Multi-modal Emotion Recognition Based on Text Guidance

doi:10.3969/j.issn.1006-2475.2025.07.012

Abstract

Abstract:
Abstract: Multi-modal emotion recognition has been widely used in artificial intelligence， safe driving and other fields. Multi-modal information has rich modal representation， which is more accurate for emotion recognition. Text is a mode that expresses rich and accurate information. This paper proposes a multi-modal sentiment analysis model guided by multi-scale text features， that is， text features of different scales are aggregated to optimize the features of other modes. A text-guided multi-modal aggregation module AGG is designed， and the idea of contrast learning is introduced into the design of loss function to optimize the whole network. Each experimental index shows that the model has excellent performance in multi-modal emotion recognition， and the rationality and validity of the design are further proved by comparison experiment and ablation experiment.

Key words: Key words: emotion recognition, multi-modal, multi-scale text features, multi-modal fusion

CLC Number:

中图分类号：TP391

ZHAI Junlong, GU Lin. Multi-modal Emotion Recognition Based on Text Guidance[J]. Computer and Modernization, 2025, 0(07): 83-89.

References

［1］ GANDHI A， ADHVARYU K， PORIA S， et al. Multimodal sentiment analysis: A systematic review of history， datasets， multimodal fusion methods， applications， challenges and future directions［J］. Information Fusion， 2023，91:424-444.
［2］宋慎铭，王琛，詹东远.突发公共卫生事件下的在线社交媒体公众情绪挖掘［J］.管理评论， 2024，36（3）:246-257.
［3］ HONG A， LUNSCHER N， HU T H， et al. A multimodal emotional human-robot interaction architecture for social robots engaged in bidirectional communication［J］. IEEE Transactions on Cybernetics， 2021，51（12）:5954-5968.
［4］ ACOSTA J N， FALCONE G J， RAJPURKA P， et al. Multimodal biomedical AI［J］. Nature Medicine， 2022，28:1773
-1784.
［5］殷梦馨，倪娜，尉怀怀，等.基于多模态情绪识别的研究进展［J］. 生物医学工程研究， 2023，42（3）:285-291.
［6］ DZEDZICKIS A， KAKLAUSKAS A， BUCINSKAS V. Human emotion recognition: Review of sensors and methods［J］. Sensors， 2020，20（3）. DOI: 10.3390/s20030592.
［7］ MIDDYA A I， NAG B， ROY S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities［J］. Knowledge-based Systems， 2022，244:108580.
［8］何俊，张彩庆，李小珍，等.面向深度学习的多模态融合技术研究综述［J］. 计算机工程， 2020，46（5）:1-11.
［9］ MICHELSANTI D， TAN Z H， ZHANG S X， et al. An overview of deep-learning-based audio-visual speech enhancement and separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021，29（1）:1368-1396.
［10］ PAN Z X， LUO Z J， YANG J C， et al. Multi-modal attention for speech emotion recognition［J］. arXiv preprint arXiv : 2009.04107， 2020.
［11］张迎辉，聂燕敏，孙波，等.基于深度森林多模态数据决策级融合抑郁症评价方法［J］. 北京师范大学学报（自然科学版）， 2018，54（5）:606-611.
［12］ PORIA S， CAMBRIA E， BAJPAI R， et al. A review of affective computing: From unimodal analysis to multimodal fusion［J］. Information Fusion， 2017，37（C）:98-125.
［13］ MA Y X， HAO Y X， CHEN M. Audio-visual emotion fusion （AVEF）: A deep efficient weighted approach［J］. Information Fusion， 2018，46:184-192.
［14］仲兆满，黄贤波，熊玉龙. 基于混合融合的突发事件多模态情感分析［J］. 江苏海洋大学学报（自然科学版），2023，32（1）:1-8.
［15］ TSAI Y H H， BAI S J， LIANG P P， et al. Multimodal Transformer for unaligned multimodal language sequences［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL， 2019:6558-6569.
［16］ HAZARIKA D， ZIMMERMANN R， PORIA S. MISA: Modality-invariant and -specific representations for multimodal sentiment analysis［J］. arXiv preprint arXiv: 2005.03045， 2020.
［17］ HAN W， CHEN H， PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. ACL， 2021:9180-9192.
［18］ SUN L C， LIAN Z L， LIU B， et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis［J］. IEEE Transactions on Affective Computing， 2024，15（1）:309-325.
［19］ LI J N， LI D X， SAVARESE S， et al. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models［C］// Proceedings of the 40th International Conference on Machine Learning. ACM，2023:19730-19742.
［20］ ZADEH A B， LIANG P P， PORIA S， et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ACL， 2018:2236-2246.
［21］ ZADEH A， ZELLERS R， PINCUS E， et al. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos［J］. arXiv preprint arXiv:1606.06259， 2016.
［22］ CARION N， MASSA F， SYNNAEVE G， et al. End-to-end object detection with Transformers［C］// European Conference on Computer Vision. Springer， 2020:213-229.
［23］ CORDONNIER J B， LOUKAS A， JAGGI M. Multi-head attention: Collaborate instead of concatenate［J］. arXiv preprint arXiv:2006.16362， 2020.
［24］ PRAVEEN G R， GRANGER E， CARDINAL P. Cross attentional audio-visual fusion for dimensional emotion recognition［C］// 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition. IEEE， 2021. DOI: 10.1109/FG52635.2021.9667055.
［25］ DOERSCH C， GUPTA A， ZISSERMAN A. CrossTransformers: Spatially-aware few-shot transfer［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. ACM， 2020:21981-21993.
［26］张重生，陈杰，李岐龙，等.深度对比学习综述［J］.自动化学报， 2023，49（1）:15-39.
［27］ VAN DEN OORD A， LI Y Z， VINYALS O. Representation learning with contrastive predictive coding［J］. arXiv preprint arXiv:1807.03748， 2018.
［28］ WILLIAM I， SETIADI D R I M， RACHMAWANTO E H， et al. Face recognition using FaceNet （survey， performance test， and comparison）［C］// 2019 Fourth International Conference on Informatics and Computing. IEEE， 2019. DOI: 10.1109/ICIC47613.2019.8985786.
［29］ TENNEY I， DAS D， PAVLICK E. BERT rediscovers the classical NLP pipeline［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL， 2019:4593-4601.

[1]	WEI Yunsong1, 2, LI Jiaqiang1, 2, HE Chao1, 2, 3, YU Haisheng1, 2, CHEN Yanlin1, 2, ZHAO Longqing1, 2, WEI Rongkun1, 2. Research Advances on 3D Object Detection Method Based on Visual Information and LiDAR for Intelligent Driving [J]. Computer and Modernization, 2025, 0(05): 91-102.
[2]	WANG Cong1, YANG Wenjuan1, DING Xingwang2. An Aggregation Model of Historical Archive Resources Based on Multimodal#br# Information Fusion [J]. Computer and Modernization, 2024, 0(10): 120-126.
[3]	MEI Xin, MIAO Zi-jing. Multi-modal Disaster Analysis Based on Embracing Fusion [J]. Computer and Modernization, 2022, 0(10): 82-87.
[4]	HAN Min-jie. Multi-modal Action Recognition Based on Deep Learning Framework [J]. Computer and Modernization, 2017, 0(7): 48-52.
[5]	WEN Miao-i1,2. Face and Speech Recognition Fusion Method Based on Penalty Coefficient [J]. Computer and Modernization, 2015, 0(11): 69-73.