基于模态类别的多模态信息处理与融合综述

doi:10.3969/j.issn.1006-2475.2024.07.008

摘要/Abstract

摘要： 随着人工智能和深度学习技术的不断发展，多模态信息处理与融合领域的相关研究受到了研究者们的广泛关注。本文总结多模态信息处理的发展历史和里程碑式的工作，以及多模态融合策略和模型。根据模态类别的不同，分类整理多模态信息处理与融合的主流数据集。以模态类型作为分类标准，本文系统地梳理多模态信息处理与融合的研究进展，强调不同模态之间的区别，并将多模态信息处理与融合分为：视听处理与融合、声文处理与融合、视觉-文本处理与融合和视觉-音频-文本处理与融合4种类别，对不同输入模态的处理融合方法与模型进行详细的研究。最后针对多模态处理与融合领域的发展进行总结与展望。

关键词: , 多模态处理；多模态信息处理；多模态融合；深度学习

Abstract: With the continuous advancement of artificial intelligence and deep learning technologies， research in the field of multimodal information processing and fusion has garnered widespread attention from researchers. This paper provides a comprehensive overview of the development history and milestone works of multimodal information processing， along with strategies and models for multimodal fusion. Based on different modalities，mainstream datasets for multimodal information processing and fusion are systematically classified and summarized. Using modality type as the classification criterion， this paper systematically reviews the research progress in multimodal information processing and fusion， emphasizing the distinctions between different modalities. Multimodal information processing and fusion are categorized into four types： audio-visual processing and fusion， audio-text processing and fusion， visual-text processing and fusion， and visual-audio-text processing and fusion. Detailed investigations are conducted on methods and models for processing and fusing different input modalities. Finally， a summary and outlook on the development of multimodal processing and fusion are provided.

Key words: , multimodal processing； multimodal information processing； multimodal fusion； deep learning

中图分类号:

TP391

黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.

HUANG Wendong, WANG Yifan. Survey on Multimodal Information Processing and Fusion Based on Modal Categories[J]. Computer and Modernization, 2024, 0(07): 47-62.

参考文献

［1］ GHOSAL D， AKHTAR M S， CHAUHAN D， et al. Contextual inter-modal attention for multi-modal sentiment analysis［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL， 2018：3454-3466.
［2］ XI C，LU G M，YAN J J. Multimodal sentiment analysis based on multi-head attention mechanism［C］// Proceedings of the 4th International Conference on Machine Learning and Soft Computing. ACM， 2020：34-39.
［3］ ZHAO Y X，MAMAT M，AYSA A，et al. Multimodal sentiment system and method based on CRNN-SVM［J］. Neural Computing and Applications， 2023，35（35）：24713-24725.
［4］ ZHANG T，TAN Z H，WU X E. HAAN-ERC： Hierarchical adaptive attention network for multimodal emotion recognition in conversation［J］. Neural Computing and Applications， 2023，35（24）：17619-17632.
［5］ JI X Y， ZHOU H， WANG K S Y， et al. Audio-driven emotional video portraits［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：14080-14089.
［6］ PRAJWAL K R， MUKHOPADHYAY R， NAMBOODIRI V P， et al. Learning individual speaking styles for accurate lip to speech synthesis［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2020：13796-13805.
［7］ SHASHIDHAR R， PATILKULKARNI S. Audiovisual speech recognition for Kannada language using feed forward neural network［J］. Neural Computing and Applications，2022，34（18）：15603-15615.
［8］ MULLIGAN R M，SHAW M L. Multimodal signal detection： Independent decisions vs. integration［J］. Perception & Psychophysics， 1980，28（5）：471-478.
［9］ MCGURK H，MACDONALD J. Hearing lips and seeing voices［J］. Nature， 1976，264（5588）：746-748.
［10］ ZHU Y K，GROTH O，BERNSTEIN M，et al. Visual7w： Grounded question answering in images［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：4995-5004.
［11］ SCHWARTZ I， SCHWING A G， HAZAN T. High-order attention models for visual question answering［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017：3667-3677.
［12］ LU J S， BATRA D， PARIKH D， et al. VilBERT： Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. ACM， 2019：13-23.
［13］ LEI J， LI L J， ZHOU L W， et al. Less is more： ClipBERT for video-and-language learning via sparse sampling［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：7331-7341.
［14］ LUO H S， JI L， ZHONG M， et al. CLIP4CLIP： An empirical study of clip for end to end video clip retrieval and captioning［J］. Neurocomputing， 2022，508：293-304.
［15］ FAN W Q，HE Z W，XING X F，et al. Multi-modality depression detection via multi-scale temporal dilated CNNs［C］// Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. ACM， 2019：73-80.
［16］ YIN S， LIANG C， DING H Y， et al. A multi-modal hierarchical recurrent neural network for depression detection［C］// Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. ACM， 2019：65-71.
［17］ AL-JEBRNI A H， ALI S G， LI H T， et al. SThy-Net： A feature fusion-enhanced dense-branched modules network for small thyroid nodule classification from ultrasound images［J］. The Visual Computer， 2023，39（8）：3675-3689.
［18］ MENG H Y， HUANG D， WANG H， et al. Depression recognition based on dynamic facial and vocal expression features using partial least square regression［C］// Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM， 2013：21-30.
［19］ SAMAREH A， JIN Y， WANG Z Y， et al. Detect depression from communication： How computer vision， signal processing， and sentiment analysis join forces［J］. IISE Transactions on Healthcare Systems Engineering，2018，8（3）：196-208.
［20］ NIE W Z， YAN Y， SONG D， et al. Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition［J］. Multimedia Tools and Applications， 2021，80（11）：16205-16214.
［21］ VERMA S，WANG J W，GE Z F，et al. Deep-HOSeq： Deep higher order sequence fusion for multimodal sentiment analysis［C］// Proceedings of the 2020 IEEE International Conference on Data Mining （ICDM）. IEEE， 2020：561-570.
［22］ NAGRANI A， ALBANIE S， ZISSERMAN A. Seeing voices and hearing faces： Cross-modal biometric matching［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2018：8427-8436.
［23］ MORALES M， SCHERER S， LEVITAN R. A linguistically-informed fusion approach for multimodal depression detection［C］// Proceedings of the 5th Workshop on Computational Linguistics and Clinical Psychology： From Keyboard to Clinic. ACL， 2018：13-24.
［24］ ALGHOWINEM S，GOECKE R，COHN J F，et al. Cross-cultural detection of depression from nonverbal behaviour［C］// Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition （FG）. IEEE， 2015，1：1-8.
［25］ XU K， BA J， KIROS R， et al. Show， attend and tell： Neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on Machine Learning. ACM， 2015：2048-2057.
［26］ LI W B， ZHANG P C， ZHANG L， et al. Object-driven text-to-image synthesis via adversarial training［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：12174-12182.
［27］ XU T， ZHANG P C， HUANG Q Y， et al. AttnGAN： Fine-grained text to image generation with attentional generative adversarial networks［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2018：1316-1324.
［28］ HAZIRBAS C， MA L N， DOMOKOS C， et al. FuseNet： Incorporating depth into semantic segmentation via fusion-based CNN architecture［C］// Proceedings of the 13th Asian Conference on Computer Vision. Springer， 2017：213-228.
［29］ NGIAM J，KHOSLA A，KIM M，et al. Multimodal deep learning［C］// Proceedings of the 28th International Conference on Machine Learning. ACM， 2011：689-696.
［30］ ZENG J， TONG Y F， HUANG Y M， et al. Deep surface normal estimation with hierarchical RGB-D fusion［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：6153-6162.
［31］ VALADA A，MOHAN R，BURGARD W. Self-supervised model adaptation for multimodal semantic segmentation［J］. International Journal of Computer Vision，2020，128（5）：1239-1285.
［32］ SONG S J， LIU J Y， LI Y H， et al. Modality compensation network： Cross-modal adaptation for action recognition［J］. IEEE Transactions on Image Processing，2020，29：3957-3969.
［33］ WANG J H， WANG Z H， TAO D C， et al. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks［C］// Proceedings of the 2016 14th European Conference on Computer Vision. Springer， 2016：664-679.
［34］ CHENG Y H， CAI R， LI Z W， et al. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2017：3029-3037.
［35］ WANG Y K， HUANG W B， SUN F C， et al. Deep multimodal fusion by channel exchanging［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. ACM， 2020：4835-4845.
［36］ ZHANG Y， SIDIBE D， MOREL O， et al. Deep multimodal fusion for semantic image segmentation： A survey［J］. Image and Vision Computing， 2021，105. DOI：10.1016/
j.imavis.2020.104042.
［37］ SONG X Y， CHEN H， WANG Q， et al. A review of audio-visual fusion with machine learning［J］. Journal of Physics： Conference Series， 2019，1237（2）. DOI： 10.1088/1742-6596/1237/2/022144.
［38］ ZHANG C， YANG Z C， HE X D， et al. Multimodal intelligence： Representation learning， information fusion， and applications［J］. IEEE Journal of Selected Topics in Signal Processing， 2020，14（3）：478-493.
［39］吴友政，李浩然，姚霆，等. 多模态信息处理前沿综述：应用，融合和预训练［J］. 中文信息学报， 2022，36（5）：1-20.
［40］任泽裕，王振超，柯尊旺，等. 多模态数据融合综述［J］. 计算机工程与应用， 2021，57（18）：49-64.
［41］ GUI J， SUN Z N， WEN Y G， et al. A review on generative adversarial networks： Algorithms，theory，and applications［J］. IEEE Transactions on Knowledge and Data Engineering， 2021，35（4）：3313-3332.
［42］ HE X D， DENG L. Deep learning for image-to-text generation： A technical overview［J］. IEEE Signal Processing Magazine， 2017，34（6）：109-116.
［43］ FROLOV S， HINZ T， RAUE F， et al. Adversarial text-to-image synthesis： A review［J］. Neural Networks，2021，144：187-209.
［44］ HEILBRON F C， ESCORCIA V， GHANEM B， et al. ActivityNet： A large-scale video benchmark for human activity understanding［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2015：961-970.
［45］ XU J， MEI T， YAO T， et al. MSR-VTT： A large video description dataset for bridging video and language［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：5288-5296.
［46］ WANG X， WU J W， CHEN J K， et al. VATEX： A large-scale，high-quality multilingual dataset for video-and-language research［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：4581-4591.
［47］ MIECH A， ZHUKOV D， ALAYRAC J B， et al. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：2630-2640.
［48］ HUANG Q Q， XIONG Y， RAO A Y， et al. Movienet： A holistic dataset for movie understanding［C］// Proceedings of the 2020 16th European Conference on Computer Vision. Springer， 2020：709-727.
［49］ CARLETTA J，ASHBY S，BOURBAN S，et al. The AMI meeting corpus： A pre-announcement［C］// 2005 International Workshop on Machine Learning for Multimodal Interaction. Springer， 2005：28-39.
［50］ KAY W， CARREIRA J， SIMONYAN K， et al. The kinetics human action video dataset［J］. arXiv preprint arXiv：1705.06950， 2017.
［51］ CHUNG J S， NAGRANI A， ZISSERMAN A. VoxCeleb2： Deep speaker recognition［J］. arXiv preprint arXiv：1806.05622， 2018.
［52］ PANAYOTOV V，CHEN G G，POVEY D，et al. LibriSpeech： An ASR corpus based on public domain audio books［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2015：5206-5210.
［53］ JIA J， ZHOU S P， YIN Y F， et al. Inferring emotions from large-scale Internet voice data［J］. IEEE Transactions on Multimedia， 2018，21（7）：1853-1866.
［54］ BUSSO C， BULUT M， LEE C C， et al. IEMOCAP： Interactive emotional dyadic motion capture database［J］. Language Resources and Evaluation， 2008，42（4）：335-359.
［55］ ZADEH A， ZELLERS R， PINCUS E， et al. Multimodal sentiment intensity analysis in videos： Facial gestures and verbal messages［J］. IEEE Intelligent Systems，2016，31（6）：82-88.
［56］ ZADEH A A B，LIANG P P，PORIA S，et al. Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ACL， 2018：2236-2246.
［57］ WU Z X，FU Y W，JIANG Y G，et al. Harnessing object and scene semantics for large-scale video understanding［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：3112-3121.
［58］ RAMANISHKA V，DAS A，PARK D H，et al. Multimodal video description［C］// Proceedings of the 24th ACM International Conference on Multimedia. ACM， 2016：1092-1096.
［59］ CHEN S Z， JIN Q， CHEN J， et al. Generating video descriptions with latent topic guidance［J］. IEEE Transactions on Multimedia， 2019，21（9）：2407-2418.
［60］ WU P，HE X T，TANG M Q，et al. HANet： Hierarchical alignment networks for video-text retrieval［C］// Proceedings of the 29th ACM International Conference on Multimedia. ACM， 2021：3518-3527.
［61］ WU W H， LUO H P， FANG B， et al. Cap4Video： What can auxiliary captions do for text-video Retrieval?［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2023：10704-10713.
［62］ WU H Q， CHEN K Y， LUO Y N， et al. Scene consistency representation learning for video scene segmentation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2022：14021-14030.
［63］ LI R Z，WANG X F，MALLIDI S H，et al. Multi-stream end-to-end speech recognition［J］. IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，28：646-655.
［64］ VILLATORO-TELLO E， MADIKERI S， ZULUAGA-GOM
EZ J，et al. Effectiveness of text，acoustic，and lattice-based representations in spoken language understanding tasks［C］// Proceedings of the 2023 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP）. IEEE， 2023：1-5.
［65］ FEICHTENHOFER C，FAN H，MALIK J，et al. Slowfast networks for video recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：6202-6211.
［66］ YANG Z H， WU Z Y， JIA J. Speaker characteristics guided speech synthesis［C］// Proceedings of the 2022 International Joint Conference on Neural Networks. IEEE，2022：1-8.
［67］ GAO R H， GRAUMAN K. Visualvoice： Audio-visual speech separation with cross-modal consistency［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：15490-15500.
［68］ LI C D，QIAN Y M. Deep audio-visual speech separation with attention mechanism［C］// Proceedings of the 2020 IEEE International Conference on Acoustics，Speech and Signal Processing. IEEE， 2020：7314-7318.
［69］ LAPTEV A，KOROSTIK R，SVISCHEV A，et al. You do not need more data： Improving end-to-end speech recognition by text-to-speech data augmentation［C］// Proceedings of the 2020 13th International Congress on Image and Signal Processing，BioMedical Engineering and Informatics （CISP-BMEI）. IEEE， 2020：439-444.
［70］ KRISHNA D N，PATIL A. Multimodal emotion recognition using cross-modal attention and 1D convolutional neural networks［C］// 2020 Interspeech. ISCA， 2020：4243-4247.
［71］ PRIYASAD D，FERNANDO T，DENMAN S，et al. Attention driven fusion for multi-modal emotion recognition［C］// Proceedings of the 2020 IEEE International Conference on Acoustics，Speech and Signal Processing （ICASSP）. IEEE， 2020：3227-3231.
［72］ HOTELLING H. Breakthroughs in Statistics［M］// Relations Between Two Sets of Variates. New York：Springer，1992：162-190.
［73］ OHKUSHI H，OGAWA T，HASEYAMA M. Music recommendation according to human motion based on kernel CCA-based relationship［J］. EURASIP Journal on Advances in Signal Processing， 2011，2011（1）：1-14.
［74］ AYTAR Y，VONDRICK C，TORRALBA A. SoundNet： Learning sound representations from unlabeled video［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. ACM， 2016：892-900.
［75］ ORTEGA J D S，CARDINAL P，KOERICH A L. Emotion recognition using fusion of audio and video features［C］// Proceedings of the 2019 IEEE International Conference on Systems， Man and Cybernetics （SMC）. IEEE， 2019：3847-3852.
［76］ SIMONYAN K，ZISSERMAN A. Two-stream convolutional networks for action recognition in videos［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM， 2014：568-576.
［77］ ZHANG S Q， ZHANG S L， HUANG T J， et al. Learning affective features with a hybrid deep model for audio-visual emotion recognition［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2017，28（10）：3030-3043.
［78］ CHUNG J S， ZISSERMAN A. Out of time： Automated lip sync in the wild［C］// Proceedings of the 9th Asian Conference on Computer Vision. Springer， 2017：251-263.
［79］ SANGUINETI V， MORERIO P， POZZETTI N， et al. Leveraging acoustic images for effective self-supervised audio representation learning［C］// Proceedings of the 16th European Conference on Computer Vision. Springer， 2020：119-135.
［80］ JOZE H R V， SHABAN A， IUZZOLINO M L， et al. MMTM： Multimodal transfer module for CNN fusion［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020：13289-13299.
［81］ FENG W J，GUAN N Y，LI Y，et al. Audio visual speech recognition with multimodal recurrent neural networks［C］// 2017 International Joint Conference on Neural Networks （IJCNN）. IEEE， 2017：681-688.
［82］ PAN P B，XU Z W，YANG Y，et al. Hierarchical recurrent neural encoder for video representation with application to captioning［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：1029-1038.
［83］ MAO J H，XU W，YANG Y，et al. Deep captioning with multimodal recurrent neural networks （m-RNN）［J］. arXiv preprint arXiv：1412.6632， 2014.
［84］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017：6000-6010.
［85］ WANG X L，GIRSHICK R，GUPTA A，et al. Non-local neural networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2018：7794-7803.
［86］ HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2018：7132-7141.
［87］ WOO S H Y， PARK J， LEE J Y， et al. CBAM： Convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision （ECCV）. Springer， 2018：3-19.
［88］ FU J， LIU J， TIAN H J， et al. Dual attention network for scene segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：3146-3154.
［89］ CHEN C F R，FAN Q F，PANDA R. CrossVit： Cross-attention multi-scale vision transformer for image classification［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：357-366.
［90］ GHOSAL D， MAJUMDER N， PORIA S， et al. DialogueGCN： A graph convolutional neural network for emotion recognition in conversation［J］. arXiv preprint arXiv：1908.11540，2019.
［91］ HU J W， LIU Y C， ZHAO J M， et al. MMGCN： Multimodal fusion via deep graph convolution network for emotion recognition in conversation［J］. arXiv preprint arXiv：2107.06779， 2021.
［92］ TAO Z L， WEI Y W， WANG X， et al. MGAT： Multimodal graph attention network for recommendation［J］. Information Processing & Management， 2020，57（5）. DOI：10.1016/
j.ipm.2020.102277.
［93］ JOSHI A， BHAT A， JAIN A， et al. COGMEN： Contextualized GNN based multimodal emotion recognitio［J］. arXiv preprint arXiv：2205.02455， 2022.
［94］ MIDDYA A I，NAG B，ROY S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities［J］. Knowledge-Based Systems，2022，244. DIO： 10.1016/j.knosys.2022.108580.
［95］ TZIRAKIS P，TRIGEORGIS G，NICOLAOU M A，et al. End-to-end multimodal emotion recognition using deep neural networks［J］. IEEE Journal of Selected Topics in Signal Processing， 2017，11（8）：1301-1309.
［96］ JADERBERG M，SIMONYAN K，ZISSERMAN A，et al. Spatial transformer networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM， 2015：2017-2025.
［97］ LI X， WANG W H， HU X L， et al. Selective kernel networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：510-519.
［98］ LI J N， ZHANG S L， WANG J D， et al. Global-local temporal representations for video person re-identification［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：3958-3967.
［99］ LIU Z Y， WANG L M， WU W Y N， et al. TAM： Temporal adaptive module for video recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：13708-13718.
［100］ NAGRANI A， YANG S， ARNAB A， et al. Attention bottlenecks for multimodal fusion［J］. arXiv preprint arXiv：2107.00135， 2021.
［101］ GORI M， MONFARDINI G， SCARSELLI F. A new model for learning in graph domains［C］// Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. IEEE， 2005，2：729-734.
［102］ MICHELI A. Neural network for graphs： A contextual constructive approach［J］. IEEE Transactions on Neural Networks， 2009，20（3）：498-511.
［103］ TIAN Y J，ZHANG C X，GUO Z C，et al. Recipe2vec： Multi-modal recipe representation learning with graph neural networks［J］. arXiv preprint arXiv：2205.12396，2022.
［104］ SCHLICHTKRULL M， KIPF T N， BLOEM P， et al. Modeling relational data with graph convolutional networks［C］// Proceedings of the 15th International Conference on Semantic Web. Springer， 2018：593-607.
［105］ VELIČKOVIĆ P，CUCURULL G，CASANOVA A，et al. Graph attention networks［J］. arXiv preprint arXiv：1710.10903， 2017.
［106］ XIE Z F， ZHANG W L， SHENG B， et al. BaGFN： Broad attentive graph fusion network for high-order feature interactions［J］. IEEE Transactions on Neural Networks and Learning Systems， 2023，34（8）：4499-4513.
［107］ KIM Y. Convolutional neural networks for sentence classification［J］. arXiv preprint arXiv：1408.5882， 2014.
［108］ CHAN W，JAITLY N，LE Q，et al. Listen，attend and spell： A neural network for large vocabulary conversational speech recognition［C］// 2016 IEEE International Conference on Acoustics，Speech and Signal Processing （ICASSP）. IEEE， 2016：4960-4964.
［109］ YOON S， BYUN S， JUNG K. Multimodal speech emotion recognition using audio and text［C］// 2018 IEEE Spoken Language Technology Workshop （SLT）. IEEE， 2018：112-118.
［110］ CHOROWSKI J，BAHDANAU D，SERDYUK D，et al. Attention-based models for speech recognition［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM， 2015：577-585.
［111］ TSAI Y H H， BAI S，LIANG P P， et al. Multimodal transformer for unaligned multimodal language sequences［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL， 2019：6558-6569.
［112］ TRUONG Q T， LAUW H W. VistaNet： Visual aspect attention network for multimodal sentiment analysis［C］//Proceedings of the 2019 AAAI Conference on Artificial Intelligence. AAAI Press， 2019，33（1）：305-312.
［113］ ZADEH A，LIANG P P，PORIA S，et al. Multi-attention recurrent network for human communication comprehension［C］// Proceedings of the 2018 AAAI Conference on Artificial Intelligence， AAAI Press， 2018，32（1）：5642-5649.
［114］ ZADEH A， LIANG P P， MAZUMDER N， et al. Memory fusion network for multi-view sequential learning［C］// Proceedings of the 2018 AAAI Conference on Artificial Intelligence. AAAI Press， 2018，32（1）：5634-5641.
［115］ YANG K C，XU H，GAO K. CM-BERT： Cross-modal bert for text-audio sentiment analysis［C］// Proceedings of the 28th ACM International Conference on Multimedia. ACM， 2020：521-528.
［116］ DEVLIN J，CHANG M W，LEE K，et al. BERT： Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv：1810.04805， 2018.
［117］ RADFORD A，NARASIMHAN K，SALIMANS T，et al. Improving Language Understanding by Generative Pre-training［EB/OL］. ［2024-02-01］. https：//cdn.openai.com/
research-covers/language-unsupervised/language_underst
anding_paper.pdf.
［118］ ZHAO J， ZHANG X S， WANG X F， et al. Overcoming language priors in VQA via adding visual module［J］. Neural Computing and Applications， 2022，34（11）：9015-9023.
［119］ LI X R， ZHOU F M， XU C X， et al. SEA： Sentence encoder assembly for video retrieval by textual queries［J］. IEEE Transactions on Multimedia， 2020，23：4351-4362.
［120］ ZHU L C，YANG Y. ActBERT： Learning global-local video-text representations［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020：8746-8755.
［121］ IMFELD M，GRALDI J，GIORDANO M，et al. Transformer fusion with optimal transport［J］. arXiv preprint arXiv：2310.05719， 2023.
［122］ SUN C， MYERS A， VONDRICK C， et al. VideoBERT： A joint model for video and language representation learning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer vision. IEEE， 2019：7464-7473.
［123］ KIELA D，BHOOSHAN S，FIROOZ H，et al. Supervised multimodal bitransformers for classifying images and text［J］. arXiv preprint arXiv：1909.02950， 2019.
［124］ STAP D，BLEEKER M，IBRAHIMI S，et al. Conditional image generation and manipulation for user-specified content［J］. arXiv preprint arXiv：2005.04909， 2020.
［125］ ZHANG H， KOH J Y， BALDRIDGE J， et al. Cross-modal contrastive learning for text-to-image generation［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：833-842.
［126］ YANG J W，BISK Y，GAO J F. TACo： Token-aware cascade contrastive learning for video-text alignment［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：11562-11572.
［127］ XU H， GHOSH G， HUANG P Y， et al. VideoCLIP： Contrastive pre-training for zero-shot video-text understanding［J］. arXiv preprint arXiv：2109.14084， 2021.
［128］ WANG M M， XING J Z， LIU Y. ActionCLIP： A new paradigm for video action recognition［J］. arXiv preprint arXiv：2109.08472， 2021.
［129］ ZHANG R R， ZENG Z Y， GUO Z Y， et al. Can language understand depth?［C］// Proceedings of the 30th ACM International Conference on Multimedia. ACM， 2022：6868-6874.
［130］ ZHU J， HUANG H B， LI B H， et al. E-CRF： Embedded conditional random field for boundary-caused class weights confusion in semantic segmentation［J］. arXiv preprint arXiv：2112.07106， 2021.
［131］ XU J R， MELLO S D， LIU S F， et al. GroupViT： Semantic segmentation emerges from text supervision［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2022：18134-18144.
［132］ CIAPARRONE G，CHIARIGLIONE L，TAGLIAFERRI R. A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos［J］. Neural Computing and Applications， 2022，34（10）：7489-7506.
［133］ PENG Y Q， WANG C X， PEI Y X， et al. Video captioning with global and local text attention［J］. The Visual Computer， 2022，38（12）：4267-4278.
［134］ SUN B，WU Y，ZHAO Y J，et al. Cross-language multimodal scene semantic guidance and leap sampling for video captioning［J］. The Visual Computer， 2023，39（1）：9-25.
［135］ REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： Towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017，39（6）：1137-1149.
［136］ LIN X， SUN S Z， HUANG W， et al. EAPT： Efficient attention pyramid transformer for image processing［J］. IEEE Transactions on Multimedia， 2023，25：50-61.
［137］ LIU Z， LIN Y T， CAO Y， et al. Swin transformer： Hierarchical vision transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：10012-10022.
［138］ ZENG A D，ATTARIAN M，ICHTER B，et al. Socratic models： Composing zero-shot multimodal reasoning with language［J］. arXiv preprint arXiv：2204.00598， 2022.
［139］ GONG T， LYU C， ZHANG S， et al. MultiModal-GPT： A vision and language model for dialogue with humans［J］. arXiv preprint arXiv：2305.04790， 2023.
［140］ GOODFELLOW I， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial networks［J］. Communications of the ACM， 2020，63（11）：139-144.
［141］ MIRZA M，OSINDERO S. Conditional generative adversarial nets［J］. arXiv preprint arXiv：1411.1784， 2014.
［142］ KARRAS T，LAINE S，AILA T. A style-based generator architecture for generative adversarial networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：4401-4410.
［143］ XU J R，GANDELSMAN Y，BAR A，et al. IMProv： Inpainting-based multimodal prompting for computer vision tasks［J］. arXiv preprint arXiv：2312.01771， 2023.
［144］ LIU R H，WANG T Q，LI H T，et al. TMM-Nets： Transferred multi-to mono-modal generation for Lupus retinopathy diagnosis［J］. IEEE Transactions on Medical Imaging， 2023，42（4）：1083-1094.
［145］ HE K M，ZHANG X Y，REN S Q，et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：770-778.
［146］ WU Z B，DADU A，TUSTISON N，et al. Multimodal patient representation learning with missing modalities and labels［C］// The 12th International Conference on Learning Representations. ICLR， 2024.
［147］ MIECH A， ALAYRAC J B， SMAIRA L， et al. End-to-end learning of visual representations from uncurated instructional videos［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020：9879-9889.
［148］ RADFORD A，KIM J W，HALLACY C，et al. Learning transferable visual models from natural language supervision［C］// 2021 International Conference on Machine Learning. ACM， 2021：8748-8763.
［149］ JIANG Y G，WU Z X，WANG J，et al. Exploiting feature and class relationships in video categorization with regularized deep neural networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017，40（2）：352-364.
［150］ KAMPMAN O， BAREZI E J， BERTERO D， et al. Investigating audio，visual，and text fusion methods for end-to-end automatic personality prediction［J］. arXiv preprint arXiv：1805.00705， 2018.
［151］ ORTEGA J D S，S ENOUSSAOUI M， GRANGER E， et al. Multimodal fusion with deep neural networks for audio-video emotion recognition［J］. arXiv preprint arXiv：1907.03196， 2019.
［152］ YANG X T，RAMESH P，CHITTA R，et al. Deep multimodal representation learning from temporal data［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2017：5447-5455.
［153］ ZADEH A， CHEN M， PORIA S， et al. Tensor fusion network for multimodal sentiment analysis［J］. arXiv preprint arXiv：1707.07250， 2017.
［154］ PORIA S，CAMBRIA E，HAZARIKA D，et al. Context-dependent sentiment analysis in user-generated videos［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. ACL， 2017：873-883.
［155］ MONFORT M，JIN S Y，LIU A，et al. Spoken moments： Learning joint audio-visual representations from video descriptions［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：14871-14881.
［156］ TAN H， BANSAL M. LXMERT： Learning cross-modality encoder representations from transformers［J］. arXiv preprint arXiv：1908.07490， 2019.
［157］ GABEUR V，SUN C，ALAHARI K，et al. Multi-modal transformer for video retrieval［C］// 2020 16th European Conference on Computer Vision. Springer， 2020：214-229.
［158］ AKBARI H， YUAN L， QIAN R， et al. VATT： Transformers for multimodal self-supervised learning from raw video， audio and text［J］. Advances in Neural Information Processing Systems， 2021，34：24206-24221.
［159］ WEI Y W， WANG X， NIE L Q， et al. MMGCN： Multi-modal graph convolution network for personalized recommendation of micro-video［C］// Proceedings of the 27th ACM International Conference on Multimedia. ACM， 2019：1437-1445.
［160］ GAO D F，LI K，WANG R P，et al. Multi-modal graph neural network for joint reasoning on vision and scene text［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020： 12746-12756.
［161］ TRAN D， BOURDEV L， FERGUS R， et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE， 2015：4489-4497.
［162］ LI J， WANG X P， LV G Q， et al. GraphMFT： A graph attention based multimodal fusion technique for emotion recognition in conversation［J］. Neurocomputing，2022，550. DOI： 10.1016/j.neucom.2023.126427.
［163］ EYBEN F，WÖLLMER M，SCHULLER B. OpenSMILE： The munich versatile and fast open-source audio feature extractor［C］// Proceedings of the 18th ACM International Conference on Multimedia. ACM， 2010：1459-1462.
［164］ WANG Q F，WEI Y W，YIN J H，et al. DualGNN： Dual graph neural network for multimedia recommendation［J］. IEEE Transactions on Multimedia， 2021，25：1074-1084.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[5]	张志霞, 秦志毅. 基于变分模态分解和IGJO-SVR的网络舆情预测[J]. 计算机与现代化, 2024, 0(11): 77-83.
[6]	万鸿炜, 陈平华. 基于Involution算子和协调反向注意力的息肉图像分割[J]. 计算机与现代化, 2024, 0(11): 84-90.
[7]	杨正科, 沈小东, 王凯翔, 何立. 基于改进麻雀搜索算法的接地网腐蚀故障定位[J]. 计算机与现代化, 2024, 0(10): 14-20.
[8]	韩瑞超, 孟令军, 敖利丞, 谢宇斌, 甄明硕. 基于改进YOLOv5的施工防护佩戴检测[J]. 计算机与现代化, 2024, 0(10): 49-54.
[9]	王佳1, 顾文俊1, 鞠炜刚2, 李玉维1, 张云龙2, 米传民3, 周志鹏3. 基于多元级差优良化遗传算法的环境拓扑结构任务调度[J]. 计算机与现代化, 2024, 0(10): 65-73.
[10]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.
[11]	于天一, 李剑锋, 陈海龙, 翟军. 隐性角色下的协同推荐算法[J]. 计算机与现代化, 2024, 0(09): 1-7.
[12]	薛浩, 马静, 郭小宇. 基于Focal Loss改进LightGBM的供水管网毛刺数据检测[J]. 计算机与现代化, 2024, 0(09): 74-81.
[13]	程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29.
[14]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[15]	赵小明, 潘婷, 刘伟锋. 基于图像分类的自动绘画心理分析方法[J]. 计算机与现代化, 2024, 0(08): 92-97.