Survey on Multimodal Information Processing and Fusion Based on Modal Categories

doi:10.3969/j.issn.1006-2475.2024.07.008

Abstract

Abstract: With the continuous advancement of artificial intelligence and deep learning technologies， research in the field of multimodal information processing and fusion has garnered widespread attention from researchers. This paper provides a comprehensive overview of the development history and milestone works of multimodal information processing， along with strategies and models for multimodal fusion. Based on different modalities，mainstream datasets for multimodal information processing and fusion are systematically classified and summarized. Using modality type as the classification criterion， this paper systematically reviews the research progress in multimodal information processing and fusion， emphasizing the distinctions between different modalities. Multimodal information processing and fusion are categorized into four types： audio-visual processing and fusion， audio-text processing and fusion， visual-text processing and fusion， and visual-audio-text processing and fusion. Detailed investigations are conducted on methods and models for processing and fusing different input modalities. Finally， a summary and outlook on the development of multimodal processing and fusion are provided.

Key words: , multimodal processing； multimodal information processing； multimodal fusion； deep learning

CLC Number:

TP391

HUANG Wendong, WANG Yifan. Survey on Multimodal Information Processing and Fusion Based on Modal Categories[J]. Computer and Modernization, 2024, 0(07): 47-62.

References

［1］ GHOSAL D， AKHTAR M S， CHAUHAN D， et al. Contextual inter-modal attention for multi-modal sentiment analysis［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL， 2018：3454-3466.
［2］ XI C，LU G M，YAN J J. Multimodal sentiment analysis based on multi-head attention mechanism［C］// Proceedings of the 4th International Conference on Machine Learning and Soft Computing. ACM， 2020：34-39.
［3］ ZHAO Y X，MAMAT M，AYSA A，et al. Multimodal sentiment system and method based on CRNN-SVM［J］. Neural Computing and Applications， 2023，35（35）：24713-24725.
［4］ ZHANG T，TAN Z H，WU X E. HAAN-ERC： Hierarchical adaptive attention network for multimodal emotion recognition in conversation［J］. Neural Computing and Applications， 2023，35（24）：17619-17632.
［5］ JI X Y， ZHOU H， WANG K S Y， et al. Audio-driven emotional video portraits［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：14080-14089.
［6］ PRAJWAL K R， MUKHOPADHYAY R， NAMBOODIRI V P， et al. Learning individual speaking styles for accurate lip to speech synthesis［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2020：13796-13805.
［7］ SHASHIDHAR R， PATILKULKARNI S. Audiovisual speech recognition for Kannada language using feed forward neural network［J］. Neural Computing and Applications，2022，34（18）：15603-15615.
［8］ MULLIGAN R M，SHAW M L. Multimodal signal detection： Independent decisions vs. integration［J］. Perception & Psychophysics， 1980，28（5）：471-478.
［9］ MCGURK H，MACDONALD J. Hearing lips and seeing voices［J］. Nature， 1976，264（5588）：746-748.
［10］ ZHU Y K，GROTH O，BERNSTEIN M，et al. Visual7w： Grounded question answering in images［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：4995-5004.
［11］ SCHWARTZ I， SCHWING A G， HAZAN T. High-order attention models for visual question answering［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017：3667-3677.
［12］ LU J S， BATRA D， PARIKH D， et al. VilBERT： Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. ACM， 2019：13-23.
［13］ LEI J， LI L J， ZHOU L W， et al. Less is more： ClipBERT for video-and-language learning via sparse sampling［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：7331-7341.
［14］ LUO H S， JI L， ZHONG M， et al. CLIP4CLIP： An empirical study of clip for end to end video clip retrieval and captioning［J］. Neurocomputing， 2022，508：293-304.
［15］ FAN W Q，HE Z W，XING X F，et al. Multi-modality depression detection via multi-scale temporal dilated CNNs［C］// Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. ACM， 2019：73-80.
［16］ YIN S， LIANG C， DING H Y， et al. A multi-modal hierarchical recurrent neural network for depression detection［C］// Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. ACM， 2019：65-71.
［17］ AL-JEBRNI A H， ALI S G， LI H T， et al. SThy-Net： A feature fusion-enhanced dense-branched modules network for small thyroid nodule classification from ultrasound images［J］. The Visual Computer， 2023，39（8）：3675-3689.
［18］ MENG H Y， HUANG D， WANG H， et al. Depression recognition based on dynamic facial and vocal expression features using partial least square regression［C］// Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM， 2013：21-30.
［19］ SAMAREH A， JIN Y， WANG Z Y， et al. Detect depression from communication： How computer vision， signal processing， and sentiment analysis join forces［J］. IISE Transactions on Healthcare Systems Engineering，2018，8（3）：196-208.
［20］ NIE W Z， YAN Y， SONG D， et al. Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition［J］. Multimedia Tools and Applications， 2021，80（11）：16205-16214.
［21］ VERMA S，WANG J W，GE Z F，et al. Deep-HOSeq： Deep higher order sequence fusion for multimodal sentiment analysis［C］// Proceedings of the 2020 IEEE International Conference on Data Mining （ICDM）. IEEE， 2020：561-570.
［22］ NAGRANI A， ALBANIE S， ZISSERMAN A. Seeing voices and hearing faces： Cross-modal biometric matching［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2018：8427-8436.
［23］ MORALES M， SCHERER S， LEVITAN R. A linguistically-informed fusion approach for multimodal depression detection［C］// Proceedings of the 5th Workshop on Computational Linguistics and Clinical Psychology： From Keyboard to Clinic. ACL， 2018：13-24.
［24］ ALGHOWINEM S，GOECKE R，COHN J F，et al. Cross-cultural detection of depression from nonverbal behaviour［C］// Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition （FG）. IEEE， 2015，1：1-8.
［25］ XU K， BA J， KIROS R， et al. Show， attend and tell： Neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on Machine Learning. ACM， 2015：2048-2057.
［26］ LI W B， ZHANG P C， ZHANG L， et al. Object-driven text-to-image synthesis via adversarial training［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：12174-12182.
［27］ XU T， ZHANG P C， HUANG Q Y， et al. AttnGAN： Fine-grained text to image generation with attentional generative adversarial networks［C］// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2018：1316-1324.
［28］ HAZIRBAS C， MA L N， DOMOKOS C， et al. FuseNet： Incorporating depth into semantic segmentation via fusion-based CNN architecture［C］// Proceedings of the 13th Asian Conference on Computer Vision. Springer， 2017：213-228.
［29］ NGIAM J，KHOSLA A，KIM M，et al. Multimodal deep learning［C］// Proceedings of the 28th International Conference on Machine Learning. ACM， 2011：689-696.
［30］ ZENG J， TONG Y F， HUANG Y M， et al. Deep surface normal estimation with hierarchical RGB-D fusion［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：6153-6162.
［31］ VALADA A，MOHAN R，BURGARD W. Self-supervised model adaptation for multimodal semantic segmentation［J］. International Journal of Computer Vision，2020，128（5）：1239-1285.
［32］ SONG S J， LIU J Y， LI Y H， et al. Modality compensation network： Cross-modal adaptation for action recognition［J］. IEEE Transactions on Image Processing，2020，29：3957-3969.
［33］ WANG J H， WANG Z H， TAO D C， et al. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks［C］// Proceedings of the 2016 14th European Conference on Computer Vision. Springer， 2016：664-679.
［34］ CHENG Y H， CAI R， LI Z W， et al. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2017：3029-3037.
［35］ WANG Y K， HUANG W B， SUN F C， et al. Deep multimodal fusion by channel exchanging［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. ACM， 2020：4835-4845.
［36］ ZHANG Y， SIDIBE D， MOREL O， et al. Deep multimodal fusion for semantic image segmentation： A survey［J］. Image and Vision Computing， 2021，105. DOI：10.1016/
j.imavis.2020.104042.
［37］ SONG X Y， CHEN H， WANG Q， et al. A review of audio-visual fusion with machine learning［J］. Journal of Physics： Conference Series， 2019，1237（2）. DOI： 10.1088/1742-6596/1237/2/022144.
［38］ ZHANG C， YANG Z C， HE X D， et al. Multimodal intelligence： Representation learning， information fusion， and applications［J］. IEEE Journal of Selected Topics in Signal Processing， 2020，14（3）：478-493.
［39］吴友政，李浩然，姚霆，等. 多模态信息处理前沿综述：应用，融合和预训练［J］. 中文信息学报， 2022，36（5）：1-20.
［40］任泽裕，王振超，柯尊旺，等. 多模态数据融合综述［J］. 计算机工程与应用， 2021，57（18）：49-64.
［41］ GUI J， SUN Z N， WEN Y G， et al. A review on generative adversarial networks： Algorithms，theory，and applications［J］. IEEE Transactions on Knowledge and Data Engineering， 2021，35（4）：3313-3332.
［42］ HE X D， DENG L. Deep learning for image-to-text generation： A technical overview［J］. IEEE Signal Processing Magazine， 2017，34（6）：109-116.
［43］ FROLOV S， HINZ T， RAUE F， et al. Adversarial text-to-image synthesis： A review［J］. Neural Networks，2021，144：187-209.
［44］ HEILBRON F C， ESCORCIA V， GHANEM B， et al. ActivityNet： A large-scale video benchmark for human activity understanding［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2015：961-970.
［45］ XU J， MEI T， YAO T， et al. MSR-VTT： A large video description dataset for bridging video and language［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：5288-5296.
［46］ WANG X， WU J W， CHEN J K， et al. VATEX： A large-scale，high-quality multilingual dataset for video-and-language research［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：4581-4591.
［47］ MIECH A， ZHUKOV D， ALAYRAC J B， et al. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：2630-2640.
［48］ HUANG Q Q， XIONG Y， RAO A Y， et al. Movienet： A holistic dataset for movie understanding［C］// Proceedings of the 2020 16th European Conference on Computer Vision. Springer， 2020：709-727.
［49］ CARLETTA J，ASHBY S，BOURBAN S，et al. The AMI meeting corpus： A pre-announcement［C］// 2005 International Workshop on Machine Learning for Multimodal Interaction. Springer， 2005：28-39.
［50］ KAY W， CARREIRA J， SIMONYAN K， et al. The kinetics human action video dataset［J］. arXiv preprint arXiv：1705.06950， 2017.
［51］ CHUNG J S， NAGRANI A， ZISSERMAN A. VoxCeleb2： Deep speaker recognition［J］. arXiv preprint arXiv：1806.05622， 2018.
［52］ PANAYOTOV V，CHEN G G，POVEY D，et al. LibriSpeech： An ASR corpus based on public domain audio books［C］// Proceedings of the 2015 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2015：5206-5210.
［53］ JIA J， ZHOU S P， YIN Y F， et al. Inferring emotions from large-scale Internet voice data［J］. IEEE Transactions on Multimedia， 2018，21（7）：1853-1866.
［54］ BUSSO C， BULUT M， LEE C C， et al. IEMOCAP： Interactive emotional dyadic motion capture database［J］. Language Resources and Evaluation， 2008，42（4）：335-359.
［55］ ZADEH A， ZELLERS R， PINCUS E， et al. Multimodal sentiment intensity analysis in videos： Facial gestures and verbal messages［J］. IEEE Intelligent Systems，2016，31（6）：82-88.
［56］ ZADEH A A B，LIANG P P，PORIA S，et al. Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ACL， 2018：2236-2246.
［57］ WU Z X，FU Y W，JIANG Y G，et al. Harnessing object and scene semantics for large-scale video understanding［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：3112-3121.
［58］ RAMANISHKA V，DAS A，PARK D H，et al. Multimodal video description［C］// Proceedings of the 24th ACM International Conference on Multimedia. ACM， 2016：1092-1096.
［59］ CHEN S Z， JIN Q， CHEN J， et al. Generating video descriptions with latent topic guidance［J］. IEEE Transactions on Multimedia， 2019，21（9）：2407-2418.
［60］ WU P，HE X T，TANG M Q，et al. HANet： Hierarchical alignment networks for video-text retrieval［C］// Proceedings of the 29th ACM International Conference on Multimedia. ACM， 2021：3518-3527.
［61］ WU W H， LUO H P， FANG B， et al. Cap4Video： What can auxiliary captions do for text-video Retrieval?［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2023：10704-10713.
［62］ WU H Q， CHEN K Y， LUO Y N， et al. Scene consistency representation learning for video scene segmentation［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2022：14021-14030.
［63］ LI R Z，WANG X F，MALLIDI S H，et al. Multi-stream end-to-end speech recognition［J］. IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，28：646-655.
［64］ VILLATORO-TELLO E， MADIKERI S， ZULUAGA-GOM
EZ J，et al. Effectiveness of text，acoustic，and lattice-based representations in spoken language understanding tasks［C］// Proceedings of the 2023 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP）. IEEE， 2023：1-5.
［65］ FEICHTENHOFER C，FAN H，MALIK J，et al. Slowfast networks for video recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：6202-6211.
［66］ YANG Z H， WU Z Y， JIA J. Speaker characteristics guided speech synthesis［C］// Proceedings of the 2022 International Joint Conference on Neural Networks. IEEE，2022：1-8.
［67］ GAO R H， GRAUMAN K. Visualvoice： Audio-visual speech separation with cross-modal consistency［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：15490-15500.
［68］ LI C D，QIAN Y M. Deep audio-visual speech separation with attention mechanism［C］// Proceedings of the 2020 IEEE International Conference on Acoustics，Speech and Signal Processing. IEEE， 2020：7314-7318.
［69］ LAPTEV A，KOROSTIK R，SVISCHEV A，et al. You do not need more data： Improving end-to-end speech recognition by text-to-speech data augmentation［C］// Proceedings of the 2020 13th International Congress on Image and Signal Processing，BioMedical Engineering and Informatics （CISP-BMEI）. IEEE， 2020：439-444.
［70］ KRISHNA D N，PATIL A. Multimodal emotion recognition using cross-modal attention and 1D convolutional neural networks［C］// 2020 Interspeech. ISCA， 2020：4243-4247.
［71］ PRIYASAD D，FERNANDO T，DENMAN S，et al. Attention driven fusion for multi-modal emotion recognition［C］// Proceedings of the 2020 IEEE International Conference on Acoustics，Speech and Signal Processing （ICASSP）. IEEE， 2020：3227-3231.
［72］ HOTELLING H. Breakthroughs in Statistics［M］// Relations Between Two Sets of Variates. New York：Springer，1992：162-190.
［73］ OHKUSHI H，OGAWA T，HASEYAMA M. Music recommendation according to human motion based on kernel CCA-based relationship［J］. EURASIP Journal on Advances in Signal Processing， 2011，2011（1）：1-14.
［74］ AYTAR Y，VONDRICK C，TORRALBA A. SoundNet： Learning sound representations from unlabeled video［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. ACM， 2016：892-900.
［75］ ORTEGA J D S，CARDINAL P，KOERICH A L. Emotion recognition using fusion of audio and video features［C］// Proceedings of the 2019 IEEE International Conference on Systems， Man and Cybernetics （SMC）. IEEE， 2019：3847-3852.
［76］ SIMONYAN K，ZISSERMAN A. Two-stream convolutional networks for action recognition in videos［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM， 2014：568-576.
［77］ ZHANG S Q， ZHANG S L， HUANG T J， et al. Learning affective features with a hybrid deep model for audio-visual emotion recognition［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2017，28（10）：3030-3043.
［78］ CHUNG J S， ZISSERMAN A. Out of time： Automated lip sync in the wild［C］// Proceedings of the 9th Asian Conference on Computer Vision. Springer， 2017：251-263.
［79］ SANGUINETI V， MORERIO P， POZZETTI N， et al. Leveraging acoustic images for effective self-supervised audio representation learning［C］// Proceedings of the 16th European Conference on Computer Vision. Springer， 2020：119-135.
［80］ JOZE H R V， SHABAN A， IUZZOLINO M L， et al. MMTM： Multimodal transfer module for CNN fusion［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020：13289-13299.
［81］ FENG W J，GUAN N Y，LI Y，et al. Audio visual speech recognition with multimodal recurrent neural networks［C］// 2017 International Joint Conference on Neural Networks （IJCNN）. IEEE， 2017：681-688.
［82］ PAN P B，XU Z W，YANG Y，et al. Hierarchical recurrent neural encoder for video representation with application to captioning［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：1029-1038.
［83］ MAO J H，XU W，YANG Y，et al. Deep captioning with multimodal recurrent neural networks （m-RNN）［J］. arXiv preprint arXiv：1412.6632， 2014.
［84］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017：6000-6010.
［85］ WANG X L，GIRSHICK R，GUPTA A，et al. Non-local neural networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2018：7794-7803.
［86］ HU J， SHEN L， SUN G. Squeeze-and-excitation networks［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2018：7132-7141.
［87］ WOO S H Y， PARK J， LEE J Y， et al. CBAM： Convolutional block attention module［C］// Proceedings of the 2018 European Conference on Computer Vision （ECCV）. Springer， 2018：3-19.
［88］ FU J， LIU J， TIAN H J， et al. Dual attention network for scene segmentation［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：3146-3154.
［89］ CHEN C F R，FAN Q F，PANDA R. CrossVit： Cross-attention multi-scale vision transformer for image classification［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：357-366.
［90］ GHOSAL D， MAJUMDER N， PORIA S， et al. DialogueGCN： A graph convolutional neural network for emotion recognition in conversation［J］. arXiv preprint arXiv：1908.11540，2019.
［91］ HU J W， LIU Y C， ZHAO J M， et al. MMGCN： Multimodal fusion via deep graph convolution network for emotion recognition in conversation［J］. arXiv preprint arXiv：2107.06779， 2021.
［92］ TAO Z L， WEI Y W， WANG X， et al. MGAT： Multimodal graph attention network for recommendation［J］. Information Processing & Management， 2020，57（5）. DOI：10.1016/
j.ipm.2020.102277.
［93］ JOSHI A， BHAT A， JAIN A， et al. COGMEN： Contextualized GNN based multimodal emotion recognitio［J］. arXiv preprint arXiv：2205.02455， 2022.
［94］ MIDDYA A I，NAG B，ROY S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities［J］. Knowledge-Based Systems，2022，244. DIO： 10.1016/j.knosys.2022.108580.
［95］ TZIRAKIS P，TRIGEORGIS G，NICOLAOU M A，et al. End-to-end multimodal emotion recognition using deep neural networks［J］. IEEE Journal of Selected Topics in Signal Processing， 2017，11（8）：1301-1309.
［96］ JADERBERG M，SIMONYAN K，ZISSERMAN A，et al. Spatial transformer networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM， 2015：2017-2025.
［97］ LI X， WANG W H， HU X L， et al. Selective kernel networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：510-519.
［98］ LI J N， ZHANG S L， WANG J D， et al. Global-local temporal representations for video person re-identification［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019：3958-3967.
［99］ LIU Z Y， WANG L M， WU W Y N， et al. TAM： Temporal adaptive module for video recognition［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：13708-13718.
［100］ NAGRANI A， YANG S， ARNAB A， et al. Attention bottlenecks for multimodal fusion［J］. arXiv preprint arXiv：2107.00135， 2021.
［101］ GORI M， MONFARDINI G， SCARSELLI F. A new model for learning in graph domains［C］// Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. IEEE， 2005，2：729-734.
［102］ MICHELI A. Neural network for graphs： A contextual constructive approach［J］. IEEE Transactions on Neural Networks， 2009，20（3）：498-511.
［103］ TIAN Y J，ZHANG C X，GUO Z C，et al. Recipe2vec： Multi-modal recipe representation learning with graph neural networks［J］. arXiv preprint arXiv：2205.12396，2022.
［104］ SCHLICHTKRULL M， KIPF T N， BLOEM P， et al. Modeling relational data with graph convolutional networks［C］// Proceedings of the 15th International Conference on Semantic Web. Springer， 2018：593-607.
［105］ VELIČKOVIĆ P，CUCURULL G，CASANOVA A，et al. Graph attention networks［J］. arXiv preprint arXiv：1710.10903， 2017.
［106］ XIE Z F， ZHANG W L， SHENG B， et al. BaGFN： Broad attentive graph fusion network for high-order feature interactions［J］. IEEE Transactions on Neural Networks and Learning Systems， 2023，34（8）：4499-4513.
［107］ KIM Y. Convolutional neural networks for sentence classification［J］. arXiv preprint arXiv：1408.5882， 2014.
［108］ CHAN W，JAITLY N，LE Q，et al. Listen，attend and spell： A neural network for large vocabulary conversational speech recognition［C］// 2016 IEEE International Conference on Acoustics，Speech and Signal Processing （ICASSP）. IEEE， 2016：4960-4964.
［109］ YOON S， BYUN S， JUNG K. Multimodal speech emotion recognition using audio and text［C］// 2018 IEEE Spoken Language Technology Workshop （SLT）. IEEE， 2018：112-118.
［110］ CHOROWSKI J，BAHDANAU D，SERDYUK D，et al. Attention-based models for speech recognition［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM， 2015：577-585.
［111］ TSAI Y H H， BAI S，LIANG P P， et al. Multimodal transformer for unaligned multimodal language sequences［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL， 2019：6558-6569.
［112］ TRUONG Q T， LAUW H W. VistaNet： Visual aspect attention network for multimodal sentiment analysis［C］//Proceedings of the 2019 AAAI Conference on Artificial Intelligence. AAAI Press， 2019，33（1）：305-312.
［113］ ZADEH A，LIANG P P，PORIA S，et al. Multi-attention recurrent network for human communication comprehension［C］// Proceedings of the 2018 AAAI Conference on Artificial Intelligence， AAAI Press， 2018，32（1）：5642-5649.
［114］ ZADEH A， LIANG P P， MAZUMDER N， et al. Memory fusion network for multi-view sequential learning［C］// Proceedings of the 2018 AAAI Conference on Artificial Intelligence. AAAI Press， 2018，32（1）：5634-5641.
［115］ YANG K C，XU H，GAO K. CM-BERT： Cross-modal bert for text-audio sentiment analysis［C］// Proceedings of the 28th ACM International Conference on Multimedia. ACM， 2020：521-528.
［116］ DEVLIN J，CHANG M W，LEE K，et al. BERT： Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv：1810.04805， 2018.
［117］ RADFORD A，NARASIMHAN K，SALIMANS T，et al. Improving Language Understanding by Generative Pre-training［EB/OL］. ［2024-02-01］. https：//cdn.openai.com/
research-covers/language-unsupervised/language_underst
anding_paper.pdf.
［118］ ZHAO J， ZHANG X S， WANG X F， et al. Overcoming language priors in VQA via adding visual module［J］. Neural Computing and Applications， 2022，34（11）：9015-9023.
［119］ LI X R， ZHOU F M， XU C X， et al. SEA： Sentence encoder assembly for video retrieval by textual queries［J］. IEEE Transactions on Multimedia， 2020，23：4351-4362.
［120］ ZHU L C，YANG Y. ActBERT： Learning global-local video-text representations［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020：8746-8755.
［121］ IMFELD M，GRALDI J，GIORDANO M，et al. Transformer fusion with optimal transport［J］. arXiv preprint arXiv：2310.05719， 2023.
［122］ SUN C， MYERS A， VONDRICK C， et al. VideoBERT： A joint model for video and language representation learning［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer vision. IEEE， 2019：7464-7473.
［123］ KIELA D，BHOOSHAN S，FIROOZ H，et al. Supervised multimodal bitransformers for classifying images and text［J］. arXiv preprint arXiv：1909.02950， 2019.
［124］ STAP D，BLEEKER M，IBRAHIMI S，et al. Conditional image generation and manipulation for user-specified content［J］. arXiv preprint arXiv：2005.04909， 2020.
［125］ ZHANG H， KOH J Y， BALDRIDGE J， et al. Cross-modal contrastive learning for text-to-image generation［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：833-842.
［126］ YANG J W，BISK Y，GAO J F. TACo： Token-aware cascade contrastive learning for video-text alignment［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：11562-11572.
［127］ XU H， GHOSH G， HUANG P Y， et al. VideoCLIP： Contrastive pre-training for zero-shot video-text understanding［J］. arXiv preprint arXiv：2109.14084， 2021.
［128］ WANG M M， XING J Z， LIU Y. ActionCLIP： A new paradigm for video action recognition［J］. arXiv preprint arXiv：2109.08472， 2021.
［129］ ZHANG R R， ZENG Z Y， GUO Z Y， et al. Can language understand depth?［C］// Proceedings of the 30th ACM International Conference on Multimedia. ACM， 2022：6868-6874.
［130］ ZHU J， HUANG H B， LI B H， et al. E-CRF： Embedded conditional random field for boundary-caused class weights confusion in semantic segmentation［J］. arXiv preprint arXiv：2112.07106， 2021.
［131］ XU J R， MELLO S D， LIU S F， et al. GroupViT： Semantic segmentation emerges from text supervision［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2022：18134-18144.
［132］ CIAPARRONE G，CHIARIGLIONE L，TAGLIAFERRI R. A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos［J］. Neural Computing and Applications， 2022，34（10）：7489-7506.
［133］ PENG Y Q， WANG C X， PEI Y X， et al. Video captioning with global and local text attention［J］. The Visual Computer， 2022，38（12）：4267-4278.
［134］ SUN B，WU Y，ZHAO Y J，et al. Cross-language multimodal scene semantic guidance and leap sampling for video captioning［J］. The Visual Computer， 2023，39（1）：9-25.
［135］ REN S Q， HE K M， GIRSHICK R， et al. Faster R-CNN： Towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017，39（6）：1137-1149.
［136］ LIN X， SUN S Z， HUANG W， et al. EAPT： Efficient attention pyramid transformer for image processing［J］. IEEE Transactions on Multimedia， 2023，25：50-61.
［137］ LIU Z， LIN Y T， CAO Y， et al. Swin transformer： Hierarchical vision transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE， 2021：10012-10022.
［138］ ZENG A D，ATTARIAN M，ICHTER B，et al. Socratic models： Composing zero-shot multimodal reasoning with language［J］. arXiv preprint arXiv：2204.00598， 2022.
［139］ GONG T， LYU C， ZHANG S， et al. MultiModal-GPT： A vision and language model for dialogue with humans［J］. arXiv preprint arXiv：2305.04790， 2023.
［140］ GOODFELLOW I， POUGET-ABADIE J， MIRZA M， et al. Generative adversarial networks［J］. Communications of the ACM， 2020，63（11）：139-144.
［141］ MIRZA M，OSINDERO S. Conditional generative adversarial nets［J］. arXiv preprint arXiv：1411.1784， 2014.
［142］ KARRAS T，LAINE S，AILA T. A style-based generator architecture for generative adversarial networks［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019：4401-4410.
［143］ XU J R，GANDELSMAN Y，BAR A，et al. IMProv： Inpainting-based multimodal prompting for computer vision tasks［J］. arXiv preprint arXiv：2312.01771， 2023.
［144］ LIU R H，WANG T Q，LI H T，et al. TMM-Nets： Transferred multi-to mono-modal generation for Lupus retinopathy diagnosis［J］. IEEE Transactions on Medical Imaging， 2023，42（4）：1083-1094.
［145］ HE K M，ZHANG X Y，REN S Q，et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2016：770-778.
［146］ WU Z B，DADU A，TUSTISON N，et al. Multimodal patient representation learning with missing modalities and labels［C］// The 12th International Conference on Learning Representations. ICLR， 2024.
［147］ MIECH A， ALAYRAC J B， SMAIRA L， et al. End-to-end learning of visual representations from uncurated instructional videos［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020：9879-9889.
［148］ RADFORD A，KIM J W，HALLACY C，et al. Learning transferable visual models from natural language supervision［C］// 2021 International Conference on Machine Learning. ACM， 2021：8748-8763.
［149］ JIANG Y G，WU Z X，WANG J，et al. Exploiting feature and class relationships in video categorization with regularized deep neural networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2017，40（2）：352-364.
［150］ KAMPMAN O， BAREZI E J， BERTERO D， et al. Investigating audio，visual，and text fusion methods for end-to-end automatic personality prediction［J］. arXiv preprint arXiv：1805.00705， 2018.
［151］ ORTEGA J D S，S ENOUSSAOUI M， GRANGER E， et al. Multimodal fusion with deep neural networks for audio-video emotion recognition［J］. arXiv preprint arXiv：1907.03196， 2019.
［152］ YANG X T，RAMESH P，CHITTA R，et al. Deep multimodal representation learning from temporal data［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2017：5447-5455.
［153］ ZADEH A， CHEN M， PORIA S， et al. Tensor fusion network for multimodal sentiment analysis［J］. arXiv preprint arXiv：1707.07250， 2017.
［154］ PORIA S，CAMBRIA E，HAZARIKA D，et al. Context-dependent sentiment analysis in user-generated videos［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. ACL， 2017：873-883.
［155］ MONFORT M，JIN S Y，LIU A，et al. Spoken moments： Learning joint audio-visual representations from video descriptions［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2021：14871-14881.
［156］ TAN H， BANSAL M. LXMERT： Learning cross-modality encoder representations from transformers［J］. arXiv preprint arXiv：1908.07490， 2019.
［157］ GABEUR V，SUN C，ALAHARI K，et al. Multi-modal transformer for video retrieval［C］// 2020 16th European Conference on Computer Vision. Springer， 2020：214-229.
［158］ AKBARI H， YUAN L， QIAN R， et al. VATT： Transformers for multimodal self-supervised learning from raw video， audio and text［J］. Advances in Neural Information Processing Systems， 2021，34：24206-24221.
［159］ WEI Y W， WANG X， NIE L Q， et al. MMGCN： Multi-modal graph convolution network for personalized recommendation of micro-video［C］// Proceedings of the 27th ACM International Conference on Multimedia. ACM， 2019：1437-1445.
［160］ GAO D F，LI K，WANG R P，et al. Multi-modal graph neural network for joint reasoning on vision and scene text［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020： 12746-12756.
［161］ TRAN D， BOURDEV L， FERGUS R， et al. Learning spatiotemporal features with 3D convolutional networks［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE， 2015：4489-4497.
［162］ LI J， WANG X P， LV G Q， et al. GraphMFT： A graph attention based multimodal fusion technique for emotion recognition in conversation［J］. Neurocomputing，2022，550. DOI： 10.1016/j.neucom.2023.126427.
［163］ EYBEN F，WÖLLMER M，SCHULLER B. OpenSMILE： The munich versatile and fast open-source audio feature extractor［C］// Proceedings of the 18th ACM International Conference on Multimedia. ACM， 2010：1459-1462.
［164］ WANG Q F，WEI Y W，YIN J H，et al. DualGNN： Dual graph neural network for multimedia recommendation［J］. IEEE Transactions on Multimedia， 2021，25：1074-1084.

[1]	ZHENG Jiuchao, ZHAO Xinyuan. Entity Linking Method Based on Topics and Description Information [J]. Computer and Modernization, 2024, 0(12): 10-14.
[2]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[3]	WANG Xiaohang1, LI Yongjie1, YU Lei1, FAN Xiao2. A Method of Using Compound Event Probability Operation to Solve Problem of Negative Information Blocking Maximization [J]. Computer and Modernization, 2024, 0(12): 24-33.
[4]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[5]	LIU Yunhai1, Feng Guang1, WU Xiaoting2, YANG Qun2 . Safety Helmet Wearing Detection Algorithm for Complex Construction Scenes [J]. Computer and Modernization, 2024, 0(12): 66-71.
[6]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[7]	GU Yue, DENG Songfeng, SHEN Ji, MU Wentao, ZHAO Enqi. SAR Ship Detection Algorithm Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(12): 78-83.
[8]	WU Xiuling1, ZHOU Sheng1, WANG Chunjuan1, YU Cuizhuo2, LIU Hao3. Research Progress in Ultra Short-term Power Load Forecasting Technology [J]. Computer and Modernization, 2024, 0(12): 108-115.
[9]	LI Deyou1, 2, YU Jinsongdi1, 2, WEI Dandan1, 2, LUO Yuan1, 2, TONG Ruiju3. Abstract Tree Model for Gridded Cube Metadata [J]. Computer and Modernization, 2024, 0(11): 1-6.
[10]	GONG Yicheng1, 2, LIU Qing1, 2. Beijing Opera Binary Classification Based on RF-LCE-BiLSTM-Attention-AMSSA Model [J]. Computer and Modernization, 2024, 0(11): 7-12.
[11]	LI Taoying, LI Meng, WU Mengqiao. Taxi Passenger Flow Prediction Based on Heterogeneous Spatiotemporal Graph#br# Convolutional Networks [J]. Computer and Modernization, 2024, 0(11): 13-18.
[12]	ZHANG Tai1, YAN Zihao2, DUAN Jie2, ZHANG Zhihong2. Information Forwarding Strategy of Internet of Vehicles in Named Data Network [J]. Computer and Modernization, 2024, 0(11): 19-27.
[13]	YUAN Qingle, MU Li. Inventory Forecasting Method Based on Improved Elman Neural Network [J]. Computer and Modernization, 2024, 0(11): 28-33.
[14]	ZHANG Kun1, ZHANG Yongwei1, WU Yongcheng1, ZHANG Xiaowen2, ZHAI Shichen2. An LLM-based Method for Automatic Construction of Equipment Failure Knowledge Graphs [J]. Computer and Modernization, 2024, 0(11): 46-53.
[15]	YE Xue, YANG Sheng, CHENG Kai, ZHU Feng. A Financial Knowledge Q&A Model for Power Enterprise Based on ChatGLM2-6B [J]. Computer and Modernization, 2024, 0(11): 54-63.