计算机与现代化 ›› 2024, Vol. 0 ›› Issue (07): 47-62.doi: 10.3969/j.issn.1006-2475.2024.07.008
出版日期:
2024-07-25
发布日期:
2024-08-08
基金资助:
Online:
2024-07-25
Published:
2024-08-08
摘要: 随着人工智能和深度学习技术的不断发展,多模态信息处理与融合领域的相关研究受到了研究者们的广泛关注。本文总结多模态信息处理的发展历史和里程碑式的工作,以及多模态融合策略和模型。根据模态类别的不同,分类整理多模态信息处理与融合的主流数据集。以模态类型作为分类标准,本文系统地梳理多模态信息处理与融合的研究进展,强调不同模态之间的区别,并将多模态信息处理与融合分为:视听处理与融合、声文处理与融合、视觉-文本处理与融合和视觉-音频-文本处理与融合4种类别,对不同输入模态的处理融合方法与模型进行详细的研究。最后针对多模态处理与融合领域的发展进行总结与展望。
中图分类号:
黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
HUANG Wendong, WANG Yifan. Survey on Multimodal Information Processing and Fusion Based on Modal Categories[J]. Computer and Modernization, 2024, 0(07): 47-62.
[1] GHOSAL D, AKHTAR M S, CHAUHAN D, et al. Contextual inter-modal attention for multi-modal sentiment analysis[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, 2018:3454-3466. [2] XI C,LU G M,YAN J J. Multimodal sentiment analysis based on multi-head attention mechanism[C]// Proceedings of the 4th International Conference on Machine Learning and Soft Computing. ACM, 2020:34-39. [3] ZHAO Y X,MAMAT M,AYSA A,et al. Multimodal sentiment system and method based on CRNN-SVM[J]. Neural Computing and Applications, 2023,35(35):24713-24725. [4] ZHANG T,TAN Z H,WU X E. HAAN-ERC: Hierarchical adaptive attention network for multimodal emotion recognition in conversation[J]. Neural Computing and Applications, 2023,35(24):17619-17632. [5] JI X Y, ZHOU H, WANG K S Y, et al. Audio-driven emotional video portraits[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021:14080-14089. [6] PRAJWAL K R, MUKHOPADHYAY R, NAMBOODIRI V P, et al. Learning individual speaking styles for accurate lip to speech synthesis[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020:13796-13805. [7] SHASHIDHAR R, PATILKULKARNI S. Audiovisual speech recognition for Kannada language using feed forward neural network[J]. Neural Computing and Applications,2022,34(18):15603-15615. [8] MULLIGAN R M,SHAW M L. Multimodal signal detection: Independent decisions vs. integration[J]. Perception & Psychophysics, 1980,28(5):471-478. [9] MCGURK H,MACDONALD J. Hearing lips and seeing voices[J]. Nature, 1976,264(5588):746-748. [10] ZHU Y K,GROTH O,BERNSTEIN M,et al. Visual7w: Grounded question answering in images[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016:4995-5004. [11] SCHWARTZ I, SCHWING A G, HAZAN T. High-order attention models for visual question answering[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017:3667-3677. [12] LU J S, BATRA D, PARIKH D, et al. VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. ACM, 2019:13-23. [13] LEI J, LI L J, ZHOU L W, et al. Less is more: ClipBERT for video-and-language learning via sparse sampling[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021:7331-7341. [14] LUO H S, JI L, ZHONG M, et al. CLIP4CLIP: An empirical study of clip for end to end video clip retrieval and captioning[J]. Neurocomputing, 2022,508:293-304. [15] FAN W Q,HE Z W,XING X F,et al. Multi-modality depression detection via multi-scale temporal dilated CNNs[C]// Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. ACM, 2019:73-80. [16] YIN S, LIANG C, DING H Y, et al. A multi-modal hierarchical recurrent neural network for depression detection[C]// Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. ACM, 2019:65-71. [17] AL-JEBRNI A H, ALI S G, LI H T, et al. SThy-Net: A feature fusion-enhanced dense-branched modules network for small thyroid nodule classification from ultrasound images[J]. The Visual Computer, 2023,39(8):3675-3689. [18] MENG H Y, HUANG D, WANG H, et al. Depression recognition based on dynamic facial and vocal expression features using partial least square regression[C]// Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM, 2013:21-30. [19] SAMAREH A, JIN Y, WANG Z Y, et al. Detect depression from communication: How computer vision, signal processing, and sentiment analysis join forces[J]. IISE Transactions on Healthcare Systems Engineering,2018,8(3):196-208. [20] NIE W Z, YAN Y, SONG D, et al. Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition[J]. Multimedia Tools and Applications, 2021,80(11):16205-16214. [21] VERMA S,WANG J W,GE Z F,et al. Deep-HOSeq: Deep higher order sequence fusion for multimodal sentiment analysis[C]// Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020:561-570. [22] NAGRANI A, ALBANIE S, ZISSERMAN A. Seeing voices and hearing faces: Cross-modal biometric matching[C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018:8427-8436. [23] MORALES M, SCHERER S, LEVITAN R. A linguistically-informed fusion approach for multimodal depression detection[C]// Proceedings of the 5th Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic. ACL, 2018:13-24. [24] ALGHOWINEM S,GOECKE R,COHN J F,et al. Cross-cultural detection of depression from nonverbal behaviour[C]// Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). IEEE, 2015,1:1-8. [25] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]// Proceedings of the 32nd International Conference on Machine Learning. ACM, 2015:2048-2057. [26] LI W B, ZHANG P C, ZHANG L, et al. Object-driven text-to-image synthesis via adversarial training[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019:12174-12182. [27] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks[C]// Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018:1316-1324. [28] HAZIRBAS C, MA L N, DOMOKOS C, et al. FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture[C]// Proceedings of the 13th Asian Conference on Computer Vision. Springer, 2017:213-228. [29] NGIAM J,KHOSLA A,KIM M,et al. Multimodal deep learning[C]// Proceedings of the 28th International Conference on Machine Learning. ACM, 2011:689-696. [30] ZENG J, TONG Y F, HUANG Y M, et al. Deep surface normal estimation with hierarchical RGB-D fusion[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019:6153-6162. [31] VALADA A,MOHAN R,BURGARD W. Self-supervised model adaptation for multimodal semantic segmentation[J]. International Journal of Computer Vision,2020,128(5):1239-1285. [32] SONG S J, LIU J Y, LI Y H, et al. Modality compensation network: Cross-modal adaptation for action recognition[J]. IEEE Transactions on Image Processing,2020,29:3957-3969. [33] WANG J H, WANG Z H, TAO D C, et al. Learning common and specific features for RGB-D semantic segmentation with deconvolutional networks[C]// Proceedings of the 2016 14th European Conference on Computer Vision. Springer, 2016:664-679. [34] CHENG Y H, CAI R, LI Z W, et al. Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017:3029-3037. [35] WANG Y K, HUANG W B, SUN F C, et al. Deep multimodal fusion by channel exchanging[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. ACM, 2020:4835-4845. [36] ZHANG Y, SIDIBE D, MOREL O, et al. Deep multimodal fusion for semantic image segmentation: A survey[J]. Image and Vision Computing, 2021,105. DOI:10.1016/ j.imavis.2020.104042. [37] SONG X Y, CHEN H, WANG Q, et al. A review of audio-visual fusion with machine learning[J]. Journal of Physics: Conference Series, 2019,1237(2). DOI: 10.1088/1742-6596/1237/2/022144. [38] ZHANG C, YANG Z C, HE X D, et al. Multimodal intelligence: Representation learning, information fusion, and applications[J]. IEEE Journal of Selected Topics in Signal Processing, 2020,14(3):478-493. [39] 吴友政,李浩然,姚霆,等. 多模态信息处理前沿综述: 应用,融合和预训练[J]. 中文信息学报, 2022,36(5):1-20. [40] 任泽裕,王振超,柯尊旺,等. 多模态数据融合综述[J]. 计算机工程与应用, 2021,57(18):49-64. [41] GUI J, SUN Z N, WEN Y G, et al. A review on generative adversarial networks: Algorithms,theory,and applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2021,35(4):3313-3332. [42] HE X D, DENG L. Deep learning for image-to-text generation: A technical overview[J]. IEEE Signal Processing Magazine, 2017,34(6):109-116. [43] FROLOV S, HINZ T, RAUE F, et al. Adversarial text-to-image synthesis: A review[J]. Neural Networks,2021,144:187-209. [44] HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: A large-scale video benchmark for human activity understanding[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2015:961-970. [45] XU J, MEI T, YAO T, et al. MSR-VTT: A large video description dataset for bridging video and language[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016:5288-5296. [46] WANG X, WU J W, CHEN J K, et al. VATEX: A large-scale,high-quality multilingual dataset for video-and-language research[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE, 2019:4581-4591. [47] MIECH A, ZHUKOV D, ALAYRAC J B, et al. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE, 2019:2630-2640. [48] HUANG Q Q, XIONG Y, RAO A Y, et al. Movienet: A holistic dataset for movie understanding[C]// Proceedings of the 2020 16th European Conference on Computer Vision. Springer, 2020:709-727. [49] CARLETTA J,ASHBY S,BOURBAN S,et al. The AMI meeting corpus: A pre-announcement[C]// 2005 International Workshop on Machine Learning for Multimodal Interaction. Springer, 2005:28-39. [50] KAY W, CARREIRA J, SIMONYAN K, et al. The kinetics human action video dataset[J]. arXiv preprint arXiv:1705.06950, 2017. [51] CHUNG J S, NAGRANI A, ZISSERMAN A. VoxCeleb2: Deep speaker recognition[J]. arXiv preprint arXiv:1806.05622, 2018. [52] PANAYOTOV V,CHEN G G,POVEY D,et al. LibriSpeech: An ASR corpus based on public domain audio books[C]// Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015:5206-5210. [53] JIA J, ZHOU S P, YIN Y F, et al. Inferring emotions from large-scale Internet voice data[J]. IEEE Transactions on Multimedia, 2018,21(7):1853-1866. [54] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: Interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008,42(4):335-359. [55] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages[J]. IEEE Intelligent Systems,2016,31(6):82-88. [56] ZADEH A A B,LIANG P P,PORIA S,et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. ACL, 2018:2236-2246. [57] WU Z X,FU Y W,JIANG Y G,et al. Harnessing object and scene semantics for large-scale video understanding[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016:3112-3121. [58] RAMANISHKA V,DAS A,PARK D H,et al. Multimodal video description[C]// Proceedings of the 24th ACM International Conference on Multimedia. ACM, 2016:1092-1096. [59] CHEN S Z, JIN Q, CHEN J, et al. Generating video descriptions with latent topic guidance[J]. IEEE Transactions on Multimedia, 2019,21(9):2407-2418. [60] WU P,HE X T,TANG M Q,et al. HANet: Hierarchical alignment networks for video-text retrieval[C]// Proceedings of the 29th ACM International Conference on Multimedia. ACM, 2021:3518-3527. [61] WU W H, LUO H P, FANG B, et al. Cap4Video: What can auxiliary captions do for text-video Retrieval?[C]// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2023:10704-10713. [62] WU H Q, CHEN K Y, LUO Y N, et al. Scene consistency representation learning for video scene segmentation[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022:14021-14030. [63] LI R Z,WANG X F,MALLIDI S H,et al. Multi-stream end-to-end speech recognition[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing,2019,28:646-655. [64] VILLATORO-TELLO E, MADIKERI S, ZULUAGA-GOM EZ J,et al. Effectiveness of text,acoustic,and lattice-based representations in spoken language understanding tasks[C]// Proceedings of the 2023 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP). IEEE, 2023:1-5. [65] FEICHTENHOFER C,FAN H,MALIK J,et al. Slowfast networks for video recognition[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE, 2019:6202-6211. [66] YANG Z H, WU Z Y, JIA J. Speaker characteristics guided speech synthesis[C]// Proceedings of the 2022 International Joint Conference on Neural Networks. IEEE,2022:1-8. [67] GAO R H, GRAUMAN K. Visualvoice: Audio-visual speech separation with cross-modal consistency[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021:15490-15500. [68] LI C D,QIAN Y M. Deep audio-visual speech separation with attention mechanism[C]// Proceedings of the 2020 IEEE International Conference on Acoustics,Speech and Signal Processing. IEEE, 2020:7314-7318. [69] LAPTEV A,KOROSTIK R,SVISCHEV A,et al. You do not need more data: Improving end-to-end speech recognition by text-to-speech data augmentation[C]// Proceedings of the 2020 13th International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI). IEEE, 2020:439-444. [70] KRISHNA D N,PATIL A. Multimodal emotion recognition using cross-modal attention and 1D convolutional neural networks[C]// 2020 Interspeech. ISCA, 2020:4243-4247. [71] PRIYASAD D,FERNANDO T,DENMAN S,et al. Attention driven fusion for multi-modal emotion recognition[C]// Proceedings of the 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2020:3227-3231. [72] HOTELLING H. Breakthroughs in Statistics[M]// Relations Between Two Sets of Variates. New York:Springer,1992:162-190. [73] OHKUSHI H,OGAWA T,HASEYAMA M. Music recommendation according to human motion based on kernel CCA-based relationship[J]. EURASIP Journal on Advances in Signal Processing, 2011,2011(1):1-14. [74] AYTAR Y,VONDRICK C,TORRALBA A. SoundNet: Learning sound representations from unlabeled video[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. ACM, 2016:892-900. [75] ORTEGA J D S,CARDINAL P,KOERICH A L. Emotion recognition using fusion of audio and video features[C]// Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019:3847-3852. [76] SIMONYAN K,ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM, 2014:568-576. [77] ZHANG S Q, ZHANG S L, HUANG T J, et al. Learning affective features with a hybrid deep model for audio-visual emotion recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017,28(10):3030-3043. [78] CHUNG J S, ZISSERMAN A. Out of time: Automated lip sync in the wild[C]// Proceedings of the 9th Asian Conference on Computer Vision. Springer, 2017:251-263. [79] SANGUINETI V, MORERIO P, POZZETTI N, et al. Leveraging acoustic images for effective self-supervised audio representation learning[C]// Proceedings of the 16th European Conference on Computer Vision. Springer, 2020:119-135. [80] JOZE H R V, SHABAN A, IUZZOLINO M L, et al. MMTM: Multimodal transfer module for CNN fusion[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2020:13289-13299. [81] FENG W J,GUAN N Y,LI Y,et al. Audio visual speech recognition with multimodal recurrent neural networks[C]// 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017:681-688. [82] PAN P B,XU Z W,YANG Y,et al. Hierarchical recurrent neural encoder for video representation with application to captioning[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016:1029-1038. [83] MAO J H,XU W,YANG Y,et al. Deep captioning with multimodal recurrent neural networks (m-RNN)[J]. arXiv preprint arXiv:1412.6632, 2014. [84] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017:6000-6010. [85] WANG X L,GIRSHICK R,GUPTA A,et al. Non-local neural networks[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018:7794-7803. [86] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018:7132-7141. [87] WOO S H Y, PARK J, LEE J Y, et al. CBAM: Convolutional block attention module[C]// Proceedings of the 2018 European Conference on Computer Vision (ECCV). Springer, 2018:3-19. [88] FU J, LIU J, TIAN H J, et al. Dual attention network for scene segmentation[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019:3146-3154. [89] CHEN C F R,FAN Q F,PANDA R. CrossVit: Cross-attention multi-scale vision transformer for image classification[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE, 2021:357-366. [90] GHOSAL D, MAJUMDER N, PORIA S, et al. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation[J]. arXiv preprint arXiv:1908.11540,2019. [91] HU J W, LIU Y C, ZHAO J M, et al. MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation[J]. arXiv preprint arXiv:2107.06779, 2021. [92] TAO Z L, WEI Y W, WANG X, et al. MGAT: Multimodal graph attention network for recommendation[J]. Information Processing & Management, 2020,57(5). DOI:10.1016/ j.ipm.2020.102277. [93] JOSHI A, BHAT A, JAIN A, et al. COGMEN: Contextualized GNN based multimodal emotion recognitio[J]. arXiv preprint arXiv:2205.02455, 2022. [94] MIDDYA A I,NAG B,ROY S. Deep learning based multimodal emotion recognition using model-level fusion of audio-visual modalities[J]. Knowledge-Based Systems,2022,244. DIO: 10.1016/j.knosys.2022.108580. [95] TZIRAKIS P,TRIGEORGIS G,NICOLAOU M A,et al. End-to-end multimodal emotion recognition using deep neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 2017,11(8):1301-1309. [96] JADERBERG M,SIMONYAN K,ZISSERMAN A,et al. Spatial transformer networks[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM, 2015:2017-2025. [97] LI X, WANG W H, HU X L, et al. Selective kernel networks[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019:510-519. [98] LI J N, ZHANG S L, WANG J D, et al. Global-local temporal representations for video person re-identification[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE, 2019:3958-3967. [99] LIU Z Y, WANG L M, WU W Y N, et al. TAM: Temporal adaptive module for video recognition[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE, 2021:13708-13718. [100] NAGRANI A, YANG S, ARNAB A, et al. Attention bottlenecks for multimodal fusion[J]. arXiv preprint arXiv:2107.00135, 2021. [101] GORI M, MONFARDINI G, SCARSELLI F. A new model for learning in graph domains[C]// Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. IEEE, 2005,2:729-734. [102] MICHELI A. Neural network for graphs: A contextual constructive approach[J]. IEEE Transactions on Neural Networks, 2009,20(3):498-511. [103] TIAN Y J,ZHANG C X,GUO Z C,et al. Recipe2vec: Multi-modal recipe representation learning with graph neural networks[J]. arXiv preprint arXiv:2205.12396,2022. [104] SCHLICHTKRULL M, KIPF T N, BLOEM P, et al. Modeling relational data with graph convolutional networks[C]// Proceedings of the 15th International Conference on Semantic Web. Springer, 2018:593-607. [105] VELIČKOVIĆ P,CUCURULL G,CASANOVA A,et al. Graph attention networks[J]. arXiv preprint arXiv:1710.10903, 2017. [106] XIE Z F, ZHANG W L, SHENG B, et al. BaGFN: Broad attentive graph fusion network for high-order feature interactions[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023,34(8):4499-4513. [107] KIM Y. Convolutional neural networks for sentence classification[J]. arXiv preprint arXiv:1408.5882, 2014. [108] CHAN W,JAITLY N,LE Q,et al. Listen,attend and spell: A neural network for large vocabulary conversational speech recognition[C]// 2016 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE, 2016:4960-4964. [109] YOON S, BYUN S, JUNG K. Multimodal speech emotion recognition using audio and text[C]// 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018:112-118. [110] CHOROWSKI J,BAHDANAU D,SERDYUK D,et al. Attention-based models for speech recognition[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. ACM, 2015:577-585. [111] TSAI Y H H, BAI S,LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, 2019:6558-6569. [112] TRUONG Q T, LAUW H W. VistaNet: Visual aspect attention network for multimodal sentiment analysis[C]//Proceedings of the 2019 AAAI Conference on Artificial Intelligence. AAAI Press, 2019,33(1):305-312. [113] ZADEH A,LIANG P P,PORIA S,et al. Multi-attention recurrent network for human communication comprehension[C]// Proceedings of the 2018 AAAI Conference on Artificial Intelligence, AAAI Press, 2018,32(1):5642-5649. [114] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]// Proceedings of the 2018 AAAI Conference on Artificial Intelligence. AAAI Press, 2018,32(1):5634-5641. [115] YANG K C,XU H,GAO K. CM-BERT: Cross-modal bert for text-audio sentiment analysis[C]// Proceedings of the 28th ACM International Conference on Multimedia. ACM, 2020:521-528. [116] DEVLIN J,CHANG M W,LEE K,et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018. [117] RADFORD A,NARASIMHAN K,SALIMANS T,et al. Improving Language Understanding by Generative Pre-training[EB/OL]. [2024-02-01]. https://cdn.openai.com/ research-covers/language-unsupervised/language_underst anding_paper.pdf. [118] ZHAO J, ZHANG X S, WANG X F, et al. Overcoming language priors in VQA via adding visual module[J]. Neural Computing and Applications, 2022,34(11):9015-9023. [119] LI X R, ZHOU F M, XU C X, et al. SEA: Sentence encoder assembly for video retrieval by textual queries[J]. IEEE Transactions on Multimedia, 2020,23:4351-4362. [120] ZHU L C,YANG Y. ActBERT: Learning global-local video-text representations[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2020:8746-8755. [121] IMFELD M,GRALDI J,GIORDANO M,et al. Transformer fusion with optimal transport[J]. arXiv preprint arXiv:2310.05719, 2023. [122] SUN C, MYERS A, VONDRICK C, et al. VideoBERT: A joint model for video and language representation learning[C]// Proceedings of the 2019 IEEE/CVF International Conference on Computer vision. IEEE, 2019:7464-7473. [123] KIELA D,BHOOSHAN S,FIROOZ H,et al. Supervised multimodal bitransformers for classifying images and text[J]. arXiv preprint arXiv:1909.02950, 2019. [124] STAP D,BLEEKER M,IBRAHIMI S,et al. Conditional image generation and manipulation for user-specified content[J]. arXiv preprint arXiv:2005.04909, 2020. [125] ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021:833-842. [126] YANG J W,BISK Y,GAO J F. TACo: Token-aware cascade contrastive learning for video-text alignment[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE, 2021:11562-11572. [127] XU H, GHOSH G, HUANG P Y, et al. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding[J]. arXiv preprint arXiv:2109.14084, 2021. [128] WANG M M, XING J Z, LIU Y. ActionCLIP: A new paradigm for video action recognition[J]. arXiv preprint arXiv:2109.08472, 2021. [129] ZHANG R R, ZENG Z Y, GUO Z Y, et al. Can language understand depth?[C]// Proceedings of the 30th ACM International Conference on Multimedia. ACM, 2022:6868-6874. [130] ZHU J, HUANG H B, LI B H, et al. E-CRF: Embedded conditional random field for boundary-caused class weights confusion in semantic segmentation[J]. arXiv preprint arXiv:2112.07106, 2021. [131] XU J R, MELLO S D, LIU S F, et al. GroupViT: Semantic segmentation emerges from text supervision[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2022:18134-18144. [132] CIAPARRONE G,CHIARIGLIONE L,TAGLIAFERRI R. A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos[J]. Neural Computing and Applications, 2022,34(10):7489-7506. [133] PENG Y Q, WANG C X, PEI Y X, et al. Video captioning with global and local text attention[J]. The Visual Computer, 2022,38(12):4267-4278. [134] SUN B,WU Y,ZHAO Y J,et al. Cross-language multimodal scene semantic guidance and leap sampling for video captioning[J]. The Visual Computer, 2023,39(1):9-25. [135] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,39(6):1137-1149. [136] LIN X, SUN S Z, HUANG W, et al. EAPT: Efficient attention pyramid transformer for image processing[J]. IEEE Transactions on Multimedia, 2023,25:50-61. [137] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. IEEE, 2021:10012-10022. [138] ZENG A D,ATTARIAN M,ICHTER B,et al. Socratic models: Composing zero-shot multimodal reasoning with language[J]. arXiv preprint arXiv:2204.00598, 2022. [139] GONG T, LYU C, ZHANG S, et al. MultiModal-GPT: A vision and language model for dialogue with humans[J]. arXiv preprint arXiv:2305.04790, 2023. [140] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020,63(11):139-144. [141] MIRZA M,OSINDERO S. Conditional generative adversarial nets[J]. arXiv preprint arXiv:1411.1784, 2014. [142] KARRAS T,LAINE S,AILA T. A style-based generator architecture for generative adversarial networks[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2019:4401-4410. [143] XU J R,GANDELSMAN Y,BAR A,et al. IMProv: Inpainting-based multimodal prompting for computer vision tasks[J]. arXiv preprint arXiv:2312.01771, 2023. [144] LIU R H,WANG T Q,LI H T,et al. TMM-Nets: Transferred multi-to mono-modal generation for Lupus retinopathy diagnosis[J]. IEEE Transactions on Medical Imaging, 2023,42(4):1083-1094. [145] HE K M,ZHANG X Y,REN S Q,et al. Deep residual learning for image recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016:770-778. [146] WU Z B,DADU A,TUSTISON N,et al. Multimodal patient representation learning with missing modalities and labels[C]// The 12th International Conference on Learning Representations. ICLR, 2024. [147] MIECH A, ALAYRAC J B, SMAIRA L, et al. End-to-end learning of visual representations from uncurated instructional videos[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2020:9879-9889. [148] RADFORD A,KIM J W,HALLACY C,et al. Learning transferable visual models from natural language supervision[C]// 2021 International Conference on Machine Learning. ACM, 2021:8748-8763. [149] JIANG Y G,WU Z X,WANG J,et al. Exploiting feature and class relationships in video categorization with regularized deep neural networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,40(2):352-364. [150] KAMPMAN O, BAREZI E J, BERTERO D, et al. Investigating audio,visual,and text fusion methods for end-to-end automatic personality prediction[J]. arXiv preprint arXiv:1805.00705, 2018. [151] ORTEGA J D S,S ENOUSSAOUI M, GRANGER E, et al. Multimodal fusion with deep neural networks for audio-video emotion recognition[J]. arXiv preprint arXiv:1907.03196, 2019. [152] YANG X T,RAMESH P,CHITTA R,et al. Deep multimodal representation learning from temporal data[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017:5447-5455. [153] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint arXiv:1707.07250, 2017. [154] PORIA S,CAMBRIA E,HAZARIKA D,et al. Context-dependent sentiment analysis in user-generated videos[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. ACL, 2017:873-883. [155] MONFORT M,JIN S Y,LIU A,et al. Spoken moments: Learning joint audio-visual representations from video descriptions[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2021:14871-14881. [156] TAN H, BANSAL M. LXMERT: Learning cross-modality encoder representations from transformers[J]. arXiv preprint arXiv:1908.07490, 2019. [157] GABEUR V,SUN C,ALAHARI K,et al. Multi-modal transformer for video retrieval[C]// 2020 16th European Conference on Computer Vision. Springer, 2020:214-229. [158] AKBARI H, YUAN L, QIAN R, et al. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text[J]. Advances in Neural Information Processing Systems, 2021,34:24206-24221. [159] WEI Y W, WANG X, NIE L Q, et al. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video[C]// Proceedings of the 27th ACM International Conference on Multimedia. ACM, 2019:1437-1445. [160] GAO D F,LI K,WANG R P,et al. Multi-modal graph neural network for joint reasoning on vision and scene text[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2020: 12746-12756. [161] TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE, 2015:4489-4497. [162] LI J, WANG X P, LV G Q, et al. GraphMFT: A graph attention based multimodal fusion technique for emotion recognition in conversation[J]. Neurocomputing,2022,550. DOI: 10.1016/j.neucom.2023.126427. [163] EYBEN F,WÖLLMER M,SCHULLER B. OpenSMILE: The munich versatile and fast open-source audio feature extractor[C]// Proceedings of the 18th ACM International Conference on Multimedia. ACM, 2010:1459-1462. [164] WANG Q F,WEI Y W,YIN J H,et al. DualGNN: Dual graph neural network for multimedia recommendation[J]. IEEE Transactions on Multimedia, 2021,25:1074-1084. |
[1] | 何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9. |
[2] | 赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23. |
[3] | 黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39. |
[4] | 刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71. |
[5] | 张志霞, 秦志毅. 基于变分模态分解和IGJO-SVR的网络舆情预测[J]. 计算机与现代化, 2024, 0(11): 77-83. |
[6] | 万鸿炜, 陈平华. 基于Involution算子和协调反向注意力的息肉图像分割[J]. 计算机与现代化, 2024, 0(11): 84-90. |
[7] | 杨正科, 沈小东, 王凯翔, 何立. 基于改进麻雀搜索算法的接地网腐蚀故障定位[J]. 计算机与现代化, 2024, 0(10): 14-20. |
[8] | 韩瑞超, 孟令军, 敖利丞, 谢宇斌, 甄明硕. 基于改进YOLOv5的施工防护佩戴检测[J]. 计算机与现代化, 2024, 0(10): 49-54. |
[9] | 王佳1, 顾文俊1, 鞠炜刚2, 李玉维1, 张云龙2, 米传民3, 周志鹏3. 基于多元级差优良化遗传算法的环境拓扑结构任务调度[J]. 计算机与现代化, 2024, 0(10): 65-73. |
[10] | 杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119. |
[11] | 于天一, 李剑锋, 陈海龙, 翟军. 隐性角色下的协同推荐算法[J]. 计算机与现代化, 2024, 0(09): 1-7. |
[12] | 薛浩, 马静, 郭小宇. 基于Focal Loss改进LightGBM的供水管网毛刺数据检测[J]. 计算机与现代化, 2024, 0(09): 74-81. |
[13] | 程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29. |
[14] | 时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53. |
[15] | 赵小明, 潘婷, 刘伟锋. 基于图像分类的自动绘画心理分析方法[J]. 计算机与现代化, 2024, 0(08): 92-97. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||