基于语义融合和多重相似性学习的跨模态检索

摘要/Abstract

摘要： 针对现有跨模态检索方法不能充分挖掘模态之间的相似性信息的问题，提出一种基于语义融合和多重相似性学习（CFMSL）方法。首先，在特征提取过程中融合不同模态的语义信息，加强不同模态特征间的交互，使得模型能够充分挖掘模态间的关联信息。然后，利用生成器将单模态特征和融合模态特征映射到公共子空间中，通过最大化锚点与正例样本之间的相似性和最小化锚点与负例样本间的相似性得到具有判别性的特征进行模态对齐。最后，基于决策融合方式对相似性列表进行重排序，使得最终排序结果同时考虑单模态特征和融合模态特征，提高检索性能。通过在Pascal Sentences、Wikipedia、NUS-WIDE-10K这3个广泛使用的图文数据集上进行实验，实验结果表明CFMSL模型能够有效提高跨模态检索任务的性能。

关键词: 跨模态检索, 特征融合, 相似性学习, 重排序, 异构鸿沟

Abstract: Most of cross-modal retrieval methods do not fully exploit the interaction between heterogeneous data. To solve the problem, a novel method called Context Fusion and Multi-Similarity Learning （CFMSL） is proposed. To exploit the interactions between different modal data, the context fusion is adapted to aggregate different modal information. The generative module is used to generate discriminative representations by optimizing the pair similarity loss in the common subspace, which maximizes the intra-class similarity and minimizes the inter-class similarity for cross-modal alignment. Moreover, the re-ranking strategy based on single modality and fused multi-modality is proposed during evaluation phase, appropriately adjusting the final retrieval results to improve the performance. The experiments demonstrate that our proposed method achieves competitive results in cross-modal retrieval tasks on several widely-used image-text datasets, such as Pascal Sentences, Wikipedia, and NUS-WIDE-10K.

Key words: cross-modal retrieval, feature fusion; , similarity learning; , re-ranking; , heterogeneous gap

曾奕斌, 葛红. 基于语义融合和多重相似性学习的跨模态检索[J]. 计算机与现代化, 2022, 0(08): 50-56.

ZENG Yi-bin, GE Hong. Cross-modal Retrieval Based on Context Fusion and Multi-similarity Learning[J]. Computer and Modernization, 2022, 0(08): 50-56.

参考文献

［1］冯霞,胡志毅,刘才华. 跨模态检索研究进展综述［J］. 计算机科学, 2021,48（8）:13-23.
［2］ WANG K, YIN Q, WANG W, et al. A comprehensive survey on cross-modal retrieval［J］. arXiv preprint arXiv:1607.06215, 2016.
［3］ PENG Y, HUANG X, ZHAO Y. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges［J］. IEEE Transactions on Circuits and Systems for Video Technology, 2018,28（9）:2372-2385.
［4］ COSTA PEREIRA J, COVIELLO E, DOYLE G, et al. On the role of correlation and abstraction in cross-modal multimedia retrieval［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014,36（3）:521-535.
［5］ HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: An overview with application to learning methods［J］. Neural Computation, 2004,16（12）:2639-2664.
［6］ ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis［C］// 30th International Conference on Machine Learning( ICML 2013). 2013,28:2284-2292.
［7］ WANG X, HU P, ZHEN L, et al. DRSL: deep relational similarity learning for cross-modal retrieval［J］. Information Sciences, 2021,546（1）:298-311.
［8］ ZHAI X, PENG Y, XIAO J. Learning cross-media joint representation with sparse and semisupervised regularization［J］. IEEE Transactions on Circuits and Systems for Video Technology, 2013,24（6）:965-978.
［9］ PENG Y, HUANG X, QI J. Cross-media shared representation by hierarchical learning with multiple deep networks［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 3846-3853.
［10］WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval［C］// Proceedings of the 25th ACM International Conference on Multimedia. 2017:154-162.
［11］JIAN Y, XIAO J, CAO Y, et al. Deep pairwise ranking with multi-label information for cross-modal retrieval［C］// 2019 IEEE International Conference on Multimedia and Expo （ICME）. 2019:1810-1815.
［12］ZHEN L, HU P, WANG X, et al. Deep supervised cross-modal retrieval［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2019. DOI:10.1109/CVPR.2019.01064.
［13］ATREY P K, HOSSAIN M A, EL SADDIK A, et al. Multimodal fusion for multimedia analysis: A Survey［J］. Multimedia Systems,2010,16（6）:345-379.
［14］何俊,张彩庆,李小珍,等. 面向深度学习的多模态融合技术研究综述［J］. 计算机工程, 2020, 46（5）:1-11.
［15］SALVADOR A, HYNES N, AYTAR Y, et al. Learning cross-modal embeddings for cooking recipes and food images［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2017:3068-3076.
［16］WANG X, HAN X, HUANG W, et al. Multi-similarity loss with general pair weighting for deep metric learning［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2019:5017-5025.
［17］SUN Y, CHENG C, ZHANG Y, et al. Circle Loss: A unified perspective of pair similarity optimization［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2020:6397-6406.
［18］SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［C］// arXiv preprint arXiv:1409.1556, 2015.
［19］NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines［C］// Proceedings of the 27th International Conference on Machine Learning. 2010:807-814.
［20］DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database［C］// 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009: 248-255.
［21］KIM Y. Convolutional neural networks for sentence classification［C］// arXiv preprint arXiv:1408.5882, 2014.
［22］MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013,2:3111-3119.
［23］RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting image annotations using Amazon’s mechanical turk［C］// Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010:139-147.
［24］CHUA T S, TANG J, HONG R, et al. NUS-WIDE: A real-world web image database from national university of singapore［C］// Proceedings of the ACM International Conference on Image and Video Retrieval（CIVR 2009）. 2009:1-9.
［25］RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval［C］// Proceedings of the 18th ACM International Conference on Multimedia October, 2010:251-260.
［26］FENG F, WANG X, LI R. Cross-modal retrieval with correspondence autoencoder［C］// Proceedings of the 22nd ACM International Conference on Multimedia November. 2014:7-16
［27］DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv:1810.04805, 2019.
［28］ALEC R, JEFFREY W, REWON C, et al. ［GPT-2］ Language models are unsupervised multitask learners［J］. OpenAI Blog, 2020（5）:1-7.

[1]	张思敏, 刘新妹, 殷俊龄, 李宝玲. 基于YOLOv7改进的PCB缺陷检测方法[J]. 计算机与现代化, 2024, 0(12): 45-52.
[2]	王海洋, 弓同鑫, 杨锦涛, 陈再龙. 多尺度时间编码的工业园区短期负荷预测[J]. 计算机与现代化, 2024, 0(12): 59-65.
[3]	马钰, 杨勇, 任鸽, 帕力旦·吐尔逊. 基于GCN和微调BERT的作文自动评分方法[J]. 计算机与现代化, 2024, 0(09): 33-37.
[4]	郑尚坡1, 陈德富1, 李坚利2, 林国贤2, 王星平3. 基于改进YOLOv5s和DeepSORT的行人跟踪算法[J]. 计算机与现代化, 2024, 0(08): 54-58.
[5]	庞梅, 汪珙, 詹泳, 黄哲法. 基于YOLOv5改进算法的海洋水下垃圾检测方法[J]. 计算机与现代化, 2024, 0(07): 120-126.
[6]	符灵利, 邱宇, 张新晨 . 基于改进U-Net多特征融合的血管分割#br#[J]. 计算机与现代化, 2024, 0(06): 76-82.
[7]	朱纷, 何立风, 孙爽, 张梦颖, 于佳佳. 基于形变残差和级联编码的胰腺分割模型[J]. 计算机与现代化, 2024, 0(06): 83-88.
[8]	武昭盟1, 张成刚2. 适用于网络新闻数据的未配对跨模态哈希方法[J]. 计算机与现代化, 2024, 0(03): 54-60.
[9]	宁娟, 周庆华, 曾小为. 改进YOLOv7算法在西林瓶轧盖缺陷检测中的应用[J]. 计算机与现代化, 2023, 0(12): 82-86.
[10]	谷明轩, 范冰冰. 基于多模态特征融合的抑郁症识别[J]. 计算机与现代化, 2023, 0(10): 17-22.
[11]	陈俊义. 基于图节点动静态特征的健康事件预测模型[J]. 计算机与现代化, 2023, 0(10): 39-44.
[12]	邢世帅, 刘丹凤, 王立国, 潘月涛, 孟灵鸿, 岳晓晗. 基于空间注意力残差网络的图像超分辨率重建模型[J]. 计算机与现代化, 2023, 0(10): 45-52.
[13]	陈嘉敏, 张伯泉, 麦海鹏. 基于特征融合的海马体分割[J]. 计算机与现代化, 2023, 0(08): 1-6.
[14]	王鸿, 葛红. 基于注意力机制和语义相似度的跨模态哈希检索[J]. 计算机与现代化, 2023, 0(08): 44-53.
[15]	王杰, 潘凤, 张艳莎, 谭棉, 严晓波, 王林, . 融合带权非局部模块的铝型材表面缺陷分类[J]. 计算机与现代化, 2023, 0(05): 86-92.