[1] 冯霞,胡志毅,刘才华. 跨模态检索研究进展综述[J]. 计算机科学, 2021,48(8):13-23.
[2] WANG K, YIN Q, WANG W, et al. A comprehensive survey on cross-modal retrieval[J]. arXiv preprint arXiv:1607.06215, 2016.
[3] PENG Y, HUANG X, ZHAO Y. An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018,28(9):2372-2385.
[4] COSTA PEREIRA J, COVIELLO E, DOYLE G, et al. On the role of correlation and abstraction in cross-modal multimedia retrieval[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014,36(3):521-535.
[5] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: An overview with application to learning methods[J]. Neural Computation, 2004,16(12):2639-2664.
[6] ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]// 30th International Conference on Machine Learning( ICML 2013). 2013,28:2284-2292.
[7] WANG X, HU P, ZHEN L, et al. DRSL: deep relational similarity learning for cross-modal retrieval[J]. Information Sciences, 2021,546(1):298-311.
[8] ZHAI X, PENG Y, XIAO J. Learning cross-media joint representation with sparse and semisupervised regularization[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2013,24(6):965-978.
[9] PENG Y, HUANG X, QI J. Cross-media shared representation by hierarchical learning with multiple deep networks[C]// Proceedings of the 25th International Joint Conference on Artificial Intelligence. 2016: 3846-3853.
[10]WANG B, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C]// Proceedings of the 25th ACM International Conference on Multimedia. 2017:154-162.
[11]JIAN Y, XIAO J, CAO Y, et al. Deep pairwise ranking with multi-label information for cross-modal retrieval[C]// 2019 IEEE International Conference on Multimedia and Expo (ICME). 2019:1810-1815.
[12]ZHEN L, HU P, WANG X, et al. Deep supervised cross-modal retrieval[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019. DOI:10.1109/CVPR.2019.01064.
[13]ATREY P K, HOSSAIN M A, EL SADDIK A, et al. Multimodal fusion for multimedia analysis: A Survey[J]. Multimedia Systems,2010,16(6):345-379.
[14]何俊,张彩庆,李小珍,等. 面向深度学习的多模态融合技术研究综述[J]. 计算机工程, 2020, 46(5):1-11.
[15]SALVADOR A, HYNES N, AYTAR Y, et al. Learning cross-modal embeddings for cooking recipes and food images[C]// 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017:3068-3076.
[16]WANG X, HAN X, HUANG W, et al. Multi-similarity loss with general pair weighting for deep metric learning[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019:5017-5025.
[17]SUN Y, CHENG C, ZHANG Y, et al. Circle Loss: A unified perspective of pair similarity optimization[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020:6397-6406.
[18]SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]// arXiv preprint arXiv:1409.1556, 2015.
[19]NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines[C]// Proceedings of the 27th International Conference on Machine Learning. 2010:807-814.
[20]DENG J, DONG W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009: 248-255.
[21]KIM Y. Convolutional neural networks for sentence classification[C]// arXiv preprint arXiv:1408.5882, 2014.
[22]MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013,2:3111-3119.
[23]RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting image annotations using Amazon’s mechanical turk[C]// Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010:139-147.
[24]CHUA T S, TANG J, HONG R, et al. NUS-WIDE: A real-world web image database from national university of singapore[C]// Proceedings of the ACM International Conference on Image and Video Retrieval(CIVR 2009). 2009:1-9.
[25]RASIWASIA N, COSTA PEREIRA J, COVIELLO E, et al. A new approach to cross-modal multimedia retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia October, 2010:251-260.
[26]FENG F, WANG X, LI R. Cross-modal retrieval with correspondence autoencoder[C]// Proceedings of the 22nd ACM International Conference on Multimedia November. 2014:7-16
[27]DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2019.
[28]ALEC R, JEFFREY W, REWON C, et al. [GPT-2] Language models are unsupervised multitask learners[J]. OpenAI Blog, 2020(5):1-7.
|