[1] PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[J]. International Journal of Computer Vision, 2017,123(1):74-93.
[2] VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
[3] XIONG L C, PARIKH D, SOCHER R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[J]. Computer Vision and Pattern Recognition, arXiv:1612.01887, 2017.
[4] YU D, FU J, MEI T, et al. Multi-level attention networks for visual question answering[C] // 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
[5] XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[J]. Machine Learning, arXiv:1502.03044, 2015.
[6] YANG Z, YUAN Y, WU Y, et al. Review networks for caption generation[C]// Advances in Neural Information Processing Systems. 2016:2361-2369.
[7] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017,39(6):1137-1149.
[8] HE K, ZHANG X, REN S, et al. Delving deepinto rectifiers: Surpassing human-level performance on image net classification[C]// Proceedings of IEEE International Conference on Computer Vision. 2015:1026-1034.
[9] FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images[J]. Lecture Notes in Computer Science, 2010,21(10):15-29.
[10]LI S, KULKARNI G, BERG T, et al. Composing simple image descriptions using web-scale n-grams[C]// ACL Conference on Computational Natural Language Learning. 2011:220-228.
[11]KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Baby talk: Understanding and generating simple image descriptions[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2011:1601-1608.
[12]CHO K, VAN MERRIENBOER B, GULCEHRE C, et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computation and Language, arXiv:1406.1078, 2014.
[13]SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// Advances in Neural Information Processing Systems. 2014:3104-3112.
[14]BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. Computation and Language, arXiv:1409.0473, 2014.
[15]KIROS R, ALAKHUTDINOV R, ZEMEL R S. Multimodal neural language models[C]// ICML. 2014:595-603.
[16]WU Q, SHEN C, LIU L, et al. What value do explicit high level concepts have invision to language problems?[J]. Computer Vision and Pattern Recognition, arXiv:1506.01144, 2015.
[17]YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:4651-4659.
[18]YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes[J]. Computer Vision and Pattern Recognition, arXiv:1611.01646,2015.
[19]KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. Computer Vision and Pattern Recognition, arXiv:1602.07332,2016.
[20]LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[J]. Computer Vision and Pattern Recognition, arXiv:1405.0312,2014.
[21]PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
[22]BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]// ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:65-72.
[23]FLICK C. ROUGE: A package for automatic evaluation of summaries[C]// The Workshop on Text Summarization Branches Out. 2004:25-26.
[24]VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
|