[1] |
PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models[J]. International Journal of Computer Vision, 2017,123(1):74-93.
|
[2] |
VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
|
[3] |
XIONG L C, PARIKH D, SOCHER R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning[J]. Computer Vision and Pattern Recognition, arXiv:1612.01887, 2017.
|
[4] |
YU D, FU J, MEI T, et al. Multi-level attention networks for visual question answering[C] // 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
|
[5] |
XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention[J]. Machine Learning, arXiv:1502.03044, 2015.
|
[6] |
YANG Z, YUAN Y, WU Y, et al. Review networks for caption generation[C]// Advances in Neural Information Processing Systems. 2016:2361-2369.
|
[7] |
REN S, HE K, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017,39(6):1137-1149.
|
[8] |
HE K, ZHANG X, REN S, et al. Delving deepinto rectifiers: Surpassing human-level performance on image net classification[C]// Proceedings of IEEE International Conference on Computer Vision. 2015:1026-1034.
|
[9] |
FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images[J]. Lecture Notes in Computer Science, 2010,21(10):15-29.
|
[10] |
LI S, KULKARNI G, BERG T, et al. Composing simple image descriptions using web-scale n-grams[C]// ACL Conference on Computational Natural Language Learning. 2011:220-228.
|
[11] |
KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Baby talk: Understanding and generating simple image descriptions[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2011:1601-1608.
|
[12] |
CHO K, VAN MERRIENBOER B, GULCEHRE C, et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computation and Language, arXiv:1406.1078, 2014.
|
[13] |
SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]// Advances in Neural Information Processing Systems. 2014:3104-3112.
|
[14] |
BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. Computation and Language, arXiv:1409.0473, 2014.
|
[15] |
KIROS R, ALAKHUTDINOV R, ZEMEL R S. Multimodal neural language models[C]// ICML. 2014:595-603.
|
[16] |
WU Q, SHEN C, LIU L, et al. What value do explicit high level concepts have invision to language problems?[J]. Computer Vision and Pattern Recognition, arXiv:1506.01144, 2015.
|
[17] |
YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:4651-4659.
|
[18] |
YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes[J]. Computer Vision and Pattern Recognition, arXiv:1611.01646,2015.
|
[19] |
KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. Computer Vision and Pattern Recognition, arXiv:1602.07332,2016.
|
[20] |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[J]. Computer Vision and Pattern Recognition, arXiv:1405.0312,2014.
|
[21] |
PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation[C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
|
[22] |
BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]// ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:65-72.
|
[23] |
FLICK C. ROUGE: A package for automatic evaluation of summaries[C]// The Workshop on Text Summarization Branches Out. 2004:25-26.
|
[24] |
VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
|