基于自适应注意模型的图像描述

摘要/Abstract

摘要： 结合注意力机制的编解码框架模型已经被广泛地应用在图像描述任务中。大多数方法都强制对生成的每个单词进行主动的视觉注意，然而，解码器很可能不需要关注图像中的任何视觉信息就生成非视觉单词，比如“the”和“of”。本文提出一种自适应注意力模型，编码端采用Faster R-CNN网络提取图像中的显著特征，解码端LSTM网络中引入一个视觉监督信号。在每个时间步长，它可以自动地决定何时依赖于视觉信号，何时仅依赖于语言模型。最后在Flickr30K和MS-COCO数据集进行验证，实验结果表明该模型有效地提升了描述语句的质量。

关键词: 注意力机制, 卷积神经网络, 长短时记忆网络(LSTM), 图像描述

Abstract: Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict nonvisual words such as “the” and “of”. In this paper，an adaptive attention model is proposed, in which the encoder adopts the Faster R-CNN network to extract the salient features of images, the decoder LSTM network adapts a visual sentinel. At each time step, it can automatically decide when to rely on visual signals and when to just rely on the language model. Finally, the model is verified on Flickr30K and MS-COCO data sets, the experimental results show that the model effectively improves the quality of image captioning.

Key words: attention mechanism, Convolution Neural Network(CNN), Long Short-Term Memory Network(LSTM), image captioning

中图分类号:

TP391.41

侯星晨, 王锦. 基于自适应注意模型的图像描述[J]. 计算机与现代化, 2020, 0(06): 95-.

HOU Xing-chen, WANG Jin. Image Captioning Based on Adaptive Attention Model[J]. Computer and Modernization, 2020, 0(06): 95-.

参考文献

［1］ PLUMMER B A, WANG L, CERVANTES C M， et al. Flickr30k entities:Collecting region-to-phrase correspondences for richer image-to-sentence models［J］. International Journal of Computer Vision, 2017,123(1):74-93.
［2］ VINYALS O, TOSHEV A, BENGIO S， et al. Show and tell: A neural image caption generator［C］// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［3］ XIONG L C, PARIKH D, SOCHER R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning［J］. Computer Vision and Pattern Recognition, arXiv：1612.01887， 2017.
［4］ YU D, FU J, MEI T, et al. Multi-level attention networks for visual question answering［C］ // 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
［5］ XU K， BA J， KIROS R， et al. Show， attend and tell： Neural image caption generation with visual attention［J］. Machine Learning, arXiv：1502.03044， 2015.
［6］ YANG Z, YUAN Y, WU Y, et al. Review networks for caption generation［C］// Advances in Neural Information Processing Systems. 2016:2361-2369.
［7］ REN S， HE K， GIRSHICK R， et al. Faster R-CNN： Towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence， 2017，39（6）：1137-1149.
［8］ HE K, ZHANG X, REN S, et al. Delving deepinto rectifiers: Surpassing human-level performance on image net classification［C］// Proceedings of IEEE International Conference on Computer Vision. 2015:1026-1034.
［9］ FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: Generating sentences from images［J］. Lecture Notes in Computer Science, 2010,21(10):15-29.
［10］LI S, KULKARNI G, BERG T, et al. Composing simple image descriptions using web-scale n-grams［C］// ACL Conference on Computational Natural Language Learning. 2011:220-228.
［11］KULKARNI G, PREMRAJ V, ORDONEZ V, et al. Baby talk: Understanding and generating simple image descriptions［C］// IEEE Conference on Computer Vision and Pattern Recognition. 2011:1601-1608.
［12］CHO K, VAN MERRIENBOER B, GULCEHRE C, et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation［J］. Computation and Language, arXiv:1406.1078, 2014.
［13］SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks［C］// Advances in Neural Information Processing Systems. 2014:3104-3112.
［14］BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate［J］. Computation and Language, arXiv:1409.0473, 2014.
［15］KIROS R, ALAKHUTDINOV R, ZEMEL R S. Multimodal neural language models［C］// ICML. 2014:595-603.
［16］WU Q, SHEN C, LIU L, et al. What value do explicit high level concepts have invision to language problems?［J］. Computer Vision and Pattern Recognition, arXiv:1506.01144, 2015.
［17］YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:4651-4659.
［18］YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes［J］. Computer Vision and Pattern Recognition, arXiv:1611.01646,2015.
［19］KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations［J］. Computer Vision and Pattern Recognition, arXiv:1602.07332,2016.
［20］LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO： Common objects in context［J］. Computer Vision and Pattern Recognition, arXiv：1405.0312，2014.
［21］PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
［22］BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments［C］// ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:65-72.
［23］FLICK C. ROUGE: A package for automatic evaluation of summaries［C］// The Workshop on Text Summarization Branches Out. 2004:25-26.
［24］VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation［C］// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[6]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[7]	谷岳, 邓松峰, 沈霁, 穆文涛, 赵恩棋. 基于改进YOLOv8的SAR舰船目标检测算法[J]. 计算机与现代化, 2024, 0(12): 78-83.
[8]	王艳媛, 茅正冲. 中英文场景文本图像的检测和识别算法[J]. 计算机与现代化, 2024, 0(12): 84-90.
[9]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[10]	张宇1, 2, 黎靖1, 2, 马铭1, 2, 王众祥1, 2, 孙妍1, 2. YOLOLW:一个新的轻量级目标检测模型[J]. 计算机与现代化, 2024, 0(11): 91-98.
[11]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[12]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[13]	魏学诚1, 江凌云1, 李研2, 何非2. 改进YOLOv5的路侧单目视角小目标检测算法[J]. 计算机与现代化, 2024, 0(10): 27-34.
[14]	杜猛俊1, 李昂1, 童俊1, 钱锦1, 康恺1, 王若丁1, 靳文星2. 基于改进极限学习算法的电力信息数据融合模型[J]. 计算机与现代化, 2024, 0(10): 61-64.
[15]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.