计算机与现代化 ›› 2020, Vol. 0 ›› Issue (06): 95-.

• • 上一篇    下一篇

基于自适应注意模型的图像描述

  

  1. (天津职业技术师范大学电子工程学院,天津300222)
  • 收稿日期:2019-10-15 出版日期:2020-06-24 发布日期:2020-06-28
  • 作者简介:侯星晨(1994-),男,河南周口人,硕士研究生,研究方向:计算机视觉,深度学习,E-mail: 243517084@qq.com; 王锦(1970-),女, 教授,硕士,研究方向:数字图像处理,深度学习,E-mail: 13512287240@163.com。

Image Captioning Based on Adaptive Attention Model

  1. (College of Electronic Engineering, Tianjin University of Technology and Education, Tianjin 300222, China)
  • Received:2019-10-15 Online:2020-06-24 Published:2020-06-28

摘要: 结合注意力机制的编解码框架模型已经被广泛地应用在图像描述任务中。大多数方法都强制对生成的每个单词进行主动的视觉注意,然而,解码器很可能不需要关注图像中的任何视觉信息就生成非视觉单词,比如“the”和“of”。本文提出一种自适应注意力模型,编码端采用Faster R-CNN网络提取图像中的显著特征,解码端LSTM网络中引入一个视觉监督信号。在每个时间步长,它可以自动地决定何时依赖于视觉信号,何时仅依赖于语言模型。最后在Flickr30K和MS-COCO数据集进行验证,实验结果表明该模型有效地提升了描述语句的质量。

关键词: 注意力机制, 卷积神经网络, 长短时记忆网络(LSTM), 图像描述

Abstract: Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict nonvisual words such as “the” and “of”. In this paper,an adaptive attention model is proposed, in which the encoder adopts the Faster R-CNN network to extract the salient features of images, the decoder LSTM network adapts a visual sentinel. At each time step, it can automatically decide when to rely on visual signals and when to just rely on the language model. Finally, the model is verified on Flickr30K and MS-COCO data sets, the experimental results show that the model effectively improves the quality of image captioning.

Key words: attention mechanism, Convolution Neural Network(CNN), Long Short-Term Memory Network(LSTM), image captioning

中图分类号: