计算机与现代化 ›› 2023, Vol. 0 ›› Issue (05): 8-12.

• 人工智能 • 上一篇    下一篇

基于通道注意力和Transformer的图像标题生成方法

  

  1. (西安工程大学计算机科学学院,陕西 西安 710048)
  • 出版日期:2023-06-06 发布日期:2023-06-06
  • 作者简介:刘静(1998—),女,陕西周至人,硕士研究生,研究方向:图像处理与理解,E-mail: 1850351173@qq.com; 通信作者:陈金广(1977—),男,河南镇平人,教授,研究方向:多源信息融合,机器学习,计算机应用,E-mail: xacjg@163.com。
  • 基金资助:
    陕西省自然科学基础研究计划项目(2023-JC-YB-568); 陕西省教育厅科研计划项目(22JP028)

Image Caption Generation Method Based on Channel Attention and Transformer

  1. (School of Computer Science, Xi’an Polytechnic University, Xi’an 710048, China)
  • Online:2023-06-06 Published:2023-06-06

摘要: 图像标题生成是指通过计算机将图像翻译成描述该图像的标题语句。针对现有图像标题生成任务中,未充分利用图像的局部和全局特征以及时间复杂度较高的问题,本文提出一种基于卷积神经网络(Convolution Neural Networks, CNN)和Transformer的混合结构图像标题生成模型。考虑卷积网络所具有的空间和通道特性,首先将轻量化高精度的注意力ECA与卷积网络CNN进行融合构成注意力残差块,用于从输入图像中提取视觉特征;然后将特征输入到序列模型Transformer中,在编码器端借助自注意学习来获得所参与的视觉表示,在语言解码器中捕获标题中的细粒度信息以及学习标题序列间的相互作用,进而结合视觉特征逐步得到图像所对应的标题描述。将模型在MSCOCO数据集上进行实验,BLEU-1、BLEU-3、BLEU-4、Meteor和CIDEr指标分别提高了0.3、0.5、0.7、0.4、1.6个百分点。

关键词: 图像标题, Transformer, 通道注意力, 卷积神经网络, Encoder-Decoder模型

Abstract: Image caption generation refers to translating an image into a sentence by computer. Aiming at the problems of existing image caption generation tasks, such as insufficient use of local and global features of images and high time complexity, this paper proposes a hybrid structure image caption generation model based on Convolution Neural Networks (CNN) and Transformer. Considering the spatial and channel characteristics of the convolutional network, firstly, the light-weight and high-precision attention ECA is fused with the convolutional network CNN to form an attention residual block, which is used to extract visual features from the input image. Then the features are input into the sequence model Transformer. At the encoder, use self-attention learning to obtain the participating visual representations. At the language decoder,capture the fine-grained information in the caption and learn the interaction between the caption sequences. The model was validated on MSCOCO dataset, and the BLEU-1, BLEU-3, BLEU-4, Meteor, and CIDEr metrics were improved by 0.3, 0.5, 0.7, 0.4, and 1.6 percentage points, respectively.

Key words: image caption, Transformer, channel attention, convolutional neural network, encoder-decoder model