Computer and Modernization ›› 2023, Vol. 0 ›› Issue (05): 8-12.

Previous Articles     Next Articles

Image Caption Generation Method Based on Channel Attention and Transformer

  

  1. (School of Computer Science, Xi’an Polytechnic University, Xi’an 710048, China)
  • Online:2023-06-06 Published:2023-06-06

Abstract: Image caption generation refers to translating an image into a sentence by computer. Aiming at the problems of existing image caption generation tasks, such as insufficient use of local and global features of images and high time complexity, this paper proposes a hybrid structure image caption generation model based on Convolution Neural Networks (CNN) and Transformer. Considering the spatial and channel characteristics of the convolutional network, firstly, the light-weight and high-precision attention ECA is fused with the convolutional network CNN to form an attention residual block, which is used to extract visual features from the input image. Then the features are input into the sequence model Transformer. At the encoder, use self-attention learning to obtain the participating visual representations. At the language decoder,capture the fine-grained information in the caption and learn the interaction between the caption sequences. The model was validated on MSCOCO dataset, and the BLEU-1, BLEU-3, BLEU-4, Meteor, and CIDEr metrics were improved by 0.3, 0.5, 0.7, 0.4, and 1.6 percentage points, respectively.

Key words: image caption, Transformer, channel attention, convolutional neural network, encoder-decoder model