Computer and Modernization ›› 2023, Vol. 0 ›› Issue (01): 7-12.

Previous Articles     Next Articles

Enhanced Image Caption Based on Improved Transformer_decoder

  

  1. (School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications,
    Xi’an 710121,China)
  • Online:2023-03-02 Published:2023-03-02

Abstract: Transformer's decoder model(Transformer_decoder)has been widely used in image caption tasks. Self Attention captures fine-grained features to achieve deeper image understanding. This article makes two improvements to the Self Attention, including Vision-Boosted Attention(VBA)and Relative-Position Attention(RPA). Vision-Boosted Attention adds a VBA layer to Transformer_decoder, and introduces visual features as auxiliary information into the attention model, which can be used to guide the decoder model to generate more matching description semantics with the image content. On the basis of Self Attention, Relative-Position Attention introduces trainable relative position parameters to add the relative position relationship between words to the input sequence. Based on COCO2014 experiments, the results show that the two attention mechanisms of VBA and RPA have improved image caption tasks to a certain extent, and the decoder model combining the two attention mechanisms has better semantic expression effects.

Key words: image caption, Transformer model, Self Attention mechanism, relative-position attention, vision-boosted attention