计算机与现代化 ›› 2023, Vol. 0 ›› Issue (01): 7-12.

• 图像处理 • 上一篇    下一篇

基于改进的Transformer_decoder的增强图像描述

  

  1. (西安邮电大学通信与信息工程学院,陕西 西安 710121)
  • 出版日期:2023-03-02 发布日期:2023-03-02
  • 作者简介:林椹尠(1969—),女,陕西西安人,教授,硕士生导师,博士,研究方向:计算机视觉,自然语言处理等,E-mail:zhx126@126.com; 通信作者:屈嘉欣(1996—),女,陕西咸阳人,硕士研究生,研究方向:计算机视觉,E-mail: qujiaxin_118@163.com; 罗亮(1981—),男,甘肃临夏人,讲师,博士,研究方向:图像处理中的数学方法,E-mail: luoliang775@163.com。
  • 基金资助:
    国家青年基金资助项目(12102341); 陕西省教育厅项目(21JK0904); 陕西省自然科学基础研究计划项目(2020JM-580)

Enhanced Image Caption Based on Improved Transformer_decoder

  1. (School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications,
    Xi’an 710121,China)
  • Online:2023-03-02 Published:2023-03-02

摘要: Transformer的解码器(Transformer_decoder)模型已被广泛应用于图像描述任务中,其中自注意力机制(Self Attention)通过捕获细粒度的特征来实现更深层次的图像理解。本文对Self Attention机制进行2方面改进,包括视觉增强注意力机制(Vision-Boosted Attention, VBA)和相对位置注意力机制(Relative-Position Attention, RPA)。视觉增强注意力机制为Transformer_decoder添加VBA层,将视觉特征作为辅助信息引入Self Attention模型中,指导解码器模型生成与图像内容更匹配的描述语义。相对位置注意力机制在Self Attention的基础上,引入可训练的相对位置参数,为输入序列添加词与词之间的相对位置关系。基于COCO2014进行实验,结果表明VBA和RPA这2种注意力机制对图像描述任务都有一定改进,且2种注意力机制相结合的解码器模型有更好的语义表述效果。

关键词: 图像描述, Transformer模型, Self Attention机制, 相对位置注意力机制, 视觉增强注意力机制

Abstract: Transformer's decoder model(Transformer_decoder)has been widely used in image caption tasks. Self Attention captures fine-grained features to achieve deeper image understanding. This article makes two improvements to the Self Attention, including Vision-Boosted Attention(VBA)and Relative-Position Attention(RPA). Vision-Boosted Attention adds a VBA layer to Transformer_decoder, and introduces visual features as auxiliary information into the attention model, which can be used to guide the decoder model to generate more matching description semantics with the image content. On the basis of Self Attention, Relative-Position Attention introduces trainable relative position parameters to add the relative position relationship between words to the input sequence. Based on COCO2014 experiments, the results show that the two attention mechanisms of VBA and RPA have improved image caption tasks to a certain extent, and the decoder model combining the two attention mechanisms has better semantic expression effects.

Key words: image caption, Transformer model, Self Attention mechanism, relative-position attention, vision-boosted attention