基于改进的Transformer_decoder的增强图像描述

计算机与现代化 ›› 2023, Vol. 0 ›› Issue (01): 7-12.

基于改进的Transformer_decoder的增强图像描述

（西安邮电大学通信与信息工程学院，陕西西安 710121）

出版日期:2023-03-02 发布日期:2023-03-02
作者简介:林椹尠（1969—），女，陕西西安人，教授，硕士生导师，博士，研究方向：计算机视觉，自然语言处理等，E-mail：zhx126@126.com；通信作者：屈嘉欣（1996—），女，陕西咸阳人，硕士研究生，研究方向：计算机视觉，E-mail： qujiaxin_118@163.com；罗亮（1981—），男，甘肃临夏人，讲师，博士，研究方向：图像处理中的数学方法，E-mail： luoliang775@163.com。
基金资助:
国家青年基金资助项目（12102341）；陕西省教育厅项目（21JK0904）；陕西省自然科学基础研究计划项目（2020JM-580）

Enhanced Image Caption Based on Improved Transformer_decoder

（School of Communication and Information Engineering， Xi’an University of Posts and Telecommunications，
Xi’an 710121，China）

Online:2023-03-02 Published:2023-03-02

摘要/Abstract

摘要： Transformer的解码器（Transformer_decoder）模型已被广泛应用于图像描述任务中，其中自注意力机制（Self Attention）通过捕获细粒度的特征来实现更深层次的图像理解。本文对Self Attention机制进行2方面改进，包括视觉增强注意力机制（Vision-Boosted Attention， VBA）和相对位置注意力机制（Relative-Position Attention， RPA）。视觉增强注意力机制为Transformer_decoder添加VBA层，将视觉特征作为辅助信息引入Self Attention模型中，指导解码器模型生成与图像内容更匹配的描述语义。相对位置注意力机制在Self Attention的基础上，引入可训练的相对位置参数，为输入序列添加词与词之间的相对位置关系。基于COCO2014进行实验，结果表明VBA和RPA这2种注意力机制对图像描述任务都有一定改进，且2种注意力机制相结合的解码器模型有更好的语义表述效果。

关键词: 图像描述, Transformer模型, Self Attention机制, 相对位置注意力机制, 视觉增强注意力机制

Abstract: Transformer's decoder model（Transformer_decoder）has been widely used in image caption tasks. Self Attention captures fine-grained features to achieve deeper image understanding. This article makes two improvements to the Self Attention, including Vision-Boosted Attention（VBA）and Relative-Position Attention（RPA）. Vision-Boosted Attention adds a VBA layer to Transformer_decoder, and introduces visual features as auxiliary information into the attention model, which can be used to guide the decoder model to generate more matching description semantics with the image content. On the basis of Self Attention, Relative-Position Attention introduces trainable relative position parameters to add the relative position relationship between words to the input sequence. Based on COCO2014 experiments, the results show that the two attention mechanisms of VBA and RPA have improved image caption tasks to a certain extent， and the decoder model combining the two attention mechanisms has better semantic expression effects.

Key words: image caption, Transformer model, Self Attention mechanism, relative-position attention, vision-boosted attention

林椹尠, 屈嘉欣, 罗亮. 基于改进的Transformer_decoder的增强图像描述[J]. 计算机与现代化, 2023, 0(01): 7-12.

LIN Zhen-xian, QU Jia-xin, LUO Liang. Enhanced Image Caption Based on Improved Transformer_decoder[J]. Computer and Modernization, 2023, 0(01): 7-12.

参考文献

［1］ LI G， ZHU L C， LIU P， et al. Entangled transformer for image captioning［C］// 2019 IEEE/CVF International Conference on Computer Vision. 2019:8928-8937.
［2］ DENG Z R， JIANG Z Q， LAN R S， et al. Image captioning using DenseNet network and adaptive attention［J］. Signal Processing Image Communication，2020，85:115836. DOI: 10.1016/j.image.2020.115836.
［3］ ZHU X X， LI L X， LIU J， et al. Image captioning with word gate and adaptive self-critical learning［J］. Applied Sciences， 2018，8（6）:909.
［4］ GUAN Z B， LIU K， MA Y， et al. Middle-level attribute-based language retouching for image caption generation［J］. Applied Sciences， 2018，8（10）:1850.
［5］ YANG X， TANG K H， ZHANG H W， et al. Auto-encoding scene graphs for image captioning［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:10677-10686.
［6］ LAWRENCE S， GILES C L， TSOI A C， et al. Face recognition: A convolutional neural network approach［J］. IEEE Transactions on Neural Networks， 1997，8（1）: 98-113.
［7］ SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［J］. arXiv preprint arXiv:1409.1556， 2014.
［8］ HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［9］ TAN M X， LE Q V. EfficientNet: Rethinking model scaling for convolutional neural networks［J］. arXiv preprint arXiv:1905.11946， 2019.
［10］ GRAVES A. Generating sequences with recurrent neural networks［J］. arXiv preprint arXiv:1308.0850， 2013.
［11］ HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997，9（8）:1735-1780.
［12］ CHO K， VAN B， GULCEHRE C， et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation［J］. Computer Science， 2014，21（4）:62-72.
［13］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// 2017 Advances in Neural Information Processing Systems. 2017:5998-6008.
［14］ TAY Y， BAHRI D， METZLER D， et al. Synthesizer: Rethinking self-attention for transformer models［C］// Proceedings of the 38th International Conference on Machine Learning Computer Science. 2021，139:10183-10192.
［15］ ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［16］ CORNIA M， BARALDI L， CUCCHIARA R， et al. Show， control and tell: A framework for generating controllable and grounded captions［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8299-8308.
［17］ CHEN S Z， JIN Q， WANG P， et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:9956-9968.
［18］ DAI Z H， YANG Z L， YANG Y M， et al. Transformer-XL: Attentive language models beyond a fixed-length context［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2978-2988.
［19］ HUANG Z H， LIANG D， XU P， et al. Improve transformer models with better relative position embeddings［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020.
［20］ WANG B Y， SHANG L F， LIOMA C， et al. On position embeddings in BERT［C］// 2021 International Conference on Learning Representations. 2021.
［21］ LIN T Y， MAIRE M， BELONGIE S， et a1. MicrosoftCOCO: Common objects in context［C］// 2014 European Conference on Computer Vision. 2014:740-755.
［22］ KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2016，39（4）:664-676.
［23］ YONG H W， HUANG J Q， HUA X S， et al. Gradient centralization: A new optimization technique for deep neural networks［J］. arXiv preprint arXiv:2004.01461， 2020.
［24］ WANG P， HWEE T N. A beam-search decoder for normalization of social media text with application to machine translation［C］// The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013:471-481.
［25］ PAPINENI K， ROUKOS S， WARD T， et al. Blue: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.

[1]	冯茹嘉, 张海军, 潘伟民. 基于情感分析和Transformer模型的微博谣言检测[J]. 计算机与现代化, 2021, 0(10): 1-7.
[2]	李欣晔, 张承强, 周雄图, 郭太良, 张永爱. 多场景融合的细粒度图像描述生成算法[J]. 计算机与现代化, 2021, 0(09): 1-6.
[3]	高逸凡, 王勇. 一种基于目标检测与词性分析的图像描述算法[J]. 计算机与现代化, 2021, 0(03): 108-114.
[4]	侯星晨, 王锦. 基于自适应注意模型的图像描述[J]. 计算机与现代化, 2020, 0(06): 95-.
[5]	卢爱平，李盼池. 基于混沌序列的量子图像加密方案[J]. 计算机与现代化, 2020, 0(03): 86-.