Enhanced Image Caption Based on Improved Transformer_decoder

Abstract

Abstract: Transformer's decoder model（Transformer_decoder）has been widely used in image caption tasks. Self Attention captures fine-grained features to achieve deeper image understanding. This article makes two improvements to the Self Attention, including Vision-Boosted Attention（VBA）and Relative-Position Attention（RPA）. Vision-Boosted Attention adds a VBA layer to Transformer_decoder, and introduces visual features as auxiliary information into the attention model, which can be used to guide the decoder model to generate more matching description semantics with the image content. On the basis of Self Attention, Relative-Position Attention introduces trainable relative position parameters to add the relative position relationship between words to the input sequence. Based on COCO2014 experiments, the results show that the two attention mechanisms of VBA and RPA have improved image caption tasks to a certain extent， and the decoder model combining the two attention mechanisms has better semantic expression effects.

Key words: image caption, Transformer model, Self Attention mechanism, relative-position attention, vision-boosted attention

LIN Zhen-xian, QU Jia-xin, LUO Liang. Enhanced Image Caption Based on Improved Transformer_decoder[J]. Computer and Modernization, 2023, 0(01): 7-12.

References

［1］ LI G， ZHU L C， LIU P， et al. Entangled transformer for image captioning［C］// 2019 IEEE/CVF International Conference on Computer Vision. 2019:8928-8937.
［2］ DENG Z R， JIANG Z Q， LAN R S， et al. Image captioning using DenseNet network and adaptive attention［J］. Signal Processing Image Communication，2020，85:115836. DOI: 10.1016/j.image.2020.115836.
［3］ ZHU X X， LI L X， LIU J， et al. Image captioning with word gate and adaptive self-critical learning［J］. Applied Sciences， 2018，8（6）:909.
［4］ GUAN Z B， LIU K， MA Y， et al. Middle-level attribute-based language retouching for image caption generation［J］. Applied Sciences， 2018，8（10）:1850.
［5］ YANG X， TANG K H， ZHANG H W， et al. Auto-encoding scene graphs for image captioning［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:10677-10686.
［6］ LAWRENCE S， GILES C L， TSOI A C， et al. Face recognition: A convolutional neural network approach［J］. IEEE Transactions on Neural Networks， 1997，8（1）: 98-113.
［7］ SIMONYAN K， ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［J］. arXiv preprint arXiv:1409.1556， 2014.
［8］ HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［9］ TAN M X， LE Q V. EfficientNet: Rethinking model scaling for convolutional neural networks［J］. arXiv preprint arXiv:1905.11946， 2019.
［10］ GRAVES A. Generating sequences with recurrent neural networks［J］. arXiv preprint arXiv:1308.0850， 2013.
［11］ HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997，9（8）:1735-1780.
［12］ CHO K， VAN B， GULCEHRE C， et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation［J］. Computer Science， 2014，21（4）:62-72.
［13］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// 2017 Advances in Neural Information Processing Systems. 2017:5998-6008.
［14］ TAY Y， BAHRI D， METZLER D， et al. Synthesizer: Rethinking self-attention for transformer models［C］// Proceedings of the 38th International Conference on Machine Learning Computer Science. 2021，139:10183-10192.
［15］ ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［16］ CORNIA M， BARALDI L， CUCCHIARA R， et al. Show， control and tell: A framework for generating controllable and grounded captions［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8299-8308.
［17］ CHEN S Z， JIN Q， WANG P， et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:9956-9968.
［18］ DAI Z H， YANG Z L， YANG Y M， et al. Transformer-XL: Attentive language models beyond a fixed-length context［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2978-2988.
［19］ HUANG Z H， LIANG D， XU P， et al. Improve transformer models with better relative position embeddings［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020.
［20］ WANG B Y， SHANG L F， LIOMA C， et al. On position embeddings in BERT［C］// 2021 International Conference on Learning Representations. 2021.
［21］ LIN T Y， MAIRE M， BELONGIE S， et a1. MicrosoftCOCO: Common objects in context［C］// 2014 European Conference on Computer Vision. 2014:740-755.
［22］ KARPATHY A， LI F F. Deep visual-semantic alignments for generating image descriptions［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2016，39（4）:664-676.
［23］ YONG H W， HUANG J Q， HUA X S， et al. Gradient centralization: A new optimization technique for deep neural networks［J］. arXiv preprint arXiv:2004.01461， 2020.
［24］ WANG P， HWEE T N. A beam-search decoder for normalization of social media text with application to machine translation［C］// The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013:471-481.
［25］ PAPINENI K， ROUKOS S， WARD T， et al. Blue: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.

[1]	LIU Jing, CHEN Jin-guang. Image Caption Generation Method Based on Channel Attention and Transformer [J]. Computer and Modernization, 2023, 0(05): 8-12.
[2]	FENG Ru-jia, ZHANG Hai-jun, PAN Wei-min. Microblog Rumor Detection Based on Sentiment Analysis and Transformer Model [J]. Computer and Modernization, 2021, 0(10): 1-7.
[3]	LI Xin-ye, ZHANG Cheng-qiang, ZHOU Xiong-tu, GUO Tai-liang, ZHANG Yong-ai. Multi-scene Fusion Algorithm for Fine-grained Image Caption [J]. Computer and Modernization, 2021, 0(09): 1-6.
[4]	HOU Xing-chen, WANG Jin. Image Captioning Based on Adaptive Attention Model [J]. Computer and Modernization, 2020, 0(06): 95-.
[5]	WANG Zhi-ping, ZHENG Bao-you, LIU Yi-wei. An Improved LSTM Model in the Application of Image Caption Generation [J]. Computer and Modernization, 2020, 0(04): 37-.