Image Caption Generation Method Based on Channel Attention and Transformer

Abstract

Abstract: Image caption generation refers to translating an image into a sentence by computer. Aiming at the problems of existing image caption generation tasks， such as insufficient use of local and global features of images and high time complexity， this paper proposes a hybrid structure image caption generation model based on Convolution Neural Networks (CNN) and Transformer. Considering the spatial and channel characteristics of the convolutional network， firstly， the light-weight and high-precision attention ECA is fused with the convolutional network CNN to form an attention residual block， which is used to extract visual features from the input image. Then the features are input into the sequence model Transformer. At the encoder， use self-attention learning to obtain the participating visual representations. At the language decoder，capture the fine-grained information in the caption and learn the interaction between the caption sequences. The model was validated on MSCOCO dataset， and the BLEU-1， BLEU-3， BLEU-4， Meteor， and CIDEr metrics were improved by 0.3， 0.5， 0.7， 0.4， and 1.6 percentage points， respectively.

Key words: image caption, Transformer, channel attention, convolutional neural network, encoder-decoder model

LIU Jing, CHEN Jin-guang. Image Caption Generation Method Based on Channel Attention and Transformer[J]. Computer and Modernization, 2023, 0(05): 8-12.

References

［1］苗益，赵增顺，杨雨露，等. 图像描述技术综述［J］. 计算机科学， 2020，47(12):149-160.
［2］马艳春，刘永坚，解庆，等. 自动图像标注技术综述［J］. 计算机研究与发展， 2020，57(11):2348-2374.
［3］ KULKARNI G， PREMRAJ V， ORDONEZ V， et al. Babytalk: Understanding and generating simple image descriptions［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013，35(12):2891-2903.
［4］ DATTA R， JOSHI D， LI J， et al. Image retrieval: Ideas， influences， and trends of the new age［J］. ACM Computing Surveys， 2008，40(2):1-60.
［5］ BAHDANAU D， CHO K， BENGIO Y. Neural machine translation by jointly learning to align and translate［J］. arXiv preprint arXiv:1409.0473， 2014.
［6］ VINYALS O， TOSHEV A， BENGIO S， et al. Show and tell: A neural image caption generator［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［7］李勇，成红红，梁新彦，等. CNN图像标题生成［J］. 西安电子科技大学学报， 2019，46(2):152-157.
［8］ MAO J H， XU W， YANG Y， et al. Deep captioning with multimodal recurrent neural networks(m-RNN)［J］. arXiv preprint arXiv:1412.6632， 2014.
［9］ HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997，9(8):1735-1780.
［10］ XU K， BA J， KIROS R， et al. Show， attend and tell: Neural image caption generation with visual attention［C］// International Conference on Machine Learning. PMLR， 2015:2048-2057.
［11］ ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［12］ CHEN L， ZHANG H W， XIAO J， et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:5659-5667.
［13］ WANG Q L， WU B G， ZHU P F， et al. ECA-Net: Efficient channel attention for deep convolutional neural networks［C］// IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE， 2020. DOI: 10.1109/CVPR42600.2020.01155.
［14］ HU J， SHEN L， ALBANIE S， et al. Squeeze-and-excitation networks［J］ IEEE Transactions on Pattern Analysis and Machine Intelligence， 2020，42(8):2011-2023 .
［15］ HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［16］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
［17］ CHEN H T， WANG Y H， GUO T Y， et al. Pre-trained image processing transformer［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021:12299-12310.
［18］刘文婷，卢新明. 基于计算机视觉的Transformer研究进展［J］. 计算机工程与应用， 2022，58(6):1-16.
［19］ LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO: Common objects in context［C］// European Conference on Computer Vision. Springer. 2014:740-755.
［20］ ANDREJ K， LI F F. Deep visual-semantic alignments for generating image descriptions［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence， 2017，39(4):664-676.
［21］ PAPINENI K， ROUKOS S， WARD T， et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.
［22］ DENKOWSKI M， LAVIE A. Meteor universal: Language specific translation evaluation for any target language［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. 2014:376-380.
［23］ LIN C Y. Rouge: A package for automatic evaluation of summaries［C］// Proceedings of the Workshop on Text Summarization Branches Out. 2004:74-81.
［24］ VEDANTAM R， LAWRENCE ZITNICK C. CIDEr: Consensusbased image description evaluation［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
［25］ HAWKINS D M. The problem of overfitting［J］. Journal of Chemical Information and Computer Sciences， 2004，44(1):1-12.
［26］ ANEJA J， DESHPANDE A. Convolutional image captioning［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:5561-5570.
［27］ ZHANG Z J， WU Q， WANG Y， et al. High-quality image captioning with fine-grained and semantic-guided visual attention［J］. IEEE Transactions on Multimedia， 2019，21(7):1681-1693.

[1]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[2]	WANG Haiyang, GONG Tongxin, YANG Jintao, CHEN Zailong. Short-term Load Forecasting in Industrial Parks with Multi-scale Time Coding [J]. Computer and Modernization, 2024, 0(12): 59-65.
[3]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[4]	SHI Xianwei1, FAN Xin2. Semantic Segmentation of Video Frame Scene Based on Lightweight [J]. Computer and Modernization, 2024, 0(08): 49-53.
[5]	GAO Shuaipeng, WANG Yifan. Survey on Group-level Emotion Recognition in Images [J]. Computer and Modernization, 2024, 0(08): 98-107.
[6]	ZHOU Xianxi, MU Li. News Long Text Classification Model Based on Improved TF-IDF and AGLCNN [J]. Computer and Modernization, 2024, 0(08): 120-126.
[7]	LIU Cunli1, LEI Zhanzhan2, ZHENG Ao2. Circular Convolutional Neural Network-based Defect Detection Method for#br# Drainage Pipe Networks [J]. Computer and Modernization, 2024, 0(07): 26-35.
[8]	QIAO Jia, XU Kun, HU Peirong. Layout Analysis Method of Multi-scale Feature Fusion [J]. Computer and Modernization, 2024, 0(05): 16-21.
[9]	LI Shan, WANG Linna, GAO Dingjia, XUAN Haibo. Multi-layer Bank-enterprise Converged Network Based on Graph Neural Network [J]. Computer and Modernization, 2024, 0(05): 27-32.
[10]	ZHONG Hailong1, 2, HE Yueshun1, HE Linlin1, CHEN JIE1, TIAN Ming3, ZHENG Ruiyin4. Cost-sensitive Convolutional Neural Network for Encrypted Traffic Classification#br# #br# [J]. Computer and Modernization, 2024, 0(05): 55-60.
[11]	GAO Geng1, XIAO Fengli2, YANG Fei1. Recognition of Hypopigmented Skin Diseases Based on Improved MobileNetV3-Small [J]. Computer and Modernization, 2024, 0(05): 120-126.
[12]	YOU Jiajing1, 2, HE Yueshun1, HE Linlin1, ZHONG Hailong1, 2. Encryption Traffic Classification Method Based on AHP-CNN [J]. Computer and Modernization, 2024, 0(04): 83-87.
[13]	LI An-ran1, FANG Yang-yang2, CHENG Hui-jie2, ZHANG Shen-shen2, YAN Jin-qiang3, YU Teng3, YANG Guo-wei3. Bi-stream Transformer for Single Image Dehazing [J]. Computer and Modernization, 2024, 0(03): 78-84.
[14]	XU Yue-wen1, LI Ming1, LI Li2. Image Classification of COVID-19 Based on Contrast Learning MocoV2 [J]. Computer and Modernization, 2024, 0(02): 81-87.
[15]	FU Hong-lin, ZHANG Tai-hong, YANG Ya-ting, Aizimaiti Aiwanier, MA Bo. Scenes Text Modification Network for Uyghur Based on Generative Adversarial Network [J]. Computer and Modernization, 2024, 0(01): 41-46.