基于通道注意力和Transformer的图像标题生成方法

摘要/Abstract

摘要： 图像标题生成是指通过计算机将图像翻译成描述该图像的标题语句。针对现有图像标题生成任务中，未充分利用图像的局部和全局特征以及时间复杂度较高的问题，本文提出一种基于卷积神经网络（Convolution Neural Networks， CNN）和Transformer的混合结构图像标题生成模型。考虑卷积网络所具有的空间和通道特性，首先将轻量化高精度的注意力ECA与卷积网络CNN进行融合构成注意力残差块，用于从输入图像中提取视觉特征；然后将特征输入到序列模型Transformer中，在编码器端借助自注意学习来获得所参与的视觉表示，在语言解码器中捕获标题中的细粒度信息以及学习标题序列间的相互作用，进而结合视觉特征逐步得到图像所对应的标题描述。将模型在MSCOCO数据集上进行实验，BLEU-1、BLEU-3、BLEU-4、Meteor和CIDEr指标分别提高了0.3、0.5、0.7、0.4、1.6个百分点。

关键词: 图像标题, Transformer, 通道注意力, 卷积神经网络, Encoder-Decoder模型

Abstract: Image caption generation refers to translating an image into a sentence by computer. Aiming at the problems of existing image caption generation tasks， such as insufficient use of local and global features of images and high time complexity， this paper proposes a hybrid structure image caption generation model based on Convolution Neural Networks (CNN) and Transformer. Considering the spatial and channel characteristics of the convolutional network， firstly， the light-weight and high-precision attention ECA is fused with the convolutional network CNN to form an attention residual block， which is used to extract visual features from the input image. Then the features are input into the sequence model Transformer. At the encoder， use self-attention learning to obtain the participating visual representations. At the language decoder，capture the fine-grained information in the caption and learn the interaction between the caption sequences. The model was validated on MSCOCO dataset， and the BLEU-1， BLEU-3， BLEU-4， Meteor， and CIDEr metrics were improved by 0.3， 0.5， 0.7， 0.4， and 1.6 percentage points， respectively.

Key words: image caption, Transformer, channel attention, convolutional neural network, encoder-decoder model

刘静, 陈金广. 基于通道注意力和Transformer的图像标题生成方法[J]. 计算机与现代化, 2023, 0(05): 8-12.

LIU Jing, CHEN Jin-guang. Image Caption Generation Method Based on Channel Attention and Transformer[J]. Computer and Modernization, 2023, 0(05): 8-12.

参考文献

［1］苗益，赵增顺，杨雨露，等. 图像描述技术综述［J］. 计算机科学， 2020，47(12):149-160.
［2］马艳春，刘永坚，解庆，等. 自动图像标注技术综述［J］. 计算机研究与发展， 2020，57(11):2348-2374.
［3］ KULKARNI G， PREMRAJ V， ORDONEZ V， et al. Babytalk: Understanding and generating simple image descriptions［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013，35(12):2891-2903.
［4］ DATTA R， JOSHI D， LI J， et al. Image retrieval: Ideas， influences， and trends of the new age［J］. ACM Computing Surveys， 2008，40(2):1-60.
［5］ BAHDANAU D， CHO K， BENGIO Y. Neural machine translation by jointly learning to align and translate［J］. arXiv preprint arXiv:1409.0473， 2014.
［6］ VINYALS O， TOSHEV A， BENGIO S， et al. Show and tell: A neural image caption generator［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［7］李勇，成红红，梁新彦，等. CNN图像标题生成［J］. 西安电子科技大学学报， 2019，46(2):152-157.
［8］ MAO J H， XU W， YANG Y， et al. Deep captioning with multimodal recurrent neural networks(m-RNN)［J］. arXiv preprint arXiv:1412.6632， 2014.
［9］ HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997，9(8):1735-1780.
［10］ XU K， BA J， KIROS R， et al. Show， attend and tell: Neural image caption generation with visual attention［C］// International Conference on Machine Learning. PMLR， 2015:2048-2057.
［11］ ANDERSON P， HE X D， BUEHLER C， et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［12］ CHEN L， ZHANG H W， XIAO J， et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:5659-5667.
［13］ WANG Q L， WU B G， ZHU P F， et al. ECA-Net: Efficient channel attention for deep convolutional neural networks［C］// IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE， 2020. DOI: 10.1109/CVPR42600.2020.01155.
［14］ HU J， SHEN L， ALBANIE S， et al. Squeeze-and-excitation networks［J］ IEEE Transactions on Pattern Analysis and Machine Intelligence， 2020，42(8):2011-2023 .
［15］ HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［16］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
［17］ CHEN H T， WANG Y H， GUO T Y， et al. Pre-trained image processing transformer［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021:12299-12310.
［18］刘文婷，卢新明. 基于计算机视觉的Transformer研究进展［J］. 计算机工程与应用， 2022，58(6):1-16.
［19］ LIN T Y， MAIRE M， BELONGIE S， et al. Microsoft COCO: Common objects in context［C］// European Conference on Computer Vision. Springer. 2014:740-755.
［20］ ANDREJ K， LI F F. Deep visual-semantic alignments for generating image descriptions［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence， 2017，39(4):664-676.
［21］ PAPINENI K， ROUKOS S， WARD T， et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.
［22］ DENKOWSKI M， LAVIE A. Meteor universal: Language specific translation evaluation for any target language［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. 2014:376-380.
［23］ LIN C Y. Rouge: A package for automatic evaluation of summaries［C］// Proceedings of the Workshop on Text Summarization Branches Out. 2004:74-81.
［24］ VEDANTAM R， LAWRENCE ZITNICK C. CIDEr: Consensusbased image description evaluation［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
［25］ HAWKINS D M. The problem of overfitting［J］. Journal of Chemical Information and Computer Sciences， 2004，44(1):1-12.
［26］ ANEJA J， DESHPANDE A. Convolutional image captioning［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:5561-5570.
［27］ ZHANG Z J， WU Q， WANG Y， et al. High-quality image captioning with fine-grained and semantic-guided visual attention［J］. IEEE Transactions on Multimedia， 2019，21(7):1681-1693.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[3]	王海洋, 弓同鑫, 杨锦涛, 陈再龙. 多尺度时间编码的工业园区短期负荷预测[J]. 计算机与现代化, 2024, 0(12): 59-65.
[4]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[5]	陈雪松1, 李衡1, 王浩畅2. 结合注意力机制和Mengzi模型的短文本分类[J]. 计算机与现代化, 2024, 0(09): 101-106.
[6]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[7]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[8]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[9]	杨江1, 孙晓梅1, 许韬2. 基于业务内容构建股票关联关系的股价预测[J]. 计算机与现代化, 2024, 0(07): 21-25.
[10]	刘存莉1, 雷占占2, 郑澳2. 基于循环卷积神经网络的排水管网缺陷检测方法[J]. 计算机与现代化, 2024, 0(07): 26-35.
[11]	乔佳, 徐琨, 胡佩蓉. 多尺度特征融合的版面分析方法[J]. 计算机与现代化, 2024, 0(05): 16-21.
[12]	李珊, 王林娜, 高丁佳, 宣海波. 基于图神经网络的多层银企网络融合研究[J]. 计算机与现代化, 2024, 0(05): 27-32.
[13]	钟海龙1, 2, 何月顺1, 何璘琳1, 陈杰1, 田鸣3, 郑瑞银4. 基于代价敏感卷积神经网络的加密流量分类#br# #br#[J]. 计算机与现代化, 2024, 0(05): 55-60.
[14]	高埂1, 肖风丽2, 杨飞1. 基于改进MobileNetV3-Small的色素减退性皮肤病诊断[J]. 计算机与现代化, 2024, 0(05): 120-126.
[15]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.