多场景融合的细粒度图像描述生成算法

摘要/Abstract

摘要： 针对图像描述生成任务在不同场景下表现不佳的缺点，提出一种融合卷积神经网络和先验知识的多场景注意力图像描述生成算法。该算法通过卷积神经网络生成视觉语义单元，使用命名实体识别对图像场景进行识别和预测，并使用该结果自动调整自注意力机制的关键参数并进行多场景注意力计算，最后将得到的区域编码和语义先验知识插入Transformer文本生成器中指导句子的生成。结果表明，该算法有效解决了生成的描述缺少关键场景信息的问题。在MSCOCO和Flickr30k数据集上对模型进行评估，其中MSCOCO数据集的CIDEr得分达到1.210，优于同类图像描述生成模型。

关键词: 图像描述生成, 卷积神经网络, 命名实体识别, 多场景注意力, Transformer结构

Abstract: In terms of the poor performance of image caption task in different scenes, a multi-scene image caption generation algorithm based on convolutional neural network and prior knowledge is proposed. The algorithm generates visual semantic units by convolutional neural network, then uses named entity recognition to identify and predict image scenes, uses the result of classifying to adjust the focusing parameter of self-attention mechanism automatically, and calculate the multi-scene attention score. Finally, the obtained region coding and semantic prior knowledge are inserted into Transformer text generator to guide sentence generation. The results show that the algorithm can effectively solve the problem that the caption lacks the key scene information. Evaluation indicators are used to evaluate the model on the MSCOCO dataset and Flickr30k dataset, and the CIDEr score of MSCOCO dataset reaches 1.210, which is better than similar image description generation models.

Key words: image caption, CNN, NER, multi-scene attention, Transformer structure

李欣晔, 张承强, 周雄图, 郭太良, 张永爱. 多场景融合的细粒度图像描述生成算法[J]. 计算机与现代化, 2021, 0(09): 1-6.

LI Xin-ye, ZHANG Cheng-qiang, ZHOU Xiong-tu, GUO Tai-liang, ZHANG Yong-ai. Multi-scene Fusion Algorithm for Fine-grained Image Caption[J]. Computer and Modernization, 2021, 0(09): 1-6.

参考文献

［1］陈龙杰,张钰,张玉梅,等. 基于多注意力多尺度特征融合的图像描述生成算法［J］. 计算机应用, 2019,39(2):354-359.
［2］张姣,杨振宇. 图像描述生成方法研究文献综述［J］. 智能计算机与应用, 2019(5):45-49.
［3］ VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator［C］// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［4］ KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks［J］. Advances in Neural Information Processing Systems, 2012,25(2):1097-1105.
［5］ HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［6］ MAO J H, XU W. Explain images with multimodal recurrent neural networks［J］. Computer Science, 2014，arXiv:1410.1090.
［7］ XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］// International Conference on Machine Learning. 2015:2048-2057.
［8］ HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words［C］// Advances in Neural Information Processing Systems. 2019:11137-11147.
［9］ HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// 2016 IEEE Conference on Computer Vision & Pattern Recognition. 2016:770-778.
［10］REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017,39(6):1137-1149.
［11］KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks［J］. Machine Learning, 2016，arXiv:1609.02907.
［12］BASTINGS J, TITOV I, AZIZ W, et al. Graph convolutional encoders for syntax-aware neural machine translation［J］. Computation and
Language， 2017，arXiv:1704.04675.
［13］SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014:3104-3112.
［14］BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate［J］. Computation and Language, 2014，arXiv:1409.0473.
［15］ZELLERS R, YATSKAR M, THOMSON S, et al. Neural motifs: Scene graph parsing with global context［J］. Computer Vision and Pattern Recognition, 2017，arXiv:1711.06640.
［16］KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision, 2017,123(1):32-73.
［17］SHANG J B, LIU L Y, GU X T, et al. Learning named entity tagger using domain-specific dictionary［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018:2054-2064.
［18］VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:5998-6008.
［19］VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation［C］// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
［20］PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models［C］// Proceedings of the IEEE International Conference on Computer Vision. 2015:2641-2649.
［21］KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:3128-3137.
［22］PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.
［23］SATANJEEV B. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments［C］// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:228-231.
［24］LIN C Y, HOVY E. Automatic evaluation of summaries using n-gram co-occurrence statistics［C］// Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003:150-157.
［25］KINGMA D, BA J. Adam: A method for stochastic optimization［J］. Machine Learning, 2014,arXiv:1412.6980.
［26］ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［27］ZHA Z J, LIU D Q, ZHANG H W, et al. Context-aware visual policy network for sequence-level image captioning［C］// Proceedings of the 26th ACM International Conference on Multimedia. 2018:1416-1424.
［28］YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning［C］// Proceedings of the European Conference on Computer Vision. 2018:684-699.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[3]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[4]	赵盾1, 佘学兵2, 邬昌兴3. 基于BERT-BiLSTM-CRF党建领域命名实体识别[J]. 计算机与现代化, 2024, 0(09): 91-94.
[5]	陈雪松1, 李衡1, 王浩畅2. 结合注意力机制和Mengzi模型的短文本分类[J]. 计算机与现代化, 2024, 0(09): 101-106.
[6]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[7]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[8]	杨江1, 孙晓梅1, 许韬2. 基于业务内容构建股票关联关系的股价预测[J]. 计算机与现代化, 2024, 0(07): 21-25.
[9]	刘存莉1, 雷占占2, 郑澳2. 基于循环卷积神经网络的排水管网缺陷检测方法[J]. 计算机与现代化, 2024, 0(07): 26-35.
[10]	李珊, 王林娜, 高丁佳, 宣海波. 基于图神经网络的多层银企网络融合研究[J]. 计算机与现代化, 2024, 0(05): 27-32.
[11]	钟海龙1, 2, 何月顺1, 何璘琳1, 陈杰1, 田鸣3, 郑瑞银4. 基于代价敏感卷积神经网络的加密流量分类#br# #br#[J]. 计算机与现代化, 2024, 0(05): 55-60.
[12]	高埂1, 肖风丽2, 杨飞1. 基于改进MobileNetV3-Small的色素减退性皮肤病诊断[J]. 计算机与现代化, 2024, 0(05): 120-126.
[13]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.
[14]	王谭, 陈金广, 马丽丽. 融合词典信息和句子语义的中文命名实体识别[J]. 计算机与现代化, 2024, 0(03): 24-28.
[15]	乔璐, 孙有朝, 吴红兰. 面向飞机故障文本的信息抽取[J]. 计算机与现代化, 2024, 0(03): 61-66.