Multi-scene Fusion Algorithm for Fine-grained Image Caption

Abstract

Abstract: In terms of the poor performance of image caption task in different scenes, a multi-scene image caption generation algorithm based on convolutional neural network and prior knowledge is proposed. The algorithm generates visual semantic units by convolutional neural network, then uses named entity recognition to identify and predict image scenes, uses the result of classifying to adjust the focusing parameter of self-attention mechanism automatically, and calculate the multi-scene attention score. Finally, the obtained region coding and semantic prior knowledge are inserted into Transformer text generator to guide sentence generation. The results show that the algorithm can effectively solve the problem that the caption lacks the key scene information. Evaluation indicators are used to evaluate the model on the MSCOCO dataset and Flickr30k dataset, and the CIDEr score of MSCOCO dataset reaches 1.210, which is better than similar image description generation models.

Key words: image caption, CNN, NER, multi-scene attention, Transformer structure

LI Xin-ye, ZHANG Cheng-qiang, ZHOU Xiong-tu, GUO Tai-liang, ZHANG Yong-ai. Multi-scene Fusion Algorithm for Fine-grained Image Caption[J]. Computer and Modernization, 2021, 0(09): 1-6.

References

［1］陈龙杰,张钰,张玉梅,等. 基于多注意力多尺度特征融合的图像描述生成算法［J］. 计算机应用, 2019,39(2):354-359.
［2］张姣,杨振宇. 图像描述生成方法研究文献综述［J］. 智能计算机与应用, 2019(5):45-49.
［3］ VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator［C］// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［4］ KRIZHEVSKY A, SUTSKEVER I, HINTON G. ImageNet classification with deep convolutional neural networks［J］. Advances in Neural Information Processing Systems, 2012,25(2):1097-1105.
［5］ HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［6］ MAO J H, XU W. Explain images with multimodal recurrent neural networks［J］. Computer Science, 2014，arXiv:1410.1090.
［7］ XU K, BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］// International Conference on Machine Learning. 2015:2048-2057.
［8］ HERDADE S, KAPPELER A, BOAKYE K, et al. Image captioning: Transforming objects into words［C］// Advances in Neural Information Processing Systems. 2019:11137-11147.
［9］ HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// 2016 IEEE Conference on Computer Vision & Pattern Recognition. 2016:770-778.
［10］REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017,39(6):1137-1149.
［11］KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks［J］. Machine Learning, 2016，arXiv:1609.02907.
［12］BASTINGS J, TITOV I, AZIZ W, et al. Graph convolutional encoders for syntax-aware neural machine translation［J］. Computation and
Language， 2017，arXiv:1704.04675.
［13］SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014:3104-3112.
［14］BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate［J］. Computation and Language, 2014，arXiv:1409.0473.
［15］ZELLERS R, YATSKAR M, THOMSON S, et al. Neural motifs: Scene graph parsing with global context［J］. Computer Vision and Pattern Recognition, 2017，arXiv:1711.06640.
［16］KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision, 2017,123(1):32-73.
［17］SHANG J B, LIU L Y, GU X T, et al. Learning named entity tagger using domain-specific dictionary［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018:2054-2064.
［18］VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:5998-6008.
［19］VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation［C］// 2015 IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575.
［20］PLUMMER B A, WANG L, CERVANTES C M, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models［C］// Proceedings of the IEEE International Conference on Computer Vision. 2015:2641-2649.
［21］KARPATHY A, FEI-FEI L. Deep visual-semantic alignments for generating image descriptions［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:3128-3137.
［22］PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:311-318.
［23］SATANJEEV B. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments［C］// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:228-231.
［24］LIN C Y, HOVY E. Automatic evaluation of summaries using n-gram co-occurrence statistics［C］// Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 2003:150-157.
［25］KINGMA D, BA J. Adam: A method for stochastic optimization［J］. Machine Learning, 2014,arXiv:1412.6980.
［26］ANDERSON P, HE X, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［27］ZHA Z J, LIU D Q, ZHANG H W, et al. Context-aware visual policy network for sequence-level image captioning［C］// Proceedings of the 26th ACM International Conference on Multimedia. 2018:1416-1424.
［28］YAO T, PAN Y W, LI Y H, et al. Exploring visual relationship for image captioning［C］// Proceedings of the European Conference on Computer Vision. 2018:684-699.

[1]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[2]	HUANG Shanshan1, WU Wei2, XU Yuqing1, WEI Jie1. Pipelines in Drawings Detection Method Based on Improved Mask R-CNN and LSD [J]. Computer and Modernization, 2024, 0(10): 42-48.
[3]	XU Xin’ai, LI Gang. An Image Generation Method of Classroom Expression Images [J]. Computer and Modernization, 2024, 0(08): 88-91.
[4]	WANG Zhiqiang, ZHENG Shuang. Semi-supervised Image Generation Model Based on StyleGAN [J]. Computer and Modernization, 2024, 0(06): 14-18.
[5]	WANG Jia1, ZHANG Yunlong1, JU Weigang1, ZHOU Zhipeng2, MI Chuanmin2. A General Platform for Energy Saving and Consumption Reduction of Server-class#br# Environmental Resources [J]. Computer and Modernization, 2024, 0(05): 61-68.
[6]	LI Jiaduo, YAN Xiuying. Method of Location and Capacity Determination for Distributed Generator Based on#br# Enhanced Capuchin Search Algorithm [J]. Computer and Modernization, 2024, 0(04): 27-32.
[7]	XIAO Mengsi1, WU Jianbin1, TU Yameng1, YUAN Linfeng2. GAN-generated Fake Images Recognition Based on Improved ConvNeXt [J]. Computer and Modernization, 2024, 0(04): 38-42.
[8]	YU Lihui, HU Shaowen, HUANG Langxin, LUO Shuhuan. A Source Code Security Vulnerability Detection Method Using ChatGPT [J]. Computer and Modernization, 2024, 0(04): 88-91.
[9]	LU Zi-han1, ZHANG Dong1, YANG Yan1, YANG Shuang2. Breast Cancer Immunohistochemical Image Generation Based on Generative Adversarial Network [J]. Computer and Modernization, 2024, 0(03): 92-96.
[10]	LIU Yan-hong, YANG Qiu-xiang. Image Dehazing Algorithm with Improved Generative Adversarial Network [J]. Computer and Modernization, 2024, 0(02): 56-63.
[11]	LIANG Jian-ye1, CHEN Jun-hong1, FANG Gui-biao1, WU Xing-cai2, LIU Wen-yin1. Optical Flow Estimation Based on Inverse Residual Attention [J]. Computer and Modernization, 2024, 0(02): 64-68.
[12]	FU Hong-lin, ZHANG Tai-hong, YANG Ya-ting, Aizimaiti Aiwanier, MA Bo. Scenes Text Modification Network for Uyghur Based on Generative Adversarial Network [J]. Computer and Modernization, 2024, 0(01): 41-46.
[13]	ZHANG Zai-cheng, LI Jian. Application of Neural Rendering Based Visual Synthesis in Construction Scene [J]. Computer and Modernization, 2023, 0(12): 76-81.
[14]	WANG Kai-xiang, YANG Jing, YANG Wen, MI Hong-ju, GAN Fei. Optimal Configuration of Multi-energy Mobile Power Vehicles Based on Improved GSA Algorithm [J]. Computer and Modernization, 2023, 0(12): 105-111.
[15]	WANG Xin, XIAO Tao-rui. GAN-based Adversarial Attacks on Face Recognition [J]. Computer and Modernization, 2023, 0(10): 115-120.