An Image Description Algorithm Based on Object Detection and Part of Speech Analysis

Abstract

Abstract: In this paper, an image description algorithm based on object detection and part of speech analysis is proposed to solve the problem of low correlation between the description content and image in the existing image description method that based on attention mechanism. Based on the attention mechanism, this method extracts information from the picture by the target detection algorithm, and processes it by the recurrent neural network with the attention mechanism to generate the image description statement. In the process of generating words, the algorithm predicts the part of speech of each word, and then selects different neural networks according to the prediction results, so it improves the correlation between the description statement and the original image. The experimental results show that in many objective description evaluation criteria, the description statements generated by the algorithm in this paper have different degrees of improvement compared with the existing algorithms, at the same time, the content of the picture can be more accurately and smoothly described in the subjective evaluation.

Key words: image description, recurrent neural network, attention mechanism, target detection, deep learning, natural language processing

GAO Yi-fan, WANG Yong. An Image Description Algorithm Based on Object Detection and Part of Speech Analysis[J]. Computer and Modernization, 2021, 0(03): 108-114.

References

［1］马苗,王伯龙,吴琦,等. 视觉场景描述及其效果评价［J］. 软件学报, 2019,30(4):867-883.
［2］靳华中,刘潇龙,胡梓珂. 一种结合全局和局部特征的图像描述生成模型［J］. 应用科学学报, 2019,37(4):501-509.
［3］ RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors［J］. Cognitive Modeling, 1988,5(3):1.
［4］ ELMAN J L. Finding structure in time［J］. Cognitive Science, 1990,14(2):179-211.
［5］ VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator［C］// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［6］ CHO K, VAN MERRINBOER B, GULCEHRE C, et al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation［J/OL］. (2014- 09-03)［2019-11-24］. https://arxiv.org/pdf/1406.1078.
［7］ REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015:91-99.
［8］ BAHDANAU D, CHO K, BENGIO Y. Neural Machine Translation by Jointly Learning to Align and Translate［J/OL］. (2016-05-19)［2019-11-24］. https://arxiv.org/pdf/1409.0473.pdf.
［9］ HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［10］ CHUNG J, GULCEHRE C, CHO K H, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling［J/OL］. (2014-12-11)［2019-11-24］. https://arxiv.org/pdf/1412.3555.
［11］ RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors［J］. Nature, 1986,323:533-536.
［12］ LECUN Y, BENGIO Y, HINTON G. Deep learning［J］. Nature, 2015,521:436-444.
［13］〖JP+1〗邓珍荣,张宝军,蒋周琴,等. 融合word2vec和注意力机制的图像描述模型［J］. 计算机科学, 2019,46(4):268-273.
［14］ CHEN X L, FANG H, LIN T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server［J/OL］. (2015-04-03)［2019-11-24］. https://arxiv.org/pdf/1504.00325.
［15］师岩,王宇,吴水清. 基于Self-Attention模型的机器翻译系统［J］. 计算机与现代化, 2019(7):9-14.
［16］ GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks［J］. Journal of Machine Learning Research， 2010,9:249-256.
［17］ HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving Neural Networks by Preventing Co-adaptation of Feature Detectors［J/OL］. (2012-07-03)［2019-11-24］. https://arxiv.org/pdf/1207.0580.
［18］ PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
［19］ LIN C Y. ROUGE: A package for automatic evaluation of summaries［C］// Proceedings of the Workshop on Text Summarization Branches Out. 2004:74-81.
［20］ BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments［C］// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:65-72.
［21］ VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575. 〖HJ0.55mm〗
［22］ MAO J H, XU W, YANG Y, et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)［J/OL］. (2015-06-11)［2019-11-24］. https://arxiv.org/abs/1412.6632.
［23］ XU K, LEI BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015:2048-2057.
［24］ JIANG L, XU M, LIU T, et al. Deepvs: A deep learning based video saliency prediction approach［C］// Proceedings of the European Conference on Computer Vision (ECCV). 2018:602-617.
［25］ ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[3]	HUANG Tingpei1, MA Lubiao1, LI Shibao2, LIU Jianhang1. Gesture Recognition Method Based on WiFi and Prototypical Network [J]. Computer and Modernization, 2024, 0(12): 34-39.
[4]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[5]	WANG Yanyuan, MAO Zhengchong. Detection and Recognition Algorithms for Chinese and English Scene Text Images [J]. Computer and Modernization, 2024, 0(12): 84-90.
[6]	CHEN Liang, LI Cheng, YI Wei, XIONG Wei, WANG Xiaofan, TANG Haidong. Helmet Wearing Detection in Electric Power Field Based on#br# Millimeter-wave Radar and Visual Fusion [J]. Computer and Modernization, 2024, 0(12): 100-107.
[7]	LI Junchao1, YOU Fei1, ZHANG Chao2, SU Lele2, GONG Yan2 . BiLSTM-Attention Prediction Model and Error Analysis #br# Based on Novel Multi-objective Coati Optimization Algorithm [J]. Computer and Modernization, 2024, 0(11): 70-76.
[8]	ZHANG Yu1, 2, LI Jing1, 2, MA Ming1, 2, WANG Zhongxiang1, 2, SUN Yan1, 2. YOLOLW: A Novel Lightweight Object Detection Model [J]. Computer and Modernization, 2024, 0(11): 91-98.
[9]	QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism [J]. Computer and Modernization, 2024, 0(11): 106-112.
[10]	CHEN Kai1, LI Yiting1, 2, QUAN Huafeng1. A River Discarded Bottles Detection Method Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(11): 113-120.
[11]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[12]	WEI Xuecheng1, JIANG Lingyun1, LI Yan2, HE Fei2. Improved Roadside Monocular View Small Target Detection Algorithm Based on YOLOv5 [J]. Computer and Modernization, 2024, 0(10): 27-34.
[13]	DU Mengjun1, LI Ang1, TONG Jun1, QIAN Jin1, KANG Kai1, WANG Ruoding1, JIN Wenxing2. Power Information Data Fusion Model Based on Improved Extreme Learning Algorithm [J]. Computer and Modernization, 2024, 0(10): 61-64.
[14]	WANG Yingying, HAO Xiao. Fine-grained Image Classification Based on Res2Net and Recursive Gated Convolution [J]. Computer and Modernization, 2024, 0(10): 74-79.
[15]	SHI Xingyu1, LI Qiang2, ZHUANG Li3, LIANG Yi3, WANG Qiulin3, CHEN Kai3, WU Chenzhou3, CHANG Sheng1. Object Detection Models Distillation Technique for Industrial Deployment [J]. Computer and Modernization, 2024, 0(10): 93-99.