一种基于目标检测与词性分析的图像描述算法

摘要/Abstract

摘要： 针对现有的基于注意力机制的图像描述方法描述内容与图像关联度低的问题，提出一种基于目标检测与词性分析的图像描述算法。该方法在注意力机制的基础上，通过目标检测算法提取图片中的信息，使用带有注意力机制的循环神经网络对提取到的信息进行处理，生成图像描述语句。在生成单词的过程中，算法会预测每个单词的词性，根据不同的词性选择不同的神经网络，从而提升描述语句与原图像的关联度。实验结果表明，在多种客观描述评价标准中，本文算法生成的描述语句相对目前存在的算法均有不同程度提升，同时，在主观评价中也能够更准确流畅地描述图片的内容。

关键词: 图像描述, 循环神经网络, 注意力机制, 目标检测, 深度学习, 自然语言处理

Abstract: In this paper, an image description algorithm based on object detection and part of speech analysis is proposed to solve the problem of low correlation between the description content and image in the existing image description method that based on attention mechanism. Based on the attention mechanism, this method extracts information from the picture by the target detection algorithm, and processes it by the recurrent neural network with the attention mechanism to generate the image description statement. In the process of generating words, the algorithm predicts the part of speech of each word, and then selects different neural networks according to the prediction results, so it improves the correlation between the description statement and the original image. The experimental results show that in many objective description evaluation criteria, the description statements generated by the algorithm in this paper have different degrees of improvement compared with the existing algorithms, at the same time, the content of the picture can be more accurately and smoothly described in the subjective evaluation.

Key words: image description, recurrent neural network, attention mechanism, target detection, deep learning, natural language processing

高逸凡, 王勇. 一种基于目标检测与词性分析的图像描述算法[J]. 计算机与现代化, 2021, 0(03): 108-114.

GAO Yi-fan, WANG Yong. An Image Description Algorithm Based on Object Detection and Part of Speech Analysis[J]. Computer and Modernization, 2021, 0(03): 108-114.

参考文献

［1］马苗,王伯龙,吴琦,等. 视觉场景描述及其效果评价［J］. 软件学报, 2019,30(4):867-883.
［2］靳华中,刘潇龙,胡梓珂. 一种结合全局和局部特征的图像描述生成模型［J］. 应用科学学报, 2019,37(4):501-509.
［3］ RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors［J］. Cognitive Modeling, 1988,5(3):1.
［4］ ELMAN J L. Finding structure in time［J］. Cognitive Science, 1990,14(2):179-211.
［5］ VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: A neural image caption generator［C］// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2015:3156-3164.
［6］ CHO K, VAN MERRINBOER B, GULCEHRE C, et al. Learning Phrase Representations Using RNN Encoder-decoder for Statistical Machine Translation［J/OL］. (2014- 09-03)［2019-11-24］. https://arxiv.org/pdf/1406.1078.
［7］ REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015:91-99.
［8］ BAHDANAU D, CHO K, BENGIO Y. Neural Machine Translation by Jointly Learning to Align and Translate［J/OL］. (2016-05-19)［2019-11-24］. https://arxiv.org/pdf/1409.0473.pdf.
［9］ HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［10］ CHUNG J, GULCEHRE C, CHO K H, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling［J/OL］. (2014-12-11)［2019-11-24］. https://arxiv.org/pdf/1412.3555.
［11］ RUMELHART D E, HINTON G E, WILLIAMS R J. Learning representations by back-propagating errors［J］. Nature, 1986,323:533-536.
［12］ LECUN Y, BENGIO Y, HINTON G. Deep learning［J］. Nature, 2015,521:436-444.
［13］〖JP+1〗邓珍荣,张宝军,蒋周琴,等. 融合word2vec和注意力机制的图像描述模型［J］. 计算机科学, 2019,46(4):268-273.
［14］ CHEN X L, FANG H, LIN T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server［J/OL］. (2015-04-03)［2019-11-24］. https://arxiv.org/pdf/1504.00325.
［15］师岩,王宇,吴水清. 基于Self-Attention模型的机器翻译系统［J］. 计算机与现代化, 2019(7):9-14.
［16］ GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks［J］. Journal of Machine Learning Research， 2010,9:249-256.
［17］ HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving Neural Networks by Preventing Co-adaptation of Feature Detectors［J/OL］. (2012-07-03)［2019-11-24］. https://arxiv.org/pdf/1207.0580.
［18］ PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
［19］ LIN C Y. ROUGE: A package for automatic evaluation of summaries［C］// Proceedings of the Workshop on Text Summarization Branches Out. 2004:74-81.
［20］ BANERJEE S, LAVIE A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments［C］// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:65-72.
［21］ VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: Consensus-based image description evaluation［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:4566-4575. 〖HJ0.55mm〗
［22］ MAO J H, XU W, YANG Y, et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)［J/OL］. (2015-06-11)［2019-11-24］. https://arxiv.org/abs/1412.6632.
［23］ XU K, LEI BA J, KIROS R, et al. Show, attend and tell: Neural image caption generation with visual attention［C］// Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015:2048-2057.
［24］ JIANG L, XU M, LIU T, et al. Deepvs: A deep learning based video saliency prediction approach［C］// Proceedings of the European Conference on Computer Vision (ECCV). 2018:602-617.
［25］ ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6077-6086.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[6]	谷岳, 邓松峰, 沈霁, 穆文涛, 赵恩棋. 基于改进YOLOv8的SAR舰船目标检测算法[J]. 计算机与现代化, 2024, 0(12): 78-83.
[7]	王艳媛, 茅正冲. 中英文场景文本图像的检测和识别算法[J]. 计算机与现代化, 2024, 0(12): 84-90.
[8]	陈亮, 李诚, 易伟, 熊伟, 汪晓帆, 唐海东. 基于毫米波雷达与视觉融合的电力现场安全帽佩戴检测[J]. 计算机与现代化, 2024, 0(12): 100-107.
[9]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[10]	张宇1, 2, 黎靖1, 2, 马铭1, 2, 王众祥1, 2, 孙妍1, 2. YOLOLW:一个新的轻量级目标检测模型[J]. 计算机与现代化, 2024, 0(11): 91-98.
[11]	董玉玟. 基于改进YOLOv7-tiny的多尺度运动目标检测算法[J]. 计算机与现代化, 2024, 0(11): 99-105.
[12]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[13]	陈凯1, 李宜汀1, 2, 全华凤1 . 基于改进YOLOv8的河道废弃瓶检测方法[J]. 计算机与现代化, 2024, 0(11): 113-120.
[14]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[15]	魏学诚1, 江凌云1, 李研2, 何非2. 改进YOLOv5的路侧单目视角小目标检测算法[J]. 计算机与现代化, 2024, 0(10): 27-34.