基于同等注意力图网络的视觉问答方法

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (11): 1-6.

• 人工智能 • 下一篇

基于同等注意力图网络的视觉问答方法

(1.南京航空航天大学计算机科学与技术学院,江苏南京211106；
2.南京航空航天大学信息化处（信息化技术中心）,江苏南京211106)

出版日期:2021-12-13 发布日期:2021-12-13
作者简介:王天星(1996—),女,江苏扬州人,硕士研究生，研究方向：视觉问答，E-mail: wtx0730@nuaa.edu.cn；通信作者：袁家斌(1968—),男,教授,博士,研究方向：深度学习,高性能计算,信息安全，E-mail: jbyuan@nuaa.edu.cn；刘昕(1995—),男,硕士研究生,研究方向：深度学习,行为识别，E-mail: liuxinx@nuaa.edu.cn。
基金资助:
国家重点研发计划项目（2017YFB0802303）；国家自然科学基金资助项目（62076127, 61571226）

Approach for Visual Question Answering Based on Equal Attention Graph Networks

(1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;
2. Information Department (Informatization Technology Center), Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)

Online:2021-12-13 Published:2021-12-13

摘要/Abstract

摘要： 视觉问答是一项计算机视觉与自然语言处理相结合的任务，需要理解图中的场景，特别是不同目标对象之间的交互关系。近年来，关于视觉问答的研究有了很大的进展，但传统方法采用整体特征表示，很大程度上忽略了所给图像的结构，无法有效锁定场景中的目标。而图网络依靠高层次图像表示，能捕获语义和空间关系，但以往利用图网络的视觉问答方法忽略了关系与问题间的关联在解答过程中的作用。据此提出基于同等注意力图网络的视觉问答模型EAGN，通过同等注意力机制赋予关系边与目标节点同等的重要性，两者结合使回答问题的依据更加充分。通过实验得出，相比于其他相关方法，EAGN模型性能优异且更具有竞争力，也为后续的相关研究提供了基础。

关键词: 视觉问答, 图网络；计算机视觉；自然语言处理

Abstract: Visual question answering is a task that combines computer vision with natural language processing. It needs to understand the scene in the picture, especially the interaction between different target objects. Great progress on visual question answering has been made in recent years, but traditional methods adopt holistic feature representation, which largely ignores the structure of the given image, and cannot effectively locate objects in the scene. Graph networks rely on high-level image representation, which can capture semantic and spatial relationships. However, the former visual question answering methods using graph networks ignored the role of the correspondence between relations and the question in the answering process. According to this, a visual question answering model based on equal attention graph networks named EAGN is proposed. Relationship edges are given the same importance as object nodes through the equal attention mechanism. The combination of these two elements makes the basis for answering the question more sufficient. Experiments show that compared with other related methods, the EAGN model performs well and is more competitive, which also provides a basis for subsequent related research.

Key words: visual question answering, graph networks, computer vision, natural language processing

王天星, 袁家斌, 刘昕. 基于同等注意力图网络的视觉问答方法[J]. 计算机与现代化, 2021, 0(11): 1-6.

WANG Tian-xing, YUAN Jia-bin, LIU Xin. Approach for Visual Question Answering Based on Equal Attention Graph Networks[J]. Computer and Modernization, 2021, 0(11): 1-6.

参考文献

［1］葛梦颖,孙宝山. 基于深度学习的视觉问答系统［J］. 现代信息科技, 2019,3(11):11-13.
［2］杨睿,刘瑞军,师于茜,等. 面向智能交互的视觉问答研究综述［J］. 电子测量与仪器学报, 2019,33(2):117-124.
［3］ ANTOL S, AGRAWAL A, LU J S, et al. VQA: Visual question answering［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015:2425-2433.
［4］ YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:21-29.
［5］ XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3097-3106.
［6］ TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3233-3241.
［7］ HAURILET M, ROITBERG A, STIEFELHAGEN R. It’s not about the journey; It’s about the destination: Following soft paths under question-guidance for visual reasoning［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:1930-1939.
［8］ SHI J X, ZHANG H W, LI J Z. Explainable and explicit visual reasoning over scene graphs［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8368-8376.
［9］于东飞. 基于注意力机制与高层语义的视觉问答研究［D］. 合肥:中国科学技术大学, 2019.
［10］REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks［C］// Proceedings of the 2015 Advances in Neural Information Processing Systems. 2015:91-99.
［11］PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1532-1543.
［12］CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1724-1734.
［13］ZHANG Y, HARE J, PRGEL-BENNETT A. Learning to count objects in natural images for visual question answering［C］// 2018 International Conference on Learning Representations. 2018. https://openreview.net/forum?id=B12Js_yRb.
［14］HUDSON D A, MANNING C D. GQA: A new dataset for real-world visual reasoning and compositional question answering［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6693-6702.
［15］KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision, 2017,123(1):32-73.
［16］GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:6325-6334.
［17］LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context［C］// 2014 European Conference on Computer Vision. 2014:740-755.
［18］ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［19］HUDSON D A, MANNING C D. Compositional attention networks for machine reasoning［C］// 2018 International Conference on Learning Representations. 2018. https://arxiv.org/abs/1803.03067.
［20］FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding［C］// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:457-468.
［21］ILIEVSKI I, FENG J S. Multimodal learning and reasoning for visual question answering［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:551-562.
［22］LU J S, YANG J W, BATRA D, et al. Hierarchical question-image co-attention for visual question answering［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:289-297.
［23］YU D F, FU J L, MEI T, et al. Multi-level attention networks for visual question answering［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
［24］NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6087-6096.

基于同等注意力图网络的视觉问答方法

Approach for Visual Question Answering Based on Equal Attention Graph Networks

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

编辑推荐

Metrics

本文评价