计算机与现代化 ›› 2021, Vol. 0 ›› Issue (11): 1-6.

• 人工智能 •    下一篇

基于同等注意力图网络的视觉问答方法

  

  1. (1.南京航空航天大学计算机科学与技术学院,江苏南京211106;
    2.南京航空航天大学信息化处(信息化技术中心),江苏南京211106)
  • 出版日期:2021-12-13 发布日期:2021-12-13
  • 作者简介:王天星(1996—),女,江苏扬州人,硕士研究生,研究方向:视觉问答,E-mail: wtx0730@nuaa.edu.cn; 通信作者:袁家斌(1968—),男,教授,博士,研究方向:深度学习,高性能计算,信息安全,E-mail: jbyuan@nuaa.edu.cn; 刘昕(1995—),男,硕士研究生,研究方向:深度学习,行为识别,E-mail: liuxinx@nuaa.edu.cn。
  • 基金资助:
    国家重点研发计划项目(2017YFB0802303); 国家自然科学基金资助项目(62076127, 61571226)

Approach for Visual Question Answering Based on Equal Attention Graph Networks

  1. (1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China; 
    2. Information Department (Informatization Technology Center), Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
  • Online:2021-12-13 Published:2021-12-13

摘要: 视觉问答是一项计算机视觉与自然语言处理相结合的任务,需要理解图中的场景,特别是不同目标对象之间的交互关系。近年来,关于视觉问答的研究有了很大的进展,但传统方法采用整体特征表示,很大程度上忽略了所给图像的结构,无法有效锁定场景中的目标。而图网络依靠高层次图像表示,能捕获语义和空间关系,但以往利用图网络的视觉问答方法忽略了关系与问题间的关联在解答过程中的作用。据此提出基于同等注意力图网络的视觉问答模型EAGN,通过同等注意力机制赋予关系边与目标节点同等的重要性,两者结合使回答问题的依据更加充分。通过实验得出,相比于其他相关方法,EAGN模型性能优异且更具有竞争力,也为后续的相关研究提供了基础。

关键词: 视觉问答, 图网络; 计算机视觉; 自然语言处理

Abstract: Visual question answering is a task that combines computer vision with natural language processing. It needs to understand the scene in the picture, especially the interaction between different target objects. Great progress on visual question answering has been made in recent years, but traditional methods adopt holistic feature representation, which largely ignores the structure of the given image, and cannot effectively locate objects in the scene. Graph networks rely on high-level image representation, which can capture semantic and spatial relationships. However, the former visual question answering methods using graph networks ignored the role of the correspondence between relations and the question in the answering process. According to this, a visual question answering model based on equal attention graph networks named EAGN is proposed. Relationship edges are given the same importance as object nodes through the equal attention mechanism. The combination of these two elements makes the basis for answering the question more sufficient. Experiments show that compared with other related methods, the EAGN model performs well and is more competitive, which also provides a basis for subsequent related research.

Key words: visual question answering, graph networks, computer vision, natural language processing