Approach for Visual Question Answering Based on Equal Attention Graph Networks

Abstract

Abstract: Visual question answering is a task that combines computer vision with natural language processing. It needs to understand the scene in the picture, especially the interaction between different target objects. Great progress on visual question answering has been made in recent years, but traditional methods adopt holistic feature representation, which largely ignores the structure of the given image, and cannot effectively locate objects in the scene. Graph networks rely on high-level image representation, which can capture semantic and spatial relationships. However, the former visual question answering methods using graph networks ignored the role of the correspondence between relations and the question in the answering process. According to this, a visual question answering model based on equal attention graph networks named EAGN is proposed. Relationship edges are given the same importance as object nodes through the equal attention mechanism. The combination of these two elements makes the basis for answering the question more sufficient. Experiments show that compared with other related methods, the EAGN model performs well and is more competitive, which also provides a basis for subsequent related research.

Key words: visual question answering, graph networks, computer vision, natural language processing

WANG Tian-xing, YUAN Jia-bin, LIU Xin. Approach for Visual Question Answering Based on Equal Attention Graph Networks[J]. Computer and Modernization, 2021, 0(11): 1-6.

References

［1］葛梦颖,孙宝山. 基于深度学习的视觉问答系统［J］. 现代信息科技, 2019,3(11):11-13.
［2］杨睿,刘瑞军,师于茜,等. 面向智能交互的视觉问答研究综述［J］. 电子测量与仪器学报, 2019,33(2):117-124.
［3］ ANTOL S, AGRAWAL A, LU J S, et al. VQA: Visual question answering［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015:2425-2433.
［4］ YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:21-29.
［5］ XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3097-3106.
［6］ TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3233-3241.
［7］ HAURILET M, ROITBERG A, STIEFELHAGEN R. It’s not about the journey; It’s about the destination: Following soft paths under question-guidance for visual reasoning［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:1930-1939.
［8］ SHI J X, ZHANG H W, LI J Z. Explainable and explicit visual reasoning over scene graphs［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8368-8376.
［9］于东飞. 基于注意力机制与高层语义的视觉问答研究［D］. 合肥:中国科学技术大学, 2019.
［10］REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks［C］// Proceedings of the 2015 Advances in Neural Information Processing Systems. 2015:91-99.
［11］PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1532-1543.
［12］CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1724-1734.
［13］ZHANG Y, HARE J, PRGEL-BENNETT A. Learning to count objects in natural images for visual question answering［C］// 2018 International Conference on Learning Representations. 2018. https://openreview.net/forum?id=B12Js_yRb.
［14］HUDSON D A, MANNING C D. GQA: A new dataset for real-world visual reasoning and compositional question answering［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6693-6702.
［15］KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations［J］. International Journal of Computer Vision, 2017,123(1):32-73.
［16］GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:6325-6334.
［17］LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context［C］// 2014 European Conference on Computer Vision. 2014:740-755.
［18］ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering［C］// Proceedings of the 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
［19］HUDSON D A, MANNING C D. Compositional attention networks for machine reasoning［C］// 2018 International Conference on Learning Representations. 2018. https://arxiv.org/abs/1803.03067.
［20］FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding［C］// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:457-468.
［21］ILIEVSKI I, FENG J S. Multimodal learning and reasoning for visual question answering［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:551-562.
［22］LU J S, YANG J W, BATRA D, et al. Hierarchical question-image co-attention for visual question answering［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:289-297.
［23］YU D F, FU J L, MEI T, et al. Multi-level attention networks for visual question answering［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
［24］NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6087-6096.

[1]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[2]	WANG Mengxi, LI Jun. Review of Fall Detection Technologies for Elderly [J]. Computer and Modernization, 2024, 0(08): 30-36.
[3]	WEI Jiakun, WANG Jiarun. Survey on Gesture Recognition and Interaction [J]. Computer and Modernization, 2024, 0(08): 67-76.
[4]	LI Lu, ZHU Yan. Knowledge Prompt Fine-tuning for Event Extraction [J]. Computer and Modernization, 2024, 0(07): 36-40.
[5]	ZHANG Ke1, AI Zhongliang2, LIU Zhonglin3, GU Pingli1, LIU Xuelin4. Judicial Argumentation Understanding Method Based on Multiplet Loss [J]. Computer and Modernization, 2024, 0(06): 115-120.
[6]	LIANG Tian-kai, HUANG Kang-hua, LIU Kai-hang, LAN Lan, ZENG Bi. Deep Federated Image Classification Method Based on Bilateral Homomorphic Encryption [J]. Computer and Modernization, 2023, 0(12): 36-40.
[7]	WU Song-lin, ZHANG Guo-wei, LU Qiu-hong, SHI Jian-zhuang, HUANG Wei. Laser SLAM Mapping Method Based on Visual Information [J]. Computer and Modernization, 2023, 0(02): 17-23.
[8]	WANG Hao-chang, LIU Ru-yi. Review of Relation Extraction Based on Pre-training Language Model [J]. Computer and Modernization, 2023, 0(01): 49-57.
[9]	HUANG Yan-hui, LAN Hai, WEI Xian. Lightweight Vision Transformer Based on Separable Structured Transformations [J]. Computer and Modernization, 2022, 0(10): 75-81.
[10]	LI Wei-qiang, WANG Dong, NING Zheng-tong, LU Ming-liang, QIN Peng-fei. Survey of Fruit Object Detection Algorithms in Computer Vision [J]. Computer and Modernization, 2022, 0(06): 87-95.
[11]	ZHENG Xin-yue, REN Jun-chao. Intention Recognition and Classification Based on BERT-FNN [J]. Computer and Modernization, 2021, 0(07): 71-76.
[12]	GAO Yi-fan, WANG Yong. An Image Description Algorithm Based on Object Detection and Part of Speech Analysis [J]. Computer and Modernization, 2021, 0(03): 108-114.
[13]	ZHOU Jing-wei, HAN Li-xin, LI Xiao-shuang. Target Tracking Algorithm Based on Multimodal Data [J]. Computer and Modernization, 2020, 0(11): 16-22.
[14]	ZHU Da-qing, CAO Guo. Particle Size Detection of Sandstone Images Based on Full Convolutional Network [J]. Computer and Modernization, 2020, 0(07): 111-116.
[15]	CHEN Chuan, CHEN Zhe, DING Shuang-hui. Innovation of Computer Vision Teaching Contents Under Development of Deep Learning [J]. Computer and Modernization, 2020, 0(06): 107-.