[1] 葛梦颖,孙宝山. 基于深度学习的视觉问答系统[J]. 现代信息科技, 2019,3(11):11-13.
[2] 杨睿,刘瑞军,师于茜,等. 面向智能交互的视觉问答研究综述[J]. 电子测量与仪器学报, 2019,33(2):117-124.
[3] ANTOL S, AGRAWAL A, LU J S, et al. VQA: Visual question answering[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015:2425-2433.
[4] YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:21-29.
[5] XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3097-3106.
[6] TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3233-3241.
[7] HAURILET M, ROITBERG A, STIEFELHAGEN R. It’s not about the journey; It’s about the destination: Following soft paths under question-guidance for visual reasoning[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:1930-1939.
[8] SHI J X, ZHANG H W, LI J Z. Explainable and explicit visual reasoning over scene graphs[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8368-8376.
[9] 于东飞. 基于注意力机制与高层语义的视觉问答研究[D]. 合肥:中国科学技术大学, 2019.
[10]REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]// Proceedings of the 2015 Advances in Neural Information Processing Systems. 2015:91-99.
[11]PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1532-1543.
[12]CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1724-1734.
[13]ZHANG Y, HARE J, PRGEL-BENNETT A. Learning to count objects in natural images for visual question answering[C]// 2018 International Conference on Learning Representations. 2018. https://openreview.net/forum?id=B12Js_yRb.
[14]HUDSON D A, MANNING C D. GQA: A new dataset for real-world visual reasoning and compositional question answering[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6693-6702.
[15]KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017,123(1):32-73.
[16]GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:6325-6334.
[17]LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]// 2014 European Conference on Computer Vision. 2014:740-755.
[18]ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
[19]HUDSON D A, MANNING C D. Compositional attention networks for machine reasoning[C]// 2018 International Conference on Learning Representations. 2018. https://arxiv.org/abs/1803.03067.
[20]FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:457-468.
[21]ILIEVSKI I, FENG J S. Multimodal learning and reasoning for visual question answering[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:551-562.
[22]LU J S, YANG J W, BATRA D, et al. Hierarchical question-image co-attention for visual question answering[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:289-297.
[23]YU D F, FU J L, MEI T, et al. Multi-level attention networks for visual question answering[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
[24]NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6087-6096.
|