[1] |
葛梦颖,孙宝山. 基于深度学习的视觉问答系统[J]. 现代信息科技, 2019,3(11):11-13.
|
[2] |
杨睿,刘瑞军,师于茜,等. 面向智能交互的视觉问答研究综述[J]. 电子测量与仪器学报, 2019,33(2):117-124.
|
[3] |
ANTOL S, AGRAWAL A, LU J S, et al. VQA: Visual question answering[C]// Proceedings of the 2015 IEEE International Conference on Computer Vision. 2015:2425-2433.
|
[4] |
YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:21-29.
|
[5] |
XU D F, ZHU Y K, CHOY C B, et al. Scene graph generation by iterative message passing[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3097-3106.
|
[6] |
TENEY D, LIU L Q, VAN DEN HENGEL A. Graph-structured representations for visual question answering[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:3233-3241.
|
[7] |
HAURILET M, ROITBERG A, STIEFELHAGEN R. It’s not about the journey; It’s about the destination: Following soft paths under question-guidance for visual reasoning[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:1930-1939.
|
[8] |
SHI J X, ZHANG H W, LI J Z. Explainable and explicit visual reasoning over scene graphs[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8368-8376.
|
[9] |
于东飞. 基于注意力机制与高层语义的视觉问答研究[D]. 合肥:中国科学技术大学, 2019.
|
[10] |
REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]// Proceedings of the 2015 Advances in Neural Information Processing Systems. 2015:91-99.
|
[11] |
PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1532-1543.
|
[12] |
CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1724-1734.
|
[13] |
ZHANG Y, HARE J, PRGEL-BENNETT A. Learning to count objects in natural images for visual question answering[C]// 2018 International Conference on Learning Representations. 2018. https://openreview.net/forum?id=B12Js_yRb.
|
[14] |
HUDSON D A, MANNING C D. GQA: A new dataset for real-world visual reasoning and compositional question answering[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:6693-6702.
|
[15] |
KRISHNA R, ZHU Y, GROTH O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017,123(1):32-73.
|
[16] |
GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image understanding in visual question answering[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:6325-6334.
|
[17] |
LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]// 2014 European Conference on Computer Vision. 2014:740-755.
|
[18] |
ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]// Proceedings of the 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition. 2018:6077-6086.
|
[19] |
HUDSON D A, MANNING C D. Compositional attention networks for machine reasoning[C]// 2018 International Conference on Learning Representations. 2018. https://arxiv.org/abs/1803.03067.
|
[20] |
FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:457-468.
|
[21] |
ILIEVSKI I, FENG J S. Multimodal learning and reasoning for visual question answering[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:551-562.
|
[22] |
LU J S, YANG J W, BATRA D, et al. Hierarchical question-image co-attention for visual question answering[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:289-297.
|
[23] |
YU D F, FU J L, MEI T, et al. Multi-level attention networks for visual question answering[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:4187-4195.
|
[24] |
NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:6087-6096.
|