A Review of Deep Neural Networks Combined with Attention Mechanism

Abstract

Abstract: Attention mechanism has become one of the research hotspots in improving the learning ability of deep neural network. In view of the wide attention paid to the attention mechanism， this paper aims to give a comprehensive analysis and elaboration of attention mechanism in deep neural network from three aspects: the classification of attention mechanism， the way of combining with deep neural network， and the specific applications in natural language processing and computer vision. Specifically， attention mechanism has been divided into soft attention mechanism， hard attention mechanism and self-attention mechanism， and their advantages and disadvantages are compared. Then， the common ways of combining attention mechanism in recursive neural network and convolutional neural network are discussed respectively， and the representative model structures of each way are given. After that， the applications of attention mechanism in natural language processing and computer vision are illustrated. Finally， several future developments of attention mechanism are illustrated expecting to provide clues and directions for subsequent researches.

Key words: attention mechanisms, deep learning, neural networks, attention models

HUANGFU Xiao-ying, QIAN Hui-min, HUANG Min. A Review of Deep Neural Networks Combined with Attention Mechanism[J]. Computer and Modernization, 2023, 0(02): 40-49.

References

［1］吴建鑫，高斌斌，魏秀参，等. 资源受限的深度学习:挑战与实践［J］. 中国科学:信息科学， 2018，48（5）:501-510.
［2］ BORJI A， ITTI L. State-of-the-art in visual attention modeling［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2013，35（1）:185-207.
［3］王培森. 基于注意力机制的图像分类深度学习方法研究［D］. 合肥:中国科学技术大学， 2018.
［4］ HONG Z. A preliminary study on artificial neural network［C］// 2011 6th IEEE Joint International Information Technology and Artificial Intelligence Conference. 2011:336-338.
［5］ CHAUDHARI S， MITHAL V， POLATKAN G， et al. An Attentive Survey of Attention Models［J］. arXiv preprint arXiv:1904.02874， 2019.
［6］任欢，王旭光. 注意力机制综述［J］. 计算机应用， 2021，41（S01）:1-6.
［7］ CORREIA A D S， COLOMBINI E L. Attention， please! A survey of neural attention models in deep learning［J］. arXiv preprint arXiv:2103.16775， 2021.
［8］ XU K， BA J， KIROS R， et al. Show， attend and tell: Neural image caption generation with visual attention［J］. arXiv preprint arXiv:1502.03044v2， 2015.
［9］ HU J， SHEN L， ALBANIE S， et al. Squeeze-and-xxcitation networks［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2020，42（8）:2011-2023.
［10］ WANG Q L， WU B G， ZHU P F， et al. ECA-Net: Efficient channel attention for deep convolutional neural networks［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2020:11531-11539.
［11］ LI X， WANG W H， HU X L， et al. Selective kernel networks［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2019:510-519.
［12］ JADERBERG M， SIMONYAN K， ZISSERMAN A， et al. Spatial transformer networks［J］. arXiv preprint arXiv:1506.02025v3， 2015.
［13］ WOO S， PARK J， LEE J， et al. CBAM: Convolutional block attention module［C］// Computer Vision - ECCV 2018. 2018. DOI:10.1007/978-3-030-01234-2_1.
［14］ PARK J， WOO S， LEE J， et al. BAM: Bottleneck attention module［J］. arXiv preprint arXiv:1807.06514， 2018.
［15］ ROY A G， NAVAB N， WACHINGER C. Concurrent spatial and channel squeeze & excitation in fully convolutional networks［J］. arXiv preprint arXiv:1803.02579v2， 2018.
［16］ MNIH V， HEESS N， GRAVES A， et al. Recurrent models of visual attention［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. 2014:2204-2212.
［17］ MALINOWSKI M， DOERSCH C， SANTORO A， et al. Learning visual question answering by？Bootstrapping hard attention［C］// Computer Vision – ECCV 2018. 2018. DOI:10.1007/978-3-030-01231-1_1.
［18］ ZHOU S K， LE H N， LU U K， et al. Deep reinforcement learning in medical imaging: A literature review［J］. arXiv preprint arXiv:2103.05115， 2021.
［19］ MICHEL P， LEVY O， NEUBIG Graham. Are sixteen heads really better than one？［J］. arXiv preprint arXiv: 1905.10650， 2019.
［20］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［J］. arXiv preprint arXiv:1706.03762v5， 2017.
［21］ WANG X， GIRSHICK R， GUPTA A， et al. Non-local neural networks［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:7794-7803.
［22］ HUANG Z L， WANG X G， HUANG L C， et al. CCNet: Criss-cross attention for semantic segmentation［C］// 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. 2019:603-612.
［23］ REN S C， ZHOU D Q， HE S F， et al. Shunted self-attention via multi-scale token aggregation［J］. arXiv preprint arXiv:2111.15193， 2021.
［24］ PI H J， WANG H Y， LI Y W， et al. Searching for TrioNet: Combining convolution with local and global self-attention［J］. arXiv preprint arXiv:2111.07547， 2021.
［25］ LIPTON Z C， BERKOWITZ J， ELKAN C. A critical review of recurrent neural networks for sequence learning［J］. arXiv preprint arXiv: 1506.00019v4， 2015.
［26］ SEO M， KEMBHAVI A， FARHADI A， et al. Bidirectional attention flow for machine comprehension［J］. arXiv preprint arXiv:1611.01603v6， 2016.
［27］ WANG B， LIU K， ZHAO J. Inner attention based recurrent neural networks for answer selection［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016:1288-1297.
［28］ QIN Y， SONG D J， CHENG H F， et al. A dual-stage attention-based recurrent neural network for time series prediction［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017:2627-2633.
［29］ ZHANG P F， XUE J R， LAN C L， et al. Adding attentiveness to the neurons in recurrent neural networks［C］// 15th European Conference on Munich， Germany. 2018:136-152.
［30］ HUBEL D H， WIESEL T N. Early exploration of the visual cortex［J］. Neuron， 1998，20（3）:401-412.
［31］ XU S J， CHENG Y， GU K， et al. Jointly attentive spatial-temporal pooling networks for video-based person re-identification［C］// 2017 IEEE International Conference on Computer Vision （ICCV）. 2017:4743-4752.
［32］ LIU N， LONG Y C， ZOU C Q， et al. ADCrowdNet: An attention-injective deformable convolutional network for crowd understanding［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2019:3220-3229.
［33］ GUO S N， LIN Y F， FENG N， et al. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2019，33（1）:922-929.
［34］ WANG F， JIANG M Q， QIAN C， et al. Residual attention network for image classification［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2017:6450-6458
［35］ FU J L， ZHENG H L， MEI T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2017:4476-4484.
［36］ YANG S H， WANG Y X， CHU X W. A survey of deep learning techniques for neural machine translation［J］. arXiv preprint arXiv:2002.07526v1， 2020.
［37］ BAHDANAU D， CHO K， BENGIO Y. Neural machine translation by jointly learning to align and translate［J］. arXiv preprint arXiv:1409.0473v7， 2016.
［38］ LUONG M， PHAM H， MANNING C D. Effective approaches to attention-based neural machine translation［J］. arXiv preprint arXiv:1508.04025， 2015.
［39］ GEHRING J， AULI M， GRANGIER D， et al. Convolutional sequence to sequence learning［J］. arXiv preprint arXiv:1705.03122v3， 2017.
［40］ DEVLIN J， CHANG M， LEE K， et al. BERT: Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv:1810.04805， 2018.
［41］杨小冈，高凡，卢瑞涛，等. 基于改进YOLOv5的轻量化航空目标检测方法［J］. 信息与控制， 2022，51（3）:361-368.
［42］刘赏，葛顶玉，耿明筱.结合全局与局部的人群集体性卷积网络识别方法［J/OL］.信息与控制.［2022-01-01］ https://doi.org/10.13976/j.cnki.xk.2022.1381.
［43］ CHEN T， LIU Y， SU H， et al. Dual-Awareness Attention for Few-Shot Object Detection［J］. arXiv preprint arXiv:2102.12152v3， 2021.
［44］ NIU R， SUN X， TIAN Y， et al. Hybrid multiple attention network for semantic segmentation in aerial images［J］. IEEE Transactions on Geoscience and Remote Sensing. 2021，60:1-18.
［45］ FU J， LIU J， TIAN H J， et al. Dual attention network for scene segmentation［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2019:3141-3149.
［46］ WANG H， WANG W N， LIU J. Temporal memory attention for video semantic segmentation［J］. arXiv preprint arXiv:2102.08643， 2021.
［47］于东飞. 基于注意力机制与高层语义的视觉问答研究［D］. 合肥:中国科学技术大学， 2019.
［48］ GUO W Y， ZHANG Y， WU X F， et al. Re-attention for visual question answering［C］// Proceedings of the AAAI Conference on Artificial Intelligence. 2020，34（1）:91-98.
［49］孟乐乐. 融合时空网络与注意力机制的人体行为识别研究［D］. 北京:北京交通大学， 2018.
［50］ PEREZ-RUA J M， MARTINEZ B， ZHU X， et al. Knowing what， where and when to look: Efficient video action modeling with attention［J］. arXiv preprint arXiv: 2004.01278v1， 2020.
［51］ PU S， SONG Y B， MA C， et al. Deep attentive tracking via reciprocative learning［J］. arXiv preprint arXiv:1810.03851， 2018.
［52］ CAO Z， FU C H， YE J J， et al. SiamAPN++: Siamese attentional aggregation network for real-time UAV Tracking［J］. arXiv preprint arXiv:2106.08816v2， 2021.
［53］ XUE Y， YUAN Z M， NERI F. ConAM: Confidence attention module for convolutional neural networks［J］. arXiv preprint arXiv:2110.14369， 2021.
［54］ GUO M H， LU C Z， LIU Z N， et al. Visual attention network［J］. arXiv preprint arXiv:2202.09741， 2022.
［55］姚懿秦，郭薇. 基于交互注意力机制的多模态情感识别算法［J］. 计算机应用研究， 2021，38（6）:1689-1693.
［56］ LIU H D， XU S Y， FU J M， et al. CMA-CLIP: Cross-modality attention CLIP for image-text classification［J］. arXiv preprint arXiv:2112.03562， 2021.
［57］ HAFIZ A M， PARAH S A， BHAT R U A. Attention mechanisms and deep learning for machine vision: A survey of the state of the art［J］. arXiv preprint arXiv:2106.07550v1， 2021.
［58］姚玉倩. 基于胶囊网络的人脸表情特征提取与识别算法研究［D］. 北京:北京交通大学， 2019.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[3]	QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism [J]. Computer and Modernization, 2024, 0(11): 106-112.
[4]	CHEN Kai1, LI Yiting1, 2, QUAN Huafeng1. A River Discarded Bottles Detection Method Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(11): 113-120.
[5]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[6]	WANG Yingying, HAO Xiao. Fine-grained Image Classification Based on Res2Net and Recursive Gated Convolution [J]. Computer and Modernization, 2024, 0(10): 74-79.
[7]	SHI Xingyu1, LI Qiang2, ZHUANG Li3, LIANG Yi3, WANG Qiulin3, CHEN Kai3, WU Chenzhou3, CHANG Sheng1. Object Detection Models Distillation Technique for Industrial Deployment [J]. Computer and Modernization, 2024, 0(10): 93-99.
[8]	ZHANG Ze1, ZHANG Jianquan2, 3, ZHOU Guopeng2, 3. Camera Module Defect Detection Based on Improved YOLOv8s [J]. Computer and Modernization, 2024, 0(09): 107-113.
[9]	CHENG Yazi1, LEI Liang1, 2, CHEN Han1, ZHAO Yiran1. Multi-scale Depth Fusion Monocular Depth Estimation Based on Transposed Attention [J]. Computer and Modernization, 2024, 0(09): 121-126.
[10]	CHENG Meng, LI Hao. Improved Deciduous Tree Nest Detection Method Based on YOLOv5s [J]. Computer and Modernization, 2024, 0(08): 24-29.
[11]	WANG Mengxi, LI Jun. Review of Fall Detection Technologies for Elderly [J]. Computer and Modernization, 2024, 0(08): 30-36.
[12]	SHI Xianwei1, FAN Xin2. Semantic Segmentation of Video Frame Scene Based on Lightweight [J]. Computer and Modernization, 2024, 0(08): 49-53.
[13]	XU Xin’ai, LI Gang. An Image Generation Method of Classroom Expression Images [J]. Computer and Modernization, 2024, 0(08): 88-91.
[14]	GAO Shuaipeng, WANG Yifan. Survey on Group-level Emotion Recognition in Images [J]. Computer and Modernization, 2024, 0(08): 98-107.
[15]	HUANG Wendong, WANG Yifan. Survey on Multimodal Information Processing and Fusion Based on Modal Categories [J]. Computer and Modernization, 2024, 0(07): 47-62.