基于可分离结构变换的轻量级Vision Transformer

摘要/Abstract

摘要： 由于视觉Transformer结构模型参数量大、浮点计算次数高，使得其难以部署到终端设备上。因为注意力矩阵存在低秩瓶颈，所以模型压缩算法和注意力机制加速算法不能很好地平衡模型参数量、模型推理速度和模型性能之间的关系。为了解决上述问题，本文设计一种轻量级的ViT-SST模型用于图像分类任务。首先，通过将传统全连接层转换为可分离结构，大幅度降低模型参数量且提高了模型推理速度，保证了注意力矩阵不会因出现低秩而破坏模型表达能力；其次，提出一种基于SVD分解的克罗内克积近似分解法，可以将公开的ViT-Base模型预训练参数转换至ViT-Base-SST模型，略微缓解了ViT模型的过拟合现象并提高了模型精度。在常见公开图片数据集CIFAR系列和Caltech系列上的实验验证了本文方法优于对比方法

关键词: 深度学习, 计算机视觉, 图像分类, 模型压缩

Abstract: Due to a large number of parameters and high floating-point calculations of the Visual Transformer model, it is difficult to deploy it to portable or terminal devices. Because the attention matrix has a low-rank bottleneck, the model compression algorithm and the attention mechanism acceleration algorithm cannot well balance the relationship between the amount of model parameters, model inference speed and model performance. In order to solve the above problems, a lightweight ViT-SST model is designed. Firstly, by transforming the traditional fully connected layer into a separable structure, the number of model parameters is greatly reduced and the reasoning speed of the model is improved, and it is guaranteed that the attention matrix will not destroy the model’s expressive ability due to the appearance of low rank. Secondly, this paper proposes a Kronecker product approximate decomposition method based on SVD decomposition, which can convert the pre-training parameters of the public ViT-Base model to the ViT-Base-SST model. It slightly alleviates the overfitting phenomenon of the ViT-Base model and improves the accuracy of the model. Experiments on five common public datasets show that the proposed method is more suitable for the Transformer structure model than traditional compression methods.

Key words: deep learning, computer vision, image classification, model compression

黄延辉, 兰海, 魏宪. 基于可分离结构变换的轻量级Vision Transformer[J]. 计算机与现代化, 2022, 0(10): 75-81.

HUANG Yan-hui, LAN Hai, WEI Xian. Lightweight Vision Transformer Based on Separable Structured Transformations[J]. Computer and Modernization, 2022, 0(10): 75-81.

参考文献

［1］ KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks［C］// Proceedings of the 2012 International Conference on Neural Information Processing Systems. 2012:1097-1105.
［2］ DENG J, DOND W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009:248-255.
［3］ SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［C］// Proceedings of the 2015 International Conference on Learning Representations. 2015:212-219.
［4］ HE K M, ZHZANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［5］ DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale［C］// Proceedings of the 2020 International Conference on Learning Representations. 2020:321-326.
［6］ RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts［C］// Proceedings of the 2021 International Conference on Neural Information Processing Systems. 2021:1097-1105.
［7］ LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. 2021:9992-10002.
［8］ STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: Transformer for semantic segmentation［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. 2021:7242-7252.
［9］ TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: A survey［J］. arXiv preprint arXiv:2009.06732, 2020.
［10］WANG S N, LI B Z, KHABSA M, et al. Linformer: Self-attention with linear complexity［J］. arXiv preprint arXiv:2006.04768, 2020.
［11］CHOROMANSKI K, LIKHOSHERSTOV V, DOHAN D, et al. Rethinking attention with performers［C］// Proceedings of the 2021 International Conference on Learning Representations. 2021:3-7.〖HJ0.68mm〗
［12］KITAEV N, KAISER L, LEVSKAYA A. Reformer: The efficient transformer［C］// Proceedings of the 2020 International Conference on Learning Representations. 2020:1-7.
［13］CHILD R, GRAY S, RADFORD A, et al. Generating long sequences with sparse transformers［J］. arXiv preprint arXiv:1904.10509, 2019.
［14］BHOJANAPALLI S, YUN C, RAWAT A S, et al. Low-rank bottleneck in multi-head attention models［C］// Proceedings of the 2020 International Conference on Machine Learning. 2020:864-873.
［15］YU X Y, LIU T L, WANG X C, et al. On compressing deep models by low rank and sparse decomposition［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:67-76.
［16］WU J X, LENG C, WANG Y H, et al. Quantized convolutional neural networks for mobile devices［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:4820-4828.
［17］HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network［J］. arXiv preprint arXiv:1503.02531, 2015.
［18］HANSON S J, PRATT L Y. Comparing biases for minimal network construction with back-propagation［C］// Proceedings of the 1st International Conference on Neural Information Processing Systems. 1988:177-185.
［19］WU H P, XIAO B, CODELLA N, et al. CvT: Introducing convolutions to vision transformers［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. 2021:22-31.
［20］BELTAGY I, PETERS M E, COHAN A. Longformer: The long-document transformer［J］. arXiv preprint arXiv:2004.05150, 2020.
［21］TAY Y, BAHRI D, METZLER D, et al. Synthesizer: Rethinking self-attention for transformer models［C］// Proceedings of the 38th IEEE International Conference on Machine Learning. 2021:10183-10192.
［22］DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: Attentive language models beyond a fixed-length context［J］. arXiv preprint arXiv:1901.02860, 2019.
［23］RAE J W, POTAPENKO A, JAYAKUMAR S M, et al. Compressive transformers for long-range sequence modelling［C］// Proceedings of the 2020 International Conference on Learning Representations. 2020:26-30.
［24］TAI C, XIAO T, ZHANG Y, et al. Convolutional neural networks with low-rank regularization［J］. arXiv preprint arXiv:1511.06067, 2015.
［25］GONG Y C, LIU L, YANG M, et al. Compressing deep convolutional networks using vector quantization［J］. arXiv preprint arXiv:1412.6115, 2014.
［26］VANHOUCKE V, SENIOR A, MAO M Z. Improving the speed of neural networks on CPUs［EB/OL］. (2011-12-15)［2021-10-02］. https://research.google/pubs/pub37631/.
［27］ROMERO A, BALLAS N, KAHOU S E, et al. FitNets: Hints for thin deep nets［J］. arXiv preprint arXiv:1412.6550, 2014.
［28］MOZER M C, JORDAN M I, PETSCHE T. Advances in Neural Information Processing Systems 9［M］. Morgan Kaufmann Publishers, 1997.
［29］ HASSIBI B, STORK D G. Second order derivatives for network pruning: Optimal brain surgeon［C］// Proceedings of the 1992 International Conference on Neural Information Processing Systems. 1992:164-171.
［30］SRINIVAS S, BABU R V. Data-free parameter pruning for deep neural networks［C］// Proceedings of the 2015 British Machine Vision Conference. 2015. DOI: 10.5244/C.29.31.
［31］DENG L. The MNIST database of handwritten digit images for machine learning research ［best of the Web］［J］. IEEE Signal Processing Magazine, 2012,29(6):141-142.
［32］KRIZHEVSKY A. Learning multiple layers of features from tiny images［R］. University of Toronto, 2009.
［33］GRIFFIN G, HOLUB A, PERONA P. Caltech-256 object category dataset［EB/OL］. (2007-12-15)［2021-12-15］. https://authors.library.caltech.edu/7694/?〖KG-*3〗ref=https://githubhelp.com.
［34］SEHWAG V, WANG S Q, MITTAL P, et al. HYDRA: Pruning adversarially robust neural networks［C］// Proceedings of the 2020 International Conference on Neural Information Processing Systems. 2020:97-105.
［35］LIU N, MA X L, XU Z Y, et al. AutoCompress: An automatic DNN structured pruning framework for ultra-high compression rates［C］// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020:4876-4883.
［36］SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017:618-626.

[1]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[2]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[3]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[4]	陈凯1, 李宜汀1, 2, 全华凤1 . 基于改进YOLOv8的河道废弃瓶检测方法[J]. 计算机与现代化, 2024, 0(11): 113-120.
[5]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[6]	王莹莹, 郝潇. 基于Res2Net和递归门控卷积的细粒度图像分类[J]. 计算机与现代化, 2024, 0(10): 74-79.
[7]	史星宇1, 李强2, 庄莉3, 梁懿3, 王秋琳3, 陈锴3, 伍臣周3, 常胜1. 一种面向工业部署的目标检测模型蒸馏技术[J]. 计算机与现代化, 2024, 0(10): 93-99.
[8]	张泽1, 张建权2, 3, 周国鹏2, 3. 基于改进YOLOv8s的摄像头模组缺陷检测[J]. 计算机与现代化, 2024, 0(09): 107-113.
[9]	程亚子1, 雷亮1, 2, 陈瀚1, 赵毅然1. 基于转置注意力的多尺度深度融合单目深度估计[J]. 计算机与现代化, 2024, 0(09): 121-126.
[10]	程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29.
[11]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[12]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[13]	魏嘉焜, 王家润. 手势识别与交互综述[J]. 计算机与现代化, 2024, 0(08): 67-76.
[14]	徐新爱, 李钢. 基于DCGAN的课堂表情图像生成方法[J]. 计算机与现代化, 2024, 0(08): 88-91.
[15]	赵小明, 潘婷, 刘伟锋. 基于图像分类的自动绘画心理分析方法[J]. 计算机与现代化, 2024, 0(08): 92-97.