Lightweight Vision Transformer Based on Separable Structured Transformations

Abstract

Abstract: Due to a large number of parameters and high floating-point calculations of the Visual Transformer model, it is difficult to deploy it to portable or terminal devices. Because the attention matrix has a low-rank bottleneck, the model compression algorithm and the attention mechanism acceleration algorithm cannot well balance the relationship between the amount of model parameters, model inference speed and model performance. In order to solve the above problems, a lightweight ViT-SST model is designed. Firstly, by transforming the traditional fully connected layer into a separable structure, the number of model parameters is greatly reduced and the reasoning speed of the model is improved, and it is guaranteed that the attention matrix will not destroy the model’s expressive ability due to the appearance of low rank. Secondly, this paper proposes a Kronecker product approximate decomposition method based on SVD decomposition, which can convert the pre-training parameters of the public ViT-Base model to the ViT-Base-SST model. It slightly alleviates the overfitting phenomenon of the ViT-Base model and improves the accuracy of the model. Experiments on five common public datasets show that the proposed method is more suitable for the Transformer structure model than traditional compression methods.

Key words: deep learning, computer vision, image classification, model compression

HUANG Yan-hui, LAN Hai, WEI Xian. Lightweight Vision Transformer Based on Separable Structured Transformations[J]. Computer and Modernization, 2022, 0(10): 75-81.

References

［1］ KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks［C］// Proceedings of the 2012 International Conference on Neural Information Processing Systems. 2012:1097-1105.
［2］ DENG J, DOND W, SOCHER R, et al. ImageNet: A large-scale hierarchical image database［C］// Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009:248-255.
［3］ SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition［C］// Proceedings of the 2015 International Conference on Learning Representations. 2015:212-219.
［4］ HE K M, ZHZANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778.
［5］ DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale［C］// Proceedings of the 2020 International Conference on Learning Representations. 2020:321-326.
［6］ RIQUELME C, PUIGCERVER J, MUSTAFA B, et al. Scaling vision with sparse mixture of experts［C］// Proceedings of the 2021 International Conference on Neural Information Processing Systems. 2021:1097-1105.
［7］ LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. 2021:9992-10002.
［8］ STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: Transformer for semantic segmentation［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. 2021:7242-7252.
［9］ TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: A survey［J］. arXiv preprint arXiv:2009.06732, 2020.
［10］WANG S N, LI B Z, KHABSA M, et al. Linformer: Self-attention with linear complexity［J］. arXiv preprint arXiv:2006.04768, 2020.
［11］CHOROMANSKI K, LIKHOSHERSTOV V, DOHAN D, et al. Rethinking attention with performers［C］// Proceedings of the 2021 International Conference on Learning Representations. 2021:3-7.〖HJ0.68mm〗
［12］KITAEV N, KAISER L, LEVSKAYA A. Reformer: The efficient transformer［C］// Proceedings of the 2020 International Conference on Learning Representations. 2020:1-7.
［13］CHILD R, GRAY S, RADFORD A, et al. Generating long sequences with sparse transformers［J］. arXiv preprint arXiv:1904.10509, 2019.
［14］BHOJANAPALLI S, YUN C, RAWAT A S, et al. Low-rank bottleneck in multi-head attention models［C］// Proceedings of the 2020 International Conference on Machine Learning. 2020:864-873.
［15］YU X Y, LIU T L, WANG X C, et al. On compressing deep models by low rank and sparse decomposition［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017:67-76.
［16］WU J X, LENG C, WANG Y H, et al. Quantized convolutional neural networks for mobile devices［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016:4820-4828.
［17］HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network［J］. arXiv preprint arXiv:1503.02531, 2015.
［18］HANSON S J, PRATT L Y. Comparing biases for minimal network construction with back-propagation［C］// Proceedings of the 1st International Conference on Neural Information Processing Systems. 1988:177-185.
［19］WU H P, XIAO B, CODELLA N, et al. CvT: Introducing convolutions to vision transformers［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. 2021:22-31.
［20］BELTAGY I, PETERS M E, COHAN A. Longformer: The long-document transformer［J］. arXiv preprint arXiv:2004.05150, 2020.
［21］TAY Y, BAHRI D, METZLER D, et al. Synthesizer: Rethinking self-attention for transformer models［C］// Proceedings of the 38th IEEE International Conference on Machine Learning. 2021:10183-10192.
［22］DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: Attentive language models beyond a fixed-length context［J］. arXiv preprint arXiv:1901.02860, 2019.
［23］RAE J W, POTAPENKO A, JAYAKUMAR S M, et al. Compressive transformers for long-range sequence modelling［C］// Proceedings of the 2020 International Conference on Learning Representations. 2020:26-30.
［24］TAI C, XIAO T, ZHANG Y, et al. Convolutional neural networks with low-rank regularization［J］. arXiv preprint arXiv:1511.06067, 2015.
［25］GONG Y C, LIU L, YANG M, et al. Compressing deep convolutional networks using vector quantization［J］. arXiv preprint arXiv:1412.6115, 2014.
［26］VANHOUCKE V, SENIOR A, MAO M Z. Improving the speed of neural networks on CPUs［EB/OL］. (2011-12-15)［2021-10-02］. https://research.google/pubs/pub37631/.
［27］ROMERO A, BALLAS N, KAHOU S E, et al. FitNets: Hints for thin deep nets［J］. arXiv preprint arXiv:1412.6550, 2014.
［28］MOZER M C, JORDAN M I, PETSCHE T. Advances in Neural Information Processing Systems 9［M］. Morgan Kaufmann Publishers, 1997.
［29］ HASSIBI B, STORK D G. Second order derivatives for network pruning: Optimal brain surgeon［C］// Proceedings of the 1992 International Conference on Neural Information Processing Systems. 1992:164-171.
［30］SRINIVAS S, BABU R V. Data-free parameter pruning for deep neural networks［C］// Proceedings of the 2015 British Machine Vision Conference. 2015. DOI: 10.5244/C.29.31.
［31］DENG L. The MNIST database of handwritten digit images for machine learning research ［best of the Web］［J］. IEEE Signal Processing Magazine, 2012,29(6):141-142.
［32］KRIZHEVSKY A. Learning multiple layers of features from tiny images［R］. University of Toronto, 2009.
［33］GRIFFIN G, HOLUB A, PERONA P. Caltech-256 object category dataset［EB/OL］. (2007-12-15)［2021-12-15］. https://authors.library.caltech.edu/7694/?〖KG-*3〗ref=https://githubhelp.com.
［34］SEHWAG V, WANG S Q, MITTAL P, et al. HYDRA: Pruning adversarially robust neural networks［C］// Proceedings of the 2020 International Conference on Neural Information Processing Systems. 2020:97-105.
［35］LIU N, MA X L, XU Z Y, et al. AutoCompress: An automatic DNN structured pruning framework for ultra-high compression rates［C］// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020:4876-4883.
［36］SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017:618-626.

[1]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[2]	HUANG Tingpei1, MA Lubiao1, LI Shibao2, LIU Jianhang1. Gesture Recognition Method Based on WiFi and Prototypical Network [J]. Computer and Modernization, 2024, 0(12): 34-39.
[3]	QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism [J]. Computer and Modernization, 2024, 0(11): 106-112.
[4]	CHEN Kai1, LI Yiting1, 2, QUAN Huafeng1. A River Discarded Bottles Detection Method Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(11): 113-120.
[5]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[6]	WANG Yingying, HAO Xiao. Fine-grained Image Classification Based on Res2Net and Recursive Gated Convolution [J]. Computer and Modernization, 2024, 0(10): 74-79.
[7]	SHI Xingyu1, LI Qiang2, ZHUANG Li3, LIANG Yi3, WANG Qiulin3, CHEN Kai3, WU Chenzhou3, CHANG Sheng1. Object Detection Models Distillation Technique for Industrial Deployment [J]. Computer and Modernization, 2024, 0(10): 93-99.
[8]	ZHANG Ze1, ZHANG Jianquan2, 3, ZHOU Guopeng2, 3. Camera Module Defect Detection Based on Improved YOLOv8s [J]. Computer and Modernization, 2024, 0(09): 107-113.
[9]	CHENG Yazi1, LEI Liang1, 2, CHEN Han1, ZHAO Yiran1. Multi-scale Depth Fusion Monocular Depth Estimation Based on Transposed Attention [J]. Computer and Modernization, 2024, 0(09): 121-126.
[10]	CHENG Meng, LI Hao. Improved Deciduous Tree Nest Detection Method Based on YOLOv5s [J]. Computer and Modernization, 2024, 0(08): 24-29.
[11]	WANG Mengxi, LI Jun. Review of Fall Detection Technologies for Elderly [J]. Computer and Modernization, 2024, 0(08): 30-36.
[12]	SHI Xianwei1, FAN Xin2. Semantic Segmentation of Video Frame Scene Based on Lightweight [J]. Computer and Modernization, 2024, 0(08): 49-53.
[13]	WEI Jiakun, WANG Jiarun. Survey on Gesture Recognition and Interaction [J]. Computer and Modernization, 2024, 0(08): 67-76.
[14]	XU Xin’ai, LI Gang. An Image Generation Method of Classroom Expression Images [J]. Computer and Modernization, 2024, 0(08): 88-91.
[15]	ZHAO Xiaoming, PAN Ting, LIU Weifeng. Automated Drawing Psychoanalysis Based on Image Classification [J]. Computer and Modernization, 2024, 0(08): 92-97.