Computer and Modernization ›› 2022, Vol. 0 ›› Issue (10): 75-81.

Previous Articles     Next Articles

Lightweight Vision Transformer Based on Separable Structured Transformations

  

  1. (1. College of Electrical Engineering and Automation, Fuzhou University, Fuzhou 350100, China;
    2. Quanzhou Institute of Equipment Manufacturing, Fujian Institute of Research on the Structure of Matter, 
    Chinese Academy of Sciences, Quanzhou 362200, China)
  • Online:2022-10-20 Published:2022-10-21

Abstract: Due to a large number of parameters and high floating-point calculations of the Visual Transformer model, it is difficult to deploy it to portable or terminal devices. Because the attention matrix has a low-rank bottleneck, the model compression algorithm and the attention mechanism acceleration algorithm cannot well balance the relationship between the amount of model parameters, model inference speed and model performance. In order to solve the above problems, a lightweight ViT-SST model is designed. Firstly, by transforming the traditional fully connected layer into a separable structure, the number of model parameters is greatly reduced and the reasoning speed of the model is improved, and it is guaranteed that the attention matrix will not destroy the model’s expressive ability due to the appearance of low rank. Secondly, this paper proposes a Kronecker product approximate decomposition method based on SVD decomposition, which can convert the pre-training parameters of the public ViT-Base model to the ViT-Base-SST model. It slightly alleviates the overfitting phenomenon of the ViT-Base model and improves the accuracy of the model. Experiments on five common public datasets show that the proposed method is more suitable for the Transformer structure model than traditional compression methods.

Key words: deep learning, computer vision, image classification, model compression