计算机与现代化 ›› 2022, Vol. 0 ›› Issue (10): 75-81.

• 图像处理 • 上一篇    下一篇

基于可分离结构变换的轻量级Vision Transformer

  

  1. (1.福州大学电气工程与自动化学院,福建福州350100;
    2.中国科学院福建物质结构研究所泉州装备制造研究中心,福建泉州362200)
  • 出版日期:2022-10-20 发布日期:2022-10-21
  • 作者简介:黄延辉(1995—),男,福建宁德人,硕士研究生,研究方向:模型压缩,模型鲁棒性,E-mail: 736768695@qq.com; 通信作者:兰海(1988—),男,福建莆田人,助理研究员,硕士,研究方向:计算机视觉,模式识别,E-mail: lanhai@fjirsm.ac.com; 魏宪(1986—),男,河南沁阳人,研究员,博士,研究方向:计算机视觉,模式识别,E-mail: xian.wei@fjirsm.ac.com。
  • 基金资助:
    中国福建光电信息科学技术创新实验室(闽都创新实验室)项目(2021ZZ120); 福建省科技计划项目(2021T3003, 2021T3068); 泉州市科技计划项目(2021C065L); 莆田市科技计划项目(2020HJSTS006)

Lightweight Vision Transformer Based on Separable Structured Transformations

  1. (1. College of Electrical Engineering and Automation, Fuzhou University, Fuzhou 350100, China;
    2. Quanzhou Institute of Equipment Manufacturing, Fujian Institute of Research on the Structure of Matter, 
    Chinese Academy of Sciences, Quanzhou 362200, China)
  • Online:2022-10-20 Published:2022-10-21

摘要: 由于视觉Transformer结构模型参数量大、浮点计算次数高,使得其难以部署到终端设备上。因为注意力矩阵存在低秩瓶颈,所以模型压缩算法和注意力机制加速算法不能很好地平衡模型参数量、模型推理速度和模型性能之间的关系。为了解决上述问题,本文设计一种轻量级的ViT-SST模型用于图像分类任务。首先,通过将传统全连接层转换为可分离结构,大幅度降低模型参数量且提高了模型推理速度,保证了注意力矩阵不会因出现低秩而破坏模型表达能力;其次,提出一种基于SVD分解的克罗内克积近似分解法,可以将公开的ViT-Base模型预训练参数转换至ViT-Base-SST模型,略微缓解了ViT模型的过拟合现象并提高了模型精度。在常见公开图片数据集CIFAR系列和Caltech系列上的实验验证了本文方法优于对比方法

关键词: 深度学习, 计算机视觉, 图像分类, 模型压缩

Abstract: Due to a large number of parameters and high floating-point calculations of the Visual Transformer model, it is difficult to deploy it to portable or terminal devices. Because the attention matrix has a low-rank bottleneck, the model compression algorithm and the attention mechanism acceleration algorithm cannot well balance the relationship between the amount of model parameters, model inference speed and model performance. In order to solve the above problems, a lightweight ViT-SST model is designed. Firstly, by transforming the traditional fully connected layer into a separable structure, the number of model parameters is greatly reduced and the reasoning speed of the model is improved, and it is guaranteed that the attention matrix will not destroy the model’s expressive ability due to the appearance of low rank. Secondly, this paper proposes a Kronecker product approximate decomposition method based on SVD decomposition, which can convert the pre-training parameters of the public ViT-Base model to the ViT-Base-SST model. It slightly alleviates the overfitting phenomenon of the ViT-Base model and improves the accuracy of the model. Experiments on five common public datasets show that the proposed method is more suitable for the Transformer structure model than traditional compression methods.

Key words: deep learning, computer vision, image classification, model compression