计算机与现代化 ›› 2025, Vol. 0 ›› Issue (04): 1-5.doi: 10.3969/j.issn.1006-2475.2025.04.001

• 人工智能 •    下一篇

基于混合Transformer的视线估计模型





  

  1. (重庆师范大学计算机与信息科学学院,重庆 401331)
  • 出版日期:2025-04-30 发布日期:2025-04-30
  • 基金资助:
    重庆市科委科学研究项目(cstc2021jcyj-msxm2791); 重庆市教委科技项目(KJZD-K202200513)

Gaze Estimation Model Based on Hybrid Transformer 

  1. (School of Computer and Information Science, Chongqing Normal University, Chongqing 401331,China)
  • Online:2025-04-30 Published:2025-04-30

摘要: 采用CNN与Transformer相结合的方法,利用Transformer能获取全局特征信息的优势,提高模型上下文信息感知能力,从而改善模型精度。本文提出一种新颖的基于混合Transformer的视线估计模型ResNet-MHSA (RN-SA),该模型将ResNet18中部分3×3空间卷积层替换为由一个1×1的空间卷积层和MHSA(Multi-Head Self-Attention)层组合而成的块,并在模型结构中添加DropBlock机制,以增加模型的鲁棒性。实验结果表明,RN-SA模型在减少参数量的同时改善了模型的精度,与目前较好的模型GazeTR-Hybrid相比,在参数数量减少15.8%的情况下,在EyeDiap和Gaze360数据集上精度分别提高了4.1%和3.7%。因此,CNN与Transformer相结合的方式能有效应用于视线估计任务中。

关键词: 视线估计, 自注意力, MHSA, Transformer

Abstract:  Combined CNN and Transformer, Transformer can gain the advantage of global feature information and improve the awareness of model context information, which can lead to improve model accuracy. A novel gaze estimation model RN-SA(ResNet-MHSA) based on a hybrid Transformer is proposed. In this model, part of the 3×3 spatial convolution layers in ResNet18 are replaced with a block composed of a 1×1 spatial convolution layer and MHSA(Multi-Head Self-Attention) layer, and the DropBlock mechanism is added to the model structure to increase the robustness of the model. Experimental results show that RN-SA model can improve the accuracy of the model while reducing the number of parameters compared with the current better model GazeTR-Hybrid, RN-SA model can improve the accuracy by 4.1% and 3.7% on EyeDiap and Gaze360 datasets, respectively, while the number of parameters is reduced by 15.8%.  Therefore, the combination of CNN and Transformer can be effectively applied to gaze estimation tasks.

Key words:  , gaze estimation, self-attention, MHSA, Transformer

中图分类号: