计算机与现代化 ›› 2023, Vol. 0 ›› Issue (02): 34-39.

• 人工智能 • 上一篇    下一篇

基于条件卷积与极化自注意力的单目深度与位姿估计

  

  1. (上海工程技术大学电子电气工程学院,上海 201600)
  • 出版日期:2023-04-10 发布日期:2023-04-10
  • 作者简介:乔善宝(1996-),男,山东泰安人,硕士研究生,研究方向:计算机视觉,E-mail: 17864216432@163.com; 通信作者:高永彬(1988-),男,副教授,博士,研究方向:计算机视觉,知识图谱,E-mail: gaoyongbin@sues.edu.cn; 黄勃,男,副教授,博士,研究方向:知识工程,E-mail: huangbosues@sues.edu.cn; 余文俊,男,讲师,博士,研究方向:计算机视觉,知识图谱, E-mail: yuwenjun@sues.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61802253,U2033218);科技部科技创新2030-新一代人工智能重大项目(2020AAA0109302,2020AAA0109300);上海晨光人才计划(17CG59)

Monocular Depth and Pose Estimation Based on Conditionally Convolution and Polarized Self-attention

  1. (School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201600, China)
  • Online:2023-04-10 Published:2023-04-10

摘要: 基于运动恢复结构与视图合成的自监督范式,引入条件卷积与极化自注意力,提出新的单目深度与位姿估计模型。条件卷积对不同输入数据进行多组动态的卷积权重赋值,所有权重在经过加权整合后共享一次卷积操作,在不显著增加计算量的情况下实现模型容量的提升。图像信息完整性对深度估计任务的性能有极大影响,极化自注意力通过极化滤波使数据在通道或空间维度上保持高分辨率,防止图像的细粒度信息或结构信息丢失;同时压缩与通道或空间正交的维度,减小计算量,并通过非线性函数对压缩过程中损失的特征强度范围进行增强与动态映射。自注意力机制可以实现数据在各维度上的长距离建模。在KITTI数据集上进行实验,证明了所提出模型在自监督单目深度与位姿估计任务中有优异的性能表现。

关键词: 单目深度估计, 位姿估计, 自监督学习, 条件卷积, 极化滤波, 自注意力

Abstract: This paper proposed a novel monocular depth and pose estimation framework based on view synthesis and the self-supervised structure from motion paradigm by introducing conditionally convolution and polarized self-attention. Conditional convolution assigns multiple groups of dynamic weights to different input data, and all weights share one convolution operation after integration, which improves the model capacity without significantly increasing the computational cost. The image information integrity has significant impacts on the performance of depth estimation tasks. Polarized self-attention keeps the high resolution of data in channel or spatial dimensions through polarization filtering, which could prevent the loss of fine-grained and structural information. The dimension orthogonal to the channel or space is compressed to reduce the computation, and the feature intensity range lost in the compression process is enhanced and dynamically mapped through nonlinear functions. The self-attention mechanism can realize long-distance modeling of data in various dimensions. Experiments on the KITTI dataset demonstrate that the proposed model has excellent performance in self-supervised monocular depth and pose estimation tasks.

Key words: monocular depth estimation, pose estimation; self-supervised learning; conditionally convolution; polarized filter; self-attention