基于条件卷积与极化自注意力的单目深度与位姿估计

摘要/Abstract

摘要： 基于运动恢复结构与视图合成的自监督范式，引入条件卷积与极化自注意力，提出新的单目深度与位姿估计模型。条件卷积对不同输入数据进行多组动态的卷积权重赋值，所有权重在经过加权整合后共享一次卷积操作，在不显著增加计算量的情况下实现模型容量的提升。图像信息完整性对深度估计任务的性能有极大影响，极化自注意力通过极化滤波使数据在通道或空间维度上保持高分辨率，防止图像的细粒度信息或结构信息丢失；同时压缩与通道或空间正交的维度，减小计算量，并通过非线性函数对压缩过程中损失的特征强度范围进行增强与动态映射。自注意力机制可以实现数据在各维度上的长距离建模。在KITTI数据集上进行实验，证明了所提出模型在自监督单目深度与位姿估计任务中有优异的性能表现。

关键词: 单目深度估计, 位姿估计, 自监督学习, 条件卷积, 极化滤波, 自注意力

Abstract: This paper proposed a novel monocular depth and pose estimation framework based on view synthesis and the self-supervised structure from motion paradigm by introducing conditionally convolution and polarized self-attention. Conditional convolution assigns multiple groups of dynamic weights to different input data， and all weights share one convolution operation after integration， which improves the model capacity without significantly increasing the computational cost. The image information integrity has significant impacts on the performance of depth estimation tasks. Polarized self-attention keeps the high resolution of data in channel or spatial dimensions through polarization filtering， which could prevent the loss of fine-grained and structural information. The dimension orthogonal to the channel or space is compressed to reduce the computation， and the feature intensity range lost in the compression process is enhanced and dynamically mapped through nonlinear functions. The self-attention mechanism can realize long-distance modeling of data in various dimensions. Experiments on the KITTI dataset demonstrate that the proposed model has excellent performance in self-supervised monocular depth and pose estimation tasks.

Key words: monocular depth estimation, pose estimation； self-supervised learning； conditionally convolution； polarized filter； self-attention

乔善宝, 高永彬, 黄勃, 余文俊. 基于条件卷积与极化自注意力的单目深度与位姿估计[J]. 计算机与现代化, 2023, 0(02): 34-39.

QIAO Shan-bao, GAO Yong-bin, HUANG Bo, YU Wen-jun. Monocular Depth and Pose Estimation Based on Conditionally Convolution and Polarized Self-attention[J]. Computer and Modernization, 2023, 0(02): 34-39.

参考文献［27］

［1］	EIGEN D， PUHRSCH C， FERGUS R. Depth map prediction from a single image using a multi-scale deep network［C］// NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014:2366-2374.
［2］	HARTLEY R， ZISSERMAN A. Multiple View Geometry in Computer Vision［M］. Cambridge: Cambridge University Press， 2004.
［3］	MAYER N， ILG E， FISCHER P， et al. What makes good synthetic training data for learning disparity and optical flow estimation［J］. International Journal of Computer Vision， 2018，126（9）:942-960.
［4］	ZHOU T H， BROWN M， SNAVELY N， et al. Unsupervised learning of depth and ego-motion from video［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2017:6612-6619.
［5］	GARG R， B G V K， CARNEIRO G， et al. Unsupervised CNN for single view depth estimation: geometry to the rescue［C］// Computer Vision – ECCV 2016. 2016:740-756.
［6］	陈莹，王一良. 基于密集特征融合的无监督单目深度估计［J］. 电子与信息学报， 2021，43（10）:2976-2984.
［7］	詹雁，张娟，金昌基. 联合语义感知与域适应方法的单目深度估计［J］. 传感器与微系统， 2021，40（5）:60-63.
［8］	叶星余，何元烈，汝少楠. 基于生成式对抗网络及自注意力机制的无监督单目深度估计和视觉里程计［J］. 机器人， 2021，43（2）:203-213
［9］	BIAN J W， LI Z C， WANG N Y， et al. Unsupervised scale-consistent depth and ego-motion learning from monocular Video［J］. arXiv preprint arXiv:1908.10553， 2019.
［10］	YANG B， BENDER G， LE Q V， et al. CondConv: Conditionally parameterized convolutions for efficient inference［C］// NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019:1307-1318.
［11］	CHEN Y P， DAI X Y， LIU M C， et al. Dynamic convolution: attention over convolution kernels［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2020:11027-11036.
［12］	ZHANG Y K， ZHANG J， WANG Q， et al. DyNet: Dynamic convolution for accelerating convolutional neural networks［J］. arXiv preprint arXiv:2004.10694， 2020.
［13］	LIU H J， LIU F Q， FAN X Y， et al. Polarized self-attention: Towards high-quality pixel-wise regression［J］. arXiv preprint arXiv:2107.00782， 2021.
［14］	WANG Z， BOVIK A C， SHEIKH H R， et al. Image quality assessment: from error visibility to structural similarity［J］. IEEE Transactions on Image Processing， 2004，13（4）:600-612.
［15］	GEIGER A， LENZ P， STILLER C， et al. Vision meets robotics: The KITTI dataset［J］. International Journal of Robotics Research， 2013，32（11）:1231-1237.
［16］	MAHJOURIAN R， WICKE M， ANGELOVA A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:5667-5675.
［17］	YIN Z C， SHI J P. GeoNet: Unsupervised learning of dense depth， optical flow and camera pose［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:1983-1992.
［18］	ZOU Y L， LUO Z L， HUANG J B. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency［C］// Computer Vision - ECCV 2018. 2018:38-55.
［19］	WANG C Y， BUENAPOSADA J M， ZHU R， et al. Learning depth from monocular videos using direct methods［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:2022-2030.
［20］	RANJAN A， JAMPANI V， BALLES L， et al. Competitive collaboration: Joint unsupervised learning of depth， camera motion， optical flow and motion segmentation［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2019:12232-12241.
［21］	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. 2016:770-778.
［22］	SPENCER J， BOWDEN R， HADFIELD S. DeFeat-Net: General monocular depth via simultaneous unsupervised representation learning［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. 2020:14390-14401.
［23］	LI H H， GORDON A， ZHAO H， et al. Unsupervised monocular depth learning in dynamic scenes［J］. arXiv preprint arXiv:2010.16404， 2020.
［24］	ZHAN H Y， GARG R， WEERASEKERA C S， et al. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction［C］// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018:340-349.
［25］	SHEN T W， LUO Z X， ZHOU L， et al. Beyond photometric loss for self-supervised ego-motion estimation［C］// 2019 International Conference on Robotics and Automation （ICRA）. 2019:6359-6365.
［26］	GODARD C， AODHA O M， FIRMAN M， et al. Digging into self-supervised monocular depth estimation［C］// 2019 IEEE/CVF International Conference on Computer Vision （ICCV）. 2019:3827-3837.
［27］	BIAN J W， LI Z C， WANG N Y， et al. Unsupervised scale-consistent depth and ego-motion learning from monocular video［C］// NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019:35-45.

[1]	李亚平, 王军防, 余红梅, 窦一民, 肖媛, 田继林. Regformer：基于稀疏注意力的输油管道水力压降预测方法[J]. 计算机与现代化, 2024, 0(01): 59-66.
[2]	林威. 基于自监督学习和数据回放的新闻推荐模型增量学习方法[J]. 计算机与现代化, 2023, 0(12): 1-6.
[3]	郑海利, 陈平华. 基于动态卷积和自注意力的序列推荐模型[J]. 计算机与现代化, 2023, 0(10): 9-16.
[4]	龚轩, 郭中华, 陈旺. 基于CA-TransUNet的遥感图像道路分割[J]. 计算机与现代化, 2023, 0(07): 112-118.
[5]	刘甲甲, 胡旭欣, 余萍. 聚合多维注意力特征的单目深度估计方法[J]. 计算机与现代化, 2023, 0(06): 76-81.
[6]	夏义春, 李汪根, 李豆豆, 葛英奎, 王志格. 结合注意力机制和图神经网络的CTR预估模型[J]. 计算机与现代化, 2023, 0(03): 29-37.
[7]	栗伟松, 汤敏芳, 何征岭, 王鹏, 杜利东, 方震, 陈贤祥. 基于自注意力机制和单导联心电信号的自动睡眠分期算法#br#[J]. 计算机与现代化, 2022, 0(12): 50-59.
[8]	何晨曦, 王正勇, 卿粼波, 何小海, 吴小强. 基于全局自注意力的小麦图像识别[J]. 计算机与现代化, 2022, 0(04): 38-44.
[9]	万发洋, 于旭, 徐其江. 基于多头自注意力机制的深度缺陷分派模型[J]. 计算机与现代化, 2021, 0(11): 39-43.
[10]	宋爽, 陆鑫达. 基于BERT与图像自注意力机制的文本匹配模型[J]. 计算机与现代化, 2021, 0(11): 12-16.
[11]	潘陈听, 谭晓阳, . 复杂背景下基于深度学习的视频动作识别[J]. 计算机与现代化, 2020, 0(07): 97-103.
[12]	徐龙. 基于自注意力和胶囊网络的短文本情感分析[J]. 计算机与现代化, 2020, 0(07): 61-64.
[13]	孙震，王敬东，茅天诒，魏雪迎. 基于自监督学习的河流分割方法[J]. 计算机与现代化, 2017, 0(10): 10-14.