基于转置注意力的多尺度深度融合单目深度估计

doi:10.3969/j.issn.1006-2475.2024.09.020

摘要/Abstract

摘要： 单目深度估计是计算机视觉领域中一项基础任务，其目标是通过单张图像预测深度图，并获取每个像素位置的深度信息。本文提出一种新的单目深度估计网络结构，旨在进一步提高网络的预测准确性。转置注意力机制在降低参数量和计算量的同时引入了自注意力机制，以关注图像中的特定区域，并结合不同通道之间的信息。这种机制能够有效地关注到图像中的细小区域和边缘信息，并进行学习。本文还提出一种改进的转置注意力机制，以更少的参数量保留语义信息。多尺度深度融合根据不同通道提取不同深度特征的特点，计算每个通道的平均深度，以增强模型的深度感知能力。此外，它能够建模垂直距离的长距离关系，有效地分离物体之间的边缘，有助于减少细粒度信息的损失。最后，本文在NYU Depth V2数据集和KITTI数据集上进行实验，验证了所提出模块的有效性，并取得了出色的性能表现。

关键词: 深度学习, 单目深度估计, 转置注意力, 多尺度深度融合, 通道平均深度

Abstract: Monocular depth estimation is a fundamental task in computer vision， aiming to predict depth maps from single images and retrieve depth information for corresponding pixel positions. This paper proposes a novel network architecture for monocular depth estimation to further enhance the predictive accuracy of the network. Transposed attention introduces a self-attention mechanism， enabling it to focus on specific regions within the image while reducing the parameter and computation requirements. By incorporating information across different channels， it effectively captures fine-grained regions and edge details for learning. The paper presents an improved version of transposed attention that retains semantic information with fewer parameters. Multi-scale depth fusion leverages the characteristic of extracting features with different depths from distinct channels. It computes the average depth for each channel， enhancing the model’s depth perception capability. Furthermore， it models long-range dependencies for vertical distances， effectively separating edges between objects and mitigating the loss of fine-grained information. Finally， the proposed modules’ effectiveness is validated through experiments conducted on the NYU Depth V2 dataset and the KITTI dataset， demonstrating exceptional performance.

Key words: deep learning, monocular depth estimation, transposed attention, multi-scale deep fusion, channel average depth

中图分类号:

TP391.41

程亚子1, 雷亮1, 2, 陈瀚1, 赵毅然1. 基于转置注意力的多尺度深度融合单目深度估计[J]. 计算机与现代化, 2024, 0(09): 121-126.

CHENG Yazi1, LEI Liang1, 2, CHEN Han1, ZHAO Yiran1. Multi-scale Depth Fusion Monocular Depth Estimation Based on Transposed Attention[J]. Computer and Modernization, 2024, 0(09): 121-126.

参考文献

［1］陈颖，赵来旺，詹洪陈，等. 双视系统的室内三维场景重建研究［J］. 计算机科学， 2020，47（S2）:175-177.
［2］乔善宝，高永彬，黄勃，等. 基于条件卷积与极化自注意力的单目深度与位姿估计［J］. 计算机与现代化， 2023（2）:34-39.
［3］ YAO Y， LUO Z X， LI S W， et al. MVSNet: Depth inference for unstructured multi-view stereo［C］// Proceedings of the 2018 European Conference on Computer Vision （ECCV）. Springer， 2018:785-801.
［4］ YAO Y， LUO Z X， LI S W， et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2019:5520-5529.
［5］ GUIZILINI V， AMBRUS R， PILLAI S， et al. 3D packing for self-supervised monocular depth estimation［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2020:2482-2491.
［6］ XU G W， CHENG J D， GUO P， et al. ACVNet: Attention concatenation volume for accurate and efficient stereo matching［J］. arXiv preprint arXiv:2203.02146， 2022.
［7］ LI Y Y， GUO Y L， YAN Z X， et al. OmniFusion: 360 monocular depth estimation via geometry-aware fusion［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2022:2791-2800.
［8］ SAXENA A， SUN M， NG A Y. Make3D: Learning 3D scene structure from a single still image［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence， 2009，31（5）:824-840.
［9］ CHENG Z Y， ZHANG Y， TANG C K. Swin-depth: Using transformers and multi-scale fusion for monocular-based depth estimation［J］. IEEE Sensors Journal， 2021，21（23）:26912-26920.
［10］ TAKAMINE M， ENDO S. Monocular depth estimation with a multi-task and multiple-input architecture using depth gradient［C］// Proceedings of the 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems （SCIS-ISIS）. IEEE， 2020:373-378.
［11］ EIGEN D， PUHRSCH C， FERGUS R. Depth map prediction from a single image using a multi-scale deep network［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM， 2014，2:2366-2374.
［12］ LAINA I， RUPPRECHT C， BELAGIANNIS V， et al. Deeper depth prediction with fully convolutional residual networks［C］// Proceedings of the 2016 4th International Conference on 3D Vision （3DV）. IEEE， 2016:239-248.
［13］ SONG M， LIM S， KIM W. Monocular depth estimation using Laplacian pyramid-based depth residuals［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2021，31（11）:4381-4393.
［14］ WANG L J， ZHANG J M， WANG Y F， et al. CLIFFNet for monocular depth estimation with hierarchical embedding loss［C］// Proceedings of the 2020 European Conference on Computer Vision （ECCV）. Springer， 2020:316-331.
［15］ CAO Y Z H， WU Z F， SHEN C H. Estimating depth from monocular images as classification using deep fully convolutional residual networks［J］. IEEE Transactions on Circuits and Systems for Video Technology， 2018，28（11）:3174-3182.
［16］ ZIOULIS N， KARAKOTTAS A， ZARPALAS D， et al. OmniDepth: Dense depth estimation for indoors spherical panoramas［C］// Proceedings of the 2018 European Conference on Computer Vision （ECCV）. Springer， 2018:453-471.
［17］ BHAT S F， ALHASHIM I， WONKA P. AdaBins: Depth estimation using adaptive bins［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2021:4008-4017.
［18］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017:6000-6010.
［19］ DOSOVITSKIY A， BEYER L， KOLESNIKOV A， et al. An image is worth 16×16 words: Transformers for image recognition at scale［J］. arXiv preprint arXiv:2010.11929， 2020.
［20］ RANFTL R， BOCHKOVSKIY A， KOLTUN V. Vision transformers for dense prediction［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. IEEE， 2021:12159-12168.
［21］ LIU Z， LIN Y T， CAO Y， et al. Swin transformer: Hierarchical vision transformer using shifted windows［C］// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision （ICCV）. IEEE， 2021:9992-10002.
［22］ ZAMIR S W， ARORA A， KHAN S， et al. Restormer: Efficient transformer for high-resolution image restoration［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2022:5718-5729.
［23］ LIN T Y， DOLLAR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2017:936-944.
［24］ ZHAO H S， SHI J P， QI X J， et al. Pyramid scene parsing network［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2017:6230-6239.
［25］ AGARWAL A， ARORA C. Depthformer: Multiscale vision transformer for monocular depth estimation with local global information fusion［J］. arXiv preprint arXiv:2207.04535， 2022.
［26］ RONNEBERGER O， FISCHER P， BROX T. U-Net: Convolutional networks for biomedical image segmentation［C］// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer， 2015:234-241.
［27］ XIE E Z， WANG W H， YU Z D， et al. SegFormer: Simple and efficient design for semantic segmentation with transformers［C］// Proceedings of the 2021 International Conference on Neural Information Processing Systems. ACM， 2021:12077-12090.
［28］ SILBERMAN N， HOIEM D， KOHLI P， et al. Indoor segmentation and support inference from RGBD images［C］// Proceedings of the 2012 European Conference on Computer Vision （ECCV）. Springer， 2012:746-760.
［29］ GEIGER A， LENZ P， STILLER C，et al. Vision meets robotics: The KITTI dataset［J］. International Journal of Robotics Research， 2013，32（11）:1231-1237.
［30］ FU H， GONG M M， WANG C H， et al. Deep ordinal regression network for monocular depth estimation［C］// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2018:2002-2011.
［31］ LEE J H， HAN M K， KO D W， et al. From big to small: Multi-scale local planar guidance for monocular depth estimation［J］. arXiv preprint arXiv:1907.10326， 2019.
［32］ YOO J， AHN N， SOHN K A. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2020:8372-8381.
［33］ PATIL V， SAKARIDIS C， LINIGER A， et al. P3Depth: Monocular depth estimation with a piecewise planarity prior［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2022:1600-1611.
［34］ BHAT S F， ALHASHIM I， WONKA P. LocalBins: Improving depth estimation by learning local distributions［C］// Proceedings of the 2022 European Conference on Computer Vision （ECCV）. Springer， 2022:480-496.
［35］ KIM D， KA W， AHN P， et al. Global-local path networks for monocular depth estimation with vertical cutDepth［J］. arXiv preprint arXiv:2201.07436， 2022.

[1]	史星宇1, 李强2, 庄莉3, 梁懿3, 王秋琳3, 陈锴3, 伍臣周3, 常胜1. 一种面向工业部署的目标检测模型蒸馏技术[J]. 计算机与现代化, 2024, 0(10): 93-99.
[2]	王莹莹, 郝潇. 基于Res2Net和递归门控卷积的细粒度图像分类[J]. 计算机与现代化, 2024, 0(10): 74-79.
[3]	张泽1, 张建权2, 3, 周国鹏2, 3. 基于改进YOLOv8s的摄像头模组缺陷检测[J]. 计算机与现代化, 2024, 0(09): 107-113.
[4]	徐新爱, 李钢. 基于DCGAN的课堂表情图像生成方法[J]. 计算机与现代化, 2024, 0(08): 88-91.
[5]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[6]	刘甲甲, 胡旭欣, 余萍. 聚合多维注意力特征的单目深度估计方法[J]. 计算机与现代化, 2023, 0(06): 76-81.
[7]	乔善宝, 高永彬, 黄勃, 余文俊. 基于条件卷积与极化自注意力的单目深度与位姿估计[J]. 计算机与现代化, 2023, 0(02): 34-39.