Multi-scale Depth Fusion Monocular Depth Estimation Based on Transposed Attention
(1. School of Physics and Optoelectronic Engineering, Guangdong University of Technology, Guangzhou 510006, China; 2. Guangdong Provincial Key Laboratory of Information Photonics Technology, Guangzhou 510006, China)
[1] 陈颖,赵来旺,詹洪陈,等. 双视系统的室内三维场景重建研究[J]. 计算机科学, 2020,47(S2):175-177.
[2] 乔善宝,高永彬,黄勃,等. 基于条件卷积与极化自注意力的单目深度与位姿估计[J]. 计算机与现代化, 2023(2):34-39.
[3] YAO Y, LUO Z X, LI S W, et al. MVSNet: Depth inference for unstructured multi-view stereo[C]// Proceedings of the 2018 European Conference on Computer Vision (ECCV). Springer, 2018:785-801.
[4] YAO Y, LUO Z X, LI S W, et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference[C]// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019:5520-5529.
[5] GUIZILINI V, AMBRUS R, PILLAI S, et al. 3D packing for self-supervised monocular depth estimation[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020:2482-2491.
[6] XU G W, CHENG J D, GUO P, et al. ACVNet: Attention concatenation volume for accurate and efficient stereo matching[J]. arXiv preprint arXiv:2203.02146, 2022.
[7] LI Y Y, GUO Y L, YAN Z X, et al. OmniFusion: 360 monocular depth estimation via geometry-aware fusion[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022:2791-2800.
[8] SAXENA A, SUN M, NG A Y. Make3D: Learning 3D scene structure from a single still image[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009,31(5):824-840.
[9] CHENG Z Y, ZHANG Y, TANG C K. Swin-depth: Using transformers and multi-scale fusion for monocular-based depth estimation[J]. IEEE Sensors Journal, 2021,21(23):26912-26920.
[10] TAKAMINE M, ENDO S. Monocular depth estimation with a multi-task and multiple-input architecture using depth gradient[C]// Proceedings of the 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems (SCIS-ISIS). IEEE, 2020:373-378.
[11] EIGEN D, PUHRSCH C, FERGUS R. Depth map prediction from a single image using a multi-scale deep network[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. ACM, 2014,2:2366-2374.
[12] LAINA I, RUPPRECHT C, BELAGIANNIS V, et al. Deeper depth prediction with fully convolutional residual networks[C]// Proceedings of the 2016 4th International Conference on 3D Vision (3DV). IEEE, 2016:239-248.
[13] SONG M, LIM S, KIM W. Monocular depth estimation using Laplacian pyramid-based depth residuals[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2021,31(11):4381-4393.
[14] WANG L J, ZHANG J M, WANG Y F, et al. CLIFFNet for monocular depth estimation with hierarchical embedding loss[C]// Proceedings of the 2020 European Conference on Computer Vision (ECCV). Springer, 2020:316-331.
[15] CAO Y Z H, WU Z F, SHEN C H. Estimating depth from monocular images as classification using deep fully convolutional residual networks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2018,28(11):3174-3182.
[16] ZIOULIS N, KARAKOTTAS A, ZARPALAS D, et al. OmniDepth: Dense depth estimation for indoors spherical panoramas[C]// Proceedings of the 2018 European Conference on Computer Vision (ECCV). Springer, 2018:453-471.
[17] BHAT S F, ALHASHIM I, WONKA P. AdaBins: Depth estimation using adaptive bins[C]// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021:4008-4017.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM, 2017:6000-6010.
[19] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
[20] RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021:12159-12168.
[21] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]// Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021:9992-10002.
[22] ZAMIR S W, ARORA A, KHAN S, et al. Restormer: Efficient transformer for high-resolution image restoration[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022:5718-5729.
[23] LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017:936-944.
[24] ZHAO H S, SHI J P, QI X J, et al. Pyramid scene parsing network[C]// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017:6230-6239.
[25] AGARWAL A, ARORA C. Depthformer: Multiscale vision transformer for monocular depth estimation with local global information fusion[J]. arXiv preprint arXiv:2207.04535, 2022.
[26] RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutional networks for biomedical image segmentation[C]// Proceedings of the 2015 International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015:234-241.
[27] XIE E Z, WANG W H, YU Z D, et al. SegFormer: Simple and efficient design for semantic segmentation with transformers[C]// Proceedings of the 2021 International Conference on Neural Information Processing Systems. ACM, 2021:12077-12090.
[28] SILBERMAN N, HOIEM D, KOHLI P, et al. Indoor segmentation and support inference from RGBD images[C]// Proceedings of the 2012 European Conference on Computer Vision (ECCV). Springer, 2012:746-760.
[29] GEIGER A, LENZ P, STILLER C,et al. Vision meets robotics: The KITTI dataset[J]. International Journal of Robotics Research, 2013,32(11):1231-1237.
[30] FU H, GONG M M, WANG C H, et al. Deep ordinal regression network for monocular depth estimation[C]// Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018:2002-2011.
[31] LEE J H, HAN M K, KO D W, et al. From big to small: Multi-scale local planar guidance for monocular depth estimation[J]. arXiv preprint arXiv:1907.10326, 2019.
[32] YOO J, AHN N, SOHN K A. Rethinking data augmentation for image super-resolution: A comprehensive analysis and a new strategy[C]// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020:8372-8381.
[33] PATIL V, SAKARIDIS C, LINIGER A, et al. P3Depth: Monocular depth estimation with a piecewise planarity prior[C]// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022:1600-1611.
[34] BHAT S F, ALHASHIM I, WONKA P. LocalBins: Improving depth estimation by learning local distributions[C]// Proceedings of the 2022 European Conference on Computer Vision (ECCV). Springer, 2022:480-496.
[35] KIM D, KA W, AHN P, et al. Global-local path networks for monocular depth estimation with vertical cutDepth[J]. arXiv preprint arXiv:2201.07436, 2022.