计算机与现代化 ›› 2024, Vol. 0 ›› Issue (09): 121-126.doi: 10.3969/j.issn.1006-2475.2024.09.020

• 图像处理 • 上一篇    

基于转置注意力的多尺度深度融合单目深度估计



  

  1. (1.广东工业大学物理与光电工程学院,广东 广州 510006; 2.广东省信息光子技术重点实验室,广东 广州 510006)
  • 出版日期:2024-09-27 发布日期:2024-09-29
  • 基金资助:
    国家自然科学基金资助项目(62006046)

Multi-scale Depth Fusion Monocular Depth Estimation Based on Transposed Attention

  1. (1. School of Physics and Optoelectronic Engineering, Guangdong University of Technology, Guangzhou 510006, China;
    2. Guangdong Provincial Key Laboratory of Information Photonics Technology, Guangzhou 510006, China)
  • Online:2024-09-27 Published:2024-09-29

摘要: 单目深度估计是计算机视觉领域中一项基础任务,其目标是通过单张图像预测深度图,并获取每个像素位置的深度信息。本文提出一种新的单目深度估计网络结构,旨在进一步提高网络的预测准确性。转置注意力机制在降低参数量和计算量的同时引入了自注意力机制,以关注图像中的特定区域,并结合不同通道之间的信息。这种机制能够有效地关注到图像中的细小区域和边缘信息,并进行学习。本文还提出一种改进的转置注意力机制,以更少的参数量保留语义信息。多尺度深度融合根据不同通道提取不同深度特征的特点,计算每个通道的平均深度,以增强模型的深度感知能力。此外,它能够建模垂直距离的长距离关系,有效地分离物体之间的边缘,有助于减少细粒度信息的损失。最后,本文在NYU Depth V2数据集和KITTI数据集上进行实验,验证了所提出模块的有效性,并取得了出色的性能表现。

关键词: 深度学习, 单目深度估计, 转置注意力, 多尺度深度融合, 通道平均深度

Abstract: Monocular depth estimation is a fundamental task in computer vision, aiming to predict depth maps from single images and retrieve depth information for corresponding pixel positions. This paper proposes a novel network architecture for monocular depth estimation to further enhance the predictive accuracy of the network. Transposed attention introduces a self-attention mechanism, enabling it to focus on specific regions within the image while reducing the parameter and computation requirements. By incorporating information across different channels, it effectively captures fine-grained regions and edge details for learning. The paper presents an improved version of transposed attention that retains semantic information with fewer parameters. Multi-scale depth fusion leverages the characteristic of extracting features with different depths from distinct channels. It computes the average depth for each channel, enhancing the model’s depth perception capability. Furthermore, it models long-range dependencies for vertical distances, effectively separating edges between objects and mitigating the loss of fine-grained information. Finally, the proposed modules’ effectiveness is validated through experiments conducted on the NYU Depth V2 dataset and the KITTI dataset, demonstrating exceptional performance.

Key words: deep learning, monocular depth estimation, transposed attention, multi-scale deep fusion, channel average depth

中图分类号: