基于改进自注意力机制的多视图三维重建

doi:10.3969/j.issn.1006-2475.2024.11.016

摘要/Abstract

摘要： 针对目前多视图三维重建无法适应高分辨率场景、重建完整性差、忽略全局背景信息等问题，提出一种融合可变形卷积与改进自注意力机制的三维重建网络MVFSAM-CasMVSNet。首先，设计专用于多视图立体重建任务的可变形卷积模块，自适应地调整提取特征的范围，增强深度突变的特征提取能力。其次，考虑到多视图间深度信息的关联性和特征交互，设计一种多视图融合自注意力模块，通过计算复杂度较低的线性自注意力聚合每个视图内部的远程上下文信息，并通过改进的多头注意力捕获参考视图与源视图间的深度依赖关系。最后利用多阶段策略构建匹配代价体并对其进行正则化，使用具有更高分辨率的代价体生成深度图。在DTU数据集上的测试结果显示，该网络与基准模型相比，完整性、准确性、整体性分别提升15.6%、7.4%、11.8%，与现有其他模型相比具有最优的整体性。同时，在Tanks and Temples数据集上的实验结果显示，该网络与基准模型相比平均F-score提升6.5%。该网络在多视图三维重建领域针对高分辨率场景具有优良的重建效果与泛化能力。

关键词: 三维重建, 深度学习, 多视图立体, 自注意力机制

Abstract: To address the current problems that multi-view 3D reconstruction cannot adapt to high-resolution scenes， poor completeness， and ignoring global background information， this paper proposes a 3D reconstruction network MVFSAM-CasMVSNet that fuses deformable convolution with improved self-attention mechanism. Firstly， a deformable convolution module dedicated to the task of multi-view stereo reconstruction is designed to adaptively adjust the range of extracted features and enhance the feature extraction capability for depth mutation. Secondly， considering the correlation of depth information and feature interactions among multiple views， a multi-view fusion self-attention module is designed to aggregate remote context information within each view by linear self-attention with low computational complexity， and capture the depth dependencies between the reference view and the source view by improved multi-head attention. Finally， the cost volume is constructed and regularized from coarse to fine using a multi-stage strategy， and depth map is generated using the cost volume with higher resolution. The test results on DTU dataset show that MVFSAM-CasMVSNet has respectively improved completeness， accuracy， and overall by 15.6%， 7.4%， and 11.8%， compared with baseline model， and has optimal overall compared with other existing models. Meanwhile， experimental results on the Tanks and Temples dataset show that the network has an average F-score improvement of 6.5% compared to the benchmark model. The method in this paper has excellent reconstruction effect and generalization ability for high-resolution scenes in the field of multi-view 3D reconstruction

Key words: , 3D reconstruction, deep learning, multi-view stereo, self-attention mechanism

中图分类号:

TP391.4

祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.

QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism[J]. Computer and Modernization, 2024, 0(11): 106-112.

参考文献

［1］ FURUKAWA Y， HERNÁNDEZ C. Multi-view stereo: A tutorial［J］. Foundations and Trends in Computer Graphics and Vision， 2015，9（1-2）:1-148.
［2］李明阳，陈伟，王珊珊，等. 视觉深度学习的三维重建方法综述［J］. 计算机科学与探索， 2023，17（2）:279-302.
［3］ COLLINS R T. A space-sweep approach to true multi-image matching［C］// Proceedings 1996 IEEE Conference on Computer Vision and Pattern Recognition（CVPR）. IEEE， 1996:358-363.
［4］ ZHANG F H， PRISACARIU V， YANG R G， et al. Ga-Net: Guided aggregation net for end-to-end stereo matching［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019: 185-194.
［5］周晴，谭功全，尹宋麟，等. 改进YOLOv5s的道路目标检测算法［J］. 液晶与显示， 2023，38（5）:680-690.
［6］崔志强，单慧琳，张银胜，等. 基于改进型U-Net的遥感云图分割方法［J］. 电子测量技术， 2022，45（12）: 127-132.
［7］ ZAGORUYKO S， KOMODAKIS N. Learning to compare image patches via convolutional neural networks［C］// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2015:4353-4361.
［8］ YAO Y， LUO Z X， LI S W， et al. MVSNet: Depth inference for unstructured multi-view stereo［C］// Proceedings of the European Conference on Computer Vision （ECCV）2018. Springer， 2018: 785-801.
［9］ CHEN R， HAN S F， XU J， et al. Point-based multi-view stereo network［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE， 2019:1538-1547.
［10］ GU X D， FAN Z W， ZHU S Y， et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching［C］// Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2020: 2492-2501.
［11］ GUO M H， XU T X， LIU J J， et al. Attention mechanisms in computer vision: A survey［J］. Computational Visual Media， 2022，8（3）:331-368.
［12］刘会杰，柏正尧，程威，等. 融合注意力机制和多层U-Net的多视图立体重建［J］. 中国图象图形学报， 2022， 27（2）:475-485.
［13］谢琪琦，辛月兰，曾曦. 基于注意力机制的多视图三维重建［J］. 激光杂志， 2023， 44（1）:136-142.
［14］朱光照，韦博，杨阿峰，等. 基于自注意力机制的多视图三维重建方法［J/OL］. 激光与光电子学进展: 1-17［2023-07-11］. http://kns.cnki.net/kcms/detail/31.1690.TN.20230104.1235.014.html.
［15］ LIN T Y， DOLLAR P， GIRSHICK R， et al. Feature pyramid networks for object detection［C］// Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. IEEE， 2017:936-944.
［16］ DAI J F， QI H Z， XIONG Y W， et al. Deformable convolutional networks［C］// Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE， 2017:764-773.
［17］ HUA W Z， DAI Z H， LIU H X， et al. Transformer quality in linear time［C］// Proceeding of the 39th International Conference on Machine Learning（ICML 2022）. PMLR，2022，162:9099-9117.
［18］ CHEN J R， HE T L， ZHUO W P， et al. TVConv: Efficient translation variant convolution for layout-aware visual processing［C］// Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2022:12538-12548.
［19］ RONNEBERGER O， FISCHER P， BROX T. U-Net: Convolutional networks for biomedical image segmentation［C］// Proceedings of the 18he International Conference on Medical Image Computing and Computer-Assisted Intervention（MICCAI 2015）. Springer， 2015，9351:234-241.
［20］ AANAES H， JENSEN R R， VOGIATZIS G， et al. Large-scale data for multiple-view stereopsis［J］. International Journal of Computer Vision， 2016，120（2）：153-168.
［21］ KNAPITSCH A， PARK J， ZHOU Q Y， et al. Tanks and temples: Benchmarking large-scale scene reconstruction［J］. ACM Transactions on Graphics， 2017，36（4）. DOI： 10.1145/3072959.3073599.
［22］ TOLA E， STRECHA C， FUA P. Efficient large-scale multi-view stereo for ultra high-resolution image sets［J］. Machine Vision and Applications， 2012， 23（5）: 903-920.
［23］ CAMPBELL N D F， VOGIATZIS G， HERNÁNDEZ C， et al. Using multiple hypotheses to improve depth-maps for multi-view stereo［C］// 10th European Conference on Computer Vision（ECCV 2008）. Springer Berlin Heidelberg， 2008:766-779.
［24］ SCHÖNBERGER J L， ZHENG E L， FRAHM J M， et al. Pixelwise view selection for unstructured multi-view stereo［C］// 14th European Conference on Computer Vision（ECCV 2016）. Springer International Publishing， 2016:501-518.
［25］ GALLIANI S， LASINGER K， SCHINDLER K. Massively parallel multiview stereopsis by surface normal diffusion［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE，2015:873-881.
［26］ YAO Y， LUO Z X， LI S W， et al. Recurrent MVSNet for high-resolution multi-view stereo depth inference［C］// Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2019: 5520-5529.
［27］ LUO K Y， GUAN T， JU L L， et al. P-MVSNet: Learning patch-wise matching confidence aggregation for multi-view stereo［C］// Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. IEEE，2019: 10451-10460.
［28］ WANG F J H， GALLIANI S， VOGEL C， et al. PatchmatchNet: Learned multi-view patchmatch stereo［C］// Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE，2021:14189-14198.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[3]	陈凯1, 李宜汀1, 2, 全华凤1 . 基于改进YOLOv8的河道废弃瓶检测方法[J]. 计算机与现代化, 2024, 0(11): 113-120.
[4]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[5]	王莹莹, 郝潇. 基于Res2Net和递归门控卷积的细粒度图像分类[J]. 计算机与现代化, 2024, 0(10): 74-79.
[6]	史星宇1, 李强2, 庄莉3, 梁懿3, 王秋琳3, 陈锴3, 伍臣周3, 常胜1. 一种面向工业部署的目标检测模型蒸馏技术[J]. 计算机与现代化, 2024, 0(10): 93-99.
[7]	张泽1, 张建权2, 3, 周国鹏2, 3. 基于改进YOLOv8s的摄像头模组缺陷检测[J]. 计算机与现代化, 2024, 0(09): 107-113.
[8]	程亚子1, 雷亮1, 2, 陈瀚1, 赵毅然1. 基于转置注意力的多尺度深度融合单目深度估计[J]. 计算机与现代化, 2024, 0(09): 121-126.
[9]	程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29.
[10]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[11]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[12]	徐新爱, 李钢. 基于DCGAN的课堂表情图像生成方法[J]. 计算机与现代化, 2024, 0(08): 88-91.
[13]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[14]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[15]	武丽1, 张征浩2, 葛彩成2, 俞俊2. 基于改进SCNN网络的车道线检测算法[J]. 计算机与现代化, 2024, 0(07): 87-92.