计算机与现代化 ›› 2025, Vol. 0 ›› Issue (06): 28-33.doi: 10.3969/j.issn.1006-2475.2025.06.005

• 人工智能 • 上一篇    下一篇

基于自我引导的Transformer的多通道语音分离方法

  

  1. (大众问问(北京)信息科技有限公司,北京 100098)
  • 出版日期:2025-06-30 发布日期:2025-07-01
  • 作者简介:作者简介:谭应伟(1981—),男,重庆人,博士研究生,研究方向:智能语音处理,自然语言处理,E-mail: tanyingwei1981@sina.com; 丁雪枫(1985—),男,吉林长春人,技术经理,本科,研究方向:智能座舱,E-ail: xuefeng.ding@live.cn

Multi-Channel Speech Separation Method with Self-guided Transformer

  1. (Volkswagen-Mobvoi (Beijing) Information Technology Co., Ltd., Beijing 100098, China)
  • Online:2025-06-30 Published:2025-07-01

摘要: 摘要:在语音处理领域,多通道语音分离技术旨在从多通道混合语音中有效分离出不同说话人的语音信号。然而,现有方法在处理多通道特征点间长距离依赖关系时存在不足。针对此问题,本文提出一种新颖的基于自我引导的Transformer(SG-former)的多通道语音分离方法,旨在构建一个自适应细粒度的全局注意力机制。SG-former的核心机制在于通过显著性图对Token进行重分配。在此框架下,显著区域能够细粒度地提取关键信息,而次要区域则采取粗粒度提取方式以降低计算成本。显著性图的生成依赖于混合尺度的自注意机制,该机制能够准确捕捉多通道特征点间的长距离依赖关系。为了验证所提方法的有效性,在空间化的WSJ0-2MIX数据库上进行了实验。实验结果显示,SG-former方法相较于基线Beam-Guided TasNet方法,在信号失真比提升(Signal-to-Distortion Ratio Improvement, SDRi)上取得了显著的优势,达到了20.34 dB的提升。这一结果充分表明了SG-former在处理多通道语音分离问题中,特别是在建立长距离依赖关系方面的优越性,且在性能上优于现有技术,为多通道语音分离领域的研究提供了新的思路和方法。

关键词: 关键词:语音分离, 自我引导注意力, 混合尺度注意力, 多通道处理, 波束形成器

Abstract: Abstract: In the field of speech processing, multi-channel speech separation technology aims to effectively separate the speech signals of different speakers from multi-channel mixed speech. However, existing methods have shortcomings in handling long-distance dependency relationships between multi-channel feature points. In response to this issue, this study proposes a novel multi-channel speech separation method based on self-guided Transformer (SG-former), which aims to construct an adaptive fine-grained global attention mechanism. The core mechanism of SG-former is to reassign tokens through saliency maps. Under this framework, significant regions can extract key information at a fine-grained level, while secondary regions adopt a coarse-grained extraction approach to reduce computational costs. The generation of saliency maps relies on the hybrid-scale self-attention mechanism, which can accurately capture the long-distance dependency relationships between multi-channel feature points. To verify the effectiveness of the proposed method, experiments were conducted on a spatialized WSJ0-2MIX database. The experimental results show that the SG-former method has a significant advantage in Signal to Distortion Ratio Improvement (SDRi) compared to the baseline Beam Guided TasNet method, achieving a 20.34 dB improvement. This result fully demonstrates the superiority of SG-former in dealing with multi-channel speech separation problems, especially in establishing long-distance dependency relationships. The experimental results show that this method outperforms existing technologies in terms of performance, providing new ideas and methods for research in the field of multi-channel speech separation.

Key words: Key words: speech separation, self-guided attention, hybrid-scale attention, multi-channel processing, beamformer

中图分类号: