Computer and Modernization ›› 2025, Vol. 0 ›› Issue (06): 28-33.doi: 10.3969/j.issn.1006-2475.2025.06.005

Previous Articles     Next Articles

Multi-Channel Speech Separation Method with Self-guided Transformer

  

  1. (Volkswagen-Mobvoi (Beijing) Information Technology Co., Ltd., Beijing 100098, China)
  • Online:2025-06-30 Published:2025-07-01

Abstract: Abstract: In the field of speech processing, multi-channel speech separation technology aims to effectively separate the speech signals of different speakers from multi-channel mixed speech. However, existing methods have shortcomings in handling long-distance dependency relationships between multi-channel feature points. In response to this issue, this study proposes a novel multi-channel speech separation method based on self-guided Transformer (SG-former), which aims to construct an adaptive fine-grained global attention mechanism. The core mechanism of SG-former is to reassign tokens through saliency maps. Under this framework, significant regions can extract key information at a fine-grained level, while secondary regions adopt a coarse-grained extraction approach to reduce computational costs. The generation of saliency maps relies on the hybrid-scale self-attention mechanism, which can accurately capture the long-distance dependency relationships between multi-channel feature points. To verify the effectiveness of the proposed method, experiments were conducted on a spatialized WSJ0-2MIX database. The experimental results show that the SG-former method has a significant advantage in Signal to Distortion Ratio Improvement (SDRi) compared to the baseline Beam Guided TasNet method, achieving a 20.34 dB improvement. This result fully demonstrates the superiority of SG-former in dealing with multi-channel speech separation problems, especially in establishing long-distance dependency relationships. The experimental results show that this method outperforms existing technologies in terms of performance, providing new ideas and methods for research in the field of multi-channel speech separation.

Key words: Key words: speech separation, self-guided attention, hybrid-scale attention, multi-channel processing, beamformer

CLC Number: