计算机与现代化 ›› 2025, Vol. 0 ›› Issue (03): 22-28.doi: 10.3969/j.issn.1006-2475.2025.03.004

• 人工智能 • 上一篇    下一篇

基于并行级联时频Conformer生成对抗网络的语音增强算法






  

  1. (中北大学信息与通信工程学院,山西 太原 030051)
  • 出版日期:2025-03-28 发布日期:2025-03-28
  • 基金资助:
    山西省回国留学人员科研资助项目(2023-127); 山西省自然科学基金面上项目(202103021224201)

Speech Enhancement Algorithm Based on Parallel Cascaded Time-frequency Conformer Generative Adversarial Network

  1. (School of Information and Communication Engineering, North China University, Taiyuan 030051, China)
  • Online:2025-03-28 Published:2025-03-28

摘要: 生成对抗网络通过对抗训练的机制不断提升网络映射能力,使其具有强大的降噪能力,被广泛应用于语音增强领域。针对现有生成对抗网络语音增强方法存在利用语音特征序列中的时频相关性和全局相关性不充分,去噪性能欠佳的问题,本文提出一种并行级联时频Conformer生成对抗网络,用于单通道语音增强。首先,并行级联时间-频率Conformer分别对语音语谱图中时间和频率的顺序特征进行建模,提取时域和频域的局部及全局特征供生成器学习;然后,通过2个Decoder路径分别学习带噪语音的幅度掩码和干净语音的语谱图,融合2个路径的输出得到生成后的语音;最后,采用指标判别器评价生成器所生成语音的相关评价指标得分,通过对抗训练提高生成器生成语音的质量,并在公开数据集VoiceBank+Demand上进行了验证。 

关键词: 语音增强, 生成对抗网络, 时频Conformer, 指标判别器, 对抗训练

Abstract:  Generative adversarial networks continuously improve network mapping capabilities through the adversarial training mechanism, giving them powerful noise reduction capabilities and are widely used in the field of speech enhancement. In order to solve the problem that the existing generative adversarial network speech enhancement methods do not fully utilize the time-frequency correlation and global correlation in the speech feature sequence and have poor denoising performance, this paper proposes a parallel cascaded time-frequency Conformer generative adversarial network for single channel speech enhancement. Firstly, the parallel cascaded time-frequency Conformer models the sequential features of time and frequency in the speech spectrogram, extracting local and global solicitations in the time domain and frequency domain for generator learning. Then, the two Decoder paths are used to learn the speech spectrogram with the amplitude mask of the noisy speech and the spectrogram of the clean speech respectively to fuse the output of the two paths to obtain the generated speech. Finally, an indicator discriminator is used to evaluate the relevant evaluation index scores of the speech generated by the generator, and the generator generation is improved through adversarial training. The quality of the voice is verified on the public dataset VoiceBank+Demand.

Key words:  , speech enhancement, generative adversarial network, time-frequency Conformer, indicator discriminator, adversarial training

中图分类号: