基于时频自注意力残差时序卷积网络的语音增强

doi:10.3969/j.issn.1006-2475.2024.09.004

摘要/Abstract

摘要： 语音增强的主要目的是去除语音信号中的噪声等无关信号，是许多语音处理任务的前端处理部分，在视频会议、视频直播等领域都有着重要的作用。然而目前大多数语音增强的研究主要集中在语音帧的长期上下文依赖关系建模上，没有考虑语音在时频域上的能量分布特征。本文提出一种基于时频域的自注意力模块，使得在模型建模过程中可以显式引入对语音分布特性的先验思考，并与残差时序卷积网络相结合，构成基于时频域自注意力的残差时序卷积网络模型。为了验证该模型的有效性，本文使用语音增强领域中常用的2个训练目标IRM和PSM进行实验，实验结果表明，该模型显著提高了语音增强领域中4种常用的客观评价指标，明显优于其他基准模型。

关键词: 语音增强, 时频域, 自注意力机制, 时序卷积网络

Abstract: The main purpose of speech enhancement（SE） is to remove irrelevant signals such as noise. It is the front-end processing part of many speech processing tasks. SE plays an important role in fields such as video conferencing and live broadcasting. However， most studies on SE mainly focuses on the long-term context-dependent modeling of speech frames， without considering the energy distribution characteristics in the time-frequency domain. This paper proposes a self-attention module based on time-frequency domain， which makes it possible to explicitly introduce a priori thinking about speech distribution characteristics in the process of model modeling. Combined with the residual temporal convolutional network， a residual temporal convolutional network model based on time-frequency domain self-attention is constructed. In order to verify the validity of the model， two training targets， IRM and PSM， which are commonly used in the field of SE， are used for experiments. The experimental results show that the model significantly improves the performance in terms of four objective evaluation metrics in SE and is consistently better than other baseline models.

Key words: , speech enhancement； time-frequency； self-attention mechanism； temporal convolutional network

中图分类号:

TP183

候聪颖, 杨文清, 王召, 程聪. 基于时频自注意力残差时序卷积网络的语音增强[J]. 计算机与现代化, 2024, 0(09): 20-24.

HOU Congying, YANG Wengqing, WANG Zhao, CHENG Cong. Speech Enhancement Based on Time-frequency Self-attention Residual Temporal#br# Convolutional Networks[J]. Computer and Modernization, 2024, 0(09): 20-24.

参考文献

［1］ EPHRAIM Y， MALAH D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1984，32（6）：1109-1121.
［2］ EPHRAIM Y， MALAH D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1985，33（2）：443-445.
［3］ BEROUTI M， SCHWARTZ R， MAKHOUL J. Enhancement of speech corrupted by acoustic noise［C］// The 1979 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 1979：208-211.
［4］ SCALART P， FILHO J V. Speech enhancement based on a priori signal to noise estimation［C］// The 1996 IEEE International Conference on Acoustics， Speech， and Signal Processing Conference Proceedings. IEEE， 1996：629-632.
［5］蒋学仕. 结合语音包络的变步长LMS语音自动增益控制［J］. 电讯技术， 2022，62（2）：252-258.
［6］张雄伟，李毅豪，孙蒙，等. 单通道语音增强中深度学习方法研究现状与展望［J］. 陆军工程大学学报， 2022，1（5）：1-12.
［7］李文志，屈晓旭. 基于时域全卷积网络的语音增强［J］.舰船科学技术， 2022，44（15）：139-144.
［8］王怡婷. 基于深度神经网络的语音增强模型研究［D］. 南昌：东华理工大学， 2022.
［9］ CHEN J T， WANG D L. Long short-term memory for speaker generalization in supervised speech separation［J］. The Journal of the Acoustical Society of America， 2017，141（6）：4705-4714.
［10］ BAI S， KOLTER J Z， KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling［J］. arXiv preprint arXiv：1803.01271， 2018.
［11］ ZHANG Q Q， NICOLSON A， WANG M J， et al. DeepMMSE： A deep learning approach to MMSE-based noise power spectral density estimation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020，28：1404-1415.
［12］ TAN K， CHEN J T， WANG D L. Gated residual networks with dilated convolutions for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019， 27（1）：189-198.
［13］ PANDEY A， WANG D L. TCNN： Temporal convolutional neural network for real-time speech enhancement in the time domain［C］// The 2019 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2019：6875-6879.
［14］ PASCUAL S， BONAFONTE A， SERRA J. SEGAN： Speech enhancement generative adversarial network［J］. arXiv preprint arXiv：1703.09452， 2017.
［15］ LUO Y， MESGARANI N. CONV-TasNet： Surpassing ideal time-frequency magnitude masking for speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019，27（8）：1256-1266.
［16］顾天麒. 基于多头注意力机制的语音增强算法研究［D］. 南京：南京邮电大学， 2023.
［17］ ZHANG Q， SONG Q， NICOLSON A， et al. Temporal convolutional network with frequency dimension adaptive attention for speech enhancement［C］// Proceedings of the 21st Annual Conference of the International Speech Communication Association. ACM， 2021:166-170.
［18］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31th Annual Conference on Neural Information Processing Systems. ACM， 2017：5998-6008.
［19］任继刚. 基于多任务学习的长时序语音增强方法研究［D］. 镇江：江苏大学， 2022.
［20］ PANAYOTOV V， CHEN G G， POVEY D， et al. Librispeech： An ASR corpus based on public domain audio books［C］// 2015 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2015：5206-5210.
［21］ VARGA A， STEENEKEN H J M. Assessment for automatic speech recognition： II. NOISEX-92： A database and an experiment to study the effect of additive noise on speech recognition systems［J］. Speech Communication， 1993，12（3）：247-251.
［22］ HU G. 100 Nonspeech Environmental Sounds［D］. The Ohio State University， Department of Computer Science and Engineering， 2004.
［23］ DEAN D， SRIDHARAN S， VOGT R， et al. The QUT-NOISE-TIMIT corpus for evaluation of voice activity detection algorithms［C］// Proceedings of the 11th Annual Conference of the International Speech Communication Association. ACM， 2010：3110-3113.
［24］ STEENEKEN H J M， GEURTSEN F W M. Description of the RSG-10 noise database［R］. Report IZF 1988-3R， 1988.
［25］ SHIMAUCHI S， OHMURO H. Accurate adaptive filtering in square-root Hann windowed short-time Fourier transform domain［C］// 2014 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2014：1305-1309.
［26］ ZHANG Q Q， SONG Q， NI Z H， et al. Time-frequency attention for monaural speech enhancement［C］// ICASSP 2022-2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2022：7852-7856.
［27］ WANG Y X， NARAYANAN A， WANG D L. On training targets for supervised speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2014，22（12）：1849-1858.
［28］ ERDOGAN H， HERSHEY J R， WATANABE S， et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks［C］// 2015 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2015：708-712.
［29］ LOIZOU P C. Speech Enhancement： Theory and Practice［M］. CRC press， 2007.

[1]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[2]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[3]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[4]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.
[5]	李亚平, 王军防, 余红梅, 窦一民, 肖媛, 田继林. Regformer：基于稀疏注意力的输油管道水力压降预测方法[J]. 计算机与现代化, 2024, 0(01): 59-66.
[6]	夏义春, 李汪根, 李豆豆, 葛英奎, 王志格. 结合注意力机制和图神经网络的CTR预估模型[J]. 计算机与现代化, 2023, 0(03): 29-37.
[7]	栗伟松, 汤敏芳, 何征岭, 王鹏, 杜利东, 方震, 陈贤祥. 基于自注意力机制和单导联心电信号的自动睡眠分期算法#br#[J]. 计算机与现代化, 2022, 0(12): 50-59.
[8]	万发洋, 于旭, 徐其江. 基于多头自注意力机制的深度缺陷分派模型[J]. 计算机与现代化, 2021, 0(11): 39-43.
[9]	宋爽, 陆鑫达. 基于BERT与图像自注意力机制的文本匹配模型[J]. 计算机与现代化, 2021, 0(11): 12-16.
[10]	徐龙. 基于自注意力和胶囊网络的短文本情感分析[J]. 计算机与现代化, 2020, 0(07): 61-64.