Speech Enhancement Based on Time-frequency Self-attention Residual Temporal#br#
Convolutional Networks

doi:10.3969/j.issn.1006-2475.2024.09.004

Abstract

Abstract: The main purpose of speech enhancement（SE） is to remove irrelevant signals such as noise. It is the front-end processing part of many speech processing tasks. SE plays an important role in fields such as video conferencing and live broadcasting. However， most studies on SE mainly focuses on the long-term context-dependent modeling of speech frames， without considering the energy distribution characteristics in the time-frequency domain. This paper proposes a self-attention module based on time-frequency domain， which makes it possible to explicitly introduce a priori thinking about speech distribution characteristics in the process of model modeling. Combined with the residual temporal convolutional network， a residual temporal convolutional network model based on time-frequency domain self-attention is constructed. In order to verify the validity of the model， two training targets， IRM and PSM， which are commonly used in the field of SE， are used for experiments. The experimental results show that the model significantly improves the performance in terms of four objective evaluation metrics in SE and is consistently better than other baseline models.

Key words: , speech enhancement； time-frequency； self-attention mechanism； temporal convolutional network

CLC Number:

TP183

HOU Congying, YANG Wengqing, WANG Zhao, CHENG Cong. Speech Enhancement Based on Time-frequency Self-attention Residual Temporal#br# Convolutional Networks[J]. Computer and Modernization, 2024, 0(09): 20-24.

References

［1］ EPHRAIM Y， MALAH D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1984，32（6）：1109-1121.
［2］ EPHRAIM Y， MALAH D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1985，33（2）：443-445.
［3］ BEROUTI M， SCHWARTZ R， MAKHOUL J. Enhancement of speech corrupted by acoustic noise［C］// The 1979 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 1979：208-211.
［4］ SCALART P， FILHO J V. Speech enhancement based on a priori signal to noise estimation［C］// The 1996 IEEE International Conference on Acoustics， Speech， and Signal Processing Conference Proceedings. IEEE， 1996：629-632.
［5］蒋学仕. 结合语音包络的变步长LMS语音自动增益控制［J］. 电讯技术， 2022，62（2）：252-258.
［6］张雄伟，李毅豪，孙蒙，等. 单通道语音增强中深度学习方法研究现状与展望［J］. 陆军工程大学学报， 2022，1（5）：1-12.
［7］李文志，屈晓旭. 基于时域全卷积网络的语音增强［J］.舰船科学技术， 2022，44（15）：139-144.
［8］王怡婷. 基于深度神经网络的语音增强模型研究［D］. 南昌：东华理工大学， 2022.
［9］ CHEN J T， WANG D L. Long short-term memory for speaker generalization in supervised speech separation［J］. The Journal of the Acoustical Society of America， 2017，141（6）：4705-4714.
［10］ BAI S， KOLTER J Z， KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling［J］. arXiv preprint arXiv：1803.01271， 2018.
［11］ ZHANG Q Q， NICOLSON A， WANG M J， et al. DeepMMSE： A deep learning approach to MMSE-based noise power spectral density estimation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020，28：1404-1415.
［12］ TAN K， CHEN J T， WANG D L. Gated residual networks with dilated convolutions for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019， 27（1）：189-198.
［13］ PANDEY A， WANG D L. TCNN： Temporal convolutional neural network for real-time speech enhancement in the time domain［C］// The 2019 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2019：6875-6879.
［14］ PASCUAL S， BONAFONTE A， SERRA J. SEGAN： Speech enhancement generative adversarial network［J］. arXiv preprint arXiv：1703.09452， 2017.
［15］ LUO Y， MESGARANI N. CONV-TasNet： Surpassing ideal time-frequency magnitude masking for speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019，27（8）：1256-1266.
［16］顾天麒. 基于多头注意力机制的语音增强算法研究［D］. 南京：南京邮电大学， 2023.
［17］ ZHANG Q， SONG Q， NICOLSON A， et al. Temporal convolutional network with frequency dimension adaptive attention for speech enhancement［C］// Proceedings of the 21st Annual Conference of the International Speech Communication Association. ACM， 2021:166-170.
［18］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31th Annual Conference on Neural Information Processing Systems. ACM， 2017：5998-6008.
［19］任继刚. 基于多任务学习的长时序语音增强方法研究［D］. 镇江：江苏大学， 2022.
［20］ PANAYOTOV V， CHEN G G， POVEY D， et al. Librispeech： An ASR corpus based on public domain audio books［C］// 2015 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2015：5206-5210.
［21］ VARGA A， STEENEKEN H J M. Assessment for automatic speech recognition： II. NOISEX-92： A database and an experiment to study the effect of additive noise on speech recognition systems［J］. Speech Communication， 1993，12（3）：247-251.
［22］ HU G. 100 Nonspeech Environmental Sounds［D］. The Ohio State University， Department of Computer Science and Engineering， 2004.
［23］ DEAN D， SRIDHARAN S， VOGT R， et al. The QUT-NOISE-TIMIT corpus for evaluation of voice activity detection algorithms［C］// Proceedings of the 11th Annual Conference of the International Speech Communication Association. ACM， 2010：3110-3113.
［24］ STEENEKEN H J M， GEURTSEN F W M. Description of the RSG-10 noise database［R］. Report IZF 1988-3R， 1988.
［25］ SHIMAUCHI S， OHMURO H. Accurate adaptive filtering in square-root Hann windowed short-time Fourier transform domain［C］// 2014 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2014：1305-1309.
［26］ ZHANG Q Q， SONG Q， NI Z H， et al. Time-frequency attention for monaural speech enhancement［C］// ICASSP 2022-2022 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2022：7852-7856.
［27］ WANG Y X， NARAYANAN A， WANG D L. On training targets for supervised speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2014，22（12）：1849-1858.
［28］ ERDOGAN H， HERSHEY J R， WATANABE S， et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks［C］// 2015 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2015：708-712.
［29］ LOIZOU P C. Speech Enhancement： Theory and Practice［M］. CRC press， 2007.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	ZHENG Jiuchao, ZHAO Xinyuan. Entity Linking Method Based on Topics and Description Information [J]. Computer and Modernization, 2024, 0(12): 10-14.
[3]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[4]	WANG Xiaohang1, LI Yongjie1, YU Lei1, FAN Xiao2. A Method of Using Compound Event Probability Operation to Solve Problem of Negative Information Blocking Maximization [J]. Computer and Modernization, 2024, 0(12): 24-33.
[5]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[6]	LIU Yunhai1, Feng Guang1, WU Xiaoting2, YANG Qun2 . Safety Helmet Wearing Detection Algorithm for Complex Construction Scenes [J]. Computer and Modernization, 2024, 0(12): 66-71.
[7]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[8]	GU Yue, DENG Songfeng, SHEN Ji, MU Wentao, ZHAO Enqi. SAR Ship Detection Algorithm Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(12): 78-83.
[9]	WU Xiuling1, ZHOU Sheng1, WANG Chunjuan1, YU Cuizhuo2, LIU Hao3. Research Progress in Ultra Short-term Power Load Forecasting Technology [J]. Computer and Modernization, 2024, 0(12): 108-115.
[10]	LI Deyou1, 2, YU Jinsongdi1, 2, WEI Dandan1, 2, LUO Yuan1, 2, TONG Ruiju3. Abstract Tree Model for Gridded Cube Metadata [J]. Computer and Modernization, 2024, 0(11): 1-6.
[11]	GONG Yicheng1, 2, LIU Qing1, 2. Beijing Opera Binary Classification Based on RF-LCE-BiLSTM-Attention-AMSSA Model [J]. Computer and Modernization, 2024, 0(11): 7-12.
[12]	LI Taoying, LI Meng, WU Mengqiao. Taxi Passenger Flow Prediction Based on Heterogeneous Spatiotemporal Graph#br# Convolutional Networks [J]. Computer and Modernization, 2024, 0(11): 13-18.
[13]	ZHANG Tai1, YAN Zihao2, DUAN Jie2, ZHANG Zhihong2. Information Forwarding Strategy of Internet of Vehicles in Named Data Network [J]. Computer and Modernization, 2024, 0(11): 19-27.
[14]	YUAN Qingle, MU Li. Inventory Forecasting Method Based on Improved Elman Neural Network [J]. Computer and Modernization, 2024, 0(11): 28-33.
[15]	DONG Chongchong, ZHAO Cong, WU You, ZHANG Lei, ZHANG Jiawen, LI Zhihao. Zero-trust Dynamic Evaluation Method for IoT Terminals [J]. Computer and Modernization, 2024, 0(11): 41-45.

Speech Enhancement Based on Time-frequency Self-attention Residual Temporal#br# Convolutional Networks

PDF (PC)

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments