基于TasNet的单通道语音分离技术的研究综述

摘要/Abstract

摘要： 语音分离是声学信号处理中的一项基本任务，具有广泛的应用。得益于深度学习的发展，近年来单通道语音分离系统的性能有了显着提升。特别是，随着一种被称为时域音频网络（Time-domain audio separation Network，TasNet）的新语音分离方法被提出，语音分离技术的研究也逐步从基于时-频域的传统方法过渡至基于时域的方法。本文综述基于TasNet的单通道语音分离技术的研究现状与展望。在回顾基于时-频域的语音分离传统方法之后，本文重点介绍基于TasNet的Conv-TasNet模型以及DPRNN模型，并对比针对各模型的改进研究。最后,本文阐述目前基于TasNet的单通道语音分离模型的局限性,并从模型、数据集、说话人数量以及如何解决复杂场景下的语音分离等层面对未来的研究方向进行讨论。

关键词: 语音分离, 时域音频网络, 全卷积时域音频网络, 双路径循环神经网络

Abstract: Speech separation is a fundamental task in acoustic signal processing with a wide range of applications. Thanks to the development of deep learning, the performance of single-channel speech separation systems has been significantly improved in recent years. In particular, with the introduction of a new speech separation method called time-domain audio separation network （TasNet）, speech separation technology is also gradually transitioning from the traditional method based on time-frequency domain to the one based on time domain methods. This paper reviews the research status and prospect of single-channel speech separation technology based on TasNet. After reviewing the traditional methods of speech separation based on time-frequency domain, this paper focuses on the TasNet-based Conv-TasNet model and DPRNN model, and compares the improvement research on each model. Finally, this paper expounds the limitations of the current single-channel speech separation model based on TasNet, and discusses future research directions from the aspects of model, dataset, number of speakers, and how to solve speech separation in complex scenarios.

Key words: speech separation, TasNet, Conv-TasNet, DPRNN

陆炜, 朱定局. 基于TasNet的单通道语音分离技术的研究综述[J]. 计算机与现代化, 2022, 0(11): 119-126.

LU Wei, ZHU Ding-ju. Research Review of Single-channel Speech Separation Technology Based on TasNet[J]. Computer and Modernization, 2022, 0(11): 119-126.

参考文献

［1］ HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation［J］. IEEE/ACM Transactions on Audio Speech & Language Processing, 2015,23（12）:2136-2147.
［2］ ZHANG X L, WANG D L. Deep ensemble learning for monaural speech separation［J］. IEEE/ACM Transactions on Audio Speech & Language Processing, 2016,24（5）:967-977.
［3］ ISIK Y, LE ROUX J, CHEN Z, et al. Single-channel multi-speaker separation using deep clustering［C］// Annual Conference of the International Speech Communication Association. 2016:545-549.
［4］ KOLBK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017,25（10）:1901-1913.
［5］ CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2017:246-250.
［6］ LUO Y, CHEN Z, MESGARANI N. Speaker-independent speech separation with deep attractor network［J］. IEEE/ACM Transactions on Audio Speech & Language Processing, 2018,26（4）:787-796.
［7］ ERDOGAN H, HERSHEY J R, WATANABE S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks［C］// 2015 IEEE International Conference on Acoustics, Speech and Sign-al Processing （ICASSP）. IEEE, 2015:708-712.
［8］ WILLIAMSON D S, WANG Y, WANG D L. Complex ratio masking for monaural speech separation［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015,24（3）:483-492.
［9］ LUO Y, CHEN Z, HERSHEY J R, et al. Deep clustering and conventional networks for music separation: Stronger together［C］// Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 2017:61-65.
［10］SAINATH T N, WEISS R J, SENIOR A, et al. Learning the speech front-end with raw waveform CLDNNs ［C］// Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015:1-5.
［11］GHAHREMANI P, MANOHAR V, POVEY D, et al. Acoustic modelling from the signal domain using CNNs［C］// Proceedings of the Annual Conference of the International Speech Communication Association. 2016:3434-3438.
［12］VAN DEN OORD A, DIELEMAN S, ZEN H G, et al. WaveNet: A generative model for raw audio［J］. arXiv preprint arXiv:1609.03499, 2016.
［13］MEHRI S, KUMAR K, GULRAJANI I, et al. Sa-mpleRNN: An unconditional end-to-end neural audio generation model［J］. arXiv preprint arXiv:1612.07837, 2016.
［14］PASCUAL S, BONAFONTE A, SERR J. SEGAN: Speech enhancement generative adversarial network［C］// INTERSPEECH 2017. 2017:3642-3646.
［15］LUO Y, MESGARANI N. TasNet: Time-domain audio separation network for real-time, single-channel speech separation［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2018:696-700.
［16］HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation［C］// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2016:31-35.
［17］YU D, KOLBK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2017:241-245.
［18］WANG Z Q, ROUX J L, WANG D L, et al. End-to-end speech separation with unfolded iterative phase reconstruction［J］. arXiv pre-print arXiv:1804.10204, 2018.
［19］WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation: A trigonometric perspective［C］// IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2019:71-75.
［20］LIU Y Z, WANG D L. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019,27（12）:2092-2102.
［21］LUO Y, MESGARANI N. Real-time single-channel dereverberation and separation with time-domain audio separation network［C］// INTERSPEECH 2018. 2018:342-346.
［22］LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019,27（8）:1256-1266.
［23］LEA C, FLYNN M D, VIDAL R, et al. Temporal convolutional networks for action segmentation and detection［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:156-165.
［24］BAI S, KOLTER J Z, KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling［J］. arXiv preprint arXiv:1803.01271, 2018.
［25］LEA C, VIDAL R, REITER A, et al. Temporal convolutional networks: A unified approach to action segmentation［C］// European Confe-rence on Computer Vision. Springer, 2016:47-54.
［26］CHOLLET F. Xception: Deep learning with depthwise separable convolutions［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1251-1258.
［27］HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications［J］. arXiv preprint arXiv:1704.04861, 2017.
［28］WANG D L. On ideal binary mask as the computational goal of auditory scene analysis［M］// Speech Separation by Humans and Machines. Springer, 2005:181-197.
［29］LI Y P, WANG D L. On the optimality of ideal binary time-frequency masks［J］. Speech Communication, 2009,51（3）:230-239.
［30］WANG Y X, NARAYANAN A, WANG D L. On training targets for supervised speech separation ［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014,22（12）:1849-1858.
［31］TUAN C I, WU Y K, LEE H, et al. Mitas: A compressed time-domain audio separation network with parameter sharing［J］. arXiv preprint arXiv:1912.03884, 2019.
［32］DFOSSEZ A, USUNIER N, BOTTOU L, et al. Music source separation in the waveform domain［J］. arXiv preprint arXiv:1911.13254, 2019.
［33］DITTER D, GERKMANN T. A multi-phase gammatone filterbank for speech separation via tasnet［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:36-40.
［34］KADIOGLU B, HORGAN M, LIU X, et al. An empirical study of Conv-TasNet［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:7264-7268.
［35］BABY D, VERHULST S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2019:106-110.
［36］FU S W, LIAO C F, TSAO Y, et al. Metricgan: Generative adversarial networks based mblackbox metric scores optimization for speech enhancement［C］// International Conference on Machine Learning. PMLR, 2019:2031-2041.
［37］DENG C Y, ZHANG Y, MA S Q, et al. Conv-TasSAN: Separative adversarial network based on Conv-TasNet［C］// INTERSPEECH 2020. 2020:2647-2651.
［38］CHEN H S, XIANG T, CHEN K, et al. Nonlinear residual echo suppression based on multi-stream Conv-TasNet［C］// INTERSPEECH 2020. 2020:3959-3963.
［39］KOIZUMI Y, KARITA S, WISDOM S, et al. DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement［C］// 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics（WASPAA）. IEEE, 2021:161-165.
［40］LUO Y, CHEN Z, YOSHIOKA T. Dualpath RNN: Efficient long sequence modeling for time-domain single-channel speech separation［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:46-50.
［41］ZHANG L W, SHI Z Q, HAN J Q, et al. Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks［C］// International Conference on Multimedia Modeling. Springer, 2020:653-665.
［42］WANG Z Q, ROUX J L, WANG D L, et al. End-to-end speech separation with unfolded iterative phase reconstruction［J］. arXiv pre-print arXiv:1804.10204, 2018.
［43］WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation: A trigonometric perspective［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2019:71-75.
［44］NACHMANI E, ADI Y, WOLF L. Voice separation with an unknown number of multiple speakers［C］// International Conference on Machine Learning. PMLR, 2020:7164-7175.
［45］SHI Z Q, LIU R J, HAN J Q. LaFurca: Iterative refined speech separation based on context-aware dual-path parallel Bi-LSTM［J］. arXiv preprint arXiv:2001.08998, 2020.
［46］WIJAYAKUSUMA A, GOZALI D R, WIDJAJA A, et al. Implementation of real-time speech separation model using time-domain audio separation network （TasNet） and dual-path recurrent neural network （DPRNN）［J］. Procedia Computer Science, 2021,179:762-772.
［47］WANG F L, PENG Y H, LEE H S, et al. Dual-path filter network: Speaker-aware modeling for speech separation［J］. arXiv preprint arXiv:2106.07579, 2021.
［48］KOLBK M, TAN Z H, JENSEN S H, et al. On TasNet for low-latency single-speaker speech enhancement［J］. arXiv preprint arXiv:2103.14882, 2021.
［49］NIU S T, DU J, SUN L, et al. Separation guided speaker diarization in realistic mismatched conditions［J］. arXiv preprint arXiv:2107.02357, 2021.
［50］DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing（ICASS-P）. IEEE, 2018:5554-5558.
［51］XU J M, SHI J, LIU G C, et al. Modeling attention and memory for auditory selection in a cocktail party environment［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018:2564-2571.
［52］WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking［C］// INTERSPEECH 2019. 2019:2728-2732.
［53］XU C L, RAO W, CHNG E S, et al. Spex: Multiscale time domain speaker extraction network［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020,28:1370-1384.
［54］GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network［C］// INTERSPEECH 2020. 2020:1406-1410.
［55］ZEGHIDOUR N, GRANGIER D. Wavesplit: End-to-end speech separation by speaker clustering［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021,29:2840-2849.
［56］PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018:3942-3951.
［57］SHI J, XU J M, FUJITA Y, et al. Speaker-conditional chain model for speech separation and extraction［C］// INTERSPEECH 2020. 2020:2707-2711.
［58］SHI Z Q, LIU R J, HAN J Q. Speech separation based on multi-stage elaborated dual-path deep BiLSTM with auxiliary identity loss［C］// INTERSPEECH 2020. 2020:2682-2686.
［59］黄雅婷,石晶,许家铭,等. 鸡尾酒会问题与相关听觉模型的研究现状与展望［J］. 自动化学报, 2019,45（2）:234-251.
［60］刘文举,聂帅,梁山,等. 基于深度学习语音分离技术的研究现状与进展［J］. 自动化学报, 2016,42（6）:819-833.
［61］TZINIS E, WISDOM S, HERSHEY J R, et al. Improving universal sound separation using sound classification［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:96-100.