Research Review of Single-channel Speech Separation Technology Based on TasNet

Abstract

Abstract: Speech separation is a fundamental task in acoustic signal processing with a wide range of applications. Thanks to the development of deep learning, the performance of single-channel speech separation systems has been significantly improved in recent years. In particular, with the introduction of a new speech separation method called time-domain audio separation network （TasNet）, speech separation technology is also gradually transitioning from the traditional method based on time-frequency domain to the one based on time domain methods. This paper reviews the research status and prospect of single-channel speech separation technology based on TasNet. After reviewing the traditional methods of speech separation based on time-frequency domain, this paper focuses on the TasNet-based Conv-TasNet model and DPRNN model, and compares the improvement research on each model. Finally, this paper expounds the limitations of the current single-channel speech separation model based on TasNet, and discusses future research directions from the aspects of model, dataset, number of speakers, and how to solve speech separation in complex scenarios.

Key words: speech separation, TasNet, Conv-TasNet, DPRNN

LU Wei, ZHU Ding-ju. Research Review of Single-channel Speech Separation Technology Based on TasNet[J]. Computer and Modernization, 2022, 0(11): 119-126.

References

［1］ HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation［J］. IEEE/ACM Transactions on Audio Speech & Language Processing, 2015,23（12）:2136-2147.
［2］ ZHANG X L, WANG D L. Deep ensemble learning for monaural speech separation［J］. IEEE/ACM Transactions on Audio Speech & Language Processing, 2016,24（5）:967-977.
［3］ ISIK Y, LE ROUX J, CHEN Z, et al. Single-channel multi-speaker separation using deep clustering［C］// Annual Conference of the International Speech Communication Association. 2016:545-549.
［4］ KOLBK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017,25（10）:1901-1913.
［5］ CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2017:246-250.
［6］ LUO Y, CHEN Z, MESGARANI N. Speaker-independent speech separation with deep attractor network［J］. IEEE/ACM Transactions on Audio Speech & Language Processing, 2018,26（4）:787-796.
［7］ ERDOGAN H, HERSHEY J R, WATANABE S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks［C］// 2015 IEEE International Conference on Acoustics, Speech and Sign-al Processing （ICASSP）. IEEE, 2015:708-712.
［8］ WILLIAMSON D S, WANG Y, WANG D L. Complex ratio masking for monaural speech separation［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015,24（3）:483-492.
［9］ LUO Y, CHEN Z, HERSHEY J R, et al. Deep clustering and conventional networks for music separation: Stronger together［C］// Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 2017:61-65.
［10］SAINATH T N, WEISS R J, SENIOR A, et al. Learning the speech front-end with raw waveform CLDNNs ［C］// Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015:1-5.
［11］GHAHREMANI P, MANOHAR V, POVEY D, et al. Acoustic modelling from the signal domain using CNNs［C］// Proceedings of the Annual Conference of the International Speech Communication Association. 2016:3434-3438.
［12］VAN DEN OORD A, DIELEMAN S, ZEN H G, et al. WaveNet: A generative model for raw audio［J］. arXiv preprint arXiv:1609.03499, 2016.
［13］MEHRI S, KUMAR K, GULRAJANI I, et al. Sa-mpleRNN: An unconditional end-to-end neural audio generation model［J］. arXiv preprint arXiv:1612.07837, 2016.
［14］PASCUAL S, BONAFONTE A, SERR J. SEGAN: Speech enhancement generative adversarial network［C］// INTERSPEECH 2017. 2017:3642-3646.
［15］LUO Y, MESGARANI N. TasNet: Time-domain audio separation network for real-time, single-channel speech separation［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2018:696-700.
［16］HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation［C］// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2016:31-35.
［17］YU D, KOLBK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2017:241-245.
［18］WANG Z Q, ROUX J L, WANG D L, et al. End-to-end speech separation with unfolded iterative phase reconstruction［J］. arXiv pre-print arXiv:1804.10204, 2018.
［19］WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation: A trigonometric perspective［C］// IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. IEEE, 2019:71-75.
［20］LIU Y Z, WANG D L. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019,27（12）:2092-2102.
［21］LUO Y, MESGARANI N. Real-time single-channel dereverberation and separation with time-domain audio separation network［C］// INTERSPEECH 2018. 2018:342-346.
［22］LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019,27（8）:1256-1266.
［23］LEA C, FLYNN M D, VIDAL R, et al. Temporal convolutional networks for action segmentation and detection［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:156-165.
［24］BAI S, KOLTER J Z, KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling［J］. arXiv preprint arXiv:1803.01271, 2018.
［25］LEA C, VIDAL R, REITER A, et al. Temporal convolutional networks: A unified approach to action segmentation［C］// European Confe-rence on Computer Vision. Springer, 2016:47-54.
［26］CHOLLET F. Xception: Deep learning with depthwise separable convolutions［C］// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1251-1258.
［27］HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications［J］. arXiv preprint arXiv:1704.04861, 2017.
［28］WANG D L. On ideal binary mask as the computational goal of auditory scene analysis［M］// Speech Separation by Humans and Machines. Springer, 2005:181-197.
［29］LI Y P, WANG D L. On the optimality of ideal binary time-frequency masks［J］. Speech Communication, 2009,51（3）:230-239.
［30］WANG Y X, NARAYANAN A, WANG D L. On training targets for supervised speech separation ［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014,22（12）:1849-1858.
［31］TUAN C I, WU Y K, LEE H, et al. Mitas: A compressed time-domain audio separation network with parameter sharing［J］. arXiv preprint arXiv:1912.03884, 2019.
［32］DFOSSEZ A, USUNIER N, BOTTOU L, et al. Music source separation in the waveform domain［J］. arXiv preprint arXiv:1911.13254, 2019.
［33］DITTER D, GERKMANN T. A multi-phase gammatone filterbank for speech separation via tasnet［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:36-40.
［34］KADIOGLU B, HORGAN M, LIU X, et al. An empirical study of Conv-TasNet［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:7264-7268.
［35］BABY D, VERHULST S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2019:106-110.
［36］FU S W, LIAO C F, TSAO Y, et al. Metricgan: Generative adversarial networks based mblackbox metric scores optimization for speech enhancement［C］// International Conference on Machine Learning. PMLR, 2019:2031-2041.
［37］DENG C Y, ZHANG Y, MA S Q, et al. Conv-TasSAN: Separative adversarial network based on Conv-TasNet［C］// INTERSPEECH 2020. 2020:2647-2651.
［38］CHEN H S, XIANG T, CHEN K, et al. Nonlinear residual echo suppression based on multi-stream Conv-TasNet［C］// INTERSPEECH 2020. 2020:3959-3963.
［39］KOIZUMI Y, KARITA S, WISDOM S, et al. DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement［C］// 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics（WASPAA）. IEEE, 2021:161-165.
［40］LUO Y, CHEN Z, YOSHIOKA T. Dualpath RNN: Efficient long sequence modeling for time-domain single-channel speech separation［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:46-50.
［41］ZHANG L W, SHI Z Q, HAN J Q, et al. Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks［C］// International Conference on Multimedia Modeling. Springer, 2020:653-665.
［42］WANG Z Q, ROUX J L, WANG D L, et al. End-to-end speech separation with unfolded iterative phase reconstruction［J］. arXiv pre-print arXiv:1804.10204, 2018.
［43］WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation: A trigonometric perspective［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2019:71-75.
［44］NACHMANI E, ADI Y, WOLF L. Voice separation with an unknown number of multiple speakers［C］// International Conference on Machine Learning. PMLR, 2020:7164-7175.
［45］SHI Z Q, LIU R J, HAN J Q. LaFurca: Iterative refined speech separation based on context-aware dual-path parallel Bi-LSTM［J］. arXiv preprint arXiv:2001.08998, 2020.
［46］WIJAYAKUSUMA A, GOZALI D R, WIDJAJA A, et al. Implementation of real-time speech separation model using time-domain audio separation network （TasNet） and dual-path recurrent neural network （DPRNN）［J］. Procedia Computer Science, 2021,179:762-772.
［47］WANG F L, PENG Y H, LEE H S, et al. Dual-path filter network: Speaker-aware modeling for speech separation［J］. arXiv preprint arXiv:2106.07579, 2021.
［48］KOLBK M, TAN Z H, JENSEN S H, et al. On TasNet for low-latency single-speaker speech enhancement［J］. arXiv preprint arXiv:2103.14882, 2021.
［49］NIU S T, DU J, SUN L, et al. Separation guided speaker diarization in realistic mismatched conditions［J］. arXiv preprint arXiv:2107.02357, 2021.
［50］DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing（ICASS-P）. IEEE, 2018:5554-5558.
［51］XU J M, SHI J, LIU G C, et al. Modeling attention and memory for auditory selection in a cocktail party environment［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018:2564-2571.
［52］WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking［C］// INTERSPEECH 2019. 2019:2728-2732.
［53］XU C L, RAO W, CHNG E S, et al. Spex: Multiscale time domain speaker extraction network［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020,28:1370-1384.
［54］GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network［C］// INTERSPEECH 2020. 2020:1406-1410.
［55］ZEGHIDOUR N, GRANGIER D. Wavesplit: End-to-end speech separation by speaker clustering［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021,29:2840-2849.
［56］PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018:3942-3951.
［57］SHI J, XU J M, FUJITA Y, et al. Speaker-conditional chain model for speech separation and extraction［C］// INTERSPEECH 2020. 2020:2707-2711.
［58］SHI Z Q, LIU R J, HAN J Q. Speech separation based on multi-stage elaborated dual-path deep BiLSTM with auxiliary identity loss［C］// INTERSPEECH 2020. 2020:2682-2686.
［59］黄雅婷,石晶,许家铭,等. 鸡尾酒会问题与相关听觉模型的研究现状与展望［J］. 自动化学报, 2019,45（2）:234-251.
［60］刘文举,聂帅,梁山,等. 基于深度学习语音分离技术的研究现状与进展［J］. 自动化学报, 2016,42（6）:819-833.
［61］TZINIS E, WISDOM S, HERSHEY J R, et al. Improving universal sound separation using sound classification［C］// IEEE International Conference on Acoustics, Speech and Signal Processing（ICASSP）. IEEE, 2020:96-100.