[1] HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2015,23(12):2136-2147.
[2] ZHANG X L, WANG D L. Deep ensemble learning for monaural speech separation[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2016,24(5):967-977.
[3] ISIK Y, LE ROUX J, CHEN Z, et al. Single-channel multi-speaker separation using deep clustering[C]// Annual Conference of the International Speech Communication Association. 2016:545-549.
[4] KOLBK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017,25(10):1901-1913.
[5] CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation[C]// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017:246-250.
[6] LUO Y, CHEN Z, MESGARANI N. Speaker-independent speech separation with deep attractor network[J]. IEEE/ACM Transactions on Audio Speech & Language Processing, 2018,26(4):787-796.
[7] ERDOGAN H, HERSHEY J R, WATANABE S, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]// 2015 IEEE International Conference on Acoustics, Speech and Sign-al Processing (ICASSP). IEEE, 2015:708-712.
[8] WILLIAMSON D S, WANG Y, WANG D L. Complex ratio masking for monaural speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015,24(3):483-492.
[9] LUO Y, CHEN Z, HERSHEY J R, et al. Deep clustering and conventional networks for music separation: Stronger together[C]// Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. 2017:61-65.
[10]SAINATH T N, WEISS R J, SENIOR A, et al. Learning the speech front-end with raw waveform CLDNNs [C]// Proceedings of the 16th Annual Conference of the International Speech Communication Association. 2015:1-5.
[11]GHAHREMANI P, MANOHAR V, POVEY D, et al. Acoustic modelling from the signal domain using CNNs[C]// Proceedings of the Annual Conference of the International Speech Communication Association. 2016:3434-3438.
[12]VAN DEN OORD A, DIELEMAN S, ZEN H G, et al. WaveNet: A generative model for raw audio[J]. arXiv preprint arXiv:1609.03499, 2016.
[13]MEHRI S, KUMAR K, GULRAJANI I, et al. Sa-mpleRNN: An unconditional end-to-end neural audio generation model[J]. arXiv preprint arXiv:1612.07837, 2016.
[14]PASCUAL S, BONAFONTE A, SERR J. SEGAN: Speech enhancement generative adversarial network[C]// INTERSPEECH 2017. 2017:3642-3646.
[15]LUO Y, MESGARANI N. TasNet: Time-domain audio separation network for real-time, single-channel speech separation[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018:696-700.
[16]HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation[C]// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016:31-35.
[17]YU D, KOLBK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C]// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017:241-245.
[18]WANG Z Q, ROUX J L, WANG D L, et al. End-to-end speech separation with unfolded iterative phase reconstruction[J]. arXiv pre-print arXiv:1804.10204, 2018.
[19]WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation: A trigonometric perspective[C]// IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019:71-75.
[20]LIU Y Z, WANG D L. Divide and conquer: A deep CASA approach to talker-independent monaural speaker separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019,27(12):2092-2102.
[21]LUO Y, MESGARANI N. Real-time single-channel dereverberation and separation with time-domain audio separation network[C]// INTERSPEECH 2018. 2018:342-346.
[22]LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019,27(8):1256-1266.
[23]LEA C, FLYNN M D, VIDAL R, et al. Temporal convolutional networks for action segmentation and detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:156-165.
[24]BAI S, KOLTER J Z, KOLTUN V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[J]. arXiv preprint arXiv:1803.01271, 2018.
[25]LEA C, VIDAL R, REITER A, et al. Temporal convolutional networks: A unified approach to action segmentation[C]// European Confe-rence on Computer Vision. Springer, 2016:47-54.
[26]CHOLLET F. Xception: Deep learning with depthwise separable convolutions[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1251-1258.
[27]HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
[28]WANG D L. On ideal binary mask as the computational goal of auditory scene analysis[M]// Speech Separation by Humans and Machines. Springer, 2005:181-197.
[29]LI Y P, WANG D L. On the optimality of ideal binary time-frequency masks[J]. Speech Communication, 2009,51(3):230-239.
[30]WANG Y X, NARAYANAN A, WANG D L. On training targets for supervised speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014,22(12):1849-1858.
[31]TUAN C I, WU Y K, LEE H, et al. Mitas: A compressed time-domain audio separation network with parameter sharing[J]. arXiv preprint arXiv:1912.03884, 2019.
[32]DFOSSEZ A, USUNIER N, BOTTOU L, et al. Music source separation in the waveform domain[J]. arXiv preprint arXiv:1911.13254, 2019.
[33]DITTER D, GERKMANN T. A multi-phase gammatone filterbank for speech separation via tasnet[C]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020:36-40.
[34]KADIOGLU B, HORGAN M, LIU X, et al. An empirical study of Conv-TasNet[C]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020:7264-7268.
[35]BABY D, VERHULST S. Sergan: Speech enhancement using relativistic generative adversarial networks with gradient penalty[C]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019:106-110.
[36]FU S W, LIAO C F, TSAO Y, et al. Metricgan: Generative adversarial networks based mblackbox metric scores optimization for speech enhancement[C]// International Conference on Machine Learning. PMLR, 2019:2031-2041.
[37]DENG C Y, ZHANG Y, MA S Q, et al. Conv-TasSAN: Separative adversarial network based on Conv-TasNet[C]// INTERSPEECH 2020. 2020:2647-2651.
[38]CHEN H S, XIANG T, CHEN K, et al. Nonlinear residual echo suppression based on multi-stream Conv-TasNet[C]// INTERSPEECH 2020. 2020:3959-3963.
[39]KOIZUMI Y, KARITA S, WISDOM S, et al. DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement[C]// 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA). IEEE, 2021:161-165.
[40]LUO Y, CHEN Z, YOSHIOKA T. Dualpath RNN: Efficient long sequence modeling for time-domain single-channel speech separation[C]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020:46-50.
[41]ZHANG L W, SHI Z Q, HAN J Q, et al. Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks[C]// International Conference on Multimedia Modeling. Springer, 2020:653-665.
[42]WANG Z Q, ROUX J L, WANG D L, et al. End-to-end speech separation with unfolded iterative phase reconstruction[J]. arXiv pre-print arXiv:1804.10204, 2018.
[43]WANG Z Q, TAN K, WANG D L. Deep learning based phase reconstruction for speaker separation: A trigonometric perspective[C]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2019:71-75.
[44]NACHMANI E, ADI Y, WOLF L. Voice separation with an unknown number of multiple speakers[C]// International Conference on Machine Learning. PMLR, 2020:7164-7175.
[45]SHI Z Q, LIU R J, HAN J Q. LaFurca: Iterative refined speech separation based on context-aware dual-path parallel Bi-LSTM[J]. arXiv preprint arXiv:2001.08998, 2020.
[46]WIJAYAKUSUMA A, GOZALI D R, WIDJAJA A, et al. Implementation of real-time speech separation model using time-domain audio separation network (TasNet) and dual-path recurrent neural network (DPRNN)[J]. Procedia Computer Science, 2021,179:762-772.
[47]WANG F L, PENG Y H, LEE H S, et al. Dual-path filter network: Speaker-aware modeling for speech separation[J]. arXiv preprint arXiv:2106.07579, 2021.
[48]KOLBK M, TAN Z H, JENSEN S H, et al. On TasNet for low-latency single-speaker speech enhancement[J]. arXiv preprint arXiv:2103.14882, 2021.
[49]NIU S T, DU J, SUN L, et al. Separation guided speaker diarization in realistic mismatched conditions[J]. arXiv preprint arXiv:2107.02357, 2021.
[50]DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam[C]// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASS-P). IEEE, 2018:5554-5558.
[51]XU J M, SHI J, LIU G C, et al. Modeling attention and memory for auditory selection in a cocktail party environment[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018:2564-2571.
[52]WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking[C]// INTERSPEECH 2019. 2019:2728-2732.
[53]XU C L, RAO W, CHNG E S, et al. Spex: Multiscale time domain speaker extraction network[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020,28:1370-1384.
[54]GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network[C]// INTERSPEECH 2020. 2020:1406-1410.
[55]ZEGHIDOUR N, GRANGIER D. Wavesplit: End-to-end speech separation by speaker clustering[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021,29:2840-2849.
[56]PEREZ E, STRUB F, DE VRIES H, et al. FiLM: Visual reasoning with a general conditioning layer[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018:3942-3951.
[57]SHI J, XU J M, FUJITA Y, et al. Speaker-conditional chain model for speech separation and extraction[C]// INTERSPEECH 2020. 2020:2707-2711.
[58]SHI Z Q, LIU R J, HAN J Q. Speech separation based on multi-stage elaborated dual-path deep BiLSTM with auxiliary identity loss[C]// INTERSPEECH 2020. 2020:2682-2686.
[59]黄雅婷,石晶,许家铭,等. 鸡尾酒会问题与相关听觉模型的研究现状与展望[J]. 自动化学报, 2019,45(2):234-251.
[60]刘文举,聂帅,梁山,等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016,42(6):819-833.
[61]TZINIS E, WISDOM S, HERSHEY J R, et al. Improving universal sound separation using sound classification[C]// IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2020:96-100.
|