[1] MOHAMMADI S H, KAIN A. An overview of voice conversion systems[J]. Speech Communication, 2017,88:65-82.
[2] WANG Y X, STANTON D, ZHANG Y, et al. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis[J]. arXiv preprint arXiv:1803.09017, 2018.
[3] ZHOU K, SISMAN B, LIU R, et al. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset[C]// the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2021:920-924.
[4] BAUM L E. An inequality and associated maximization technique in statistical estimation for probablistic functions of Markov processes[M]// Inequalities III: Proceedings of the 3rd Symposium on Inequalities. New York: Academic Press, 1972:1-8.
[5] DAVIS K H, BIDDULPH R, BALASHEK S. Automatic recognition of spoken digits[J]. The Journal of the Acoustical Society of America, 1952,24(6):637-642.
[6] SAKOE H, CHIBA S. Dynamic programming algorithm optimization for spoken word recognition[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1978,26(1):43-49.
[7] 张佳欣. AI语音克隆技术引发争议[N]. 科技日报, 2023-06-08(4).
[8] QIN Z Y, ZHAO W L, YU X M, et al. Openvoice: Versatile instant voice cloning[J]. arXiv preprint arXiv:2312.014
79, 2023.
[9] OYUCU S. A novel end-to-end Turkish text-to-speech (TTS) system via deep learning[J]. Electronics, 2023,12(8):1900.
[10] KIM J, KONG J, SON J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech[J]. arXiv preprint arXiv:2106.06103, 2021.
[11] KIM J, KIM S, KONG J, et al. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search[J]. arXiv preprint arXiv:2005.11129, 2020.
[12] HUANG S F, LIN C J, LIU D R, et al. Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, IEEE, 2022,30:1558-1571.
[13] SHEN K, JU Z Q, TAN X, et al. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers[J]. arXiv preprint arXiv:2304.09116, 2023.
[14] YAO J X, LEI Y, WANG Q, et al. Preserving background sound in noise-robust voice conversion via multi-task learning[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023:1-5.
[15] BAAS M, VAN NIEKERK B, KAMPER H. Voice conversion with just nearest neighbors[J]. arXiv preprint arXiv:2305.18975, 2023.
[16] JIA Y, RAMANOVICH M T, REMEZ T, et al. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation[C]// Proceedings of the 39th International Conference on Machine Learning. PMLR, 2022,162:10120-10134.
[17] GUO H J, LIU C R, ISHI C T, et al. QuickVC: Any-to-many voice conversion using inverse short-time Fourier transform for faster conversion[J]. arXiv preprint arXiv:2302.08296, 2023.
[18] VAN NIEKERK B, CARBONNEAU M A, ZADI J, et al. A comparison of discrete and soft speech units for improved voice conversion[J]. arXiv preprint arXiv:2111.02392, 2021.
[19] CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022,16(6):1505-1518.
[20] LI J Y, TU W P, XIAO L. FreeVC: Towards high-quality text-free one-shot voice conversion[C]// ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10095191.
[21] KONG J, KIM J, BAE J. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis[J]. arXiv preprint arXiv:2010.05646, 2020.
[22] ZAIDI S A M, LATIF S, QADIR J. Cross-language speech emotion recognition using multimodal dual attention transformers[J]. arXiv preprint arXiv:2306.13804, 2023.
[23] SU J Q, JIN Z Y, FINKELSTEIN A. HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks[J]. arXiv preprint arXiv:2006.05694, 2020.
[24] DINH L, SOHL-DICKSTEIN J, BENGIO S. Density estimation using Real NVP[J]. arXiv preprint arXiv:1605.08803, 2016.
[25] MAO X, LI Q, XIE H, et al. Least squares generative adversarial networks[J]. IEEE, 2017.DOI:10.1109/ICCV.
2017.304.
[26] LARSEN A B L, SONDERBY S K, LAROCHELLE H, et al. Autoencoding beyond pixels using a learned similarity metric[J]. arXiv preprint arXiv:1512.09300, 2015.
[27] YAMAGISHI J, VEAUX C, MACDONALD K. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92) [DB/OL]. (2019-11-13)[2024-05-25]. https://datashare.ed.ac.uk/handle/10283/3443.
[28] WESTER M, WU Z Z, YAMAGISHI J. Analysis of the voice conversion challenge 2016 evaluation results[C]// InterSpeech 2016. ISCA, 2016:1637-1641.
[29] WANG D, DENG L, YEUNG Y T, et al. VQMIVC: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion[J]. arXiv preprint arXiv:2106.10132, 2021.
[30] LIU S X, CAO Y W, WANG D S, et al. Any-to-many voice conversion with location-relative sequence-to-sequence modeling[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021,29:1717-1728.
[31] KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning[C]// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2017:4835-4839.