基于自注意力机制说话人编码器与SA-Decoder的语音克隆方法

doi:10.3969/j.issn.1006-2475.2025.07.010

摘要/Abstract

摘要： 摘要：FreeVC模型在语音克隆技术领域表现出色。但是由于语音序列中包含复杂的语音特征变化和信息，例如音色、风格等，FreeVC模型中的Speaker Encoder模块只使用单一的LSTM网络难以准确地提取和表示说话人信息，这会导致模型处理语音序列的性能下降，影响声音转换质量和准确性。并且FreeVC模型使用传统的解码器，其中上采样（反卷积）操作细节丢失，导致重建还原的音频咬字细节会模糊不清，从而产生音频伪影。针对这些问题，本文提出一种基于自注意力机制的说话人编码器与SA-Decoder的语音克隆方法FreeVC-SA。该方法将说话人的梅尔谱作为输入，在LSTM网络上加入自注意力机制有助于模型更好地捕捉长距离依赖关系，更为准确地提取说话人的音色、风格等特征。使用SA-Decoder可以很好地解决局部感受野限制问题，使得重建生成的语音克隆效果更加真实、清晰。实验结果表明，与所有基线模型相比，FreeVC-SA语音克隆的自然度相似性和情感相似性均有明显提升，字错误率和字符错误率均有明显下降。

关键词: 关键词：语音克隆, 说话人编码器, SA-Decoder, 自注意力机制, FreeVC-SA

Abstract: Abstract: FreeVC model performs well in the field of speech cloning technology. However， due to the complex variations in speech features and information contained in speech sequences， such as timbre and style， the Speaker Encoder module in the FreeVC model only uses a single LSTM network， which is difficult to accurately extract and represent the speaker information， which leads to a decrease in the performance of the model in processing speech sequences， affecting the quality and accuracy of sound conversion. Moreover， the FreeVC model uses a traditional decoder， where the upsampling （deconvolution） operation can cause loss of detail， resulting in blurry and unclear speech articulation details in the reconstructed audio， thus generating audio artifacts. To address these issues， this paper proposes a speech cloning method based on self-attention mechanism， FreeVC-SA， for speaker encoder and SA-Decoder. The method takes the speaker’s Mel spectrum as input， and adds a self-attention mechanism on the LSTM network to help the model better capture long-distance dependencies and more accurate extract features such as speaker’s tone and style. Using the SA-Decoder decoder can effectively solve the problem of local receptive field limitation， making the generated speech cloning effect more realistic and clearer. Experimental results show that compared with all baseline models， FreeVC-SA speech cloning has significantly improved naturalness similarity and emotional similarity， and word error rate and character error rate have significantly decreased.

Key words: Key words: speech cloning, speaker encoder, SA-Decoder, self-attention mechanism, FreeVC-SA

中图分类号:

中图分类号：TP391

焦乐岩, 朱欣娟. 基于自注意力机制说话人编码器与SA-Decoder的语音克隆方法[J]. 计算机与现代化, 2025, 0(07): 69-76.

JIAO Leyan, ZHU Xinjuan. Speech Cloning Method Based on Self-attention Mechanism Speaker Encoder And SA-Decoder[J]. Computer and Modernization, 2025, 0(07): 69-76.

参考文献

［1］ MOHAMMADI S H， KAIN A. An overview of voice conversion systems［J］. Speech Communication， 2017，88:65-82.
［2］ WANG Y X， STANTON D， ZHANG Y， et al. Style tokens: Unsupervised style modeling， control and transfer in end-to-end speech synthesis［J］. arXiv preprint arXiv:1803.09017， 2018.
［3］ ZHOU K， SISMAN B， LIU R， et al. Seen and unseen emotional style transfer for voice conversion with a new emotional speech dataset［C］// the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. IEEE， 2021:920-924.
［4］ BAUM L E. An inequality and associated maximization technique in statistical estimation for probablistic functions of Markov processes［M］// Inequalities III: Proceedings of the 3rd Symposium on Inequalities. New York: Academic Press， 1972:1-8.
［5］ DAVIS K H， BIDDULPH R， BALASHEK S. Automatic recognition of spoken digits［J］. The Journal of the Acoustical Society of America， 1952，24（6）:637-642.
［6］ SAKOE H， CHIBA S. Dynamic programming algorithm optimization for spoken word recognition［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1978，26（1）:43-49.
［7］张佳欣. AI语音克隆技术引发争议［N］. 科技日报， 2023-06-08（4）.
［8］ QIN Z Y， ZHAO W L， YU X M， et al. Openvoice: Versatile instant voice cloning［J］. arXiv preprint arXiv:2312.014
79， 2023.
［9］ OYUCU S. A novel end-to-end Turkish text-to-speech （TTS） system via deep learning［J］. Electronics， 2023，12（8）:1900.
［10］ KIM J， KONG J， SON J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech［J］. arXiv preprint arXiv:2106.06103， 2021.
［11］ KIM J， KIM S， KONG J， et al. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search［J］. arXiv preprint arXiv:2005.11129， 2020.
［12］ HUANG S F， LIN C J， LIU D R， et al. Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， IEEE， 2022，30:1558-1571.
［13］ SHEN K， JU Z Q， TAN X， et al. NaturalSpeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers［J］. arXiv preprint arXiv:2304.09116， 2023.
［14］ YAO J X， LEI Y， WANG Q， et al. Preserving background sound in noise-robust voice conversion via multi-task learning［C］// ICASSP 2023-2023 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2023:1-5.
［15］ BAAS M， VAN NIEKERK B， KAMPER H. Voice conversion with just nearest neighbors［J］. arXiv preprint arXiv:2305.18975， 2023.
［16］ JIA Y， RAMANOVICH M T， REMEZ T， et al. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation［C］// Proceedings of the 39th International Conference on Machine Learning. PMLR， 2022，162:10120-10134.
［17］ GUO H J， LIU C R， ISHI C T， et al. QuickVC: Any-to-many voice conversion using inverse short-time Fourier transform for faster conversion［J］. arXiv preprint arXiv:2302.08296， 2023.
［18］ VAN NIEKERK B， CARBONNEAU M A， ZADI J， et al. A comparison of discrete and soft speech units for improved voice conversion［J］. arXiv preprint arXiv:2111.02392， 2021.
［19］ CHEN S Y， WANG C Y， CHEN Z Y， et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing［J］. IEEE Journal of Selected Topics in Signal Processing， 2022，16（6）:1505-1518.
［20］ LI J Y， TU W P， XIAO L. FreeVC: Towards high-quality text-free one-shot voice conversion［C］// ICASSP 2023-2023 IEEE International Conference on Acoustics， Speech and Signal Processing （ICASSP）. IEEE， 2023. DOI: 10.1109/ICASSP49357.2023.10095191.
［21］ KONG J， KIM J， BAE J. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis［J］. arXiv preprint arXiv:2010.05646， 2020.
［22］ ZAIDI S A M， LATIF S， QADIR J. Cross-language speech emotion recognition using multimodal dual attention transformers［J］. arXiv preprint arXiv:2306.13804， 2023.
［23］ SU J Q， JIN Z Y， FINKELSTEIN A. HiFi-GAN: High-fidelity denoising and dereverberation based on speech deep features in adversarial networks［J］. arXiv preprint arXiv:2006.05694， 2020.
［24］ DINH L， SOHL-DICKSTEIN J， BENGIO S. Density estimation using Real NVP［J］. arXiv preprint arXiv:1605.08803， 2016.
［25］ MAO X， LI Q， XIE H， et al. Least squares generative adversarial networks［J］. IEEE， 2017.DOI:10.1109/ICCV.
2017.304.
［26］ LARSEN A B L， SONDERBY S K， LAROCHELLE H， et al. Autoencoding beyond pixels using a learned similarity metric［J］. arXiv preprint arXiv:1512.09300， 2015.
［27］ YAMAGISHI J， VEAUX C， MACDONALD K. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit （version 0.92）［DB/OL］. （2019-11-13）［2024-05-25］. https://datashare.ed.ac.uk/handle/10283/3443.
［28］ WESTER M， WU Z Z， YAMAGISHI J. Analysis of the voice conversion challenge 2016 evaluation results［C］// InterSpeech 2016. ISCA， 2016:1637-1641.
［29］ WANG D， DENG L， YEUNG Y T， et al. VQMIVC: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion［J］. arXiv preprint arXiv:2106.10132， 2021.
［30］ LIU S X， CAO Y W， WANG D S， et al. Any-to-many voice conversion with location-relative sequence-to-sequence modeling［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021，29:1717-1728.
［31］ KIM S， HORI T， WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning［C］// 2017 IEEE International Conference on Acoustics， Speech and Signal Processing（ICASSP）. IEEE， 2017:4835-4839.

[1]	池碧蔚1, 孙瑞2. 基于Transformer架构的DDoS攻击检测方法[J]. 计算机与现代化, 2025, 0(05): 36-40.
[2]	冯心洁, 王伟. 基于孪生特征融合网络的自然场景文本图像超分辨率方法[J]. 计算机与现代化, 2025, 0(02): 86-93.
[3]	郭华. 面向下游任务优化的不平衡数据重采样[J]. 计算机与现代化, 2025, 0(02): 28-32.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[6]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[7]	候聪颖, 杨文清, 王召, 程聪. 基于时频自注意力残差时序卷积网络的语音增强[J]. 计算机与现代化, 2024, 0(09): 20-24.
[8]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.
[9]	李亚平, 王军防, 余红梅, 窦一民, 肖媛, 田继林. Regformer：基于稀疏注意力的输油管道水力压降预测方法[J]. 计算机与现代化, 2024, 0(01): 59-66.
[10]	夏义春, 李汪根, 李豆豆, 葛英奎, 王志格. 结合注意力机制和图神经网络的CTR预估模型[J]. 计算机与现代化, 2023, 0(03): 29-37.
[11]	栗伟松, 汤敏芳, 何征岭, 王鹏, 杜利东, 方震, 陈贤祥. 基于自注意力机制和单导联心电信号的自动睡眠分期算法#br#[J]. 计算机与现代化, 2022, 0(12): 50-59.
[12]	万发洋, 于旭, 徐其江. 基于多头自注意力机制的深度缺陷分派模型[J]. 计算机与现代化, 2021, 0(11): 39-43.
[13]	宋爽, 陆鑫达. 基于BERT与图像自注意力机制的文本匹配模型[J]. 计算机与现代化, 2021, 0(11): 12-16.
[14]	徐龙. 基于自注意力和胶囊网络的短文本情感分析[J]. 计算机与现代化, 2020, 0(07): 61-64.