计算机与现代化 ›› 2025, Vol. 0 ›› Issue (07): 69-76.doi: 10.3969/j.issn.1006-2475.2025.07.010

• 人工智能 • 上一篇    下一篇

基于自注意力机制说话人编码器与SA-Decoder的语音克隆方法

  


  1. (西安工程大学计算机科学学院,陕西 西安 710600)
  • 出版日期:2025-07-22 发布日期:2025-07-22
  • 作者简介: 作者简介:焦乐岩(1999—),男,河南洛阳人,硕士研究生,研究方向:语音克隆,E-mail: 1097052561@qq.com; 通信作者:朱欣娟(19—),女,教授,博士,研究方向:智能信息系统,虚拟现实,E-mail: zhuxinjuan@xpu.edu.cn。
  • 基金资助:
     基金项目:陕西省重点研发计划项目(2024GX-YBXM-548)

Speech Cloning Method Based on Self-attention Mechanism Speaker Encoder And SA-Decoder


  1. (Computer Science Academy, Xi’an Polytechnic University, Xi’an 710600, China)
  • Online:2025-07-22 Published:2025-07-22

摘要: 摘要:FreeVC模型在语音克隆技术领域表现出色。但是由于语音序列中包含复杂的语音特征变化和信息,例如音色、风格等,FreeVC模型中的Speaker Encoder模块只使用单一的LSTM网络难以准确地提取和表示说话人信息,这会导致模型处理语音序列的性能下降,影响声音转换质量和准确性。并且FreeVC模型使用传统的解码器,其中上采样(反卷积)操作细节丢失,导致重建还原的音频咬字细节会模糊不清,从而产生音频伪影。针对这些问题,本文提出一种基于自注意力机制的说话人编码器与SA-Decoder的语音克隆方法FreeVC-SA。该方法将说话人的梅尔谱作为输入,在LSTM网络上加入自注意力机制有助于模型更好地捕捉长距离依赖关系,更为准确地提取说话人的音色、风格等特征。使用SA-Decoder可以很好地解决局部感受野限制问题,使得重建生成的语音克隆效果更加真实、清晰。实验结果表明,与所有基线模型相比,FreeVC-SA语音克隆的自然度相似性和情感相似性均有明显提升,字错误率和字符错误率均有明显下降。

  

关键词: 关键词:语音克隆, 说话人编码器, SA-Decoder, 自注意力机制, FreeVC-SA

Abstract: Abstract: FreeVC model performs well in the field of speech cloning technology. However, due to the complex variations in speech features and information contained in speech sequences, such as timbre and style, the Speaker Encoder module in the FreeVC model only uses a single LSTM network, which is difficult to accurately extract and represent the speaker information, which leads to a decrease in the performance of the model in processing speech sequences, affecting the quality and accuracy of sound conversion. Moreover, the FreeVC model uses a traditional decoder, where the upsampling (deconvolution) operation can cause loss of detail, resulting in blurry and unclear speech articulation details in the reconstructed audio, thus generating audio artifacts. To address these issues, this paper proposes a speech cloning method based on self-attention mechanism, FreeVC-SA, for speaker encoder and SA-Decoder. The method takes the speaker’s Mel spectrum as input, and adds a self-attention mechanism on the LSTM network to help the model better capture long-distance dependencies and more accurate extract features such as speaker’s tone and style. Using the SA-Decoder decoder can effectively solve the problem of local receptive field limitation, making the generated speech cloning effect more realistic and clearer. Experimental results show that compared with all baseline models, FreeVC-SA speech cloning has significantly improved naturalness similarity and emotional similarity, and word error rate and character error rate have significantly decreased. 

Key words: Key words: speech cloning, speaker encoder, SA-Decoder, self-attention mechanism, FreeVC-SA

中图分类号: