基于辅助学习的改进端到端合成语音检测方法

计算机与现代化 ›› 2023, Vol. 0 ›› Issue (05): 52-57.

基于辅助学习的改进端到端合成语音检测方法

（河海大学能源与电气学院，江苏南京 211100）

出版日期:2023-06-06 发布日期:2023-06-06
作者简介:袁甜甜（1998—），女，安徽淮南人，硕士研究生，研究方向：机器学习与音频处理，E-mail: yuantaurora@163.com；通信作者：李志华（1964—），男，江苏泰州人，教授，硕士生导师，研究方向:人工智能与复杂系统故障诊断，E-mail: zhli@hhu.edu.cn；邱阳（1997—），男，江苏丹阳人，硕士研究生，研究方向:机器学习与图像识别。
基金资助:
江苏省自然科学基金资助项目（BK20151500）

Improved End-to-end Synthetic Speech Detection Method Based on Auxiliary Learning

（College of Energy and Electrical Engineering， Hohai University， Nanjing 211100， China）

Online:2023-06-06 Published:2023-06-06

摘要/Abstract

摘要： 随着深度伪造技术的发展，合成语音检测面临越来越多的挑战。本文提出一种将辅助学习融入端到端模型的合成语音检测方法。将音频数据进行数据对齐后在不加提取任何手工特征的情况下直接输入到改进端到端模型，主任务进行真实语音与合成语音的二分类，同时选用不同合成语音类型判别作为辅助任务，为主任务的合成语音检测提供先验假设，并且对主辅任务的权重叠加进行了优化。通过在公开数据集ASVspoof2019及ASVspoof2015上进行的实验结果表明，本文改进的模型与使用手工特征的模型相比能有效降低等错率，且优于改进前的端到端模型，并且在面对未知攻击类型时拥有更好的泛化能力。

关键词: 深度伪造, 合成语音检测, 辅助学习, 权重优化, 端到端系统

Abstract: With the development of deep forgery technology， synthetic speech detection faces more and more challenges， a synthetic speech detection method is proposed， which integrates auxiliary learning into end-to-end model. After data alignment， the audio data is directly input to the improved end-to-end model without extracting any manual features. The main task is to classify real speech and synthetic speech. At the same time， different synthetic speech types are selected as auxiliary tasks to provide a priori hypothesis for the combined speech detection of the main task， and the weight superposition of the main and auxiliary tasks is optimized. The experimental results on the open datasets ASVspoof2019 and ASVspoof2015 show that the improved model in this paper can effectively reduce the equal error rate compared with the model using manual features， and is better than the end-to-end model before the improvement， and has better generalization ability in the face of unknown attack types.

Key words: deep forgery, synthetic speech detection, auxiliary learning, weight optimization, end-to-end system

袁甜甜, 李志华, 邱阳. 基于辅助学习的改进端到端合成语音检测方法[J]. 计算机与现代化, 2023, 0(05): 52-57.

YUAN Tian-tian, LI Zhi-hua, QIU Yang. Improved End-to-end Synthetic Speech Detection Method Based on Auxiliary Learning[J]. Computer and Modernization, 2023, 0(05): 52-57.

参考文献

［1］朱杰斌. 结合语义和声纹信息的说话人身份确认技术研究及系统实现［D］. 北京:北京大学，2003
［2］张鹏，王丽红，毛琳.语音合成系统中波形拼接过渡算法的研究［J］.黑龙江大学自然科学学报，2011，28（6）:867-870.
［3］高丽. 统计参数语音合成中的基频建模与生成方法研究［D］.合肥:中国科学技术大学，2015.
［4］ WANG Y， SKERRY-RYAN R J， STANTON D， et al. Ta-cotron: A fully end-to-end text-to-speech synthesis model［J］. arXiv preprint arXiv.1703.10135.2017.
［5］ CHEN L W， RUDNICKY A. Fine-grained style control in transformer-based text-to-speech synthesis［C］// IEEE International Conference on Acoustics， Speech and Signal Processing（ICASSP）. IEEE， 2022:7907-7911.
［6］ NGUYEN H K， JEONG K， KANG H G. A fast and lig-htweight speech synthesis model based on Fast-Speech2［C］// 2021 36th International Technical Conference on Circuits/Systems， Computers and Communications （ITC-CSCC）. IEEE， 2021:1-4.
［7］ LIETO A， MORO D， DEVOTI F， et al. " Hello? Who am I talking to?" A shallow CNN approach for human vs. bot speech classification［C］// IEEE International Conference on Acoustics， Speech and Signal Processing （ICASS-P）. IEEE， 2019:2577-2581.
［8］ TODISCO M， DELGADO H， EVANS N. Constant Q cep-stral coefficients: A spoofing countermeasure for automatic speaker verification［J］. Computer Speech & Language， 2017，45:516-535.
［9］ HUA G， TEOH A B J， ZHANG H J. Towards end-to-end synthetic speech detection［J］. IEEE Signal Processing Letters， 2021，28:1265-1269.
［10］王锦阳，华光，黄双. 基于注意力机制的端到端合成语音检测［J］. 信号处理， 2022，38（9）:1975-1987.
［11］ ZHAI B H， GAO T R， XUE F， et al. Squeezewave: Extremely lightweight vocoders for on-device speech synthesis［J］. arXiv preprint arXiv:2001.05685， 2020.
［12］张钰，刘建伟，左信. 多任务学习［J］. 计算机学报， 2020，43（7）:1340-1378.
［13］ LIEBEL L， K[O]RNER M. Auxiliary tasks in multi-task learning［J］. arXiv preprint arXiv:1805.06334， 2018.
［14］ ZHAO X Y， LI H X， SHEN X H， et al. A modulation module for multi-task learning with applications in image retrieval［C］// Proceedings of the European Conference on Computer Vision （ECCV）. 2018:415-432.
［15］ LIU S K， DAVISON A J， JOHNS E．Self-supervised generalisation with meta auxiliary learning［C］// Conference on Neural Information Processing Systems. 2020:1679-1689.
［16］ LI J Q， LIU X K， YIN W P， et al. Empirical evaluation of multi-task learning in deep neural networks for natural language processing［J］. Neural Computing and Applications， 2021，33（9）:4417-4428.
［17］ LIU Y F， ZHUANG B H， SHEN C H， et al. Auxiliary learning for deep multi-task learning［J］. arXiv preprint arXiv:1909.02214， 2019.
［18］ TODISCO M， WANG X， VESTMAN V， et al. ASVspoof 2019: Future horizons in spoofed and fake audio detection［J］. arXiv preprint arXiv:1904.05441， 2019.
［19］ YANG G， YANG S， LIU K， et al. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech［C］// 2021 IEEE Spoken Language Technology Workshop（SLT）. IEEE，2021:492-498.
［20］ YANG J， BAE J S， BAK T， et al. GANSpeech: Adversarial training for high-fidelity multi-speaker speech synthesis［J］. arXiv preprint arXiv:2106.15153， 2021.
［21］ WANG X， TAKAKI S， YAMAGISHI J. Neural source- filter-based waveform model for statistical parametric speech synthesis［C］// IEEE International Conference on Acoustics， Speech and Signal Processing（ICASSP）. IEEE，2019:5916-5920.
［22］ WU Z Z， KINNUNEN T， EVANS N， et al. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge［C］// The 16th Annual Conference of the International Speech Communication Association. 2015. DOI:10.21437/Interspeech.2015-462.
［23］ CHENG J M， WANG H C. A method of estimating the equal error rate for automatic speaker verification［C］// 2004 International Symposium on Chinese Spoken Language Processing. IEEE，2004:285-288.
［24］ LI X， LI N， WENG C， et al. Replay and synthetic speech detection with Res2net architecture ［C］// IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP）.IEEE，2021:6354-6358.
［25］ VAN DER MAATEN L， HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research， 2008，9（11）:2579-2605.
［26］ TODISCO M， WANG X， VESTMAN V， et al. ASVspoof 2019: Future horizons in spoofed and fake audio detection［J］. arXiv preprint arXiv:1904.05441， 2019.
［27］ LAVRENTYEVA G， NOVOSELOV S， TSEREN A， et al. STC antispoofing systems for the ASVspoof2019 challenge［J］. arXiv preprint arXiv:1904.05576，2019.

[1]	雷依翰, 曹利峰, 韩孟达, 韩雪. 软件定义天地一体化网络安全切换架构与方法[J]. 计算机与现代化, 2023, 0(08): 119-126.
[2]	杨勇,赖如,任鸽. 面向中亚的大学汉语辅助学习软件设计与教学应用[J]. 计算机与现代化, 2018, 0(05): 111-.
[3]	李智彪;刘敏;廖春华;江源. 基于二次网络搜索辅助学习系统的设计与实现[J]. 计算机与现代化, 2012, 1(1): 14-16,2.