Improved End-to-end Synthetic Speech Detection Method Based on Auxiliary Learning

Abstract

Abstract: With the development of deep forgery technology， synthetic speech detection faces more and more challenges， a synthetic speech detection method is proposed， which integrates auxiliary learning into end-to-end model. After data alignment， the audio data is directly input to the improved end-to-end model without extracting any manual features. The main task is to classify real speech and synthetic speech. At the same time， different synthetic speech types are selected as auxiliary tasks to provide a priori hypothesis for the combined speech detection of the main task， and the weight superposition of the main and auxiliary tasks is optimized. The experimental results on the open datasets ASVspoof2019 and ASVspoof2015 show that the improved model in this paper can effectively reduce the equal error rate compared with the model using manual features， and is better than the end-to-end model before the improvement， and has better generalization ability in the face of unknown attack types.

Key words: deep forgery, synthetic speech detection, auxiliary learning, weight optimization, end-to-end system

YUAN Tian-tian, LI Zhi-hua, QIU Yang. Improved End-to-end Synthetic Speech Detection Method Based on Auxiliary Learning[J]. Computer and Modernization, 2023, 0(05): 52-57.

References

［1］朱杰斌. 结合语义和声纹信息的说话人身份确认技术研究及系统实现［D］. 北京:北京大学，2003
［2］张鹏，王丽红，毛琳.语音合成系统中波形拼接过渡算法的研究［J］.黑龙江大学自然科学学报，2011，28（6）:867-870.
［3］高丽. 统计参数语音合成中的基频建模与生成方法研究［D］.合肥:中国科学技术大学，2015.
［4］ WANG Y， SKERRY-RYAN R J， STANTON D， et al. Ta-cotron: A fully end-to-end text-to-speech synthesis model［J］. arXiv preprint arXiv.1703.10135.2017.
［5］ CHEN L W， RUDNICKY A. Fine-grained style control in transformer-based text-to-speech synthesis［C］// IEEE International Conference on Acoustics， Speech and Signal Processing（ICASSP）. IEEE， 2022:7907-7911.
［6］ NGUYEN H K， JEONG K， KANG H G. A fast and lig-htweight speech synthesis model based on Fast-Speech2［C］// 2021 36th International Technical Conference on Circuits/Systems， Computers and Communications （ITC-CSCC）. IEEE， 2021:1-4.
［7］ LIETO A， MORO D， DEVOTI F， et al. " Hello? Who am I talking to?" A shallow CNN approach for human vs. bot speech classification［C］// IEEE International Conference on Acoustics， Speech and Signal Processing （ICASS-P）. IEEE， 2019:2577-2581.
［8］ TODISCO M， DELGADO H， EVANS N. Constant Q cep-stral coefficients: A spoofing countermeasure for automatic speaker verification［J］. Computer Speech & Language， 2017，45:516-535.
［9］ HUA G， TEOH A B J， ZHANG H J. Towards end-to-end synthetic speech detection［J］. IEEE Signal Processing Letters， 2021，28:1265-1269.
［10］王锦阳，华光，黄双. 基于注意力机制的端到端合成语音检测［J］. 信号处理， 2022，38（9）:1975-1987.
［11］ ZHAI B H， GAO T R， XUE F， et al. Squeezewave: Extremely lightweight vocoders for on-device speech synthesis［J］. arXiv preprint arXiv:2001.05685， 2020.
［12］张钰，刘建伟，左信. 多任务学习［J］. 计算机学报， 2020，43（7）:1340-1378.
［13］ LIEBEL L， K[O]RNER M. Auxiliary tasks in multi-task learning［J］. arXiv preprint arXiv:1805.06334， 2018.
［14］ ZHAO X Y， LI H X， SHEN X H， et al. A modulation module for multi-task learning with applications in image retrieval［C］// Proceedings of the European Conference on Computer Vision （ECCV）. 2018:415-432.
［15］ LIU S K， DAVISON A J， JOHNS E．Self-supervised generalisation with meta auxiliary learning［C］// Conference on Neural Information Processing Systems. 2020:1679-1689.
［16］ LI J Q， LIU X K， YIN W P， et al. Empirical evaluation of multi-task learning in deep neural networks for natural language processing［J］. Neural Computing and Applications， 2021，33（9）:4417-4428.
［17］ LIU Y F， ZHUANG B H， SHEN C H， et al. Auxiliary learning for deep multi-task learning［J］. arXiv preprint arXiv:1909.02214， 2019.
［18］ TODISCO M， WANG X， VESTMAN V， et al. ASVspoof 2019: Future horizons in spoofed and fake audio detection［J］. arXiv preprint arXiv:1904.05441， 2019.
［19］ YANG G， YANG S， LIU K， et al. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech［C］// 2021 IEEE Spoken Language Technology Workshop（SLT）. IEEE，2021:492-498.
［20］ YANG J， BAE J S， BAK T， et al. GANSpeech: Adversarial training for high-fidelity multi-speaker speech synthesis［J］. arXiv preprint arXiv:2106.15153， 2021.
［21］ WANG X， TAKAKI S， YAMAGISHI J. Neural source- filter-based waveform model for statistical parametric speech synthesis［C］// IEEE International Conference on Acoustics， Speech and Signal Processing（ICASSP）. IEEE，2019:5916-5920.
［22］ WU Z Z， KINNUNEN T， EVANS N， et al. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge［C］// The 16th Annual Conference of the International Speech Communication Association. 2015. DOI:10.21437/Interspeech.2015-462.
［23］ CHENG J M， WANG H C. A method of estimating the equal error rate for automatic speaker verification［C］// 2004 International Symposium on Chinese Spoken Language Processing. IEEE，2004:285-288.
［24］ LI X， LI N， WENG C， et al. Replay and synthetic speech detection with Res2net architecture ［C］// IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP）.IEEE，2021:6354-6358.
［25］ VAN DER MAATEN L， HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research， 2008，9（11）:2579-2605.
［26］ TODISCO M， WANG X， VESTMAN V， et al. ASVspoof 2019: Future horizons in spoofed and fake audio detection［J］. arXiv preprint arXiv:1904.05441， 2019.
［27］ LAVRENTYEVA G， NOVOSELOV S， TSEREN A， et al. STC antispoofing systems for the ASVspoof2019 challenge［J］. arXiv preprint arXiv:1904.05576，2019.

[1]	LEI Yi-han, CAO Li-feng, HAN Meng-da, HAN Xue. Security Handover Architecture and Method for Software Defined Space-ground#br# Integration Network [J]. Computer and Modernization, 2023, 0(08): 119-126.
[2]	ZHANG Li-jie, SHA Xiu-yan, YIN Chuan-cun, DUAN Jun-tao, ZHANG Xin-yi, LI Zi-tong, JIANG Fu-lei. Improved GM (1,1) Grey Prediction Model Based on Background Value of Variable Weight Optimization and Its Application [J]. Computer and Modernization, 2021, 0(01): 1-6.
[3]	YANG Yong， LAI Ru， REN Ge. Design and Teaching Application of Chinese Auxiliary Learning Software for Central Asia [J]. Computer and Modernization, 2018, 0(05): 111-.