Application of Hybrid CTC/Attention Model in Mandarin Recognition

Abstract

Abstract: The end-to-end speech recognition model based on Connectionist Temporal Classification （CTC） has the advantages of simple structure and automatic alignment, but the recognition accuracy needs to be further improved. This paper introduces the attention mechanism to form a hybrid CTC/Attention end-to-end model. This method adopts the multi-task learning approach, combining the alignment advantage of CTC with the context modeling advantage of attention mechanism. The experimental results show that when the 80-dimensional FBank feature and the 3-dimensional pitch feature are selected as the acoustic features, and the VGG-Bidirectional long short-time memory network is selected as the encoder for Chinese Mandarin recognition, the character error rate of this hybrid model is reduced by about 6.1% compared with the end-to-end model based on CTC, after the external language model is connected, the character error rate is further reduced by 0.3%. Compared with the traditional baseline model, the character error rate also decreased significantly.

Key words: speech recognition, connectionist temporal classification, attention mechanism, end-to-end

XU Hong-kui, ZHANG Zi-feng, LU Jiang-kun, ZHOU Jun-jie, HU Wen-ye, JIANG Tong-tong. Application of Hybrid CTC/Attention Model in Mandarin Recognition[J]. Computer and Modernization, 2022, 0(08): 1-6.

References

［1］ RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition［J］. Proceedings of the IEEE, 1989,77（2）:257-286.
［2］ RABINER L R, JUANG B H. Hidden Markov models for speech recognition-strengths and limitations［M］// Speech Recognition and Understanding. 1992:3-29.
［3］ RODRGUEZ E, RUZ B, GARCA-CRESPO , et al. Speech/speaker recognition using a HMM/GMM hybrid model［C］// International Conference on Audio- and Video-Based Biometric Person Authentication. 1997: 227-234.
［4］ AMODEI D, ANANTHANARAYANAN S, ANUBHAI R, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin［C］// International Conference on Machine Learning. 2016:173-182.
［5］ GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks［C］// Proceedings of the 23rd International Conference on Machine Learning. 2006: 369-376.
［6］ WATANABE S, HORI T, KIM S, et al. Hybrid CTC/Attention architecture for end-to-end speech recognition［J］. IEEE Journal of Selected Topics in Signal Processing, 2017,11（8）:1240-1253.
［7］ CHOROWSKI J, BA HDANAU D, SERDYUK D, et al. Attention-based models for speech recognition［C］// Proceedings of the 28th International Conference on Nearal Information Processing System. 2015:577-585.
［8］鱼昆,张绍阳,侯佳正,等. 语音识别及端到端技术现状及展望［J］. 计算机系统应用, 2021,30（3）:14-23.
［9］戴礼荣,张仕良,黄智颖. 基于深度学习的语音识别技术现状与展望［J］. 数据采集与处理, 2017, 32（2）: 221-231.
［10］KIM C, KIM S, KIM K, et al. End-to-end training of a large vocabulary end-to-end speech recognition system［C］// 2019 IEEE Automatic Speech Recognition and Understanding Workshop （ASRU）. 2019:562-569.
［11］杨威,胡燕. 混合CTC/attention架构端到端带口音普通话识别［J］. 计算机应用研究, 2021,38（3）:755-759.
［12］刘加. 汉语大词汇量连续语音识别系统研究进展［J］. 电子学报, 2000, 28（1）: 85-91.
［13］YU D, DENG L. Deep learning and its applications to signal and information processing ［exploratory DSP］［J］. IEEE Signal Processing Magazine, 2010, 28（1）:145-154.
［14］GEIGER J T, ZHANG Z, WENINGER F, et al. Robust speech recognition using long short-term memory recurrent neural networks for Hybrid acoustic modelling ［C］// The 15th Annual Conference of the International Speech Communication Association. 2014:631-635.
［15］张瑞珍,韩跃平,张晓通. 基于深度LSTM的端到端的语音识别［J］. 中北大学学报（自然科学版）, 2020,41（3）:244-248.
［16］姚煜,RYAD C. 基于双向长短时记忆-联结时序分类和加权有限状态转换器的端到端中文语音识别系统［J］. 计算机应用, 2018,38（9）:2495-2499.
［17］杨德举,马良荔,谭琳珊,等. 基于门控卷积网络与CTC的端到端语音识别［J］. 计算机工程与设计, 2020,41（9）:2650-2654.
［18］张威,翟明浩,黄子龙,等. SE-MCNN-CTC的中文语音识别声学模型［J］. 应用声学, 2020,39（2）:223-230.
［19］张宇,张鹏远,颜永红. 基于注意力 LSTM 和多任务学习的远场语音识别［J］. 清华大学学报（自然科学版）, 2018, 58（3）: 249-253.
［20］刘晓峰,宋文爱,余本国,等. 基于注意力机制的大同方言语音翻译模型研究［J］. 中北大学学报（自然科学版）, 2020, 41（3）: 238-243.
［21］徐冬冬,蒋志翔. 基于HOPE-CTC的端到端语音识别［J］. 计算机工程与设计, 2021,42（2）:462-467.
［22］洪青阳,李琳. 语音识别:原理与应用［M］. 北京：电子工业出版社, 2020:252-254.
［23］KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning ［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. 2017:4835-4839.
［24］WATANABE S, HORI T, KARITA S, et al. ESPnet: End-to-end speech processing toolkit［J］// arXiv preprint arXiv:1804.00015, 2018.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[3]	HUANG Tingpei1, MA Lubiao1, LI Shibao2, LIU Jianhang1. Gesture Recognition Method Based on WiFi and Prototypical Network [J]. Computer and Modernization, 2024, 0(12): 34-39.
[4]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[5]	WANG Yanyuan, MAO Zhengchong. Detection and Recognition Algorithms for Chinese and English Scene Text Images [J]. Computer and Modernization, 2024, 0(12): 84-90.
[6]	LI Junchao1, YOU Fei1, ZHANG Chao2, SU Lele2, GONG Yan2 . BiLSTM-Attention Prediction Model and Error Analysis #br# Based on Novel Multi-objective Coati Optimization Algorithm [J]. Computer and Modernization, 2024, 0(11): 70-76.
[7]	ZHANG Yu1, 2, LI Jing1, 2, MA Ming1, 2, WANG Zhongxiang1, 2, SUN Yan1, 2. YOLOLW: A Novel Lightweight Object Detection Model [J]. Computer and Modernization, 2024, 0(11): 91-98.
[8]	QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism [J]. Computer and Modernization, 2024, 0(11): 106-112.
[9]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[10]	WEI Xuecheng1, JIANG Lingyun1, LI Yan2, HE Fei2. Improved Roadside Monocular View Small Target Detection Algorithm Based on YOLOv5 [J]. Computer and Modernization, 2024, 0(10): 27-34.
[11]	DU Mengjun1, LI Ang1, TONG Jun1, QIAN Jin1, KANG Kai1, WANG Ruoding1, JIN Wenxing2. Power Information Data Fusion Model Based on Improved Extreme Learning Algorithm [J]. Computer and Modernization, 2024, 0(10): 61-64.
[12]	YANG Shijun1, DI Guangyi1, GAO Jun1, CHEN Jianfei1, WANG Yaokun1, JI Xiaohan2. Sentiment Consistency Detection Based on Cross Modal Attention Fusion and#br# Information Perception [J]. Computer and Modernization, 2024, 0(10): 113-119.
[13]	HOU Congying, YANG Wengqing, WANG Zhao, CHENG Cong. Speech Enhancement Based on Time-frequency Self-attention Residual Temporal#br# Convolutional Networks [J]. Computer and Modernization, 2024, 0(09): 20-24.
[14]	ZHANG Ze1, ZHANG Jianquan2, 3, ZHOU Guopeng2, 3. Camera Module Defect Detection Based on Improved YOLOv8s [J]. Computer and Modernization, 2024, 0(09): 107-113.
[15]	HE Ruonan1, FAN Xiang2, CHEN Yi1, JIANG Yufei1, CAO Hui1. Proportional Dominance Logistic Regression Optimized Voice Disorder Index Algorithm [J]. Computer and Modernization, 2024, 0(08): 1-4.