混合CTC/Attention模型在普通话识别中的应用

摘要/Abstract

摘要： 基于链接时序分类（Connectionist Temporal Classification, CTC）的端到端语音识别模型具有结构简单且能自动对齐的优点，但识别准确率有待进一步提高。本文引入注意力机制（Attention）构成混合CTC/Attention端到端模型，采用多任务学习方式，充分发挥CTC的对齐优势和Attention机制的上下文建模优势。实验结果表明，当选取80维FBank特征和3维pitch特征作为声学特征，选择VGG-双向长短时记忆网络（VGG-Bidirectional long short-time memory, VGG-BiLSTM）作为编码器应用于中文普通话识别时，该模型与基于CTC的端到端模型相比，字错误率下降约6.1%，外接语言模型后，字错误率进一步下降0.3%；与传统基线模型相比，字错误率也有大幅度下降。

关键词: 语音识别, 链接时序分类, 注意力机制, 端到端

Abstract: The end-to-end speech recognition model based on Connectionist Temporal Classification （CTC） has the advantages of simple structure and automatic alignment, but the recognition accuracy needs to be further improved. This paper introduces the attention mechanism to form a hybrid CTC/Attention end-to-end model. This method adopts the multi-task learning approach, combining the alignment advantage of CTC with the context modeling advantage of attention mechanism. The experimental results show that when the 80-dimensional FBank feature and the 3-dimensional pitch feature are selected as the acoustic features, and the VGG-Bidirectional long short-time memory network is selected as the encoder for Chinese Mandarin recognition, the character error rate of this hybrid model is reduced by about 6.1% compared with the end-to-end model based on CTC, after the external language model is connected, the character error rate is further reduced by 0.3%. Compared with the traditional baseline model, the character error rate also decreased significantly.

Key words: speech recognition, connectionist temporal classification, attention mechanism, end-to-end

许鸿奎, 张子枫, 卢江坤, 周俊杰, 胡文烨, 姜彤彤. 混合CTC/Attention模型在普通话识别中的应用[J]. 计算机与现代化, 2022, 0(08): 1-6.

XU Hong-kui, ZHANG Zi-feng, LU Jiang-kun, ZHOU Jun-jie, HU Wen-ye, JIANG Tong-tong. Application of Hybrid CTC/Attention Model in Mandarin Recognition[J]. Computer and Modernization, 2022, 0(08): 1-6.

参考文献

［1］ RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition［J］. Proceedings of the IEEE, 1989,77（2）:257-286.
［2］ RABINER L R, JUANG B H. Hidden Markov models for speech recognition-strengths and limitations［M］// Speech Recognition and Understanding. 1992:3-29.
［3］ RODRGUEZ E, RUZ B, GARCA-CRESPO , et al. Speech/speaker recognition using a HMM/GMM hybrid model［C］// International Conference on Audio- and Video-Based Biometric Person Authentication. 1997: 227-234.
［4］ AMODEI D, ANANTHANARAYANAN S, ANUBHAI R, et al. Deep speech 2: End-to-end speech recognition in English and Mandarin［C］// International Conference on Machine Learning. 2016:173-182.
［5］ GRAVES A, FERNNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks［C］// Proceedings of the 23rd International Conference on Machine Learning. 2006: 369-376.
［6］ WATANABE S, HORI T, KIM S, et al. Hybrid CTC/Attention architecture for end-to-end speech recognition［J］. IEEE Journal of Selected Topics in Signal Processing, 2017,11（8）:1240-1253.
［7］ CHOROWSKI J, BA HDANAU D, SERDYUK D, et al. Attention-based models for speech recognition［C］// Proceedings of the 28th International Conference on Nearal Information Processing System. 2015:577-585.
［8］鱼昆,张绍阳,侯佳正,等. 语音识别及端到端技术现状及展望［J］. 计算机系统应用, 2021,30（3）:14-23.
［9］戴礼荣,张仕良,黄智颖. 基于深度学习的语音识别技术现状与展望［J］. 数据采集与处理, 2017, 32（2）: 221-231.
［10］KIM C, KIM S, KIM K, et al. End-to-end training of a large vocabulary end-to-end speech recognition system［C］// 2019 IEEE Automatic Speech Recognition and Understanding Workshop （ASRU）. 2019:562-569.
［11］杨威,胡燕. 混合CTC/attention架构端到端带口音普通话识别［J］. 计算机应用研究, 2021,38（3）:755-759.
［12］刘加. 汉语大词汇量连续语音识别系统研究进展［J］. 电子学报, 2000, 28（1）: 85-91.
［13］YU D, DENG L. Deep learning and its applications to signal and information processing ［exploratory DSP］［J］. IEEE Signal Processing Magazine, 2010, 28（1）:145-154.
［14］GEIGER J T, ZHANG Z, WENINGER F, et al. Robust speech recognition using long short-term memory recurrent neural networks for Hybrid acoustic modelling ［C］// The 15th Annual Conference of the International Speech Communication Association. 2014:631-635.
［15］张瑞珍,韩跃平,张晓通. 基于深度LSTM的端到端的语音识别［J］. 中北大学学报（自然科学版）, 2020,41（3）:244-248.
［16］姚煜,RYAD C. 基于双向长短时记忆-联结时序分类和加权有限状态转换器的端到端中文语音识别系统［J］. 计算机应用, 2018,38（9）:2495-2499.
［17］杨德举,马良荔,谭琳珊,等. 基于门控卷积网络与CTC的端到端语音识别［J］. 计算机工程与设计, 2020,41（9）:2650-2654.
［18］张威,翟明浩,黄子龙,等. SE-MCNN-CTC的中文语音识别声学模型［J］. 应用声学, 2020,39（2）:223-230.
［19］张宇,张鹏远,颜永红. 基于注意力 LSTM 和多任务学习的远场语音识别［J］. 清华大学学报（自然科学版）, 2018, 58（3）: 249-253.
［20］刘晓峰,宋文爱,余本国,等. 基于注意力机制的大同方言语音翻译模型研究［J］. 中北大学学报（自然科学版）, 2020, 41（3）: 238-243.
［21］徐冬冬,蒋志翔. 基于HOPE-CTC的端到端语音识别［J］. 计算机工程与设计, 2021,42（2）:462-467.
［22］洪青阳,李琳. 语音识别:原理与应用［M］. 北京：电子工业出版社, 2020:252-254.
［23］KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning ［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing （ICASSP）. 2017:4835-4839.
［24］WATANABE S, HORI T, KARITA S, et al. ESPnet: End-to-end speech processing toolkit［J］// arXiv preprint arXiv:1804.00015, 2018.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[6]	谷岳, 邓松峰, 沈霁, 穆文涛, 赵恩棋. 基于改进YOLOv8的SAR舰船目标检测算法[J]. 计算机与现代化, 2024, 0(12): 78-83.
[7]	王艳媛, 茅正冲. 中英文场景文本图像的检测和识别算法[J]. 计算机与现代化, 2024, 0(12): 84-90.
[8]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[9]	张宇1, 2, 黎靖1, 2, 马铭1, 2, 王众祥1, 2, 孙妍1, 2. YOLOLW:一个新的轻量级目标检测模型[J]. 计算机与现代化, 2024, 0(11): 91-98.
[10]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[11]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[12]	魏学诚1, 江凌云1, 李研2, 何非2. 改进YOLOv5的路侧单目视角小目标检测算法[J]. 计算机与现代化, 2024, 0(10): 27-34.
[13]	杜猛俊1, 李昂1, 童俊1, 钱锦1, 康恺1, 王若丁1, 靳文星2. 基于改进极限学习算法的电力信息数据融合模型[J]. 计算机与现代化, 2024, 0(10): 61-64.
[14]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.
[15]	候聪颖, 杨文清, 王召, 程聪. 基于时频自注意力残差时序卷积网络的语音增强[J]. 计算机与现代化, 2024, 0(09): 20-24.