基于非线性堆叠双向网络的端到端声纹识别

摘要/Abstract

摘要： 传统声纹识别方法过程繁琐且识别率低，现有的深度学习方法所使用的神经网络对语音信号没有针对性从而导致识别精度不够。针对上述问题，本文提出一种基于非线性堆叠双向LSTM的端到端声纹识别方法。首先，对原始语音文件提取出Fbank特征用于网络模型的输入。然后，针对语音信号连续且前后关联性强的特点，构建双向长短时记忆网络处理语音数据提取深度特征，为进一步增强网络的非线性表达能力，利用堆叠多层双向LSTM层和多层非线性层实现对语音信号更深层次抽象特征的提取。最后，使用SGD优化器优化训练方式。实验结果表明提出的方法能够充分利用语音序列信号特征，具有较强的时序全面性和非线性表达能力，所构造模型整体性强，比GRU和LSTM等模型具有更好的识别效果。

关键词: 声纹识别, 端到端, 时序特征, 长短时记忆, 堆叠网络, 非线性

Abstract: The traditional voiceprint recognition method is cumbersome and has a low recognition rate. The neural network used in the existing deep learning method is not specific to the speech signal, resulting in insufficient recognition accuracy. To solve the above problems, this paper proposes an end-to-end voiceprint recognition method based on nonlinear stacked bidirectional LSTM. Firstly, the Fbank features are extracted from the original voice files for the input of the network model. Then, in view of the continuous and strong relevance of the voice signal, a bidirectional long and short-term memory network is constructed to process the voice data to extract deep features. In order to further enhance the nonlinear expression ability of the network, stacking multi-layer bidirectional LSTM layer and multi-layer nonlinear layer are used to extract the deeper abstract features of the speech signal. Finally, the SGD optimizer is used to optimize the training mode. The experimental results show that the proposed method can make full use of the characteristics of the speech sequence signal and has strong time series comprehensiveness and nonlinear expression ability. The constructed model has strong integrity and better recognition effect than GRU and LSTM models.

Key words: voiceprint recognition, end to end, sequential characteristic, long short-term memory, stacked network, nonlinear

王芷悦, 崔琳, . 基于非线性堆叠双向网络的端到端声纹识别[J]. 计算机与现代化, 2022, 0(03): 13-17.

WANG Zhi-yue, CUI Lin, . End to End Voiceprint Recognition Based on Nonlinear Stacked Bidirectional Network[J]. Computer and Modernization, 2022, 0(03): 13-17.

参考文献

［1］ ACHINTYA K S, ZHENG H T. Text dependent speaker verification using un-supervised HMM-UBM and temporal GMM-UBM［C］// Interspeech 2016. 2016:425-429.
［2］郝敏,刘航,李扬,等. 基于聚类分析与说话人识别的语音跟踪［J］. 计算机与现代化, 2020(4):11-17.
［3］ VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification［C］// 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. 2014:4052-4056.
［4］ TORFI A, DAWSON J, NASRABADI N M. Text-independent speaker verification using 3D convolutional neural networks［C］// 2018 IEEE International Conference on Multimedia and Expo. 2018:1-6.
［5］项洋,殷锋,袁平. 基于X-Vector嵌入与BLSOM模型的声纹聚类系统［J］. 现代计算机, 2020,618(9):4-8.
［6］ ROHDIN J, SILNOVA A, DIEZ M, et al. End-to-end DNN based speaker recognition inspired by I-vector and PLDA［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. 2018:4874-4878.
［7］ HEIGOLD G, MORENO I, BENGIO S, et al. End-to-end text-dependent speaker verification［C］// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. 2016:5115-5119.
［8］ LI C, MA X K, JIANG B, et al. Deep speaker: An end-to-end neural speaker embedding system［J］. Computation and Language, arXiv preprint arXiv:1705.02304, 2017.
［9］ SHON S, TANG H, GLASS J. Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model［C］// 2018 IEEE Spoken Language Technology Workshop. 2018:1007-1013.
［10］SHIN H C, ROTH H R, GAO M, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning［J］. IEEE Transactions on Medical Imaging, 2016,35(5):1285-1298.
［11］WANG Y, SUN Y B, LIU Z W, et al. Dynamic graph CNN for learning on point clouds［J］. ACM Transactions on Graphics, 2019,38(5):1-12.
［12］LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation［C］// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. 2020:46-50.
［13］BASIRI M E, NEMATI S, ABDAR M, et al. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis［J］. Future Generation Computer Systems, 2021,115:279-294.
［14］HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［15］YU Y, SI X S, HU C H, et al. A review of recurrent neural networks: LSTM cells and network architectures［J］. Neural Computation, 2019,31(7):1235-1270.
［16］EIMONEIM S A, NASSAR M A, DESSOUKY M I, et al. Text-independent speaker recognition using LSTM-RNN and speech enhancement［J］. Multimedia Tools and Applications, 2020,79(2):24013-24028.
［17］GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures［J］. Neural Networks, 2005,18(5-6):602-610.
［18］YAN S Y, SMITH J S, LU W J, et al. CHAM: Action recognition using convolutional hierarchical attention model［C］// Proceedings of 2017 IEEE International Conference on Image Processing. 2017:3958-3962.
［19］李宏伟,吴庆祥. 智能传感器中神经网络激活函数的实现方案［J］. 传感器与微系统, 2014,33(1):51-53.
［20］BALDI P, PETER S D, LU Z. Learning in the machine: Random back propagation and the deep learning channel［C］// The 28th International Joint Conference on Artificial Intelligence. 2018:1-35.
［21］CHATTERJEE A, GUPTA U, CHINNAKOTLAM K, et al. Understanding emotions in text using deep learning and big data［J］. Computers in Human Behavior, 2019,93:309-317.
［22］NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines vinod nair［C］// Proceedings of the 27th International Conference on Machine Learning. 2010:807-814.
［23］DOLEZEL P, SKRABANEK P, GAGO L. Weight initialization possibilities for feedforward neural network with linear saturated activation functions［J］. IFAC-PapersOnLine, 2016,49(25):49-54.
［24］MAAS A L, HANNUN A Y, NG A Y. Rectifier nonline arities improve neural network acoustic models［C］// Proceedings of the 30th International Conference on Machine Learning. 2013:456-462.

[1]	龚谊承1, 2, 刘青1, 2. 基于RF-LCE-BiLSTM-Attention-AMSSA模型的京剧二分类[J]. 计算机与现代化, 2024, 0(11): 7-12.
[2]	夏煌智, 陈丽敏, 毛雪迪, . 融入动态学习与高斯变异的自适应秃鹰搜索算法[J]. 计算机与现代化, 2024, 0(01): 117-126.
[3]	刘付琪, 张达, 宋建华, 王海东. 基于CNN-BiLSTM的液压系统故障诊断[J]. 计算机与现代化, 2023, 0(09): 10-19.
[4]	张志霞, 谢宝强. 基于FCGA-LSTM与迁移学习的天然气负荷预测[J]. 计算机与现代化, 2023, 0(07): 7-12.
[5]	申志, 李元 . 基于KPCA和SSA优化SVM的非线性过程故障检测#br#[J]. 计算机与现代化, 2023, 0(06): 15-20.
[6]	袁甜甜, 李志华, 邱阳. 基于辅助学习的改进端到端合成语音检测方法[J]. 计算机与现代化, 2023, 0(05): 52-57.
[7]	王磊, 张晓东, 戴欢. 基于1D-CNN-LSTM注意力网络的抽油机井故障诊断[J]. 计算机与现代化, 2023, 0(04): 1-6.
[8]	徐涯昕, 何泽恩, 徐绪堪. 基于CNN-BiLSTM网络的数控机床故障文本自动分类[J]. 计算机与现代化, 2023, 0(04): 7-14.
[9]	杨宗月, 时正华. 四旋翼无人机非线性轨迹跟踪控制[J]. 计算机与现代化, 2023, 0(04): 95-100.
[10]	刘立婷, 欧毓毅. 融合注意力机制与并行混合网络的DGA域名检测[J]. 计算机与现代化, 2022, 0(09): 119-126.
[11]	许鸿奎, 张子枫, 卢江坤, 周俊杰, 胡文烨, 姜彤彤. 混合CTC/Attention模型在普通话识别中的应用[J]. 计算机与现代化, 2022, 0(08): 1-6.
[12]	孙弘扬, 王尚. 基于残差门控循环卷积和注意力机制的端到端光学乐谱识别方法[J]. 计算机与现代化, 2022, 0(07): 85-90.
[13]	田忠, 伍世虔. 线阵相机的圆环旋转标定方法[J]. 计算机与现代化, 2022, 0(03): 76-81.
[14]	刘亚南, 郭南, 赵阳, 余贶琭, . 基于WOA优化神经网络的BOTDA传感信息提取[J]. 计算机与现代化, 2021, 0(12): 19-26.
[15]	王运乾, 王以松, 陈攀峰, 邹龙. 融合注意力机制的药用植物文本命名实体识别[J]. 计算机与现代化, 2021, 0(11): 100-105.