End to End Voiceprint Recognition Based on Nonlinear Stacked Bidirectional Network

Abstract

Abstract: The traditional voiceprint recognition method is cumbersome and has a low recognition rate. The neural network used in the existing deep learning method is not specific to the speech signal, resulting in insufficient recognition accuracy. To solve the above problems, this paper proposes an end-to-end voiceprint recognition method based on nonlinear stacked bidirectional LSTM. Firstly, the Fbank features are extracted from the original voice files for the input of the network model. Then, in view of the continuous and strong relevance of the voice signal, a bidirectional long and short-term memory network is constructed to process the voice data to extract deep features. In order to further enhance the nonlinear expression ability of the network, stacking multi-layer bidirectional LSTM layer and multi-layer nonlinear layer are used to extract the deeper abstract features of the speech signal. Finally, the SGD optimizer is used to optimize the training mode. The experimental results show that the proposed method can make full use of the characteristics of the speech sequence signal and has strong time series comprehensiveness and nonlinear expression ability. The constructed model has strong integrity and better recognition effect than GRU and LSTM models.

Key words: voiceprint recognition, end to end, sequential characteristic, long short-term memory, stacked network, nonlinear

WANG Zhi-yue, CUI Lin, . End to End Voiceprint Recognition Based on Nonlinear Stacked Bidirectional Network[J]. Computer and Modernization, 2022, 0(03): 13-17.

References

［1］ ACHINTYA K S, ZHENG H T. Text dependent speaker verification using un-supervised HMM-UBM and temporal GMM-UBM［C］// Interspeech 2016. 2016:425-429.
［2］郝敏,刘航,李扬,等. 基于聚类分析与说话人识别的语音跟踪［J］. 计算机与现代化, 2020(4):11-17.
［3］ VARIANI E, LEI X, MCDERMOTT E, et al. Deep neural networks for small footprint text-dependent speaker verification［C］// 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. 2014:4052-4056.
［4］ TORFI A, DAWSON J, NASRABADI N M. Text-independent speaker verification using 3D convolutional neural networks［C］// 2018 IEEE International Conference on Multimedia and Expo. 2018:1-6.
［5］项洋,殷锋,袁平. 基于X-Vector嵌入与BLSOM模型的声纹聚类系统［J］. 现代计算机, 2020,618(9):4-8.
［6］ ROHDIN J, SILNOVA A, DIEZ M, et al. End-to-end DNN based speaker recognition inspired by I-vector and PLDA［C］// 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. 2018:4874-4878.
［7］ HEIGOLD G, MORENO I, BENGIO S, et al. End-to-end text-dependent speaker verification［C］// 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. 2016:5115-5119.
［8］ LI C, MA X K, JIANG B, et al. Deep speaker: An end-to-end neural speaker embedding system［J］. Computation and Language, arXiv preprint arXiv:1705.02304, 2017.
［9］ SHON S, TANG H, GLASS J. Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model［C］// 2018 IEEE Spoken Language Technology Workshop. 2018:1007-1013.
［10］SHIN H C, ROTH H R, GAO M, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning［J］. IEEE Transactions on Medical Imaging, 2016,35(5):1285-1298.
［11］WANG Y, SUN Y B, LIU Z W, et al. Dynamic graph CNN for learning on point clouds［J］. ACM Transactions on Graphics, 2019,38(5):1-12.
［12］LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation［C］// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. 2020:46-50.
［13］BASIRI M E, NEMATI S, ABDAR M, et al. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis［J］. Future Generation Computer Systems, 2021,115:279-294.
［14］HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［15］YU Y, SI X S, HU C H, et al. A review of recurrent neural networks: LSTM cells and network architectures［J］. Neural Computation, 2019,31(7):1235-1270.
［16］EIMONEIM S A, NASSAR M A, DESSOUKY M I, et al. Text-independent speaker recognition using LSTM-RNN and speech enhancement［J］. Multimedia Tools and Applications, 2020,79(2):24013-24028.
［17］GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures［J］. Neural Networks, 2005,18(5-6):602-610.
［18］YAN S Y, SMITH J S, LU W J, et al. CHAM: Action recognition using convolutional hierarchical attention model［C］// Proceedings of 2017 IEEE International Conference on Image Processing. 2017:3958-3962.
［19］李宏伟,吴庆祥. 智能传感器中神经网络激活函数的实现方案［J］. 传感器与微系统, 2014,33(1):51-53.
［20］BALDI P, PETER S D, LU Z. Learning in the machine: Random back propagation and the deep learning channel［C］// The 28th International Joint Conference on Artificial Intelligence. 2018:1-35.
［21］CHATTERJEE A, GUPTA U, CHINNAKOTLAM K, et al. Understanding emotions in text using deep learning and big data［J］. Computers in Human Behavior, 2019,93:309-317.
［22］NAIR V, HINTON G E. Rectified linear units improve restricted boltzmann machines vinod nair［C］// Proceedings of the 27th International Conference on Machine Learning. 2010:807-814.
［23］DOLEZEL P, SKRABANEK P, GAGO L. Weight initialization possibilities for feedforward neural network with linear saturated activation functions［J］. IFAC-PapersOnLine, 2016,49(25):49-54.
［24］MAAS A L, HANNUN A Y, NG A Y. Rectifier nonline arities improve neural network acoustic models［C］// Proceedings of the 30th International Conference on Machine Learning. 2013:456-462.

[1]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[2]	ZHENG Li-rui, XIAO Xiao-xia, ZOU Bei-ji, LIU Bin, ZHOU Zhan. Named Entity Recognition in Electronic Medical Record Based on BERT [J]. Computer and Modernization, 2024, 0(01): 87-91.
[3]	WANG Yu-hang, DONG Bao-liang, GONG Chao, SHANG Zhen-zhen, YAO Kang-ning. Dynamic Threat Assessment of Air Swarm Targets Based on Intent Recognition [J]. Computer and Modernization, 2023, 0(12): 100-104.
[4]	ZENG Li-li, TANG Hua-bei, NIU Yi-xiao, MENG Fan-yue. Lithofacies Identification Method Based on LSTM Stacked Residual Network [J]. Computer and Modernization, 2023, 0(08): 38-43.
[5]	SHEN Zhi, LI Yuan. Nonlinear Process Fault Detection Based on KPCA and SSA Optimized SVM [J]. Computer and Modernization, 2023, 0(06): 15-20.
[6]	WANG Lei, ZHANG Xiao-dong, DAI Huan. Fault Diagnosis of Pumping Unit Based on 1D-CNN-LSTM Attention Network [J]. Computer and Modernization, 2023, 0(04): 1-6.
[7]	XU Ya-xin, HE Ze-en, XU Xu-kan. Automatic Classification Method of CNC Machine Tool Fault Text Based on CNN-BiLSTM [J]. Computer and Modernization, 2023, 0(04): 7-14.
[8]	YANG Zong-yue, SHI Zheng-hua. Nonlinear Trajectory Tracking Control of Quadrotor UAV [J]. Computer and Modernization, 2023, 0(04): 95-100.
[9]	ZHOU Chang-ye, LI Cheng. Prediction of Railway Freight Volume Based on GS-LSTM Model [J]. Computer and Modernization, 2022, 0(10): 24-28.
[10]	LIU Li-ting, OU Yu-yi. DGA Domain Name Detection Combining Attention Mechanisms and Parallel Hybrid Network [J]. Computer and Modernization, 2022, 0(09): 119-126.
[11]	FENG Shen, YU Yue-cheng, ZHANG Zong-hai. Point of Interest Recommendation Combined with Dynamic Multiple Types of Information [J]. Computer and Modernization, 2022, 0(08): 57-64.
[12]	ZHANG Ling-yun, HAN Ying, ZHANG Kai, LU Hai-peng, DING Yu-jie. Short-term Traffic Flow Prediction Model Based on Deep Learning [J]. Computer and Modernization, 2022, 0(07): 54-60.
[13]	XU Sheng-chao, YE Li-hong. Container Cloud Queue Online Task Dynamic Allocation Based on Long Short-term Memory Neural Network [J]. Computer and Modernization, 2022, 0(07): 79-84.
[14]	TIAN Zhong, WU Shi-qian. Rotation Calibration Method of Concentric Circles for Line Scan Camera [J]. Computer and Modernization, 2022, 0(03): 76-81.
[15]	JI Ping, GUO Ying. Underwater Localization Algorithm of Range Correction Based on Long Short-Term Memory [J]. Computer and Modernization, 2022, 0(02): 52-57.