Lightweight Speech Emotion Recognition for Data Enhancement

Abstract

Abstract: The use of deep learning for speech emotion recognition requires a large amount of training data. In this paper， the original speech is enhanced by adding Gaussian white noise and shifting the waveform to generate new speech signals in the preprocessing stage， which not only improves the recognition accuracy but also enhances the robustness of the model， given the shortage of existing speech emotion databases and the defects of overfitting caused by the small amount of data. At the same time， due to the excessive amount of parameters of the common convolutional neural network， a lightweight model is proposed， which consists of separable convolutional and gated recurrent units. Firstly， MFCC features are extracted from the original speech as the input of the model， and secondly， separable convolutional is used to extract the spatial information of speech， and gated recurrent units extract the temporal information of speech so that the temporal and spatial information can be used to characterize the speech emotion at the same time. It can make the prediction results more accurate. Finally， a fully connected layer with softmax is fed to complete the sentiment classification. The experimental results show that the model in this paper can not only obtain higher accuracy but also compress the model by about 50% compared with the benchmark model.

Key words: speech emotion recognition, data enhancement, Gaussian white noise, waveform displacement, number of parameters

CUI Chen-lu, CUI Lin, . Lightweight Speech Emotion Recognition for Data Enhancement[J]. Computer and Modernization, 2023, 0(04): 83-89.

References

［1］ HU H， XU M X， WU W. GMM supervector based SVM with spectral features for speech emotion recognition［C］// Proceedings of the 2007 IEEE International Conference on Acoustics， Speech and Signal Processing. 2007，4:413-416.
［2］ VLASSIS N， LIKAS A. A greedy EM algorithm for Gaussian mixture learning［J］. Neural Processing Letters， 2002，15（1）:77-87.
［3］ RAGHAVAN A， DI TROIA F， STAMP M. Hidden Markov models with random restarts versus boosting for malware detection［J］. Journal of Computer Virology and Hacking Techniques， 2019，15（2）:97-107.
［4］ JIANG N， LIU T. Research on voiceprint recognition of camouflage voice based on deep belief network［J］. International Journal of Automation and Computing， 2021，18（6）:947-962.
［5］陈文兵，管正雄，陈允杰. 基于条件生成式对抗网络的数据增强方法［J］. 计算机应用， 2018，38（11）:3305-3311.
［6］张一珂，张鹏远，颜永红. 基于对抗训练策略的语言模型数据增强技术［J］. 自动化学报， 2018，44（5）:891-900.
［7］牛亚峰. 基于深度学习的语音情感识别研究［D］. 重庆:重庆大学， 2018.
［8］ MAO Q R， DONG M， HUANG Z W， et al. Learning salient features for speech emotion recognition using convolutional neural networks［J］. IEEE Transactions on Multimedia， 2014，16（8）:2203-2213.
［9］ LEE J， TASHEV I. High-level feature representation using recurrent neural network for speech emotion recognition［C］// Proceedings of the 2015 Conference on Interspeech. 2015:1537-1540.
［10］ VERKHOLYAK O V， KAYA H， KARPOV A A. Modeling short-term and long-term dependencies of the speech signal for paralinguistic emotion classification［J］. SPIIRAS Proceedings， 2019，18（1）:30-56.
［11］ YU H L， JI Y N， LI Q L. Student sentiment classification model based on GRU neural network and TF-IDF algorithm［J］. Journal of Intelligent and Fuzzy Systems， 2021，40（2）:2301-2311.
［12］余莉萍，梁镇麟，梁瑞宇. 基于改进LSTM的儿童语音情感识别模型［J］. 计算机工程， 2020，46（6）:40-49.
［13］ HAN S Q， LENG F， JIN Z T. Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network［C］// Proceedings of the 2021 International Conference on Communications， Information System and Computer Engineering. 2021:803-807.
［14］王光宇，张海涛. 轻量型图像分类神经网络改进研究［J］. 计算机应用研究， 2021，38（12）:3808-3813.
［15］周宇，曹英楠，王永超. 面向大数据的数据处理与分析算法综述［J］. 南京航空航天大学学报， 2021，53（5）:664-676.
［16］周迅，张晓龙. 基于双重数据增强策略的音频分类方法［J］. 武汉科技大学学报， 2020，43（2）:155-160.
［17］ KOO H， JEONG S， YOON S， et al. Development of speech emotion recognition algorithm using MFCC and prosody［C］// Proceedings of the 2020 International Conference on Electronics， Information， and Communication. 2020. DOI: 10.1109/ICEIC49074.2020.9051281.
［18］ SHIMIZU T， ONAGA H. Study on acoustic improvements by sound-absorbing panels and acoustical quality assessment of teleconference systems［J］. Applied Acoustics， 2018，139:101-112.
［19］ BENESTY J， SONDHI M M， HUANG Y T A. Springer Handbook of Speech Processing［M］. Berlin: Springer-Verlag， 2008.
［20］ HOWARD A G， ZHU M L， CHEN B， et al. MobileNets: Efficient convolutional neural networks for mobile vision applications［J］. arXiv preprint arXiv:1704.04861， 2017.
［21］ SHEWALKAR A， NYAVANANDI D， LUDWIG S A. Performance evaluation of deep neural networks applied to speech recognition: RNN， LSTM and GRU［J］. Journal of Artificial Intelligence and Soft Computing Research， 2019，9（4）:235-245.
［22］ LIVINGSTONE S R， RUSSO F A. The Ryerson audio-visual database of emotional speech and song （RAVDESS）: A dynamic， multimodal set of facial and vocal expressions in North American English［J］. PLoS One， 2018，13（5）. DOI: 10.1371/journal.pone.0196391.
［23］韩文静，李海峰，阮华斌，等. 语音情感识别研究进展综述［J］. 软件学报， 2014，25（1）:37-50.
［24］ SHEN B X， WANG K Y， ZHOU J H. Design of a pitch detection and intonation correction system based on LabVIEW［J］. Journal of Computers， 2021，32（2）:222-232.
［25］许雪琼，余小清，李昌莲，等. 改进波形相似叠加算法的音频时长调整［J］. 应用科学学报， 2009，27（5）:514-519.
［26］乔栋，陈章进，邓良，等. 基于改进语音处理的卷积神经网络中文语音情感识别方法［J］. 计算机工程， 2022，48（2）:281-290.
［27］缪裕青，邹巍，刘同来，等. 基于参数迁移和卷积循环神经网络的语音情感识别［J］. 计算机工程与应用， 2019，55（10）:135-140.
［28］ ZHANG H Y， HUANG H M， HAN H. A novel heterogeneous parallel convolution Bi-LSTM for speech emotion recognition［J］. Applied Sciences， 2021，11（21）. DOI: 10.3390/app11219897.
［29］冯天艺，杨震. 采用多任务学习和循环神经网络的语音情感识别算法［J］. 信号处理， 2019，35（7）:1133-1140.
［30］焦亚萌，周成智，李文萍，等. 融合多头注意力的VGGNet语音情感识别研究［J］. 国外电子测量技术， 2022，41（1）:63-69.
［31］郑艳，陈家楠，吴凡，等. 基于CGRU模型的语音情感识别研究与实现［J］. 东北大学学报（自然科学版）， 2020，41（12）:1680-1685.
［32］ Mustaqeem， KWON S. CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network［J］. Mathematics， 2020，8（12）. DOI: 10.3390/math8122133.

[1]	CHEN Zhuo, QIAO Gui-fang, CHAI Xin-bo, DU Yi-jun, SHEN Chong-lin, WANG Yuan-hao. Multi-weather Vehicle Detection Algorithm Based on Modified Knowledge Distillation [J]. Computer and Modernization, 2023, 0(02): 50-57.
[2]	LIANG Ke-jin, ZHANG Hai-jun, LIU Ya-qing, ZHANG Yu, WANG Yue-yang. Speech Emotion Recognition of Hybrid Multi-scale Convolution Combined with Dual-layer LSTM [J]. Computer and Modernization, 2023, 0(01): 63-68.
[3]	QIAN Jia-qi, HUANG He-ming, ZHANG Hui-yun, . Speech Emotion Recognition Based on ARCNN-GAP Network [J]. Computer and Modernization, 2021, 0(12): 91-95.
[4]	ZHAO Yu-rong, GUO Hui-ming, JIAO Han, ZHANG Jun-wei. Application of YOLOv4 with Mixed-domain Attention in Ship Detection [J]. Computer and Modernization, 2021, 0(09): 75-82.
[5]	LIN Zhi-wei, ZHU Wen-zhang, CHEN Hao. Dynamic Gesture Recognition Based on Space-time Feature Extraction of Neural Network [J]. Computer and Modernization, 2021, 0(06): 41-47.
[6]	WANG Chao-qi, GONG Fa-ming. Detection Method of One-shot Legend Based on Siamese Neural Networks [J]. Computer and Modernization, 2020, 0(12): 116-122.
[7]	LI Cheng-qi１, ZHENG Wen-jie1, HUANG Wen-li2, WEN Zhao-yang2. A Squint Object Robust Detection Method Based on #br# Perspective Transformation Data Augmentation [J]. Computer and Modernization, 2020, 0(04): 1-.