面向数据增强的轻量化语音情感识别

摘要/Abstract

摘要： 利用深度学习进行语音情感识别时通常需要大量的训练数据。针对现有语音情感数据库匮乏且数据量少容易造成过拟合的缺陷，本文在预处理阶段，将原始语音通过加入高斯白噪声和对波形进行位移产生新的语音信号以实现数据增强，不仅可提高识别准确率而且可增强模型的鲁棒性。与此同时，由于普通卷积神经网络参数量过大，提出一种轻量化模型，该模型由可分离卷积与门控循环单元构成。首先，从原始语音中提取MFCC特征作为模型的输入；其次利用可分离卷积来提取语音空间信息，门控循环单元提取语音的时序信息，用时序信息和空间信息同时表征语音情感可以使预测结果更加准确；最后送入带有softmax的全连接层完成情感分类。实验结果表明，本文模型与基准模型相比不仅可以得到较高的准确率且模型可压缩约50%。

关键词: 语音情感识别, 数据增强, 高斯白噪声, 波形位移, 参数量

Abstract: The use of deep learning for speech emotion recognition requires a large amount of training data. In this paper， the original speech is enhanced by adding Gaussian white noise and shifting the waveform to generate new speech signals in the preprocessing stage， which not only improves the recognition accuracy but also enhances the robustness of the model， given the shortage of existing speech emotion databases and the defects of overfitting caused by the small amount of data. At the same time， due to the excessive amount of parameters of the common convolutional neural network， a lightweight model is proposed， which consists of separable convolutional and gated recurrent units. Firstly， MFCC features are extracted from the original speech as the input of the model， and secondly， separable convolutional is used to extract the spatial information of speech， and gated recurrent units extract the temporal information of speech so that the temporal and spatial information can be used to characterize the speech emotion at the same time. It can make the prediction results more accurate. Finally， a fully connected layer with softmax is fed to complete the sentiment classification. The experimental results show that the model in this paper can not only obtain higher accuracy but also compress the model by about 50% compared with the benchmark model.

Key words: speech emotion recognition, data enhancement, Gaussian white noise, waveform displacement, number of parameters

崔晨露, 崔琳, . 面向数据增强的轻量化语音情感识别[J]. 计算机与现代化, 2023, 0(04): 83-89.

CUI Chen-lu, CUI Lin, . Lightweight Speech Emotion Recognition for Data Enhancement[J]. Computer and Modernization, 2023, 0(04): 83-89.

参考文献

［1］ HU H， XU M X， WU W. GMM supervector based SVM with spectral features for speech emotion recognition［C］// Proceedings of the 2007 IEEE International Conference on Acoustics， Speech and Signal Processing. 2007，4:413-416.
［2］ VLASSIS N， LIKAS A. A greedy EM algorithm for Gaussian mixture learning［J］. Neural Processing Letters， 2002，15（1）:77-87.
［3］ RAGHAVAN A， DI TROIA F， STAMP M. Hidden Markov models with random restarts versus boosting for malware detection［J］. Journal of Computer Virology and Hacking Techniques， 2019，15（2）:97-107.
［4］ JIANG N， LIU T. Research on voiceprint recognition of camouflage voice based on deep belief network［J］. International Journal of Automation and Computing， 2021，18（6）:947-962.
［5］陈文兵，管正雄，陈允杰. 基于条件生成式对抗网络的数据增强方法［J］. 计算机应用， 2018，38（11）:3305-3311.
［6］张一珂，张鹏远，颜永红. 基于对抗训练策略的语言模型数据增强技术［J］. 自动化学报， 2018，44（5）:891-900.
［7］牛亚峰. 基于深度学习的语音情感识别研究［D］. 重庆:重庆大学， 2018.
［8］ MAO Q R， DONG M， HUANG Z W， et al. Learning salient features for speech emotion recognition using convolutional neural networks［J］. IEEE Transactions on Multimedia， 2014，16（8）:2203-2213.
［9］ LEE J， TASHEV I. High-level feature representation using recurrent neural network for speech emotion recognition［C］// Proceedings of the 2015 Conference on Interspeech. 2015:1537-1540.
［10］ VERKHOLYAK O V， KAYA H， KARPOV A A. Modeling short-term and long-term dependencies of the speech signal for paralinguistic emotion classification［J］. SPIIRAS Proceedings， 2019，18（1）:30-56.
［11］ YU H L， JI Y N， LI Q L. Student sentiment classification model based on GRU neural network and TF-IDF algorithm［J］. Journal of Intelligent and Fuzzy Systems， 2021，40（2）:2301-2311.
［12］余莉萍，梁镇麟，梁瑞宇. 基于改进LSTM的儿童语音情感识别模型［J］. 计算机工程， 2020，46（6）:40-49.
［13］ HAN S Q， LENG F， JIN Z T. Speech emotion recognition with a ResNet-CNN-Transformer parallel neural network［C］// Proceedings of the 2021 International Conference on Communications， Information System and Computer Engineering. 2021:803-807.
［14］王光宇，张海涛. 轻量型图像分类神经网络改进研究［J］. 计算机应用研究， 2021，38（12）:3808-3813.
［15］周宇，曹英楠，王永超. 面向大数据的数据处理与分析算法综述［J］. 南京航空航天大学学报， 2021，53（5）:664-676.
［16］周迅，张晓龙. 基于双重数据增强策略的音频分类方法［J］. 武汉科技大学学报， 2020，43（2）:155-160.
［17］ KOO H， JEONG S， YOON S， et al. Development of speech emotion recognition algorithm using MFCC and prosody［C］// Proceedings of the 2020 International Conference on Electronics， Information， and Communication. 2020. DOI: 10.1109/ICEIC49074.2020.9051281.
［18］ SHIMIZU T， ONAGA H. Study on acoustic improvements by sound-absorbing panels and acoustical quality assessment of teleconference systems［J］. Applied Acoustics， 2018，139:101-112.
［19］ BENESTY J， SONDHI M M， HUANG Y T A. Springer Handbook of Speech Processing［M］. Berlin: Springer-Verlag， 2008.
［20］ HOWARD A G， ZHU M L， CHEN B， et al. MobileNets: Efficient convolutional neural networks for mobile vision applications［J］. arXiv preprint arXiv:1704.04861， 2017.
［21］ SHEWALKAR A， NYAVANANDI D， LUDWIG S A. Performance evaluation of deep neural networks applied to speech recognition: RNN， LSTM and GRU［J］. Journal of Artificial Intelligence and Soft Computing Research， 2019，9（4）:235-245.
［22］ LIVINGSTONE S R， RUSSO F A. The Ryerson audio-visual database of emotional speech and song （RAVDESS）: A dynamic， multimodal set of facial and vocal expressions in North American English［J］. PLoS One， 2018，13（5）. DOI: 10.1371/journal.pone.0196391.
［23］韩文静，李海峰，阮华斌，等. 语音情感识别研究进展综述［J］. 软件学报， 2014，25（1）:37-50.
［24］ SHEN B X， WANG K Y， ZHOU J H. Design of a pitch detection and intonation correction system based on LabVIEW［J］. Journal of Computers， 2021，32（2）:222-232.
［25］许雪琼，余小清，李昌莲，等. 改进波形相似叠加算法的音频时长调整［J］. 应用科学学报， 2009，27（5）:514-519.
［26］乔栋，陈章进，邓良，等. 基于改进语音处理的卷积神经网络中文语音情感识别方法［J］. 计算机工程， 2022，48（2）:281-290.
［27］缪裕青，邹巍，刘同来，等. 基于参数迁移和卷积循环神经网络的语音情感识别［J］. 计算机工程与应用， 2019，55（10）:135-140.
［28］ ZHANG H Y， HUANG H M， HAN H. A novel heterogeneous parallel convolution Bi-LSTM for speech emotion recognition［J］. Applied Sciences， 2021，11（21）. DOI: 10.3390/app11219897.
［29］冯天艺，杨震. 采用多任务学习和循环神经网络的语音情感识别算法［J］. 信号处理， 2019，35（7）:1133-1140.
［30］焦亚萌，周成智，李文萍，等. 融合多头注意力的VGGNet语音情感识别研究［J］. 国外电子测量技术， 2022，41（1）:63-69.
［31］郑艳，陈家楠，吴凡，等. 基于CGRU模型的语音情感识别研究与实现［J］. 东北大学学报（自然科学版）， 2020，41（12）:1680-1685.
［32］ Mustaqeem， KWON S. CLSTM: Deep feature-based speech emotion recognition using the hierarchical ConvLSTM network［J］. Mathematics， 2020，8（12）. DOI: 10.3390/math8122133.

[1]	朱剑波, 葛明锋, 董文飞. 基于改进EfficientNet的阿尔兹海默症图像分类[J]. 计算机与现代化, 2023, 0(06): 56-61.
[2]	石展鲲, 杨风, 韩建宁, 郭鑫, 曹尚斌. 基于Faster-RCNN的自然环境下苹果识别[J]. 计算机与现代化, 2023, 0(02): 62-65.
[3]	陈卓, 乔贵方, 柴鑫波, 杜一君, 沈重霖, 王远浩. 基于改进知识蒸馏的多天候车辆检测方法[J]. 计算机与现代化, 2023, 0(02): 50-57.
[4]	梁科晋, 张海军, 刘雅情, 张昱, 王月阳. 混合多尺度卷积结合双层LSTM语音情感识别[J]. 计算机与现代化, 2023, 0(01): 63-68.
[5]	李健, 张克亮, 唐亮, 夏榕璟, 任静静. 面向中文命名实体识别任务的数据增强[J]. 计算机与现代化, 2022, 0(04): 1-6.
[6]	钱佳琪, 黄鹤鸣, 张会云, . 基于ARCNN-GAP网络的语音情感识别[J]. 计算机与现代化, 2021, 0(12): 91-95.
[7]	赵玉蓉, 郭会明, 焦函, 章俊伟. 融合混合域注意力的YOLOv4在船舶检测中的应用[J]. 计算机与现代化, 2021, 0(09): 75-82.
[8]	林智伟, 朱文章, 陈浩. 基于神经网络空时特征提取的动态手势识别[J]. 计算机与现代化, 2021, 0(06): 41-47.
[9]	王超奇, 宫法明. 基于孪生网络结构的单样本图例检测方法[J]. 计算机与现代化, 2020, 0(12): 116-122.
[10]	林钦壮, 何昭水. 基于注意力机制的高效点云识别方法[J]. 计算机与现代化, 2020, 0(08): 51-55.
[11]	李程启１,郑文杰1,黄文礼2,温招洋2. 一种基于透视变换数据增广的斜视目标鲁棒检测方法[J]. 计算机与现代化, 2020, 0(04): 1-.
[12]	樊笛,巨志勇. 基于原型网络的小样本图像识别方法[J]. 计算机与现代化, 2020, 0(03): 103-.