计算机与现代化 ›› 2023, Vol. 0 ›› Issue (04): 83-89.

• 模式识别 • 上一篇    下一篇

面向数据增强的轻量化语音情感识别

  

  1. (1.西安工程大学电子信息学院,陕西 西安 710048; 2.西北工业大学航海学院,陕西 西安 710072)
  • 出版日期:2023-05-09 发布日期:2023-05-09
  • 作者简介:崔晨露(1997—),女,陕西西安人,硕士研究生,研究方向:语音情感识别,E-mail: 1017071182@qq.com; 崔琳(1984—),女,河北沧州人,讲师,博士,研究方向:阵列信号处理,语音信号处理,E-mail: cuilin789@163.com。
  • 基金资助:
    国家自然科学基金青年科学基金资助项目(61901347)

Lightweight Speech Emotion Recognition for Data Enhancement

  1. (1. School of Electronics and Information, Xi’an Polytechnic University, Xi’an 710048, China;
    2. School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China)
  • Online:2023-05-09 Published:2023-05-09

摘要: 利用深度学习进行语音情感识别时通常需要大量的训练数据。针对现有语音情感数据库匮乏且数据量少容易造成过拟合的缺陷,本文在预处理阶段,将原始语音通过加入高斯白噪声和对波形进行位移产生新的语音信号以实现数据增强,不仅可提高识别准确率而且可增强模型的鲁棒性。与此同时,由于普通卷积神经网络参数量过大,提出一种轻量化模型,该模型由可分离卷积与门控循环单元构成。首先,从原始语音中提取MFCC特征作为模型的输入;其次利用可分离卷积来提取语音空间信息,门控循环单元提取语音的时序信息,用时序信息和空间信息同时表征语音情感可以使预测结果更加准确;最后送入带有softmax的全连接层完成情感分类。实验结果表明,本文模型与基准模型相比不仅可以得到较高的准确率且模型可压缩约50%。

关键词: 语音情感识别, 数据增强, 高斯白噪声, 波形位移, 参数量

Abstract: The use of deep learning for speech emotion recognition requires a large amount of training data. In this paper, the original speech is enhanced by adding Gaussian white noise and shifting the waveform to generate new speech signals in the preprocessing stage, which not only improves the recognition accuracy but also enhances the robustness of the model, given the shortage of existing speech emotion databases and the defects of overfitting caused by the small amount of data. At the same time, due to the excessive amount of parameters of the common convolutional neural network, a lightweight model is proposed, which consists of separable convolutional and gated recurrent units. Firstly, MFCC features are extracted from the original speech as the input of the model, and secondly, separable convolutional is used to extract the spatial information of speech, and gated recurrent units extract the temporal information of speech so that the temporal and spatial information can be used to characterize the speech emotion at the same time. It can make the prediction results more accurate. Finally, a fully connected layer with softmax is fed to complete the sentiment classification. The experimental results show that the model in this paper can not only obtain higher accuracy but also compress the model by about 50% compared with the benchmark model.

Key words: speech emotion recognition, data enhancement, Gaussian white noise, waveform displacement, number of parameters