计算机与现代化 ›› 2021, Vol. 0 ›› Issue (12): 91-95.

• 人工智能 • 上一篇    下一篇

基于ARCNN-GAP网络的语音情感识别

  

  1. (1.青海师范大学计算机学院,青海西宁810008;2.藏语智能信息处理及应用国家重点实验室,青海西宁810008)
  • 出版日期:2021-12-24 发布日期:2021-12-24
  • 作者简介:钱佳琪(1996—),女,安徽蚌埠人,硕士研究生,研究方向:模式识别与智能系统,E-mail: qianjiaqi_k@163.com; 黄鹤鸣(1969—),男(藏),青海乐都人,教授,博士,研究方向:模式识别与智能系统,E-mail: 1021489068@qq.com; 张会云(1993—),女,甘肃庆阳人,博士研究生,研究方向:模式识别与智能系统,E-mail: 1406043513@qq.com。
  • 基金资助:
    国家自然科学基金资助项目(62066039)

Speech Emotion Recognition Based on ARCNN-GAP Network

  1. (1. School of Computer Science, Qinghai Normal University, Xining 810008, China;

    2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China)
  • Online:2021-12-24 Published:2021-12-24

摘要: 语音情感识别是计算机理解人类情感最直接的方式,是实现人机交互智能化的重要渠道,但识别模型的性能需要进一步提升。为实现这一目标,提出一种基于循环卷积神经网络的语音情感识别模型ARCNN-GAP。其中,循环卷积层具有弹性路径,在确保网络深度的同时能保证优化时的梯度回传,提取更加有效的情感特征;全局平均池化运算可以在减少计算复杂度的同时降低过拟合风险;而注意力机制能够使模型更多关注情感相关特征。使用韵律特征和谱特征的融合特征在CASIA和EMO-DB数据库上进行研究,分别取得了83.29%和75.28%的识别率。实验结果表明:ARCNN-GAP具有更好的识别性能和泛化性。

关键词: 语音情感识别, 循环卷积神经网络, 全局平均池化, 注意力机制

Abstract: Speech emotion recognition is the most direct way for computers to understand human emotions. It is an important channel to realize the intelligentization of human-computer interaction, but the performance of the recognition model needs to be further improved. To achieve this goal, a model ARCNN-GAP based on recurrent convolutional neural network is proposed for speech emotion recognition. Where the recurrent convolution layer has elastic path, which can ensure the depth of the network and the gradient return during optimization, and extract more effective features. The global average pooling is used to reduce the computational complexity and the risk of over-fitting. And the attention mechanism is employed to focus more on emotion-related features. The fusion features of prosodic features and spectral features are studied on CASIA and EMO-DB databases, and the recognition rates are 83.29% and 75.28% respectively. The experimental results show that the proposed model ARCNN-GAP features higher recognition performance and better generalization.

Key words: speech emotion recognition, recurrent convolutional neural network, global average pooling, attention mechanism