Computer and Modernization ›› 2021, Vol. 0 ›› Issue (12): 91-95.

Previous Articles     Next Articles

Speech Emotion Recognition Based on ARCNN-GAP Network

  

  1. (1. School of Computer Science, Qinghai Normal University, Xining 810008, China;

    2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China)
  • Online:2021-12-24 Published:2021-12-24

Abstract: Speech emotion recognition is the most direct way for computers to understand human emotions. It is an important channel to realize the intelligentization of human-computer interaction, but the performance of the recognition model needs to be further improved. To achieve this goal, a model ARCNN-GAP based on recurrent convolutional neural network is proposed for speech emotion recognition. Where the recurrent convolution layer has elastic path, which can ensure the depth of the network and the gradient return during optimization, and extract more effective features. The global average pooling is used to reduce the computational complexity and the risk of over-fitting. And the attention mechanism is employed to focus more on emotion-related features. The fusion features of prosodic features and spectral features are studied on CASIA and EMO-DB databases, and the recognition rates are 83.29% and 75.28% respectively. The experimental results show that the proposed model ARCNN-GAP features higher recognition performance and better generalization.

Key words: speech emotion recognition, recurrent convolutional neural network, global average pooling, attention mechanism