基于ARCNN-GAP网络的语音情感识别

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (12): 91-95.

基于ARCNN-GAP网络的语音情感识别

（1.青海师范大学计算机学院，青海西宁810008；2.藏语智能信息处理及应用国家重点实验室，青海西宁810008）

出版日期:2021-12-24 发布日期:2021-12-24
作者简介:钱佳琪（1996—），女，安徽蚌埠人，硕士研究生，研究方向：模式识别与智能系统，E-mail： qianjiaqi_k@163.com；黄鹤鸣（1969—），男（藏），青海乐都人，教授，博士，研究方向：模式识别与智能系统，E-mail： 1021489068@qq.com；张会云（1993—），女，甘肃庆阳人，博士研究生，研究方向：模式识别与智能系统，E-mail： 1406043513@qq.com。
基金资助:
国家自然科学基金资助项目(62066039)

Speech Emotion Recognition Based on ARCNN-GAP Network

(1. School of Computer Science, Qinghai Normal University, Xining 810008, China;

2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China)

Online:2021-12-24 Published:2021-12-24

摘要/Abstract

摘要： 语音情感识别是计算机理解人类情感最直接的方式，是实现人机交互智能化的重要渠道，但识别模型的性能需要进一步提升。为实现这一目标，提出一种基于循环卷积神经网络的语音情感识别模型ARCNN-GAP。其中，循环卷积层具有弹性路径，在确保网络深度的同时能保证优化时的梯度回传，提取更加有效的情感特征；全局平均池化运算可以在减少计算复杂度的同时降低过拟合风险；而注意力机制能够使模型更多关注情感相关特征。使用韵律特征和谱特征的融合特征在CASIA和EMO-DB数据库上进行研究，分别取得了83.29%和75.28%的识别率。实验结果表明：ARCNN-GAP具有更好的识别性能和泛化性。

关键词: 语音情感识别, 循环卷积神经网络, 全局平均池化, 注意力机制

Abstract: Speech emotion recognition is the most direct way for computers to understand human emotions. It is an important channel to realize the intelligentization of human-computer interaction, but the performance of the recognition model needs to be further improved. To achieve this goal, a model ARCNN-GAP based on recurrent convolutional neural network is proposed for speech emotion recognition. Where the recurrent convolution layer has elastic path, which can ensure the depth of the network and the gradient return during optimization, and extract more effective features. The global average pooling is used to reduce the computational complexity and the risk of over-fitting. And the attention mechanism is employed to focus more on emotion-related features. The fusion features of prosodic features and spectral features are studied on CASIA and EMO-DB databases, and the recognition rates are 83.29% and 75.28% respectively. The experimental results show that the proposed model ARCNN-GAP features higher recognition performance and better generalization.

Key words: speech emotion recognition, recurrent convolutional neural network, global average pooling, attention mechanism

钱佳琪, 黄鹤鸣, 张会云, . 基于ARCNN-GAP网络的语音情感识别[J]. 计算机与现代化, 2021, 0(12): 91-95.

QIAN Jia-qi, HUANG He-ming, ZHANG Hui-yun, . Speech Emotion Recognition Based on ARCNN-GAP Network[J]. Computer and Modernization, 2021, 0(12): 91-95.

参考文献

［1］李霞,卢官明,闫静杰,等. 多模态维度情感预测综述［J］. 自动化学报, 2018,44(12):2142-2159.
［2］李海峰,陈婧,马琳,等. 维度语音情感识别研究综述［J］. 软件学报, 2020,31(8):2465-2491.
［3］韩文静,李海峰,阮华斌,等. 语音情感识别研究进展综述［J］. 软件学报, 2014,25(1):37-50.
［4］赵腊生. 语音情感特征提取与识别方法研究［D］. 大连:大连理工大学, 2010.
［5］ BANSE R, SCHERER K R. Acoustic profiles in vocal emotion expression［J］. Journal of Personality and Social Psychology, 1996,70(3):614-636.
［6］ DELLAERT F, POLZIN T, WAIBEL A. Recognizing emotion in speech［C］// The 4th IEEE International Conference on Spoken Language Processing (ICSLP). 1996,3:1970-1973.
［7］ LUENGO I, NAVAS E, HERNAEZ I. Feature analysis and evaluation for automatic emotion identification in speech［J］. IEEE Transactions on Multimedia, 2010,12(6):490-501.
［8］ MIRSAMADI S, BARSOUM E, ZHANG C. Automatic speech emotion recognition using recurrent neural networks with local attention［C］// 2017 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). 2017:2227-2231.
［9］余莉萍,梁镇麟,梁瑞宇. 基于改进LSTM的儿童语音情感识别模型［J］. 计算机工程, 2020,46(6):40-49.
［10］SUN T W. End-to-end speech emotion recognition with gender information［J］. IEEE Access, 2020,8:152423-152438.
［11］柳长源,李文强,毕晓君. 基于RCNN-LSTM的脑电情感识别研究［J］. 自动化学报, 2020. DOI: 10.16383/j.aas.c190357.

［12］周飞燕,金林鹏,董军. 卷积神经网络研究综述［J］. 计算机学报, 2017,40(6):1229-1251.

［13］ZHANG T, ZHENG W M, CUI Z, et al. Spatial-temporal recurrent neural network for emotion recognition［J］. IEEE Transactions on Cybernetics, 2019,49(3):839-847.
［14］LIANG M, HU X L. Recurrent convolutional neural network for object recognition［C］// 2015 Conference on Computer Vision and Pattern Recognition (CVPR). 2015:3367-3375.
［15］梁鸣. 受神经科学启发的计算机识别和注意模型［D］. 北京:清华大学, 2016.
［16］赵明权. 基于图像场景和显著性信息的图像美感质量评估［D］. 广州:华南理工大学, 2017.
［17］赵泽宇,张卫强,刘加. 采用注意力机制和多任务训练的端到端无语音识别关键词检索［J］. 信号处理, 2020,36(6):839-851.
［18］ZHENG H Q, YANG Y R. An improved speech emotion recognition algorithm based on deep belief network［C］// 2019 IEEE International Conference on Power, Intelligent Computing and Systems (ICPICS). 2019:493-497.
［19］MUSTAQEEM, SAJJAD M, KWON S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM［J］. IEEE Access, 2020,8:79861-79875.
［20］张石清,赵知劲,雷必成,等. 结合音质特征和韵律特征的语音情感识别［J］. 电路与系统学报, 2009,14(4):120-123.
［21］LIU J X, LIU Z L, WANG L B, et al. Speech emotion recognition with local-global aware deep representation learning［C］// 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020:7174-7178.
［22］王艳,胡维平. 基于BP特征选择的语音情感识别［J］. 微电子学与计算机, 2019,36(5):14-18.
［23］缪裕青,邹巍,刘同来,等. 基于参数迁移和卷积循环神经网络的语音情感识别［J］. 计算机工程与应用, 2019,55(10):135-140.
［24］陈嘉. 基于深度学习的语音情感识别［D］. 南京:南京邮电大学, 2019.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[6]	谷岳, 邓松峰, 沈霁, 穆文涛, 赵恩棋. 基于改进YOLOv8的SAR舰船目标检测算法[J]. 计算机与现代化, 2024, 0(12): 78-83.
[7]	王艳媛, 茅正冲. 中英文场景文本图像的检测和识别算法[J]. 计算机与现代化, 2024, 0(12): 84-90.
[8]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[9]	张宇1, 2, 黎靖1, 2, 马铭1, 2, 王众祥1, 2, 孙妍1, 2. YOLOLW:一个新的轻量级目标检测模型[J]. 计算机与现代化, 2024, 0(11): 91-98.
[10]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[11]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[12]	魏学诚1, 江凌云1, 李研2, 何非2. 改进YOLOv5的路侧单目视角小目标检测算法[J]. 计算机与现代化, 2024, 0(10): 27-34.
[13]	杜猛俊1, 李昂1, 童俊1, 钱锦1, 康恺1, 王若丁1, 靳文星2. 基于改进极限学习算法的电力信息数据融合模型[J]. 计算机与现代化, 2024, 0(10): 61-64.
[14]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.
[15]	候聪颖, 杨文清, 王召, 程聪. 基于时频自注意力残差时序卷积网络的语音增强[J]. 计算机与现代化, 2024, 0(09): 20-24.