基于优化残差网络的多模态音乐情感分类

摘要/Abstract

摘要： 针对传统的音乐情感分类因特征提取困难而导致模型分类准确率不高和人工工作量大等问题，提出一种基于优化深度残差网络的多模态音乐情感分类方法。该方法首先利用多模态翻译将难以提取特征的音乐音频模态转换为易于操作的图像模态；同时在深度残差网络的基础上对网络输入层的卷积核大小和残差块的快捷连接进行优化改进，减少了信息流失，缩短了计算时间；此外，为了缓解Softmax分类器存在类内离散、类间聚集这一弊端，引入了Center loss函数的变体来提升Softmax分类函数的性能。实验结果表明了本文优化后的残差网络模型的有效性和鲁棒性，相比于原始的残差网络，其对音乐情感的分类准确率提升了4.27个百分点。

关键词: 情感识别, 模态翻译, 图像分类, 深度残差网络, 分类损失函数

Abstract: Aiming at the problems of traditional music sentiment classification due to the difficulty of feature extraction, the model classification accuracy is not high and the manual workload is large, this paper proposes a multi-modal music sentiment classification method based on an optimized deep residual network. This method first uses multi-modal translation to convert difficult-to-extract feature music audio modalities into easy-to-operate image modalities; at the same time, based on the deep residual network, the convolution kernel size of the network input layer and the speed of the residual block, the connection has been optimized and improved, which reduces the information loss and shortens the calculation time. In addition, in order to alleviate the shortcomings of Softmax classifiers such as intra-class dispersion and inter-class aggregation, this paper introduces a variant of the Center loss function to improve the Softmax classification function performance. The experimental results prove the effectiveness and robustness of the optimized residual network model in this paper. Compared with the original residual network, the accuracy rate of music emotion classification is improved by 4.27 percentage points.

Key words: emotion recognition, modal translation, image classification, depth residual network, classification loss function

李晓双, 韩立新, 李景仙, 周经纬. 基于优化残差网络的多模态音乐情感分类[J]. 计算机与现代化, 2020, 0(12): 83-89.

LI Xiao-shuang, HAN Li-xin, LI Jing-xian, ZHOU Jing-wei. Multi-modal Music Emotion Classification Based on Optimized Residual Network[J]. Computer and Modernization, 2020, 0(12): 83-89.

参考文献

［1］ CHANGIZI M. Why does music make us feel［N］. Scientific American, 2009-09-15.
［2］ EEROLA T, VUOSKOSKI J K. A comparison of the discrete and dimensional models of emotion in music［J］. Psychology of Music, 2011,39(1):18-49.
［3］ PATRA B G, DAS D, BANDYOPADHYAY S. Unsupervised approach to hindi music mood classification［M］// Mining Intelligence and Knowledge Exploration. Springer International Publishing, 2013:62-69.
［4］邵曦,陶凯云. 基于音乐内容和歌词的音乐情感分类研究［J］. 计算机技术与发展, 2015,25(8):184-187.
［5］王瑶,徐昌,舒福舟. 基于SVM算法的两种特征提取的图像分类方法分析［J］. 电脑与信息技术, 2019,27(6):18-20.
［6］刘华祠. 基于传统机器学习与深度学习的图像分类算法对比分析［J］. 电脑与信息技术, 2019,27(5):12-15.
［7］ HAN K, YU D, TASHEV I. Speech emotion recognition using deep neural network and extreme learning machine［C］// Conference of the International Speech Communication Association(INTERSPEECH 2014). 2014:223-227.
［8］ LAURIER C, GRIVOLLA J, HERRERA P. Multimodal music mood classification using audio and lyrics［C］// International Conference on Machine Learning and Applications. 2008:688-693.
［9］ HU X, DOWNIE J S. Improving mood classification in music digital libraries by combining lyrics and audio［C］// ACM/IEEE Joint Conference on Digital Libraries. 2010:159-168.
［10］DEFFERRARD M, BENZI K, VANDERGHEYNST P, et al. FMA: A dataset for music analysis［J］. Sound, 2016:arXiv:1612.01840.
［11］BALTRUSAITIS T, AHUJA C, MORENCY L. Multimodal machine learning: A survey and taxonomy［J］.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(2):423-443.
［12］陈青,龚乾,张鸣. 基于语谱图的声乐分析［J］. 微计算机信息, 2010,26(21):6-8.
［13］HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition［C］// Computer Vision and Pattern Recognition(CVPR 2016). 2016:770-778.
［14］HE K M, ZHANG X Y, REN S Q, et al. Identity mappings in deep residual networks［C］// European Conference on Computer Vision. 2016:630-645.
［15］XIE S N, GIRSHICK R, DOLLAR P, et al. Aggregated residual transformations for deep neural networks［C］// Computer Vision and Pattern Recognition(CVPR 2017). 2017:5987-5995.
［16］HUANG G, LIU Z, DER MAATEN L V, et al. Densely connected convolutional networks［C］// Computer Vision and Pattern Recognition(CVPR 2017). 2017:2261-2269.
［17］VEIT A, WILBER M J, BELONGIE S. Residual networks behave like ensembles of relatively shallow networks［C］// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016:550-558.
［18］WANG J, LIU X, CHEN Y J, et al. Filtering normal papanicolaou smear with multi-instance learning［C］// IEEE International Conference on Signal and Image Processing. 2016:113-117.
［19］LIU W Y, WEN Y D, YU Z D, et al. SphereFace: Deep hypersphere embedding for face recognition［C］// Computer Vision and Pattern Recognition(CVPR 2017). 2017:6738-6746.
［20］SUN Y F, CHENG C M, ZHANG Y H, et al. Circle Loss: A unified perspective of pair similarity optimization［C］// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR 2020). 2020:6397-6406.
［21］汤凯,何庆,赵群,等. 基于改进的深度残差网络的图像识别［J］. 南京师大学报(自然科学版), 2019,42(3):115-121.
［22］冯超. 深度学习轻松学：核心算法与视觉实践［M］. 北京:电子工业出版社, 2017.
［23］贺伟,姚娅川,彭彩平. 一种基于BoW模型的图像分类方法研究［J］. 科技创新与应用, 2017(10):45.

[1]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[2]	赵小明, 潘婷, 刘伟锋. 基于图像分类的自动绘画心理分析方法[J]. 计算机与现代化, 2024, 0(08): 92-97.
[3]	许跃雯1, 李明1, 李莉2. 基于对比学习MocoV2的COVID-19图像分类#br#[J]. 计算机与现代化, 2024, 0(02): 81-87.
[4]	邱凯星, 冯广. 基于双重特征注意力的多标签图像分类模型[J]. 计算机与现代化, 2023, 0(12): 41-47.
[5]	吴甜, 刘海华, 童顺延. 基于深度反馈的卷积神经网络的图像分类[J]. 计算机与现代化, 2023, 0(09): 82-86.
[6]	马国祥, 杨凌菲, 严传波, 张志豪, 孙彬, 王晓荣. 基于深度DenseNet网络的肝包虫病超声影像诊断方法[J]. 计算机与现代化, 2023, 0(09): 100-104.
[7]	秦竹媛, 吴浩忠, 谭代庆, 韩爱庆, 臧昊, 王选, 唐燕. 基于多尺度ResNet融合注意力机制的麦冬细粒度识别[J]. 计算机与现代化, 2023, 0(07): 105-111.
[8]	朱剑波, 葛明锋, 董文飞. 基于改进EfficientNet的阿尔兹海默症图像分类[J]. 计算机与现代化, 2023, 0(06): 56-61.
[9]	崔晨露, 崔琳, . 面向数据增强的轻量化语音情感识别[J]. 计算机与现代化, 2023, 0(04): 83-89.
[10]	梁科晋, 张海军, 刘雅情, 张昱, 王月阳. 混合多尺度卷积结合双层LSTM语音情感识别[J]. 计算机与现代化, 2023, 0(01): 63-68.
[11]	吴志平, 麻尧斌, 汤文超, 胡必伟, 胡毕炜, 刘明嘉. 基于上下文感知和超像素后处理的多光谱图像分类[J]. 计算机与现代化, 2022, 0(12): 67-73.
[12]	杨贞, 单孟姣, 殷志坚, 杨凡, 李翠梅. 细粒度图像分类的通道自适应判别性学习方法[J]. 计算机与现代化, 2022, 0(10): 68-74.
[13]	黄延辉, 兰海, 魏宪. 基于可分离结构变换的轻量级Vision Transformer[J]. 计算机与现代化, 2022, 0(10): 75-81.
[14]	王盼红, 朱昌明. 融合CNN与交互特征的多标签图像分类方法[J]. 计算机与现代化, 2022, 0(09): 85-92.
[15]	钱佳琪, 黄鹤鸣, 张会云, . 基于ARCNN-GAP网络的语音情感识别[J]. 计算机与现代化, 2021, 0(12): 91-95.