计算机与现代化 ›› 2023, Vol. 0 ›› Issue (11): 6-12.doi: 10.3969/j.issn.1006-2475.2023.11.002

• 算法设计与分析 • 上一篇    下一篇

基于栈式降噪编码器的跨语言多标签情感分类

  

  1. (1.广州大数据智能教育重点实验室,广东 广州 510631; 2.华南师范大学计算机学院,广东 广州 510631)
  • 出版日期:2023-11-29 发布日期:2023-11-29
  • 作者简介:唐诗琪(1998—),女,广东湛江人,硕士研究生,研究方向:自然语言处理,情感分类,E-mail: 2532855590@qq.com; 周瑞平(1998—),女,四川广安人,硕士研究生,研究方向:数据库技术,E-mail: 1658330923@qq.com; 谢仕斌(1997—),男,广东汕头人,硕士研究生,研究方向:教育大数据,知识追踪,E-mail: 784995152@qq.com; 刘梦赤(1962—),男,教授,研究方向:大数据系统,智能信息系统,E-mail: liumengchi@scnu.edu.cn; 肖文(1998—),女,广东惠州人,硕士研究生,研究方向:自然语言处理,E-mail: 2532855590@qq.com。
  • 基金资助:
    国家自然科学基金资助项目(61672389); 广州市大数据智能教育重点实验室项目(201905010009)

Cross-language Multi-label Sentiment Classification Based on Stacked Denoising AutoEncoder

  1. (1. Guangzhou Key Laboratory of Big Data and Intelligent Education, Guangzhou 510631, China;
    2. School of Computer Science, South China Normal University, Guangzhou 510631, China)
  • Online:2023-11-29 Published:2023-11-29

摘要: 摘要:多标签情感分类任务旨在处理一个实例可能与多个情感标签关联的问题。现有的大多数多标签情感分类模型都是基于完整的数据设计,模型性能和语义易受到数据本身存在的不完全性影响。针对此问题本文提出一种基于栈式降噪自编码器的跨语言多标签情感分类模型,引入标签感知损失函数弥补训练带来的损失。该模型通过栈式降噪自编码器对词向量去噪以构建原始数据的低维特征,降低特征空间的噪声干扰,为下游任务提供有效特征表示。在SemEval2018的3种语言数据集(即英语、阿拉伯语和西班牙语)多标签情感分类实验中,该模型在测试集上的micro_F1、macro_F1、jaccard这3个指标均得到提升,其中macro_F1分别提升了约0.82、1.45和1.83个百分点。

关键词: 关键词:多标签分类, 情感分类, 不完全数据, BERT, 栈式降噪自编码器

Abstract: Abstract: The multi-label sentiment classification task aims to deal with the problem that an instance may be associated with multiple sentiment labels. Most existing multi-label sentiment classification models were designed based on complete data,and their performance and sentiment were easily affected by the incompleteness of data itself. To address this problem,a cross-language multi-label sentiment classification model based on stacked denoising autoencoder is proposed, and a loss function is introduced to compensate for the loss caused by training. In this model, the word vectors are denoised by the stacked denoising autoencoder to construct the low-dimensional features of the original data. This reduces the noise interference in feature space and provides effective feature representation for downstream tasks. In the multi-label sentiment classification experiment of SemEval2018 three language datasets (English, Arabic and Spanish), the micro_F1 score, macro_F1 score and jaccard indexes of the model on the test set are all improved. Macro_F1 is improved by about 0.82, 1.45 and 1.83 percentage points, respectively.

Key words: Key words: multi-label classification, sentiment classification, incomplete data, BERT, stacked denoising autoencoder(SDAE)

中图分类号: