计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

一种针对弱标记文档的分类方法

  

  1. 南京理工大学计算机科学与工程学院,江苏南京210094
  • 收稿日期:2015-09-02 出版日期:2016-01-22 发布日期:2016-01-26
  • 作者简介:梁伟超(1991-),男,江苏南京人,南京理工大学计算机科学与工程学院硕士研究生,研究方向:机器学习,数据挖掘; 宋斌(1968-),男,江苏句容人,副教授,硕士,研究方向:数据挖掘,Web信息处理。

A Text Classification Method for Weak Labeling

  1. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  • Received:2015-09-02 Online:2016-01-22 Published:2016-01-26

摘要: 多标记学习不同于传统的监督学习,它是为了解决客观世界中多义性对象的建模问题而提出的一种学习框架。在该框架下,一个实例可以同时隶属于多个标记。已有的多标记学习算法大多假设每个样本的标记集合都是完整的,但有时某些实例对应的标记会出现缺失。为了应对这一问题,本文提出一种针对弱标记文档的分类方法,该方法基于标记之间不同的相关性和相似实例具有相似标记的假设,构造一个最优化问题,以尽可能地补全缺失的标记。实验结果表明,该方法可以有效地提升学习系统的泛化性能。

关键词: 弱标记, 文档分类, 多标记学习, 机器学习, 数据挖掘

Abstract: Multi-label learning is different from traditional supervised learning. It is a framework which is proposed to represent objects which might have multiple semantic meanings simultaneously in the external world. Under this framework, an instance might be associated with a set of labels. The majority of the existing multi-label learning algorithms assume that each label set corresponding to the example is complete. However, the label sets associated with some examples may he incomplete. To deal with this problem, we propose a text classification method for weak labeling. The method tries to replenish missing labels by constructing an optimization problem, which is based on the assumptions that correlations between different labels are different and similar instances may have similar labels. Extensive experiments show that the proposed method can effectively improve the generalization performance of the learning system.

Key words: weak labeling, document classification, multi-label learning, machine learning, data mining

中图分类号: