一种针对弱标记文档的分类方法

doi:10.3969/j.issn.1006-2475.2016.01.017

摘要/Abstract

摘要： 多标记学习不同于传统的监督学习，它是为了解决客观世界中多义性对象的建模问题而提出的一种学习框架。在该框架下，一个实例可以同时隶属于多个标记。已有的多标记学习算法大多假设每个样本的标记集合都是完整的，但有时某些实例对应的标记会出现缺失。为了应对这一问题，本文提出一种针对弱标记文档的分类方法，该方法基于标记之间不同的相关性和相似实例具有相似标记的假设，构造一个最优化问题，以尽可能地补全缺失的标记。实验结果表明，该方法可以有效地提升学习系统的泛化性能。


关键词: 弱标记, 文档分类, 多标记学习, 机器学习, 数据挖掘

Abstract: Multi-label learning is different from traditional supervised learning. It is a framework which is proposed to represent objects which might have multiple semantic meanings simultaneously in the external world. Under this framework, an instance might be associated with a set of labels. The majority of the existing multi-label learning algorithms assume that each label set corresponding to the example is complete. However, the label sets associated with some examples may he incomplete. To deal with this problem, we propose a text classification method for weak labeling. The method tries to replenish missing labels by constructing an optimization problem, which is based on the assumptions that correlations between different labels are different and similar instances may have similar labels. Extensive experiments show that the proposed method can effectively improve the generalization performance of the learning system.


Key words: weak labeling, document classification, multi-label learning, machine learning, data mining

中图分类号:

TP181

梁伟超，宋斌. 一种针对弱标记文档的分类方法[J]. 计算机与现代化, 2016, 0(1): 77-81,86.

LIANG Wei-chao， SONG Bin. A Text Classification Method for Weak Labeling[J]. Computer and Modernization, 2016, 0(1): 77-81,86.

参考文献

［1］ Boutella M R, Luo Jiebo, Shen Xipeng, et al. Learning multi-label scene classification［J］. Pattern Recognition, 2004,37(9):1757-1771.
［2］ Zhang Min-ling, Zhou Zhi-hua. ML-KNN: A lazy learning approach to multi-label learning［J］. Pattern Recognition, 2007,40(7):2038-2048.
［3］ Clare E A, King R D. Knowledge discovery in multi-label phenotype data［C］// Proceedings of the 5th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. 2001:42-53.
［4］ Hüllermeier E, Fürnkranz J, Chen Weiwei, et al. Label ranking by learning pairwise preferences［J］. Artificial Intelligence, 2008,172(16-17):1897-1916.
［5］ Fürnkranz J, Hüllermeier E, Mencía E L, et al. Multilabel classification via calibrated label ranking［J］. Machine Learning, 2008,73(2):133-153.
［6］ Elisseeff A, Weston J. A kernel method for multi-labelled classification［C］// Proceedings of the 9th International Conference on Neural Information Processing. 2001:681-687
［7］ Zhang Min-ling, Zhou Zhi-hua. Multilabel neural networks with applications to functional genomics and text categorization［J］. IEEE Transactions on Knowledge and Data Engineering, 2006,18(10):1338-1351.
［8］ Read J, Pfahringer B, Holmes G, et al. Classifier chains for multi-label classification［J］. Machine Learning, 2011,85(3):333-359.
［9］〖JP2〗Tsoumakas G, Vlahavas I. Random k-labelsets: An ensemble method for multilabel classification［C］// Proceedings of the 18th European Conference on Machine Learning. 2007:406-417.
［10］Read J. A pruned problem transformation method for multi-label classification［C］// Proceedings of the New Zealand Computer Science Research Student Conference. 2008:143-150.
［11］Schapire R E, Singer Y, Carbonell J, et al. Boostexter: A boosting-based system for text categorization［J］. Machine Learning, 2000,39(2/3):135-168.
［12］Ueda Naonori, Saito Kazumi. Parametric mixture models for multi-label text［J］. Systems and Computers in Japan, 2006,37(2):56-66.
［13］Gao Sheng, Wu Wen, Lee C H, et al. A MFoM learning approach to robust multiclass multi-label text categorization［C］// Proceedings of the 21st International Conference on Machine Learning. 2004:329-336.
［14］Sanden C, Zhang J Z. Enhancing multi-label music genre classification through ensemble techniques［C］// Proceedings of the 34th ACM Special Interest Group on Information Retrieval. 2011:705-714.
［15］Wieczorkowska A, Synak P, Ras W Z. Multi-label classification of emotions in music［C］// Proceedings of the 16th International Conference on Intelligent Information Processing and Web Mining. 2006:307-315.
［16］Nasierding G, Kouzani A Z. Image to text translation by multi-label classification［C］// Proceedings of the Advanced Intelligent Computing Theories and Applications; 6th International Conference on Intelligent Computing. 2010:247-254.
［17］Kazawa H, Izumitani T, Taira H, et al. Maximal margin labeling for multi-topic text categorization［C］// Proceedings of Neural Information Processing Systems, 2004. 2004:649-656.
［18］Tang Lei, Rajan S, Narayanan V K. Large scale multi-label classification via metalabeler［C］// Proceedings of the 18th International Conference on World Wide Web. 2009:211-220.
［19］Rak R, Kurgan L, Reformat M. Multi-label associative classification of medical documents from MEDLINE［C］// Proceedings of the 4th International Conference on Machine Learning and Applications. 2005:177-184.
［20］Thabtah F A, Cowling P, Peng Yong-hong. MMAC: A new multi-class, multi-label associative classification approach［C］// Proceedings of the 4th IEEE International Conference on Data Mining. 2004:217-224.
［21］Sangsuriyun S, Marukatat S, Waiyamai K. Hierarchical multi-label associative classification(HMAC) using negative rules［C］// Proceedings of the 9th IEEE International Conference on Cognitive Informatics. 2010:919-924.
［22］Gopal S, Yang Yi-ming. Multilabel classification with meta-level features［C］// Proceedings of the 33rd ACM Special Interest Group on Information Retrieval. 2010:315-322.
［23］Read J, Puurula A, Bifet A. Multi-label classification with meta-labels［C］// Proceedings of the 14th IEEE International Conference on Data Mining. 2014:941-946.
［24］Zhu Sheng-huo, Ji Xiang, XU Wei, et al. Multi-labelled classification using maximum entropy method［C］// Proceedings of the 28th ACM Special Interest Group on Information Retrieval. 2005:274-281.
［25］Kordmahalleh M M, Homaifar A, Dukka B K C. Hierarchical multi-label gene function prediction using adaptive mutation in crowding niching ［C］// Proceedings of the 13th International Conference on Bioinformatics and Bioengineering. 2013:1-6.
［26］Fabris F, Freitas A A. Dependency network methods for hierarchical multi-label classification of gene functions［C］// Proceedings of the 2014 IEEE Symposium on Computational Intelligence and Data Mining. 2014:241-248.
［27］Song Yang, Zhang Lu, Giles L C. A sparse Gaussian processes classification framework for fast tag suggestions［C］// Proceedings of the 17th ACM Conference on Information and Knowledge. 2008:93-102.
［28］Xia Xin, Lo D, Wang Xin-yu, et al. Tag recommendation in software information sites［C］// Proceedings of the 10th IEEE Working Conference on Mining Software Repositories. 2013:287-296.
［29］Liu Bing, Dai Yang, Li Xiao-li, et al. Building text classifiers using positive and unlabeled examples［C］// Proceedings of the 3rd IEEE International Conference on Data Mining. 2003:179-186.
［30］Comité F D, Gilleron R, Tommasi M. Learning multi-label alternating decision tree from texts and data［C］// Proceedings of the 3rd International Conference on Machine Learning and Data Mining in Pattern Recognition. 2003:35-49.
［31］Ghamrawi N, Mccallum A. Collective multi-label classification［C］// Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005:195-200.
［32］Xiao Yan-shan, Liu Bo, Yin Jie, et al. Similarity-based approach for positive and unlabelled learning［C］// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011:1577-1582.
［33］Zhang Min-ling, Zhou Zhi-hua. A Review on multi-label learning algorithms［J］. IEEE Transactions on Knowledge and Data Engineering, 2014,26(8):1819-1837.

[1]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[2]	袁红伟1, 常利军1, 郝家欢2, 樊娜2, 王超2, 罗闯2, 张泽辉2. 基于标签传播的轨迹兴趣点挖掘及隐私保护[J]. 计算机与现代化, 2024, 0(05): 46-54.
[3]	贾潇瑶, . 融合CatBoost和SHAP的乳腺癌预测及特征分析[J]. 计算机与现代化, 2023, 0(10): 32-38.
[4]	谢仕斌, 刘梦赤, 唐诗琪, 周瑞平, . 基于多特征提取的时间卷积知识追踪模型[J]. 计算机与现代化, 2023, 0(07): 25-29.
[5]	刘佩. 基于数据挖掘的医保控费系统[J]. 计算机与现代化, 2023, 0(06): 89-94.
[6]	张芸, 白开峰, 王星, 仓甜, 周通, 段锦文, 苏晗. 智能电网环境下窃电行为检测[J]. 计算机与现代化, 2023, 0(03): 60-65.
[7]	王劭华, 欧阳会丹, 孙丹, 王康, 吴鸿萍, 钟询, 褚兴平, 杨松涛. 基于Apriori算法的大学生体测项目关联规则挖掘[J]. 计算机与现代化, 2023, 0(03): 66-70.
[8]	石志伟, 武志峰, 张哲. 纠正学习策略下LightGBM-GRU模型的股票波动率预测[J]. 计算机与现代化, 2023, 0(01): 95-102.
[9]	宋晓丽, 张勇波, 张培颖. 基于半监督学习的学生消费数据异常检测[J]. 计算机与现代化, 2022, 0(12): 13-17.
[10]	关云鹏, 刘玉龙. 基于从共现矩阵提取关联的类别型数据聚类[J]. 计算机与现代化, 2022, 0(11): 1-8.
[11]	段桂芹, 邹臣嵩. 基于近邻传播聚类的职业能力评价模型[J]. 计算机与现代化, 2022, 0(05): 21-27.
[12]	冷涛, . 基于深度学习的加密流量分类研究综述[J]. 计算机与现代化, 2021, 0(08): 112-120.
[13]	邓子云, . 一种为辅助诊断筛选机器学习模型的方法[J]. 计算机与现代化, 2021, 0(03): 88-93.
[14]	杨琳, 白钊, 寇勇刚. 基于RFM模型的随机森林算法对民航客户的流失分析[J]. 计算机与现代化, 2021, 0(01): 100-104.
[15]	郭欣, 陈瑛, 章鸣嬛, 张璇, 潘曙明, 汤璐佳. 利用机器学习方法对灾难生命支持课程NDLS培训效果进行分析预测#br#[J]. 计算机与现代化, 2020, 0(12): 61-66.