计算机与现代化

• 人工智能 • 上一篇    下一篇

基于聚类混合采样的不平衡数据分类

  

  1. (华南理工大学数学学院,广东广州510641)
  • 收稿日期:2019-09-17 出版日期:2020-05-20 发布日期:2020-05-21
  • 作者简介:史明华(1995-),女,河南商丘人,硕士研究生,研究方向:机器学习,数据挖掘,E-mail: 1043135305@qq.com; 吴广潮(1972-),男,广东潮阳人,副教授,硕士生导师,博士,研究方向:机器学习,数据挖掘,E-mail: magchwu@scut.edu.cn。

An Imbalanced Data Classification of Hybrid Sampling Based on Clustering

  1. (School of Mathematics, South China University of Technology, Guangzhou 510641, China)
  • Received:2019-09-17 Online:2020-05-20 Published:2020-05-21

摘要: 不平衡分类问题广泛地应用于现实生活中,针对大多数重采样算法侧重于类间平衡,较少关注类内数据分布不平衡问题,提出一种基于聚类的混合采样算法。首先对原始数据集聚类,然后对每一簇样本计算不平衡比,根据不平衡比的大小对该簇样本做出相应处理,最后将平衡后的数据集放入GBDT分类器进行训练。实验表明该算法与几种传统算法相比F1-value和AUC更高,分类效果更好。

关键词: 不平衡数据, 聚类; 混合采样, GBDT

Abstract: The imbalanced classification problem is widely used in real life. For most resampling algorithms, it focuses on the balance between classes and pays less attention to the problem of data distribution imbalance within classes, a hybrid sampling algorithm based on clustering is proposed. Firstly, the original data set is clustered, then the imbalance ratio is calculated for each cluster sample, and the cluster sample is processed according to the imbalance ratio. Finally, the balanced data set is put into the GBDT classifier for training. Experiments show that the algorithm has higher F1-value, AUC and better classification results than several traditional algorithms.

Key words: imbalanced data, clustering, hybrid sampling, GBDT

中图分类号: