计算机与现代化 ›› 2021, Vol. 0 ›› Issue (11): 72-76.

• 算法设计与分析 • 上一篇    下一篇

基于聚类欠采样的集成分类算法

  

  1. (1.安徽工业大学管理科学与工程学院,安徽马鞍山243000; 
    2.中国科学技术大学计算机科学与技术学院,安徽合肥230000)
  • 出版日期:2021-12-13 发布日期:2021-12-13
  • 作者简介:周传华(1965—),男,安徽马鞍山人,教授,硕士生导师,博士,CCF会员(E8785M),研究方向:机器学习,数据挖掘,智能算法,E-mail: chzhou863@ustc.edu; 通信作者:朱俊杰(1998—),男,安徽合肥人,硕士研究生,研究方向:机器学习,数据挖掘,E-mail: 459135441@qq.com; 徐文倩(1996—),女,山东济宁人,硕士研究生,研究方向:智能算法,数据挖掘,E-mail: 1249737289@qq.com; 邓佳佳(1995—),男,安徽亳州人,硕士研究生,研究方向:数据挖掘,机器学习,E-mail: 1340777962@qq.com。
  • 基金资助:
    国家自然科学基金资助项目(71772002, 61702006); 复杂系统多学科管理与控制安徽省普通高校重点实验室资助项目(CS2020-04)

Integrated Classification Algorithm Based on Clustering and Undersampling

  1. (1. School of Management Science and Engineering, Anhui University of Technology, Ma’anshan 243000, China;
    2. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230000, China)
  • Online:2021-12-13 Published:2021-12-13

摘要: 不平衡数据常出现在各应用领域中,传统分类器往往关注于多数类样本而导致样本分类效果不理想。针对此问题,提出一种基于聚类欠采样的集成分类算法(ClusterUndersampling-AdaCost, CU-AdaCost)。该算法通过计算样本间维度加权后的欧氏距离得出各簇的样本中心位置,根据簇心邻域范围选择出信息特征较强的多数类样本,形成新的训练集;并将训练集放在引入代价敏感调整函数的集成算法中,使得模型更加关注于少数类别。通过对6组UCI数据集进行对比实验,结果表明,该算法在欠采样过程中抽取的样本具有较强的代表性,能够有效提高模型对少数类别的分类性能。

关键词: 不平衡数据, 聚类, 欠采样, 代价敏感

Abstract: Unbalanced data are often found in various application areas, and traditional classifiers tend to focus on the majority class of samples, which results in unsatisfactory sample classification. To address this problem, an integrated classification algorithm (ClusterUndersampling-AdaCost, CU-AdaCost) based on clustering undersampling is proposed. The algorithm derives the sample centre positions of each cluster by calculating the dimensionally weighted Euclidean distance between samples, and selects the majority class samples with strong information features according to the cluster centroid range to form a new training set. The training set is also placed in an integrated algorithm that introduces a cost-sensitive adjustment function, so as to make the model focus more on the minority class. Through comparison experiments on six UCI datasets, the results show that the algorithm has a strong representation of samples drawn in the undersampling process, which can effectively improve the classification performance of the model for minority categories.

Key words: unbalanced data, clustering, undersampling, cost sensitive