基于聚类欠采样的集成分类算法

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (11): 72-76.

基于聚类欠采样的集成分类算法

(1.安徽工业大学管理科学与工程学院,安徽马鞍山243000;
2.中国科学技术大学计算机科学与技术学院,安徽合肥230000)

出版日期:2021-12-13 发布日期:2021-12-13
作者简介:周传华(1965—),男,安徽马鞍山人,教授,硕士生导师,博士,CCF会员(E8785M),研究方向：机器学习，数据挖掘，智能算法，E-mail: chzhou863@ustc.edu; 通信作者：朱俊杰(1998—),男,安徽合肥人,硕士研究生,研究方向：机器学习，数据挖掘，E-mail: 459135441@qq.com; 徐文倩(1996—),女,山东济宁人,硕士研究生,研究方向：智能算法，数据挖掘,E-mail： 1249737289@qq.com; 邓佳佳(1995—),男,安徽亳州人,硕士研究生,研究方向：数据挖掘，机器学习,E-mail: 1340777962@qq.com。
基金资助:
国家自然科学基金资助项目(71772002， 61702006)；复杂系统多学科管理与控制安徽省普通高校重点实验室资助项目(CS2020-04)

Integrated Classification Algorithm Based on Clustering and Undersampling

(1. School of Management Science and Engineering, Anhui University of Technology, Ma’anshan 243000, China;
2. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230000, China)

Online:2021-12-13 Published:2021-12-13

摘要/Abstract

摘要： 不平衡数据常出现在各应用领域中，传统分类器往往关注于多数类样本而导致样本分类效果不理想。针对此问题，提出一种基于聚类欠采样的集成分类算法(ClusterUndersampling-AdaCost, CU-AdaCost)。该算法通过计算样本间维度加权后的欧氏距离得出各簇的样本中心位置，根据簇心邻域范围选择出信息特征较强的多数类样本，形成新的训练集；并将训练集放在引入代价敏感调整函数的集成算法中，使得模型更加关注于少数类别。通过对6组UCI数据集进行对比实验，结果表明，该算法在欠采样过程中抽取的样本具有较强的代表性，能够有效提高模型对少数类别的分类性能。

关键词: 不平衡数据, 聚类, 欠采样, 代价敏感

Abstract: Unbalanced data are often found in various application areas, and traditional classifiers tend to focus on the majority class of samples, which results in unsatisfactory sample classification. To address this problem, an integrated classification algorithm (ClusterUndersampling-AdaCost, CU-AdaCost) based on clustering undersampling is proposed. The algorithm derives the sample centre positions of each cluster by calculating the dimensionally weighted Euclidean distance between samples, and selects the majority class samples with strong information features according to the cluster centroid range to form a new training set. The training set is also placed in an integrated algorithm that introduces a cost-sensitive adjustment function, so as to make the model focus more on the minority class. Through comparison experiments on six UCI datasets, the results show that the algorithm has a strong representation of samples drawn in the undersampling process, which can effectively improve the classification performance of the model for minority categories.

Key words: unbalanced data, clustering, undersampling, cost sensitive

周传华, 朱俊杰, 徐文倩, 邓佳佳. 基于聚类欠采样的集成分类算法[J]. 计算机与现代化, 2021, 0(11): 72-76.

ZHOU Chuan-hua, ZHU Jun-jie, XU Wen-qian, DENG Jia-jia. Integrated Classification Algorithm Based on Clustering and Undersampling[J]. Computer and Modernization, 2021, 0(11): 72-76.

参考文献

［1］ HE H B, GARCIA E A. Learning from imbalanced data［J］. IEEE Transactions on Knowledge and Data Engineering, 2009,21(9):1263-1284.
［2］邱少健,蔡子仪,陆璐. 基于卷积神经网络的代价敏感软件缺陷预测模型［J］. 计算机科学, 2019,46(11):156-160.
［3］闫芮铵,张立臣. 基于Focal Loss和卷积神经网络的入侵检测［J］. 计算机与现代化, 2021(1):65-69.
［4］ THOMAS C. Improving intrusion detection for imbalanced network traffic［J］. Security and Communication Networks, 2013,6(3):309-324.
［5］ SALAZAR A, SAFONT G, VERGARA L. Semi-supervised learning for imbalanced classification of credit card transaction［C］// 2018 IEEE International Joint Conference on Neural Networks. 2018:4976-4982.
［6］李艳霞,柴毅,胡友强,等. 不平衡数据分类方法综述［J］. 控制与决策, 2019,34(4):673-688.
［7］ ZOU Q, XIE S F, LIN Z Y, et al. Finding the best classification threshold in imbalanced classification［J］. Big Data Research, 2016,5:2-8.
［8］王莉,陈红梅,王生武. 新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost［J］. 计算机应用, 2019,39(3):629-633.
［9］ TAO X M, LI Q, GUO W J, et al. Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification［J］. Information Sciences, 2019,487:31-56.
［10］NIKULIN V, MCLACHLAN G J, NG S K. Ensemble approach for the classification of imbalanced data［C］// The 22nd Australasian Joint Conference on Artificial Intelligence. 2009:291-300.
［11］魏力,张育平. 一种改进型的不平衡数据欠采样算法［J］. 小型微型计算机系统, 2019,40(5):1094-1098.
［12］KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets: One-sided selection［C］// 1997 International Conference on Machine Learning. 1997:179-186.
［13］SOWAH R A, AGEBURE M A, MILLS G A, et al. New cluster undersampling technique for class imbalance learning［J］. International Journal of Machine Learning and Computing, 2016,6(3):205-214.
［14］LIN W C, TSAI C F, HU Y H, et al. Clustering-based undersampling in class-imbalanced data［J］. Information Sciences, 2017,409-410:17-26.
［15］SONG A Y, XU Q H. Imbalanced data classification based on MBCDK-means undersampling and GA-ANN［C］// 2018 International Conference on Artificial Neural Networks. 2018:349-358.
［16］SEIFFERT C, KHOSHGOFTAAR T M, VAN HULSE J, et al. RUSBoost: A hybrid approach to alleviating class imbalance［J］. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2010,40(1):185-197.
［17］熊冰妍,王国胤,邓维斌. 基于样本权重的不平衡数据欠抽样方法［J］. 计算机研究与发展, 2016,53(11):2613-2622.
［18］王艳娥,安健,梁艳,等. 基于密度优化初始聚类中心的K-means算法［J］. 计算机技术与发展, 2020,30(12):99-105.
［19］吴浩. Adaboost分类算法研究［D］. 南京:东南大学, 2018.
［20］陈小雪,尉永清,任敏,等. 基于萤火虫优化的加权K-means算法［J］. 计算机应用研究, 2018,35(2):466-470.
［21］金旭,王磊,孙国梓,等. 一种基于质心空间的不均衡数据欠采样方法［J］. 计算机科学, 2019,46(2):50-55.
［22］石磊. 基于不平衡数据处理的电子商务垃圾评论识别研究［D］. 太原:山西财经大学, 2020.
［23］王俊红,闫家荣. 基于欠采样和代价敏感的不平衡数据分类算法［J］. 计算机应用, 2021,41(1):48-52.
［24］史明华,吴广潮. 基于聚类混合采样的不平衡数据分类［J］. 计算机与现代化, 2020(5):34-38.

[1]	吕美静1, 年梅1, 张俊1, 2, 付鲁森1. 基于自编码器的网络流量异常检测[J]. 计算机与现代化, 2024, 0(12): 40-44.
[2]	刘文亮1, 吴飞1, 何德明1, 赵维伟2, 潘建宏3. 基于相异度矩阵的碎片化回复文本聚类方法[J]. 计算机与现代化, 2024, 0(09): 56-60.
[3]	薛浩, 马静, 郭小宇. 基于Focal Loss改进LightGBM的供水管网毛刺数据检测[J]. 计算机与现代化, 2024, 0(09): 74-81.
[4]	周传华1, 2, 任太娇1, 罗岚1, 周昊1. 基于联合熵的非平衡数据边界混合重采样[J]. 计算机与现代化, 2024, 0(09): 95-100.
[5]	苏凯旋. 基于改进XGBoost模型的网络入侵检测研究[J]. 计算机与现代化, 2024, 0(06): 109-114.
[6]	袁红伟1, 常利军1, 郝家欢2, 樊娜2, 王超2, 罗闯2, 张泽辉2. 基于标签传播的轨迹兴趣点挖掘及隐私保护[J]. 计算机与现代化, 2024, 0(05): 46-54.
[7]	钟海龙1, 2, 何月顺1, 何璘琳1, 陈杰1, 田鸣3, 郑瑞银4. 基于代价敏感卷积神经网络的加密流量分类#br# #br#[J]. 计算机与现代化, 2024, 0(05): 55-60.
[8]	敖博超, 范冰冰. 基于AP聚类算法的联邦学习聚合算法[J]. 计算机与现代化, 2024, 0(04): 5-11.
[9]	孟雅蕾1, 师红宇1, 王予2. 一种无阻流量预测方法[J]. 计算机与现代化, 2024, 0(04): 33-37.
[10]	曾钟静昕, 甘刚. 基于卷积自编码器的侧信道分析[J]. 计算机与现代化, 2024, 0(03): 110-114.
[11]	王秋忆, 周浩, 郑婷婷. 改进RetinaNet的电力设备目标检测方法[J]. 计算机与现代化, 2024, 0(01): 47-52.
[12]	王宏杰, 徐胜超. 基于希尔伯特相似度的云平台异常传输数据聚类方法[J]. 计算机与现代化, 2023, 0(09): 27-31.
[13]	韩雪. 基于约束聚类和粒子群算法的多路径规划[J]. 计算机与现代化, 2023, 0(08): 7-11.
[14]	孙子雨, 任燃, 魏曦哲. 基于DTW-TCN的股票分类及预测研究[J]. 计算机与现代化, 2023, 0(08): 31-37.
[15]	王艺成, 张国良, 张自杰, . 基于改进YOLOv5的小目标检测方法[J]. 计算机与现代化, 2023, 0(05): 100-105.