一种面向非均衡分类的随机森林算法

doi:10.3969/j.issn.1006-2475.2018.12.011

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (12): 56-.doi: 10.3969/j.issn.1006-2475.2018.12.011

一种面向非均衡分类的随机森林算法

(1.江苏科技大学(张家港校区)电气与信息工程学院，江苏张家港215600;2.苏州大学城市轨道交通学院，江苏苏州215000)

收稿日期:2018-05-07 出版日期:2019-01-03 发布日期:2019-01-04
作者简介:沈智勇(1997-),男,江苏苏州人,江苏科技大学(张家港校区)电气与信息工程学院本科生,研究方向:数据挖掘；通信作者:苏翀(1977-),男,江苏南京人,研究员级高级工程师,博士,研究方向:机器学习,数据挖掘；周扬(1998-),男,江苏盐城人,本科生,研究方向:数据挖掘；沈智威(1997-),男,江苏苏州人,苏州大学城市轨道交通学院本科生,研究方向:数据挖掘。
基金资助:
中国博士后科学基金资助项目(2016M600430）；湖北省水电工程智能视觉监测重点实验室开放基金资助项目(2016KLA08)

A Random Forest Algorithm for Imbalanced Classification

(1. School of Electrical and Information Engineering, Zhangjiagang Branch, Jiangsu University of Science and Technology,
Zhangjiagang 215600, China; 2. School of Urban Rail Transportation, Soochow University, Suzhou 215000, China)

Received:2018-05-07 Online:2019-01-03 Published:2019-01-04

摘要/Abstract

摘要： 随机森林算法是一种简单、有效的集成学习算法。它通过自助法和随机化特征子集的方式增加了集成分类的多样性，进而构建出比Bagging和Boosting更精确的集成分类器。然而，当面对非均衡分类问题时，其建树所使用采用的分裂指标——Gini指数被证明对类分布敏感，这在一定程度上降低了随机森林的分类精度。本文提出一种使用K-L距离作为分裂指标的随机森林。实验采用ROC曲线下面积（AUC）作为分类性能评价指标，通过在低度非均衡数据集和高度非均衡数据集上分别与随机森林、平衡随机森林以及基于Hellinger决策树的Bagging集成分类器相比，K-L随机森林不仅在70%以上的实验数据集上优于其他分类器，而且其平均AUC值也优于其他分类器，分别为0.938、0.937。上述实验结果表明：使用K-L距离作为分裂指标可以有效提高随机森林处理非均衡分类问题的分类性能。

关键词: 非均衡分类, K-L距离, 随机森林, 平衡随机森林, Bagging

Abstract: Random Forest algorithm is a simple and effective integrated learning method. It increases the diversities of classes by choosing a subset of features or rotating feature space, and builds more accurate and diverse classifiers than Bagging and Boosting. However, the splitting criteria used for constructing each tree in Random Forest is Gini index, which is proven to be skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. This paper uses K-L divergence as the splitting criterion for building each tree in Random Forest. An experimental framework is performed across a wide range of imbalanced datasets to investigate the effectiveness of K-L divergence based Random Forest which compares with Random Forest, Balanced Random Forest and Bagging with Hellinger decision trees in terms of area under ROC curve(AUC). The experimental results show that K-L divergence based Random Forest not only performs better than the others over more than 70% imbalanced datasets used in this experiment, but also is superior to the others according to the average AUC and obtains 0.938 and 0.937 across the lowly imbalanced datasets and the highly imbalanced datasets respectively. Finally, we conclude that it can improve the performances of Random Forest for imbalanced classification to use K-L divergence as the splitting criterion.

Key words: imbalance classification, K-L divergence, random forest, balanced random forest, Bagging

中图分类号:

TP301.6

沈智勇1,苏翀1,周扬1,沈智威2. 一种面向非均衡分类的随机森林算法[J]. 计算机与现代化, 2018, 0(12): 56-.

SHEN Zhi-yong1, SU Chong1, ZHOU Yang1, SHEN Zhi-wei2. A Random Forest Algorithm for Imbalanced Classification[J]. Computer and Modernization, 2018, 0(12): 56-.

参考文献

［1］ BREIMAN L. Bagging predictors［J］. Maching Learning, 1996,24(2):123-140.
［2］ FREUD Y, SCHAPIRE R E. Experiments with a new Boosting algorithm［C］// Proceedings of the 13th International Conference on Machine Learning. 1996:148-156.
［3］ BREIMAN L. Random forests［J］. Maching Learning, 2001,45(1):5-32.
［4］〖KG-*3〗KROGH A, VEDELSBY J. Neural network ensembles, cross validation, and active learning［C］// Proceedings of the 7th International Conference on Neural Information Processing Systems. 1995:231-238.
［5］ CIESLAK D A, CHAWLA N V. Learning decision trees for unbalanced data［C］// Proceedings of 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. 2008:241-256.
［6］ CIESLAK D A, HOENS T R, CHAWLA N V, et al. Hellinger distance decision trees are robust and skew-insensitive［J］. Data Mining and Knowledge Discovery, 2012,24(1):136-158.

［7］ CHEN C, LIAW A, BREIMAN L. Using Random Forest to Learn Imbalanced Data［R］. Technical Report, Department of Statistics, U.C. Berkley, 2004.
［8］ KHALILIA M, CHAKRABORTY S, POPESCU M. Predicting diease risks from highly imbalanced data using random forest［J］. BMC Medical Informatics and Decision Making, 2011,11(1):51.
［9］ CUI Y J, DAVIS S, CHENG C K, et al. A study of sample size with neural network ［C］// Proceedings of IEEE International Conference on Machine Learning and Cybernetics. 2004:3444-3448.
［10］KULLBACK S, LEIBLER R A. On information and sufficiency［J］. Annals of Mathematical Statistics, 1951,22(1): 79-86.
［11］GOOD I J. The Estimation of Probabilities［M］. M.I.T Press, Cambridge, Massa-chusetts, 2003.
［12］BANFIELD R E, HALL L O, BOWYER K W, et al. A comparison of decision tree ensemble creation techniques［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007,29(1):173-180.
［13］KHOSHGOFTAAR T M, GOLAWALA M, HULSE J. An empirical study of learning from imbalanced data using Random Forest ［C］// Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence. 2007:310-317.
［14］KEEL Website. Welcome to the KEEL-dataset Repository［EB/OL］. http://www.keel.es/datasets.php, 2018-05-01.
［15］DIETTERICH T G. Approximate statistical test for comparing supervised classification learning algorithms［J］. Neural Computation, 1998,10(7):1895-1923.
［16］ALPAYDIN E. Combined 5×2cv F test for comparing supervised classification learning algorithms［J］. Neural Computation 1999,11(8):1885-1892.
［17］FAWCETT T. An introduction to ROC analysis［J］. Pattern Recognition Letters, 2006,27(8):861-874.

[1]	吕美静1, 年梅1, 张俊1, 2, 付鲁森1. 基于自编码器的网络流量异常检测[J]. 计算机与现代化, 2024, 0(12): 40-44.
[2]	苏凯旋. 基于改进XGBoost模型的网络入侵检测研究[J]. 计算机与现代化, 2024, 0(06): 109-114.
[3]	苏欣. 数据驱动的焊膏印刷工艺参数推荐技术[J]. 计算机与现代化, 2024, 0(01): 99-102.
[4]	刘静乐, 罗翔, 宫成荣, 张国鹏. 基于RF-RFECV和LightGBM算法的糖尿病预测[J]. 计算机与现代化, 2023, 0(11): 36-43.
[5]	杨娟, 滕飞, 郭大林. 多模态融合的特征提取方法在SA检测中的应用[J]. 计算机与现代化, 2022, 0(10): 121-126.
[6]	杨琳, 白钊, 寇勇刚. 基于RFM模型的随机森林算法对民航客户的流失分析[J]. 计算机与现代化, 2021, 0(01): 100-104.
[7]	吴克河, 张英, 崔文超, 程瑞. 一种基于随机森林算法的MQTT异常流量检测方法[J]. 计算机与现代化, 2021, 0(01): 61-64.
[8]	刘晓康, 万曦, 涂文超, 周清楷, 田正稳. 基于红外可见光图像配准的电力设备分割算法[J]. 计算机与现代化, 2020, 0(08): 26-30.
[9]	孙世凡，叶明，刘凯 . 基于超像素分割和随机森林的橡胶柱塞缺陷检测方法[J]. 计算机与现代化, 2020, 0(02): 99-.
[10]	郭冰楠，吴广潮. 改进的随机平衡采样Bagging算法的网络贷款研究[J]. 计算机与现代化, 2019, 0(04): 11-.
[11]	张昊,黄蔚,胡国超. 基于改进随机森林的洗钱交易角色识别应用[J]. 计算机与现代化, 2018, 0(02): 101-.
[12]	雷武1,2，廖闻剑2，彭艳兵2. 基于随机森林与LambdaMART的搜索排序模型[J]. 计算机与现代化, 2017, 0(3): 54-.
[13]	隋建凯1，刘惠义1，严烁2. 一种基于随机森林的特征匹配方法[J]. 计算机与现代化, 2014, 0(4): 81-85.
[14]	张艳，魏雪云. 基于旋转不变深度比较特征的人手关节点识别[J]. 计算机与现代化, 2014, 0(3): 119-122.
[15]	杜霄鹏;郝建平;李星新;杨俊. 基于单一深度图像的人体姿态实时识别技术研究[J]. 计算机与现代化, 2012, 1(200): 192-04.

一种面向非均衡分类的随机森林算法

A Random Forest Algorithm for Imbalanced Classification

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价