计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

一种面向非均衡分类的随机森林算法

  

  1. (1.江苏科技大学(张家港校区)电气与信息工程学院,江苏张家港215600;2.苏州大学城市轨道交通学院,江苏苏州215000)
  • 收稿日期:2018-05-07 出版日期:2019-01-03 发布日期:2019-01-04
  • 作者简介:沈智勇(1997-),男,江苏苏州人,江苏科技大学(张家港校区)电气与信息工程学院本科生,研究方向:数据挖掘; 通信作者:苏翀(1977-),男,江苏南京人,研究员级高级工程师,博士,研究方向:机器学习,数据挖掘; 周扬(1998-),男,江苏盐城人,本科生,研究方向:数据挖掘; 沈智威(1997-),男,江苏苏州人,苏州大学城市轨道交通学院本科生,研究方向:数据挖掘。
  • 基金资助:
    中国博士后科学基金资助项目(2016M600430); 湖北省水电工程智能视觉监测重点实验室开放基金资助项目(2016KLA08)

A Random Forest Algorithm for Imbalanced Classification

  1. (1. School of Electrical and Information Engineering, Zhangjiagang Branch, Jiangsu University of Science and Technology,
    Zhangjiagang 215600, China; 2. School of Urban Rail Transportation, Soochow University, Suzhou 215000, China)
  • Received:2018-05-07 Online:2019-01-03 Published:2019-01-04

摘要: 随机森林算法是一种简单、有效的集成学习算法。它通过自助法和随机化特征子集的方式增加了集成分类的多样性,进而构建出比Bagging和Boosting更精确的集成分类器。然而,当面对非均衡分类问题时,其建树所使用采用的分裂指标——Gini指数被证明对类分布敏感,这在一定程度上降低了随机森林的分类精度。本文提出一种使用K-L距离作为分裂指标的随机森林。实验采用ROC曲线下面积(AUC)作为分类性能评价指标,通过在低度非均衡数据集和高度非均衡数据集上分别与随机森林、平衡随机森林以及基于Hellinger决策树的Bagging集成分类器相比,K-L随机森林不仅在70%以上的实验数据集上优于其他分类器,而且其平均AUC值也优于其他分类器,分别为0.938、0.937。上述实验结果表明:使用K-L距离作为分裂指标可以有效提高随机森林处理非均衡分类问题的分类性能。

关键词: 非均衡分类, K-L距离, 随机森林, 平衡随机森林, Bagging

Abstract:  Random Forest algorithm is a simple and effective integrated learning method. It increases the diversities of classes by choosing a subset of features or rotating feature space, and builds more accurate and diverse classifiers than Bagging and Boosting. However, the splitting criteria used for constructing each tree in Random Forest is Gini index, which is proven to be skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. This paper uses K-L divergence as the splitting criterion for building each tree in Random Forest. An experimental framework is performed across a wide range of imbalanced datasets to investigate the effectiveness of K-L divergence based Random Forest which compares with Random Forest, Balanced Random Forest and Bagging with Hellinger decision trees in terms of area under ROC curve(AUC). The experimental results show that K-L divergence based Random Forest not only performs better than the others over more than 70% imbalanced datasets used in this experiment, but also is superior to the others according to the average AUC and obtains 0.938 and 0.937 across the lowly imbalanced datasets and the highly imbalanced datasets respectively. Finally, we conclude that it can improve the performances of Random Forest for imbalanced classification to use K-L divergence as the splitting criterion.

Key words:  imbalance classification, K-L divergence, random forest, balanced random forest, Bagging

中图分类号: