Computer and Modernization ›› 2018, Vol. 0 ›› Issue (12): 56-.doi: 10.3969/j.issn.1006-2475.2018.12.011

Previous Articles     Next Articles

A Random Forest Algorithm for Imbalanced Classification

  

  1. (1. School of Electrical and Information Engineering, Zhangjiagang Branch, Jiangsu University of Science and Technology,
    Zhangjiagang 215600, China; 2. School of Urban Rail Transportation, Soochow University, Suzhou 215000, China)
  • Received:2018-05-07 Online:2019-01-03 Published:2019-01-04

Abstract:  Random Forest algorithm is a simple and effective integrated learning method. It increases the diversities of classes by choosing a subset of features or rotating feature space, and builds more accurate and diverse classifiers than Bagging and Boosting. However, the splitting criteria used for constructing each tree in Random Forest is Gini index, which is proven to be skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. This paper uses K-L divergence as the splitting criterion for building each tree in Random Forest. An experimental framework is performed across a wide range of imbalanced datasets to investigate the effectiveness of K-L divergence based Random Forest which compares with Random Forest, Balanced Random Forest and Bagging with Hellinger decision trees in terms of area under ROC curve(AUC). The experimental results show that K-L divergence based Random Forest not only performs better than the others over more than 70% imbalanced datasets used in this experiment, but also is superior to the others according to the average AUC and obtains 0.938 and 0.937 across the lowly imbalanced datasets and the highly imbalanced datasets respectively. Finally, we conclude that it can improve the performances of Random Forest for imbalanced classification to use K-L divergence as the splitting criterion.

Key words:  imbalance classification, K-L divergence, random forest, balanced random forest, Bagging

CLC Number: