A Random Forest Algorithm for Imbalanced Classification

doi:10.3969/j.issn.1006-2475.2018.12.011

Abstract

Abstract: Random Forest algorithm is a simple and effective integrated learning method. It increases the diversities of classes by choosing a subset of features or rotating feature space, and builds more accurate and diverse classifiers than Bagging and Boosting. However, the splitting criteria used for constructing each tree in Random Forest is Gini index, which is proven to be skew-sensitive. When learning from highly imbalanced datasets, class imbalance impedes their ability to learn the minority class concept. This paper uses K-L divergence as the splitting criterion for building each tree in Random Forest. An experimental framework is performed across a wide range of imbalanced datasets to investigate the effectiveness of K-L divergence based Random Forest which compares with Random Forest, Balanced Random Forest and Bagging with Hellinger decision trees in terms of area under ROC curve(AUC). The experimental results show that K-L divergence based Random Forest not only performs better than the others over more than 70% imbalanced datasets used in this experiment, but also is superior to the others according to the average AUC and obtains 0.938 and 0.937 across the lowly imbalanced datasets and the highly imbalanced datasets respectively. Finally, we conclude that it can improve the performances of Random Forest for imbalanced classification to use K-L divergence as the splitting criterion.

Key words: imbalance classification, K-L divergence, random forest, balanced random forest, Bagging

CLC Number:

TP301.6

SHEN Zhi-yong1, SU Chong1, ZHOU Yang1, SHEN Zhi-wei2. A Random Forest Algorithm for Imbalanced Classification[J]. Computer and Modernization, 2018, 0(12): 56-.

References

［1］ BREIMAN L. Bagging predictors［J］. Maching Learning, 1996,24(2):123-140.
［2］ FREUD Y, SCHAPIRE R E. Experiments with a new Boosting algorithm［C］// Proceedings of the 13th International Conference on Machine Learning. 1996:148-156.
［3］ BREIMAN L. Random forests［J］. Maching Learning, 2001,45(1):5-32.
［4］〖KG-*3〗KROGH A, VEDELSBY J. Neural network ensembles, cross validation, and active learning［C］// Proceedings of the 7th International Conference on Neural Information Processing Systems. 1995:231-238.
［5］ CIESLAK D A, CHAWLA N V. Learning decision trees for unbalanced data［C］// Proceedings of 2008 European Conference on Machine Learning and Knowledge Discovery in Databases. 2008:241-256.
［6］ CIESLAK D A, HOENS T R, CHAWLA N V, et al. Hellinger distance decision trees are robust and skew-insensitive［J］. Data Mining and Knowledge Discovery, 2012,24(1):136-158.

［7］ CHEN C, LIAW A, BREIMAN L. Using Random Forest to Learn Imbalanced Data［R］. Technical Report, Department of Statistics, U.C. Berkley, 2004.
［8］ KHALILIA M, CHAKRABORTY S, POPESCU M. Predicting diease risks from highly imbalanced data using random forest［J］. BMC Medical Informatics and Decision Making, 2011,11(1):51.
［9］ CUI Y J, DAVIS S, CHENG C K, et al. A study of sample size with neural network ［C］// Proceedings of IEEE International Conference on Machine Learning and Cybernetics. 2004:3444-3448.
［10］KULLBACK S, LEIBLER R A. On information and sufficiency［J］. Annals of Mathematical Statistics, 1951,22(1): 79-86.
［11］GOOD I J. The Estimation of Probabilities［M］. M.I.T Press, Cambridge, Massa-chusetts, 2003.
［12］BANFIELD R E, HALL L O, BOWYER K W, et al. A comparison of decision tree ensemble creation techniques［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007,29(1):173-180.
［13］KHOSHGOFTAAR T M, GOLAWALA M, HULSE J. An empirical study of learning from imbalanced data using Random Forest ［C］// Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence. 2007:310-317.
［14］KEEL Website. Welcome to the KEEL-dataset Repository［EB/OL］. http://www.keel.es/datasets.php, 2018-05-01.
［15］DIETTERICH T G. Approximate statistical test for comparing supervised classification learning algorithms［J］. Neural Computation, 1998,10(7):1895-1923.
［16］ALPAYDIN E. Combined 5×2cv F test for comparing supervised classification learning algorithms［J］. Neural Computation 1999,11(8):1885-1892.
［17］FAWCETT T. An introduction to ROC analysis［J］. Pattern Recognition Letters, 2006,27(8):861-874.

[1]	LYU Meijing1, NIAN Mei1, ZHANG Jun1, 2, FU Lusen1. Anomaly Detection of Network Traffic Based on Autoencoder [J]. Computer and Modernization, 2024, 0(12): 40-44.
[2]	SU Kaixuan . Network Intrusion Detection Based on Improved XGBoost Model [J]. Computer and Modernization, 2024, 0(06): 109-114.
[3]	SU Xin. Recommended Technology for Solder Paste Printing Process Parameters on Data Driven [J]. Computer and Modernization, 2024, 0(01): 99-102.
[4]	YANG Juan, TENG Fei, GUO Da-lin. Application of Multimodal Fusion TCN-SSDAEs-RF Method in SA Detection [J]. Computer and Modernization, 2022, 0(10): 121-126.
[5]	YANG Lin, BAI Zhao, KOU Yong-gang. Analysis of Airline Customer Churn by Random Forest Algorithm Based on RFM Model [J]. Computer and Modernization, 2021, 0(01): 100-104.
[6]	WU Ke-he, ZHANG Ying, CUI Wen-chao, CHENG Rui. A MQTT Abnormal Traffic Detection Method Based on Random Forest Algorithm [J]. Computer and Modernization, 2021, 0(01): 61-64.
[7]	LIU Xiao-kang, WAN Xi, TU Wen-chao, ZHOU Qing-kai, TIAN Zheng-wen. Power Equipment Segmentation Algorithm Based on Infrared and Visible Images Registration [J]. Computer and Modernization, 2020, 0(08): 26-30.
[8]	YANG Yong-jiao, XIAO Jian-yi, ZHAO Chuang-ye, ZHOU Kai-dong. An Anomaly Detection Algorithm for Smart Grid Time Series Data #br# Based on Combination of Isolation Forest and Random Forest [J]. Computer and Modernization, 2020, 0(03): 99-.
[9]	SUN Shi-fan， YE Ming， LIU Kai . Rubber Plunger Defect Detection Method #br# Based on Super Pixel Segmentation and Random Forest [J]. Computer and Modernization, 2020, 0(02): 99-.
[10]	GUO Bing-nan， WU Guang-chao . Improved Random Balanced Sampling Bagging Algorithm for Network Loan Research [J]. Computer and Modernization, 2019, 0(04): 11-.
[11]	ZHANG Hao, HUANG Wei, HU Guo-chao. Role Identification of Money Laundering Based on Improved Random Forest [J]. Computer and Modernization, 2018, 0(02): 101-.
[12]	LEI Wu1,2， LIAO Wenjian2， PENG Yanbing2. Search Rank Model Based on Random Forests and LambdaMART [J]. Computer and Modernization, 2017, 0(3): 54-.
[13]	SUI Jiankai1, LIU Huiyi1,YAN Shuo2. Feature Matching Method Based on Random Forests Classifier [J]. Computer and Modernization, 2014, 0(4): 81-85.
[14]	ZHANG Yan, WEI Xue-yun. Human Hand Joint Recognition Based on Rotation Invariant Depth Comparison Features [J]. Computer and Modernization, 2014, 0(3): 119-122.
[15]	DU Xiao-peng;HAO Jian-ping;LI Xing-xin;YANG Jun. Human Pose Recognition Research Based on Single Depth Images [J]. Computer and Modernization, 2012, 1(200): 192-04.