Computer and Modernization ›› 2020, Vol. 0 ›› Issue (06): 83-.

Previous Articles     Next Articles

Comparative Study of Oversampling and Ensemble Learning Methods in Software Defect Prediction

  

  1. (1. College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China;
    2. Department of Archives, Qingdao University of Science and Technology, Qingdao 266061, China)
  • Received:2019-11-22 Online:2020-06-24 Published:2020-06-28

Abstract: In recent years, research on software defect prediction has attracted much attention. Class-imbalance problem is common in software defect prediction, that is, the number of defective samples is much smaller than that of non-defective samples, but defective samples are the focus of prediction. Due to the above problem, the performance of defect prediction models is difficult to meet the requirement of users, hence it is necessary to effectively process the imbalanced data. At present, sampling-based methods and ensemble learning methods have become two important methods for dealing with imbalanced data. Many researchers have proposed different oversampling methods and ensemble learning methods. This paper studies how to better combine these two kinds of methods to effectively deal with the class-imbalance problem in defect prediction. For that purpose, this paper selects four common oversampling methods (i.e., RandomOverSampler, SMOTE, Borderline-SMOTE and ADASYN) and four commonly used ensemble learning methods (i.e., Bagging, Random Forest, AdaBoost and GBDT). This paper respectively combines one oversampling method with one ensemble method, and hence forms different combinations. By comparing the defect prediction performance of each combination, the optimal combination is obtained, which may provide some useful insights for the processing of imbalance problem in defect prediction. Experiment results demonstrate that the oversampling method ADASYN has more advantages in dealing with the imbalance problem. The oversampling methed ADASYN and the ensemble method GBDT is the best combination, which has better defect prediction performance than other combinations.

Key words: defect prediction, class-imbalance, oversampling, ensemble learning

CLC Number: