计算机与现代化 ›› 2020, Vol. 0 ›› Issue (06): 83-.

• 数据挖掘 • 上一篇    下一篇

过采样与集成学习方法在软件缺陷预测中的对比研究

  

  1. (1.青岛科技大学信息科学技术学院,山东青岛266061;2.青岛科技大学档案馆,山东青岛266061)
  • 收稿日期:2019-11-22 出版日期:2020-06-24 发布日期:2020-06-28
  • 作者简介:王海(1991-),男,山东青岛人,硕士研究生,研究方向:数据挖掘,人工智能,E-mail: 350357641@qq.com; 江峰(1978-),男,副教授, 博士,研究方向:机器学习,粗糙集,E-mail: 14269672@qq.com; 杜军威(1974-),男,教授,博士,研究方向:机器学习,知识图谱,知识工程,E-mail: jw_du@qq.com; 通信作者:赵军(1979-),男,讲师,博士,研究方向:机器学习,数据挖掘,E-mail: 350357641@qq.com。
  • 基金资助:
    国家自然科学基金面上项目(61973180); 国家自然科学基金资助项目(61671261); 山东省自然科学基金面上项目(ZR2018MF007)

Comparative Study of Oversampling and Ensemble Learning Methods in Software Defect Prediction

  1. (1. College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China;
    2. Department of Archives, Qingdao University of Science and Technology, Qingdao 266061, China)
  • Received:2019-11-22 Online:2020-06-24 Published:2020-06-28

摘要: 近年来,软件缺陷预测的研究引起了大量关注。软件缺陷预测中普遍存在类别不平衡问题,即有缺陷样本要远少于无缺陷样本,而有缺陷样本又是预测的重点。上述问题使得缺陷预测模型的性能难以满足用户要求,有必要对不平衡数据进行有效处理。目前,采样法和集成学习方法已成为处理不平衡数据的2类重要方法,很多学者提出了不同的过采样方法和集成学习方法。本文研究如何把这2类方法更好地组合在一起,从而有效地处理缺陷预测中的类别不平衡问题。对此,选取RandomOverSampler、SMOTE、Borderline-SMOTE和ADASYN这4种常见的过采样方法以及Bagging、Random Forest、AdaBoost和GBDT这4种常用的集成学习方法,分别将一种过采样方法与一种集成方法组合在一起,从而形成不同的组合。通过对比每一种组合的缺陷预测性能,从而获得最优组合,为缺陷预测中不平衡问题的处理提供有益参考。实验表明,过采样方法ADASYN在处理不平衡问题方面更有优势,它与集成方法GBDT的组合表现最优,相对于其他组合具有更好的缺陷预测性能。

关键词: 缺陷预测, 类别不平衡, 过采样, 集成学习

Abstract: In recent years, research on software defect prediction has attracted much attention. Class-imbalance problem is common in software defect prediction, that is, the number of defective samples is much smaller than that of non-defective samples, but defective samples are the focus of prediction. Due to the above problem, the performance of defect prediction models is difficult to meet the requirement of users, hence it is necessary to effectively process the imbalanced data. At present, sampling-based methods and ensemble learning methods have become two important methods for dealing with imbalanced data. Many researchers have proposed different oversampling methods and ensemble learning methods. This paper studies how to better combine these two kinds of methods to effectively deal with the class-imbalance problem in defect prediction. For that purpose, this paper selects four common oversampling methods (i.e., RandomOverSampler, SMOTE, Borderline-SMOTE and ADASYN) and four commonly used ensemble learning methods (i.e., Bagging, Random Forest, AdaBoost and GBDT). This paper respectively combines one oversampling method with one ensemble method, and hence forms different combinations. By comparing the defect prediction performance of each combination, the optimal combination is obtained, which may provide some useful insights for the processing of imbalance problem in defect prediction. Experiment results demonstrate that the oversampling method ADASYN has more advantages in dealing with the imbalance problem. The oversampling methed ADASYN and the ensemble method GBDT is the best combination, which has better defect prediction performance than other combinations.

Key words: defect prediction, class-imbalance, oversampling, ensemble learning

中图分类号: