过采样与集成学习方法在软件缺陷预测中的对比研究

计算机与现代化 ›› 2020, Vol. 0 ›› Issue (06): 83-.

过采样与集成学习方法在软件缺陷预测中的对比研究

（1.青岛科技大学信息科学技术学院，山东青岛266061；2.青岛科技大学档案馆，山东青岛266061）

收稿日期:2019-11-22 出版日期:2020-06-24 发布日期:2020-06-28
作者简介:王海(1991-),男,山东青岛人,硕士研究生，研究方向：数据挖掘，人工智能，E-mail: 350357641@qq.com；江峰(1978-),男,副教授, 博士,研究方向：机器学习，粗糙集，E-mail: 14269672@qq.com；杜军威(1974-),男,教授,博士,研究方向：机器学习，知识图谱，知识工程，E-mail: jw_du@qq.com；通信作者：赵军(1979-),男,讲师,博士,研究方向：机器学习，数据挖掘，E-mail: 350357641@qq.com。
基金资助:
国家自然科学基金面上项目（61973180）；国家自然科学基金资助项目（61671261）；山东省自然科学基金面上项目（ZR2018MF007）

Comparative Study of Oversampling and Ensemble Learning Methods in Software Defect Prediction

(1. College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China;
2. Department of Archives, Qingdao University of Science and Technology, Qingdao 266061, China)

Received:2019-11-22 Online:2020-06-24 Published:2020-06-28

摘要/Abstract

摘要： 近年来，软件缺陷预测的研究引起了大量关注。软件缺陷预测中普遍存在类别不平衡问题，即有缺陷样本要远少于无缺陷样本，而有缺陷样本又是预测的重点。上述问题使得缺陷预测模型的性能难以满足用户要求，有必要对不平衡数据进行有效处理。目前，采样法和集成学习方法已成为处理不平衡数据的2类重要方法，很多学者提出了不同的过采样方法和集成学习方法。本文研究如何把这2类方法更好地组合在一起，从而有效地处理缺陷预测中的类别不平衡问题。对此，选取RandomOverSampler、SMOTE、Borderline-SMOTE和ADASYN这4种常见的过采样方法以及Bagging、Random Forest、AdaBoost和GBDT这4种常用的集成学习方法，分别将一种过采样方法与一种集成方法组合在一起，从而形成不同的组合。通过对比每一种组合的缺陷预测性能，从而获得最优组合，为缺陷预测中不平衡问题的处理提供有益参考。实验表明，过采样方法ADASYN在处理不平衡问题方面更有优势，它与集成方法GBDT的组合表现最优，相对于其他组合具有更好的缺陷预测性能。

关键词: 缺陷预测, 类别不平衡, 过采样, 集成学习

Abstract: In recent years, research on software defect prediction has attracted much attention. Class-imbalance problem is common in software defect prediction, that is, the number of defective samples is much smaller than that of non-defective samples, but defective samples are the focus of prediction. Due to the above problem, the performance of defect prediction models is difficult to meet the requirement of users, hence it is necessary to effectively process the imbalanced data. At present, sampling-based methods and ensemble learning methods have become two important methods for dealing with imbalanced data. Many researchers have proposed different oversampling methods and ensemble learning methods. This paper studies how to better combine these two kinds of methods to effectively deal with the class-imbalance problem in defect prediction. For that purpose, this paper selects four common oversampling methods (i.e., RandomOverSampler, SMOTE, Borderline-SMOTE and ADASYN) and four commonly used ensemble learning methods (i.e., Bagging, Random Forest, AdaBoost and GBDT). This paper respectively combines one oversampling method with one ensemble method, and hence forms different combinations. By comparing the defect prediction performance of each combination, the optimal combination is obtained, which may provide some useful insights for the processing of imbalance problem in defect prediction. Experiment results demonstrate that the oversampling method ADASYN has more advantages in dealing with the imbalance problem. The oversampling methed ADASYN and the ensemble method GBDT is the best combination, which has better defect prediction performance than other combinations.

Key words: defect prediction, class-imbalance, oversampling, ensemble learning

中图分类号:

TP391

王海, 江峰, 杜军威, 赵军. 过采样与集成学习方法在软件缺陷预测中的对比研究[J]. 计算机与现代化, 2020, 0(06): 83-.

WANG Hai, JIANG Feng, DU Jun-wei, ZHAO Jun. Comparative Study of Oversampling and Ensemble Learning Methods in Software Defect Prediction[J]. Computer and Modernization, 2020, 0(06): 83-.

参考文献

［1］ HOSSEINI S, TURHAN B, MANTYLA M. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction［J］. Information and Software Technology, 2018,95:296-312.
［2］ GANGANWAR V. An overview of classification algorithms for imbalanced datasets［J］. International Journal of Emerging Technology and Advanced Engineering, 2012,2(4):42-47.
［3］ MAHMOOD Z, BOWES D, HALL T, et al. Reproducibility and replicability of software defect prediction studies［J］. Information and Software Technology, 2018,99(7):148
-163.
［4］简艺恒，余啸. 一种基于数据过采样和集成学习的软件缺陷数目预测方法［J］. 计算机应用, 2018,38(9):2637-2643.
［5］ MALHOTRA R, KAMAL S. An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data［J］. Neurocomputing, 2019,343(4):120-140.
［6］戴翔,毛宇光. 基于集成混合采样的软件缺陷预测研究［J］. 计算机工程与科学, 2015,37(5):930-936.
［7］ OH J H, HONG J Y, BAEK J G. Oversampling method using outlier detectable generative adversarial network［J］. Expert Systems with Applications, 2019,133(1):1-8.
［8］ DONG Y, WANG X. A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets［C］// International Conference on Knowledge Science, Engineering and Management. 2011:343-352.
［9］ CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
［10］HAN H, WANG W Y, MAO B H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning［C］// International Conference on Intelligent Computing. 2005:878-887.
［11］HE H, BAI Y, GARCIA E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning［C］// 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008:1322-1328.
［12］封化民,李明伟,侯晓莲,等. 基于SMOTE和GBDT的网络入侵检测方法研究［J］. 计算机应用研究, 2017,34(12):3745-3748.
［13］WANG S, YAO X. Using class imbalance learning for software defect prediction［J］. IEEE Transactions on Reliability, 2013,62(2):434-443.
［14］JAYANTHI R, FLORENCE L. Software defect prediction techniques using metrics based on neural network classifier［J］. Cluster Computing, 2019,22(1):77-88.
［15］BHAGAT R C, PATIL S S. Enhanced SMOTE algorithm for classification of imbalanced big-data using random forest［C］// 2015 IEEE International Advance Computing Conference. 2015:403-408.
［16］NGUYEN H M, COOPER E W, KAMEI K. Borderline over-sampling for imbalanced data classification［C］// Proceedings of the 5th IEEE International Workshop on Computational Intelligence & Applications. 2009:24-29.
［17］BAUER E, KOHAVI R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants［J］. Machine Learning, 1999,36(1-2):105-139.
［18］DIETTERICH T G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization［J］. Machine Learning, 2000,40(2):139-157.
［19］杨杰,燕雪峰,张德平. 基于Boosting的代价敏感软件缺陷预测方法［J］. 计算机科学, 2017,44(8):176-180.
［20］BENNIN K E, KEUNG J W, MONDEN A. On the relative value of data resampling approaches for software defect prediction［J］. Empirical Software Engineering, 2019,24(2):602-636.
［21］FENTON N E, NEIL M. A critique of software defect prediction models［J］. IEEE Transactions on Software Engineering, 1999,25(5):675-689.

［22］WAN H , WU G , YU M , et al. Software defect prediction based on cost-sensitive dictionary learning［J］. International Journal of Software Engineering and Knowledge Engineering, 2019,29(9):1219-1243.
［23］RODRIGUEZ-TORRES F,CARRASCO-OCHOA J A, MARTNEZ-TRINIDAD J F. Deterministic oversampling methods based on SMOTE［J］. Journal of Intelligent & Fuzzy Systems, 2019,36(5):4945-4955

［24］ALSAEEDI A, KHAN M Z. Software defect prediction using supervised machine learning and ensemble techniques: A comparative study［J］. Journal of Software Engineering and Applications, 2019,12(5):85-100.

[1]	王杰, 徐祥, 罗晓丹, 张萌, 黄澈, 洪冠中, 汪翔. 基于集成学习的巢湖面雨量计算方法[J]. 计算机与现代化, 2023, 0(09): 38-43.
[2]	刘路瑶, 韩培胜. 基于堆叠降噪自编码器的跨项目软件缺陷数量预测方法[J]. 计算机与现代化, 2023, 0(04): 32-38.
[3]	刘付谦, 秦华妮, 赖惠慧. 基于SMOTE和贝叶斯优化的Adj-LightGBM人岗匹配算法[J]. 计算机与现代化, 2023, 0(03): 90-95.
[4]	杨进, 张晨. 改进EasyEnsemble的软投票策略下的用户购买预测方法[J]. 计算机与现代化, 2022, 0(07): 47-53.
[5]	龚云翔, 袁仕芳, 刘付谦. 基于集成学习与不平衡数据的返贫预测[J]. 计算机与现代化, 2022, 0(04): 12-16.
[6]	唐洁, 文元美. 基于3CNN-BiGRU的睡眠自动分期研究[J]. 计算机与现代化, 2022, 0(02): 120-126.
[7]	王磊, 宋波. 基于ADE-Stacking的心力衰竭非计划性再入院风险预测模型[J]. 计算机与现代化, 2022, 0(01): 23-27.
[8]	闫岭岭, 江峰, 杜军威, 杨爱光. 基于混合采样与Random_Stacking的软件缺陷预测[J]. 计算机与现代化, 2021, 0(08): 70-76.
[9]	章恒, 鞠时光. 基于概念漂移检测的网络数据流分类[J]. 计算机与现代化, 2021, 0(07): 107-114.
[10]	董燕辉, 肖军弼, 张红霞, 杨勇进, 计志滨. 面向不平衡数据集的应用系统识别方法[J]. 计算机与现代化, 2021, 0(05): 93-97.
[11]	王继民, 季昌政, 李家欢, 曹颖. 基于集成学习的中小河流洪水预报[J]. 计算机与现代化, 2021, 0(05): 51-58.
[12]	李科心, 李静, 邵佳炜, 肖屹. 多层次序列集成的高维数值型异常检测[J]. 计算机与现代化, 2020, 0(06): 73-.
[13]	刘树艺，李静，胡春，王伟. 基于卷积神经网络与集成学习的交通标志识别[J]. 计算机与现代化, 2019, 0(12): 67-.
[14]	顾陈楠,曾晓勤 . 基于3D卷积神经网络的动态手势识别[J]. 计算机与现代化, 2019, 0(11): 75-.
[15]	郭冰楠，吴广潮. 改进的随机平衡采样Bagging算法的网络贷款研究[J]. 计算机与现代化, 2019, 0(04): 11-.