基于混合采样与Random_Stacking的软件缺陷预测

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (08): 70-76.

基于混合采样与Random_Stacking的软件缺陷预测

(青岛科技大学信息科学技术学院,山东青岛266061)

出版日期:2021-08-19 发布日期:2021-08-19
作者简介:闫岭岭(1993—),女,山东汶上人,硕士研究生,研究方向:机器学习,数据挖掘,E-mail: 1019162802@qq.com; 江峰 (1978—),男,江西彭泽人,教授,博士,CCF会员,研究方向:机器学习,数据挖掘,粗糙集等; 通信作者:杨爱光(1972—), 黑龙江鸡西人,讲师，研士，研究方向：人工智能，嵌入式系统，E-mail: yanll2802@qq.com。
基金资助:
国家自然科学基金资助项目(61973180,61671261)；山东省自然科学基金资助项目(ZR2018MF007)

Software Defect Prediction Based on Hybrid Sampling and Random_Stacking

(College of Information Science & Technology, Qingdao University of Science and Technology, Qingdao 266061, China)

Online:2021-08-19 Published:2021-08-19

摘要/Abstract

摘要： 现有的软件缺陷预测方法面临数据类别不平衡性、高维数据处理等问题。如何有效解决上述问题已成为目前相关领域的研究热点。针对软件缺陷预测所面临的类别不平衡、预测精度低等问题，本文提出一种基于混合采样与Random_Stacking的软件缺陷预测算法DP_HSRS。DP_HSRS算法首先采用混合采样算法对不平衡数据进行平衡化处理；然后在该平衡数据集上采用Random_Stacking算法进行软件缺陷预测。Random_Stacking算法是对传统Stacking算法的一种有效改进，它通过融合多个经典的分类算法以及Bagging机制构建多个Stacking分类器，对多个Stacking分类器进行投票，得到一个集成分类器，最后利用该集成分类器对软件缺陷进行预测。通过在NASA MDP数据集上的实验结果表明，DP_HSRS算法的性能优于现有的算法，具有更好的缺陷预测性能。

关键词: 软件缺陷预测, 数据不平衡, 混合采样, Random_Stacking, DP_HSRS

Abstract: The existing software defect prediction methods face problems such as imbalance of data categories， high-dimensional data processing， and so on. How to effectively solve the above problems has become a research hotspot in related fields. Aiming at the problems of unbalanced categories and low prediction accuracy faced by software defect prediction, this paper proposes a software defect prediction algorithm DP_HSRS based on hybrid sampling and Random_Stacking. The DP_HSRS algorithm firstly uses a hybrid sampling algorithm to balance the unbalanced data, then uses the Random_Stacking algorithm to predict software defects on the balanced data set. The Random_Stacking algorithm is an effective improvement to the traditional Stacking algorithm. It constructs multiple Stacking classifiers by fusing multiple classic classification algorithms and the Bagging mechanism, votes multiple Stacking classifiers to obtain an integrated classifier, and finally uses the integrated classifier to predict software defects. The results of experiments on the NASA MDP data set show that the performance of the DP_HSRS algorithm is better than the existing algorithms, and it has better defect prediction performance.

Key words: software defect prediction, data imbalance, mixed sampling, Random_Stacking, DP_HSRS

闫岭岭, 江峰, 杜军威, 杨爱光. 基于混合采样与Random_Stacking的软件缺陷预测[J]. 计算机与现代化, 2021, 0(08): 70-76.

YAN Ling-ling, JIANG Feng, DU Jun-wei, YANG Ai-guang. Software Defect Prediction Based on Hybrid Sampling and Random_Stacking[J]. Computer and Modernization, 2021, 0(08): 70-76.

参考文献

［1］王青,伍书剑,李明树. 软件缺陷预测技术［J］. 软件学报, 2008,19(7):1565-1580．〖HJ1.55mm〗
［2］ HALL T, BEECHAM S, BOWES D, et al. A systematic literature review on fault prediction performance in software engineering［J］. IEEE Transactions on Software Engineering, 2012,38(6):1276-1304.
［3］薛参观,燕雪峰. 基于改进深度森林算法的软件缺陷预测［J］. 计算机科学, 2018,45(8):160-165.
［4］陈翔,顾庆,刘望舒,等. 静态软件缺陷预测方法研究［J］. 软件学报, 2016,27(1):1-25.
［5］杨杰,燕雪峰,张德平. 基于Boosting的代价敏感软件缺陷预测方法［J］. 计算机科学, 2017,44(8):176-180.
［6］傅艺绮,董威,尹良泽,等. 基于组合机器学习算法的软件缺陷预测模型［J］. 计算机研究与发展, 2017,54(3):633-641.
［7］ WANG J, SHEN B J, CHEN Y T. Compressed C4.5 models for software defect prediction［C］// Proceedings of the 2012 12th International Conference on Quality Software. 2012:13-16.
［8］ SHAN C, CHEN B Y, HU C Z, et al. Software defect prediction model based on LLE and SVM［C］// Proceedings of the 2014 Communications Security Conference (CSC 2014).2014:1-5.
［9］ WEI H, SHAN C, HU C Z, et al. Software defect distribution prediction model based on NPE-SVM［J］. Chinese Communications, 2018,15(5):173-182.
［10］TAN S B. An effective refinement strategy for KNN text classifier［J］. Expert Systems with Applications, 2006,30(2):290-298.
［11］BADER-EL-DEN M, TEITEI E, PERRY T. Biased random forest for dealing with the class imbalance problem［J］. IEEE Transactions on Neural Networks and Learning Systems, 2019,30(7):2163-2172.
［12］蒋盛益,谢照青,余雯. 基于代价敏感的朴素贝叶斯不平衡数据分类研究［J］. 计算机研究与发展, 2011,48(S1):387-390.
［13］赖永凯,陈向宇,刘海. 基于贝叶斯Logistic回归的软件缺陷预测研究［J］. 计算机工程与应用, 2019,55(11):204-208.
［14］HAN H, WANG W Y, MAO B H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning［C］// Proceedings of the 2005 International Conference on Advances in Intelligent Computing. 2005:878-887.
［15］TOMEK I. Two modifications of CNN［J］. IEEE Transactionson Systems, Man, and Cybernetics, 1976,6(11):769-772.
［16］BATISTA G E, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data［J］. ACM SIGKDD Explorations Newsletter, 2004,6(1):20-29.

［17］邱少健,蔡子仪,陆璐. 基于卷积神经网络的代价敏感软件缺陷预测模型［J］. 计算机科学, 2019,46(11):156-160.

［18］ LI F L, ZHANG X Y, ZHANG X Q, et al. Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets［J］. Information Sciences, 2018,422:242-256.
［19］DECHERCHI S, ROCCHIA W. Import vector domain description: A kernel logistic one-class learning algorithm［J］. IEEE Transactions on Neural Networks & Learning Systems, 2017,28(7):1722-1729.
［20］ROKACH L. Ensemble-based classifiers［J］. Artificial Intelligence Review, 2010,33(1-2):1-39.
［21］GALAR M, FERNANDEZ A, BARRENECHEA E, et al. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches［J］. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2012,42(4):463-484.
［22］CHAWLA N V, BOWYER K W, HaLL L O, et al. SMOTE: Synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
［23］PAK C, WANG T T, SU X H. An empirical study on software defect prediction using over-sampling by SMOTE［J］. International Journal of Software Engineering and Knowledge Engineering, 2018,28(6):811-830.
［24］WOLPERT D H. Stacked generalization［J］. Neural Networks, 1992,5(2):241-259.

[1]	周传华1, 2, 任太娇1, 罗岚1, 周昊1. 基于联合熵的非平衡数据边界混合重采样[J]. 计算机与现代化, 2024, 0(09): 95-100.
[2]	刘路瑶, 韩培胜. 基于堆叠降噪自编码器的跨项目软件缺陷数量预测方法[J]. 计算机与现代化, 2023, 0(04): 32-38.
[3]	史明华,吴广潮. 基于聚类混合采样的不平衡数据分类[J]. 计算机与现代化, 2020, 0(05): 34-.
[4]	薛参观1,2. 基于堆叠降噪稀疏自动编码器的软件缺陷预测[J]. 计算机与现代化, 2018, 0(05): 65-.
[5]	甘露，臧洌，李航. 基于DA-SVM的软件缺陷预测模型[J]. 计算机与现代化, 2017, 0(2): 36-39+44.