一种面向跨项目软件缺陷预测的特征过滤与实例迁移框架

摘要/Abstract

摘要： 在跨项目软件缺陷预测中，源项目与目标项目的特征关联度与实例分布差异性是影响预测模型性能的主要因素。本文从特征过滤与实例迁移2个角度出发，提出一种跨项目软件缺陷预测框架KCF-KMM(K-medoids Cluster Filtering- Kernel Mean Matching)。在特征过滤阶段，该方法基于K-medoids聚类算法来筛选特征子集，过滤与目标项目关联度低的特征。在实例迁移阶段，通过KMM算法计算源项目与目标项目实例间的分布差异度，以此分配每个训练实例的影响权重。最后，结合目标项目中少量有标注数据建立混合缺陷预测模型。为了验证KCF-KMM的有效性，本文从准确率和F1值的角度出发，分别与经典的跨项目软件缺陷预测方法TCA+、TNB和NNFilter相比，KCF-KMM的预测性能在Apache数据集上可以分别提升34.1%、0.8%、21.1%和14.4%、3.7%、10.6%。

关键词: 源项目, 目标项目, 特征关联度, 分布差异性, 特征过滤, 实例迁移

Abstract: In cross-project software defect prediction, the feature correlation and the difference in instance distribution between the source project and the target project are the main factors that affect the performance of the prediction model. From the perspective of feature filtering and instance transfer, we propose a framework for cross-project defect prediction called KCF-KMM. Specifically, during the feature filtering phase, it uses K-medoids clustering algorithm to select features, filtering out features that have low relevance to the target project. During the instance transfer phase, the KMM algorithm is used to calculate the distribution difference between the source project and the target project instance, so as to assign the influence weight of each training instance. Finally, it combines a small amount of labeled data in the target project to establish a mixed defect prediction model. To verify the effectiveness of KCF-KMM, it is compared with the classic cross-project software defect prediction methods such as TCA+, TNB and NNFilter from the perspective of accuracy and F1 value. The prediction performance of KCF-KMM can be improved by 34.1%, 0.8%, 21.1% and 14.4%, 3.7%, 10.6% on the Apache data set, respectively.

Key words: source project, target project, feature correlation, distribution difference, feature filtering, instance transfer

刁旭炀, 刘晓阳, 徐利, 陈天群, 徐亚周. 一种面向跨项目软件缺陷预测的特征过滤与实例迁移框架[J]. 计算机与现代化, 2021, 0(12): 116-122.

DIAO Xu-yang, LIU Xiao-yang, XU Li, CHEN Tian-qun, XU Ya-zhou. A Feature Filtering and Instance Transfer Framework for Cross-project Defect Prediction[J]. Computer and Modernization, 2021, 0(12): 116-122.

参考文献

［1］宫丽娜,姜淑娟，姜丽. 软件缺陷预测技术研究进展［J］. 软件学报, 2019,30(10):3090-3114.
［2］ PAN C, LU M Y, XU B, et al. An improved CNN model for within-project software defect prediction［J］. Applied Sciences, 2019,9(10). DOI：10.3390/app9102138.
［3］ WANG S, LIU T Y, TAN L. Automatically learning semantic features for defect prediction［C］// 2016 IEEE/ACM 38th International Conference on Software Engineering(ICSE). 2016:297-308.
［4］ DALLA PALMA S, DI NUCCI D, PALOMBA F, et al. Within-project defect prediction of infrastructure-as-code using product and process metrics［J］. IEEE Transactions on Software Engineering, 2021. DOI:10.1109/TSE. 2021.3051492.
［5］ CHEN X, MU Y Z, QU Y B, et al. Do different cross-project defect prediction methods identify the same defective modules?〖KG-1mm〗［J］. Journal of Software: Evolution and Process, 2020,32(5). DOI:10.1002/smr.2234.
［6］ SUN Z B, LI J Q, SUN H L, et al. CFPS: Collaborative filtering based source projects selection for cross-project defect prediction［J］. Applied Soft Computing, 2021,99. DOI:10.1016/j.asoc.2020.106940.
［7］ USHAKOV A V, VASILYEV I. Near-optimal large-scale K-medoids clustering［J］. Information Sciences, 2021,545:344-362.
［8］ HUANG J Y, SMOLA A J, GRETTON A, et al. Correcting sample selection bias by unlabeled data［C］// Proceedings of the 19th International Conference on Neural Information Processing Systems. 2006:601-608.
［9］ LIU C, YANG D, XIA X, et al. A two-phase transfer learning model for cross-project defect prediction［J］. Information and Software Technology, 2019,107:125-136.
［10］ANDERSON R. Thematic Content Analysis (TCA): Descriptive Presentation of Qualitative Data［EB/OL］. ［2021-07-23］. https://www.doc88.com/p-0354526272290.html?r=1.
［11］ZHANG F, KEIVANLOO I, ZOU Y. Data transformation in cross-project defect prediction［J］. Empirical Software Engineering, 2017,22(6):3186-3218.
［12］WU J, WU Y B, NIU N, et al. MHCPDP: Multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder［J］. Software Quality Journal, 2021,29(2):405-430.
［13］GOEL L, DAMODARAN D, KHATRI S K, et al. A literature review on cross project defect prediction［C］// 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON). IEEE, 2017:680-685.
［14］张瑞,郝克刚. 软件缺陷度量［J］. 计算机应用研究, 2005,22(4):54-57.
［15］陈翔,顾庆,刘望舒,等. 静态软件缺陷预测方法研究［J］. 软件学报, 2016,27(1):1-25.
［16］李学明,李海瑞,薛亮,等. 基于信息增益与信息熵的TFIDF算法［J］. 计算机工程, 2012,38(8):37-40.
［17］侯澍旻,李友荣，刘光临. 一种基于KS检验的时间序列非线性检验方法［J］. 电子与信息学报, 2007,29(4):808-810.
［18］SEJDINOVIC D, SRIPERUMBUDUR B, GRETTON A, et al. Equivalence of distance-based and RKHS-based statistics in hypothesis testing［J］. Annals of Statistics, 2013,41(5):2263-2291.
［19］BORGWARDT K M, GRETTON A, RASCH M J, et al. Integrating structured biological data by kernel maximum mean discrepancy［J］. Bioinformatics, 2006,22(14):e49-e57.
［20］JURECZKO M, MADEYSKI L. Towards identifying software project clusters with regard to defect prediction［C］// Proceedings of the 6th International Conference on Predictive Models in Software Engineering. 2010:84-93.
［21］CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
［22］FRIEDMAN M. A comparison of alternative tests of significance for the problem of m rankings［J］. The Annals of Mathematical Statistics, 1940,11(1):86-92.
［23］DEMAR J. Statistical comparisons of classifiers over multiple data sets［J］. The Journal of Machine Learning Research, 2006,7:1-30.
［24］REYES O, ALTALHI A H, VENTURA S. Statistical comparisons of active learning strategies over multiple datasets［J］. Knowledge-Based Systems, 2018,145:274-288.
［25］ELLIOTT A C, HYNAN L S. A SAS macro implementation of a multiple comparison post hoc test for a Kruskal-Wallis analysis［J］. Computer Methods and Programs in Biomedicine, 2011,102(1):75-80.
［26］祝颂,钱晓超,陆营波,等. 基于XGBoost的装备体系效能预测方法［J］. 空天防御, 2021,4(2):1-6.
［27］王俊,王赛,任俞明,等. 结合深度学习去噪和超分辨的SAR检测识别［J］. 空天防御, 2020,3(3):24-30.