A Feature Filtering and Instance Transfer Framework for Cross-project Defect Prediction

Abstract

Abstract: In cross-project software defect prediction, the feature correlation and the difference in instance distribution between the source project and the target project are the main factors that affect the performance of the prediction model. From the perspective of feature filtering and instance transfer, we propose a framework for cross-project defect prediction called KCF-KMM. Specifically, during the feature filtering phase, it uses K-medoids clustering algorithm to select features, filtering out features that have low relevance to the target project. During the instance transfer phase, the KMM algorithm is used to calculate the distribution difference between the source project and the target project instance, so as to assign the influence weight of each training instance. Finally, it combines a small amount of labeled data in the target project to establish a mixed defect prediction model. To verify the effectiveness of KCF-KMM, it is compared with the classic cross-project software defect prediction methods such as TCA+, TNB and NNFilter from the perspective of accuracy and F1 value. The prediction performance of KCF-KMM can be improved by 34.1%, 0.8%, 21.1% and 14.4%, 3.7%, 10.6% on the Apache data set, respectively.

Key words: source project, target project, feature correlation, distribution difference, feature filtering, instance transfer

DIAO Xu-yang, LIU Xiao-yang, XU Li, CHEN Tian-qun, XU Ya-zhou. A Feature Filtering and Instance Transfer Framework for Cross-project Defect Prediction[J]. Computer and Modernization, 2021, 0(12): 116-122.

References

［1］宫丽娜,姜淑娟，姜丽. 软件缺陷预测技术研究进展［J］. 软件学报, 2019,30(10):3090-3114.
［2］ PAN C, LU M Y, XU B, et al. An improved CNN model for within-project software defect prediction［J］. Applied Sciences, 2019,9(10). DOI：10.3390/app9102138.
［3］ WANG S, LIU T Y, TAN L. Automatically learning semantic features for defect prediction［C］// 2016 IEEE/ACM 38th International Conference on Software Engineering(ICSE). 2016:297-308.
［4］ DALLA PALMA S, DI NUCCI D, PALOMBA F, et al. Within-project defect prediction of infrastructure-as-code using product and process metrics［J］. IEEE Transactions on Software Engineering, 2021. DOI:10.1109/TSE. 2021.3051492.
［5］ CHEN X, MU Y Z, QU Y B, et al. Do different cross-project defect prediction methods identify the same defective modules?〖KG-1mm〗［J］. Journal of Software: Evolution and Process, 2020,32(5). DOI:10.1002/smr.2234.
［6］ SUN Z B, LI J Q, SUN H L, et al. CFPS: Collaborative filtering based source projects selection for cross-project defect prediction［J］. Applied Soft Computing, 2021,99. DOI:10.1016/j.asoc.2020.106940.
［7］ USHAKOV A V, VASILYEV I. Near-optimal large-scale K-medoids clustering［J］. Information Sciences, 2021,545:344-362.
［8］ HUANG J Y, SMOLA A J, GRETTON A, et al. Correcting sample selection bias by unlabeled data［C］// Proceedings of the 19th International Conference on Neural Information Processing Systems. 2006:601-608.
［9］ LIU C, YANG D, XIA X, et al. A two-phase transfer learning model for cross-project defect prediction［J］. Information and Software Technology, 2019,107:125-136.
［10］ANDERSON R. Thematic Content Analysis (TCA): Descriptive Presentation of Qualitative Data［EB/OL］. ［2021-07-23］. https://www.doc88.com/p-0354526272290.html?r=1.
［11］ZHANG F, KEIVANLOO I, ZOU Y. Data transformation in cross-project defect prediction［J］. Empirical Software Engineering, 2017,22(6):3186-3218.
［12］WU J, WU Y B, NIU N, et al. MHCPDP: Multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder［J］. Software Quality Journal, 2021,29(2):405-430.
［13］GOEL L, DAMODARAN D, KHATRI S K, et al. A literature review on cross project defect prediction［C］// 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON). IEEE, 2017:680-685.
［14］张瑞,郝克刚. 软件缺陷度量［J］. 计算机应用研究, 2005,22(4):54-57.
［15］陈翔,顾庆,刘望舒,等. 静态软件缺陷预测方法研究［J］. 软件学报, 2016,27(1):1-25.
［16］李学明,李海瑞,薛亮,等. 基于信息增益与信息熵的TFIDF算法［J］. 计算机工程, 2012,38(8):37-40.
［17］侯澍旻,李友荣，刘光临. 一种基于KS检验的时间序列非线性检验方法［J］. 电子与信息学报, 2007,29(4):808-810.
［18］SEJDINOVIC D, SRIPERUMBUDUR B, GRETTON A, et al. Equivalence of distance-based and RKHS-based statistics in hypothesis testing［J］. Annals of Statistics, 2013,41(5):2263-2291.
［19］BORGWARDT K M, GRETTON A, RASCH M J, et al. Integrating structured biological data by kernel maximum mean discrepancy［J］. Bioinformatics, 2006,22(14):e49-e57.
［20］JURECZKO M, MADEYSKI L. Towards identifying software project clusters with regard to defect prediction［C］// Proceedings of the 6th International Conference on Predictive Models in Software Engineering. 2010:84-93.
［21］CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
［22］FRIEDMAN M. A comparison of alternative tests of significance for the problem of m rankings［J］. The Annals of Mathematical Statistics, 1940,11(1):86-92.
［23］DEMAR J. Statistical comparisons of classifiers over multiple data sets［J］. The Journal of Machine Learning Research, 2006,7:1-30.
［24］REYES O, ALTALHI A H, VENTURA S. Statistical comparisons of active learning strategies over multiple datasets［J］. Knowledge-Based Systems, 2018,145:274-288.
［25］ELLIOTT A C, HYNAN L S. A SAS macro implementation of a multiple comparison post hoc test for a Kruskal-Wallis analysis［J］. Computer Methods and Programs in Biomedicine, 2011,102(1):75-80.
［26］祝颂,钱晓超,陆营波,等. 基于XGBoost的装备体系效能预测方法［J］. 空天防御, 2021,4(2):1-6.
［27］王俊,王赛,任俞明,等. 结合深度学习去噪和超分辨的SAR检测识别［J］. 空天防御, 2020,3(3):24-30.