计算机与现代化 ›› 2021, Vol. 0 ›› Issue (12): 116-122.

• 软件工程 • 上一篇    

一种面向跨项目软件缺陷预测的特征过滤与实例迁移框架

  

  1. (1.上海机电工程研究所,上海201109;2.上海航天电子技术研究所,上海201109)
  • 出版日期:2021-12-24 发布日期:2021-12-24
  • 作者简介:刁旭炀(1995—),男,上海人,助理工程师,硕士,研究方向:软件工程,机器学习,E-mail: y30170698@mail.ecust.edu.cn。

A Feature Filtering and Instance Transfer Framework for Cross-project Defect Prediction

  1. (1. Shanghai Electro-Mechanical Engineering Institute, Shanghai 201109, China;
    2. Shanghai Aerospace Electronic Technology Institute, Shanghai 201109, China)
  • Online:2021-12-24 Published:2021-12-24

摘要: 在跨项目软件缺陷预测中,源项目与目标项目的特征关联度与实例分布差异性是影响预测模型性能的主要因素。本文从特征过滤与实例迁移2个角度出发,提出一种跨项目软件缺陷预测框架KCF-KMM(K-medoids Cluster Filtering- Kernel Mean Matching)。在特征过滤阶段,该方法基于K-medoids聚类算法来筛选特征子集,过滤与目标项目关联度低的特征。在实例迁移阶段,通过KMM算法计算源项目与目标项目实例间的分布差异度,以此分配每个训练实例的影响权重。最后,结合目标项目中少量有标注数据建立混合缺陷预测模型。为了验证KCF-KMM的有效性,本文从准确率和F1值的角度出发,分别与经典的跨项目软件缺陷预测方法TCA+、TNB和NNFilter相比,KCF-KMM的预测性能在Apache数据集上可以分别提升34.1%、0.8%、21.1%和14.4%、3.7%、10.6%。

关键词: 源项目, 目标项目, 特征关联度, 分布差异性, 特征过滤, 实例迁移

Abstract: In cross-project software defect prediction, the feature correlation and the difference in instance distribution between the source project and the target project are the main factors that affect the performance of the prediction model. From the perspective of feature filtering and instance transfer, we propose a framework for cross-project defect prediction called KCF-KMM. Specifically, during the feature filtering phase, it uses K-medoids clustering algorithm to select features, filtering out features that have low relevance to the target project. During the instance transfer phase, the KMM algorithm is used to calculate the distribution difference between the source project and the target project instance, so as to assign the influence weight of each training instance. Finally, it combines a small amount of labeled data in the target project to establish a mixed defect prediction model. To verify the effectiveness of KCF-KMM, it is compared with the classic cross-project software defect prediction methods such as TCA+, TNB and NNFilter from the perspective of accuracy and F1 value. The prediction performance of KCF-KMM can be improved by 34.1%, 0.8%, 21.1% and 14.4%, 3.7%, 10.6% on the Apache data set, respectively.

Key words: source project, target project, feature correlation, distribution difference, feature filtering, instance transfer