计算机与现代化 ›› 2023, Vol. 0 ›› Issue (10): 32-38.doi: 10.3969/j.issn.1006-2475.2023.10.005

• 人工智能 • 上一篇    下一篇

融合CatBoost和SHAP的乳腺癌预测及特征分析

  

  1. (1.昌吉学院数学与数据科学学院,新疆 昌吉 831100; 2.新疆财经大学统计与数据科学学院,新疆 乌鲁木齐 830012)
  • 出版日期:2023-10-26 发布日期:2023-10-26
  • 作者简介:贾潇瑶(1996—),女,四川成都人,硕士研究生,研究方向:机器学习及医学数据分析,E-mail: jiaxiaoyaostudent@163.com。

Breast Cancer Prediction and Feature Analysis Model Based on CatBoost and SHAP

  1. (1. School of Mathematics and Data Science, Changji University, Changji 831100, China;
    2. College of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, China)
  • Online:2023-10-26 Published:2023-10-26

摘要: 针对当前乳腺癌预测模型存在性能不足和可解释性差的问题,提出一种融合CatBoost和SHAP的乳腺癌预测及特征分析模型。首先,对原始乳腺癌数据集进行异常值处理和数据归一化处理等工作,以提高数据的质量。然后,基于CatBoost建立乳腺癌预测的模型,并进行泛化能力分析。最后,将预测模型结合SHAP进行可解释分析,以探索影响乳腺癌的关键因素。使用威斯康星大学的Breast Cancer Wisconsin (Diagnostic)数据集验证该模型,结果表明:Accuracy值为99.30%,Precision值为99.50%,Recall值为98.91%,F1值为99.19%,均优于现有文献。其中Accuracy指标提升1.12~6.90个百分点,Precision指标提升2.00~7.50个百分点,Recall指标提升2.41~6.91个百分点,F1值提升2.19~7.19个百分点,以此验证本文模型的优越性。此外,SHAP模型得出影响乳腺癌的核心因素有concave points_worst(乳腺组织细胞核凹点极值)、perimeter_worst(乳腺组织细胞核周长极值)、area_worst(乳腺组织细胞核面积极值)等,这为医生诊断提供原理性支撑。

关键词: 关键词:CatBoost算法, 可解释, 乳腺癌, 疾病预测, 特征分析, 机器学习

Abstract: To address the problems of insufficient performance and poor interpretability of current breast cancer prediction models, this paper proposes a breast cancer prediction and feature analysis model incorporating CatBoost and SHAP. First, the original breast cancer dataset is processed with outliers and data normalization to improve the quality of the data. Then, a model for breast cancer prediction based on CatBoost is built and generalization ability analysis is performed. Finally, the prediction model is combined with SHAP for interpretable analysis to explore the key factors affecting breast cancer. The model is validated using the Breast Cancer Wisconsin (Diagnostic) dataset from the University of Wisconsin, and the results show that the Accuracy value of 99.30%, Precision value of 99.50%, Recall value of 98.91%, and F1 value of 99.19% are better than the existing literature. The superiority of this model is verified by the fact that the Accuracy index improved by 1.12-6.90 percentage points, the Precision index improved by 2.00-7.50 percentage points, the Recall index improved by 2.41-6.91 percentage points, and the F1 value improved by 2.19-7.19 percentage points. In addition, the SHAP model yields the core factors affecting breast cancer, such as concave points_worst, perimeter_worst, and area_worst, which provide the principle support for doctors’ diagnosis.

Key words: Key words: CatBoost algorithm, interpretable, breast cancer, disease prediction, feature analysis, machine learning

中图分类号: