Computer and Modernization ›› 2023, Vol. 0 ›› Issue (10): 32-38.doi: 10.3969/j.issn.1006-2475.2023.10.005

Previous Articles     Next Articles

Breast Cancer Prediction and Feature Analysis Model Based on CatBoost and SHAP

  

  1. (1. School of Mathematics and Data Science, Changji University, Changji 831100, China;
    2. College of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, China)
  • Online:2023-10-26 Published:2023-10-26

Abstract: To address the problems of insufficient performance and poor interpretability of current breast cancer prediction models, this paper proposes a breast cancer prediction and feature analysis model incorporating CatBoost and SHAP. First, the original breast cancer dataset is processed with outliers and data normalization to improve the quality of the data. Then, a model for breast cancer prediction based on CatBoost is built and generalization ability analysis is performed. Finally, the prediction model is combined with SHAP for interpretable analysis to explore the key factors affecting breast cancer. The model is validated using the Breast Cancer Wisconsin (Diagnostic) dataset from the University of Wisconsin, and the results show that the Accuracy value of 99.30%, Precision value of 99.50%, Recall value of 98.91%, and F1 value of 99.19% are better than the existing literature. The superiority of this model is verified by the fact that the Accuracy index improved by 1.12-6.90 percentage points, the Precision index improved by 2.00-7.50 percentage points, the Recall index improved by 2.41-6.91 percentage points, and the F1 value improved by 2.19-7.19 percentage points. In addition, the SHAP model yields the core factors affecting breast cancer, such as concave points_worst, perimeter_worst, and area_worst, which provide the principle support for doctors’ diagnosis.

Key words: Key words: CatBoost algorithm, interpretable, breast cancer, disease prediction, feature analysis, machine learning

CLC Number: