Computer and Modernization ›› 2021, Vol. 0 ›› Issue (11): 72-76.

Previous Articles     Next Articles

Integrated Classification Algorithm Based on Clustering and Undersampling

  

  1. (1. School of Management Science and Engineering, Anhui University of Technology, Ma’anshan 243000, China;
    2. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230000, China)
  • Online:2021-12-13 Published:2021-12-13

Abstract: Unbalanced data are often found in various application areas, and traditional classifiers tend to focus on the majority class of samples, which results in unsatisfactory sample classification. To address this problem, an integrated classification algorithm (ClusterUndersampling-AdaCost, CU-AdaCost) based on clustering undersampling is proposed. The algorithm derives the sample centre positions of each cluster by calculating the dimensionally weighted Euclidean distance between samples, and selects the majority class samples with strong information features according to the cluster centroid range to form a new training set. The training set is also placed in an integrated algorithm that introduces a cost-sensitive adjustment function, so as to make the model focus more on the minority class. Through comparison experiments on six UCI datasets, the results show that the algorithm has a strong representation of samples drawn in the undersampling process, which can effectively improve the classification performance of the model for minority categories.

Key words: unbalanced data, clustering, undersampling, cost sensitive