Computer and Modernization ›› 2024, Vol. 0 ›› Issue (09): 95-100.doi: 10.3969/j.issn.1006-2475.2024.09.016

Previous Articles     Next Articles

Boundary Mixed Resampling Based on Joint Entropy for Imbalanced Data

  

  1. (1. School of Management Science and Engineering, Anhui University of Technology, Ma’anshan 243032, China;
    2. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China)
  • Online:2024-09-27 Published:2024-09-29

Abstract: In order to overcome the limitations of single resampling methods in data imbalance handling, which often lead to the generation of redundant samples and the inadvertent deletion of crucial sample information, this paper proposes a novel non-balanced data boundary mixed resampling algorithm based on joint entropy. The algorithm first effectively distinguishes between the boundary set and the non-boundary set by introducing a boundary factor. It further constructs a joint entropy indicator system to assess the importance of minority class samples within the boundary set. Based on this assessment, different oversampling methods and sampling quantities are applied to the segmented minority class samples. Finally, the NearMiss-2 algorithm is used to filter and remove most of the sample points in the non-boundary set, thus achieving a relative data balance. Through comparative experiments on nine sets of UCI datasets, the experimental results show that the proposed algorithm achieves improvements in F1-Score, G-mean, and AUC metrics, which validates its effectiveness and exhibiting favorable performance in non-balanced data classification.

Key words:  , imbalanced data classification; boundary factor; joint entropy; mixed sampling

CLC Number: