面向下游任务优化的不平衡数据重采样

doi:10.3969/j.issn.1006-2475.2025.02.004

摘要/Abstract

摘要： 数据重采样是解决数据类别不平衡的一种重要方法。传统方法通过最小化样本空间的几何误差构建平衡样本，但在复杂分布模式的高维空间表现不佳，而且依赖统计特征的方式也缺乏对下游任务的针对性。为了解决这个问题，本文提出一种基于神经网络的数据采样模型SOD-NN。利用神经网络的非线性处理能力学习样本在高维空间中的分布特征，并与下游任务联合构造两阶段网络形成整体优化，提升采样过程对下游任务的拟合能力。采样时首先对数据集进行空间划分，得到的样本子集通过残差处理避免数据退化问题，进而利用自注意力机制构造全局特征，确保与原始样本的分布一致性。实验结果表明本文模型能有效提升下游分类任务中少数类样本的识别性能，提升下游任务的鲁棒性。

关键词: 数据重采样, 样本不平衡, 自适应采样网络, 自注意力机制

Abstract: Data resampling is a key method for correcting imbalanced dataset. Traditional methods construct balanced samples by minimizing geometric errors in the sample space， but they perform poorly in high-dimensional space with complex distribution patterns. Moreover， relying on statistical features lacks specificity for downstream tasks. To address this issue， this paper presents Sampling for Optimizing Downstream Neural Network （SOD-NN）， a neural network for data sampling. This approach utilizes the ability of neural networks for nonlinear processing to identify the distribution characteristics of high-dimensional samples. It combines with downstream tasks to create a two-stage network， enabling overall optimization， thereby enhancing the model’s capability to meet the requirements of downstream tasks effectively. Specifically， the dataset is first divided spatially during sampling. Residual processing of sample subsets is then applied to prevent data degradation. Subsequently， a self-attention mechanism is utilized to construct global feature， ensuring consistency with the original sample distribution. Experimental results indicate that the model proposed in this paper significantly improves the recognition performance of minority class samples in downstream classification tasks， enhancing the robustness of processing these tasks.

Key words: data resampling, imbalanced data, adaptive sampling network, self-attention mechanism

中图分类号:

TP391.9

郭华. 面向下游任务优化的不平衡数据重采样[J]. 计算机与现代化, 2025, 0(02): 28-32.

GUO Hua. Resampling of Imbalanced Data for Optimizing Downstream Tasks [J]. Computer and Modernization, 2025, 0(02): 28-32.

参考文献

［1］ SUN J， SHANG Z M， LI H. Imbalance-oriented SVM methods for financial distress prediction： A comparative study among the new SB-SVM-ensemble method and traditional methods［J］. Journal of the Operational Research Society， 2014，65（12）：1905-1919.
［2］ BENCHAJI I， DOUZI S， EL OUAHIDI B. Using genetic algorithm to improve classification of imbalanced datasets for credit card fraud detection［C］// Proceedings of the International Conference on Advanced Information Technology， Services and Systems （AIT2S-18）. Springer， 2019：220-229.
［3］ BAGUI S， LI K Q. Resampling imbalanced data for network intrusion detection datasets［J］. Journal of Big Data， 2021，8（1）. DOI： 10.1186/s40537-020-00390-x.
［4］ SAHOO S R， GUPTA B B. Classification of spammer and nonspammer content in online social network using genetic algorithm-based feature selection［J］. Enterprise Information Systems， 2020，14（5）：710-736.
［5］ LI X J， LI S C， LI J， et al. Detection of fake-video uploaders on social media using Naive Bayesian model with social cues［J］. Scientific Reports， 2021，11（1）. DOI： 10.1038/s41598-021-95514-5.
［6］ HUANG C X， HUANG X， FANG Y， et al. Sample imbalance disease classification model based on association rule feature selection［J］. Pattern Recognition Letters， 2020，133：280-286.
［7］ REZAEI M， YANG H J， MEINEL C. Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation［J］. Multimedia Tools and Applications， 2020，79（21-22）：15329-15348.
［8］ MAHMUDAH K R， PURNAMA B， INDRIANI F， et al. Machine learning algorithms for predicting chronic obstructive pulmonary disease from gene expression data with class imbalance［C］// 12th International Conference on Bioinformatics Models， Methods and Algorithms. INSTICC， 2021：148-153.
［9］苏逸，李晓军，姚俊萍，等.不平衡数据分类数据层面方法：现状及研究进展［J］.计算机应用研究， 2023，40（1）：11-19.
［10］ GUO H X， LI Y J， SHANG J， et al. Learning from class-imbalanced data： Review of methods and applications［J］. Expert Systems with Applications， 2017，73：220-239.
［11］ ESTABROOKS A， JO T， JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets［J］. Computational Intelligence， 2004，20（1）：18-36.
［12］ LAURIKKALA J. Improving identification of difficult small classes by balancing class distribution［C］// 8th Conference on Artificial Intelligence in Medicine in Europe， AIME 2001 Cascais. Springer， 2001：63-66.
［13］ CHAWLA N V， BOWYER K W， HALL L O， et al. SMOTE： Synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research， 2002，16（1）：321-357.
［14］ HAN H， WANG W Y， MAO B H. Borderline-SMOTE： A new over-sampling method in imbalanced data sets learning［C］// International Conference on Intelligent Computing. Springer， 2005：878-887.
［15］ BECKMANN M， EBECKEN N F F， PIRES DE LIMA B S L. A KNN undersampling approach for data balancing［J］. Journal of Intelligent Learning Systems and Applications， 2015，7（4）： 104-116.
［16］ HE H B， BAI Y， GARCIA E A， et al. ADASYN： Adaptive synthetic sampling approach for imbalanced learning［C］// 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）. IEEE， 2008：1322-1328.
［17］ CHARLES R Q， SU H， MO K C， et al. PointNet： Deep learning on point sets for 3D classification and segmentation［C］// 2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2017：77-85.
［18］ DOVRAT O， LANG I T， AVIDAN S. Learning to sample［C］// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2019：2755-2764.
［19］ HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition［C］// 2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. IEEE， 2016：770-778.
［20］ VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. ACM， 2017：6000-6010.
［21］ WU C Z， ZHENG J W， PFROMMER J， et al. Attention-based point cloud edge sampling［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE， 2023：5333-5343.
［22］ HU R Z， SHA T K， KAICK O V， et al. Data sampling in multi-view and multi-class scatterplots via set cover optimization［J］. IEEE Transactions on Visualization and Computer Graphics， 2019，26（1）：739-748.
［23］ YE X C， XIAO K Q， CAO R X， et al. Microstructure evolution and microhardness of TiAl based alloy blade by vacuum suction casting［J］. Vacuum， 2019，163：186-193.
［24］ ZHANG A M， YU H L， ZHOU S L， et al. Instance weighted SMOTE by indirectly exploring the data distribution［J］. Knowledge-Based Systems， 2022，249. DOI： 10.1016/j.knosys.2022.108919.
［25］ SUN Z Z， HU H， CHEN X. Numerical optimization of gating system parameters for a magnesium alloy casting with multiple performance characteristics［J］. Journal of Materials Processing Technology， 2008，199（1-3）：256-264.

[1]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[2]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[3]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[4]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.
[5]	李程启１,郑文杰1,黄文礼2,温招洋2. 一种基于透视变换数据增广的斜视目标鲁棒检测方法[J]. 计算机与现代化, 2020, 0(04): 1-.