计算机与现代化 ›› 2025, Vol. 0 ›› Issue (02): 28-32.doi: 10.3969/j.issn.1006-2475.2025.02.004

• 人工智能 • 上一篇    下一篇

面向下游任务优化的不平衡数据重采样

 
  

  1. (中国石油大学(华东)计算机科学与技术学院,山东 青岛 266580)
  • 出版日期:2025-02-28 发布日期:2025-02-28
  • 基金资助:
    山东省自然科学基金资助项目(ZR2020MF140)

Resampling of Imbalanced Data for Optimizing Downstream Tasks 

  1. (College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China)
  • Online:2025-02-28 Published:2025-02-28

摘要: 数据重采样是解决数据类别不平衡的一种重要方法。传统方法通过最小化样本空间的几何误差构建平衡样本,但在复杂分布模式的高维空间表现不佳,而且依赖统计特征的方式也缺乏对下游任务的针对性。为了解决这个问题,本文提出一种基于神经网络的数据采样模型SOD-NN。利用神经网络的非线性处理能力学习样本在高维空间中的分布特征,并与下游任务联合构造两阶段网络形成整体优化,提升采样过程对下游任务的拟合能力。采样时首先对数据集进行空间划分,得到的样本子集通过残差处理避免数据退化问题,进而利用自注意力机制构造全局特征,确保与原始样本的分布一致性。实验结果表明本文模型能有效提升下游分类任务中少数类样本的识别性能,提升下游任务的鲁棒性。

关键词: 数据重采样, 样本不平衡, 自适应采样网络, 自注意力机制

Abstract:  Data resampling is a key method for correcting imbalanced dataset. Traditional methods construct balanced samples by minimizing geometric errors in the sample space, but they perform poorly in high-dimensional space with complex distribution patterns. Moreover, relying on statistical features lacks specificity for downstream tasks. To address this issue, this paper presents Sampling for Optimizing Downstream Neural Network (SOD-NN), a neural network for data sampling. This approach utilizes the ability of neural networks for nonlinear processing to identify the distribution characteristics of high-dimensional samples. It combines with downstream tasks to create a two-stage network, enabling overall optimization, thereby enhancing the model’s capability to meet the requirements of downstream tasks effectively. Specifically, the dataset is first divided spatially during sampling. Residual processing of sample subsets is then applied to prevent data degradation. Subsequently, a self-attention mechanism is utilized to construct global feature, ensuring consistency with the original sample distribution. Experimental results indicate that the model proposed in this paper significantly improves the recognition performance of minority class samples in downstream classification tasks, enhancing the robustness of processing these tasks.

Key words: data resampling, imbalanced data, adaptive sampling network, self-attention mechanism

中图分类号: