基于累积正样本的偏斜数据流集成分类方法

doi:10.3969/j.issn.1006-2475.2015.03.009

计算机与现代化 ›› 2015, Vol. 0 ›› Issue (3): 41-47.doi: 10.3969/j.issn.1006-2475.2015.03.009

基于累积正样本的偏斜数据流集成分类方法

1.西北工业大学计算机学院，陕西西安710129;2.西北工业大学理学院,陕西西安710129

收稿日期:2014-12-08 出版日期:2015-03-23 发布日期:2015-03-26
作者简介:郭文锋(1984-),男,山西盂县人,西北工业大学计算机学院硕士研究生,研究方向:数据挖掘；王勇(1973-),男,西北工业大学理学院副教授,博士,研究方向:运筹学,数据挖掘,人工智能。
基金资助:
西北工业大学基础研究基金资助项目(JC201273)

Classifier Ensemble for Imbalanced Data Stream Classification Based on Accumulated Minorities

1. School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China;
2. School of Natural and Applied Sciences, Northwestern Polytechnical University, Xi’an 710129, China

Received:2014-12-08 Online:2015-03-23 Published:2015-03-26

摘要/Abstract

摘要： 针对现有处理偏斜数据流的方法存在过拟合或者未充分利用现有数据这一问题，提出一种基于累积正样本的偏斜数据流集成分类方法EAMIDS。该算法把目前达到的所有数据块的正样本收集起来生成集合AP，然后采用KNN算法和Over-sampling方法来平衡数据块的类分布。当基分类器数量超过最大值时，根据F-Measure值来更新集成分类器。通过在模拟数据集SEA和SPH上的实验，与IDSL算法和SMOTE算法相比，表明EAMIDS具有更高的准确率。


关键词: 偏斜数据流, 累积正样本, 集成分类器, 概念漂移

Abstract: To solve the issue of over-fitting and not making full use of current data in existing methods of balancing imbalanced data stream, a method named EAMIDS for imbalanced data stream is proposed based on accumulated positive samples. In EAMIDS, positive samples from previous training chunks are accumulated to form the AP set which is used to balance the class distributions by making use of K nearest neighbors and Over-sampling technique. The ensemble classifier will be updated according to F-Measure when the number of the available base classifiers is greater than the fixed size of the ensemble classifier. Empirical study on both SEA dataset and SPH dataset shows that the proposed EAMIDS has substantial advantage over IDSL approach and SMOTE approach in prediction accuracy.


Key words: imbalanced data streams, accumulated positive samples, ensemble classifiers, concept drift

中图分类号:

TP391

郭文锋1,王勇2. 基于累积正样本的偏斜数据流集成分类方法[J]. 计算机与现代化, 2015, 0(3): 41-47.

GUO Wen-feng1， WANG Yong2. Classifier Ensemble for Imbalanced Data Stream Classification Based on Accumulated Minorities[J]. Computer and Modernization, 2015, 0(3): 41-47.

参考文献

［1]Valdovinos R M, Sanchez J S. Class-dependant resampling for medical applications［C]// Proceedings of the 4th International Conference on Machine Learning and Applications. 2005:351-356.
［2]Fawcett T, Provost F. Adaptive fraud detection［J]. Data Mining and Knowledge Discovery, 1997,1(3):291-316.
［3]Japkowicz N, Stephen S. The class imbalance problem: A systematic study［J]. Intelligent Data Analysis, 2002,6(5):429-449.
［4]Chawla N V, Bowyer K W, Hall L O, et al. SMOTE:Synthetic minority Over-sampling technique［J]. Journal of Artificial Intelligence Research, 2002,16:321-357.
［5]Chen Sheng, He Haibo. Sera: Selectively recursive approach towards nonstationary imbalanced stream data mining［C]// International Joint Conference on Neural Networks, 2009. 2009:522-529.
［6]Chen Sheng, He Haibo. Towards incremental learning of nonstationary imbalanced data stream: A multiple selectively recursive approach［J]. Evolving Systems, 2011,2(1):35-50.
［7]Godase A, Attar V. Classifier ensemble for imbalaced data stream classification［C]// Proceedings of the CUBE International Information Technology Conference, 2012. 2012:284-289.
［8]Gao Jing, Ding Bolin, Fan Wei, et al. Classifying data streams with skewed class distributions and concept drifts［J]. IEEE Internet Computing, 2008,12(6):37-49.
［9]Zhang Juan, Hu Xuegang, Zhang Yuhong, et al. An efficient ensemble method for classifying skewed data streams［C]// Proceedings of the 7th International Conference on Intelligent Computing: Bio-inspired Computing and Applications. 2011:144-151.
［10]欧阳震诤,罗建书,胡东敏,等. 一种不平衡数据流集成分类模型［J]. 电子学报, 2010,38(1):184-189.
［11]Street W N, Kim Yongseog. A streaming ensemble algorithm(sea) for large-scale classification［C] // Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2001:377-382.
［12]Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques［M]. 2nd ed. San Fransisco: Morgan Kaufmann, 2005.
［13]Van Rijsbergen C V. Information Retrieval［M]. 2nd ed. London: Butterworths, 1979.
［14]Kubat M, Holte C R, Matwin S. Learning when negative examples abound［C]// Proceedings of the 9th European Conference Machine Learning. 1997,1224:146-153.
［15]Wang Haixun, Fan Wei, Yu P S, et al. Mining concept-drifting data streams using ensemble classifiers［C]// Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003:226-235.

[1]	章恒, 鞠时光. 基于概念漂移检测的网络数据流分类[J]. 计算机与现代化, 2021, 0(07): 107-114.
[2]	谢琰嵩;朱晓莲. 数据流分类技术分析[J]. 计算机与现代化, 2010, 1(01): 32-36.

基于累积正样本的偏斜数据流集成分类方法

Classifier Ensemble for Imbalanced Data Stream Classification Based on Accumulated Minorities

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐

Metrics

本文评价