计算机与现代化 ›› 2021, Vol. 0 ›› Issue (07): 107-114.

• 信息安全 • 上一篇    下一篇

基于概念漂移检测的网络数据流分类

  

  1. (江苏大学计算机科学与通信工程学院,江苏镇江212013)
  • 出版日期:2021-08-02 发布日期:2021-08-02
  • 作者简介:章恒(1994—),男,安徽铜陵人,硕士研究生,研究方向:数据处理,网络安全,E-mail: zhangheng6817@163.com;通信作者:鞠时光(1955—),男,江苏镇江人,教授,博士,研究方向:计算机可视化技术,空间数据库,信息安全技术,E-mail: jushig@ujs.edu.cn。
  • 基金资助:
    国家重点研发项目(2016YFD0702001)

Network Data Stream Classification Based on Concept Drift Detection

  1. (School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China)
  • Online:2021-08-02 Published:2021-08-02

摘要: 互联网环境日新月异,使得网络数据流中存在概念漂移,对数据流的分类也由传统的静态分类变为动态分类,而如何对概念漂移进行检测是动态分类的关键。本文提出一种基于概念漂移检测的网络数据流自适应分类算法,通过比较滑动窗口中数据与历史数据的分布差异来检测概念漂移,然后将窗口中数据过采样来减少样本间的不均衡性,最后将处理后的数据集输入到OS-ELM分类器中进行在线学习,从而更新分类器使其应对数据流中的概念漂移。本文在MOA实验平台中使用合成数据集和真实数据集对提出的算法进行验证,结果表明,该算法较集成学习算法在分类准确率和稳定性上有一定的提升,并且随着数据流量的增加,时间性能上的优势开始体现,适合复杂多变的网络环境。

关键词: 概念漂移, 数据流分类, 滑动窗口, OS-ELM, 过采样

Abstract: With the rapid development of the Internet environment, the concept drift may exist in the network data stream. The classification of the data stream has changed from the traditional static classification to the dynamic classification. The key of dynamic classification is how to detect the concept drift. In this paper, an adaptive classification algorithm for network data streams based on concept drift detection is proposed. The algorithm detects concept drift by comparing the differences of distribution difference between the data in the sliding window and historical data, and then the window data is oversampled to reduce the imbalance between the samples, finally, the processed data sets are input into OS-ELM classifier for online learning, it updates the classifier to cope with the concept drift in the data stream. In this paper, the proposed algorithm is tested on the MOA experimental platform by using synthetic data sets and real data sets. The results show that the classification accuracy and stability of the algorithm are improved compared with the traditional ensemble learning algorithm, and with the increase of data flow, the advantage of time performance begins to show, which is suitable for complex and changeable network environment.

Key words: concept drift, data stream classification, sliding window, OS-ELM, oversampling