计算机与现代化 ›› 2025, Vol. 0 ›› Issue (01): 80-85.doi: 10.3969/j.issn.1006-2475.2025.01.013

• 网络与通信 • 上一篇    下一篇

基于高维特征降维聚类的实时流量分类方法


  

  1. (中国石油大学(华东)青岛软件学院、计算机科学与技术学院,山东 青岛 266580)
  • 出版日期:2025-01-27 发布日期:2025-01-27
  • 基金资助:
    中国高校产学研创新基金资助项目(2021FNA02007); 教育部-南京优速产学合作,创新创业联合基金资助项目(202002182024)

Real-Time Traffic Classification Method Based on High-dimensional Feature#br# Dimensionality Reduction and Clustering

  1. (Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), 
    Qingdao 266580, China)
  • Online:2025-01-27 Published:2025-01-27

摘要: 针对传统网络流量分类方法无法避免未知流量对分类的影响和难以实现实时流量分类的问题,本文提出一种基于高维特征降维聚类的实时流量分类模型。首先,搭建CNN网络模型对流量数据进行高维特征提取,并保存特征向量。然后使用UMAP对特征向量进行降维,并使用DBSCAN聚类算法对流量进行分类,在实现应用程序粒度分类的基础上,有效降低未知流量对模型的影响。同时,提出一种基于流一致性的时延控制机制,借鉴TCP拥塞控制机制的思想,大幅减少流量分类的时间,使得本文模型能够满足实时流量分类的要求。同时,在真实网络中采集一套应用程序粒度的流量数据集,并在公开数据集和本文数据集上进行实验验证。结果表明,本文方法在已知数据集中准确率约为98%,当未知流量接近50%时,准确率保持在80%左右,且能够满足实时分类的要求。

关键词: 实时流量分类, 特征降维, 未知流量聚类, 深度学习

Abstract: This paper proposes a real-time traffic classification model based on high-dimensional feature reduction clustering to address the problem of traditional network traffic classification methods being unable to avoid the impact of unknown traffic on classification and difficult to achieve real-time traffic classification. First, a CNN network model is built to extract high-dimensional features from traffic data, and save feature vectors. Then, UMAP is used to reduce the dimensionality of feature vectors, and the DBSCAN clustering algorithm is used to classify traffic, which effectively reduces the impact of unknown traffic on the model while achieving application-level classification. This paper proposes a time-delay control mechanism based on flow consistency, which borrows the idea of TCP congestion control mechanism and greatly reduces the time for traffic classification, making the model proposed in this paper able to meet the requirements of real-time traffic classification. At the same time, this paper collects a set of application-level traffic data sets in a real network. The experimental results on public data sets and the data set buiding by this paper show that the accuracy of this paper’s method is approximately 98% in the known data set, and when the unknown traffic is close to 50%, the accuracy remains at around 80%, and it can meet the requirements of real-time classification.

Key words: real-time traffic classification, feature dimensionality reduction, unknown traffic clustering, deep learning

中图分类号: