计算机与现代化 ›› 2021, Vol. 0 ›› Issue (05): 93-97.

• 人工智能 • 上一篇    下一篇

面向不平衡数据集的应用系统识别方法

  

  1. (1.中国石油大学(华东)计算机科学与技术学院,山东青岛266580;
    2.胜利油田物探研究院信息技术研究室,山东东营257022)
  • 出版日期:2021-06-03 发布日期:2021-06-03
  • 作者简介:董燕辉(1996—),男,山东菏泽人,硕士研究生,研究方向:机器学习,数据挖掘,推荐系统,强化学习,E-mail:1480857024@qq.com; 肖军弼(1968—),男,山东莱州人,副教授,硕士,研究方向:软件定义网络(SDN),计算机网络及应用,网络性能与优化,数据库与信息系统,E-mail: xiaojb@upc.edu.cn; 张红霞(1981—),女,山东东营人,副教授,博士,研究方向:云计算,移动服务计算,机器学习,大数据分析,E-mail: 20060072@upc.edu.cn; 杨勇进(1996—),男,山东滨州人,硕士研究生,研究方向:边缘计算,机器学习,E-mail: 2515558634@qq.com; 计志滨(1965—),男,浙江宁波人,研究方向: 网络架构,信息安全,工控安全。
  • 基金资助:
    国家自然科学基金资助项目(61872385)

Application System Identification Method Oriented to Unbalanced Datasets

  1. (1. College of Computer Science and Technology, China University of Petroleum(East China), Qingdao 266580, China;
    2. Information Technology Research Office, Shengli Oil Field Geophysical Research Institute, Dongying 257022, China)
  • Online:2021-06-03 Published:2021-06-03

摘要: 针对油田局域网络环境中,传统基于流量的分析方法无法实现应用系统的有效识别问题,本文设计一种面向不平衡数据集的应用系统识别框架WEBCLA,该框架采用基于基尼增益的SMOTE改进算法(GSMOTE)与XGBoost分类算法相结合的方式对基于网页的应用系统进行有效识别。具体地,本文提出的GSMOTE算法对少数类进行过采样,有效缓解识别样本不平衡问题,并结合XGBoost分类算法进行应用系统的识别。通过在真实数据集上进行实验,结果表明,本文提出的方法在召回率上较传统方法有较明显的提升,比普通集成方法提高约112.8%,比未经过采样处理的方法提升约10.8%,可有效解决油田局域网中的应用系统识别问题。

关键词: 应用识别, 不平衡数据, 基尼增益, 过采样, 分类问题

Abstract: Aiming at the problem that traditional flow-based analysis methods cannot achieve effective identification of application systems in the oilfield local area network environment, this paper designs an application system identification framework for imbalanced data sets, WEBCLA, which uses the improved SMOTE algorithm based on Gini gain (GSMOTE) combined with the XGBoost classification algorithm to effectively identify web-based application systems. Specifically, the GSMOTE algorithm proposed by this paper over-samples the minority classes to effectively alleviate the problem of imbalance in recognition samples, and combines the XGBoost classification algorithm to identify the application system. Through experiments on real data sets, the results show that the method proposed in this paper has a significant improvement in recall rate compared with the traditional method, which is about 112.8% higher than the ordinary integrated method, and about 10.8% higher than the method without sampling processing. It can effectively solve the application system identification problem in the oil field LAN.

Key words: application recognition, unbalanced data, Gini gain, over sampling, classification problem