面向不平衡数据集的应用系统识别方法

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (05): 93-97.

面向不平衡数据集的应用系统识别方法

(1.中国石油大学（华东）计算机科学与技术学院,山东青岛266580；
2.胜利油田物探研究院信息技术研究室,山东东营257022)

出版日期:2021-06-03 发布日期:2021-06-03
作者简介:董燕辉(1996—),男,山东菏泽人,硕士研究生,研究方向:机器学习,数据挖掘,推荐系统,强化学习,E-mail:1480857024@qq.com; 肖军弼(1968—),男,山东莱州人,副教授,硕士,研究方向:软件定义网络（SDN）,计算机网络及应用,网络性能与优化,数据库与信息系统,E-mail： xiaojb@upc.edu.cn; 张红霞(1981—),女,山东东营人,副教授,博士,研究方向:云计算,移动服务计算,机器学习,大数据分析,E-mail： 20060072@upc.edu.cn; 杨勇进(1996—),男,山东滨州人,硕士研究生,研究方向：边缘计算,机器学习,E-mail： 2515558634@qq.com; 计志滨（1965—）,男,浙江宁波人,研究方向: 网络架构,信息安全,工控安全。
基金资助:
国家自然科学基金资助项目(61872385)

Application System Identification Method Oriented to Unbalanced Datasets

(1. College of Computer Science and Technology, China University of Petroleum(East China), Qingdao 266580, China；
2. Information Technology Research Office, Shengli Oil Field Geophysical Research Institute, Dongying 257022, China)

Online:2021-06-03 Published:2021-06-03

摘要/Abstract

摘要： 针对油田局域网络环境中，传统基于流量的分析方法无法实现应用系统的有效识别问题，本文设计一种面向不平衡数据集的应用系统识别框架WEBCLA，该框架采用基于基尼增益的SMOTE改进算法（GSMOTE）与XGBoost分类算法相结合的方式对基于网页的应用系统进行有效识别。具体地，本文提出的GSMOTE算法对少数类进行过采样，有效缓解识别样本不平衡问题，并结合XGBoost分类算法进行应用系统的识别。通过在真实数据集上进行实验，结果表明，本文提出的方法在召回率上较传统方法有较明显的提升，比普通集成方法提高约112.8%，比未经过采样处理的方法提升约10.8%，可有效解决油田局域网中的应用系统识别问题。

关键词: 应用识别, 不平衡数据, 基尼增益, 过采样, 分类问题

Abstract: Aiming at the problem that traditional flow-based analysis methods cannot achieve effective identification of application systems in the oilfield local area network environment, this paper designs an application system identification framework for imbalanced data sets, WEBCLA, which uses the improved SMOTE algorithm based on Gini gain (GSMOTE) combined with the XGBoost classification algorithm to effectively identify web-based application systems. Specifically, the GSMOTE algorithm proposed by this paper over-samples the minority classes to effectively alleviate the problem of imbalance in recognition samples, and combines the XGBoost classification algorithm to identify the application system. Through experiments on real data sets, the results show that the method proposed in this paper has a significant improvement in recall rate compared with the traditional method, which is about 112.8% higher than the ordinary integrated method, and about 10.8% higher than the method without sampling processing. It can effectively solve the application system identification problem in the oil field LAN.

Key words: application recognition, unbalanced data, Gini gain, over sampling, classification problem

董燕辉, 肖军弼, 张红霞, 杨勇进, 计志滨. 面向不平衡数据集的应用系统识别方法[J]. 计算机与现代化, 2021, 0(05): 93-97.

DONG Yan-hui, XIAO Jun-bi, ZHANG Hong-xia, YANG Yong-jin, JI Zhi-bin. Application System Identification Method Oriented to Unbalanced Datasets[J]. Computer and Modernization, 2021, 0(05): 93-97.

参考文献

［1］王禹贺. 工业控制网络安全评估方法研究［D］. 哈尔滨：哈尔滨理工大学, 2019.
［2］梁柱. “智慧油田”工控网络安全分析［J］. 网络安全技术与应用, 2020(4):146-147.
［3］张春宇. 浅谈网络安全在数字油田建设中的重要性［J］. 信息系统工程, 2019(12):62-63.
［4］刘玮,李蜀瑜. Android移动应用检测研究［J］. 计算机应用与软件, 2019,36(6):322-326.
［5］孙中军,翟江涛. 一种面向加密流量的网络应用识别方法［J］. 计算机工程, 2020,46(4):151-156.
［6］李兆斌,韩禹,魏占祯,等. SDN中基于机器学习的网络流量分类方法研究［J］. 计算机应用与软件, 2019,36(5):75-79.
［7］叶枫,丁锋. 不平衡数据分类研究及其应用［J］. 计算机应用与软件, 2018,35(1):132-136.
［8］李旭. 基于应用分类的安卓恶意应用检测模型［D］. 广州：广州大学, 2019.
［9］王超学,潘正茂,董丽丽,等. 基于改进SMOTE的非平衡数据集分类研究［J］. 计算机工程与应用, 2013,49(2):184-187.
［10］覃朗,朱建军,衣柏衡,等. 非均衡数据下基于信息增益的SMOTE改进SVM模型研究［C］// 中国管理科学学术年会. 2016:139-147.
［11］王忠震,黄勃,方志军,等. 改进SMOTE的不平衡数据集成分类算法［J］. 计算机应用, 2019，39（9）：109-114.
［12］施生生. 精确Web信息抽取关键技术与系统研究［D］. 南京：南京大学, 2017.

［13］石洪波,陈雨文,陈鑫. SMOTE过采样及其改进算法研究综述［J］. 智能系统学报, 2019,14(6):1073-1083.

［14］DASARATHY B V, SANCHEZ J S, TOWNSEND S, et al. Nearest neighbour editing and condensing tools synergy exploitation［J］. Pattern Analysis and Applications, 2000,3(1):19-30.
［15］芦兵,孙俊,许晓东. 基于正则表达式的图像目标特征提取方法研究［J］. 计算机应用与软件, 2018,35(4):260-264.
［16］CHEN T, GUESTRIN C. XGBoost: A scalable tree boosting system［C］∥ Proceedings of the 22 nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794.
［17］LEVER J, KRZYWINSKI M, ALTMAN N. Points of significance: Model selection and overfitting［J］. Nature Methods, 2016,13(9):703-704.
［18］NASERIPARSA M,AL-SHAMMARI A, SHENG M,et al. RSMOTE: Improving classification performance over imbalanced medical datasets［J］. Health Information Science and Systems, 2020,8(1):1-13.
［19］SMITH R. An Overview of the Tesseract OCR Engine［EB/OL］.(2014-04-10)［2020-07-15］. http://static.googleusercontent.com/media/research.google.com/zhCN//pubs/archive/33418.pdf.
［20］NEVEOL A, ZENG K, BODENREIDER O. Besides precision and recall: Exploring alternative approaches to evaluating an automatic indexing tool for Medline［C］// Proceedings of AMIA Annual Symposium, 2006:589-593.
［21］VASWANI N, NARAYANAMURTHY P. Static and dynamic robust PCA and matrix completion: A review［J］. Proceedings of the IEEE, 2018,106(8):1359-1379.
［22］向鸿鑫,杨云. 不平衡数据挖掘方法综述［J］. 计算机工程与应用, 2019,55(4):1-16.
［23］SCHOLKOPF B, WILLIAMSON R C, SMOLA A J, et al. Support vector method for novelty detection［C］// Advances in Neural Information Processing Systems. 1999: 582-588.
［24］王莉,陈红梅,王生武. 新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost［J］. 计算机应用, 2019,39(3):629-633.

[1]	薛浩, 马静, 郭小宇. 基于Focal Loss改进LightGBM的供水管网毛刺数据检测[J]. 计算机与现代化, 2024, 0(09): 74-81.
[2]	周传华1, 2, 任太娇1, 罗岚1, 周昊1. 基于联合熵的非平衡数据边界混合重采样[J]. 计算机与现代化, 2024, 0(09): 95-100.
[3]	刘付谦, 秦华妮, 赖惠慧. 基于SMOTE和贝叶斯优化的Adj-LightGBM人岗匹配算法[J]. 计算机与现代化, 2023, 0(03): 90-95.
[4]	孙丹, 施炜利, 饶兰香, 孟莎莎, 郭晓明, 李逸伦. 基于改进混合采样和XGBoost算法的信用卡欺诈检测方法[J]. 计算机与现代化, 2022, 0(09): 111-118.
[5]	龚云翔, 袁仕芳, 刘付谦. 基于集成学习与不平衡数据的返贫预测[J]. 计算机与现代化, 2022, 0(04): 12-16.
[6]	唐洁, 文元美. 基于3CNN-BiGRU的睡眠自动分期研究[J]. 计算机与现代化, 2022, 0(02): 120-126.
[7]	周传华, 朱俊杰, 徐文倩, 邓佳佳. 基于聚类欠采样的集成分类算法[J]. 计算机与现代化, 2021, 0(11): 72-76.
[8]	章恒, 鞠时光. 基于概念漂移检测的网络数据流分类[J]. 计算机与现代化, 2021, 0(07): 107-114.
[9]	闫芮铵, 张立臣. 基于Focal Loss和卷积神经网络的入侵检测[J]. 计算机与现代化, 2021, 0(01): 65-69.
[10]	王海, 江峰, 杜军威, 赵军. 过采样与集成学习方法在软件缺陷预测中的对比研究[J]. 计算机与现代化, 2020, 0(06): 83-.
[11]	史明华,吴广潮. 基于聚类混合采样的不平衡数据分类[J]. 计算机与现代化, 2020, 0(05): 34-.
[12]	易未1,毛力1,孙俊1,吴林海2,3. 改进Smote算法在不平衡数据集上的分类研究[J]. 计算机与现代化, 2018, 0(03): 83-.
[13]	霍旭1，吴涛1,2. 对于不平衡数据的模糊时间序列预测[J]. 计算机与现代化, 2017, 0(12): 108-110.
[14]	晁拴社1，楚恒1,2，王兴1. 高光谱图像数据的多尺度多核SVM分类[J]. 计算机与现代化, 2016, 0(2): 11-14,20.
[15]	邱海波1，钱忠民1，钱默抒2. 合成少数类过采样过滤器方法在二手车推荐中的应用[J]. 计算机与现代化, 2016, 251(07): 118-123.