基于混合N-Gram模型和XGBoost算法的内部威胁检测方法

摘要/Abstract

摘要： 随着政府企事业单位网络安全机制的建立健全，单纯从外部进入目标系统的攻击门槛越来越高，导致内部威胁逐渐增多。内部威胁区别于外部威胁，攻击者主要来自于内部用户，使得攻击更具隐蔽性，更难被检测。本文提出一种基于混合N-Gram模型和XGBoost算法的内部威胁检测方法。采用词袋、N-Gram、词汇表3种特征提取方法进行实验比对及参数N值筛选，基于混合N-Gram模型和XGBoost算法的内部威胁检测方法检测效果比通过1维数据、2维数据、4维数据的不同特征进行组合的特征子集效果更优，特定度达到0.23，灵敏度达到27.65，准确度达到0.94，F1值达到0.97。对比特定度、灵敏度、准确度、F1值4项评价指标，基于混合N-gram特征提取方法比传统的词袋、词汇表特征提取方法在检测中更有效。此检测方法不仅提高了内部威胁检测特征码的区分度，同时提高了特征提取的准确性和计算性能。

关键词: 混合N-Gram模型, XGBoost算法, 内部威胁, SEA数据集, 评价指标

Abstract: With the establishment and improvement of the network security mechanism of the government、 the enterprises and the Institutions, the threshold for attacking the target system from the outside is getting higher and higher. So the insider threats are gradually increasing. The internal threats are different from external threats. The attackers are mainly from internal users, so it makes the attacks more concealed and harder to be detected. The paper first analyzes user behaviors in the public SEA data set, then proposes an insider threat detection based on hybrid N-Gram and XGBoost theory, using the big data and machine learning methods. Three feature extraction methods: bag-of-words, N-Gram, and vocabulary are used for experimental comparison and N value experimental screening. The internal threat detection method based on the hybrid N-Gram model and XGBoost algorithm has a better detection effect than one-dimensional data and two-dimensional data. The effect of combining the different features of the four-dimensional data on the feature subset is better. The specificity reaches 0.23, the sensitivity reaches 27.65, the accuracy reaches 0.94, and the F1 value reaches 0.97. Comparing the 4 evaluation indicators of specificity, sensitivity, accuracy, and F1 value, the feature extraction method based on hybrid N-gram is more effective in detection than traditional bag-of-words and vocabulary feature extraction methods. This detection method not only improves the discrimination of internal threat detection signatures, but also improves the accuracy of feature extraction and calculation performance.

Key words: hybrid N-Gram, XGBoost, internal threats, SEA, evaluation index

孙丹, 饶兰香, 施炜利, 孟莎莎, 胡少文, 胡必伟, 应嵩. 基于混合N-Gram模型和XGBoost算法的内部威胁检测方法[J]. 计算机与现代化, 2022, 0(08): 99-105.

SUN Dan, RAO Lan-xiang, SHI Wei-li, MENG Sha-sha, HU Shao-wen, HU Bi-wei, YING Song. Insider Threat Detection Based on Hybrid N-Gram and XGBoost Theory[J]. Computer and Modernization, 2022, 0(08): 99-105.

参考文献

［1］ JASMINE. 2018年（迄今为止）最严重的6起内部攻击事件［EB/OL］. （2018-07-10）［2021-08-25］. https://www.aqniu.com/news-views/35863.html.
［2］ JARTELIUS M. The 2020 Data Breach Investigations Report-a CSO’s perspective -ScienceDirect［J］. Network Security, 2020,2020（7）:9-12.
［3］陈聪,陈志忠,王伟,等. 基于主体特征参数的电网内部威胁检测研究［J］. 自动化与仪器仪表, 2017（5）:8-10.
［4］吴良秋. 浅析网络内部威胁［J］. 计算机时代, 2019（6）:41-45.
［5］黄铁,张奋.基于隐马尔可夫模型的内部威胁检测方法［J］. 计算机工程与设计, 2010,31（5）:965-968.
［6］文雨,王伟平,孟丹. 面向内部威胁检测的用户跨域行为模式挖掘［J］. 计算机学报, 2016,39（8）:1555-1569.
［7］黄娜,何泾沙,吴亚飚,等. 基于LSTM回归模型的内部威胁检测方法［J］. 信息网络安全, 2020,20（9）:17-21.
［8］陈明帅,吴克河. 基于shell命令的内部威胁攻击检测［J］.计算机与现代化, 2021（1）:56-60.
［9］杨光,马建刚. 内部威胁检测研究［J］. 信息安全学报, 2016,1（3）:21-36.
［10］黄清兰,游贵荣. 基于ABC、XGBoost与迁移学习的入侵检测方法［J］. 重庆科技学院学报（自然科学版）, 2021,23（1）:87-90.
［11］徐迪. 一种基于XGBoost的恶意HTTP请求识别方法［J］. 电信工程技术与标准化, 2018,31（12）:22-27.
［12］SCHONLAU M, DUMOUCHEL W， JU W H, et al. Computer intrusion: Detecting masquerades［J］. Statistical Science, 2001,16（1）:58-74.
［13］杨光,吴钰. 内部攻击实验数据集浅析［J］. 电脑知识与技术, 2016,12（21）:55-56.
［14］郭晓明,孙丹. 基于朴素贝叶斯理论的内部威胁检测方法［J］. 计算机与现代化, 2017（7）:101-106．
［15］SALEM M B，HERSHKOP S，STOLFO S J. A survey of insider attack detection research［M］// Insider Attack and Cyber Security。 Springer, 2008:69-90.
［16］汤雨欢,施勇,薛质. 基于用户命令序列的伪装入侵检测［J］. 通信技术, 2018，51（5）:1148-1153．
［17］苏晗舶. 基于N-gram特征提取的恶意代码聚类分析方法研究［D］. 沈阳:沈阳理工大学, 2020.
［18］苏林萍,刘小倩,陈飞,等. 基于N-Gram和TFIDF的SQL注入检测方法［J］. 计算机与数字工程, 2021,49（6）:1177-1180.
［19］万卓昊,徐冬冬,梁生,等. 基于N-Gram的SQL注入检测研究［D］. 计算机科学, 2019,46（7）:108-113.
［20］黄清兰,游贵荣. 基于ABC,XGBoost与迁移学习的入侵检测方法［J］. 重庆科技学院学报:自然科学版, 2021,23（1）:87-90.
［21］崔艳鹏,史科杏,胡建伟. 基于XGBoost算法的Webshell检测方法研究［J］. 计算机科学, 2018,45（B06）:375-379.
［22］李世科. 基于改进XGBoost算法的智能网络异常分析技术研究［J］. 信息技术与信息化, 2020（8）:71-73.
［23］CHEN T Q，GUESTRIN C. XGBoost: A scalable tree boosting system［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794．
［24］徐国天,沈耀童. 基于XGBoost和LightGBM双层模型的恶意软件检测方法［J］. 信息网络安全, 2020,20（12）:54-63.
［25］黄娜,何泾沙,吴亚飚,等. 基于LSTM回归模型的内部威胁检测方法［J］. 信息网络安全, 2020（9）:17-21.
［26］刘绍廷,杨孟英,朱广全,等. 机器学习在SQL注入攻击检测中的应用［J］. 河南科技, 2021,40（8）:23-27.
［27］何戡,曲超,宗学军,等. 机器学习在工业网络入侵检测中的研究应用［J］.小型微型计算机系统, 2021,42（2）:437-442.
［28］田新广,段洣毅,程学旗. 基于shell命令和多重行为模式挖掘的用户伪装攻击检测［J］. 计算机学报, 2010,33（4）:697-705．

[1]	孙丹, 施炜利, 饶兰香, 孟莎莎, 郭晓明, 李逸伦. 基于改进混合采样和XGBoost算法的信用卡欺诈检测方法[J]. 计算机与现代化, 2022, 0(09): 111-118.
[2]	彭伟. 基于时间序列神经网络的鲜切花价格指数短期预测[J]. 计算机与现代化, 2019, 0(05): 101-.
[3]	邹臣嵩1,刘松2. 基于谱聚类的全局中心快速更新聚类算法[J]. 计算机与现代化, 2018, 0(10): 6-.
[4]	王松松,高伟勋. 基于高校官网的校情简介数据分析方法[J]. 计算机与现代化, 2018, 0(08): 66-.
[5]	郭晓明1,2，孙丹1,2. 基于朴素贝叶斯理论的内部威胁检测方法[J]. 计算机与现代化, 2017, 0(7): 101-106.
[6]	梁伟超,宋斌. 多标记学习研究综述[J]. 计算机与现代化, 2015, 0(8): 48-56.
[7]	陈晓明;唐新宇. 基于数据挖掘的AHP算法在职业教育教学评价系统中的应用[J]. 计算机与现代化, 2012, 1(200): 68-03.
[8]	叶永生王际洲. 面向信息系统全生命周期的评价指标体系研究[J]. 计算机与现代化, 2012, 1(03): 62-65.
[9]	邵晓青;邬家炜. 基于分层模糊评价法的本体评价[J]. 计算机与现代化, 2011, 1(6): 175-6.