一种基于k-均值聚类的异常检测技术

计算机与现代化 ›› 2014, Vol. 0 ›› Issue (1): 93-95,113.

一种基于k-均值聚类的异常检测技术

山西警官高等专科学校计算机科学与技术系,山西太原030021

收稿日期:2013-08-13 出版日期:2014-01-20 发布日期:2014-02-10
作者简介: 白宁（1975-），男，山西晋城人，山西警官高等专科学校计算机科学与技术系讲师，硕士，研究方向：智能信息处理，计算机软件。

An Outlier Detection Method Based on k-means Clustering

Department of Computer Science and Technology, Shanxi Police Academy, Taiyuan 030021, China

Received:2013-08-13 Online:2014-01-20 Published:2014-02-10

摘要/Abstract

摘要： 由于实际问题中用户的行为模式存在多样性和不可预知性，传统异常检测方法采用提前设定正常模式或异常模式进行学习变得非常困难。针对这个问题，本文提出一种基于k-均值聚类的自适应异常检测方法，称为OD_KC方法。该方法设置不同的聚类个数对无标签的样本集进行k-均值聚类，通过构造测度函数，以衡量聚类结果的抱团性和分离性，从而获得最佳的聚类结果，同时自动得到那些被划分为很小规模的类的样本作为异常模式样本。基于k-均值的异常检测方法具有很强的自主性和自适应性，特别地，当样本分布模式复杂时，也能得到较为优秀的检测结果，具有较好的异常检测能力。实验结果表明，基于k-均值聚类的异常检测技术能够得到较好的检测结果。

关键词: 数据挖掘, 聚类结果, 测度函数, 自适应性, OD_KC方法

Abstract: Because the behavior pattern of users are always diverse and unpredictability, the traditional outlier detection methods using normal or abnormal models former setting become a difficult problem. To solve this problem, this paper presents a self-adapting outlier detection method based on k-means clustering, called OD_KC algorithm. Based on the unlabeled training samples are clustered by k-means method by different clustering parameter, a measure function is constructed to measure the performance of clustering process to obtain the optimal clustering results, and the small size classes after clustering are took as the outlier model. The outlier detection method based on k-means clustering has the autonomy and adaptability. Specially, the good results also can be obtained when the training data distribution is difficult by the OD_KC method and it has good outlier detection ability. Simulation results on standard datasets demonstrate that excellent detection results can be obtained by this method.

Key words: DM, clustering results, measure function, adaptation, OD_KC algorithm

白宁. 一种基于k-均值聚类的异常检测技术[J]. 计算机与现代化, 2014, 0(1): 93-95,113.

BAI Ning. An Outlier Detection Method Based on k-means Clustering[J]. Computer and Modernization, 2014, 0(1): 93-95,113.

参考文献

［1］
苗夺谦，王国胤，刘清，等. 粒计算：过去、现在与展望［M］. 北京：科学出版社， 2007.
［2］ Staniford S, Hoagland J, McAlerney J. Practical automated detection of stealthy portscans［J］. Journal of Computer Security, 2002,10(2):105-136.
［3］ Bridges S, Vaughn R. Fuzzy data mining and genetic algorithms applied to intrusion detection［C］// Proceedings 23rd National Information Systems Security Conference. Baltimore, 2000:13-31.
［4］ Sung A, Mukkamala S. Identify important features for intrusion detection using support vector machines and neural networks［C］// IEEE Proceedings of the 2003 Symposium on Application and the Internet. 2003:209-216.
［5］李和平,胡占义,吴毅红. 基于半监督学习的行为建模与异常检测［J］. 软件学报, 2007,18(3):527-537.
［6］ Hu W M, Xie D, Tan T N. A hierarchical self-organizing approach for learning the patterns of motion trajectories［J］. IEEE Transactions on Neural Networks, 2004,15(1):135-144.
［7］ Zhang D, Gatica-Perez D, Bengio S, et al. Semi-supervised adapted HMMs for unusual event detection［C］// Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005:611-618.
［8］ Yasami Y, Mozaffari S P. A novel unsupervised classification approach for network anomaly detection by k-means clustering and ID3 decision tree learning method［J］. ACM Journal of Supercomputing, 2010,53(11):231-245.
［9］ Park N H, Oh S H, Lee W S. Anomaly intrusion detection by clustering transactional audit streams in a host computer ［J］. Information Sciences, 2010,180(12):2375-2389.
［10］李娜,钟诚. 基于划分和凝聚层次聚类的无监督异常检测［J］. 计算机工程, 2008,34(2):120-123.
［11］ 周亚建,徐晨,李继国. 基于改进CURE聚类算法的无监督异常检测方法［J］. 通信学报, 2010,31(7):19-23,32. 
［12］Elkan C. Using the triangle inequality to accelerate k-means［C］// Proceedings of the Twentieth International Conference on Machine Learning. 2003:147-153.
［13］曹文平. 一种有效k-均值聚类中心的选取方法［J］. 计算机与现代化, 2008(3):95-97.
［14］UCIrvine. UCI Machine Learning Repository［DB/OL］. http://archive.ics.uci.edu/ml/, 2013-08-13.

[1]	袁红伟1, 常利军1, 郝家欢2, 樊娜2, 王超2, 罗闯2, 张泽辉2. 基于标签传播的轨迹兴趣点挖掘及隐私保护[J]. 计算机与现代化, 2024, 0(05): 46-54.
[2]	李颖颖, 黄文培. 基于优化八叉树的场景视锥体裁剪算法[J]. 计算机与现代化, 2024, 0(01): 103-108.
[3]	谢仕斌, 刘梦赤, 唐诗琪, 周瑞平, . 基于多特征提取的时间卷积知识追踪模型[J]. 计算机与现代化, 2023, 0(07): 25-29.
[4]	刘佩. 基于数据挖掘的医保控费系统[J]. 计算机与现代化, 2023, 0(06): 89-94.
[5]	王劭华, 欧阳会丹, 孙丹, 王康, 吴鸿萍, 钟询, 褚兴平, 杨松涛. 基于Apriori算法的大学生体测项目关联规则挖掘[J]. 计算机与现代化, 2023, 0(03): 66-70.
[6]	宋晓丽, 张勇波, 张培颖. 基于半监督学习的学生消费数据异常检测[J]. 计算机与现代化, 2022, 0(12): 13-17.
[7]	段桂芹, 邹臣嵩. 基于近邻传播聚类的职业能力评价模型[J]. 计算机与现代化, 2022, 0(05): 21-27.
[8]	杨琳, 白钊, 寇勇刚. 基于RFM模型的随机森林算法对民航客户的流失分析[J]. 计算机与现代化, 2021, 0(01): 100-104.
[9]	李科心, 李静, 邵佳炜, 肖屹. 多层次序列集成的高维数值型异常检测[J]. 计算机与现代化, 2020, 0(06): 73-.
[10]	蒋毅,欧郁强,梁广,高杨,严永高,林捷,赵晓宁. 基于数据挖掘的现场作业风险态势评估方法[J]. 计算机与现代化, 2020, 0(04): 78-.
[11]	齐玉东1，丁海强1，赵锦超2，孙明玮1. 基于biRNN的海军军械不均衡文本数据集处理方法[J]. 计算机与现代化, 2019, 0(12): 21-.
[12]	郭燚1，张卫山1，徐亮2，翟佳3. 基于微服务的石油大数据挖掘平台[J]. 计算机与现代化, 2019, 0(05): 25-.
[13]	李娜，毛国君，邓康立. 基于k-means聚类的股票KDJ类指标综合分析方法[J]. 计算机与现代化, 2018, 0(10): 12-.
[14]	田丽. 情报分析中提取主题信息核心要素的模型及方法[J]. 计算机与现代化, 2018, 0(10): 22-.
[15]	杨少令，刁燕，罗华，徐天雄. 一种改进的Canny图像分割算法[J]. 计算机与现代化, 2018, 0(08): 57-.