多层次序列集成的高维数值型异常检测

计算机与现代化 ›› 2020, Vol. 0 ›› Issue (06): 73-.

多层次序列集成的高维数值型异常检测

（1.南京航空航天大学计算机科学与技术学院，江苏南京210016;2.国网上海市电力公司信息通信公司，上海200000)

收稿日期:2019-10-22 出版日期:2020-06-24 发布日期:2020-06-28
作者简介:李科心(1994-)，男，安徽亳州人，硕士研究生，研究方向：数据挖掘，软件可靠性验证，E-mail: kexinli@nuaa.edu.cn；李静(1976-)，女（满族），江苏南京人，副教授，博士，研究方向：图像识别，数据挖掘，可信计算，E-mail: lijing@nuaa.edu.cn；邵佳炜（1989-），男，上海人，工程师，硕士，研究方向：电力系统的可靠验证及数据分析, E-mail: 505322675@qq.com；肖屹，男，上海人，工程师，学士，研究方向：数据挖掘和分析，E-mail: 405781809@qq.com。
基金资助:
国家电网科技项目（SGSHXT00JFJS1900093）

High-dimensional Numerical Anomaly Data Detection Based on Multi-level Sequence Integration

(1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China;
2. Information and Communication Company of State Grid Shanghai Electric Power Company, Shanghai 200000, China)

Received:2019-10-22 Online:2020-06-24 Published:2020-06-28

摘要/Abstract

摘要： 伴随大数据的快速发展，数据分析和知识发现成为研究热点，异常数据检测是数据质量提升的关键。基于序列集成学习的异常数据检测方法在面向高维数值型数据时可能因为噪声数据和维数过多导致检测精度下降。本文提出一种基于弹性网络的多层次序列集成学习的高维数值型异常数据检测方法，其中每层包含异常数据候选集模块、弹性网络降维模块和数据异常打分模块共3个模块。首先，异常数据候选集选择模块根据异常分数选择出一部分可能的异常数据；然后，弹性网络根据异常数据候选集和异常分数对高维数据进行特征选择，选择出与异常分数最相关的特征；最后，利用选择出来的特征对数据再次进行异常打分。每层异常数据候选集选择模块中的阈值设置为不同的值，循环地执行每一层，直到当前弹性网络的均方误差大于上一次的均方误差或者当前的检测精度小于初始的检测精度。在实验阶段，使用ODDS提供的高维异常数据检测数据集并根据检测精度、提取特征数、收敛速度等指标对本文方法的性能进行了测试。结果表明本文方法不仅能够提高对高维数值型异常数据的检测精度，而且能够有效地降低噪声对检测结果的影响。

关键词: 数据挖掘, 异常数据检测, 集成学习, 弹性网络, 高维数据

Abstract: With the rapid development of big data, data analysis and knowledge discovery have become research hotspots, and anomaly data detection is the key to data quality improvement. The abnormal data detection method based on sequence ensemble learning may cause large deviations in the detection of abnormal data in high-dimensional numerical data due to noise data and excessive number of dimensions. This paper proposes a high-dimensional numerical anomaly data detection model of multi-layer sequence ensemble learning based on elastic network. Each layer contains three modules: abnormal data candidate set module, elastic network dimension reduction module and data abnormality scoring module. First, the abnormal data candidate set selection module selects some possible abnormal data according to abnormal score. Then, the elastic network reduces the dimension of data according to the outlier candidate set and its abnormal score. Finally, the selected features related to the abnormal score are used to score the data again. The threshold in each layer of the abnormal data candidate set selection module is set to a different value, and each layer is executed cyclically until the mean square error of the current elastic network is greater than the previous or the current detection precision is smaller than the initial detection precision. In the experimental stage, the high-dimensional anomaly data set provided by ODDS is used to test the performance of the model proposed in this paper based on the detection accuracy, the number of extracted features, the convergence speed, etc. The results show that the proposed method can not only improve the detection accuracy of high-dimensional numerical anomaly data, but also effectively reduce the effect of noise on the detection results.

Key words: data mining, abnormal data detecting, ensemble learning, elastic network, high-dimensional data

中图分类号:

TP301.6

李科心, 李静, 邵佳炜, 肖屹. 多层次序列集成的高维数值型异常检测[J]. 计算机与现代化, 2020, 0(06): 73-.

LI Ke-xin, LI Jing, SHAO Jia-wei, XIAO Yi. High-dimensional Numerical Anomaly Data Detection Based on Multi-level Sequence Integration[J]. Computer and Modernization, 2020, 0(06): 73-.

参考文献

［1］候泳旭,段磊,李岭,等. 基于疾病信息网络的表型相似基因搜索［J］. 软件学报， 2018,29(3):721-733.
［2］ ENRIC J D F, STANKOVA M, MOEYERSOMS J, et al. Corporate residence fraud detection［C］// ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2014:1650-1659.
［3］ NANDA S, ZAFARI F, DECUSATIS C, et al. Predicting network attack patterns in SDN using machine learning approach［C］// Proceedings of 2017 IEEE International Conference on Network Function Virtualization and Software Defined Networks. 2017:167-172.
［4］ ZHANG K, HUTTER M, JIN H. A new local distance-based outlier detection approach for scattered real-world data［C］// Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2009:813-822.
［5］ XU H L, MAO R, LIAO H, et al. Closest neighbors excluded outlier detection［C］// Proceedings of 2016 IEEE International Conference on Online Analysis and Computing Science. 2016:105-110.
［6］ LIU J, WANG G Y. Outlier detection based on local minima density［C］// Proceedings of 2016 IEEE International Conference on Information Technology, Networking, Electronic and Automation Control. 2016:718-723.
［7］ TANG B, HE H. A local density-based approach for outlier detection［J］. Neurocomputing, 2017,241:171-180.
［8］刘露,左万利,彭涛. 异质网中基于张量表示的动态离群点检测方法［J］. 计算机研究与发展, 2016,53(8):1729-1739.
［9］ LIU F, TING K M, ZHOU Z H. Isolation-based anomaly detection［J］. ACM Transactions on Knowledge Discovery from Data, 2012,6(1):1-39.
［10］PAULHEIM H, MEUSEL R. A decomposition of the outlier detection problem into a set of supervised learning problems［J］. Machine Learning, 2015,100(2-3):509-531.
［11］BENGIO Y, COURVILLE A, VINCENT P. Representation learning: A review and new perspectives［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2012,35(8):1798-1828.
［12］KHODR J, YOUNES R. Dimensionality reduction on hyperspectral images: A comparative review based on artificial datas［C］// International Congress on Image & Signal Processing. 2011:1910-1918.
［13］AGGARWAL C C, SATHE S. Theoretical foundations and algorithms for outlier ensembles［J］. ACM SIGKDD Explorations Newsletter, 2015,17(1):24-47.
［14］ZIMEK A, GAUDET M, CAMPELLO R J G B, et al. Subsampling for efficient and effective unsupervised outlier detection ensembles［C］// ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2013:428-436.
［15］RAYANA S, ZHONG W, AKOGLU L. Sequential ensemble learning for outlier detection: A bias-variance perspective［C］// IEEE International Conference on Data Mining. 2016:1167-1172.
［16］PANG G S, CAO L B, CHEN L, et al. Sparse modeling-based sequential ensemble learning for effective outlier detection in high-dimensional numeric data［C］// The 32nd AAAI Conference on Artificial Intelligence. 2018.
［17］YANG B, HUANG B. KNN based outlier detection algorithm in large dataset［C］// International Workshop on Education Technology & Training. 2009:11-613.
［18］付培国,胡晓惠. 基于密度偏倚抽样的局部距离异常值检测方法［J］. 软件学报, 2017,28(10):2625-2639.
［19］BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF: Identifying density-based local outliers［C］// ACM SIGKDD International Conference on Management of Data. 2000:93-104.
［20］KNORR E M, NG R T, TUCAKOV V. Distance-based outliers: Algorithms and applications［J］. The VLDB Journal, 2000,8(3):237-253.
［21］NGUYEN H V, ANG H H, GOPALKRISHNAN V. Mining outliers with ensemble of heterogeneous detectors on random subspaces［C］// Proceedings of International Conference on Database Systems for Advanced Applications. 2010:368-383.
［22］LI J, CHENG K, WANG S, et al. Feature selection: A data perspective［J］. ACM Computing Surveys, 2016,50(6):94:1-94:45.
［23］WU D F. A regression sequences based method for high dimensional outlier detection［J］. Journal of Discrete Mathematical Sciences and Cryptography, 2017,20(4):931-943.
［24］BREIMAN L. Bagging predictors［J］. Machine Learning, 1996,24(2):123-140.
［25］FREUND Y, SHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting［J］. Journal of Computer and System Sciences, 1997,55(1):119-139.
［26］AGGARWAL C C. Outlier ensembles: Position paper［J］. ACM SIGKDD Explorations Newsletter, 2013,14(2):49-58.
［27］RAYANA S, AKOGLU L. Less is more: Building selective anomaly ensembles［J］. ACM Transactions on Knowledge Discovery from Data, 2016,10(4):1-33.

[1]	袁红伟1, 常利军1, 郝家欢2, 樊娜2, 王超2, 罗闯2, 张泽辉2. 基于标签传播的轨迹兴趣点挖掘及隐私保护[J]. 计算机与现代化, 2024, 0(05): 46-54.
[2]	王杰, 徐祥, 罗晓丹, 张萌, 黄澈, 洪冠中, 汪翔. 基于集成学习的巢湖面雨量计算方法[J]. 计算机与现代化, 2023, 0(09): 38-43.
[3]	谢仕斌, 刘梦赤, 唐诗琪, 周瑞平, . 基于多特征提取的时间卷积知识追踪模型[J]. 计算机与现代化, 2023, 0(07): 25-29.
[4]	刘佩. 基于数据挖掘的医保控费系统[J]. 计算机与现代化, 2023, 0(06): 89-94.
[5]	王劭华, 欧阳会丹, 孙丹, 王康, 吴鸿萍, 钟询, 褚兴平, 杨松涛. 基于Apriori算法的大学生体测项目关联规则挖掘[J]. 计算机与现代化, 2023, 0(03): 66-70.
[6]	宋晓丽, 张勇波, 张培颖. 基于半监督学习的学生消费数据异常检测[J]. 计算机与现代化, 2022, 0(12): 13-17.
[7]	段桂芹, 邹臣嵩. 基于近邻传播聚类的职业能力评价模型[J]. 计算机与现代化, 2022, 0(05): 21-27.
[8]	龚云翔, 袁仕芳, 刘付谦. 基于集成学习与不平衡数据的返贫预测[J]. 计算机与现代化, 2022, 0(04): 12-16.
[9]	王磊, 宋波. 基于ADE-Stacking的心力衰竭非计划性再入院风险预测模型[J]. 计算机与现代化, 2022, 0(01): 23-27.
[10]	王继民, 季昌政, 李家欢, 曹颖. 基于集成学习的中小河流洪水预报[J]. 计算机与现代化, 2021, 0(05): 51-58.
[11]	杨琳, 白钊, 寇勇刚. 基于RFM模型的随机森林算法对民航客户的流失分析[J]. 计算机与现代化, 2021, 0(01): 100-104.
[12]	王海, 江峰, 杜军威, 赵军. 过采样与集成学习方法在软件缺陷预测中的对比研究[J]. 计算机与现代化, 2020, 0(06): 83-.
[13]	蒋毅,欧郁强,梁广,高杨,严永高,林捷,赵晓宁. 基于数据挖掘的现场作业风险态势评估方法[J]. 计算机与现代化, 2020, 0(04): 78-.
[14]	卢雨彤，韩立新. 基于稀疏张量判别分析的人体行为识别[J]. 计算机与现代化, 2020, 0(03): 121-.
[15]	齐玉东1，丁海强1，赵锦超2，孙明玮1. 基于biRNN的海军军械不均衡文本数据集处理方法[J]. 计算机与现代化, 2019, 0(12): 21-.