计算机与现代化 ›› 2020, Vol. 0 ›› Issue (06): 73-.

• 数据挖掘 • 上一篇    下一篇

多层次序列集成的高维数值型异常检测

  

  1. (1.南京航空航天大学计算机科学与技术学院,江苏南京210016;2.国网上海市电力公司信息通信公司,上海200000)
  • 收稿日期:2019-10-22 出版日期:2020-06-24 发布日期:2020-06-28
  • 作者简介:李科心(1994-),男,安徽亳州人,硕士研究生,研究方向:数据挖掘,软件可靠性验证,E-mail: kexinli@nuaa.edu.cn; 李静(1976-),女(满族),江苏南京人,副教授,博士,研究方向:图像识别,数据挖掘,可信计算,E-mail: lijing@nuaa.edu.cn; 邵佳炜(1989-),男,上海人,工程师,硕士,研究方向:电力系统的可靠验证及数据分析, E-mail: 505322675@qq.com; 肖屹,男,上海人,工程师,学士,研究方向:数据挖掘和分析,E-mail: 405781809@qq.com。
  • 基金资助:
    国家电网科技项目(SGSHXT00JFJS1900093)

High-dimensional Numerical Anomaly Data Detection Based on Multi-level Sequence Integration

  1. (1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China;
    2. Information and Communication Company of State Grid Shanghai Electric Power Company, Shanghai 200000, China)
  • Received:2019-10-22 Online:2020-06-24 Published:2020-06-28

摘要: 伴随大数据的快速发展,数据分析和知识发现成为研究热点,异常数据检测是数据质量提升的关键。基于序列集成学习的异常数据检测方法在面向高维数值型数据时可能因为噪声数据和维数过多导致检测精度下降。本文提出一种基于弹性网络的多层次序列集成学习的高维数值型异常数据检测方法,其中每层包含异常数据候选集模块、弹性网络降维模块和数据异常打分模块共3个模块。首先,异常数据候选集选择模块根据异常分数选择出一部分可能的异常数据;然后,弹性网络根据异常数据候选集和异常分数对高维数据进行特征选择,选择出与异常分数最相关的特征;最后,利用选择出来的特征对数据再次进行异常打分。每层异常数据候选集选择模块中的阈值设置为不同的值,循环地执行每一层,直到当前弹性网络的均方误差大于上一次的均方误差或者当前的检测精度小于初始的检测精度。在实验阶段,使用ODDS提供的高维异常数据检测数据集并根据检测精度、提取特征数、收敛速度等指标对本文方法的性能进行了测试。结果表明本文方法不仅能够提高对高维数值型异常数据的检测精度,而且能够有效地降低噪声对检测结果的影响。

关键词: 数据挖掘, 异常数据检测, 集成学习, 弹性网络, 高维数据

Abstract: With the rapid development of big data, data analysis and knowledge discovery have become research hotspots, and anomaly data detection is the key to data quality improvement. The abnormal data detection method based on sequence ensemble learning may cause large deviations in the detection of abnormal data in high-dimensional numerical data due to noise data and excessive number of dimensions. This paper proposes a high-dimensional numerical anomaly data detection model of multi-layer sequence ensemble learning based on elastic network. Each layer contains three modules: abnormal data candidate set module, elastic network dimension reduction module and data abnormality scoring module. First, the abnormal data candidate set selection module selects some possible abnormal data according to abnormal score. Then, the elastic network reduces the dimension of data according to the outlier candidate set and its abnormal score. Finally, the selected features related to the abnormal score are used to score the data again. The threshold in each layer of the abnormal data candidate set selection module is set to a different value, and each layer is executed cyclically until the mean square error of the current elastic network is greater than the previous or the current detection precision is smaller than the initial detection precision. In the experimental stage, the high-dimensional anomaly data set provided by ODDS is used to test the performance of the model proposed in this paper based on the detection accuracy, the number of extracted features, the convergence speed, etc. The results show that the proposed method can not only improve the detection accuracy of high-dimensional numerical anomaly data, but also effectively reduce the effect of noise on the detection results.

Key words: data mining, abnormal data detecting, ensemble learning, elastic network, high-dimensional data

中图分类号: