An Outlier Detection Algorithm for Subspace Clustering

doi:10.3969/j.issn.1006-2475.2015.12.008

Abstract

Abstract:

There are several challenging difficulties in modern big data analytics, such as missing data, unstructured data, and outlier corruption, etc. The foremost important

preprocess is outlier detection and removal. In this paper, for tackling the popular subspace clustering problem in data analytics, we consider the more challenging scenario in

which the data set is corrupted by sparse outliers. Based on the sparsity assumption, the classic ksubspace algorithm is adapted to incorporate the 1 norm regularization

to alleviate outlier sideeffect. In order to overcome the huge requirements of computation and memory in big data, the modified ksubspace clustering algorithm exploits

stochastic gradient descent (SGD) for fast computation and memory efficiency. Simulation experiments show that even the data set is heavily corrupted by outliers the proposed

approach can guarantee to accurately detect and remove outliers, and furthermore achieves the accurate subspace clustering results.

Key words: big data analytics, outlier detection, subspace clustering

CLC Number:

TP312

YANG Weiyong1, HE Jun2, ZHENG Shengjun3，ZHANG Xudong4. An Outlier Detection Algorithm for Subspace Clustering[J]. Computer and Modernization, doi: 10.3969/j.issn.1006-2475.2015.12.008.

References ［16］

	［1］
	LaValle Steve, Eric Lesser, Rebecca Shockley, et al. Big data, analytics and the path from insights to value［J］. MIT Sloan Management Review (Winter 2011), 2011,52(2).
［2］	Gubbi Jayavardhana, Rajkumar Buyya, Slaven Marusic, et al. Internet of Things (IoT): A vision, architectural elements, and future directions［J］. Future Generation
	Computer Systems, 2013,29(7):1645-1660.
［3］	Manyika James, Michael Chui, Brad Brown, et al. Big data: The Next Frontier for Innovation, Competition, and Productivity［R］. McKinsey Global Institute, 2011.
［4］	Sagiroglu Seref, Duygu Sinanc. Big data: A review［C］// IEEE 2013 International Conference on Collaboration Technologies and Systems(CTS). 2013:42-47.
［5］	Wu Xindong, Zhu Xingquan, Wu Gongqing, et al. Data mining with big data［J］. IEEE Transactions on Knowledge and Data Engineering, 2014,26(1):97-107.
［6］	Emmanuel J Candès, Li Xiaodong, Ma Yi, et al. Robust principal component analysis?［J］. Journal of the ACM (JACM), 2011,58(3):11.
［7］	Victoria J Hodge, Jim Austin. A survey of outlier detection methodologies［J］. Artificial Intelligence Review, 2004,22(2):85-126.
［8］	Elhamifar Ehsan, Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013，35
(11)	:2765-2781.
［9］	Liu Guangcan, Lin Zhouchen, Yan Shuicheng, et al. Robust recovery of subspace structures by lowrank representation［J］. IEEE Transactions on Pattern Analysis and Machine
	Intelligence, 2013,35(1):171-184.
［10］	Chen Guangliang, Gilad Lerman. Spectral curvature clustering (SCC)［J］. International Journal of Computer Vision, 2009,81(3):317-330.
［11］	Ho Jason, Yang MingHsuan, Lim Jongwoo, et al. Clustering appearances of objects under varying illumination conditions［C］// Proceedings of 2003 IEEE Computer Society
	Conference on Computer Vision and Pattern Recognition. 2003:1-11.
［12］	Zhang Teng, Arthur Szlam, Yi Wang, et al. Hybrid linear modeling via local bestfit flats［J］. International Journal of Computer Vision, 2012,100(3):217-240.
［13］	Vidal René. A tutorial on subspace clustering［J］. IEEE Signal Processing Magazine, 2010,28(2):52-68.
［14］	Cevher Volkan, Steffen Becker, Martin Schmidt. Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics［J］. IEEE Signal
	Processing Magazine, 2014,31(5):32-43.
［15］	Zaharia Matei, Mosharaf Chowdhury, Michael J Franklin, et al. Spark: Cluster computing with working sets［C］// Proceedings of the 2nd USENIX Conference on Hot Topics
	in Cloud Computing. 2010:10-10.
［16］	Zaharia Matei, Tathagata Das, Li Haoyuan, et al. Discretized streams: Faulttolerant streaming computation at scale［C］// Proceedings of the 24th ACM Symposium on
	Operating Systems Principles. 2013:423-438.

[1]	DUAN Xun, YANG Zhi-yong, JIANG Feng. An Outlier Detection Algorithm Based on Neighborhood Granular Entropy [J]. Computer and Modernization, 2022, 0(10): 19-23.
[2]	ZHUANG Li-li, SHI Hong-yan. Outlier Detection Based on Improved Cuckoo Search k-means Algorithm [J]. Computer and Modernization, 2021, 0(10): 15-22.
[3]	HAN Zhao-rong1,2,3, XU Guang-luan2,3, HUANG Ting-lei2,3, REN Wen-juan2,3. Vessel Trajectory Outlier Detection Algorithm Based on Adaptive Threshold [J]. Computer and Modernization, 2018, 0(09): 42-.
[4]	SUN Xiao-yin, ZHOU Wei. Big Data Analytics Technology Based on MOOC [J]. Computer and Modernization, 2017, 0(4): 89-93,108.
[5]	LU Liu-sheng, YU Ming-hui. Learning State Analysis Method of Students Based on Outlier Detection [J]. Computer and Modernization, 2016, 0(3): 35-40.