A Big Data Deduplication Algorithm Based on Simhash

doi:10.3969/j.issn.1006-2475.2017.07.007

Abstract

Abstract: Data deduplication is a main step in big data preprocess. To improve efficiency in deduplication and optimize performance in terrible condition of classic algorithm, this paper uses Chinese text data of microblog and modifies formula of calculating similarity based on classic Simhash algorithm. Duplication rate is considered in the advanced formula, besides, this paper draws on the experience of bucket sorting, distributes threads for several times and levels to improve efficiency. The result of experiment shows that advanced algorithm can reduce running time and improve accuracy compared with classic algorithm.

Key words: microblog, big data, deduplication, Simhash, multi-thread

CLC Number:

TP311

ZHOU Chun-hui. A Big Data Deduplication Algorithm Based on Simhash[J]. Computer and Modernization, 2017, 0(7): 38-41.

References

［1］Storer M W， Greenan K， Long D D E， et al. Secure data deduplication［C］// ACM Workshop on Storage Security & Survivability. 2008:1-10.
［2］Meister D， Brinkmann A. Multi-level comparison of data deduplication in a backup scenario［C］// Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. 2009:1-12.
［3］Ramaswamy S，Rastogi R， Shim K. Efficient algorithms for mining outliers from large data sets［J］. ACM Sigmod Record, 2000,29(2):427-438.
［4］Charikar M S. Similarity estimation techniques from rounding algorithms［C］// The 34th ACM Symposium on Theory of Computing. 2002:380-388.
［5］Ng W K， Wen Yonggang， Zhu Huafei. Private data deduplication protocols in cloud storage［C］// ACM Symposium on Applied Computing. 2012:441-446.
［6］Issa N T, Byers S W, Dakshanamurthy S. Big data:The next frontier for innovation in therapeutics and healthcare［J］. Expert Review of Clinical Pharmacology, 2014,7(3):293.
［7］周玉坤，冯丹，夏文，等. 面向数据去重的基于二次哈希的收敛加密策略［J］. 计算机工程与科学, 2016,38(9)：1755-1762.
［8］杨天明，吴海涛. 一种批处理块级数据去重方法［J］. 计算机应用与软件, 2016,33(5):44-46.
［9］罗恩韬，王国军，李超良. 大数据环境中多维数据去重的聚类算法研究［J］. 小型微型计算机系统, 2016(3):438-442.
［10］武晓岩,李康. 基因表达数据判别分析的随机森林方法［J］. 中国卫生统计, 2006,23(12):491.
［11］Yu Yuan, Isard M, Fetterly D, et al. Dryad LINQ: A system for general-purpose distributed data-parallel computing using a high-level language［C］// Proceedings of the 8th USENIX Conference on Operating System Design and Implementation. 2008:1-14.
［12］Leskovec J, Lang K J, Dasgupt A, et al. Statistical properties of community structure in large social and information networks［C］// Proceedings of the 17th International Conference on World Wide Web. 2008:695-704.
［13］Sergey B, Larry P. The anatomy of a large-scale hypertextual Web search engine［J］. Computer Networks and ISDN Systems, 1998,30(17):107-117.

[1]	WANG Jia1, GU Wenjun1, JU Weigang2, LI Yuwei1, ZHANG Yunlong2, MI Chuanmin3, ZHOU Zhipeng3. Environmental Topology Task Scheduling Based on Diverse Hierarchical Difference Optimization Genetic Algorithm [J]. Computer and Modernization, 2024, 0(10): 65-73.
[2]	QIU Ling1, 2, SONG Zhi1, 2, LYU Shuang1, 2, YANG Xue1, 2. Application of Data Synchronization Technology in External Services of Meteorological Big Data Cloud Platform [J]. Computer and Modernization, 2024, 0(07): 76-81.
[3]	ZHOU Ming-sheng, ZHANG Wen. A Smart Park Management Platform for Multi-source Data [J]. Computer and Modernization, 2023, 0(05): 68-74.
[4]	QIU Jin-shui, ZHUANG Hui-fu, JIN Tao. Design of Intelligent Retrieval System for Massive Plant Images [J]. Computer and Modernization, 2022, 0(10): 62-67.
[5]	SHAN Ke, ZHANG Yi-ming, LIU Rui-xia, . Research and Design of Science and Technology Service Resource Pool Oriented to Central Plains Urban Agglomeration [J]. Computer and Modernization, 2022, 0(07): 91-96.
[6]	HUANG AN-qi, MIAO Fang, YANG Wen-hui, NI Ya-ting, JIANG Yuan. Design of Structured Data Registration Engine Based on Data Architecture [J]. Computer and Modernization, 2022, 0(05): 82-89.
[7]	CAO Yu, LI Xiao-hui, LIU Zhong-lin, JIA He, FEI Zhi-wei. Review of Big Data Workflow Orchestration and Management System in Cloud Environment [J]. Computer and Modernization, 2022, 0(01): 41-53.
[8]	LIU Meng-ying, WANG Yong. Microblog Hot Topic Discovery Based on Text Dual Representation Model [J]. Computer and Modernization, 2021, 0(12): 110-115.
[9]	ZHANG Xiao-fang, FENG Hui-fang. Dynamic Optimal Path Planning Based on Trajectory Big Data [J]. Computer and Modernization, 2021, 0(11): 82-88.
[10]	LI Ming, CHEN Ji-fu, YI Xiao-rong, LIU Shu-ming. An Environment Monitoring System for Dongting Lake Based on JFinal Framework [J]. Computer and Modernization, 2021, 0(10): 41-48.
[11]	WEI Yun-dong. Intelligent Talent Recommendation Method Based on Big Data Technology [J]. Computer and Modernization, 2021, 0(07): 60-64.
[12]	LEI Ming, JIANG Han-sheng, WU Guo-liang, ZHAO Yu-juan, LIANG Jian. Load Balancing Technology Under Big Data Architecture Based on HBase [J]. Computer and Modernization, 2021, 0(06): 91-95.
[13]	LIU Feng, ZOU Chen-song, CUI Wei. Web Service Composition Based on K-medoids Point Optimization Algorithm in Big Data Environment [J]. Computer and Modernization, 2020, 0(12): 20-24.
[14]	PAN Wei-jun, LIU Hao-chen, WANG Run-dong, HU Bo-wen. Application of Improved Spark System Based on ANN in Big Data Processing of Air Traffic Management#br# [J]. Computer and Modernization, 2020, 0(12): 78-82.
[15]	ZHOU Xian-lai. Big Data Mining Algorithm of Heterogeneous Multi-core Platform Based on Semantic Segmentation [J]. Computer and Modernization, 2020, 0(10): 40-43.