基于Simhash的大数据去重改进算法

doi:10.3969/j.issn.1006-2475.2017.07.007

计算机与现代化 ›› 2017, Vol. 0 ›› Issue (7): 38-41.doi: 10.3969/j.issn.1006-2475.2017.07.007

基于Simhash的大数据去重改进算法

（上海交通大学软件学院，上海201100）

收稿日期:2016-11-21 出版日期:2017-07-20 发布日期:2017-07-20
作者简介:周春晖（1992-），男，江苏常州人，上海交通大学软件学院硕士研究生，研究方向：分布式计算，云计算。

A Big Data Deduplication Algorithm Based on Simhash

（School of Software, Shanghai Jiao Tong University, Shanghai 201100, China）

Received:2016-11-21 Online:2017-07-20 Published:2017-07-20

摘要/Abstract

摘要： 数据去重是大数据预处理过程中最主要的一个步骤。为了提升大数据去重的效率，以及优化其在较差情况下的表现，本文以中文微博的原始数据为基础，在传统的Simhash方法的基础上，改进计算相似度的公式，将文本重复率纳入考虑，并在检索步骤中采用桶排序的思想，进行多次多级的线程分配以提高效率。实验结果表明，改进后的算法可以显著提升传统算法的效率和准确率。


关键词: 微博, 大数据, 去重, Simhash, 多线程

Abstract: Data deduplication is a main step in big data preprocess. To improve efficiency in deduplication and optimize performance in terrible condition of classic algorithm, this paper uses Chinese text data of microblog and modifies formula of calculating similarity based on classic Simhash algorithm. Duplication rate is considered in the advanced formula, besides, this paper draws on the experience of bucket sorting, distributes threads for several times and levels to improve efficiency. The result of experiment shows that advanced algorithm can reduce running time and improve accuracy compared with classic algorithm.

Key words: microblog, big data, deduplication, Simhash, multi-thread

中图分类号:

TP311

周春晖. 基于Simhash的大数据去重改进算法[J]. 计算机与现代化, 2017, 0(7): 38-41.

ZHOU Chun-hui. A Big Data Deduplication Algorithm Based on Simhash[J]. Computer and Modernization, 2017, 0(7): 38-41.

参考文献

［1］Storer M W， Greenan K， Long D D E， et al. Secure data deduplication［C］// ACM Workshop on Storage Security & Survivability. 2008:1-10.
［2］Meister D， Brinkmann A. Multi-level comparison of data deduplication in a backup scenario［C］// Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. 2009:1-12.
［3］Ramaswamy S，Rastogi R， Shim K. Efficient algorithms for mining outliers from large data sets［J］. ACM Sigmod Record, 2000,29(2):427-438.
［4］Charikar M S. Similarity estimation techniques from rounding algorithms［C］// The 34th ACM Symposium on Theory of Computing. 2002:380-388.
［5］Ng W K， Wen Yonggang， Zhu Huafei. Private data deduplication protocols in cloud storage［C］// ACM Symposium on Applied Computing. 2012:441-446.
［6］Issa N T, Byers S W, Dakshanamurthy S. Big data:The next frontier for innovation in therapeutics and healthcare［J］. Expert Review of Clinical Pharmacology, 2014,7(3):293.
［7］周玉坤，冯丹，夏文，等. 面向数据去重的基于二次哈希的收敛加密策略［J］. 计算机工程与科学, 2016,38(9)：1755-1762.
［8］杨天明，吴海涛. 一种批处理块级数据去重方法［J］. 计算机应用与软件, 2016,33(5):44-46.
［9］罗恩韬，王国军，李超良. 大数据环境中多维数据去重的聚类算法研究［J］. 小型微型计算机系统, 2016(3):438-442.
［10］武晓岩,李康. 基因表达数据判别分析的随机森林方法［J］. 中国卫生统计, 2006,23(12):491.
［11］Yu Yuan, Isard M, Fetterly D, et al. Dryad LINQ: A system for general-purpose distributed data-parallel computing using a high-level language［C］// Proceedings of the 8th USENIX Conference on Operating System Design and Implementation. 2008:1-14.
［12］Leskovec J, Lang K J, Dasgupt A, et al. Statistical properties of community structure in large social and information networks［C］// Proceedings of the 17th International Conference on World Wide Web. 2008:695-704.
［13］Sergey B, Larry P. The anatomy of a large-scale hypertextual Web search engine［J］. Computer Networks and ISDN Systems, 1998,30(17):107-117.

[1]	王佳1, 顾文俊1, 鞠炜刚2, 李玉维1, 张云龙2, 米传民3, 周志鹏3. 基于多元级差优良化遗传算法的环境拓扑结构任务调度[J]. 计算机与现代化, 2024, 0(10): 65-73.
[2]	邱玲1, 2, 宋智1, 2, 吕爽1, 2, 杨雪1, 2. 数据同步技术在气象大数据云平台对外服务中的应用[J]. 计算机与现代化, 2024, 0(07): 76-81.
[3]	周明升, 张雯. 一种面向多源数据的智慧园区管理平台[J]. 计算机与现代化, 2023, 0(05): 68-74.
[4]	邱金水, 庄会富, 金涛. 面向海量植物图像的智能检索系统设计[J]. 计算机与现代化, 2022, 0(10): 62-67.
[5]	单珂, 张一鸣, 刘瑞霞, . 面向中原城市群的科技服务资源池研究与设计[J]. 计算机与现代化, 2022, 0(07): 91-96.
[6]	卢悦, 曹春萍. 融合用户历史传播信息的微博谣言检测[J]. 计算机与现代化, 2022, 0(06): 37-42.
[7]	黄安琪, 苗放, 杨文晖, 倪雅婷, 蒋媛. 基于数据架构的结构化数据注册引擎设计[J]. 计算机与现代化, 2022, 0(05): 82-89.
[8]	曹禹, 李晓辉, 刘忠麟, 贾贺, 费志伟. 云环境大数据工作流编排管理系统研究综述[J]. 计算机与现代化, 2022, 0(01): 41-53.
[9]	陈瀛, 生佳根. 基于LDA和Word2vec的微博标签生成算法[J]. 计算机与现代化, 2021, 0(12): 37-42.
[10]	刘梦颖, 王勇. 基于文本双表示模型的微博热点话题发现[J]. 计算机与现代化, 2021, 0(12): 110-115.
[11]	张小芳, 冯慧芳. 基于轨迹大数据的动态最优路径规划[J]. 计算机与现代化, 2021, 0(11): 82-88.
[12]	李明, 陈积富, 易小荣, 刘书铭. 基于JFinal框架的洞庭湖环境监测系统[J]. 计算机与现代化, 2021, 0(10): 41-48.
[13]	魏云东. 基于大数据技术的人才智能推荐方法[J]. 计算机与现代化, 2021, 0(07): 60-64.
[14]	雷鸣, 姜罕盛, 武国良, 赵玉娟, 梁健. 基于HBase的大数据架构下负载平衡技术[J]. 计算机与现代化, 2021, 0(06): 91-95.
[15]	刘锋, 邹臣嵩, 崔炜. 大数据环境下基于K中心点优化算法的Web服务组合[J]. 计算机与现代化, 2020, 0(12): 20-24.

基于Simhash的大数据去重改进算法

A Big Data Deduplication Algorithm Based on Simhash

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价