计算机与现代化

• 软件工程 • 上一篇    下一篇

基于Simhash的大数据去重改进算法

  

  1. (上海交通大学软件学院,上海201100)
  • 收稿日期:2016-11-21 出版日期:2017-07-20 发布日期:2017-07-20
  • 作者简介:周春晖(1992-),男,江苏常州人,上海交通大学软件学院硕士研究生,研究方向:分布式计算,云计算。

A Big Data Deduplication Algorithm Based on Simhash

  1. (School of Software, Shanghai Jiao Tong University, Shanghai 201100, China)
  • Received:2016-11-21 Online:2017-07-20 Published:2017-07-20

摘要: 数据去重是大数据预处理过程中最主要的一个步骤。为了提升大数据去重的效率,以及优化其在较差情况下的表现,本文以中文微博的原始数据为基础,在传统的Simhash方法的基础上,改进计算相似度的公式,将文本重复率纳入考虑,并在检索步骤中采用桶排序的思想,进行多次多级的线程分配以提高效率。实验结果表明,改进后的算法可以显著提升传统算法的效率和准确率。

关键词: 微博, 大数据, 去重, Simhash, 多线程

Abstract:  Data deduplication is a main step in big data preprocess. To improve efficiency in deduplication and optimize performance in terrible condition of classic algorithm, this paper uses Chinese text data of microblog and modifies formula of calculating similarity based on classic Simhash algorithm. Duplication rate is considered in the advanced formula, besides, this paper draws on the experience of bucket sorting, distributes threads for several times and levels to improve efficiency. The result of experiment shows that advanced algorithm can reduce running time and improve accuracy compared with classic algorithm.

Key words: microblog, big data, deduplication, Simhash, multi-thread

中图分类号: