Computer and Modernization

Previous Articles     Next Articles

A Big Data Deduplication Algorithm Based on Simhash

  

  1. (School of Software, Shanghai Jiao Tong University, Shanghai 201100, China)
  • Received:2016-11-21 Online:2017-07-20 Published:2017-07-20

Abstract:  Data deduplication is a main step in big data preprocess. To improve efficiency in deduplication and optimize performance in terrible condition of classic algorithm, this paper uses Chinese text data of microblog and modifies formula of calculating similarity based on classic Simhash algorithm. Duplication rate is considered in the advanced formula, besides, this paper draws on the experience of bucket sorting, distributes threads for several times and levels to improve efficiency. The result of experiment shows that advanced algorithm can reduce running time and improve accuracy compared with classic algorithm.

Key words: microblog, big data, deduplication, Simhash, multi-thread

CLC Number: