Computer and Modernization

Previous Articles     Next Articles

A Top-k Closest Pairs Data Search Scheme  Based on MapReduce

  

  1. (Institute of Information Engineering of Xianyang Normal University, Xianyang 712000, China)
  • Received:2014-04-25 Online:2014-08-15 Published:2014-08-19

Abstract:

There is a wide range of applications that require finding the top-k most similar pairs of records in a
given database. However, computing such top-k similarity joins is a challenging problem today, as there is an
increasing trend of applications that expect to deal with vast amounts of data. This paper proposes a top-k
closest pairs data search scheme based on MapReduce, firstly, the proposed scheme splits conceptually all pairs of
points into partitions, and then the all pair partitioning and the essential pair partitioning methods are
proposed, we can correctly find the top-k closest pairs by computing the top-k closest pairs in each partition
separately and selecting the top-k closest pairs among the top-k closest pairs from all partitions. We finally
perform the experiments with not only synthetic but also real-life data sets. The performance study confirms the
effectiveness and scalability of the proposed MapReduce algorithms.

Key words: database, similarity, MapReduce, data search, partitioning

CLC Number: