计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

一种基于MapReduce的最近似k对数据搜索方案

  

  1. (咸阳师范学院信息工程学院,陕西咸阳712000)
  • 收稿日期:2014-04-25 出版日期:2014-08-15 发布日期:2014-08-19
  • 作者简介:刘淑英(1982-),女,陕西府谷人,咸阳师范学院信息工程学院讲师,博士,研究方向:信息检索,数据挖掘。
  • 基金资助:
    咸阳师范学院基金资助项目(13XSYK054); 陕西省教学改革项目(13BY90)

A Top-k Closest Pairs Data Search Scheme  Based on MapReduce

  1. (Institute of Information Engineering of Xianyang Normal University, Xianyang 712000, China)
  • Received:2014-04-25 Online:2014-08-15 Published:2014-08-19

摘要:

多种应用场合需要寻找给定数据库中相似度最大的前k对数据。然而由于应用领域需要处理的数据规模呈上升趋势,计算这样的最
相似k对数据,难度非常大。提出一种基于MapReduce的最相似k对数据搜索方案,该方案首先将所有数据对分割成多个组,然后提出所有
数据对分组算法和核心数据对分组算法,通过单独计算每个组中的最近似k对数据,从所有组的最近似k对数据中选择相似度最高的k对数
据,进而确定最近似k对数据。最后基于合成数据和真实数据进行实验,性能评估结果表明本文MapReduce算法的有效性和可扩展性。

关键词: 数据库, 相似度, MapReduce, 数据搜索, 分组

Abstract:

There is a wide range of applications that require finding the top-k most similar pairs of records in a
given database. However, computing such top-k similarity joins is a challenging problem today, as there is an
increasing trend of applications that expect to deal with vast amounts of data. This paper proposes a top-k
closest pairs data search scheme based on MapReduce, firstly, the proposed scheme splits conceptually all pairs of
points into partitions, and then the all pair partitioning and the essential pair partitioning methods are
proposed, we can correctly find the top-k closest pairs by computing the top-k closest pairs in each partition
separately and selecting the top-k closest pairs among the top-k closest pairs from all partitions. We finally
perform the experiments with not only synthetic but also real-life data sets. The performance study confirms the
effectiveness and scalability of the proposed MapReduce algorithms.

Key words: database, similarity, MapReduce, data search, partitioning

中图分类号: