一种基于MapReduce的最近似k对数据搜索方案

doi:10.3969/j.issn.1006-2475.2014.08.009

计算机与现代化 ›› 2014, Vol. 0 ›› Issue (8): 38-45.doi: 10.3969/j.issn.1006-2475.2014.08.009

一种基于MapReduce的最近似k对数据搜索方案

（咸阳师范学院信息工程学院，陕西咸阳712000）

收稿日期:2014-04-25 出版日期:2014-08-15 发布日期:2014-08-19
作者简介:刘淑英（1982-），女，陕西府谷人，咸阳师范学院信息工程学院讲师，博士，研究方向：信息检索，数据挖掘。
基金资助:
咸阳师范学院基金资助项目（13XSYK054）; 陕西省教学改革项目（13BY90）

A Top-k Closest Pairs Data Search Scheme Based on MapReduce

(Institute of Information Engineering of Xianyang Normal University, Xianyang 712000, China)

Received:2014-04-25 Online:2014-08-15 Published:2014-08-19

摘要/Abstract

摘要：

多种应用场合需要寻找给定数据库中相似度最大的前k对数据。然而由于应用领域需要处理的数据规模呈上升趋势，计算这样的最

相似k对数据，难度非常大。提出一种基于MapReduce的最相似k对数据搜索方案，该方案首先将所有数据对分割成多个组，然后提出所有

数据对分组算法和核心数据对分组算法，通过单独计算每个组中的最近似k对数据，从所有组的最近似k对数据中选择相似度最高的k对数

据，进而确定最近似k对数据。最后基于合成数据和真实数据进行实验，性能评估结果表明本文MapReduce算法的有效性和可扩展性。


关键词: 数据库, 相似度, MapReduce, 数据搜索, 分组

Abstract:

There is a wide range of applications that require finding the top-k most similar pairs of records in a

given database. However, computing such top-k similarity joins is a challenging problem today, as there is an

increasing trend of applications that expect to deal with vast amounts of data. This paper proposes a top-k

closest pairs data search scheme based on MapReduce, firstly, the proposed scheme splits conceptually all pairs of

points into partitions, and then the all pair partitioning and the essential pair partitioning methods are

proposed, we can correctly find the top-k closest pairs by computing the top-k closest pairs in each partition

separately and selecting the top-k closest pairs among the top-k closest pairs from all partitions. We finally

perform the experiments with not only synthetic but also real-life data sets. The performance study confirms the

effectiveness and scalability of the proposed MapReduce algorithms.

Key words: database, similarity, MapReduce, data search, partitioning

中图分类号:

TP311.131

刘淑英. 一种基于MapReduce的最近似k对数据搜索方案[J]. 计算机与现代化, 2014, 0(8): 38-45.

LIU Shu-ying. A Top-k Closest Pairs Data Search Scheme Based on MapReduce[J]. Computer and Modernization, 2014, 0(8): 38-45.

参考文献

［1］ Bhm C, Braunmüller B, Krebs F, et al. Epsilon grid order: An algorithm for the similarity join on massive high-

dimensional data［J］. ACM SIGMOD Record, 2001,30(2):379-388.
［2］李朝鹏，李肯立，成运，等. 基于数据预处理的并行分层聚类算法［J］. 计算机应用研究, 2010,27(1):71-73.
［3］ Lee K H, Lee Y J, Choi H, et al. Parallel data processing with MapReduce: A survey［J］. ACM SIGMOD Record,

2012,40(4):11-20.
［4］ Corral A, Manolopoulos Y, Theodoridis Y, et al. Algorithms for processing K-closest-pair queries in spatial

databases ［J］. Data & Knowledge Engineering, 2004,49(1):67-104.
［5］ Pereira J C, Lobo F G. An optimized divide-and-conquer algorithm for the closest-pair problem in the planar

case［J］. Journal of Computer Science and Technology, 2012,27(4):891-896.
［6］ Yang S W, Choi Y, Jung C K. A divide-and-conquer delaunay triangulation algorithm with a vertex array and

flip operations in two-dimensional space［J］. International Journal of Precision Engineering and Manufacturing,

2011,12(3):435-442.
［7］ Salowe J S. Enumerating interdistances in space［J］. International Journal of Computational Geometry &

Applications, 1992,2(1):49-59.
［8］ Lenhof H P, Smid M. Sequential and parallel algorithms for the k closest pairs problem ［J］. International

Journal of Computational Geometry & Applications, 1995,5(3):273-288.
［9］ Katoh N, Iwano K. Finding k farthest pairs and k closest farthest bichromatic pairs for points in the plane

［C］// Proceedings of the 8th Annual Symposium on Computational Geometry. 1992:320-329.
［10］Qi Shuyao, Bouros P, Mamoulis N. Efficient Top-k spatial distance joins ［C］// Proceedings of the 13th

International Conference on Advances in Spatial and Temporal Databases. 2013:1-18.
［11］卢鑫,陈华辉,董一鸿,等. MapReduce框架下的不确定数据Top-k查询计算［J］. 模式识别与人工智能, 2013,26(7):695-704.


［12］黄山. 基于Map-Reduce框架云环境时空查询技术研究与实现［D］. 沈阳：东北大学, 2011.
［13］Xiao Chuan, Wang Wei, Lin Xuemin, et al. Efficient similarity joins for near-duplicate detection［J］. ACM

Transactions on Database Systems(TODS), 2011,36(3):No. 15.
［14］Vernica R, Carey M J, Li Chen. Efficient parallel set-similarity joins using MapReduce［C］// Proceedings

of the 2010 ACM SIGMOD International Conference on Management of data. 2010:495-506.
［15］Merigó J M, Gil-Lafuente A M. Using the OWA operator in the Minkowski distance［J］. International Journal

of Computer Science, 2008,3(3):149-157.
［16］Kim Y, Shim K. Parallel Top-k similarity join algorithms using MapReduce［C］// 2012 IEEE 28th

International Conference on Data Engineering (ICDE). 2012:510-521.
［17］段松青，吴斌，于乐，等. PDM: 基于Hadoop的并行数据分析系统［J］. 湖南大学学报（自然科学版）， 2012,39(10):87-

92.
［18］Palmer C R, Faloutsos C. Density biased sampling: An improved method for data mining and clustering

［C］//Proceedings of the 2000 ACM SIGMOO International Conference on Management of Data. 2000:82-92.
［19］UCSB. Datadownload［EB/OL］. http://vision.ece.ucsb.edu/download.html, 2014-04-25.

[1]	叶雪, 杨晟, 程凯, 朱峰. 基于ChatGLM2-6B的电力企业财务知识问答方法[J]. 计算机与现代化, 2024, 0(11): 54-63.
[2]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[3]	仁青卓玛1, 2, 3, 拥措1, 2, 3, 唐超超1, 2, 3. 面向藏汉神经机器翻译的数据筛选方法[J]. 计算机与现代化, 2024, 0(06): 19-24.
[4]	王宏杰, 徐胜超. 基于希尔伯特相似度的云平台异常传输数据聚类方法[J]. 计算机与现代化, 2023, 0(09): 27-31.
[5]	王鸿, 葛红. 基于注意力机制和语义相似度的跨模态哈希检索[J]. 计算机与现代化, 2023, 0(08): 44-53.
[6]	刘国丽, 徐洪楠, 谭有倩. 结合专家信任的协同过滤推荐算法研究[J]. 计算机与现代化, 2022, 0(11): 60-68.
[7]	田丰, 邓晓平, 张桂青, 王保义. 基于改进kNN算法与暂稳态特征的非侵入式负荷监测方法[J]. 计算机与现代化, 2022, 0(10): 29-35.
[8]	张黎, 骆春山, 谢委员, 李蓓蓓. 基于分支混淆算法的隐私数据库自适应加密方法[J]. 计算机与现代化, 2022, 0(03): 43-47.
[9]	邹梦苑, 樊志强, 徐珞, 刘洁, 梁万路. Inf-ProA信息活动过程模型相似性度量方法[J]. 计算机与现代化, 2022, 0(02): 26-32.
[10]	王传旭, 刘冉. 基于交互关系分组建模融合的组群行为识别算法[J]. 计算机与现代化, 2022, 0(01): 1-9.
[11]	朱雪晨, 陈三林, 蔡刚, 黄志洪. 降低参数规模的卷积神经网络模型压缩方法[J]. 计算机与现代化, 2021, 0(09): 83-89.
[12]	耿化聪, 梁宏涛, 刘国柱. 基于知识图谱与协同过滤的饮食推荐算法[J]. 计算机与现代化, 2021, 0(08): 24-29.
[13]	黄皓, 陈荔. 基于信息熵和改进相似度协同过滤算法[J]. 计算机与现代化, 2021, 0(06): 29-34.
[14]	舒鹏, 杜庆伟. 一种融合多维信息的移动社区发现方法[J]. 计算机与现代化, 2021, 0(05): 88-92.
[15]	邵孟巧, 吉顺慧, 张鹏程. AC-Rec:基于多维特征的科研合作者推荐方法[J]. 计算机与现代化, 2021, 0(03): 94-100.

一种基于MapReduce的最近似k对数据搜索方案

A Top-k Closest Pairs Data Search Scheme Based on MapReduce

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价