计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

用带权重的pq-gram算法计算XML文档相似度

  

  1. 南京航空航天大学计算机科学与技术学院,江苏南京210016
  • 收稿日期:2014-11-20 出版日期:2015-03-23 发布日期:2015-03-26
  • 作者简介:王成勇(1989-),男,山东临沂人,南京航空航天大学计算机科学与技术学院硕士研究生,研究方向:数据挖掘; 杜庆伟(1974-),男,河南郑州人,副教授,硕士生导师,博士,研究方向:计算机网络; 孙静(1981-),女,甘肃山丹人,硕士研究生,研究方向:计算机网络; 孙振(1982-),男,山东济南人,研究方向:卫星通信。
  • 基金资助:
    国家自然科学基金资助项目(61202350)

Calculating Similarity of XML Documents by Weighted Pq-gram Algorithm

  1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
  • Received:2014-11-20 Online:2015-03-23 Published:2015-03-26

摘要: XML文档聚类是高效管理XML文档的重要手段,XML文档相似度计算正是其中的关键步骤。pq-gram算法是解决XML文档相似度计算问题的有效手段,但忽略了XML文档结点的有序性。带权重的pq-gram算法是在此基础上,依据XML文档的结构性,首先为结点赋予相应权重,然后基于结点的权重对pq-gram赋予权重,最后将设定的权重应用到XML文档相似度计算中。实验结果表明,带权重的pq-gram算法更好地描述结点在XML文档相似度计算中的贡献度,提高了XML文档相似度计算的精度。

关键词: XML文档, 计算相似度, pq-gram, 权重

Abstract: Clustering for XML documents is an important method for efficiently managing XML documents, and calculating similarity of XML documents is the pivotal step. Pq-gram algorithm is an efficient method to solve the problem of calculating similarity of XML documents. However, it ignores that the nodes of XML documents are ordered. Based on the pq-gram algorithm, weighted pq-gram algorithm, in accordance with the structural characteristics of XML documents, sets weight for nodes, and sets weight for pq-grams based on the weight of nodes, then applies the weight to the method of calculating similarity of XML documents. Experimental results show that the weighted pq-gram algorithm describes the contribution of nodes better in the process of calculating similarity of XML documents, and improves the precision of calculating of XML documents.

Key words: XML documents, calculate similarity, pq-gram, weight

中图分类号: