计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

基于半监督协同训练的百科知识库实体对齐

  

  1. (1.中国科学院电子学研究所,北京100190;2.中国科学院大学,北京100039)
  • 收稿日期:2017-04-11 出版日期:2017-12-25 发布日期:2017-12-26
  • 作者简介:张伟莉(1990-),女,河南济源人,中国科学院电子学研究所、中国科学院大学硕士研究生,研究方向:机器学习,知识图谱构建; 黄廷磊(1971-),男,博士生导师,博士,研究方向:数据挖掘,大数据分析; 梁霄(1981-),男,讲师,博士,研究方向:数据组织管理,知识图谱构建。
  • 基金资助:
    国家高技术研究发展计划项目(2015AA7115028,2015AA7115061)

Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training

  1. (1. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100039, China)
  • Received:2017-04-11 Online:2017-12-25 Published:2017-12-26

摘要: 针对传统实体对齐方法中的有监督学习算法依赖大量标注数据,以及特征表示不适用于百科知识库等问题,提出一种基于半监督协同训练的实体对齐方法。将实体对齐建模为一个带约束的二分类问题,充分利用实体名、属性、描述文本及其中的时间、数值等关键信息,组合生成多维特征;将特征划分为2个相对独立的视图,通过2个视图上分类器的协同训练,迭代地从未标注数据中学习同义实体的分布情况。在2个中文百科上的实验结果表明,使用半监督协同训练方法进行实体对齐的F1值达到84.3%,较其他方法效果最优,证明了其有效性和在百科知识库上的实用价值。

关键词: 实体对齐, 半监督, 协同训练, 特征表示, 梯度提升决策树

Abstract: Traditional supervised learning algorithms of instance alignment depend on large amounts of labeled data, and the feature representation methods are not suitable for data in encyclopedia. In view of these issues, a semi-supervised co-training instance alignment method is proposed. Instance alignment is modeled as a constrained binary classification problem. Then multiple features are extracted by fully utilizing different categories of existing information, including instance names, attributes, description texts and the critical discrete values extracted from the texts, such as temporal and numerical values. The features are divided into two relatively independent views, and two models are trained interactively on these two views, in order to learn more about the distribution of synonymous instances from the unlabeled data iteratively. Experimental results between two Chinese encyclopedia datasets show that the proposed method achieves a 84.3% F1-value on aligning instances, and outperforms other comparative methods, proving the effectiveness and applicability of the semi-supervised co-training instance alignment method.

Key words: instance alignment, semi-supervised method, co-training, feature representation, gradient boosting decision tree

中图分类号: