基于半监督协同训练的百科知识库实体对齐

doi:10.3969/j.issn.1006-2475.2017.12.017

计算机与现代化 ›› 2017, Vol. 0 ›› Issue (12): 88-93.doi: 10.3969/j.issn.1006-2475.2017.12.017

基于半监督协同训练的百科知识库实体对齐

(1.中国科学院电子学研究所,北京100190;2.中国科学院大学,北京100039)

收稿日期:2017-04-11 出版日期:2017-12-25 发布日期:2017-12-26
作者简介:张伟莉(1990-),女,河南济源人,中国科学院电子学研究所、中国科学院大学硕士研究生,研究方向:机器学习,知识图谱构建; 黄廷磊(1971-),男,博士生导师,博士,研究方向:数据挖掘,大数据分析; 梁霄(1981-),男,讲师,博士,研究方向:数据组织管理,知识图谱构建。
基金资助:
国家高技术研究发展计划项目(2015AA7115028,2015AA7115061)

Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training

(1. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100039, China)

Received:2017-04-11 Online:2017-12-25 Published:2017-12-26

摘要/Abstract

摘要： 针对传统实体对齐方法中的有监督学习算法依赖大量标注数据，以及特征表示不适用于百科知识库等问题，提出一种基于半监督协同训练的实体对齐方法。将实体对齐建模为一个带约束的二分类问题，充分利用实体名、属性、描述文本及其中的时间、数值等关键信息，组合生成多维特征；将特征划分为2个相对独立的视图，通过2个视图上分类器的协同训练，迭代地从未标注数据中学习同义实体的分布情况。在2个中文百科上的实验结果表明，使用半监督协同训练方法进行实体对齐的F1值达到84.3%，较其他方法效果最优，证明了其有效性和在百科知识库上的实用价值。

关键词: 实体对齐, 半监督, 协同训练, 特征表示, 梯度提升决策树

Abstract: Traditional supervised learning algorithms of instance alignment depend on large amounts of labeled data, and the feature representation methods are not suitable for data in encyclopedia. In view of these issues, a semi-supervised co-training instance alignment method is proposed. Instance alignment is modeled as a constrained binary classification problem. Then multiple features are extracted by fully utilizing different categories of existing information, including instance names, attributes, description texts and the critical discrete values extracted from the texts, such as temporal and numerical values. The features are divided into two relatively independent views, and two models are trained interactively on these two views, in order to learn more about the distribution of synonymous instances from the unlabeled data iteratively. Experimental results between two Chinese encyclopedia datasets show that the proposed method achieves a 84.3% F1-value on aligning instances, and outperforms other comparative methods, proving the effectiveness and applicability of the semi-supervised co-training instance alignment method.

Key words: instance alignment, semi-supervised method, co-training, feature representation, gradient boosting decision tree

中图分类号:

TP391.1

张伟莉1,2,黄廷磊1,梁霄1. 基于半监督协同训练的百科知识库实体对齐[J]. 计算机与现代化, 2017, 0(12): 88-93.

ZHANG Wei-li1,2, HUANG Ting-lei1, LIANG Xiao1. Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training[J]. Computer and Modernization, 2017, 0(12): 88-93.

参考文献

［1］Hutton D M. Smart environments: Technology,protocols and applications［J］. Kybernetes, 2004,4(6):903-904. ［2］Bollacker K, Evans C, Paritosh P, et al. Freebase: A collaboratively created graph database for structuring human knowledge［C］// Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. 2008:1247-1250. ［3］Hoffart J, Suchanek F M, Berberich K, et al. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia［J］. Artificial Intelligence, 2013:28-61. ［4］Lehmann J, Isele R, Jakob M, et al. DBpedia: A large-scale,multilingual knowledge base extracted from Wikipedia［J］. Semantic Web, 2015,6(2):167-195. ［5］Carlson A, Betteridge J, Kisiel B, et al. Toward an architecture for never-ending language learning［C］// Proceedings of the 24th AAAI Conference on Artificial Intelligence. 2010,42(4):1306-1313. ［6］Wang Zhichun, Wang Zhigang, Li Juanzi, et al. Knowledge extraction from Chinese Wiki encyclopedias［J］. Journal of Zhejiang University (Science C), 2012,13(4):268-280. ［7］Niu Xing, Sun Xinruo, Wang Haofen, et al. Zhishi.me: Weaving Chinese linking open data［C］// Proceedings of the 10th International Conference on the Semantic Web - Volume Part II. 2011:205-220. ［8］倪立显,吴婷. 基于领域本体的语义查询推理系统［J］. 计算机与现代化, 2014(5):74-78. ［9］刘康,张元哲,纪国良,等. 基于表示学习的知识库问答研究进展与展望［J］. 自动化学报, 2016,42(6):807-818. ［10］Wang Chengyu, Gao Ming, He Xiaofeng, et al. Challenges in Chinese knowledge graph construction［C］// Proceedings of 2015 31st IEEE International Conference on Data Engineering Workshops. 2015:59-61. ［11］庄严,李国良,冯建华. 知识库实体对齐技术综述［J］. 计算机研究与发展, 2016,53(1):165-192. ［12］Blum A, Mitchell T. Combining labeled and unlabeled data with co-training［C］// Proceedings of the 11th Annual Conference on Computational Learning Theory. 1998:92-100. ［13］Huber J, Sztyler T, Noessner J, et al. CODI: Combinatorial optimization for data integration-results for OAEI 2011［C］// Proceedings of the 6th International Conference on Ontology Matching. 2011:134-141. ［14］郑杰,茅于杭,董清富. 基于语境的语义排歧方法［J］. 中文信息学报, 2000,14(5):1-7. ［15］黄峻福,李天瑞,贾真,等. 中文异构百科知识库实体对齐［J］. 计算机应用, 2016,36(7):1881-1886. ［16］Newcombe H B, Kennedy J M, Axford S J, et al. Automatic linkage of vital records［J］. Science,1959,130(3381):954-959. ［17］Fellegi I P, Sunter A B. A theory for record linkage［J］. Journal of the American Statistical Association, 1969,64(328):1183-1210. ［18］Rong Shu, Niu Xing, Xiang E W, et al. A machine learning approach for instance matching based on similarity metrics［C］// The Semantic Web-ISWC 2012. 2012:460-475. ［19］杨建萍,年梅. 基于维基百科网络技术的概念语义网络构建［J］. 计算机与现代化, 2016(1):1-5. ［20］Wang Haofen, Fang Zhijia, Zhang Le, et al. Effective online knowledge graph fusion［C］// The Semantic Web-ISWC 2015. 2015:286-302. ［21］Seddiqui M H, Nath R P D, Aono M. An efficient metric of automatic weight generation for properties in instance matching technique［J］. International Journal of Web & Semantic Technology, 2015,6(1):1-17. ［22］Wang Wei, Zhou Zhihua. Co-training with insufficient views［C］// Proceedings of the 5th Asian Conference on Machine Learning. 2013:467-482. ［23］Chen Tianqi, Guestrin C. XGBoost: A scalable tree boosting system［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794.

[1]	刘路瑶, 韩培胜. 基于堆叠降噪自编码器的跨项目软件缺陷数量预测方法[J]. 计算机与现代化, 2023, 0(04): 32-38.
[2]	洪涛, 朱鹏宇, 郭波, 王敬宇. 基于半监督聚类的通信缺陷研判知识库构建及迭代技术[J]. 计算机与现代化, 2023, 0(02): 28-33.
[3]	刘兴建, 杨晓夫, 胡磊. 基于非负矩阵分解的半监督模型用于多层网络聚类[J]. 计算机与现代化, 2023, 0(02): 83-88.
[4]	刘缨杰, 兰海, 魏宪. 基于卷积与稀疏编码的半监督学习方法[J]. 计算机与现代化, 2022, 0(11): 9-16.
[5]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[6]	肖宏宇, 曾文驱, 王淑营. 基于模型特征匹配的BIM模型混合推荐算法[J]. 计算机与现代化, 2022, 0(01): 28-32.
[7]	张东方, 陈海燕, 袁立罡. S2R2:基于相关性与冗余性分析的半监督特征选择[J]. 计算机与现代化, 2021, 0(09): 113-120.
[8]	魏健, 赵红涛, 刘敦楠, 加鹤萍 . 基于集成模型的超短时负荷预测方法[J]. 计算机与现代化, 2021, 0(03): 12-17.
[9]	黄欣辰, 皋军, 黄豪杰. 基于PCA降维的成对约束半监督聚类集成[J]. 计算机与现代化, 2021, 0(01): 94-99.
[10]	陈明帅, 吴克河. 基于shell命令的内部攻击检测[J]. 计算机与现代化, 2021, 0(01): 56-60.
[11]	许小媛, 李海波, 于本成, 刘芳. 具有对偶约束的半监督重叠社区发现方法[J]. 计算机与现代化, 2020, 0(08): 63-68.
[12]	王垚,李为,吴克河,崔文超. GBDT与LR融合模型在加密流量识别中的应用[J]. 计算机与现代化, 2020, 0(03): 93-.
[13]	马吉科,尹飞,祝永晋,豆龙龙,李剑. 一种应用半监督学习的计量装置运行状态辨识方法[J]. 计算机与现代化, 2020, 0(03): 82-.
[14]	李世宝，王升志，张鑫，陈海华，刘建航，何怡静. 基于全局流形结构的半监督学习室内定位算法[J]. 计算机与现代化, 2019, 0(07): 82-.
[15]	武文雅,陈钰枫,徐金安,张玉洁. 中文实体关系抽取研究综述[J]. 计算机与现代化, 2018, 0(08): 21-.

基于半监督协同训练的百科知识库实体对齐

Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价