Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training

doi:10.3969/j.issn.1006-2475.2017.12.017

Abstract

Abstract: Traditional supervised learning algorithms of instance alignment depend on large amounts of labeled data, and the feature representation methods are not suitable for data in encyclopedia. In view of these issues, a semi-supervised co-training instance alignment method is proposed. Instance alignment is modeled as a constrained binary classification problem. Then multiple features are extracted by fully utilizing different categories of existing information, including instance names, attributes, description texts and the critical discrete values extracted from the texts, such as temporal and numerical values. The features are divided into two relatively independent views, and two models are trained interactively on these two views, in order to learn more about the distribution of synonymous instances from the unlabeled data iteratively. Experimental results between two Chinese encyclopedia datasets show that the proposed method achieves a 84.3% F1-value on aligning instances, and outperforms other comparative methods, proving the effectiveness and applicability of the semi-supervised co-training instance alignment method.

Key words: instance alignment, semi-supervised method, co-training, feature representation, gradient boosting decision tree

CLC Number:

TP391.1

ZHANG Wei-li1,2, HUANG Ting-lei1, LIANG Xiao1. Instance Alignment Algorithm Between Encyclopedia Based on Semi-supervised Co-training[J]. Computer and Modernization, 2017, 0(12): 88-93.

References

［1］Hutton D M. Smart environments: Technology,protocols and applications［J］. Kybernetes, 2004,4(6):903-904. ［2］Bollacker K, Evans C, Paritosh P, et al. Freebase: A collaboratively created graph database for structuring human knowledge［C］// Proceedings of 2008 ACM SIGMOD International Conference on Management of Data. 2008:1247-1250. ［3］Hoffart J, Suchanek F M, Berberich K, et al. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia［J］. Artificial Intelligence, 2013:28-61. ［4］Lehmann J, Isele R, Jakob M, et al. DBpedia: A large-scale,multilingual knowledge base extracted from Wikipedia［J］. Semantic Web, 2015,6(2):167-195. ［5］Carlson A, Betteridge J, Kisiel B, et al. Toward an architecture for never-ending language learning［C］// Proceedings of the 24th AAAI Conference on Artificial Intelligence. 2010,42(4):1306-1313. ［6］Wang Zhichun, Wang Zhigang, Li Juanzi, et al. Knowledge extraction from Chinese Wiki encyclopedias［J］. Journal of Zhejiang University (Science C), 2012,13(4):268-280. ［7］Niu Xing, Sun Xinruo, Wang Haofen, et al. Zhishi.me: Weaving Chinese linking open data［C］// Proceedings of the 10th International Conference on the Semantic Web - Volume Part II. 2011:205-220. ［8］倪立显,吴婷. 基于领域本体的语义查询推理系统［J］. 计算机与现代化, 2014(5):74-78. ［9］刘康,张元哲,纪国良,等. 基于表示学习的知识库问答研究进展与展望［J］. 自动化学报, 2016,42(6):807-818. ［10］Wang Chengyu, Gao Ming, He Xiaofeng, et al. Challenges in Chinese knowledge graph construction［C］// Proceedings of 2015 31st IEEE International Conference on Data Engineering Workshops. 2015:59-61. ［11］庄严,李国良,冯建华. 知识库实体对齐技术综述［J］. 计算机研究与发展, 2016,53(1):165-192. ［12］Blum A, Mitchell T. Combining labeled and unlabeled data with co-training［C］// Proceedings of the 11th Annual Conference on Computational Learning Theory. 1998:92-100. ［13］Huber J, Sztyler T, Noessner J, et al. CODI: Combinatorial optimization for data integration-results for OAEI 2011［C］// Proceedings of the 6th International Conference on Ontology Matching. 2011:134-141. ［14］郑杰,茅于杭,董清富. 基于语境的语义排歧方法［J］. 中文信息学报, 2000,14(5):1-7. ［15］黄峻福,李天瑞,贾真,等. 中文异构百科知识库实体对齐［J］. 计算机应用, 2016,36(7):1881-1886. ［16］Newcombe H B, Kennedy J M, Axford S J, et al. Automatic linkage of vital records［J］. Science,1959,130(3381):954-959. ［17］Fellegi I P, Sunter A B. A theory for record linkage［J］. Journal of the American Statistical Association, 1969,64(328):1183-1210. ［18］Rong Shu, Niu Xing, Xiang E W, et al. A machine learning approach for instance matching based on similarity metrics［C］// The Semantic Web-ISWC 2012. 2012:460-475. ［19］杨建萍,年梅. 基于维基百科网络技术的概念语义网络构建［J］. 计算机与现代化, 2016(1):1-5. ［20］Wang Haofen, Fang Zhijia, Zhang Le, et al. Effective online knowledge graph fusion［C］// The Semantic Web-ISWC 2015. 2015:286-302. ［21］Seddiqui M H, Nath R P D, Aono M. An efficient metric of automatic weight generation for properties in instance matching technique［J］. International Journal of Web & Semantic Technology, 2015,6(1):1-17. ［22］Wang Wei, Zhou Zhihua. Co-training with insufficient views［C］// Proceedings of the 5th Asian Conference on Machine Learning. 2013:467-482. ［23］Chen Tianqi, Guestrin C. XGBoost: A scalable tree boosting system［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794.

[1]	Denoising Autoencoders. Cross-project Software Defect Number Prediction Method Based on Stacked [J]. Computer and Modernization, 2023, 0(04): 32-38.
[2]	ZHANG Zi-sen, XU Xiao-zhong. Load Forecasting Based on Decomposition and Multi-component Ensemble Learning [J]. Computer and Modernization, 2023, 0(03): 96-101.
[3]	WANG Yao, LI Wei, WU Ke-he, CUI Wen-chao. Application of Fusion Model of GBDT and LR in Encrypted Traffic Identification [J]. Computer and Modernization, 2020, 0(03): 93-.
[4]	WU Wen-ya, CHEN Yu-feng, XU Jin-an, ZHANG Yu-jie. Review of Chinese Entity Relation Extraction [J]. Computer and Modernization, 2018, 0(08): 21-.
[5]	LI Zhong-yuan;YANG Shou-wen. Improvement of Weight of Web Page Features in Calculation Based on VSM [J]. Computer and Modernization, 2010, 1(6): 137-0139.