计算机与现代化 ›› 2016, Vol. 251 ›› Issue (07): 60-67.doi: 10.3969/j.issn.1006-2475.2016.07.012

• 算法设计与分析 • 上一篇    下一篇

 基于Trie树和有限状态自动机的中文地址解析模型

  

  1. 1.武汉邮电科学研究院,湖北武汉430074; 2.烽火通信科技股份有限公司南京研发部,江苏南京210019
  • 收稿日期:2016-01-04 出版日期:2016-07-21 发布日期:2016-07-22
  • 作者简介: 汪洋(1978-),男,江苏南京人,武汉邮电科学研究院、烽火通信科技股份有限公司南京研发部高级工程师,硕士,研究方向:大数据分析,网络行为学; 刘师培(1992-),男,湖北潜江人,硕士研究生,研究方向:数据分析,自然语言处理; 王峥(1977-),男,高级工程师,研究方向:海量数据分析,网络行为分析。

 A Chinese Address Resolution Model Based on Trie Tree and Finite Automata

  1. 1.Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;
      2.FiberHome Communications Science & Technology Development Co. Ltd., Nanjing 210019, China
  • Received:2016-01-04 Online:2016-07-21 Published:2016-07-22

摘要:  目前,无论是在研究或者是商业领域中,中文地址解析都没有一个成熟的模型结果。要素识别是地址解析的关键技术,传统的地址要素识别是基于特征词和字典匹配的方法,难以解决地址命名的多样性问题。利用自然语言处理技术,根据Trie树模型对行政区域寻址的方法和有限状态自动机模型对非规范地址的要素提取方法,本文提出T-FA模型对地址进行分级划分。其中,采用隐马尔可夫模型的切词方法和最长公共子序列算法,可以解决地址要素识别的模糊化搜索。T-FA模型具有良好的泛化能力,在批量处理地址时具有很好的通用效果,能比较有效地解决中文地址多样化的解析难题。

关键词:  , 自然语言处理, 地址解析, 要素识别, Trie树模型, 有限状态自动机模型

Abstract:  Until now, there is not a relatively mature model in the research of Chinese address resolution no matter in the academic or commercial fields. Elements identification is the main technique for address resolution. Traditional method of address elements identifying basing on the method of feature words and dictionary matching is difficult to solve the problem of the non-canonical address resolution. In this paper, the T-FA model is proposed to solve the problem of address segment and grading, for further, the Trie-tree model is adopted for addressing of administrative regions and the Finite-Automata(FA) model for the elements extraction of non-canonical address corresponding, which are both common technologies in natural language processing field. And fuzzy search and recognition of the address elements could be well resolved using words segmentation method based on the hidden Markov model and the Longest Common Sub-sequence(LCS) algorithm. The T-FA model achieves a better performance in the generalization ability for batch processing the address information than state-of-art, and more effective in solving the problem of non-canonical address resolution.

Key words:  natural language processing, address resolution, elements identify, Trie tree model, finite automata model