计算机与现代化

• 应用与开发 • 上一篇    下一篇

基于AC自动机和地址概率模型的地址标准化算法

  

  1. (1.武汉邮电科学研究院,湖北武汉430074;2.南京烽火软件科技有限公司,江苏南京210019)
  • 收稿日期:2018-07-17 出版日期:2019-01-03 发布日期:2019-01-04
  • 作者简介:刘宇(1974-),男,吉林辽源人,武汉邮电科学研究院高级工程师,南京烽火软件科技有限公司副总经理,硕士,研究方向:互联网应用; 张敬会(1994-),男,江西九江人,硕士研究生,研究方向:自然语言处理,数据挖掘。

Address Standardization Algorithm Based on Aho-corasick #br# Automaton and Address Probability Model

  1. (1. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;
    2. Nanjing FiberHome Software Technology Co. Ltd, Nanjing 210019, China)
  • Received:2018-07-17 Online:2019-01-03 Published:2019-01-04

摘要: 中文地址具有广泛的应用领域和应用价值,地址标准化是地址编码的基础,而地址编码技术是利用好地址数据的重要一环。本文基于双数组的极速多模式匹配的trie树来进行初步分词和词性标注,利用最长后缀匹配能够非常快速地找出包含行政区划的地址元素,以此为基础可以将地址切分成不同地址元素并标注等级,建立地址向量模型(AVSM)。将AVSM中行政区划部分地址数据进行条件组合,找出可能的行政区划候选值。采用余弦相似度算法,计算出最佳的行政路径。对于后续非行政区等级元素,使用概率地址模型对各等级元素进行概率统计,利用贝叶斯求出最佳的组词概率,进一步处理其它各个级别的地址。最后通过有限状态机能够对整个地址等级进行各级元素的隶属调整和实现不同等级具体修复方法。该方法能够保证在海量的地址数据中实现快速切分的同时对行政缺失的地址数据进行补全,利用关键词和概率模型有效地识别登录词,兼顾分词性能和可维护性。

关键词: 中文地址, 标准化, AC自动机, 自然语言处理

Abstract:  Chinese address has a wide range of application fields and application values, and the address coding technology is the key part of it, while the address coding technology is based on the address standardization. This paper first introduces a double-array trie tree with fast speed and multi-pattern matching to the initial segmentation and part-of-speech tagging. By using the longest suffix matching, the address elements of the administrative division can be found very quickly. Based on those technologies the address can be segmented into different addresses elements and be labeled grade, so as to establish the address vector space model (AVSM). This paper gives three steps to process the data in AVSM, that is the conditional combination of part of the administrative divisions AVSM data to obtain possible administrative division candidates. By using cosine similarity algorithm we can calculate the best administrative path. For the following non-administrative divisional hierarchical elements, the probabilistic address model is used to calculate the probability of each hierarchical elements, and Bayes can be used to find the best word probability and process the other levels address in the further. Finally, the finite state machine can be used to adjust the level of membership of each level of the entire address level and to achieve different levels of specific repair methods. This method can quickly cut out a large number of address data, and supplement the missing address data of the administrative department. Using the keyword and probability model can effectively identify the login word, take into account both of the word segmentation performance and maintainability.

Key words: Chinese address, standardization, Aho-corasick automaton, NLP

中图分类号: