Computer and Modernization ›› 2018, Vol. 0 ›› Issue (12): 45-.doi: 10.3969/j.issn.1006-2475.2018.12.009

Previous Articles     Next Articles

Address Standardization Algorithm Based on Aho-corasick #br# Automaton and Address Probability Model

  

  1. (1. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;
    2. Nanjing FiberHome Software Technology Co. Ltd, Nanjing 210019, China)
  • Received:2018-07-17 Online:2019-01-03 Published:2019-01-04

Abstract:  Chinese address has a wide range of application fields and application values, and the address coding technology is the key part of it, while the address coding technology is based on the address standardization. This paper first introduces a double-array trie tree with fast speed and multi-pattern matching to the initial segmentation and part-of-speech tagging. By using the longest suffix matching, the address elements of the administrative division can be found very quickly. Based on those technologies the address can be segmented into different addresses elements and be labeled grade, so as to establish the address vector space model (AVSM). This paper gives three steps to process the data in AVSM, that is the conditional combination of part of the administrative divisions AVSM data to obtain possible administrative division candidates. By using cosine similarity algorithm we can calculate the best administrative path. For the following non-administrative divisional hierarchical elements, the probabilistic address model is used to calculate the probability of each hierarchical elements, and Bayes can be used to find the best word probability and process the other levels address in the further. Finally, the finite state machine can be used to adjust the level of membership of each level of the entire address level and to achieve different levels of specific repair methods. This method can quickly cut out a large number of address data, and supplement the missing address data of the administrative department. Using the keyword and probability model can effectively identify the login word, take into account both of the word segmentation performance and maintainability.

Key words: Chinese address, standardization, Aho-corasick automaton, NLP

CLC Number: