基于AC自动机和地址概率模型的地址标准化算法

doi:10.3969/j.issn.1006-2475.2018.12.009

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (12): 45-.doi: 10.3969/j.issn.1006-2475.2018.12.009

基于AC自动机和地址概率模型的地址标准化算法

(1.武汉邮电科学研究院,湖北武汉430074；2.南京烽火软件科技有限公司,江苏南京210019)

收稿日期:2018-07-17 出版日期:2019-01-03 发布日期:2019-01-04
作者简介:刘宇(1974-),男,吉林辽源人,武汉邮电科学研究院高级工程师,南京烽火软件科技有限公司副总经理,硕士,研究方向:互联网应用; 张敬会(1994-),男,江西九江人,硕士研究生,研究方向:自然语言处理,数据挖掘。

Address Standardization Algorithm Based on Aho-corasick #br# Automaton and Address Probability Model

(1. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;
2. Nanjing FiberHome Software Technology Co. Ltd, Nanjing 210019, China)

Received:2018-07-17 Online:2019-01-03 Published:2019-01-04

摘要/Abstract

摘要： 中文地址具有广泛的应用领域和应用价值，地址标准化是地址编码的基础，而地址编码技术是利用好地址数据的重要一环。本文基于双数组的极速多模式匹配的trie树来进行初步分词和词性标注，利用最长后缀匹配能够非常快速地找出包含行政区划的地址元素，以此为基础可以将地址切分成不同地址元素并标注等级，建立地址向量模型(AVSM)。将AVSM中行政区划部分地址数据进行条件组合，找出可能的行政区划候选值。采用余弦相似度算法，计算出最佳的行政路径。对于后续非行政区等级元素，使用概率地址模型对各等级元素进行概率统计，利用贝叶斯求出最佳的组词概率，进一步处理其它各个级别的地址。最后通过有限状态机能够对整个地址等级进行各级元素的隶属调整和实现不同等级具体修复方法。该方法能够保证在海量的地址数据中实现快速切分的同时对行政缺失的地址数据进行补全，利用关键词和概率模型有效地识别登录词，兼顾分词性能和可维护性。

关键词: 中文地址, 标准化, AC自动机, 自然语言处理

Abstract: Chinese address has a wide range of application fields and application values, and the address coding technology is the key part of it, while the address coding technology is based on the address standardization. This paper first introduces a double-array trie tree with fast speed and multi-pattern matching to the initial segmentation and part-of-speech tagging. By using the longest suffix matching, the address elements of the administrative division can be found very quickly. Based on those technologies the address can be segmented into different addresses elements and be labeled grade, so as to establish the address vector space model (AVSM). This paper gives three steps to process the data in AVSM, that is the conditional combination of part of the administrative divisions AVSM data to obtain possible administrative division candidates. By using cosine similarity algorithm we can calculate the best administrative path. For the following non-administrative divisional hierarchical elements, the probabilistic address model is used to calculate the probability of each hierarchical elements, and Bayes can be used to find the best word probability and process the other levels address in the further. Finally, the finite state machine can be used to adjust the level of membership of each level of the entire address level and to achieve different levels of specific repair methods. This method can quickly cut out a large number of address data, and supplement the missing address data of the administrative department. Using the keyword and probability model can effectively identify the login word, take into account both of the word segmentation performance and maintainability.

Key words: Chinese address, standardization, Aho-corasick automaton, NLP

中图分类号:

TP393

刘宇1,2，张敬会1. 基于AC自动机和地址概率模型的地址标准化算法[J]. 计算机与现代化, 2018, 0(12): 45-.

LIU Yu1,2, ZHANG Jing-hui1. Address Standardization Algorithm Based on Aho-corasick #br# Automaton and Address Probability Model[J]. Computer and Modernization, 2018, 0(12): 45-.

参考文献

［1］奚雪峰,周国栋. 面向自然语言处理的深度学习研究［J］. 自动化学报, 2016,42(10):1445-1465.
［2］朱俊. 中文标准地址库构建关键技术研究［D］. 南京：南京师范大学, 2013.
［3］张雪英,闾国年,李伯秋,等. 基于规则的中文地址要素解析方法［J］. 地球信息科学学报, 2010,12(1):9-16.
［4］ ZHANG W, YOSHIDA T, TANG X. A comparative study of TF*IDF, LSI and multi-words for text classification［J］. Expert Systems with Applications, 2011,38(3):2758-2765.
［5］于淼,吕雅娟,苏劲松,等. 规则和统计相结合的中文地址翻译方法［J］. 中文信息学报, 2012,26(3):49-54.
［6］宋子辉. 自然语言理解的中文地址匹配算法［J］. 遥感学报, 2013,17(4):788-801.
［7］亢孟军,杜清运,王明军. 地址树模型的中文地址提取方法［J］. 测绘学报, 2015,44(1):99-107.
［8］蒋文明,张雪英,李伯秋. 基于条件随机场的中文地址要素识别方法［J］. 计算机工程与应用, 2010,46(13):129-131.
［9］ FU C Y, FU G H. A dual-layer CRFs based method for Chinese nested named entity recognition［C］// International Conference on Fuzzy Systems and Knowledge Discovery. 2012:2546-2550.
［10］何畏. 快速精确字符串匹配算法研究［D］. 合肥:合肥工业大学, 2010.
［11］王培凤,李莉. 基于Aho-Corasick算法的多模式匹配算法研究［J］. 计算机应用研究, 2011,28(4):1251-1253.
［12］YANG W C, LIU J, YU M. Research of an improved algorithm for Chinese word segmentation dictionary based on double-array trie tree［M］// Natural Language Processing and Chinese Computing. 2013:127-131.
［13］MORARIU D, VINTAN L, CRETULESCU R. An extension of the VSM documents representation using word embedding［C］// Balkan Region Conference on Engineering and Business Education. 2018:249-257.
［14］VIDAL E, THOLLARD F, DE LA HIGUERA C, et al. Probabilistic finite-state machines—Part I［J］. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2005,27(7):1026-1039.
［15］罗刚. 解密搜索引擎技术实战 : Lucene & Java精华版［M］. 3版. 北京:电子工业出版社, 2016.
［16］汪洋,刘师培,王峥. 基于Trie树和有限状态自动机的中文地址解析模型［J］. 计算机与现代化, 2016(7):60-67.

[1]	李璐, 朱焱. 基于知识提示微调的事件抽取方法[J]. 计算机与现代化, 2024, 0(07): 36-40.
[2]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[3]	王浩畅, 刘如意. 基于预训练模型的关系抽取研究综述[J]. 计算机与现代化, 2023, 0(01): 49-57.
[4]	周慧, 徐名海, 许晓东. 基于Attention-BIGRU-CRF的中文分词模型[J]. 计算机与现代化, 2022, 0(08): 7-12.
[5]	王天星, 袁家斌, 刘昕. 基于同等注意力图网络的视觉问答方法[J]. 计算机与现代化, 2021, 0(11): 1-6.
[6]	郑新月, 任俊超. 基于BERT-FNN的意图识别分类[J]. 计算机与现代化, 2021, 0(07): 71-76.
[7]	高逸凡, 王勇. 一种基于目标检测与词性分析的图像描述算法[J]. 计算机与现代化, 2021, 0(03): 108-114.
[8]	王思远1,王恒升1,2,刘通1. 基于词向量与SVM的移动机器人自然语言导航指令识别[J]. 计算机与现代化, 2017, 0(9): 24-28,55.
[9]	刘小明1,2，张英1,2，郑秋生1,2. 基于卷积神经网络模型的互联网短文本情感分类[J]. 计算机与现代化, 2017, 0(4): 73-77.
[10]	伊尔夏提·吐尔贡,吾守尔·斯拉木,热西旦木·吐尔洪太,于清. 维吾尔文情感语料库的构建与分析[J]. 计算机与现代化, 2017, 0(4): 67-72.
[11]	汪洋1,2，刘师培1，王峥2. 基于Trie树和有限状态自动机的中文地址解析模型[J]. 计算机与现代化, 2016, 251(07): 60-67.
[12]	史金霖. 基于TENA的仿真实体[J]. 计算机与现代化, 2015, 0(3): 75-79.
[13]	刘兴林. 词汇知识获取及语义计算平台的构建[J]. 计算机与现代化, 2013, 1(3): 85-88.
[14]	何文金;顾昊旻;李志;胡传胜. 模糊聚类在信息化厂商质量评价中的应用研究 [J]. 计算机与现代化, 2012, 1(6): 31-33.
[15]	李花;张伟娜. 用Excel设计标准化试卷的自动判卷系统[J]. 计算机与现代化, 2012, 1(1): 199-202,.

基于AC自动机和地址概率模型的地址标准化算法

Address Standardization Algorithm Based on Aho-corasick #br# Automaton and Address Probability Model

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价