基于Trie树和有限状态自动机的中文地址解析模型

doi:10.3969/j.issn.1006-2475.2016.07.012

计算机与现代化 ›› 2016, Vol. 251 ›› Issue (07): 60-67.doi: 10.3969/j.issn.1006-2475.2016.07.012

基于Trie树和有限状态自动机的中文地址解析模型

1.武汉邮电科学研究院,湖北武汉430074； 2.烽火通信科技股份有限公司南京研发部，江苏南京210019

收稿日期:2016-01-04 出版日期:2016-07-21 发布日期:2016-07-22
作者简介: 汪洋(1978-),男,江苏南京人,武汉邮电科学研究院、烽火通信科技股份有限公司南京研发部高级工程师,硕士,研究方向:大数据分析,网络行为学; 刘师培(1992-),男,湖北潜江人,硕士研究生,研究方向:数据分析,自然语言处理; 王峥(1977-)，男,高级工程师,研究方向：海量数据分析,网络行为分析。

A Chinese Address Resolution Model Based on Trie Tree and Finite Automata

1.Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China;
2.FiberHome Communications Science & Technology Development Co. Ltd.， Nanjing 210019, China

Received:2016-01-04 Online:2016-07-21 Published:2016-07-22

摘要/Abstract

摘要： 目前，无论是在研究或者是商业领域中，中文地址解析都没有一个成熟的模型结果。要素识别是地址解析的关键技术，传统的地址要素识别是基于特征词和字典匹配的方法，难以解决地址命名的多样性问题。利用自然语言处理技术，根据Trie树模型对行政区域寻址的方法和有限状态自动机模型对非规范地址的要素提取方法，本文提出T-FA模型对地址进行分级划分。其中，采用隐马尔可夫模型的切词方法和最长公共子序列算法，可以解决地址要素识别的模糊化搜索。T-FA模型具有良好的泛化能力，在批量处理地址时具有很好的通用效果，能比较有效地解决中文地址多样化的解析难题。

关键词: , 自然语言处理, 地址解析, 要素识别, Trie树模型, 有限状态自动机模型

Abstract: Until now, there is not a relatively mature model in the research of Chinese address resolution no matter in the academic or commercial fields. Elements identification is the main technique for address resolution. Traditional method of address elements identifying basing on the method of feature words and dictionary matching is difficult to solve the problem of the non-canonical address resolution. In this paper, the T-FA model is proposed to solve the problem of address segment and grading, for further, the Trie-tree model is adopted for addressing of administrative regions and the Finite-Automata(FA) model for the elements extraction of non-canonical address corresponding, which are both common technologies in natural language processing field. And fuzzy search and recognition of the address elements could be well resolved using words segmentation method based on the hidden Markov model and the Longest Common Sub-sequence(LCS) algorithm. The T-FA model achieves a better performance in the generalization ability for batch processing the address information than state-of-art, and more effective in solving the problem of non-canonical address resolution.

Key words: natural language processing, address resolution, elements identify, Trie tree model, finite automata model

汪洋1,2，刘师培1，王峥2. 基于Trie树和有限状态自动机的中文地址解析模型[J]. 计算机与现代化, 2016, 251(07): 60-67.

WANG Yang1,2, LIU Shi-pei1, WANG Zheng2. A Chinese Address Resolution Model Based on Trie Tree and Finite Automata[J]. Computer and Modernization, 2016, 251(07): 60-67.

参考文献

［1］ Gu Bin, Jin Yanfeng, Zhang Chang. Study on the standardized method of Chinese addresses based on expert system［C］// Proceedings of IEEE 2nd International Conference on Cloud Computing and Intelligence Systems. 2012:1254-1258. 
［2］程昌秀,于滨. 一种基于规则的模糊中文地址分词匹配方法［J］. 地理与地理信息科学, 2011,27(3):26-29.
［3］徐娟,曹晔,张奇. 面向自由文本的中文地址规范化［J］. 计算机应用与软件, 2015(8):22-24.
［4］ Song Zihui. Address matching algorithm based on Chinese natural language understanding［J］. Journal of Remote Sensing, 2013,17(4):788-801.
［5］ Kang Mengjun, Du Qingyun, Wang Mingjun. A new method of Chinese address extraction based on address tree model［J］. Acta Geodaetica et Cartograohica Sinica, 2015,44(1):99-107.
［6］秦学秀. 地名数据的3种形式及其质量要求［J］. 测绘通报, 2011(10):68-69.
［7］臧英斐,王斌,瞿晓雯. 重庆市中文语义地址模型构建方法讨论［J］. 地理空间信息, 2015,13(3):122-125.
［8］宗成庆. 统计自然语言处理［M］. 北京:清华大学出版社, 2013.
［9］亢孟军,杜清运,王明军. 地址树模型的中文地址提取方法［J］. 测绘学报, 2015,44(1):99-107.
［10］徐聪,张丰,杜振洪,等. 基于哈希和双数组Trie树的多层次地址匹配算法［J］. 浙江大学学报, 2014,41(2):217-222.
［11］赵欢,朱红权. 基于双数组Trie树中文分词研究［J］. 湖南大学学报, 2009,36(5):78-80.
［12］刘丽霞,张志强. 基于Trie树的相似字符串查找算法［J］. 计算机应用, 2013,33(8):2375-2378.
［13］Qian Xiao, Zhang Qi, Huang Xuanjing, et al. 2D Trie for fast parsing［C］// Proceedings of the 23rd International Conference on Computational Linguistics(COLING 2010). 2010:904-912.
［14］陈晓锦,徐秀花. 有限状态自动机及在字符串搜索中的应用［J］. 北京印刷学院学报, 2014,22(4):45-48.
［15］张雪英,闾国年,李伯秋,等. 基于规则的中文地址要素解析方法［J］. 地球信息科学学报, 2010(2):9-16.
［16］Jiang Wenming, Zhang Xueying, Li Boqiu. CRFs-based approach to recognition of Chinese address element［J］. Computer Engineering and Applications, 2010,46(13):129-131.
［17］杜冲,司望利,许珺. 基于地理语义的空间关系查询和推理［J］. 地球信息科学学报, 2010,12（1）：48-55.

[1]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[2]	万兵1, 2, 3, 赵文涛4, 潘多涛1, 赵峥韬2, 3, 孙朝阳2, 3, 俞建成2, 3. 无人帆船半物理仿真测试系统设计[J]. 计算机与现代化, 2024, 0(12): 91-99.
[3]	陈宇航1, 杨勇1, 帕力旦·吐尔逊1, 樊小超1, 任鸽1, 刁宇峰2. 融合句法特征与语义特征的作文自动评分方法[J]. 计算机与现代化, 2024, 0(11): 64-69.
[4]	李璐, 朱焱. 基于知识提示微调的事件抽取方法[J]. 计算机与现代化, 2024, 0(07): 36-40.
[5]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[6]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[7]	王浩畅, 刘如意. 基于预训练模型的关系抽取研究综述[J]. 计算机与现代化, 2023, 0(01): 49-57.
[8]	李静元, 张珂, 杨东裕. 基于雾计算的工业互联网安全数据访问方法[J]. 计算机与现代化, 2022, 0(12): 118-122.
[9]	周慧, 徐名海, 许晓东. 基于Attention-BIGRU-CRF的中文分词模型[J]. 计算机与现代化, 2022, 0(08): 7-12.
[10]	王天星, 袁家斌, 刘昕. 基于同等注意力图网络的视觉问答方法[J]. 计算机与现代化, 2021, 0(11): 1-6.
[11]	陈春燕, 刘梦赤. 基于粒子群遗传算法的智能组卷策略[J]. 计算机与现代化, 2021, 0(08): 16-23.
[12]	郑新月, 任俊超. 基于BERT-FNN的意图识别分类[J]. 计算机与现代化, 2021, 0(07): 71-76.
[13]	宋鑫, 樊志强, . 基于Laguerre 前向神经网络的信息服务性能建模方法 [J]. 计算机与现代化, 2021, 0(03): 1-6.
[14]	魏健, 赵红涛, 刘敦楠, 加鹤萍 . 基于集成模型的超短时负荷预测方法[J]. 计算机与现代化, 2021, 0(03): 12-17.
[15]	高逸凡, 王勇. 一种基于目标检测与词性分析的图像描述算法[J]. 计算机与现代化, 2021, 0(03): 108-114.

基于Trie树和有限状态自动机的中文地址解析模型

A Chinese Address Resolution Model Based on Trie Tree and Finite Automata

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价