计算机与现代化

• 数据库与数据挖掘 •    下一篇

病理镜检文本数据的结构化处理方法

  

  1. (1.东华大学计算机科学与技术学院,上海201620;2.上海交通大学医学院附属瑞金医院计算机中心,上海201620)

  • 收稿日期:2015-11-09 出版日期:2016-04-14 发布日期:2018-09-30
  • 作者简介:陈德华(1976-),男,福建莆田人,东华大学计算机科学与技术学院副教授,博士,研究方向:数据库与数据仓库,智慧医疗; 刘茜茜(1990-),女,硕士研究生,研究方向:数据库,大数据与智慧医疗; 乐嘉锦(1951-),男,教授,博士生导师,研究方向:数据库与数据仓库,大数据; 潘乔(1977-),副教授,硕士生导师,研究方向:大数据和网络安全; 朱立峰(1976-),上海交通大学医学院附属瑞金医院计算机中心高级工程师,研究方向:医院信息化与数据库。
  • 基金资助:
    上海市科委科技创新行动计划资助项目(15511106900)

Structured Approach for Pathological Microscopy Text

  1. (1. School of Computer Science and Technology, Donghua University, Shanghai 201620, China;2. Computer Center of Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 201620, China)
  • Received:2015-11-09 Online:2016-04-14 Published:2018-09-30

摘要: 目前医疗文本数据的结构化处理大多依赖通用分词工具或医学知识库,而通用分词工具对专业术语的识别效果并不理想,且国内的中文医学术语标准化进程不足。针对此问题,提出一种基于统计信息对镜检文本数据进行结构化处理的方法。该方法以聚类文本为基础,基于断点词与重合串分词,利用分词词串的统计信息获取关键词以及词语类别信息,并进行词语扩充,从而得到最终词库作为字典。利用基于字典的双向最大匹配分词算法,对文本数据进行分词,并通过添加否定检出的规则,获取结构化数据。实验结果表明,该方法获取的医学词库的准确率达到了80%,实现了不依赖分词工具获得结构化数据的功能。

关键词: 医疗文本数据, 文本数据结构化, 统计, 分词, 双向最大匹配

Abstract:

The current structured approaches for the medical text data are mostly dependent on universal word segmentation software or professional terminology libraries, but the recognition effect of professional vocabularies by universal word segmentation tools is not satisfactory, and a mature system of Chinese standard terminology library is not established. Aimed at these problems, this paper puts forward a kind of structured processing method for medical text data based on statistical information. On the basis of clustering text and according to the breakpoint words and coincident string word segmentation, the key words and the type information of words are obtained by the statistical information of participle word string, enlarged the words and got the final lexicon as the word dictionary. It carried out word segmentation by the twoway dictionary word maximum matching algorithm and then obtained structured data by adding the rules of negative detection. Experiments show that the accuracy of the professional vocabulary libraries obtained by this method reached 80%, and this method achieves the capability to get structured data without the help of segmentation tools.

Key words: medical text data, structuring text data, statistics, word segmentation, two-way maximum matching

中图分类号: