基于双层语料过滤器的短语抽取方法

doi:10.3969/j.issn.1006-2475.2015.12.002

计算机与现代化 ›› 2015, Vol. 0 ›› Issue (12): 7-.doi: 10.3969/j.issn.1006-2475.2015.12.002

基于双层语料过滤器的短语抽取方法

1．中国移动通信集团广东有限公司，广东广州510006； 2．华南理工大学软件学院，广东广州510006

收稿日期:2015-09-06 出版日期:2015-12-23 发布日期:2015-12-30
作者简介:林波（1972-），男，广东湛江人，中国移动通信集团广东有限公司工程师，硕士，研究方向：通信行业数据挖掘；林伟佳（1989-），男，广东茂名人，华南理工大学软件学院硕士研究生，研究方向：数据挖掘；郭靖羽（1986-），男，广东肇庆人，硕士，研究方向：数据挖掘；丁东辉（1993-），男，江西赣州人，硕士研究生，研究方向：数据挖掘；黄翰（1980-），男，广东汕头人，教授，博士，研究方向：进化计算方法的理论基础，进化计算方法的优化计算及其应用。

Text Topic Extraction Based on Doublelinguisticfilter

1. China Mobile Group Guangdong Co. Ltd., Guangzhou 510006, China;

2. School of Software Engineering, South China University of Technology, Guangzhou 510006, China

Received:2015-09-06 Online:2015-12-23 Published:2015-12-30

摘要/Abstract

摘要： 文本主题提取技术能够有效地精炼文本消息，传统的中文文本由最基本的词语组成，由于词汇本身的信息粒度过小，针对词语进行中文信息抽取不能完整表达文本片段的语义信息。短语本身包含较为丰富的细粒度语义信息，更能表达出文本片段的主题性。本文提出基于双层语料过滤器（词性过滤器与短语扩展规则过滤器）的方法来进行文本语料的冗余信息过滤并抽取文本主题短语信息。实验证明，本文的方法具有一定的可靠性和应用性。

关键词: 短语抽取, 信息提取, 规则挖掘

Abstract: The technology of text topic extraction is widely applied to refine the text information. Since the Chinese text is made up of base Chinese words, which contains trivial semantic information, the methods of using the words to express the semantic information of short text is not promised in applications. In contrast, Chinese phrases contain rich finegrained semantic information and they are preferred to be the representatives of topic of text. Therefore, this paper proposed a method of doublelinguisticfilter (lexical category filter and phraseextending filter) to weed out the redundant information and extract topic phrases from text. The phrase results are close to the refined semantic expression of text. The experimental result shows that the method we proposed can obtain reliable results, and the method would indicate other new methods on text mining.

Key words: phrase extraction, information extraction, rule mining

中图分类号:

TP391

林波1，林伟佳2，郭靖羽1，丁东辉2，黄翰2. 基于双层语料过滤器的短语抽取方法[J]. 计算机与现代化, 2015, 0(12): 7-.

LIN Bo1, LIN Weijia2, GUO Jingyu1, DING Donghui2, HUANG Han2. Text Topic Extraction Based on Doublelinguisticfilter[J]. Computer and Modernization, 2015, 0(12): 7-.

参考文献

［1］中国互联网络信息中心. 第35次中国互联网络发展状况统计报告［DB/OL］. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/201502/P020150203551802054676.pdf, 2015-03-26.
［2］鲁明羽，姚晓娜，魏善岭. 基于模糊聚类的网络论坛热点话题挖掘［J］. 大连海事大学学报(自然科学版), 2008,34(4):52-54.
［3］Sahami Mehran, Heilman Timothy D. A Webbased kernel function for measuring the similarity of short text snippets［C］// Proceedings of ACM the 15th International Conference on World Wide Web. 2006:377-386.
［4］Metaler D, Dumais S, Meek C. Similarity measures for short segments of text［C］// European Colloquium on IR ResearchECIR. 2007:16-27.
［5］Yih W, Meek C. Improving similarity measures for short segments of text［C］// National Conference on Artificial IntelligenceAAAI. 2007:1489-1494.
［6］Phan XuanHieu, Nguyen LeMinh, Horiguchi Susumu. Learning to classify short and sparse Text&Web with hidden topics from largescale data collections［C］// World Wide Web Conference SeriesWWW. 2008:91-100.
［7］Tantanasiriwong Supaporn, Haruechaiyasak Choochart, Guha Sumanta. A comparative study of key phrase extraction for crossdomain document collections［C］// The 16th International Conference on AsiaPacific Digital Libraries. 2014:393-398.
［8］Liu Dacheng, Peng Zhiyong, Liu Bin, et al. Technology effect phrase extraction in Chinese patent abstracts［C］// Web Technologies and Applications, Lecture Notes in Computer Science. 2014,8709:141-152.
［9］Bharti Kusum Kumari, Singh Pramod Kumar. Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering［J］. Expert Systems with Applications, 2015,42(6):3105-3114.
［10］吴薇．大规模短文本的分类过滤方法研究［D］．北京：北京邮电大学， 2007．
［11］王鹏．文本分类中利用依存关系的实验研究［J］．计算机工程， 2010，46(3):131-133．
［12］王细薇．基于特征扩展的中文短文本分类方法［J］．计算机应用， 2009，29(3):843-845．
［13］胡吉祥．基于频繁模式的消息文本聚类研究［D］．北京：中科院研究生院， 2006．
［14］宁亚辉．基于领域词语本体的短文本分类［J］．计算机科学， 2009，36(3):142-145．
［15］盛宇利．自然语言理解心理学在短文本分类中的实证研究［J］．现代情报， 2009,29(8):4-7．
［16］桂卓民. 基于事件的多文档自动文摘系统的研究［D］. 武汉:华中师范大学, 2010.
［17］冯琴荣,苗夺谦,程昳. 决策表属性约简的相对划分粒度表示［J］. 小型微型计算机系统, 2008,29(12):2305-2308.

[1]	樊海玮, 秦佳杰, 孙欢, 张丽苗, 鲁芯丝雨. 基于BERT与BiGRU-CRF的交通事故文本信息提取模型[J]. 计算机与现代化, 2022, 0(05): 10-15.
[2]	李盼1，李宜广2，徐春1. 基于关键节点的网络热点信息抽取[J]. 计算机与现代化, 2019, 0(09): 60-.
[3]	李海胜;周萍;韩孟啸;霍红元;崔艳梅;耿令朋. 基于光谱吸收指数的矿物识别软件模块研发[J]. 计算机与现代化, 2012, 198(2): 69-72.
[4]	赵晓峰;凌天斌;彭波;王转妮. 一种基于网页源文件的信息提取算法[J]. 计算机与现代化, 2012, 198(2): 38-39.
[5]	侯明亮. 基于分数阶微分的高斯噪声图像信息提取算法[J]. 计算机与现代化, 2010, 1(9): 5-8.
[6]	周炘;邓蓉. 基于XML的Web数据挖掘模型设计与研究[J]. 计算机与现代化, 2010, 1(11): 60-62.

基于双层语料过滤器的短语抽取方法

Text Topic Extraction Based on Doublelinguisticfilter

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价