计算机与现代化

• 人工智能 • 上一篇    下一篇

基于图核算法的文本分类

  

  1. (北京工业大学信息学部,北京 100124)
  • 收稿日期:2016-11-24 出版日期:2017-11-21 发布日期:2017-11-21
  • 作者简介:蒋强荣(1964-),女,北京人,北京工业大学信息学部副教授,博士研究生,研究方向:模式识别,机器学习; 通信作者:宋烈金(1989-),男,河南周口人,硕士研究生,研究方向:自然语言处理。

Text Categorization Based on Graph Kernel

  1. (Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China)
  • Received:2016-11-24 Online:2017-11-21 Published:2017-11-21

摘要: 在文本分类研究中,向量空间模型具有表示形式简单的特点,但只能表示特征词的词频信息而忽视了特征词间的结构信息和语义语序信息,所以可能导致不同文档被表示为相同向量。针对这种问题,本文采用图结构模型表示文本,把文本表示成一个有向图(简称文本图),可有效解决结构化信息缺失的问题。本文将图核技术应用于文本分类,提出适用于文本图之间的相似度计算的图核算法——间隔通路核,然后利用支持向量机对文本进行分类。在文本集上的实验结果表明:与向量空间模型相比,间隔通路核相比于其他核函数的分类准确率更高,所以间隔通路核是一种很好的图结构相似性计算算法,能广泛应用于文本分类中。

关键词: 图结构, 向量空间模型, 间隔通路核, 支持向量机, 文本分类

Abstract: In text classification, vector space model has the characteristic of simple representation, but only represents frequency information of feature word and ignores the structural information and semantic information of word order between words, which may lead to different documents to be represented as vectors of the same. In view of this problem, this paper uses the graph structure model to represent text, and a text is represented as a directed graph (abbreviated as text graph), which effectively solves the problem of the lack of structured information. In this paper, the graph kernel technique is applied to text classification, and a graph kernel algorithm, which is suitable for the computation of the similarity between text graphs, is proposed. Then support vector machine is used to classify the texts. The experimental results on the text set show that compared with the vector space model, the classification accuracy of interval walk kernel is better than other kernel functions, so it is a good graph structure similarity calculation algorithm and it can be widely used in text classification.

Key words: graph structure, vector space model, gap walk kernel, support vector machine, text categorization

中图分类号: