计算机与现代化 ›› 2022, Vol. 0 ›› Issue (02): 92-96.

• 数据库与数据挖掘 • 上一篇    下一篇

基于支持向量机的短文本分类方法

  

  1. (北京交通大学信息中心,北京100044)
  • 出版日期:2022-03-31 发布日期:2022-03-31
  • 作者简介:赵延平(1986—),女,山东泰安人,工程师,硕士,研究方向:数据挖掘,E-mail: ypzh@bjtu.edu.cn; 王芳(1977—),女,高级工程师,硕士,研究方向:数据库,计算网络,E-mail: fangw@bjtu.edu.cn; 夏杨(1983—),女, 高级工程师,硕士,研究方向:人工智能,E-mail: xiayang@bjtu.edu.cn。
  • 基金资助:
    2018年度北京高校思想政治工作研究课题(BJSZ2019ZC16)

Short Text Classification Method Based on Support Vector Machine

  1. (Information Center, Beijing Jiaotong University, Beijing 100044, China)
  • Online:2022-03-31 Published:2022-03-31

摘要: 针对短文本具有特征稀疏、不规范、主题不明确等特点,提出一种有效的基于支持向量机的短文本分类方法。由于汉语中依存语法分析准确率和时间效率不高的问题,针对客户文本咨询的特点,在对短文本分类时,本文并未对句子进行依存语法的分析,而是主要使用句法特征进行分析,找出文本的子串和子序列形成候选特征集,之后利用信息增益、互信息、卡方统计3种特征选择方法进行有效特征选择,最后采用支持向量机方法进行文本分类。将本文所提的模型应用于一组真实数据,实验结果表明,平均正确率可达到84.19%,从而验证该分类方法的鲁棒性和有效性。

关键词: 支持向量机, 文本分类, 半监督学习, 特征选择

Abstract: This paper proposes an effective short text classification method based on support vector machine for short texts with sparse features, non-standard features and unclear topics. Due to the low accuracy and time efficiency of Chinese dependency grammar analysis, in view of the characteristics of client text consultation, this paper did not analyze the dependency grammar of sentences, but mainly uses syntactic features for analysis. Two syntactic features of substrings and subsequences of sentences are find out. Then three feature measure methods such as information gain, mutual information and chi-square statistics are used to realize feature selection effectively. Finally support vector machine method is used to classify text. The model proposed in this paper is applied into a set of real data, and the experimental results show that the average accuracy could reach 84.19%, thus verifying the robustness and effectiveness of the classification method.

Key words: support vector machine, text categorization, semi-supervised, feature selection