计算机与现代化

• 算法分析与设计 • 上一篇    下一篇

基于词库匹配的网络文本分类及预测

  

  1. 南京理工大学机械工程学院,江苏南京210094
  • 收稿日期:2017-02-23 出版日期:2017-10-30 发布日期:2017-10-31
  • 作者简介:杨雨诗(1993-),女,浙江杭州人, 南京理工大学机械工程学院硕士研究生,研究方向:图像处理与机器视觉; 何博侠(1972-),男,副教授, 博士,研究方向:光电测试技术,微纳米测量技术,机械动力学及先进制造技术; 周鑫(1991-),男,硕士研究生。

Web Text Classification and Prediction Based on Thesaurus Match

  1. School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  • Received:2017-02-23 Online:2017-10-30 Published:2017-10-31

摘要: 为实现中文文本的准确分类,提出一种基于词库匹配的分类方法。在测试集中采用向量空间模型进行特征表达,用基于词逆向文档频率(TF-IDF)的主成分分析法计算权值,筛选并建立47个行业的索引词库,然后根据与索引词库的余弦相似度判断文本行业类别,建立自回归积分滑动平均(ARIMA)模型,预测其未来10天发展趋势。实验结果表明,索引词库的平均分类效果指标F值为85.6%,预测模型的平均相对误差为3.41%,证明该分类方法是有效的。

关键词: 文本分类, 向量空间模型, 主成分分析法, 余弦相似度, 自回归积分滑动平均模型

Abstract: In order to achieve accurate classification of Chinese text, a classification method based on thesaurus match is put forward. The vector space model is used to express the features in the test set, the principle component analysis based on term frequency-inverse document frequency is used to weight the feature items in the corpus, 47 industries index thesaurus are screened out and built. And then the text industry category is determined according to the cosine similarity, the auto-regressive integrated moving average model is established, and the development trend of the next 10 days is forecast. Experimental results show that, the average classification performance F of index thesaurus is 85.6%, the average relative error of prediction model is 3.41%, which proves the classification method to be effective.

Key words: text classification, vector space model, principal components analysis, cosine similarity, ARIMA