基于词库匹配的网络文本分类及预测

doi:10.3969/j.issn.1006-2475.2017.10.015

计算机与现代化 ›› 2017, Vol. 0 ›› Issue (10): 72-75.doi: 10.3969/j.issn.1006-2475.2017.10.015

基于词库匹配的网络文本分类及预测

南京理工大学机械工程学院，江苏南京210094

收稿日期:2017-02-23 出版日期:2017-10-30 发布日期:2017-10-31
作者简介:杨雨诗(1993-),女,浙江杭州人, 南京理工大学机械工程学院硕士研究生,研究方向：图像处理与机器视觉；何博侠(1972-),男,副教授, 博士,研究方向：光电测试技术，微纳米测量技术，机械动力学及先进制造技术；周鑫(1991-),男,硕士研究生。

Web Text Classification and Prediction Based on Thesaurus Match

School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

Received:2017-02-23 Online:2017-10-30 Published:2017-10-31

摘要/Abstract

摘要： 为实现中文文本的准确分类，提出一种基于词库匹配的分类方法。在测试集中采用向量空间模型进行特征表达，用基于词逆向文档频率(TF-IDF)的主成分分析法计算权值，筛选并建立47个行业的索引词库，然后根据与索引词库的余弦相似度判断文本行业类别，建立自回归积分滑动平均(ARIMA)模型，预测其未来10天发展趋势。实验结果表明，索引词库的平均分类效果指标F值为85.6%，预测模型的平均相对误差为3.41%，证明该分类方法是有效的。

关键词: 文本分类, 向量空间模型, 主成分分析法, 余弦相似度, 自回归积分滑动平均模型

Abstract: In order to achieve accurate classification of Chinese text, a classification method based on thesaurus match is put forward. The vector space model is used to express the features in the test set, the principle component analysis based on term frequency-inverse document frequency is used to weight the feature items in the corpus, 47 industries index thesaurus are screened out and built. And then the text industry category is determined according to the cosine similarity, the auto-regressive integrated moving average model is established, and the development trend of the next 10 days is forecast. Experimental results show that, the average classification performance F of index thesaurus is 85.6%, the average relative error of prediction model is 3.41%, which proves the classification method to be effective.

Key words: text classification, vector space model, principal components analysis, cosine similarity, ARIMA

杨雨诗,何博侠,周鑫，刘慧丽，葛方丽. 基于词库匹配的网络文本分类及预测[J]. 计算机与现代化, 2017, 0(10): 72-75.

YANG Yu-shi, HE Bo-xia, ZHOU Xin, LIU Hui-li, GE Fang-li. Web Text Classification and Prediction Based on Thesaurus Match[J]. Computer and Modernization, 2017, 0(10): 72-75.

[1]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[2]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[3]	徐涯昕, 何泽恩, 徐绪堪. 基于CNN-BiLSTM网络的数控机床故障文本自动分类[J]. 计算机与现代化, 2023, 0(04): 7-14.
[4]	田丰, 邓晓平, 张桂青, 王保义. 基于改进kNN算法与暂稳态特征的非侵入式负荷监测方法[J]. 计算机与现代化, 2022, 0(10): 29-35.
[5]	张军, 邱龙龙. 一种基于BERT和池化操作的文本分类模型[J]. 计算机与现代化, 2022, 0(06): 1-7.
[6]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[7]	郭书武, 陈军华. 基于深度学习的教材德目分类方法[J]. 计算机与现代化, 2021, 0(09): 106-112.
[8]	贾澎涛, 孙炜. 基于深度学习的文本分类综述[J]. 计算机与现代化, 2021, 0(07): 29-37.
[9]	代继鹏, 邵峰晶, 孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究[J]. 计算机与现代化, 2021, 0(06): 6-11.
[10]	尼格拉木·买斯木江, 艾孜尔古丽·玉素甫. 基于BERT及双向GRU模型的慕课用户评论情感倾向性分析[J]. 计算机与现代化, 2021, 0(04): 20-26.
[11]	周灵, 张英俊, 潘理虎. 一种基于情感特征的短文本分类方法[J]. 计算机与现代化, 2020, 0(07): 80-84.
[12]	景栋盛, 薛劲松, 冯仁君. 基于深度Q网络的垃圾邮件文本分类方法[J]. 计算机与现代化, 2020, 0(06): 89-.
[13]	彭路1,朱君2,邹云峰2. 基于深度神经网络的电力客户诉求预判[J]. 计算机与现代化, 2020, 0(05): 22-.
[14]	李一野1，2，邓浩江1，2. 基于改进余弦相似度的协同过滤推荐算法[J]. 计算机与现代化, 2020, 0(01): 69-.
[15]	牛雪莹. 结合主题模型词向量的CNN文本分类[J]. 计算机与现代化, 2019, 0(10): 7-.

基于词库匹配的网络文本分类及预测

Web Text Classification and Prediction Based on Thesaurus Match

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价