基于fastText的中文文本分类

doi:10.3969/j.issn.1006-2475.2018.05.008

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (05): 35-.doi: 10.3969/j.issn.1006-2475.2018.05.008

基于fastText的中文文本分类

(广西大学计算机与电子信息学院,广西南宁530004)

收稿日期:2017-10-18 出版日期:2018-06-13 发布日期:2018-06-13
作者简介:代令令（1991-），女,河南驻马店人，广西大学计算机与电子信息学院硕士研究生，研究方向：文本分类，情感分析；蒋侃（1972-），男,湖北武汉人，教授，博士，研究方向：电子商务，传播计算。
基金资助:
国家自然科学基金资助项目（71362012）

Chinese Text Classification Based on fastText

(School of Computer and Electronic Information, Guangxi University, Nanning 530004, China) 

Received:2017-10-18 Online:2018-06-13 Published:2018-06-13

摘要/Abstract

摘要： 在保证文本分类准确率的情况下缩短分类时间一直是文本分类领域的一个研究目标。针对目前文本分类处理过程复杂且耗时过长的问题，将Facebook开源的句子分类和单词特征学习模型fastText引入到中文文本分类领域中，并验证其在中文分类中的效果。相对于目前主流的文本分类方法，基于fastText模型的分类方法在保证分类效果的同时，大大缩短了分类时间。此外，在分类准确率和参数设置方面进行分析并得出相应的优化规则。

关键词: 文本分类, fastText, 线性分类器, 快速文本分类模型

Abstract: A goal of text classification is to shorten the classification time under the premise of ensuring the accuracy of classification. In order to solve the problem that the current text classification process is complex and time-consuming, fastText, a library for efficient learning of word representations and sentence classification proposed by Facebook is applied to Chinese text classification field to test its effect. Compared with the mainstream text classification method, fastText can guarantee the classification accuracy and greatly shorten the classification time. At the same time, this paper further studies and analyzes the classification accuracy and parameters and gets some optimization rules.

Key words: text classification, fastText, linear classifier, fast text classification model

中图分类号:

TP391

代令令，蒋侃. 基于fastText的中文文本分类[J]. 计算机与现代化, 2018, 0(05): 35-.

DAI Ling-ling, JIANG Kan. Chinese Text Classification Based on fastText[J]. Computer and Modernization, 2018, 0(05): 35-.

参考文献

［1］夏从零,钱涛,姬东鸿,等. 基于事件卷积特征的新闻文本分类［J］. 计算机应用研究, 2017(4):991-994.
［2］石文娟,龙舜,云飞. 基于背景学习的迭代式文本分类框架［J］. 计算机工程与应用, 2015,51(9):129-134.
［3］ Debole F, Sebastiani F. Supervised term weighting for automated text categorization［M］// Text Mining and Its Applications. Springer Berlin Heidelberg, 2004:81-97.
［4］ Sebastiani F. Machine learning in automated text categorization［J］. ACM Computing Surveys (CSUR), 2002,34(1):1-47.
［5］唐明,朱磊,邹显春. 基于Word2Vec的一种文档向量表示［J］. 计算机科学,2016, 43(6):214-217.
［6］闫琰. 基于深度学习的文本表示与分类方法研究［D］. 北京：北京科技大学, 2016.
［7］ Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space［J］. Computer Science, 2013: arXiv:1301.3781.
［8］ Mitchell J, Lapata M. Composition in distributional models of semantics［J］. Cognitive Science, 2010,34(8):1388-1429.
［9］ Zanzotto F M, Korkontzelos I, Fallucchi F, et al. Estimating linear models for compositional distributional semantics［C］// Proceedings of International Conference on Computational Linguistics. 2010:1263-1271.
［10］〖ZK（#〗Yessenalina A, Cardie C. Compositional matrix-space models for sentiment analysis［C］// Conference on Empirical Methods in Natural Language Processing. 2011:172-182.
［11］Le Q, Mikolov T. Distributed representations of sentences and documents［C］// Proceedings of the 31st International Conference on Machine Learning. 2014:1188-1196.
［12］ Wang S, Manning C D. Baselines and bigrams: Simple, good sentiment and topic classification［C］// Meeting of the Association for Computational Linguistics. 2012:90-94. 
［13］Agarwal A, Chapelle O, Dudik M, et al. A reliable effective terascale linear learning system［J］. Journal of Machine Learning Research, 2011,15(1):1111-1133.
［14］Joulin A, Grave E, Bojanowski P, et al. Bag of tricks for efficient text classification［J］. Computer Science, 2016: arXiv:1607.01759.
［15］Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information［J］. Computer Science, 2016:arXiv:1607.04606.
［16］孙茂松,李景阳,郭志芃,等. THUCTC:一个高效的中文文本分类工具包［DB/OL］. http://thuctc.thunlp.org, 2016-01-25.
［17］杨宇婷,王名扬,田宪允,等. 基于文档分布式表达的新浪微博情感分类研究［J］. 情报杂志, 2016,35(2):151-156.
［18］吕超镇,姬东鸿,吴飞飞. 基于LDA特征扩展的短文本分类［J］. 计算机工程与应用, 2015,51(4):123-127.
［19］Zolotov V, Kung D. Analysis and optimization of fastText linear text classifier［J］. Computer Science, 2017:arXiv:1702.05531.

[1]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[2]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[3]	徐涯昕, 何泽恩, 徐绪堪. 基于CNN-BiLSTM网络的数控机床故障文本自动分类[J]. 计算机与现代化, 2023, 0(04): 7-14.
[4]	张军, 邱龙龙. 一种基于BERT和池化操作的文本分类模型[J]. 计算机与现代化, 2022, 0(06): 1-7.
[5]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[6]	郭书武, 陈军华. 基于深度学习的教材德目分类方法[J]. 计算机与现代化, 2021, 0(09): 106-112.
[7]	贾澎涛, 孙炜. 基于深度学习的文本分类综述[J]. 计算机与现代化, 2021, 0(07): 29-37.
[8]	代继鹏, 邵峰晶, 孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究[J]. 计算机与现代化, 2021, 0(06): 6-11.
[9]	尼格拉木·买斯木江, 艾孜尔古丽·玉素甫. 基于BERT及双向GRU模型的慕课用户评论情感倾向性分析[J]. 计算机与现代化, 2021, 0(04): 20-26.
[10]	周灵, 张英俊, 潘理虎. 一种基于情感特征的短文本分类方法[J]. 计算机与现代化, 2020, 0(07): 80-84.
[11]	景栋盛, 薛劲松, 冯仁君. 基于深度Q网络的垃圾邮件文本分类方法[J]. 计算机与现代化, 2020, 0(06): 89-.
[12]	彭路1,朱君2,邹云峰2. 基于深度神经网络的电力客户诉求预判[J]. 计算机与现代化, 2020, 0(05): 22-.
[13]	牛雪莹. 结合主题模型词向量的CNN文本分类[J]. 计算机与现代化, 2019, 0(10): 7-.
[14]	李富星，蒙祖强 . 一种改进的类别区分词特征选择算法[J]. 计算机与现代化, 2019, 0(03): 73-.
[15]	马建红，刘广森，姚爽，杨智. 面向短文本的特征选择及文本表示[J]. 计算机与现代化, 2019, 0(03): 95-.

基于fastText的中文文本分类

Chinese Text Classification Based on fastText

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价