基于上下文语义的朴素贝叶斯文本分类算法

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (06): 1-.

• 算法设计与分析 • 下一篇

基于上下文语义的朴素贝叶斯文本分类算法

（北京工业大学信息学部，北京100124）

收稿日期:2017-12-04 出版日期:2018-07-05 发布日期:2018-07-05
作者简介:郑开雨(1991-),女,河北邯郸人,北京工业大学信息学部硕士研究生,研究方向：机器学习；竹翠(1973-),女,副教授,博士,研究方向：数据挖掘。

ContextSemantic-basedNaiveBayesianAlgorithmforTextClassification

（InformationDepartment,BeijingUniversityofTechnology,Beijing100124,China）

Received:2017-12-04 Online:2018-07-05 Published:2018-07-05

摘要/Abstract

摘要： 朴素贝叶斯分类器基于样本各属性相互条件独立的假设前提，它作为一种简单的词袋模型，忽略了上下文语境下同义词对分类的影响。本文提出相似词概念，使用相似词词簇代替传统的特征词典参与训练。首先训练word2vec得到词向量。然后，将特征词典用词向量表示后层次聚类，构建相似词词簇，并对其扩展。实验结果表明，改进后算法有效提高了文本分类的准确度，避免了因分类训练语料的差异导致分类效果的不稳定。

关键词: 文本分类, 朴素贝叶斯, word2vec

Abstract: TheNaiveBayesclassifierisbasedontheassumptionthatthesamples’attributesareindependentoneanother.Asasimplebag-of-wordsmodel,itignorestheinfluenceofsynonymsincontexttoclassification.Thispaperproposestheconceptofsimilarwordandusesclustersofsimilarwordsinsteadofkeyworddictionaryintraining.First,word2vecistrainedtogetwordembedding.Second,thekeyworddictionaryisrepresentedbywordembeddingwhichisthenclusteredhierarchically,theclustersofsimiliarwordsarebuiltandexpanded.Theexperimentalresultsshowthattheabovemethodcanimprovetheaccuracyoftextclassification,andavoidtheinstabilityofclassificationeffectduetothedifferencesintrainingcorpus.

Key words: textcategorization, NaiveBayes, word2vec

中图分类号:

TP391

郑开雨，竹翠. 基于上下文语义的朴素贝叶斯文本分类算法[J]. 计算机与现代化, 2018, 0(06): 1-.

ZHENGKai-yu,ZHUCui. ContextSemantic-basedNaiveBayesianAlgorithmforTextClassification[J]. Computer and Modernization, 2018, 0(06): 1-.

参考文献

［1］张雯,张化祥.属性加权的朴素贝叶斯集成分类器［J］.计算机工程与应用,2010,46(29):144-146.
［2］JiangJJ,ConrathDW.Semanticsimilaritybasedoncorpusstatisticsandlexicaltaxonomy［C］//Proceedingsof1997IEEEInternationalConferenceonResearchinComputationalLinguistics.1997：19-33.
［3］郭永辉.面向短文本分类的特征扩展方法［D］.哈尔滨：哈尔滨工业大学,2013.
［4］ZhangWeitai,XuWeiran,ChenGuang,etal.Afeatureextractionmethodbasedonwordembeddingforwordsimilaritycomputing［M］//NaturalLanguageProcessingandChineseComputing.Springer,2014:160-167.
［5］张东娜,周春光,刘彦斌,等.一种基于WordNet和CorpusStatistics的语义相似性计算方法［J］.吉林大学学报(理学版),2010,48(5):811-816.
［6］袁满,欧阳元新,熊璋,等.一种基于频繁词集的短文本特征扩展方法［J］.东南大学学报(自然科学版),2014,44(2):256-260.
［7］AgrawalR,ImielinskiT,SwamiA.Databasemining:Aperformanceperspective［J］.IEEETransactionsonKnowledgeandDataEngineering,1993,5(6):914-925.
［8］FriedmanN,GeigerD,GoldszmidtM.Bayesiannetworkclassifiers［J］.MachineLearning,1997,29(2-3):131-163.
［9］〖JP2〗MladenicD,GrobelnikM.FeatureselectionforunbalancedclassdistributionandNaiveBayes［C］//Proceedingsofthe16thInternationalConferenceonMachineLearning.1999:258-267
［10］YangYiming,PedersenJO.Acomparativestudyonfeatureselectionintextcategorization［C］//Proceedingsofthe14thInternationalConferenceonMachineLearning.1997:412-420.
［11］SaltonG,WongA,YangCS.Avectorspacemodelforautomaticindexing［J］.CommunicationsoftheACM,1975,18(11):613-620.
［12］MikolovT,SutskeverI,ChenKai,etal.Distributedrepresentationsofwordsandphrasesandtheircompositionality［C］//Proceedingsofthe26thInternationalConferenceonNeuralInformationProcessingSystems.2013,2:3111-3119
［13］MikolovT,YihWT,ZweigG.Linguisticregularitiesincontinuousspacewordrepresentations［C］//Proceedingsof2013ConferenceoftheNorthAmericanChapteroftheAssociationforComputationalLinguistics:HumanLanguageTechnologies.2013:746-751.
［14］杨婉霞,孙理和,黄永峰.结合语义与统计的特征降维短文本聚类［J］.计算机工程,2012,38(22):171-175.
［15］WolfL,HananiY,BarK,etal.Jointword2vecnetworksforbilingualsemanticrepresentations［J］.InternationalJournalofComputationalLinguisticsandApplications,2014,5(1):27-44.［16］MikolovT,ChenKai,CorradoG,etal.Efficientestimationofwordrepresentationsinvectorspace［J］.ComputerScience,2013（1）：28-36.
［17］JoulinA,GraveE,BojanowskiP,etal.Bagoftricksforefficienttextclassification［C］//Proceedingsofthe15thConferenceoftheEuropeanChapteroftheAssociationforComputationalLinguistics.2017:427-431.
［18］LillebergJ,ZhuYun,ZhangYanqing.Supportvectormachinesandword2vecfortextclassificationwithsemanticfeatures［C］//Proceedingsofthe14thIEEEInternationalConferenceonCognitiveInformatics&CognitiveComputing.2015:136-140.

[1]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[2]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[3]	徐涯昕, 何泽恩, 徐绪堪. 基于CNN-BiLSTM网络的数控机床故障文本自动分类[J]. 计算机与现代化, 2023, 0(04): 7-14.
[4]	李舒, 张伟业, 汪坤, 段照斌. 基于聚类分析的航班油耗组合估计[J]. 计算机与现代化, 2022, 0(08): 65-69.
[5]	张军, 邱龙龙. 一种基于BERT和池化操作的文本分类模型[J]. 计算机与现代化, 2022, 0(06): 1-7.
[6]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[7]	陈瀛, 生佳根. 基于LDA和Word2vec的微博标签生成算法[J]. 计算机与现代化, 2021, 0(12): 37-42.
[8]	郭书武, 陈军华. 基于深度学习的教材德目分类方法[J]. 计算机与现代化, 2021, 0(09): 106-112.
[9]	贾澎涛, 孙炜. 基于深度学习的文本分类综述[J]. 计算机与现代化, 2021, 0(07): 29-37.
[10]	代继鹏, 邵峰晶, 孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究[J]. 计算机与现代化, 2021, 0(06): 6-11.
[11]	尼格拉木·买斯木江, 艾孜尔古丽·玉素甫. 基于BERT及双向GRU模型的慕课用户评论情感倾向性分析[J]. 计算机与现代化, 2021, 0(04): 20-26.
[12]	周灵, 张英俊, 潘理虎. 一种基于情感特征的短文本分类方法[J]. 计算机与现代化, 2020, 0(07): 80-84.
[13]	景栋盛, 薛劲松, 冯仁君. 基于深度Q网络的垃圾邮件文本分类方法[J]. 计算机与现代化, 2020, 0(06): 89-.
[14]	彭路1,朱君2,邹云峰2. 基于深度神经网络的电力客户诉求预判[J]. 计算机与现代化, 2020, 0(05): 22-.
[15]	牛雪莹. 结合主题模型词向量的CNN文本分类[J]. 计算机与现代化, 2019, 0(10): 7-.

基于上下文语义的朴素贝叶斯文本分类算法

ContextSemantic-basedNaiveBayesianAlgorithmforTextClassification

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价