基于改进TF-IDF和AGLCNN的新闻长文本分类模型

doi:10.3969/j.issn.1006-2475.2024.08.019

摘要/Abstract

摘要： 新闻长文本分类是自然语言处理中的一项重要任务，但传统的文本表示方法存在特征稀疏、语义不足等问题。此外，新闻长文本含有大量的冗余信息，并且可能涉及其他主题，以上问题都会导致文本特征提取不全面。为此，本文提出一种基于改进TF-IDF算法和AGLCNN的新闻长文本分类模型。该模型首先利用特征项在类间与类内分布情况及其位置信息来改进TF-IDF算法，并结合Word2Vec词向量进行文本表示；利用注意力机制突出关键字信息，输入至Bi-LSTM捕获文本上下文特征；接着利用CNN突出新闻主题的显著特征；考虑到新闻长文本中可能存在涉及其他主题信息的句子，引入门控机制对Bi-LSTM和CNN输出特征进行融合，获得最终的文本特征表示；最后，将特征向量输入Softmax层进行新闻分类。在THUCNews数据集和搜狐新闻数据集上进行对比实验，结果表明，所提模型在2个数据集上的召回率分别为0.985和0.976，优于其他分类模型。

关键词: 文本分类, TF-IDF, 注意力机制, 卷积神经网络, 特征项

Abstract: News long text classification is an important task in natural language processing， but traditional text representation methods have problems such as sparse features and insufficient semantics. In addition， long news texts contain a large amount of redundant information and may involve other topics， all of which can lead to incomplete text feature extraction. Therefore， this article proposes a news long text classification model based on improved TF-IDF algorithm and AGLCNN. This model firstly improves the TF-IDF algorithm by utilizing the distribution and position information of feature items between and within classes， and combines Word2Vec word vectors for text representation. Using attention mechanism to highlight keyword information， we input it into Bi-LSTM to capture text contextual features. Then we use CNN to highlight the prominent features of news topics. Considering that there may be sentences involving other topic information in long news texts， a gating mechanism is introduced to fuse the output features of Bi-LSTM and CNN to obtain the final text feature representation. Finally， we input the feature vectors into the Softmax layer for news classification. Comparative experiments are conducted on the THUCNews dataset and the Sohu News dataset， and the results show that the proposed model has recall rates of 0.985 and 0.976 on both datasets， respectively， which are superior to other classification models.

Key words: , text classification, TF-IDF, attention mechanism, convolutional neural network, characteristic item

中图分类号:

TP391

周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.

ZHOU Xianxi, MU Li. News Long Text Classification Model Based on Improved TF-IDF and AGLCNN[J]. Computer and Modernization, 2024, 0(08): 120-126.

参考文献

［1］ FAYAZ M， KHAN A， BILAL M， et al. Machine learning for fake news classification with optimal feature selection［J］. Soft Computing， 2022，26（16）：7763-7771.
［2］ QAISER S， ALI R. Text mining： Use of TF-IDF to examine the relevance of words to documents［J］. International Journal of Computer Applications， 2018，181（1）：25-29.
［3］牛硕硕，柴小丽，李德启，等. 一种基于神经网络和LDA的文本分类算法［J］. 计算机工程， 2019，45（10）：208-214.
［4］ STEIN R A， JAQUES P A， VALIATI J F. An analysis of hierarchical text classification using word embeddings［J］. Information Sciences， 2019，471：216-232.
［5］ ZHAO R， MAO K Z. Fuzzy bag-of-words model for document representation［J］. IEEE Transactions on Fuzzy Systems， 2018，26（2）：794-804.
［6］吴小晴，万国金，李程文，等. 一种改进TF-IDF的中文邮件识别算法研究［J］. 现代电子技术， 2020，43（12）：83-86.
［7］王根生，黄学坚. 基于Word2Vec和改进型TF-IDF的卷积神经网络文本分类模型［J］. 小型微型计算机系统， 2019，40（5）：1120-1126.
［8］张琳，李朝辉. 文本分类中一种改进的特征项权重计算方法［J］. 福建师范大学学报（自然科学版）， 2020，36（2）：49-54.
［9］代继鹏，邵峰晶，孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究［J］. 计算机与现代化， 2021（6）：6-11.
［10］ JIANG Z Y， GAO B， HE Y L， et al. Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports［J］. Mathematical Problems in Engineering， 2021，2021. DOI： 10.1155/2021/6619088.
［11］ LUO X Y. Efficient English text classification using selected machine learning techniques［J］. Alexandria Engineering Journal， 2021，60（3）：3401-3409.
［12］ GUO B， ZHANG C X， LIU J M， et al. Improving text classification with weighted word embeddings via a multi-channel TextCNN model［J］. Neurocomputing， 2019，363：366-374.
［13］ YIN W P， KANN K， YU M， et al. Comparative study of CNN and RNN for natural language processing［J］. arXiv preprint arXiv：1702.01923， 2017.
［14］ YU Y， SI X S， HU C H， et al. A review of recurrent neural networks： LSTM cells and network architectures［J］. Neural Computation， 2019，31（7）：1235-1270.
［15］ FU R， ZHANG Z， LI L. Using LSTM and GRU neural network methods for traffic flow prediction［C］// Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation （YAC）. IEEE， 2016：324-328.
［16］李文慧，张英俊，潘理虎. 改进BiLSTM网络的短文本分类方法［J］. 计算机工程与设计， 2020，41（3）：880-886.
［17］ PETERS M E， NEUMANN M， LYYER M， et al. Deep contextualized word representations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. ACL， 2018：2227-2237.
［18］ DEVLIN J， CHANG M W， LEE K， et al. BERT： Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv：1810.04805， 2018.
［19］倪斌，陆晓蕾，童逸琦，等. 胶囊神经网络在期刊文本分类中的应用［J］. 南京大学学报（自然科学）， 2021，57（5）：750-756.
［20］李明超，田丹，沈扬，等. 融入Attention机制改进Word2Vec技术的水利水电工程专业词智能提取与分析方法［J］. 水利学报， 2020，51（7）：816-826.
［21］何隽飞，赵慧，何学明. 基于改进TF-IDF可疑人员文本表示方法［J］. 计算机工程与设计， 2021，42（2）：396-401.
［22］白凤波，常林，王世凡，等. 裁判文书关键词提取的改进方法研究［J］. 计算机工程与应用， 2020，56（23）：153-160.
［23］赵宇红，张政. 基于CBOW模型的链路预测方法［J］. 小型微型计算机系统， 2020，41（11）：2392-2398.
［24］ LEIMEISTER M， WILSON B J. Skip-gram word embeddings in hyperbolic space［J］. arXiv preprint arXiv：1809.01498， 2018.
［25］ REN J S， WU W， LIU G， et al. Bidirectional gated temporal convolution with attention for text classification［J］. Neurocomputing， 2021，455：265-273.
［26］ KIM Y. Convolutional neural networks for sentence classification［J］. arXiv preprint arXiv：1408.5882， 2014.
［27］ LAI S W， XU L H， LIU K， et al. Recurrent convolutional neural networks for text classification［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press， 2015：2267-2273.
［28］ BAO Y， HUANG Z， LI L N， et al. A BiLSTM-CNN model for predicting users’ next locations based on geotagged social media［J］. International Journal of Geographical Information Science， 2021，35（4）：639-660.
［29］ LEE E， LEE C， AHN S. Comparative study of multiclass text classification in research proposals using pretrained language models［J］. Applied Sciences， 2022，12（9）. DOI： 10.3390/app12094522.
［30］刘燕. 基于BERT-BiGRU的中文专利文本自动分类［J］. 郑州大学学报（理学版）， 2023，55（2）：33-40.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[6]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[7]	谷岳, 邓松峰, 沈霁, 穆文涛, 赵恩棋. 基于改进YOLOv8的SAR舰船目标检测算法[J]. 计算机与现代化, 2024, 0(12): 78-83.
[8]	王艳媛, 茅正冲. 中英文场景文本图像的检测和识别算法[J]. 计算机与现代化, 2024, 0(12): 84-90.
[9]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[10]	张宇1, 2, 黎靖1, 2, 马铭1, 2, 王众祥1, 2, 孙妍1, 2. YOLOLW:一个新的轻量级目标检测模型[J]. 计算机与现代化, 2024, 0(11): 91-98.
[11]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[12]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[13]	魏学诚1, 江凌云1, 李研2, 何非2. 改进YOLOv5的路侧单目视角小目标检测算法[J]. 计算机与现代化, 2024, 0(10): 27-34.
[14]	杜猛俊1, 李昂1, 童俊1, 钱锦1, 康恺1, 王若丁1, 靳文星2. 基于改进极限学习算法的电力信息数据融合模型[J]. 计算机与现代化, 2024, 0(10): 61-64.
[15]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.