News Long Text Classification Model Based on Improved TF-IDF and AGLCNN

doi:10.3969/j.issn.1006-2475.2024.08.019

Abstract

Abstract: News long text classification is an important task in natural language processing， but traditional text representation methods have problems such as sparse features and insufficient semantics. In addition， long news texts contain a large amount of redundant information and may involve other topics， all of which can lead to incomplete text feature extraction. Therefore， this article proposes a news long text classification model based on improved TF-IDF algorithm and AGLCNN. This model firstly improves the TF-IDF algorithm by utilizing the distribution and position information of feature items between and within classes， and combines Word2Vec word vectors for text representation. Using attention mechanism to highlight keyword information， we input it into Bi-LSTM to capture text contextual features. Then we use CNN to highlight the prominent features of news topics. Considering that there may be sentences involving other topic information in long news texts， a gating mechanism is introduced to fuse the output features of Bi-LSTM and CNN to obtain the final text feature representation. Finally， we input the feature vectors into the Softmax layer for news classification. Comparative experiments are conducted on the THUCNews dataset and the Sohu News dataset， and the results show that the proposed model has recall rates of 0.985 and 0.976 on both datasets， respectively， which are superior to other classification models.

Key words: , text classification, TF-IDF, attention mechanism, convolutional neural network, characteristic item

CLC Number:

TP391

ZHOU Xianxi, MU Li. News Long Text Classification Model Based on Improved TF-IDF and AGLCNN[J]. Computer and Modernization, 2024, 0(08): 120-126.

References

［1］ FAYAZ M， KHAN A， BILAL M， et al. Machine learning for fake news classification with optimal feature selection［J］. Soft Computing， 2022，26（16）：7763-7771.
［2］ QAISER S， ALI R. Text mining： Use of TF-IDF to examine the relevance of words to documents［J］. International Journal of Computer Applications， 2018，181（1）：25-29.
［3］牛硕硕，柴小丽，李德启，等. 一种基于神经网络和LDA的文本分类算法［J］. 计算机工程， 2019，45（10）：208-214.
［4］ STEIN R A， JAQUES P A， VALIATI J F. An analysis of hierarchical text classification using word embeddings［J］. Information Sciences， 2019，471：216-232.
［5］ ZHAO R， MAO K Z. Fuzzy bag-of-words model for document representation［J］. IEEE Transactions on Fuzzy Systems， 2018，26（2）：794-804.
［6］吴小晴，万国金，李程文，等. 一种改进TF-IDF的中文邮件识别算法研究［J］. 现代电子技术， 2020，43（12）：83-86.
［7］王根生，黄学坚. 基于Word2Vec和改进型TF-IDF的卷积神经网络文本分类模型［J］. 小型微型计算机系统， 2019，40（5）：1120-1126.
［8］张琳，李朝辉. 文本分类中一种改进的特征项权重计算方法［J］. 福建师范大学学报（自然科学版）， 2020，36（2）：49-54.
［9］代继鹏，邵峰晶，孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究［J］. 计算机与现代化， 2021（6）：6-11.
［10］ JIANG Z Y， GAO B， HE Y L， et al. Text classification using novel term weighting scheme-based improved TF-IDF for internet media reports［J］. Mathematical Problems in Engineering， 2021，2021. DOI： 10.1155/2021/6619088.
［11］ LUO X Y. Efficient English text classification using selected machine learning techniques［J］. Alexandria Engineering Journal， 2021，60（3）：3401-3409.
［12］ GUO B， ZHANG C X， LIU J M， et al. Improving text classification with weighted word embeddings via a multi-channel TextCNN model［J］. Neurocomputing， 2019，363：366-374.
［13］ YIN W P， KANN K， YU M， et al. Comparative study of CNN and RNN for natural language processing［J］. arXiv preprint arXiv：1702.01923， 2017.
［14］ YU Y， SI X S， HU C H， et al. A review of recurrent neural networks： LSTM cells and network architectures［J］. Neural Computation， 2019，31（7）：1235-1270.
［15］ FU R， ZHANG Z， LI L. Using LSTM and GRU neural network methods for traffic flow prediction［C］// Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation （YAC）. IEEE， 2016：324-328.
［16］李文慧，张英俊，潘理虎. 改进BiLSTM网络的短文本分类方法［J］. 计算机工程与设计， 2020，41（3）：880-886.
［17］ PETERS M E， NEUMANN M， LYYER M， et al. Deep contextualized word representations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. ACL， 2018：2227-2237.
［18］ DEVLIN J， CHANG M W， LEE K， et al. BERT： Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv：1810.04805， 2018.
［19］倪斌，陆晓蕾，童逸琦，等. 胶囊神经网络在期刊文本分类中的应用［J］. 南京大学学报（自然科学）， 2021，57（5）：750-756.
［20］李明超，田丹，沈扬，等. 融入Attention机制改进Word2Vec技术的水利水电工程专业词智能提取与分析方法［J］. 水利学报， 2020，51（7）：816-826.
［21］何隽飞，赵慧，何学明. 基于改进TF-IDF可疑人员文本表示方法［J］. 计算机工程与设计， 2021，42（2）：396-401.
［22］白凤波，常林，王世凡，等. 裁判文书关键词提取的改进方法研究［J］. 计算机工程与应用， 2020，56（23）：153-160.
［23］赵宇红，张政. 基于CBOW模型的链路预测方法［J］. 小型微型计算机系统， 2020，41（11）：2392-2398.
［24］ LEIMEISTER M， WILSON B J. Skip-gram word embeddings in hyperbolic space［J］. arXiv preprint arXiv：1809.01498， 2018.
［25］ REN J S， WU W， LIU G， et al. Bidirectional gated temporal convolution with attention for text classification［J］. Neurocomputing， 2021，455：265-273.
［26］ KIM Y. Convolutional neural networks for sentence classification［J］. arXiv preprint arXiv：1408.5882， 2014.
［27］ LAI S W， XU L H， LIU K， et al. Recurrent convolutional neural networks for text classification［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press， 2015：2267-2273.
［28］ BAO Y， HUANG Z， LI L N， et al. A BiLSTM-CNN model for predicting users’ next locations based on geotagged social media［J］. International Journal of Geographical Information Science， 2021，35（4）：639-660.
［29］ LEE E， LEE C， AHN S. Comparative study of multiclass text classification in research proposals using pretrained language models［J］. Applied Sciences， 2022，12（9）. DOI： 10.3390/app12094522.
［30］刘燕. 基于BERT-BiGRU的中文专利文本自动分类［J］. 郑州大学学报（理学版）， 2023，55（2）：33-40.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	ZHENG Jiuchao, ZHAO Xinyuan. Entity Linking Method Based on Topics and Description Information [J]. Computer and Modernization, 2024, 0(12): 10-14.
[3]	ZHAO Chenyang, XUE Tao, LIU Junhua. Fashion Clothing Pattern Generation Based on Improved Stable Diffusion [J]. Computer and Modernization, 2024, 0(12): 15-23.
[4]	WANG Xiaohang1, LI Yongjie1, YU Lei1, FAN Xiao2. A Method of Using Compound Event Probability Operation to Solve Problem of Negative Information Blocking Maximization [J]. Computer and Modernization, 2024, 0(12): 24-33.
[5]	HUANG Tingpei1, MA Lubiao1, LI Shibao2, LIU Jianhang1. Gesture Recognition Method Based on WiFi and Prototypical Network [J]. Computer and Modernization, 2024, 0(12): 34-39.
[6]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[7]	LIU Yunhai1, Feng Guang1, WU Xiaoting2, YANG Qun2 . Safety Helmet Wearing Detection Algorithm for Complex Construction Scenes [J]. Computer and Modernization, 2024, 0(12): 66-71.
[8]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[9]	GU Yue, DENG Songfeng, SHEN Ji, MU Wentao, ZHAO Enqi. SAR Ship Detection Algorithm Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(12): 78-83.
[10]	WANG Yanyuan, MAO Zhengchong. Detection and Recognition Algorithms for Chinese and English Scene Text Images [J]. Computer and Modernization, 2024, 0(12): 84-90.
[11]	WU Xiuling1, ZHOU Sheng1, WANG Chunjuan1, YU Cuizhuo2, LIU Hao3. Research Progress in Ultra Short-term Power Load Forecasting Technology [J]. Computer and Modernization, 2024, 0(12): 108-115.
[12]	LI Deyou1, 2, YU Jinsongdi1, 2, WEI Dandan1, 2, LUO Yuan1, 2, TONG Ruiju3. Abstract Tree Model for Gridded Cube Metadata [J]. Computer and Modernization, 2024, 0(11): 1-6.
[13]	GONG Yicheng1, 2, LIU Qing1, 2. Beijing Opera Binary Classification Based on RF-LCE-BiLSTM-Attention-AMSSA Model [J]. Computer and Modernization, 2024, 0(11): 7-12.
[14]	LI Taoying, LI Meng, WU Mengqiao. Taxi Passenger Flow Prediction Based on Heterogeneous Spatiotemporal Graph#br# Convolutional Networks [J]. Computer and Modernization, 2024, 0(11): 13-18.
[15]	ZHANG Tai1, YAN Zihao2, DUAN Jie2, ZHANG Zhihong2. Information Forwarding Strategy of Internet of Vehicles in Named Data Network [J]. Computer and Modernization, 2024, 0(11): 19-27.