Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN

Abstract

Abstract: In the big data environment, obtaining hot topics from massive Internet data is the basis for studying public opinion and sentiments in the current Internet. Among them, text clustering is one of the most common methods to get hot topics, which can be divided into two steps: text vectorization representation and clustering. However, in the task of vectorized text representation, the traditional text representation model cannot accurately represent the contextual information of texts such as news and posts. In the clustering task, the K-Means algorithm and DBSCAN algorithm are most commonly used, but their clustering method is not consistent with the actual distribution of topic data, which makes the existing text clustering algorithms very poorly applied in the actual Internet environment. Therefore, this paper proposes a text clustering algorithm based on RoBERTa-WWM and HDBSCAN according to the data distribution of topics in the Internet. Firstly, the pre-trained language model RoBERTa-WWM is used to obtain the text vector of each text. Secondly, the t-SNE algorithm is used to reduce the dimension of the high-dimensional text vector. Finally, the HDBSCAN algorithm based on hierarchical density clustering algorithm is used to cluster the low-dimensional text vector. The experimental results show that compared with the existing text clustering algorithms, the proposed algorithm has a great improvement in the clustering effect on data sets that contain noisy data and are unevenly distributed.

Key words: text clustering, pre-training language model, visual dimensionality reduction, density clustering

LIU kun, ZENG Xi, QIU Zi-heng, CHEN Zhou-guo, . Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN[J]. Computer and Modernization, 2022, 0(03): 48-52.

References

［1］张长利. 面向特定领域的互联网舆情分析技术研究［D］. 长春:吉林大学, 2011.
［2］许强. 基于Spark的话题检测与跟踪技术研究［D］. 成都:电子科技大学, 2018.
［3］冀宇轩. 文本向量化表示方法的总结与分析［J］. 电子世界, 2018(22):10-12.
［4］陈新泉,周灵晶,刘耀中. 聚类算法研究综述［J］. 集成技术, 2017,6(3):41-49.
［5］李志强,王俊丰,贾晓霞. 基于K-Means算法改进的短文本聚类研究与实现［J］. 信息技术, 2019,43(12):76-80.
［6］毛郁欣,邱智学. 基于Word2Vec模型和K-Means算法的信息技术文档聚类研究［J］. 中国信息技术教育, 2020(8):99-101.
［7］吴德平,华钢. 基于Word2Vec词嵌入和聚类模型的安全生产事故文本案例分类［J］. 计算机系统应用, 2021,30(1):141-145.
［8］阮光册,夏磊. 基于 Doc2Vec 的期刊论文热点选题识别［J］. 情报理论与实践, 2019,42(4):107-111.
［9］贾君霞,王会真,任凯,等. 基于句向量和卷积神经网络的文本聚类研究［J/OL］. 计算机工程与应用:1-6［2021-10-08］. http://kns.cnki.net/kcms/detail/11.2127.TP.20210622.0840.002.html.
［10］DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［J］. Computation and Language, arXiv preprint arXiv:1810.04805, 2018.
［11］李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述［J］. 计算机科学, 2020,47(3):162-173.
［12］曹凤仙. 基于K-Means的市长公开电话文本聚类［D］. 长春:东北师范大学, 2021.
［13］朱良奇,黄勃,黄季涛,等. 融合BERT和自编码网络的短文本聚类研究［J/OL］. 计算机工程与应用:1-10［2021-08-24］. http://kns.cnki.net/kcms/detail/11.2127.TP.20210527.0850.002.html.

［14］邹艳春. 基于DBSCAN算法的文本聚类研究［J］. 软件导刊, 2016,15(8):36-38.

［15］曹旭友,周志平,王利,等. 基于BERT+ATT和DBSCAN的长三角专利匹配算法［J］. 信息技术, 2020,44(3):1-5.
［16］蔡岳,袁津生. 基于改进DBSCAN算法的文本聚类［J］. 计算机工程, 2011,37(12):50-52.
［17］王纵虎. 聚类分析优化关键技术研究［D］. 西安:西安电子科技大学, 2012.
［18］LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized bert pretraining approach［J］. Computation and Language, arXiv preprint arXiv:1907.11692, 2019.
［19］CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for chinese BERT［J］. Computation and Language, arXiv preprint arXiv:1906.08101, 2019.
［20］HUANG X, WU L, YE Y S. A review on dimensionality reduction techniques［J］. International Journal of Pattern Recognition and Artificial Intelligence, 2019,33(10):1950017.1-1950017.25.
［21］VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research, 2008,9(11):2579-2605.
［22］CAMPELLO R J G B, MOULAVI D, ZIMEK A, et al. Hierarchical density estimates for data clustering, visualization, and outlier detection［J］. ACM Transactions on Knowledge Discovery from Data (TKDD), 2015,10(1):1-51.
［23］孙茂松,李景阳,郭志芃,等. THUCTC: 一个高效的中文文本分类工具包［EB/OL］. ［2018-10-20］. http://thuctc.thunlp.org/.
［24］REIMERS N, GUREVYCH I. Sentence-BERT: Sentence embeddings using siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). 2019.

[1]	GAI Xuan. Spam E-mail Recognition Based on Cluster Analysis Algorithm [J]. Computer and Modernization, 2020, 0(10): 17-22.
[2]	WANG Liying1,2， GE Lina1,2， ZHANG Yipeng 1,2， WANG Hong1,2. Research on Hot News Topic Detection of Incremental Clustering [J]. Computer and Modernization, 2017, 0(3): 46-.
[3]	HU Wen-bo, HUANG Wei, HU Guo-chao. Trajectory Adjoint Pattern Analysis Based on OPTICS Clustering and Association Analysis [J]. Computer and Modernization, 2017, 0(12): 82-87.
[4]	JIANG Kai, YUAN Jin-hai. A Novel Assembled Text Clustering Algorithm Using Differential Evolution and SOM [J]. Computer and Modernization, 2015, 0(5): 13-16+20.
[5]	WANG Qiong. An Improved K-means Optimization Approach for Text Clustering [J]. Computer and Modernization, 2015, 0(3): 48-51,56.
[6]	HOU Ze-min, JU Xiao. An Improved Text Clustering Algorithm Based on Latent Semantic Indexing [J]. Computer and Modernization, 2014, 0(7): 24-27.
[7]	MU Cui-xia1,2, LIU Zhen-hua1, WU Tao1. Method of Retrieval Results Clustering for EC Web Pages [J]. Computer and Modernization, 2014, 0(5): 70-73.
[8]	XIE Zi-qi, XIE Ying-hua. A SNS User Interest Model Based on Data Mining [J]. Computer and Modernization, 2014, 0(2): 24-27+31.
[9]	ZHENG Cheng;LI Hong. Texts Clustering of K-means Based on LDA [J]. Computer and Modernization, 2013, 1(8): 78-80,8.
[10]	WANG Jianfeng;MA Lina;LI Xinye;QIAO Dong. Research on Text Clustering Based on LSASOM [J]. Computer and Modernization, 2010, 1(02): 77-79.
[11]	ZHANG Chun-lu;SHEN Jian-jing. Implementation of Text Clustering Based on Self-organizing Map Algorithm [J]. Computer and Modernization, 2010, 1(01): 29-31,3.