基于RoBERTa-WWM 和HDBSCAN的文本聚类算法

摘要/Abstract

摘要： 在大数据环境下，从海量的互联网数据中获取热点话题是研究当前互联网中民意民情的基础，其中文本聚类是得到热点话题最常用的方法之一，可以分为文本向量化表示和聚类2个步骤。然而在文本向量化表示任务中，传统的文本表示模型无法准确表示新闻、帖文等文本的上下文语境信息。在聚类任务中，最常使用的是K-Means算法和DBSCAN算法，但是它们对数据的聚类方式与实际中话题数据的分布不符，这使得现有的文本聚类算法在实际的互联网环境中应用效果很差。本文根据互联网中话题的数据分布情况，提出一种基于RoBERTa-WWM和HDBSCAN的文本聚类算法。首先利用预训练语言模型RoBERTa-WWM得到每一篇文本的文本向量，其次利用t-SNE算法对高维文本向量进行降维，最后利用基于层次的密度聚类算法的HDBSCAN算法对低维的文本向量进行聚类。实验结果表明提出的算法相较于现有的文本聚类算法，在含有噪声数据且分布不均衡的数据集上，聚类效果有很大的提升。

关键词: 文本聚类, 预训练语言模型, 可视化降维, 密度聚类

Abstract: In the big data environment, obtaining hot topics from massive Internet data is the basis for studying public opinion and sentiments in the current Internet. Among them, text clustering is one of the most common methods to get hot topics, which can be divided into two steps: text vectorization representation and clustering. However, in the task of vectorized text representation, the traditional text representation model cannot accurately represent the contextual information of texts such as news and posts. In the clustering task, the K-Means algorithm and DBSCAN algorithm are most commonly used, but their clustering method is not consistent with the actual distribution of topic data, which makes the existing text clustering algorithms very poorly applied in the actual Internet environment. Therefore, this paper proposes a text clustering algorithm based on RoBERTa-WWM and HDBSCAN according to the data distribution of topics in the Internet. Firstly, the pre-trained language model RoBERTa-WWM is used to obtain the text vector of each text. Secondly, the t-SNE algorithm is used to reduce the dimension of the high-dimensional text vector. Finally, the HDBSCAN algorithm based on hierarchical density clustering algorithm is used to cluster the low-dimensional text vector. The experimental results show that compared with the existing text clustering algorithms, the proposed algorithm has a great improvement in the clustering effect on data sets that contain noisy data and are unevenly distributed.

Key words: text clustering, pre-training language model, visual dimensionality reduction, density clustering

刘锟, 曾曦, 邱梓珩, 陈周国, . 基于RoBERTa-WWM 和HDBSCAN的文本聚类算法[J]. 计算机与现代化, 2022, 0(03): 48-52.

LIU kun, ZENG Xi, QIU Zi-heng, CHEN Zhou-guo, . Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN[J]. Computer and Modernization, 2022, 0(03): 48-52.

参考文献［24］

［1］	张长利. 面向特定领域的互联网舆情分析技术研究［D］. 长春:吉林大学, 2011.
［2］	许强. 基于Spark的话题检测与跟踪技术研究［D］. 成都:电子科技大学, 2018.
［3］	冀宇轩. 文本向量化表示方法的总结与分析［J］. 电子世界, 2018(22):10-12.
［4］	陈新泉,周灵晶,刘耀中. 聚类算法研究综述［J］. 集成技术, 2017,6(3):41-49.
［5］	李志强,王俊丰,贾晓霞. 基于K-Means算法改进的短文本聚类研究与实现［J］. 信息技术, 2019,43(12):76-80.
［6］	毛郁欣,邱智学. 基于Word2Vec模型和K-Means算法的信息技术文档聚类研究［J］. 中国信息技术教育, 2020(8):99-101.
［7］	吴德平,华钢. 基于Word2Vec词嵌入和聚类模型的安全生产事故文本案例分类［J］. 计算机系统应用, 2021,30(1):141-145.
［8］	阮光册,夏磊. 基于 Doc2Vec 的期刊论文热点选题识别［J］. 情报理论与实践, 2019,42(4):107-111.
［9］	贾君霞,王会真,任凯,等. 基于句向量和卷积神经网络的文本聚类研究［J/OL］. 计算机工程与应用:1-6［2021-10-08］. http://kns.cnki.net/kcms/detail/11.2127.TP.20210622.0840.002.html.
［10］	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［J］. Computation and Language, arXiv preprint arXiv:1810.04805, 2018.
［11］	李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述［J］. 计算机科学, 2020,47(3):162-173.
［12］	曹凤仙. 基于K-Means的市长公开电话文本聚类［D］. 长春:东北师范大学, 2021.
［13］	朱良奇,黄勃,黄季涛,等. 融合BERT和自编码网络的短文本聚类研究［J/OL］. 计算机工程与应用:1-10［2021-08-24］. http://kns.cnki.net/kcms/detail/11.2127.TP.20210527.0850.002.html.
［14］	邹艳春. 基于DBSCAN算法的文本聚类研究［J］. 软件导刊, 2016,15(8):36-38.
［15］	曹旭友,周志平,王利,等. 基于BERT+ATT和DBSCAN的长三角专利匹配算法［J］. 信息技术, 2020,44(3):1-5.
［16］	蔡岳,袁津生. 基于改进DBSCAN算法的文本聚类［J］. 计算机工程, 2011,37(12):50-52.
［17］	王纵虎. 聚类分析优化关键技术研究［D］. 西安:西安电子科技大学, 2012.
［18］	LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A robustly optimized bert pretraining approach［J］. Computation and Language, arXiv preprint arXiv:1907.11692, 2019.
［19］	CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for chinese BERT［J］. Computation and Language, arXiv preprint arXiv:1906.08101, 2019.
［20］	HUANG X, WU L, YE Y S. A review on dimensionality reduction techniques［J］. International Journal of Pattern Recognition and Artificial Intelligence, 2019,33(10):1950017.1-1950017.25.
［21］	VAN DER MAATEN L, HINTON G. Visualizing data using t-SNE［J］. Journal of Machine Learning Research, 2008,9(11):2579-2605.
［22］	CAMPELLO R J G B, MOULAVI D, ZIMEK A, et al. Hierarchical density estimates for data clustering, visualization, and outlier detection［J］. ACM Transactions on Knowledge Discovery from Data (TKDD), 2015,10(1):1-51.
［23］	孙茂松,李景阳,郭志芃,等. THUCTC: 一个高效的中文文本分类工具包［EB/OL］. ［2018-10-20］. http://thuctc.thunlp.org/.
［24］	REIMERS N, GUREVYCH I. Sentence-BERT: Sentence embeddings using siamese BERT-networks［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). 2019.

[1]	盖璇. 基于聚类分析算法的垃圾邮件识别[J]. 计算机与现代化, 2020, 0(10): 17-22.
[2]	李霄野，李春生，李龙,张可佳. 基于LDA模型的文本聚类检索[J]. 计算机与现代化, 2018, 0(06): 7-.
[3]	王丽颖1,2，葛丽娜1,2，张翼鹏1,2，王红1,2. 增量式聚类的新闻热点话题发现研究[J]. 计算机与现代化, 2017, 0(3): 46-.
[4]	胡文博，黄蔚，胡国超. 基于OPTICS聚类和关联分析的轨迹伴随模式分析[J]. 计算机与现代化, 2017, 0(12): 82-87.
[5]	姜凯，苑金海. 融合差分进化和SOM的组合文本聚类算法[J]. 计算机与现代化, 2015, 0(5): 13-16+20.
[6]	王琼. 一种改进的k-means文本聚类优化方法[J]. 计算机与现代化, 2015, 0(3): 48-51,56.
[7]	侯泽民，巨筱. 一种改进的基于潜在语义索引的文本聚类算法[J]. 计算机与现代化, 2014, 0(7): 24-27.
[8]	穆翠霞1,2,刘振华1,武涛1. 面向电子商务的网页检索聚类方法[J]. 计算机与现代化, 2014, 0(5): 70-73.
[9]	谢梓琪，谢颖华. 一种基于数据挖掘的SNS用户兴趣模型[J]. 计算机与现代化, 2014, 0(2): 24-27+31.
[10]	郑诚;李鸿. 基于主题模型的K-均值文本聚类[J]. 计算机与现代化, 2013, 1(8): 78-80,8.
[11]	王剑锋;麻丽娜;李新叶;乔冬. 基于潜在语义分析和自组织特征映射神经 网络的文本聚类研究 [J]. 计算机与现代化, 2010, 1(02): 77-79.
[12]	张春炉;沈建京. 基于SOM算法的文本聚类实现[J]. 计算机与现代化, 2010, 1(01): 29-31,3.