计算机与现代化 ›› 2022, Vol. 0 ›› Issue (03): 48-52.

• 数据库与数据挖掘 • 上一篇    下一篇

基于RoBERTa-WWM 和HDBSCAN的文本聚类算法

  

  1. (1.中国电子科技集团公司第三十研究所,四川成都610000;2.深圳市网联安瑞网络科技有限公司,广东深圳518000)
  • 出版日期:2022-04-29 发布日期:2022-04-29
  • 作者简介:刘锟(1996—),男,甘肃平凉人,硕士研究生,研究方向:自然语言处理,E-mail: 1368603690@qq.com; 曾曦(1969—),女,研究员级高级工程师,硕士,研究方向:网络安全与安全防御。
  • 基金资助:
    国家自然科学基金资助项目(61803352)

Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN

  1. (1. The 30th Research Institute of China Electronics Tecnnolpgy Group Corporation, Chengdu 610000, China;
    2. Shenzhen CyberAray Network Technology Co., LTD., Shenzhen 518000, China)
  • Online:2022-04-29 Published:2022-04-29

摘要: 在大数据环境下,从海量的互联网数据中获取热点话题是研究当前互联网中民意民情的基础,其中文本聚类是得到热点话题最常用的方法之一,可以分为文本向量化表示和聚类2个步骤。然而在文本向量化表示任务中,传统的文本表示模型无法准确表示新闻、帖文等文本的上下文语境信息。在聚类任务中,最常使用的是K-Means算法和DBSCAN算法,但是它们对数据的聚类方式与实际中话题数据的分布不符,这使得现有的文本聚类算法在实际的互联网环境中应用效果很差。本文根据互联网中话题的数据分布情况,提出一种基于RoBERTa-WWM和HDBSCAN的文本聚类算法。首先利用预训练语言模型RoBERTa-WWM得到每一篇文本的文本向量,其次利用t-SNE算法对高维文本向量进行降维,最后利用基于层次的密度聚类算法的HDBSCAN算法对低维的文本向量进行聚类。实验结果表明提出的算法相较于现有的文本聚类算法,在含有噪声数据且分布不均衡的数据集上,聚类效果有很大的提升。

关键词: 文本聚类, 预训练语言模型, 可视化降维, 密度聚类

Abstract: In the big data environment, obtaining hot topics from massive Internet data is the basis for studying public opinion and sentiments in the current Internet. Among them, text clustering is one of the most common methods to get hot topics, which can be divided into two steps: text vectorization representation and clustering. However, in the task of vectorized text representation, the traditional text representation model cannot accurately represent the contextual information of texts such as news and posts. In the clustering task, the K-Means algorithm and DBSCAN algorithm are most commonly used, but their clustering method is not consistent with the actual distribution of topic data, which makes the existing text clustering algorithms very poorly applied in the actual Internet environment. Therefore, this paper proposes a text clustering algorithm based on RoBERTa-WWM and HDBSCAN according to the data distribution of topics in the Internet. Firstly, the pre-trained language model RoBERTa-WWM is used to obtain the text vector of each text. Secondly, the t-SNE algorithm is used to reduce the dimension of the high-dimensional text vector. Finally, the HDBSCAN algorithm based on hierarchical density clustering algorithm is used to cluster the low-dimensional text vector. The experimental results show that compared with the existing text clustering algorithms, the proposed algorithm has a great improvement in the clustering effect on data sets that contain noisy data and are unevenly distributed.

Key words: text clustering, pre-training language model, visual dimensionality reduction, density clustering