• 数据库与数据挖掘 •

### 基于RoBERTa-WWM 和HDBSCAN的文本聚类算法

1. (1.中国电子科技集团公司第三十研究所,四川成都610000;2.深圳市网联安瑞网络科技有限公司,广东深圳518000）
• 出版日期:2022-04-29 发布日期:2022-04-29
• 作者简介:刘锟(1996—),男,甘肃平凉人,硕士研究生,研究方向:自然语言处理,E-mail: 1368603690@qq.com; 曾曦(1969—),女,研究员级高级工程师,硕士,研究方向:网络安全与安全防御。
• 基金资助:
国家自然科学基金资助项目(61803352)

### Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN

1. (1. The 30th Research Institute of China Electronics Tecnnolpgy Group Corporation, Chengdu 610000, China;
2. Shenzhen CyberAray Network Technology Co., LTD., Shenzhen 518000, China)
• Online:2022-04-29 Published:2022-04-29

Abstract: In the big data environment, obtaining hot topics from massive Internet data is the basis for studying public opinion and sentiments in the current Internet. Among them, text clustering is one of the most common methods to get hot topics, which can be divided into two steps: text vectorization representation and clustering. However, in the task of vectorized text representation, the traditional text representation model cannot accurately represent the contextual information of texts such as news and posts. In the clustering task, the K-Means algorithm and DBSCAN algorithm are most commonly used, but their clustering method is not consistent with the actual distribution of topic data, which makes the existing text clustering algorithms very poorly applied in the actual Internet environment. Therefore, this paper proposes a text clustering algorithm based on RoBERTa-WWM and HDBSCAN according to the data distribution of topics in the Internet. Firstly, the pre-trained language model RoBERTa-WWM is used to obtain the text vector of each text. Secondly, the t-SNE algorithm is used to reduce the dimension of the high-dimensional text vector. Finally, the HDBSCAN algorithm based on hierarchical density clustering algorithm is used to cluster the low-dimensional text vector. The experimental results show that compared with the existing text clustering algorithms, the proposed algorithm has a great improvement in the clustering effect on data sets that contain noisy data and are unevenly distributed.