Computer and Modernization ›› 2022, Vol. 0 ›› Issue (03): 48-52.

Previous Articles     Next Articles

Text Clustering Algorithm Based on RoBERTa-WWM and HDBSCAN

  

  1. (1. The 30th Research Institute of China Electronics Tecnnolpgy Group Corporation, Chengdu 610000, China;
    2. Shenzhen CyberAray Network Technology Co., LTD., Shenzhen 518000, China)
  • Online:2022-04-29 Published:2022-04-29

Abstract: In the big data environment, obtaining hot topics from massive Internet data is the basis for studying public opinion and sentiments in the current Internet. Among them, text clustering is one of the most common methods to get hot topics, which can be divided into two steps: text vectorization representation and clustering. However, in the task of vectorized text representation, the traditional text representation model cannot accurately represent the contextual information of texts such as news and posts. In the clustering task, the K-Means algorithm and DBSCAN algorithm are most commonly used, but their clustering method is not consistent with the actual distribution of topic data, which makes the existing text clustering algorithms very poorly applied in the actual Internet environment. Therefore, this paper proposes a text clustering algorithm based on RoBERTa-WWM and HDBSCAN according to the data distribution of topics in the Internet. Firstly, the pre-trained language model RoBERTa-WWM is used to obtain the text vector of each text. Secondly, the t-SNE algorithm is used to reduce the dimension of the high-dimensional text vector. Finally, the HDBSCAN algorithm based on hierarchical density clustering algorithm is used to cluster the low-dimensional text vector. The experimental results show that compared with the existing text clustering algorithms, the proposed algorithm has a great improvement in the clustering effect on data sets that contain noisy data and are unevenly distributed.

Key words: text clustering, pre-training language model, visual dimensionality reduction, density clustering