计算机与现代化 ›› 2021, Vol. 0 ›› Issue (12): 110-115.

• 人工智能 • 上一篇    下一篇

基于文本双表示模型的微博热点话题发现

  

  1. (北京工业大学信息学部,北京100124)
  • 出版日期:2021-12-24 发布日期:2021-12-24
  • 作者简介:刘梦颖(1995—),女,河北保定人,硕士研究生,研究方向:自然语言处理,E-mail: 3106171957@qq.com; 王勇(1974—),男,山东临朐人,副教授,硕士生导师,研究方向:面向服务的体系结构,网格计算,可信计算,E-mail: wangy@bjut.edu.cn。

Microblog Hot Topic Discovery Based on Text Dual Representation Model

  1. (Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China)
  • Online:2021-12-24 Published:2021-12-24

摘要: 微博作为当代生活中信息传播的重要平台,对其进行热点话题挖掘成为当今重要的研究方向之一。针对传统的热点话题发现方法在处理微博文本时存在文本表示缺乏语义信息、挖掘热点话题效果差等问题,本文提出一种基于频繁词集和BERT语义的文本双表示模型(Text dual representation model based on frequent word sets and BERT semantics, FWS-BERT),通过该模型计算加权文本相似度对微博文本进行谱聚类,进一步基于改进相似性度量的affinity propagation (AP)聚类算法进行微博话题挖掘,最后通过引入文献计量学中的H指数提出一种话题热度评估方法。实验表明,本文提出的方法在轮廓系数及Calinski-Harabasz(CH)指标值上均高于基于频繁词集的单一文本表示方法和K-means方法,并且能准确地对微博数据进行话题表示和热度评估。

关键词: 微博, 频繁词集, BERT, 聚类, 热点话题

Abstract: Microblog is an important platform for information dissemination in contemporary life, mining hot topics on microblog has become one of the important research directions nowadays. In view of the problems of traditional hot topic discovery methods in dealing with microblog text, such as lack of semantic information in text representation, poor effect of mining hot topics and so on, this paper proposes a text dual representation model based on frequent word sets and BERT semantics(FWS-BERT), which calculates the weighted text similarity to perform spectral clustering on microblog text, further, microblog topic mining is carried out based on affinity propagation (AP) clustering algorithm with improved similarity measurement. Finally, a topic heat evaluation method is proposed by introducing the H index in bibliometrics. Experiments show that the proposed method is higher than the single text representation method based on frequent word set and K-means method in contour coefficient and Calinski-Harabasz (CH) index value, and can accurately represent the topic and Evaluate-the popularity of microblog data.

Key words: microblog, frequent word sets, BERT, clustering, hot topics