计算机与现代化 ›› 2021, Vol. 0 ›› Issue (12): 37-42.

• 算法设计与分析 • 上一篇    下一篇

基于LDA和Word2vec的微博标签生成算法

  

  1. (江苏科技大学计算机学院,江苏镇江212100)
  • 出版日期:2021-12-24 发布日期:2021-12-24
  • 作者简介:陈瀛(1996—),女,广东佛山人,硕士研究生,研究方向:数据挖掘,E-mail: 864366787@qq.com; 生佳根(1962—),男,江苏海安人,教授,博士,研究方向:数据挖掘,E-mail: sjg@just.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61772244); 江苏省研究生创新计划项目(SJCX20_1475)

Weibo Tag Generation Algorithm Based on LDA and Word2vec

  1. (School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang 212100, China)
  • Online:2021-12-24 Published:2021-12-24

摘要: 针对基于传统LDA主题模型的标签生成算法对用户兴趣主题描述不完整的问题,提出一种基于主题嵌入表示的微博用户标签生成算法TopicERP。该算法在LDA模型的基础上,通过引入Word2vec词嵌入模型,对用户兴趣主题进行全面描述,并对匹配度计算方法进行改进。首先利用LDA主题模型对用户微博进行主题分析,生成用户兴趣主题;然后利用Word2vec词嵌入模型将主题文本转换为主题向量,用于匹配度计算;最后,利用余弦相似度和主题在文档中的条件概率,计算主题向量与候选标签匹配度,选取Top-Q的候选标签作为目标用户标签。本文在公开微博数据集microPCU上进行实验,实验结果表明,该算法在总体性能上高于基于传统LDA主题模型的微博标签生成算法,生成的用户标签能够较为准确地描述用户的兴趣偏好。

关键词: 标签生成, LDA, Word2vec, 微博

Abstract: Aiming at the problem that the tag generation algorithm based on the traditional LDA topic model describes the user’s interest topics incompletely, a Weibo user’s tag generation algorithm TopicERP based on the topic embedding representation is proposed. Based on the LDA model, by introducing Word2vec word embedded model, the algorithm is to conduct a comprehensive description of interest subject to the customer, and to improve the matching degree calculation method. Firstly, LDA topic model was used to analyze the topics of users’ Weibo and generate the topics of users’ interest. Then, Word2vec word embedding model was used to transform the topic text into the topic vector, which was used to calculate the matching degree. Finally, it used cosine similarity and conditional probability of topic in the document, the matching degree between topic vector and candidate tag was calculated, and Top-Q candidate tag was selected as the target user’s tag. Experimental results on MicroPCU, a public Weibo data set, show that the algorithm has better overall performance than the algorithm based on the traditional LDA topic model, and the generated user tags can describe users’ interests and preferences more accurately.

Key words: tag generation, LDA, Word2vec, Weibo