基于文本双表示模型的微博热点话题发现

摘要/Abstract

摘要： 微博作为当代生活中信息传播的重要平台，对其进行热点话题挖掘成为当今重要的研究方向之一。针对传统的热点话题发现方法在处理微博文本时存在文本表示缺乏语义信息、挖掘热点话题效果差等问题，本文提出一种基于频繁词集和BERT语义的文本双表示模型（Text dual representation model based on frequent word sets and BERT semantics, FWS-BERT），通过该模型计算加权文本相似度对微博文本进行谱聚类，进一步基于改进相似性度量的affinity propagation (AP)聚类算法进行微博话题挖掘，最后通过引入文献计量学中的H指数提出一种话题热度评估方法。实验表明，本文提出的方法在轮廓系数及Calinski-Harabasz（CH）指标值上均高于基于频繁词集的单一文本表示方法和K-means方法，并且能准确地对微博数据进行话题表示和热度评估。

关键词: 微博, 频繁词集, BERT, 聚类, 热点话题

Abstract: Microblog is an important platform for information dissemination in contemporary life, mining hot topics on microblog has become one of the important research directions nowadays. In view of the problems of traditional hot topic discovery methods in dealing with microblog text, such as lack of semantic information in text representation, poor effect of mining hot topics and so on, this paper proposes a text dual representation model based on frequent word sets and BERT semantics(FWS-BERT), which calculates the weighted text similarity to perform spectral clustering on microblog text, further, microblog topic mining is carried out based on affinity propagation (AP) clustering algorithm with improved similarity measurement. Finally, a topic heat evaluation method is proposed by introducing the H index in bibliometrics. Experiments show that the proposed method is higher than the single text representation method based on frequent word set and K-means method in contour coefficient and Calinski-Harabasz (CH) index value, and can accurately represent the topic and Evaluate-the popularity of microblog data.

Key words: microblog, frequent word sets, BERT, clustering, hot topics

刘梦颖, 王勇. 基于文本双表示模型的微博热点话题发现[J]. 计算机与现代化, 2021, 0(12): 110-115.

LIU Meng-ying, WANG Yong. Microblog Hot Topic Discovery Based on Text Dual Representation Model[J]. Computer and Modernization, 2021, 0(12): 110-115.

参考文献

［1］丁兆云,贾焰,周斌. 微博数据挖掘研究综述［J］. 计算机研究与发展, 2014,51(4):691-706.
［2］ CHEN K Y, LUESUKPRASERT L, CHOU S C T. Hot topic extraction based on timeline analysis and multidimensional sentence modeling［J］. IEEE Transactions on Knowledge & Data Engineering, 2007,19(8):1016-1025.
［3］路荣,项亮,刘明荣,等. 基于隐主题分析和文本聚类的微博客中新闻话题的发现［J］. 模式识别与人工智能, 2012,25(3):382-387.
［4］ YE Y T , DU Y J, FU X. Hot topic extraction based on Chinese microblog’s Features topic model［C］// 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). 2016:348-353.
［5］陈珊珊. 基于LDA模型的文本聚类研究［D］. 苏州:苏州大学, 2017.
［6］ LIU Z T, YU W C, CHEN W, et al. Short text feature selection for micro-blog mining［C］// 2010 International Conference on Computational Intelligence and Software Engineering. 2010. DOI:10.1109/CISE.2010.5677015.
［7］ ZHUANG L, DAI H H. A maximal frequent itemset approach for Web document clustering［C］// 2004 International Conference on Computer and Information Technology. 2004:970-977.
［8］ ZHANG W, YOSHIDA T, TANG X J, et al. Text clustering using frequent itemsets［J］. Knowledge-Based Systems, 2010,23(5):379-388.
［9］徐雅斌,李卓,吕非非,等. 基于频繁词集聚类的微博新话题快速发现［J］. 系统工程理论与实践, 2014,34(S1):276-282.
［10］彭敏,黄佳佳,朱佳晖,等. 基于频繁项集的海量短文本聚类与主题抽取［J］. 计算机研究与发展, 2015,52(9):1941-1953.
［11］HAN J W, PEI J, YIN Y W, et al. Mining frequent patterns without candidate generation: A frequent-pattern tree approach［J］. Data Mining & Knowledge Discovery, 2004,8(1):53-87.
［12］DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv:1810.04805,2018.
［13］VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need［C］// 2017 Advances in Neural Information Processing Systems. 2017:6000-6010.

［14］GURURANGAN S, MARASOVIC A, SWAYAMDIPTA S, et al. Don’t stop pretraining: Adapt language models to domains and tasks［J］. arXiv preprint arXiv:2004.10964,2020.

［15］FIEDLER M. Algebraic connectivity of graphs［J］. Czechoslovak Mathematical Journal, 1973,23(2):298-305.
［16］吴云,许抗震,黄瑞章. 一种基于Hadoop的文本相似度仿真检测模型［J］. 新疆大学学报(自然科学版), 2017,34(3):308-315.
［17］CHIDEAN M I, MORGADO E, SANROMN-JUNQUERA M, et al. Energy efficiency and quality of data reconstruction through data-coupled clustering for self-organized large-scale WSNs［J］. IEEE Sensors Journal, 2016,16(12):5010-5020.
［18］杨波. 新浪微博热点话题发现研究［D］. 乌鲁木齐:新疆大学, 2019.
［19］HIRSCH J E. An index to quantify an individual’s scientific research output［J］. Proceedings of the National Academy of Sciences of the United States of America, 2005,102(46):16569-16572.
［20］肖可. H指数在学科研究热点分析中的应用——以图情学为例［J］. 情报杂志, 2011,30(3):69-73.
［21］陈远,丛振江. 利用H指数评测微博影响力——以新浪校园微博为例［J］. 情报科学, 2015,33(5):85-90.
［22］王杨,王非凡,张舒宜,等. 基于TF-IDF和改进BP神经网络的社交平台垃圾文本过滤［J］. 计算机系统应用, 2019,28(3):126-132.
［23］叶雪梅,毛雪岷,夏锦春,等. 文本分类TF-IDF算法的改进研究［J］. 计算机工程与应用, 2019,55(2):104-109.
［24］BRIN S, PAGE L. Reprint of: The anatomy of a large-scale hypertextual Web search engine［J］. Computer Networks, 2012,56(18):3825-3833.

[1]	郑久超, 赵新元. 基于主题与描述信息的实体链接方法[J]. 计算机与现代化, 2024, 0(12): 10-14.
[2]	吕美静1, 年梅1, 张俊1, 2, 付鲁森1. 基于自编码器的网络流量异常检测[J]. 计算机与现代化, 2024, 0(12): 40-44.
[3]	马钰, 杨勇, 任鸽, 帕力旦·吐尔逊. 基于GCN和微调BERT的作文自动评分方法[J]. 计算机与现代化, 2024, 0(09): 33-37.
[4]	刘文亮1, 吴飞1, 何德明1, 赵维伟2, 潘建宏3. 基于相异度矩阵的碎片化回复文本聚类方法[J]. 计算机与现代化, 2024, 0(09): 56-60.
[5]	赵盾1, 佘学兵2, 邬昌兴3. 基于BERT-BiLSTM-CRF党建领域命名实体识别[J]. 计算机与现代化, 2024, 0(09): 91-94.
[6]	袁红伟1, 常利军1, 郝家欢2, 樊娜2, 王超2, 罗闯2, 张泽辉2. 基于标签传播的轨迹兴趣点挖掘及隐私保护[J]. 计算机与现代化, 2024, 0(05): 46-54.
[7]	敖博超, 范冰冰. 基于AP聚类算法的联邦学习聚合算法[J]. 计算机与现代化, 2024, 0(04): 5-11.
[8]	孟雅蕾1, 师红宇1, 王予2. 一种无阻流量预测方法[J]. 计算机与现代化, 2024, 0(04): 33-37.
[9]	王谭, 陈金广, 马丽丽. 融合词典信息和句子语义的中文命名实体识别[J]. 计算机与现代化, 2024, 0(03): 24-28.
[10]	曾钟静昕, 甘刚. 基于卷积自编码器的侧信道分析[J]. 计算机与现代化, 2024, 0(03): 110-114.
[11]	王秋忆, 周浩, 郑婷婷. 改进RetinaNet的电力设备目标检测方法[J]. 计算机与现代化, 2024, 0(01): 47-52.
[12]	郑立瑞, 肖晓霞, 邹北骥, 刘彬, 周展. 基于BERT的电子病历命名实体识别[J]. 计算机与现代化, 2024, 0(01): 87-91.
[13]	刘玉鹏, 葛艳, 杜军威, 陈卓. 融合FGM和指针标注的实体关系联合抽取方法[J]. 计算机与现代化, 2023, 0(11): 1-5.
[14]	唐诗琪, 周瑞平, 谢仕斌, 刘梦赤, 肖文, . 基于栈式降噪编码器的跨语言多标签情感分类[J]. 计算机与现代化, 2023, 0(11): 6-12.
[15]	李诗月, 孟佳娜, 于玉海, 李雪莹, 许英傲. 基于知识增强的方面级情感分析方法[J]. 计算机与现代化, 2023, 0(10): 1-8.