计算机与现代化 ›› 2021, Vol. 0 ›› Issue (05): 66-72.

• 中文信息处理技术 • 上一篇    下一篇

基于CLSVSM的惩罚性矩阵分解及其在文本主题聚类中的应用

  

  1. (山西大学数学科学学院,山西太原030006)
  • 出版日期:2021-06-03 发布日期:2021-06-03
  • 作者简介:牛奉高(1980—),男,山西沁水人,副教授,博士,研究方向:文本挖掘,应用统计,E-mail: nfgao@sxu.edu.cn; 冯世佳(1995—),女,山西吕梁人,硕士研究生,研究方向:文本挖掘,E-mail: 2502536189@qq.com; 黄琛(1992—),男,湖北武汉人,硕士研究生,研究方向:文本挖掘,E-mail: sxuchenh@163.com。
  • 基金资助:
    山西省应用基础研究计划项目(优秀青年基金)(201801D211002); 全国统计科学研究项目(2017LY04); 山西省高等学校优秀成果培育项目(2019KJ004)

Penalized Matrix Decomposition Based on CLSVSM and Its Application in Text Topic Clustering

  1. (School of Mathematical Sciences,Shanxi University, Taiyuan 030006, China)
  • Online:2021-06-03 Published:2021-06-03

摘要: 文本信息的合理表示对文本主题聚类及检索有重要作用。针对文本表示模型维度较高的问题,基于共现潜在语义向量空间模型(CLSVSM)研究惩罚性矩阵分解(PMD),利用PMD对向量进行稀疏约束,提取核心特征词,进而实现原始数据的重建;通过共现分析理论及PMD方法,深度挖掘特征词之间的语义信息,构建语义核函数(PMD_K)。将本文方法应用于文本主题聚类中,实验结果显示,PMD和PMD_K这2种方法的聚类效果均明显优于其他方法,以F值为例,PMD_K方法较以往的95%CLSVSM_K方法,F值提高了21.9%。将PMD与文本表示模型相结合,在提高了文本主题聚类的效率和精度的同时,还避免了对高维矩阵的复杂运算。

关键词: CLSVSM, 惩罚性矩阵分解, 语义核函数, 文本主题聚类

Abstract: Reasonable representation of text information plays an important role in text topic clustering and retrieval. Aiming at the problem of high dimension of text representation model, penalized matrix decomposition (PMD) is studied based on the co-occurrence potential semantic vector space model (CLSVSM), and the vector is sparsely constrained by PMD to extract core features, so as to realize the reconstruction of original data. Through co-occurrence analysis theory and PMD method, the semantic information between features is deeply mined and the semantic kernel function (PMD_K) is constructed. The methods proposed in this paper are applied to text topic clustering, the experimental results show that the clustering effect of PMD and PMD_K is obviously better than that of other methods. Taking the F value as an example, the F value of PMD_K method is 21.9% higher than that of the previous 95%CLSVSM_K method. Combining PMD with text representation model not only improves the efficiency and accuracy of text topic clustering, but also avoids the complex computation of high-dimensional matrix.

Key words: CLSVSM(Co-occurrence Latent Semantic Vector Space Model), PMD, semantic kernel function, text topic clustering