基于CLSVSM的惩罚性矩阵分解及其在文本主题聚类中的应用

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (05): 66-72.

基于CLSVSM的惩罚性矩阵分解及其在文本主题聚类中的应用

（山西大学数学科学学院，山西太原030006）

出版日期:2021-06-03 发布日期:2021-06-03
作者简介:牛奉高（1980—），男，山西沁水人，副教授，博士，研究方向:文本挖掘，应用统计，E-mail： nfgao@sxu.edu.cn；冯世佳（1995—），女，山西吕梁人，硕士研究生，研究方向：文本挖掘，E-mail： 2502536189@qq.com；黄琛（1992—），男，湖北武汉人，硕士研究生，研究方向：文本挖掘，E-mail： sxuchenh@163.com。
基金资助:
山西省应用基础研究计划项目（优秀青年基金）（201801D211002）；全国统计科学研究项目（2017LY04）；山西省高等学校优秀成果培育项目（2019KJ004）

Penalized Matrix Decomposition Based on CLSVSM and Its Application in Text Topic Clustering

（School of Mathematical Sciences，Shanxi University, Taiyuan 030006， China）

Online:2021-06-03 Published:2021-06-03

摘要/Abstract

摘要： 文本信息的合理表示对文本主题聚类及检索有重要作用。针对文本表示模型维度较高的问题，基于共现潜在语义向量空间模型（CLSVSM）研究惩罚性矩阵分解（PMD），利用PMD对向量进行稀疏约束，提取核心特征词，进而实现原始数据的重建；通过共现分析理论及PMD方法，深度挖掘特征词之间的语义信息，构建语义核函数（PMD_K）。将本文方法应用于文本主题聚类中，实验结果显示，PMD和PMD_K这2种方法的聚类效果均明显优于其他方法，以F值为例，PMD_K方法较以往的95%CLSVSM_K方法，F值提高了21.9%。将PMD与文本表示模型相结合，在提高了文本主题聚类的效率和精度的同时，还避免了对高维矩阵的复杂运算。

关键词: CLSVSM, 惩罚性矩阵分解, 语义核函数, 文本主题聚类

Abstract: Reasonable representation of text information plays an important role in text topic clustering and retrieval. Aiming at the problem of high dimension of text representation model, penalized matrix decomposition （PMD） is studied based on the co-occurrence potential semantic vector space model （CLSVSM）, and the vector is sparsely constrained by PMD to extract core features, so as to realize the reconstruction of original data. Through co-occurrence analysis theory and PMD method, the semantic information between features is deeply mined and the semantic kernel function （PMD_K） is constructed. The methods proposed in this paper are applied to text topic clustering, the experimental results show that the clustering effect of PMD and PMD_K is obviously better than that of other methods. Taking the F value as an example, the F value of PMD_K method is 21.9% higher than that of the previous 95%CLSVSM_K method. Combining PMD with text representation model not only improves the efficiency and accuracy of text topic clustering, but also avoids the complex computation of high-dimensional matrix.

Key words: CLSVSM(Co-occurrence Latent Semantic Vector Space Model), PMD, semantic kernel function, text topic clustering

牛奉高, 冯世佳, 黄琛. 基于CLSVSM的惩罚性矩阵分解及其在文本主题聚类中的应用[J]. 计算机与现代化, 2021, 0(05): 66-72.

NIU Feng-gao, FENG Shi-jia, HUANG Chen. Penalized Matrix Decomposition Based on CLSVSM and Its Application in Text Topic Clustering[J]. Computer and Modernization, 2021, 0(05): 66-72.

参考文献

［1］曹晓. 文本聚类研究综述［J］. 情报探索, 2016（1）:131-134.
［2］常鹏,冯楠. 基于词共现的文档表示模型［J］. 中文信息学报, 2012,26（1）:51-58.
［3］ SALTON G. A vector space model for automatic indexing［J］. Communications of the ACM, 1975,18（11）:613-620.
［4］吴光远,何丕廉,曹桂宏,等. 基于向量空间模型的词共现研究及其在文本分类中的应用［J］. 计算机应用, 2003（S1）:138-140.
［5］ WONG Z W. Generalized vector spaces model in information reteieval［C］// Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1985:18-25.
［6］牛奉高,邱均平. 数字文献资源高维向量表示模型与聚类检验［J］. 情报学报, 2014（10）:1041-1045.
［7］韩素青,贾茹. 基于稀疏约束非负矩阵分解的K-Means聚类算法［J］. 数据采集与处理, 2017,32（6）:1216-1222.
［8］ WITTEN D M, TIBSHIRANI R, HASTIE T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis［J］. Biostatistics, 2009,10（3）:515-524.
［9］ ZHANG J, ZHENG C H, LIU J X, et al. Discovering the transcriptional modules using microarray data by penalized matrix decomposition［J］. Computers in Biology & Medicine, 2011,41（11）:1041-1050.
［10］ZHENG C H, ZHANG L, NG V T, et al. Molecular pattern discovery based on penalized matrix decomposition［J］. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2011,8（6）:1592-1603.
［11］王娟,范少萍,郑春厚. 基于惩罚性矩阵分解的文本聚类分析［J］. 情报学报, 2012,31（9）:998-1008.
［12］LIU J X, ZHENG C H, XU Y. Extracting plants core genes responding to abiotic stresses by penalized matrix decomposition［J］. Computers in Biology & Medicine, 2012,42（5）:582-589.
［13］俞仙子,高英莲,马春霞,等. 提取核心特征词的惩罚性矩阵分解方法——以其词分析为例［J］. 现代图书情报技术, 2014（3）:88-95.
［14］邵作运,李秀霞. 惩罚性矩阵分解及其在共词分析中的应用［J］. 图书情报工作, 2015,59（13）:126-133.
［15］LIU J， CHENG Y H， WANG X S， et al. Supervised penalty matrix decomposition for tumor differentially expressed genes selection［J］. Chinese Journal of Electronics, 2018,27（4）:183-189.
［16］TROTTER J D, LANGUTH J, CAI X. Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix-vector multiplication［J］. Journal of Parallel and Distributed Computing, 2020,144（5）：189-205.
［17］AJZERMAN M A, BRAVERMAN E M, ROZONOEHR L I. Theoretical foundations of the potential function method in pattern recognition learning［J］. Automation and Remote Control, 1964,25:821-837.
［18］CHEN H, ZHANG Y, GUTMAN I. A kernel-based clustering method for gene selection with gene expression data［J］. Journal of Biomedical Informatics, 2016,62（5）:12-20.
［19］SIOLAS G, D’ALCHE-BUC F. Support vector machines based on a semantic kernel for text categorization［C］// Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. 2000,5:205-209.
［20］CRISTIANINI N, SHAWE-TAYLOR J, LODHI H. Latent semantic kernels［J］. Journal of Intelligent Information Systems, 2002,18（2）:127-152.
［21］MAVROEIDIS D, TSATSARONIS G, VAZIRGIANNIS M, et al. Word sense disambiguation for exploiting hierarchical thesauri in text classification［C］// Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2005:181-192.
［22］张玉峰,王志芳. 文本分类中的语义核函数研究［J］. 情报科学, 2010（7）:12-17.
［23］NASIR J A, VARLAMIS I, KARIM A, et al. Semantic smoothing for text clustering［J］. Knowledge-Based Systems, 2013,54:216-229.
［24］KIM K, CHUNG B S, CHOI Y, et al. Language independent semantic kernels for short-text classification［J］. Expert Systems with Applications, 2014,41（2）:735-743.
［25］WANG T H, LI W, LIU F L, et al. Sprinkled semantic diffusion kernel for word sense disambiguation［J］. Engineering Applications of Artificial Intelligence, 2017,64（5）:43-51.
［26］徐炎,曹春萍. 语义核SVM结合改进EMD跨越语义鸿沟［J］. 轻工学报, 2019,34（3）:77-83.
［27］李东博,黄铝文. 重加权稀疏主成分分析算法及其在人脸识别中的应用［J］. 计算机应用, 2020,40（3）:717-722.
［28］牛奉高,张亚宇. 基于共现潜在语义向量空间模型的语义核构建［J］. 情报学报, 2017,36（8）:834-842.
［29］KARYPIS G. CLUTO-A Clustering Toolkit［R］. Department of Computer Science, University of Minnesota, 2002.