Feature Weighted CLSVSM

Abstract

Abstract: The rational and effective representation of document information using spatial vectors has a larger impact on text clustering and retrieval results. The Co-occurrence Latent Semantic Vector Space Model (CLSVSM) deeply excavates the co-occurrence latent semantic information between document feature words and improves the performance of document clustering. Based on CLSVSM, this paper first introduces word frequency information, then, the introduced word frequency is used as a weight to assign the co-occurrence strength in CLSVSM, and finally constructs feature weighted CLSVSM. The clustering effect of feature weighted CLSVSM on Chinese data is as follows: compared with CLSVSM and Word2vec text models, the F value is increased respectively by nearly 2.4% and 5.2%; compared with 90%CLSVSM_K and Word2vec text models， the entropy value is reduced respectively by nearly 3.1% and 9.0%; compared with the word frequency CLSVSM and TF-IDF models, the clustering effect is improved. The clustering effect of feature weighted CLSVSM on English data is similar to that of other models. The stability of feature weighted CLSVSM needs to be improved, which is limited by the completeness of keyword frequency information expression.

Key words: CLSVSM, feature weighted, TF-IDF, clustering

NIU Feng-gao, YAN Tao. Feature Weighted CLSVSM[J]. Computer and Modernization, 2021, 0(05): 59-65.

References

［1］ KUMAR N, BIBHU V, ISLAM M, et al. Approximate string matching algorithm［J］. International Journal on Computer Science and Engineering, 2010,2(3):641-644.
［2］ SALTON G, WONG A, YANG C S. A vector space model for automatic indexing［J］. Communications of the ACM,1975,18(11):613-620.
［3］ WONG S K M, ZIARKO W, WONG P C N. Generalized vector spaces model in information retrieval［C］// Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1985:18-25.
［4］ SONG W, LIANG J Z, PARK S C. Fuzzy control GA with a novel hybrid semantic similarity strategy for text clustering［J］. Information Sciences, 2014,273:156-170.
［5］ SONG X, HUANG J, ZHOU J M, et al. Research of Chinese text classification methods based on semantic vector and semantic similarity［C］// 2009 International Forum on Computer Science-Technology and Applications. 2009:187-190.
［6］肖志军,冯广丽. 基于《知网》义原空间的文本相似度计算［J］. 科学技术与工程, 2013,13(29):8651-8656.
［7］白秋产,金春霞,周海岩. 概念向量文本聚类算法［J］. 计算机工程与应用, 2011,47(35):155-157.
［8］李智星. 用于文本分类的简明语义分析技术研究［D］. 重庆：重庆大学, 2011.
［9］唐明伟,卞艺杰,陶飞飞. 基于领域本体的语义向量空间模型［J］. 情报学报, 2011,30(9):951-955.
［10］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint arXiv:1301.3781, 2013.
［11］Dobó A, Csirik J. A comprehensive study of the parameters in the creation and comparison of feature vectors in distributional semantic models［J］. Journal of Quantitative Linguistics, 2020,27(3):244-271.
［12］牛奉高，邱均平. 数字文献资源高维向量表示模型与聚类检验［J］. 情报学报， 2014,33（10）:1041-1045.
［13］汤建民. 基于文献计量的卓越科研机构描绘方法研究——以国内教育学科为例［J］. 情报杂志, 2010,29(4):5-9.
［14］牛奉高,张亚宇. 基于共现潜在语义向量空间模型的语义核构建［J］. 情报学报, 2017,36(8):834-842.
［15］LEYDESDORFF L. On the normalization and visualization of author co-citation data: Salton’s Cosine versus the Jaccard index［J］. Journal of the American Society for Information Science and Technology, 2008,59(1):77-85.
［16］SALTON G， MCGILL M J. Introduction to Modern Information Retrieval［M］. New York: McGraw-Hill， 1983.
［17］MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality［C］// Advances in Neural Information Processing Systems. 2013:3111-3119.
［18］TOMAS M. Word2vec Project［DB/OL］. （2018-05-20）［2020-11-01］.http://code.google.com/p/word2vec/.
［19］姜艺,黄永,夏义堃,等. 学术文本词汇功能识别——在关键词自动抽取中的应用［J］. 情报学报, 2021,40(2):152-162.
［20］FRIDRICH J, KODOVSKY J. Rich models for steganaly-sis of digital images［J］. IEEE Transactions on Informa-tion Forensics and Security, 2012,7(3):868-882.
［21］LIU Q Z， SUNG A H， RIBEIRO B， et al. Image complexity and feature mining for steganalysis of least significant bit matching steganography［J］. Information Sciences, 2008,178(1):21-36.
［22］周茜,赵明生,扈旻. 中文文本分类中的特征选择研究［J］. 中文信息学报, 2004,18(3):17-23.
［23］贺科达，朱铮涛，程昱. 基于改进TF-IDF算法的文本分类方法研究［J］. 广东工业大学学报, 2016，33（5）：49-53.
［24］ZHENG X, CHEN H, XU T. Deep learning for Chinese word segmentation and POS tagging［C］// Conference on Empirical Methods in Natural Language Processing. 2013:647-657.

[1]	LYU Meijing1, NIAN Mei1, ZHANG Jun1, 2, FU Lusen1. Anomaly Detection of Network Traffic Based on Autoencoder [J]. Computer and Modernization, 2024, 0(12): 40-44.
[2]	LIU Wenliang1, WU Fei1, HE Deming1, ZHAO Weiwei2, PAN Jianhong3. Text Clustering Method for Fragmented Reply Based on Dissimilarity Matrix [J]. Computer and Modernization, 2024, 0(09): 56-60.
[3]	ZHOU Xianxi, MU Li. News Long Text Classification Model Based on Improved TF-IDF and AGLCNN [J]. Computer and Modernization, 2024, 0(08): 120-126.
[4]	YUAN Hongwei1, CHANG Lijun1, HAO Jianhuan2, FAN Na2, WANG Chao2, LUO Chuang2, ZHANG Zehui2. Trajectory Interest Points Mining Based on Label Propagation and Privacy Protection [J]. Computer and Modernization, 2024, 0(05): 46-54.
[5]	AO Bochao, FAN Bingbing. Federated Learning Aggregation Algorithm Based on AP Clustering Algorithm [J]. Computer and Modernization, 2024, 0(04): 5-11.
[6]	ZENG Zhong-jing-xin, GAN Gang. Side Channel Analysis Based on Convolutional Auto-encoder [J]. Computer and Modernization, 2024, 0(03): 110-114.
[7]	WANG Hong-jie, XU Sheng-chao. Clustering Method of Cloud Platform Abnormal Transmission Data Based on Hilbert Similarity [J]. Computer and Modernization, 2023, 0(09): 27-31.
[8]	WANG Yi-cheng, ZHANG Guo-liang, ZHANG Zi-jie, . Small Object Detection Method Based on Improved YOLOv5 [J]. Computer and Modernization, 2023, 0(05): 100-105.
[9]	MA Yu-juan, HAN Jian-ning, SHI Shao-jie, CAO Shang-bin, YANG Zhi-xiu. Improved Kmeans Segmentation Algorithm for Brain Tumor Based on HMRF [J]. Computer and Modernization, 2023, 0(03): 1-5.
[10]	HONG Tao, ZHU Peng-yu, GUO Bo, WANG Jing-Yu. Communication Fault Diagnosis Knowledge Base Construction and Iteration Based on Semi-supervised Clustering [J]. Computer and Modernization, 2023, 0(02): 28-33.
[11]	LIU Xing-jian, YANG Xiao-fu, HU Lei. A Semi-supervised Model with Non-negative Matrix Factorization for Multiplex Network Clustering [J]. Computer and Modernization, 2023, 0(02): 83-88.
[12]	WEN Zi-xin, LI Shao-ying, WANG Bin-cheng, LIU Bo, . Face Clustering Method Based on Nearest Neighborhood Aggregation [J]. Computer and Modernization, 2022, 0(12): 81-87.
[13]	GUAN Yun-peng, LIU Yu-long. Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix [J]. Computer and Modernization, 2022, 0(11): 1-8.
[14]	ZHANG Zi-xuan, SHA Xiu-yan, XIAO Fei, SU Bao-chan, SUI Yu-lu, MENG Zi-chen. Research and Application of Hesitant Fuzzy Canopy-K-means Clustering Algorithm [J]. Computer and Modernization, 2022, 0(11): 17-21.
[15]	DING Xu-dong, YANG Dong-run, LIU Hui, ZHAO Xing-kai, ZHANG Ying, SUN Mei, . A Data-driven Online Modeling Method for Evaporators [J]. Computer and Modernization, 2022, 0(11): 22-31.