一种改进的k-means文本聚类优化方法

doi:10.3969/j.issn.1006-2475.2015.03.010

计算机与现代化 ›› 2015, Vol. 0 ›› Issue (3): 48-51,56.doi: 10.3969/j.issn.1006-2475.2015.03.010

一种改进的k-means文本聚类优化方法

常熟理工学院外国语学院,江苏常熟215500

收稿日期:2014-12-09 出版日期:2015-03-23 发布日期:2015-03-26
作者简介:王琼(1982-),女,江苏常熟人,常熟理工学院外国语学院实验师,硕士,研究方向:数据挖掘,智能化信息处理。
基金资助:
江苏省自然科学基金资助项目(BK2012209)；苏州市科技发展计划项目(SYG201409)

An Improved K-means Optimization Approach for Text Clustering

School of Foreign Language, Changshu Institute of Technology, Changshu 215500, China

Received:2014-12-09 Online:2015-03-23 Published:2015-03-26

摘要/Abstract

摘要： 提出一种改进的k-means文本聚类优化方法k-meansSC，将待聚类文档集分词处理后提取主要词条集，并分别采用布尔函数、TFIDF函数表示文本特征向量，通过实验对比它们各自的优缺点。同时基于该词条集构建支持度矩阵与置信度矩阵，以此定义相似度计算公式，并在不同聚类个数条件下详细分析了该公式与其他距离计算公式的迭代次数及错误函数的表现情况，实验结果表明，在一定条件下采用TFIDF型文本特征向量能有效地提高运行效率及聚类有效性。


关键词: k-means, 相似度, 文本聚类, 支持度, 置信度

Abstract: K-meansSC as an improved k-means optimization approach for text clustering is proposed. By means of processing of word segmentation, clustering document sets will be treated for extraction of main entry sets. Then the feature vectors of the document are respectively represented by Boolean function and TFIDF function, through the comparison of their respective strengths and weaknesses. Based on the entry set building support degrees matrix and confidence degrees matrix, similar degrees calculation formula can be defined, and under different clustering number conditions the formula and other distance calculation formula of iteration number and error function of performance situation have been in detailed analysis. Experimental results shows that under certain conditions TFIDF function featuring document vector can effectively improve processing efficiency and clustering effectiveness.


Key words: k-means, similarity, text clustering, support, confidence

中图分类号:

TP311

王琼. 一种改进的k-means文本聚类优化方法[J]. 计算机与现代化, 2015, 0(3): 48-51,56.

WANG Qiong. An Improved K-means Optimization Approach for Text Clustering[J]. Computer and Modernization, 2015, 0(3): 48-51,56.

参考文献

［1]Han Jiawei, Kamber Micheline. 数据挖掘概念与技术［M]. 范明,孟小峰译. 北京:机械工业出版社, 2007.
［2]Isa D, Kallimani V P, Lee L H. Using the self-organizing map for clustering of text documents［J]. Expert Systems with Applications, 2009,36(5):9584-9591.
［3]Cui Xiaohui, Potok T E. Document clustering analysis based on hybrid PSO+K-means algorithm［J]. Journal of Computer Sciences, 2005（Special Issue）:27-33.
［4]Habibpour R, Khalilpour K. A new hybrid k-means and K-nearest-neighbor algorithms for text document clustering［J]. International Journal of Academic Research Part A, 2014,6(3):79-84.
［5]Rajaie J, Fakhar B. A novel method for document clustering using ant-fuzzy algorithm［J]. The Journal of Mathematics and Computer Science, 2012,4(2):182-196.
［6]Meena Y K, Shashank, Singh V P. Text documents clustering using genetic algorithm and discrete differential evolution［J]. International Journal of Computer Applications, 2012,43(1):16-19.
［7]Wang Jinlong, Wu Shunyao, Li Gang, et al. Integrating instance-level and attribute-level knowledge into document clustering［J]. Computer Science and Information Systems, 2011,8(3):635-651.
［8]Hu Yeming, Milios E E, Blustein J, et al. Enhancing semi-supervised document clustering with feature supervision［C]// Proceedings of the 27th Annual ACM Symposium on Applied Computing. Trento, Italy, 2012:929-936.
［9]Martino B D, Cantiello P. Automatic ontology extraction with text clustering［J]. Intelligent Distributed Computing III, 2009,237:215-220.
［10]Wang Hongjun, Yu Shiwen, Lv Xueqiang, et al. A comparative study on representing units in Chinese text clustering［C]// Proceedings of LNAI: Knowledge Science, Engineering and Management, 2006. 2006,4092:466-476.
［11]Zheng Yan, Cheng Xiaochun, Huang Ronghuai,et al. A comparative study on text clustering methods［C]// Proceedings of LNAI: Advanced Data Mining and Applications, 2006. 2006,4093:644-651.
［12]Banerjee S, Ramanathan K, Gupta A, et al. Clustering short texts using Wikipedia［C]// Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Amsterdam, Holland, 2007:788-789.
［13]Wu Shunyao, Wang Jinlong, Vu H Q, et al. Text clustering with important words using normalization［C]// Proceedings of the 10th Annual Joint Conference on Digital Libraries. Gold Coast, Australia, 2010:393-394.
［14]Gad W K, Kamel M S. New semantic similarity based modelfor text clustering using extended gloss overlaps［C]// Proceedings of LNAI: Machine Learning and Data Mining in Pattern Recognition, 2009. 2009,5632:663-677.
［15]吴舜尧,邵峰晶,王金龙,等. 融合语义资源和关键词的文本聚类［J]. 计算机工程, 2014,40(4):223-227.
［16]白秋产,金春霞,周海岩. 概念向量文本聚类算法［J]. 计算机工程与应用, 2011,47(35):155-157.
［17]王刚,邱玉辉. 基于本体及相似度的文本聚类研究［J]. 计算机应用研究, 2010,27(7):2494-2497.

[1]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[2]	王涛1, 2, 黄丹1, 2, 刘禅奕1, 2, 朱桃1, 2. 基于YOLOv5s的无人机图像车辆检测[J]. 计算机与现代化, 2024, 0(08): 108-113.
[3]	刘彧轩1, 2, 廖宇晨1, 刘忠慧1. 单条件三元概念构建及其融合推荐应用[J]. 计算机与现代化, 2024, 0(07): 1-6.
[4]	秦阳, 詹勇, 明路遥, 杨舒淇, 蓝振祎. 基于改进K-means算法的通勤交通小区识别[J]. 计算机与现代化, 2024, 0(07): 63-68.
[5]	仁青卓玛1, 2, 3, 拥措1, 2, 3, 唐超超1, 2, 3. 面向藏汉神经机器翻译的数据筛选方法[J]. 计算机与现代化, 2024, 0(06): 19-24.
[6]	孟雅蕾1, 师红宇1, 王予2. 一种无阻流量预测方法[J]. 计算机与现代化, 2024, 0(04): 33-37.
[7]	王宏杰, 徐胜超. 基于希尔伯特相似度的云平台异常传输数据聚类方法[J]. 计算机与现代化, 2023, 0(09): 27-31.
[8]	韩雪. 基于约束聚类和粒子群算法的多路径规划[J]. 计算机与现代化, 2023, 0(08): 7-11.
[9]	王鸿, 葛红. 基于注意力机制和语义相似度的跨模态哈希检索[J]. 计算机与现代化, 2023, 0(08): 44-53.
[10]	王艺成, 张国良, 张自杰, . 基于改进YOLOv5的小目标检测方法[J]. 计算机与现代化, 2023, 0(05): 100-105.
[11]	刘国丽, 徐洪楠, 谭有倩. 结合专家信任的协同过滤推荐算法研究[J]. 计算机与现代化, 2022, 0(11): 60-68.
[12]	田丰, 邓晓平, 张桂青, 王保义. 基于改进kNN算法与暂稳态特征的非侵入式负荷监测方法[J]. 计算机与现代化, 2022, 0(10): 29-35.
[13]	彭露露, 朱媛媛, 金文倩, 王笑梅. 基于改进YOLOv4的汽车钢铁零件表面缺陷检测[J]. 计算机与现代化, 2022, 0(09): 32-39.
[14]	申智, 徐丽, 符祥远. 基于改进YOLO v4光线模糊场景下交通标志检测[J]. 计算机与现代化, 2022, 0(07): 27-32.
[15]	刘锟, 曾曦, 邱梓珩, 陈周国, . 基于RoBERTa-WWM 和HDBSCAN的文本聚类算法[J]. 计算机与现代化, 2022, 0(03): 48-52.

一种改进的k-means文本聚类优化方法

An Improved K-means Optimization Approach for Text Clustering

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价