基于从共现矩阵提取关联的类别型数据聚类

摘要/Abstract

摘要： 类别型数据聚类被广泛应用于现实世界的不同领域中，如医学科学、计算机科学等。通常的类别型数据聚类，是在基于相异度量上进行研究，针对不同特点的数据集，聚类结果会受到数据集自身特点和噪音信息的影响。此外，基于表示学习的类别型数据聚类，实现复杂，聚类结果受到表示结果的影响较大。本文以共现矩阵为基础，提出一种可以直接考虑类别型数据原始信息关联关系的聚类方法———基于从共现矩阵提取关联的类别型数据聚类方法(CDCBCM)。共现矩阵可被看作是一种对原始数据空间中信息关联情况的汇总。本文通过计算不同对象在各个属性子空间下的共现频率值来构建共现矩阵，并从共现矩阵中去除一些噪音信息，再使用归一化切割来得到聚类结果。本文方法在16个不同领域的公开数据集中进行测试，与8种现有方法进行比较，并采用F1-score指标进行检测。实验结果表明，本文方法在7个数据集上效果最好，平均排名最高，能更好地完成对类别型数据的聚类任务。

关键词: 类别型数据, 类别型数据聚类, 机器学习, 共现矩阵, 归一化切割

Abstract: Categorical data clustering is widely used in different fields in the real world, such as medical science, computer science, etc. The usual categorical data clustering is studied based on the dissimilarity measure. For data sets with different characteristics, the clustering results will be affected by the characteristics of the data set itself and noise information. In addition, the categorical data clustering based on representation learning is too complicated to implement, and the clustering results are greatly affected by the representation results. Based on the co-association matrix, this paper proposes a clustering method that can directly consider the relationship between the original information of categorical data, categorical data clustering based on extraction of associations from co-association matrix (CDCBCM). The co-association matrix can be regarded as a summary of the information association in the original data space. The co-association matrix is constructed by calculating the co-association frequency value of different objects in each attribute subspace, and some noise information is removed from the co-association matrix, and then the clustering result is obtained by normalized cut. The method is tested on 16 publicly available datasets in various aspects, compared with 8 existing methods, and detected using the F1-score metric. The experimental results show that this method has the best effect on 7 data sets, the average ranking is the best, and it can better complete the clustering task of categorical data.

Key words: categorical data, categorical data clustering, machine learning, co-association matrix, normalized cut

关云鹏, 刘玉龙. 基于从共现矩阵提取关联的类别型数据聚类[J]. 计算机与现代化, 2022, 0(11): 1-8.

GUAN Yun-peng, LIU Yu-long. Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix[J]. Computer and Modernization, 2022, 0(11): 1-8.

参考文献

［1］ WEI W, LIANG J Y, GUO X Y, et al. Hierarchical division clustering framework for categorical data［J］. Neurocomputing, 2019,341:118-134.
［2］ NAOUALI S, BEN SALEM S, CHTOUROU Z. Clustering categorical data: A survey［J］. International Journal of Information Technology & Decision Making, 2020,19(1):49-96.
［3］项峥嵘. 类别型数据的划分迁移聚类［D］. 杭州:浙江大学, 2014.
［4］ TSEKOURAS G E, PAPAGEORGIOU D, KOTSIANTIS S B, et al. Fuzzy clustering of categorical attributes and its use in analyzing cultural data［C］// International Conference on Computational Intelligence. 2004:202-206.
［5］ HUANG Z X, NG M K. A fuzzy k-modes algorithm for clustering categorical data［J］. IEEE Transactions on Fuzzy Systems, 1999,7(4):446-452.
［6］ HSU C C, CHEN C L, SU Y W. Hierarchical clustering of mixed data based on distance hierarchy［J］. Information Sciences, 2007,177(20):4474-4492.
［7］ HUANG Z X. Extensions to the K-means algorithm for clustering large data sets with categorical values［J］. Data Mining and Knowledge Discovery, 1998,2(3):283-304.
［8］ NG M K, LI M J, HUANG J Z, et al. On the impact of dissimilarity measure in k-modes clustering algorithm［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007,29(3):503-507.
［9］ CAO F Y, LIANG J Y, LI D Y, et al. A dissimilarity measure for the k-modes clustering algorithm［J］. Knowledge-based Systems, 2012, 26:120-127.
［10］GANTI V, GEHRKE J, RAMAKRISHNAN R. CACTUS—Clustering categorical data using summaries［C］// Proceedings of the 5h ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1999:73-83.
［11］AHMAD A, DEY L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set［J］. Pattern Recognition Letters, 2007, 28(1):110-118.
［12］ZHANG K, WANG Q J, CHEN Z Z, et al. From categorical to numerical: Multiple transitive distance learning and embedding［C］// Proceedings of the 2015 SIAM International Conference on Data Mining. 2015:46-54.
［13］IENCO D, PENSA R G, MEO R. From context to distance: Learning dissimilarity for categorical data clustering［J］. ACM Transactions on Knowledge Discovery from Data(TKDD), 2012,6(1):1-25.
［14］WANG C, DONG X J, ZHOU F, et al. Coupled attribute similarity learning on categorical data［J］. IEEE Transactions on Neural Networks and Learning Systems, 2014,26(4):781-797.
［15］JIAN S L, CAO L B, PANG G S, et al. Embedding-based representation of categorical data by hierarchical value coupling learning［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017:1937-1943.
［16］VINCENT P, LAROCHELLE H, LAJOIE I, et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion［J］. Journal of Machine Learning Research, 2010,11(12):3371-3408.
［17］DONAHUE J, KRHENBHL P, DARRELL T. Adversarial feature learning［J］. arXiv preprint arXiv:1605.09782, 2016.
［18］KINGMA D P, WELLING M. Auto-encoding variational bayes［J］. arXiv preprint arXiv:1312.6114, 2013.
［19］BANERJEE A, PUJARI A K, RANI PANIGRAHI C, et al. A new method for weighted ensemble clustering and coupled ensemble selection［J］. Connection Science, 2021,33(3):623-644.
［20］JIA Y H, LIU H, HOU J H, et al. Clustering ensemble meets low-rank tensor approximation［J］. arXiv preprint arXiv:2012.08916, 2020.
［21］FRED A L N, JAIN A K. Combining multiple clusterings using evidence accumulation［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005,27(6):835-850.
［22］BEZDEK J C, HATHAWAYR J. VAT: A tool for visual assessment of (cluster) tendency［C］// Proceedings of the 2002 International Joint Conference on Neural Networks. IEEE, 2002,3:2225-2230.
［23］CHANG H, YEUNGD Y. Robust path-based spectral clustering［J］. Pattern Recognition, 2008,41(1):191-203.
［24］SHI J B, MALIK J. Normalized cuts and image segmentation［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000,22(8):888-905.
［25］SOKOLOVA M, JAPKOWICZ N, SZPAKOWICZS. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation［C］// Australasian Joint Conference on Artificial Intelligence. Springer, 2006:1015-1021.
［26］CHICCO D, JURMAN G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation［J］. BMC Genomics, 2020,21(1):1-13.
［27］DEMAR J. Statistical comparisons of classifiers over multiple data sets［J］. The Journal of Machine Learning Research, 2006,7:1-30.

[1]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[2]	贾潇瑶, . 融合CatBoost和SHAP的乳腺癌预测及特征分析[J]. 计算机与现代化, 2023, 0(10): 32-38.
[3]	张芸, 白开峰, 王星, 仓甜, 周通, 段锦文, 苏晗. 智能电网环境下窃电行为检测[J]. 计算机与现代化, 2023, 0(03): 60-65.
[4]	石志伟, 武志峰, 张哲. 纠正学习策略下LightGBM-GRU模型的股票波动率预测[J]. 计算机与现代化, 2023, 0(01): 95-102.
[5]	冷涛, . 基于深度学习的加密流量分类研究综述[J]. 计算机与现代化, 2021, 0(08): 112-120.
[6]	邓子云, . 一种为辅助诊断筛选机器学习模型的方法[J]. 计算机与现代化, 2021, 0(03): 88-93.
[7]	郭欣, 陈瑛, 章鸣嬛, 张璇, 潘曙明, 汤璐佳. 利用机器学习方法对灾难生命支持课程NDLS培训效果进行分析预测#br#[J]. 计算机与现代化, 2020, 0(12): 61-66.
[8]	陈平平,耿笑冉,邹敏,谭定英. 基于机器学习的文本情感倾向性分析[J]. 计算机与现代化, 2020, 0(03): 77-.
[9]	马吉科,尹飞,祝永晋,豆龙龙,李剑. 一种应用半监督学习的计量装置运行状态辨识方法[J]. 计算机与现代化, 2020, 0(03): 82-.
[10]	赵琦1,2,蒋朝惠1,2,周雪梅1,2,宋紫华1,2. 一种基于HTTP协议的隐蔽隧道及其检测方法[J]. 计算机与现代化, 2019, 0(06): 16-.
[11]	孙小川,芦天亮. 基于聚类的数据加权优化在犯罪预测中的应用[J]. 计算机与现代化, 2019, 0(06): 55-.
[12]	刘彬，张冀聪. 运动相关电位分类算法比较和语义范式分析[J]. 计算机与现代化, 2018, 0(11): 88-.
[13]	梁东，杨永全，魏志强. 基于支持向量机的网页正文内容提取方法[J]. 计算机与现代化, 2018, 0(09): 21-.
[14]	白君泽,杨红丽,张标. Android应用程序权限组重要性分析[J]. 计算机与现代化, 2018, 0(08): 102-.
[15]	李鹏鹏,范会敏. 文本分类中特征权重算法改进研究[J]. 计算机与现代化, 2018, 0(02): 66-.