计算机与现代化 ›› 2022, Vol. 0 ›› Issue (11): 1-8.

• 算法设计与分析 •    下一篇

基于从共现矩阵提取关联的类别型数据聚类

  

  1. (华北计算技术研究所系统八部,北京100083)
  • 出版日期:2022-11-30 发布日期:2022-11-30
  • 作者简介:关云鹏(1998—),男,河北晋州人,硕士研究生,研究方向:计算机应用技术,E-mail: 673766537@qq.com; 通信作者:刘玉龙(1981—),男,研究员级高级工程师,硕士,研究方向:大型信息系统架构设计和项目管理,E-mail: lyl_nci@126.com。
  • 基金资助:
    科技创新2030—“新一代人工智能”重大项目(2020AAA0105100)

Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix

  1. (Department 8 of System, North China Institute of Computing Technology, Beijing 100083, China)
  • Online:2022-11-30 Published:2022-11-30

摘要: 类别型数据聚类被广泛应用于现实世界的不同领域中,如医学科学、计算机科学等。通常的类别型数据聚类,是在基于相异度量上进行研究,针对不同特点的数据集,聚类结果会受到数据集自身特点和噪音信息的影响。此外,基于表示学习的类别型数据聚类,实现复杂,聚类结果受到表示结果的影响较大。本文以共现矩阵为基础,提出一种可以直接考虑类别型数据原始信息关联关系的聚类方法———基于从共现矩阵提取关联的类别型数据聚类方法(CDCBCM)。共现矩阵可被看作是一种对原始数据空间中信息关联情况的汇总。本文通过计算不同对象在各个属性子空间下的共现频率值来构建共现矩阵,并从共现矩阵中去除一些噪音信息,再使用归一化切割来得到聚类结果。本文方法在16个不同领域的公开数据集中进行测试,与8种现有方法进行比较,并采用F1-score指标进行检测。实验结果表明,本文方法在7个数据集上效果最好,平均排名最高,能更好地完成对类别型数据的聚类任务。

关键词: 类别型数据, 类别型数据聚类, 机器学习, 共现矩阵, 归一化切割

Abstract: Categorical data clustering is widely used in different fields in the real world, such as medical science, computer science,  etc. The usual categorical data clustering is studied based on the dissimilarity measure. For data sets with different characteristics, the clustering results will be affected by the characteristics of the data set itself and noise information. In addition, the categorical data clustering based on representation learning is too complicated to implement, and the clustering results are greatly affected by the representation results. Based on the co-association matrix, this paper proposes a clustering method that can directly consider the relationship between the original information of categorical data, categorical data clustering based on extraction of associations from co-association matrix (CDCBCM). The co-association matrix can be regarded as a summary of the information association in the original data space. The co-association matrix is constructed by calculating the co-association frequency value of different objects in each attribute subspace, and some noise information is removed from the co-association matrix, and then the clustering result is obtained by normalized cut. The method is tested on 16 publicly available datasets in various aspects, compared with 8 existing methods, and detected using the F1-score metric. The experimental results show that this method has the best effect on 7 data sets, the average ranking is the best, and it can better complete the clustering task of categorical data.

Key words: categorical data, categorical data clustering, machine learning, co-association matrix, normalized cut