计算机与现代化

• 数据挖掘 • 上一篇    下一篇

基于信息论方法的分类数据相似性度量

  

  1. (华南理工大学数学学院,广东广州510641)
  • 收稿日期:2018-01-11 出版日期:2018-06-13 发布日期:2018-06-13
  • 作者简介:郑碧如(1993-),女,广东汕头人,华南理工大学数学学院硕士研究生,研究方向:数据挖掘,机器学习; 吴广潮(1972-),男,广东汕头人,副教授,硕士生导师,研究方向:数据挖掘,机器学习。
  • 基金资助:
    国家自然科学基金面上项目(61370102)

Similarity Measure of Categorical Data Based on Information Theory

  1.  (School of Mathematics, South China University of Technology, Guangzhou 510641, China)
  • Received:2018-01-11 Online:2018-06-13 Published:2018-06-13

摘要: 两实例的距离或相似性度量在数据挖掘和机器学习中扮演着重要的角色。常用的距离度量方法主要适用于数值数据,针对分类数据,本文提出一种数据驱动的相似性度量方法。该方法利用属性值与类标签的信息,将属性值的类条件概率结合信息论来度量分类数据的相似性。为了与已提出的相似性度量方法作比较,把各度量方法与k最近邻算法结合,对多个分类数据集进行分类,通过十折交叉验证比较结果的错误率。实验表明该度量结合k最近邻方法使分类具有较低的错误率。

关键词: 相似性, 分类数据, 信息论, 条件概率

Abstract: The measure of distance or similarity between two instances plays an important role in data mining and machine learning. The common distance measures are mainly suitable for numerical data, to the classification data, this paper proposes a data-driven similarity measure. This method uses the information of attribute values and class labels to measure the similarity of categorical data by combining the label’s conditional probability of attribute values with information theory. In order to compare with the proposed similarity measures, this paper combines 8 kinds of measure methods with k-nearest neighbor algorithm to classify a plurality of categorical data sets, and the error rates of the results are compared through ten-fold cross validation. Experiments show that this metric combined with k-nearest neighbor method makes a lower error classification rate.

Key words: similarity, categorical data, information theory, conditional probability

中图分类号: