Computer and Modernization

Previous Articles     Next Articles

Similarity Measure of Categorical Data Based on Information Theory

  

  1.  (School of Mathematics, South China University of Technology, Guangzhou 510641, China)
  • Received:2018-01-11 Online:2018-06-13 Published:2018-06-13

Abstract: The measure of distance or similarity between two instances plays an important role in data mining and machine learning. The common distance measures are mainly suitable for numerical data, to the classification data, this paper proposes a data-driven similarity measure. This method uses the information of attribute values and class labels to measure the similarity of categorical data by combining the label’s conditional probability of attribute values with information theory. In order to compare with the proposed similarity measures, this paper combines 8 kinds of measure methods with k-nearest neighbor algorithm to classify a plurality of categorical data sets, and the error rates of the results are compared through ten-fold cross validation. Experiments show that this metric combined with k-nearest neighbor method makes a lower error classification rate.

Key words: similarity, categorical data, information theory, conditional probability

CLC Number: