基于信息论方法的分类数据相似性度量

doi:10.3969/j.issn.1006-2475.2018.05.007

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (05): 30-.doi: 10.3969/j.issn.1006-2475.2018.05.007

基于信息论方法的分类数据相似性度量

（华南理工大学数学学院，广东广州510641）

收稿日期:2018-01-11 出版日期:2018-06-13 发布日期:2018-06-13
作者简介:郑碧如（1993-），女，广东汕头人，华南理工大学数学学院硕士研究生，研究方向：数据挖掘，机器学习；吴广潮（1972-），男，广东汕头人，副教授，硕士生导师，研究方向：数据挖掘，机器学习。
基金资助:
国家自然科学基金面上项目（61370102）

Similarity Measure of Categorical Data Based on Information Theory

(School of Mathematics， South China University of Technology， Guangzhou 510641， China)

Received:2018-01-11 Online:2018-06-13 Published:2018-06-13

摘要/Abstract

摘要： 两实例的距离或相似性度量在数据挖掘和机器学习中扮演着重要的角色。常用的距离度量方法主要适用于数值数据，针对分类数据，本文提出一种数据驱动的相似性度量方法。该方法利用属性值与类标签的信息，将属性值的类条件概率结合信息论来度量分类数据的相似性。为了与已提出的相似性度量方法作比较，把各度量方法与k最近邻算法结合，对多个分类数据集进行分类，通过十折交叉验证比较结果的错误率。实验表明该度量结合k最近邻方法使分类具有较低的错误率。

关键词: 相似性, 分类数据, 信息论, 条件概率

Abstract: The measure of distance or similarity between two instances plays an important role in data mining and machine learning. The common distance measures are mainly suitable for numerical data, to the classification data, this paper proposes a data-driven similarity measure. This method uses the information of attribute values and class labels to measure the similarity of categorical data by combining the label’s conditional probability of attribute values with information theory. In order to compare with the proposed similarity measures, this paper combines 8 kinds of measure methods with k-nearest neighbor algorithm to classify a plurality of categorical data sets, and the error rates of the results are compared through ten-fold cross validation. Experiments show that this metric combined with k-nearest neighbor method makes a lower error classification rate.

中图分类号:

TP391

郑碧如，吴广潮. 基于信息论方法的分类数据相似性度量[J]. 计算机与现代化, 2018, 0(05): 30-.

ZHENG Bi-ru, WU Guang-chao. Similarity Measure of Categorical Data Based on Information Theory[J]. Computer and Modernization, 2018, 0(05): 30-.

参考文献

［1］ Lin Liang, Wang Guangrun, Zuo Wangmeng, et al. Cross-domain visual matching via generalized similarity measure and feature learning［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,39(6):1089-1102. 
［2］ Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation［C］// Proceedings of 2008 SIAM International Conference on Data Mining. 2008:243-254.
［3］ Xie Jierui, Szymanski B, Zaki M. Learning dissimilarities for categorical symbols［C］// The 4th Workshop on Feature Selection in Data Mining. 2010:97-106.
［4］ Cheng Victor, Li Chun-hung, Kwok J T, et al. Dissimilarity learning for nominal data［J］. Pattern Recognition, 2004,37(7):1471-1477.
［5］ Le Siquang, Ho Tubao. An association-based dissimilarity measure for categorical data［J］. Pattern Recognition Letters, 2005,26(16):2549-2557.
［6］ Alamuri M, Surampudi B R, Negi A. A survey of distance/similarity measures for categorical data［C］// 2014 IEEE International Joint Conference on Neural Networks. 2014:1907-1914.
［7］ Sparck J K. A statistical interpretation of term specificity and its application in retrieval［J］. Journal of Document, 1972,28(1):11-21.
［8］ Wang Yue, Ge Jidong, Zhou Yemao, et al. Topic model based text similarity measure for Chinese judgement document［C］// International Conference of Pioneering Computer Scientists, Engineers and Educators. 2017:42-54.
［9］ Burnaby T P. On a method for character weighting a similarity coefficient, employing the concept of information［J］. Mathematical Geology, 1970,2(1):25-38.
［10］Lin Dekang. An information-theoretic definition of similarity［C］// Proceedings of the 15th International Conference on Machine Learning. 1998:296-304.
［11］Stanfill C, Waltz D. Toward memory-based reasoning［J］. Communications of the ACM, 1986,29(12):1213-1228.
［12］ Eskin E, Arnold A, Prerau M, et al. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data［M］// Applications of Data Mining in Computer Security. Springer, Boston, MA, 2002:77-102. 
［13］Cover T, Hart P. Nearest neighbor pattern classification［J］. IEEE Transactions on Information Theory, 1967,13(1):21-27.
［14］Quinlan J R. Induction of decision trees［J］. Machine Learning, 1986,1(1):81-106.
［15］鞠可一,周德群,吴君民. 混合概念格在案例相似性度量中的应用［J］. 控制与决策, 2010,25(7):987-992.
［16］赵亮,刘建辉,王星. 基于Hellinger距离的混合数据集中分类变量相似度分析［J］. 计算机科学, 2016,43(6):280-282.
［17］孙怡帆,李赛. 基于相似度的微博社交网络的社区发现方法［J］. 计算机研究与发展, 2014,51(12):2797-2807.
［18］ 陈彦萍,杨威,唐成务,等. 基于语义相似度的数据服务分类方法［J］. 信息技术, 2017(12):93-96.

[1]	曾奕斌, 葛红. 基于语义融合和多重相似性学习的跨模态检索[J]. 计算机与现代化, 2022, 0(08): 50-56.
[2]	祝丁恺, 铁治欣, 洪顺贺. 一种HRG模型初始化算法及在链路预测中的应用[J]. 计算机与现代化, 2022, 0(02): 38-44.
[3]	张东方, 陈海燕, 袁立罡. S2R2:基于相关性与冗余性分析的半监督特征选择[J]. 计算机与现代化, 2021, 0(09): 113-120.
[4]	万杨晔, 郭进利. 基于资源分配与图嵌入加权的链路预测算法[J]. 计算机与现代化, 2021, 0(07): 12-17.
[5]	张勤1,2，贾渊1，王耀彬1. 基于稀疏特征点的零件图像拼接方法[J]. 计算机与现代化, 2019, 0(02): 19-.
[6]	蒋丽，薛善良. 基于改进kmeans算法的文本聚类[J]. 计算机与现代化, 2018, 0(04): 17-.
[7]	刘琪,张鹏程,王继民. 基于相关反馈的时间序列相似性搜索[J]. 计算机与现代化, 2018, 0(02): 22-26.
[8]	章伟明,周武能. 基于鬼影判断抑制和局部运动补偿的改进ViBe算法[J]. 计算机与现代化, 2018, 0(01): 51-55.
[9]	杨秀萍. 融合用户评分和属性相似度的协同过滤推荐算法[J]. 计算机与现代化, 2017, 0(7): 16-19.
[10]	曹洋洋,林意. 基于离均差的时间序列相似性度量[J]. 计算机与现代化, 2017, 0(5): 76-82,87.
[11]	朱明,魏慧琴. 基于项目属性偏好的协同过滤算法[J]. 计算机与现代化, 2017, 0(4): 23-26.
[12]	柳萌宇1，钟浩2，于海波1. 基于变更相似性的跨语言克隆检测方法[J]. 计算机与现代化, 2016, 0(4): 79-84+99.
[13]	伍转华. 异构信息网络的相似性度量方法[J]. 计算机与现代化, 2016, 0(3): 78-84.
[14]	金超，张龙波，王海鹏，安建瑞，怀浩，王晓丹. 一种基于结构相似性的图聚类算法[J]. 计算机与现代化, 2016, 0(3): 19-23.
[15]	魏静. 基于社交网络服务位置的用户相似性计算方法[J]. 计算机与现代化, 2015, 0(7): 9-.

基于信息论方法的分类数据相似性度量

Similarity Measure of Categorical Data Based on Information Theory

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价