Similarity Measure of Categorical Data Based on Information Theory

doi:10.3969/j.issn.1006-2475.2018.05.007

Abstract

Abstract: The measure of distance or similarity between two instances plays an important role in data mining and machine learning. The common distance measures are mainly suitable for numerical data, to the classification data, this paper proposes a data-driven similarity measure. This method uses the information of attribute values and class labels to measure the similarity of categorical data by combining the label’s conditional probability of attribute values with information theory. In order to compare with the proposed similarity measures, this paper combines 8 kinds of measure methods with k-nearest neighbor algorithm to classify a plurality of categorical data sets, and the error rates of the results are compared through ten-fold cross validation. Experiments show that this metric combined with k-nearest neighbor method makes a lower error classification rate.

CLC Number:

TP391

ZHENG Bi-ru, WU Guang-chao. Similarity Measure of Categorical Data Based on Information Theory[J]. Computer and Modernization, 2018, 0(05): 30-.

References

［1］ Lin Liang, Wang Guangrun, Zuo Wangmeng, et al. Cross-domain visual matching via generalized similarity measure and feature learning［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017,39(6):1089-1102. 
［2］ Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: A comparative evaluation［C］// Proceedings of 2008 SIAM International Conference on Data Mining. 2008:243-254.
［3］ Xie Jierui, Szymanski B, Zaki M. Learning dissimilarities for categorical symbols［C］// The 4th Workshop on Feature Selection in Data Mining. 2010:97-106.
［4］ Cheng Victor, Li Chun-hung, Kwok J T, et al. Dissimilarity learning for nominal data［J］. Pattern Recognition, 2004,37(7):1471-1477.
［5］ Le Siquang, Ho Tubao. An association-based dissimilarity measure for categorical data［J］. Pattern Recognition Letters, 2005,26(16):2549-2557.
［6］ Alamuri M, Surampudi B R, Negi A. A survey of distance/similarity measures for categorical data［C］// 2014 IEEE International Joint Conference on Neural Networks. 2014:1907-1914.
［7］ Sparck J K. A statistical interpretation of term specificity and its application in retrieval［J］. Journal of Document, 1972,28(1):11-21.
［8］ Wang Yue, Ge Jidong, Zhou Yemao, et al. Topic model based text similarity measure for Chinese judgement document［C］// International Conference of Pioneering Computer Scientists, Engineers and Educators. 2017:42-54.
［9］ Burnaby T P. On a method for character weighting a similarity coefficient, employing the concept of information［J］. Mathematical Geology, 1970,2(1):25-38.
［10］Lin Dekang. An information-theoretic definition of similarity［C］// Proceedings of the 15th International Conference on Machine Learning. 1998:296-304.
［11］Stanfill C, Waltz D. Toward memory-based reasoning［J］. Communications of the ACM, 1986,29(12):1213-1228.
［12］ Eskin E, Arnold A, Prerau M, et al. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data［M］// Applications of Data Mining in Computer Security. Springer, Boston, MA, 2002:77-102. 
［13］Cover T, Hart P. Nearest neighbor pattern classification［J］. IEEE Transactions on Information Theory, 1967,13(1):21-27.
［14］Quinlan J R. Induction of decision trees［J］. Machine Learning, 1986,1(1):81-106.
［15］鞠可一,周德群,吴君民. 混合概念格在案例相似性度量中的应用［J］. 控制与决策, 2010,25(7):987-992.
［16］赵亮,刘建辉,王星. 基于Hellinger距离的混合数据集中分类变量相似度分析［J］. 计算机科学, 2016,43(6):280-282.
［17］孙怡帆,李赛. 基于相似度的微博社交网络的社区发现方法［J］. 计算机研究与发展, 2014,51(12):2797-2807.
［18］ 陈彦萍,杨威,唐成务,等. 基于语义相似度的数据服务分类方法［J］. 信息技术, 2017(12):93-96.

[1]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[2]	LIU Wenliang1, WU Fei1, HE Deming1, ZHAO Weiwei2, PAN Jianhong3. Text Clustering Method for Fragmented Reply Based on Dissimilarity Matrix [J]. Computer and Modernization, 2024, 0(09): 56-60.
[3]	Renqingzhuoma1, 2, 3, Yongcuo1, 2, 3, TANG Chaochao1, 2, 3. Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation [J]. Computer and Modernization, 2024, 0(06): 19-24.
[4]	WANG Hong-jie, XU Sheng-chao. Clustering Method of Cloud Platform Abnormal Transmission Data Based on Hilbert Similarity [J]. Computer and Modernization, 2023, 0(09): 27-31.
[5]	WANG Hong, GE Hong. Cross Modal Hash Retrieval Based on Attention Mechanism and Semantic Similarity [J]. Computer and Modernization, 2023, 0(08): 44-53.
[6]	GUAN Yun-peng, LIU Yu-long. Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix [J]. Computer and Modernization, 2022, 0(11): 1-8.
[7]	LIU Guo-li, XU Hong-nan, TAN You-qian. Collaborative Filtering Recommendation Algorithm Combined with Expert Trust [J]. Computer and Modernization, 2022, 0(11): 60-68.
[8]	TIAN Feng, DENG Xiao-ping, ZHANG Gui-qing, WANG Bao-yi. A Non-intrusive Load Monitoring Method Based on Improved kNN Algorithm and Transient Steady State Features [J]. Computer and Modernization, 2022, 0(10): 29-35.
[9]	ZENG Yi-bin, GE Hong. Cross-modal Retrieval Based on Context Fusion and Multi-similarity Learning [J]. Computer and Modernization, 2022, 0(08): 50-56.
[10]	ZOU Meng-yuan, FAN Zhi-qiang, XU Luo, LIU Jie, LIANG Wan-lu. Similarity Measurement Method of Inf-ProA Information Activity Process Model [J]. Computer and Modernization, 2022, 0(02): 26-32.
[11]	ZHU Ding-kai, TIE Zhi-xin, HONG Shun-he. An Initialization Algorithm of HRG Model and Its Application in Link Prediction [J]. Computer and Modernization, 2022, 0(02): 38-44.
[12]	ZHANG Dong-fang, CHEN Hai-yan, YUAN Li-gang. S2R2: Semi-supervised Feature Selection Based on Analysis of Relevance and Redundancy [J]. Computer and Modernization, 2021, 0(09): 113-120.
[13]	GENG Hua-cong, LIANG Hong-tao, LIU Guo-zhu. Recipe Recommendation Algorithm Based on Knowledge Graph and Collaborative Filtering [J]. Computer and Modernization, 2021, 0(08): 24-29.
[14]	WAN Yang-ye, GUO Jin-li. Link Prediction Algorithm Based on Resource Allocation and Graph Embedding Weighting [J]. Computer and Modernization, 2021, 0(07): 12-17.
[15]	HUANG Hao, CHEN Li. A Collaborative Filtering Algorithm Based on Information Entropy and Improved Similarity [J]. Computer and Modernization, 2021, 0(06): 29-34.