A Similarity Algorithm for Chinese Text Based on Semantics

Abstract

Abstract:

This paper computes the semantic similarity of words using the HowNet and extracting the text keywords to compute the similarity of the texts. After segmenting

the text and filtering stop words, it calculates the weights of word to extract the key words of the text by combining the gender, word frequency and paragraph frequency of the

word. By calculating the similarity of the keywords, the similarity value of the texts is calculated. The analysis of the significant difference of the experimental results

shows that its accuracy is further improved compared with the traditional semantic algorithm and vector space model algorithm.

Key words: text similarity, semantic, HowNet, keywords, paragraph frequency

XIA Zhi-ming1，2, LIU Xin1，2. A Similarity Algorithm for Chinese Text Based on Semantics[J]. Computer and Modernization, 2015, 0(4): 6-9.

［1］谢晋. 基于词跨度的中文文本关键词提取及在文本分类中的应用［D］. 杭州：浙江工业大学, 2011.
［2］ Gerard Salton, Wong A, Yang C S. A vector space model for automatic indexing［J］. Communication of ACM, 1975,18(11):613-620.
［3］余刚,裴仰军,朱征宇,等. 基于词汇语义计算的文本相似度研究［J］. 计算机工程与设计, 2006(2):241-244.
［4］ Lee N, Kim J M. Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications［J］. Computational Statistics &

Data Analysis, 2010,54(5):1247-1265.
［5］ Li Y, Bandar Z A, Mclean D, et al. An approach for measuring semantic similarity between words using multiple information sources［J］. IEEE Transactions on Knowledge and

Data Engineering, 2003,15(4):871-882.
［6］李熙,徐德智. 基于WordNet的概念语义相似度研究［J］. 湖南科技学院学报, 2008,29（12）:115-117.
［7］贾可亮,樊孝忠，张禹. 基于How-Net语义相似度的FAQ研究［J］. 计算机应用, 2007,27（9）:2256-2257.
［8］孙海霞,钱庆. 基于本体的语义相似度计算方法研究综述［J］. 现代图书情报技术, 2010(1):51-55.
［9］ Jiang J, Conrath D. Semantic similarity based on corpus statistics and lexical taxonomy［C］// Proceedings of International Conference on Research in Computational

Linguistics. 1997:19-33.
［10］金博,史彦军,滕弘飞. 基于语义理解的文本相似度算法［J］. 大连理工大学学报, 2005(2):291-297.
［11］Higgins D. Which statistics reflect semantic? Rethinking synonymy and word similarity［C］// Proceedings of International Conference on Linguistic Evidence. 2004:265-

284.
［12］徐德智,王怀民. 基于本体的概念间语义相似度计算方法研究［J］. 计算机工程与应用, 2007,43(8):154-156.
［13］刘群,李素建. 基于《知网》的词汇语义相似度计算［ED/OL］. http://www.docin.com/p-23739023.html, 2009-06-12.
［14］董振东,董强. 《知网》［DB/OL］. http://www.keenage.com/, 2015-01-08.
［15］张华平. ICTCLAS汉语分词系统［DB/OL］. http://ictclas.nlpir.org/, 2015-01-30.
［16］金希茜. 基于语义相似度的中文文本相似度算法研究［D］. 杭州：浙江工业大学, 2009.
［17］李鹏,王斌,石志伟,等. Tag-TextRank:一种基于Tag的网页关键词抽取方法［J］. 计算机研究与发展, 2012,49(11):2344-2351.
［18］Leouski A V, Croft W B. An Evaluation of Techniques for Clustering Search Results［R］. Technical Report IR-76, Department of Computer Science, University of

Massachusetts, Amherst, 1996.

[1]	CHEN Yuhang1, YANG Yong1, Palidan TUERXUN1, FAN Xiaochao1, REN Ge1, DIAO Yufeng2. Integrating Syntactic and Semantic Features for Automated Essay Scoring [J]. Computer and Modernization, 2024, 0(11): 64-69.
[2]	ZHOU Anda, TANG Chaoying. Semantic Segmentation Algorithm for Rainy Road Scene and Its Mobile Deployment [J]. Computer and Modernization, 2024, 0(10): 7-13.
[3]	Renqingzhuoma1, 2, 3, Yongcuo1, 2, 3, TANG Chaochao1, 2, 3. Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation [J]. Computer and Modernization, 2024, 0(06): 19-24.
[4]	QIAO Jia, XU Kun, HU Peirong. Layout Analysis Method of Multi-scale Feature Fusion [J]. Computer and Modernization, 2024, 0(05): 16-21.
[5]	HUANG Zheng-lin, DONG Bao-liang. Temporal Knowledge Graph Question Answering Method Based on#br# Semantic and Structural Enhancement [J]. Computer and Modernization, 2024, 0(03): 15-23.
[6]	CUI Shao-guo, HU Guang-ping. Nested Named Entity Recognition Based on Semantic Segmentation [J]. Computer and Modernization, 2024, 0(02): 69-74.
[7]	HU Chong-jia, LIU Jin-zhou, FANG Li. Unsupervised Domain Adaptation for Outdoor Point Cloud Semantic Segmentation [J]. Computer and Modernization, 2024, 0(01): 74-79.
[8]	LI Ying-ying, HUANG Wen-pei. View Frustum Culling Algorithm for Scene Based on Optimized Octree [J]. Computer and Modernization, 2024, 0(01): 103-108.
[9]	XU Hong-kui, LI Zhen-ye, GUO Wen-tao, ZHAO Jing-zheng, GUO Xu-bin. Real-time Detection of Arbitrary Shape Scene Text Based on Segmentation [J]. Computer and Modernization, 2023, 0(11): 95-100.
[10]	YE Si-jia, WEI Yan, DU Han-yu, DENG Jin-zhi. HRNet Image Semantic Segmentation Algorithm Combined with Attention Mechanism [J]. Computer and Modernization, 2023, 0(10): 65-69.
[11]	CHEN Jia-min, ZHANG Bo-quan, MAI Hai-peng. Hippocampus Segmentation Based on Feature Fusion [J]. Computer and Modernization, 2023, 0(08): 1-6.
[12]	LIU Xu, ZHA Ke-ke. An Environmental Target Recognition Method for Airport Special Vehicle Operation [J]. Computer and Modernization, 2023, 0(08): 18-24.
[13]	NIU Yu-heng, LI Yong-ke, CHEN Yan-hong, JANG Ping-an. Image Segmentation Method of Residual Film on Cotton Field Surface Based on Improved SegFormer Model#br# [J]. Computer and Modernization, 2023, 0(07): 93-98.
[14]	YE Li-ming, CHEN Wei-wen. A Cascaded Insulator Defect Detection Model Combining Semantic Segmentation and Object Detection [J]. Computer and Modernization, 2023, 0(06): 82-88.
[15]	LI Xiao-feng, MA Jing, ZHOU Yan. Classification Algorithm for Goods Names Based on Enhanced Semantic Model [J]. Computer and Modernization, 2023, 0(03): 71-78.

A Similarity Algorithm for Chinese Text Based on Semantics

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments