一种基于语义的中文文本相似度算法

计算机与现代化 ›› 2015, Vol. 0 ›› Issue (4): 6-9.

一种基于语义的中文文本相似度算法

1．湘潭大学信息工程学院，湖南湘潭411105；2.智能计算与信息处理教育部重点实验室，湖南湘潭411105

收稿日期:2015-01-30 出版日期:2015-04-27 发布日期:2015-04-29
作者简介: 夏志明（1990-），男，湖南益阳人，湘潭大学信息工程学院、智能计算与信息处理教育部重点实验室硕士研究生，研究方向：计算机网络与信息安全；刘新（1975-），男，副教授，研究方向：多值逻辑,计算机算法，信息安全。
基金资助:
湖南省自然科学基金资助项目（12JJ3066）；湖南省高校科技成果产业化培育项目（11CY018）；湖南省“十二五”重点学科项目

A Similarity Algorithm for Chinese Text Based on Semantics

1. College of Information Engineering, Xiangtan University, Xiangtan 411105, China;

2. Key Laboratory of Intelligent Computing and Information Processing(Xiangtan University), 
Ministry of Education, Xiangtan 411105, China

Received:2015-01-30 Online:2015-04-27 Published:2015-04-29

摘要/Abstract

摘要：

利用《知网》计算词语的语义相似度，通过提取关键词进行文本相似度计算。将文本分词并过滤停用词后，结合词语的词性、词频和段频计算词语的权值，以便提取文本的关键词，通过计

算关键词之间的相似度来计算文本之间的相似度值。实验结果与对比值进行差异显著性分析表明,本文提出的方法相比传统的语义算法和向量空间模型算法，其精确性有了进一步的提高。

关键词: , 文本相似度, 语义, 《知网》, 关键词, 段频

Abstract:

This paper computes the semantic similarity of words using the HowNet and extracting the text keywords to compute the similarity of the texts. After segmenting

the text and filtering stop words, it calculates the weights of word to extract the key words of the text by combining the gender, word frequency and paragraph frequency of the

word. By calculating the similarity of the keywords, the similarity value of the texts is calculated. The analysis of the significant difference of the experimental results

shows that its accuracy is further improved compared with the traditional semantic algorithm and vector space model algorithm.

Key words: text similarity, semantic, HowNet, keywords, paragraph frequency

夏志明1，2，刘新1，2. 一种基于语义的中文文本相似度算法[J]. 计算机与现代化, 2015, 0(4): 6-9.

XIA Zhi-ming1，2, LIU Xin1，2. A Similarity Algorithm for Chinese Text Based on Semantics[J]. Computer and Modernization, 2015, 0(4): 6-9.

［1］谢晋. 基于词跨度的中文文本关键词提取及在文本分类中的应用［D］. 杭州：浙江工业大学, 2011.
［2］ Gerard Salton, Wong A, Yang C S. A vector space model for automatic indexing［J］. Communication of ACM, 1975,18(11):613-620.
［3］余刚,裴仰军,朱征宇,等. 基于词汇语义计算的文本相似度研究［J］. 计算机工程与设计, 2006(2):241-244.
［4］ Lee N, Kim J M. Conversion of categorical variables into numerical variables via Bayesian network classifiers for binary classifications［J］. Computational Statistics &

Data Analysis, 2010,54(5):1247-1265.
［5］ Li Y, Bandar Z A, Mclean D, et al. An approach for measuring semantic similarity between words using multiple information sources［J］. IEEE Transactions on Knowledge and

Data Engineering, 2003,15(4):871-882.
［6］李熙,徐德智. 基于WordNet的概念语义相似度研究［J］. 湖南科技学院学报, 2008,29（12）:115-117.
［7］贾可亮,樊孝忠，张禹. 基于How-Net语义相似度的FAQ研究［J］. 计算机应用, 2007,27（9）:2256-2257.
［8］孙海霞,钱庆. 基于本体的语义相似度计算方法研究综述［J］. 现代图书情报技术, 2010(1):51-55.
［9］ Jiang J, Conrath D. Semantic similarity based on corpus statistics and lexical taxonomy［C］// Proceedings of International Conference on Research in Computational

Linguistics. 1997:19-33.
［10］金博,史彦军,滕弘飞. 基于语义理解的文本相似度算法［J］. 大连理工大学学报, 2005(2):291-297.
［11］Higgins D. Which statistics reflect semantic? Rethinking synonymy and word similarity［C］// Proceedings of International Conference on Linguistic Evidence. 2004:265-

284.
［12］徐德智,王怀民. 基于本体的概念间语义相似度计算方法研究［J］. 计算机工程与应用, 2007,43(8):154-156.
［13］刘群,李素建. 基于《知网》的词汇语义相似度计算［ED/OL］. http://www.docin.com/p-23739023.html, 2009-06-12.
［14］董振东,董强. 《知网》［DB/OL］. http://www.keenage.com/, 2015-01-08.
［15］张华平. ICTCLAS汉语分词系统［DB/OL］. http://ictclas.nlpir.org/, 2015-01-30.
［16］金希茜. 基于语义相似度的中文文本相似度算法研究［D］. 杭州：浙江工业大学, 2009.
［17］李鹏,王斌,石志伟,等. Tag-TextRank:一种基于Tag的网页关键词抽取方法［J］. 计算机研究与发展, 2012,49(11):2344-2351.
［18］Leouski A V, Croft W B. An Evaluation of Techniques for Clustering Search Results［R］. Technical Report IR-76, Department of Computer Science, University of

Massachusetts, Amherst, 1996.

[1]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[2]	万兵1, 2, 3, 赵文涛4, 潘多涛1, 赵峥韬2, 3, 孙朝阳2, 3, 俞建成2, 3. 无人帆船半物理仿真测试系统设计[J]. 计算机与现代化, 2024, 0(12): 91-99.
[3]	陈宇航1, 杨勇1, 帕力旦·吐尔逊1, 樊小超1, 任鸽1, 刁宇峰2. 融合句法特征与语义特征的作文自动评分方法[J]. 计算机与现代化, 2024, 0(11): 64-69.
[4]	周安达, 唐超颖. 雨天道路场景语义分割算法及其移动端部署[J]. 计算机与现代化, 2024, 0(10): 7-13.
[5]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[6]	周超, 丛鑫, 訾玲玲, 肖谷平. 基于DNN与注意力机制的推荐算法模型[J]. 计算机与现代化, 2024, 0(06): 1-7.
[7]	袁航, 杨勇, 任鸽, 帕力旦·吐尔逊. 基于语义特征融合的作文自动评分方法[J]. 计算机与现代化, 2024, 0(06): 8-13.
[8]	王志强, 郑爽. 基于半监督学习的StyleGAN图像生成模型[J]. 计算机与现代化, 2024, 0(06): 14-18.
[9]	仁青卓玛1, 2, 3, 拥措1, 2, 3, 唐超超1, 2, 3. 面向藏汉神经机器翻译的数据筛选方法[J]. 计算机与现代化, 2024, 0(06): 19-24.
[10]	王宁, 李迎, 刘枫. 基于LNS-NSGA2的多目标冷链运输优化[J]. 计算机与现代化, 2024, 0(06): 25-32.
[11]	李欣, 焦立男, 柳有权, 马彩莎. 一种基于改进SIFT的视频稳像方法[J]. 计算机与现代化, 2024, 0(06): 43-50.
[12]	徐源鸿1, 2, 3, 赵博1, 2, 刘小军1, 2. 基于车载多通道探冰雷达的三维冰下地形提取方法[J]. 计算机与现代化, 2024, 0(06): 51-58.
[13]	主令恒1, 2, 顾丹鹏1, 2, 唐松强1, 2, 陈肖勇1, 2. 一种多层级二分图最大匹配问题的快速算法[J]. 计算机与现代化, 2024, 0(06): 59-63.
[14]	李雨菲, 闫莉, 曾彦萍, 刘云横. 基于嵌套蚁群算法的机器人拣货作业联合优化#br# #br#[J]. 计算机与现代化, 2024, 0(06): 64-69.
[15]	赵文博1, 向东1, 王玖斌2, 邓岳辉3, 张伟1, 康倩1, 李玉洁1. 基于改进黏菌算法与Tsallis熵的电力设备红外图像分割[J]. 计算机与现代化, 2024, 0(06): 70-75.