计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

基于高校官网的校情简介数据分析方法

  

  1. (上海师范大学信息与机电工程学院,上海200234)
  • 收稿日期:2018-01-16 出版日期:2018-09-11 发布日期:2018-09-11
  • 作者简介:王松松(1991-),男,安徽阜阳人,上海师范大学信息与机电工程学院硕士研究生,研究方向:自然语言处理,数据挖掘; 通信作者:高伟勋(1973-),男,高级工程师,IEICE会员,博士,研究方向:计算机网络,智能信息处理。

Data Analysis Method of Campus Profile Based on University Website

  1. (College of Information, Mechanical and Electronic Engineering, Shanghai Normal University, Shanghai 200234, China)
  • Received:2018-01-16 Online:2018-09-11 Published:2018-09-11

摘要: 对高校官网校情简介数据进行挖掘分析,提出基于短语树形结构与CilinSimHash算法相结合的短语相似度计算方法。首先将短语转换成以数字为根节点的树形结构来计算相似度,接着将同义词词林与SimHash算法相结合实现基于CilinSimHash算法的相似度计算,最后将基于短语结构的相似度方法与基于CilinSimHash算法的相似度方法加权,来实现短语相似度的计算。将该算法应用到高校官网数据分析过程中,进而对高校官网数据进行聚类分析,得出高校官网数据与高校评价指标之间的关系;获取高校官网数据中的结构化数据,通过聚类算法来对相关指标数据进行分析,表明高校的发展在不同办学层次下仍然是不平衡的。

关键词: 高校官网, 短语相似度, SimHash, 高校评价指标

Abstract: This paper mines and analyzes university official website data, and proposes a phrase similarity calculation method based on the combination of phrase tree structure and CilinSimHash algorithm. The algorithm first converts the phrase into a tree with numbers as the root node to calculate the similarity, and then the Tongyici Cilin and SimHash algorithms are combined to calculate the similarity based on CilinSimHash algorithm, finally the similarity method based on phrase structure is weighted with the similarity method based on CilinSimHash algorithm to achieve the phrase similarity calculation. The algorithm is applied to the process of data analysis of university official website, and then the cluster analysis of university official website data is studied, the relationship between university official website data and college evaluation index is achieved. According to the structured data obtained from the official website data of colleges and universities, the clustering algorithm is used to analyze the related index data, which shows the unbalanced development of higher educations at different educational levels.

Key words: university official website, phrase similarity, SimHash, college evaluation index

中图分类号: