计算机与现代化 ›› 2016, Vol. 0 ›› Issue (3): 24-29+34.doi: 10.3969/j.issn.1006-2475.2016.03.006

• 人工智能 • 上一篇    下一篇

基于新浪微博的冰雹实况信息挖掘

  

  1. (天津大学电气与自动化工程学院,天津 300072)
  • 收稿日期:2015-10-23 出版日期:2016-03-17 发布日期:2016-03-17
  • 作者简介:王萍(1955-),女,河北临西人,天津大学电气与自动化工程学院教授,硕士,研究方向:模式识别理论与方法,图像识别,运动对象跟踪; 王贺颖(1990-),女,河南驻马店人,硕士研究生,研究方向:模式识别,自然语言处理。
  • 基金资助:
    天津市自然科学基金资助项目(14JCYBJC21800)

Hail Information Extraction Based on Sina Weibo

  1. (School of Electrical Engineering and Automation, Tianjin University, Tianjin 300072, China)
  • Received:2015-10-23 Online:2016-03-17 Published:2016-03-17

摘要: 为更便捷快速地从新浪微博数据中得到准确的冰雹实况信息,设计并实现一个3层次识别系统,即通过网络爬虫技术的含“冰雹”微博的一次识别、基于分类器的降雹事件的二次识别和基于规则的冰雹实况要素的三次识别。为提高降雹事件的识别性能,新增一个用于提取样本特征的评估函数,提出使用多评估函数共同确定特征向量的方法,给出基于3分类器的组合分类方案。测试结果表明,该方法能够将隐含在新浪微博中的降雹事件的89.5%提取出来,误识信息低于13.4%;对冰雹事件中冰雹实况单要素的提取率超过96.0%,误识信息低于8.6%。

关键词: 微博, 冰雹实况, 特征提取, 文本分类, 文本要素识别, 网络爬虫

Abstract: To obtain accurate hail information more easily and quickly, a three-level identification is designed, which is the first identification of microblog containing “hail” through Web crawler technology, the second identification of hail events based on classifier and the third identification of hail element information based on rules. In order to improve identification performance of hail events, an assessment function for extracting features is added, and a multi-assessment function to determine the feature vectors is proposed. Then a scheme based on combination of three classifiers is given. The test results show that hail events extraction rate is 89.5% by the presented method, mistaken identification rate is less than 13.4%; hail element information extraction rate is more than 96.0%, mistaken identification rate is less than 8.6%.

Key words: microblog, hail information, feature extraction, text classification, text elements recognition, Web crawler

中图分类号: