计算机与现代化

• 人工智能 • 上一篇    下一篇

基于支持向量机的网页正文内容提取方法

  

  1. (中国海洋大学信息科学与工程学院,山东青岛266100)
  • 收稿日期:2018-02-03 出版日期:2018-09-29 发布日期:2018-09-30
  • 作者简介:梁东(1993-),男,黑龙江哈尔滨人,中国海洋大学信息科学与工程学院硕士研究生,研究方向:大数据,机器学习; 通信作者:杨永全(1985-),男,讲师,博士,研究方向:云计算,物联网; 魏志强(1969-),男,教授,博士生导师,研究方向:计算机软件与理论,数据挖掘。
  • 基金资助:
    海洋科学与技术国家实验室鳌山科技创新计划项目(2016ASKJ07, 2016ASKJ07-08)

Information Extraction of Web Pages Based on Support Vector Machine

  1. LIANG Dong, YANG Yong-quan, WEI Zhi-qiang
  • Received:2018-02-03 Online:2018-09-29 Published:2018-09-30

摘要: 针对网页的正文信息抽取,提出一种基于支持向量机(SVM)的正文信息抽取方法。该方法采取宽进严出的策略。第1步根据网页结构的规律遍历网页DOM树,定位到一个同时包含正文和噪音信息的HTML标签。第2步选择含噪音信息的HTML标签的5个重要特征,并采用SVM训练样本数据。SVM训练得出的数据模型可以有效去除导航、推广、版权等噪音信息,成功保留正文。将该方法应用于几大常用的网站,实验结果表明该方法具有较好的正文抽取效果和降噪效果,对于传统方法中经常误删的短文本、与正文相关的超链接等信息能够准确保留。

关键词: 支持向量机, 正文抽取, HTML标签, 降噪, 机器学习

Abstract:  Aiming at the text information extraction of Web pages, this paper presents a method of extracting text information based on support vector machines. This method adopts “come in easily, out strictly” policy. The first step is to traverse the Web DOM tree according to the rules of the Web page structure, and locate an HTML tag that contains both useful and noise information. The second step is to select five important features of the HTML tag with noise information and use SVM to train the sample data. The model can effectively remove the navigation, promotion, copyright and other noise information, and preserve the useful information of Web pages. The method is applied to several commonly used websites. The experimental results show that this method has good effect of extracting texts and noise reduction, and can preserve short texts, such as hyperlinks related to texts that often mistakenly deleted by traditional methods.

Key words: support vector machine, information extraction, HTML label, noise reduction, machine learning

中图分类号: