基于支持向量机的网页正文内容提取方法

doi:10.3969/j.issn.1006-2475.2018.09.005

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (09): 21-.doi: 10.3969/j.issn.1006-2475.2018.09.005

基于支持向量机的网页正文内容提取方法

(中国海洋大学信息科学与工程学院，山东青岛266100)

收稿日期:2018-02-03 出版日期:2018-09-29 发布日期:2018-09-30
作者简介:梁东(1993-),男,黑龙江哈尔滨人,中国海洋大学信息科学与工程学院硕士研究生,研究方向:大数据,机器学习; 通信作者:杨永全(1985-),男,讲师,博士,研究方向:云计算,物联网; 魏志强(1969-),男,教授,博士生导师,研究方向:计算机软件与理论,数据挖掘。
基金资助:
海洋科学与技术国家实验室鳌山科技创新计划项目(2016ASKJ07, 2016ASKJ07-08)

Information Extraction of Web Pages Based on Support Vector Machine

LIANG Dong, YANG Yong-quan, WEI Zhi-qiang

Received:2018-02-03 Online:2018-09-29 Published:2018-09-30

摘要/Abstract

摘要： 针对网页的正文信息抽取，提出一种基于支持向量机(SVM)的正文信息抽取方法。该方法采取宽进严出的策略。第1步根据网页结构的规律遍历网页DOM树，定位到一个同时包含正文和噪音信息的HTML标签。第2步选择含噪音信息的HTML标签的5个重要特征，并采用SVM训练样本数据。SVM训练得出的数据模型可以有效去除导航、推广、版权等噪音信息，成功保留正文。将该方法应用于几大常用的网站，实验结果表明该方法具有较好的正文抽取效果和降噪效果，对于传统方法中经常误删的短文本、与正文相关的超链接等信息能够准确保留。

关键词: 支持向量机, 正文抽取, HTML标签, 降噪, 机器学习

Abstract: Aiming at the text information extraction of Web pages, this paper presents a method of extracting text information based on support vector machines. This method adopts “come in easily, out strictly” policy. The first step is to traverse the Web DOM tree according to the rules of the Web page structure, and locate an HTML tag that contains both useful and noise information. The second step is to select five important features of the HTML tag with noise information and use SVM to train the sample data. The model can effectively remove the navigation, promotion, copyright and other noise information, and preserve the useful information of Web pages. The method is applied to several commonly used websites. The experimental results show that this method has good effect of extracting texts and noise reduction, and can preserve short texts, such as hyperlinks related to texts that often mistakenly deleted by traditional methods.

Key words: support vector machine, information extraction, HTML label, noise reduction, machine learning

中图分类号:

TP391.1

梁东，杨永全，魏志强. 基于支持向量机的网页正文内容提取方法[J]. 计算机与现代化, 2018, 0(09): 21-.

LIANG Dong, YANG Yong-quan, WEI Zhi-qiang. Information Extraction of Web Pages Based on Support Vector Machine[J]. Computer and Modernization, 2018, 0(09): 21-.

参考文献

［1］胡露露,刘小勤,孙凯. 基于正文特征和网页结构的网页正文抽取方法［J］. 大气与环境光学学报, 2017,12(3):230-235.
［2］潘心宇,陈长福,刘蓉,等. 基于网页DOM树节点路径相似度的正文抽取［J］. 微型机与应用, 2016,35(19):74-77.
［3］宋明秋,张瑞雪,吴新涛,等. 网页正文信息抽取新方法［J］. 大连理工大学学报, 2009,49(4):594-597.
［4］ Yang Xiudan, Zhu Yuanyuan. Ontology-based information extraction system in e-commerce websites［C］// Proceedings of the 2011 International Conference on Control, Automation and Systems Engineering. 2011, doi: 10.1109/ICCASE.2011.5997640.
［5］刘鹏程,胡骏,吴共庆. 基于文本块密度和标签路径覆盖率的网页正文抽取［J/OL］. http://www.arocmag.com/article/02-2018-06-004.html, 2017-06-14.
［6］罗永莲,赵昌垣,贾玉芳,等. 基于朴素贝叶斯Web新闻内容的抽取方法［J］. 计算机与现代化, 2016(1):59-63.
［7］赵欣欣,索红光,刘玉树. 基于标记窗的网页正文信息提取方法［J］. 计算机应用研究, 2007,24(3):144-145.
［8］ Zhu Ningbo, Zheng Bijuan, Zhang Chunfeng. An edge and filter based morphological text extracting method［C］// Proceedings of the 2010 4th International Conference on Intelligent Information Technology Application. 2010.
［9］李蕾,王劲林,白鹤,等. 基于FFT的网页正文提取算法研究与实现［J］. 计算机工程与应用, 2007,43(30):148-151.
［10］蒋亚平,梅骁. 基于支持向量机与人工免疫系统的垃圾邮件过滤模型［J］. 现代计算机, 2016(11):55-57.
［11］王祖辉,姜维. 基于支持向量机的垃圾邮件过滤方法［J］. 计算机工程, 2009,35(13):188-189.
［12］张洁. 改进支持向量机的电子邮件分类［J］. 现代电子技术, 2017,40(1):77-79.
［13］Bao Jianmin, Pan Lin, Xie Yuanfa. A new BDI forecasting model based on support vector machine［C］// Proceedings of the 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. 2016:65-69.
［14］姚潇,余乐安. 模糊近似支持向量机模型及其在信用风险评估中的应用［J］. 系统工程理论与实践, 2012,32(3):549-554.
［15］郭晓云. ICTCLAS中文词法分析的Delphi调用研究［J］. 电脑编程技巧与维护, 2011(24):10-11.
［16］刘克强. 2009共享版ICTCLAS的分析与使用［J］. 科教文汇(上旬刊), 2009(8):271.
［17］罗燕,赵书良,李晓超,等. 基于词频统计的文本关键词提取方法［J］. 计算机应用, 2016,36(3):718-725.
［18］赵胜辉,李吉月,徐碧瑢,等. 基于TFIDF的社区问答系统问句相似度改进算法［J］. 北京理工大学学报, 2017,37(9):982-985.
［19］Jiang Hao, Li Wen Qiang. Improved algorithm based on TFIDF in text classification［J］. Advanced Materials Research, 2012,403-408:1791-1794.
［20］Dro〖AKz·〗d〖AKz·〗 M, Kryjak T. FPGA implementation of multi-scale face detection using HOG features and SVM classifier［J］. Image Processing and Communications, 2016,21(3):27-44.
［21］Sharma A, Dey S. A boosted SVM based ensemble classifier for sentiment analysis of online reviews［J］. ACM SIGAPP Applied Computing Review, 2013,13(4):43-52.
［22］Cervantes J, García Lamont F, López-Chau A, et al. Data selection based on decision tree for SVM classification on large data sets［J］. Applied Soft Computing, 2015,37:787-798.

[1]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[2]	曹宁1, 严心娥1, 徐根祺2, 许又文1, 张正勃2, 杜倩云2. 基于DEFA-LSSAR的水利工程边坡力学参数预测模型[J]. 计算机与现代化, 2024, 0(07): 106-111.
[3]	王志强, 郑爽. 基于半监督学习的StyleGAN图像生成模型[J]. 计算机与现代化, 2024, 0(06): 14-18.
[4]	曹宁1, 徐根祺2, 张雯3, 许又文1, 何盼情1. 基于AFSPSO-ν-SVM的山洪灾害预测方法研究#br# #br#[J]. 计算机与现代化, 2024, 0(05): 33-37.
[5]	杨博, 庄毅. 基于AOA-MSVM的控制集群故障检测方法[J]. 计算机与现代化, 2023, 0(12): 112-116.
[6]	唐诗琪, 周瑞平, 谢仕斌, 刘梦赤, 肖文, . 基于栈式降噪编码器的跨语言多标签情感分类[J]. 计算机与现代化, 2023, 0(11): 6-12.
[7]	贾潇瑶, . 融合CatBoost和SHAP的乳腺癌预测及特征分析[J]. 计算机与现代化, 2023, 0(10): 32-38.
[8]	杨孙哲, 孙爱珍. 基于HSV颜色与LBP纹理特征的水稻氮素营养诊断[J]. 计算机与现代化, 2023, 0(07): 86-92.
[9]	申志, 李元 . 基于KPCA和SSA优化SVM的非线性过程故障检测#br#[J]. 计算机与现代化, 2023, 0(06): 15-20.
[10]	尹建丰, 卫鑫, 顾雄伟, 黄凯, 魏敏捷. 基于图像阈值优化及改进SVM的电表数字识别[J]. 计算机与现代化, 2023, 0(05): 106-110.
[11]	刘路瑶, 韩培胜. 基于堆叠降噪自编码器的跨项目软件缺陷数量预测方法[J]. 计算机与现代化, 2023, 0(04): 32-38.
[12]	张芸, 白开峰, 王星, 仓甜, 周通, 段锦文, 苏晗. 智能电网环境下窃电行为检测[J]. 计算机与现代化, 2023, 0(03): 60-65.
[13]	石志伟, 武志峰, 张哲. 纠正学习策略下LightGBM-GRU模型的股票波动率预测[J]. 计算机与现代化, 2023, 0(01): 95-102.
[14]	关云鹏, 刘玉龙. 基于从共现矩阵提取关联的类别型数据聚类[J]. 计算机与现代化, 2022, 0(11): 1-8.
[15]	杨娟, 滕飞, 郭大林. 多模态融合的特征提取方法在SA检测中的应用[J]. 计算机与现代化, 2022, 0(10): 121-126.

基于支持向量机的网页正文内容提取方法

Information Extraction of Web Pages Based on Support Vector Machine

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价