Information Extraction of Web Pages Based on Support Vector Machine

doi:10.3969/j.issn.1006-2475.2018.09.005

Abstract

Abstract: Aiming at the text information extraction of Web pages, this paper presents a method of extracting text information based on support vector machines. This method adopts “come in easily, out strictly” policy. The first step is to traverse the Web DOM tree according to the rules of the Web page structure, and locate an HTML tag that contains both useful and noise information. The second step is to select five important features of the HTML tag with noise information and use SVM to train the sample data. The model can effectively remove the navigation, promotion, copyright and other noise information, and preserve the useful information of Web pages. The method is applied to several commonly used websites. The experimental results show that this method has good effect of extracting texts and noise reduction, and can preserve short texts, such as hyperlinks related to texts that often mistakenly deleted by traditional methods.

Key words: support vector machine, information extraction, HTML label, noise reduction, machine learning

CLC Number:

TP391.1

LIANG Dong, YANG Yong-quan, WEI Zhi-qiang. Information Extraction of Web Pages Based on Support Vector Machine[J]. Computer and Modernization, doi: 10.3969/j.issn.1006-2475.2018.09.005.

References

［1］胡露露,刘小勤,孙凯. 基于正文特征和网页结构的网页正文抽取方法［J］. 大气与环境光学学报, 2017,12(3):230-235.
［2］潘心宇,陈长福,刘蓉,等. 基于网页DOM树节点路径相似度的正文抽取［J］. 微型机与应用, 2016,35(19):74-77.
［3］宋明秋,张瑞雪,吴新涛,等. 网页正文信息抽取新方法［J］. 大连理工大学学报, 2009,49(4):594-597.
［4］ Yang Xiudan, Zhu Yuanyuan. Ontology-based information extraction system in e-commerce websites［C］// Proceedings of the 2011 International Conference on Control, Automation and Systems Engineering. 2011, doi: 10.1109/ICCASE.2011.5997640.
［5］刘鹏程,胡骏,吴共庆. 基于文本块密度和标签路径覆盖率的网页正文抽取［J/OL］. http://www.arocmag.com/article/02-2018-06-004.html, 2017-06-14.
［6］罗永莲,赵昌垣,贾玉芳,等. 基于朴素贝叶斯Web新闻内容的抽取方法［J］. 计算机与现代化, 2016(1):59-63.
［7］赵欣欣,索红光,刘玉树. 基于标记窗的网页正文信息提取方法［J］. 计算机应用研究, 2007,24(3):144-145.
［8］ Zhu Ningbo, Zheng Bijuan, Zhang Chunfeng. An edge and filter based morphological text extracting method［C］// Proceedings of the 2010 4th International Conference on Intelligent Information Technology Application. 2010.
［9］李蕾,王劲林,白鹤,等. 基于FFT的网页正文提取算法研究与实现［J］. 计算机工程与应用, 2007,43(30):148-151.
［10］蒋亚平,梅骁. 基于支持向量机与人工免疫系统的垃圾邮件过滤模型［J］. 现代计算机, 2016(11):55-57.
［11］王祖辉,姜维. 基于支持向量机的垃圾邮件过滤方法［J］. 计算机工程, 2009,35(13):188-189.
［12］张洁. 改进支持向量机的电子邮件分类［J］. 现代电子技术, 2017,40(1):77-79.
［13］Bao Jianmin, Pan Lin, Xie Yuanfa. A new BDI forecasting model based on support vector machine［C］// Proceedings of the 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference. 2016:65-69.
［14］姚潇,余乐安. 模糊近似支持向量机模型及其在信用风险评估中的应用［J］. 系统工程理论与实践, 2012,32(3):549-554.
［15］郭晓云. ICTCLAS中文词法分析的Delphi调用研究［J］. 电脑编程技巧与维护, 2011(24):10-11.
［16］刘克强. 2009共享版ICTCLAS的分析与使用［J］. 科教文汇(上旬刊), 2009(8):271.
［17］罗燕,赵书良,李晓超,等. 基于词频统计的文本关键词提取方法［J］. 计算机应用, 2016,36(3):718-725.
［18］赵胜辉,李吉月,徐碧瑢,等. 基于TFIDF的社区问答系统问句相似度改进算法［J］. 北京理工大学学报, 2017,37(9):982-985.
［19］Jiang Hao, Li Wen Qiang. Improved algorithm based on TFIDF in text classification［J］. Advanced Materials Research, 2012,403-408:1791-1794.
［20］Dro〖AKz·〗d〖AKz·〗 M, Kryjak T. FPGA implementation of multi-scale face detection using HOG features and SVM classifier［J］. Image Processing and Communications, 2016,21(3):27-44.
［21］Sharma A, Dey S. A boosted SVM based ensemble classifier for sentiment analysis of online reviews［J］. ACM SIGAPP Applied Computing Review, 2013,13(4):43-52.
［22］Cervantes J, García Lamont F, López-Chau A, et al. Data selection based on decision tree for SVM classification on large data sets［J］. Applied Soft Computing, 2015,37:787-798.

[1]	YANG Bo, ZHUANG Yi. Fault Detection Method of Control Cluster Based on AOA-MSVM [J]. Computer and Modernization, 2023, 0(12): 112-116.
[2]	JIA Xiao-yao, . Breast Cancer Prediction and Feature Analysis Model Based on CatBoost and SHAP [J]. Computer and Modernization, 2023, 0(10): 32-38.
[3]	SHEN Zhi, LI Yuan. Nonlinear Process Fault Detection Based on KPCA and SSA Optimized SVM [J]. Computer and Modernization, 2023, 0(06): 15-20.
[4]	YIN Jian-feng, WEI Xin, GU Xiong-wei, HUANG Kai, WEI Min-jie. Digital Identification of Electric Meter Based on Image Threshold Optimization and Improved SVM [J]. Computer and Modernization, 2023, 0(05): 106-110.
[5]	ZHANG Yun, BAI Kai-feng, WANG Xing, CANG Tian, ZHOU Tong, DUAN Jin-wen, SU Han. Review of Electricity Theft Detection in Smart Grid Environment [J]. Computer and Modernization, 2023, 0(03): 60-65.
[6]	SHI Zhi-wei, WU Zhi-feng, ZHANG Zhe. Stock Volatility Prediction of LightGBM-GRU Model under Corrective Learning Strategy [J]. Computer and Modernization, 2023, 0(01): 95-102.
[7]	GUAN Yun-peng, LIU Yu-long. Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix [J]. Computer and Modernization, 2022, 0(11): 1-8.
[8]	FAN Hai-wei, QIN Jia-jie, SUN Huan, ZHANG Li-miao, LU Xin-siyu. Traffic Accident Text Information Extraction Model Based on BERT and BiGRU-CRF Fusion [J]. Computer and Modernization, 2022, 0(05): 10-15.
[9]	ZHAO Yan-ping, WANG Fang, XIA Yang. Short Text Classification Method Based on Support Vector Machine [J]. Computer and Modernization, 2022, 0(02): 92-96.
[10]	ZHAO Cong-hui, FENG Qing-sheng. Rotation Invariant Texture Extraction of Wheel Tread Based on DT-CWT and SVM [J]. Computer and Modernization, 2021, 0(12): 85-90.
[11]	LENG Tao , . A Survey of Encrypted Traffic Classification Based on Deep Learning [J]. Computer and Modernization, 2021, 0(08): 112-120.
[12]	CHEN Xiang, ZOU Qing-nian, XIE Shao-yu, CHEN Cui-qiong. Identification of Platen Switch State Based on Transfer Learning Strategy [J]. Computer and Modernization, 2021, 0(05): 120-126.
[13]	DENG Zi-yun , . A Screening Method of Machine Learning Model for Auxiliary Diagnosis [J]. Computer and Modernization, 2021, 0(03): 88-93.
[14]	ZHU Yan, ZHANG Li, WANG Yu. Named Entity Recognition on Chinese Electronic Medical Records Based on RoBERTa-WWM [J]. Computer and Modernization, 2021, 0(02): 51-55.
[15]	GUO Xin, CHEN Ying, ZHANG Ming-huan, ZHANG Xuan, PAN Shu-ming, TANG Lu-jia. Analysis and Prediction of Training Effects of National Disaster Life Support Course with Machine Learning Methods [J]. Computer and Modernization, 2020, 0(12): 61-66.