计算机与现代化 ›› 2020, Vol. 0 ›› Issue (06): 89-.

• • 上一篇    下一篇

基于深度Q网络的垃圾邮件文本分类方法

  

  1. (国网江苏省电力有限公司苏州供电分公司,江苏苏州215004)
  • 收稿日期:2019-09-20 出版日期:2020-06-24 发布日期:2020-06-28
  • 作者简介:景栋盛(1981-),男,江苏苏州人,高级工程师,硕士,研究方向:机器学习,网络安全,智能化信息系统,E-mail: jds19810119@163.com; 薛劲松(1977-),男,江苏常熟人,高级工程师,学士,研究方向:网络安全,智能化信息系统,E-mail: 6802569@qq.com; 冯仁君(1989-),男,江苏盐城人,工程师,硕士,研究方向:网络安全,智能化信息系统,E-mail: frj1989@126.com。
  • 基金资助:
    国家自然科学基金资助项目(61303108); 江苏省高等学校自然科学研究重大项目(17KJA520004)

Spam Text Classification Method Based on Deep Q-network

  1. (Suzhou Power Supply Branch, State Grid Jiangsu Electric Power Limited Company, Suzhou 215004, China)
  • Received:2019-09-20 Online:2020-06-24 Published:2020-06-28

摘要: 电子邮件广泛应用于人们的工作生活中。然而,充斥着虚假信息、恶意软件和营销广告等内容的垃圾邮件也以电子邮件为载体进行传播。这不仅给人们带来不便,而且也占用和耗费大量的网络资源,甚至严重地威胁信息安全。因此,有效地识别、过滤垃圾邮件是一项重要的工作。目前,垃圾邮件过滤方法主要包括基于邮件来源的识别和基于内容的识别,但大部分方法效果不佳且效率不高,并且需要耗费大量的人力标注特征,也跟不上垃圾邮件内容和形式等的改变。近年来,有研究人员将深度强化学习用在自然语言处理上并取得了重大的成果,鉴于此,本文提出基于深度Q网络的垃圾邮件文本分类方法。该方法在对邮件文本进行预处理、分词以及用Word2vec模型得到词向量的基础上用深度Q网络对垃圾邮件进行过滤,充分利用Word2vec中的CBOW模型得到邮件文本中的每个分词对应的词向量,直接用深度Q网络对得到的词向量集进行处理,无需提取邮件的特征,避免了由于特征提取的偏差带来的负面影响,提高了垃圾邮件过滤的效率和精确率。实验结果验证了本文方法的有效性。

关键词: 电子邮件, 深度Q网络, Word2vec模型, 文本分类

Abstract: Electronic mail is widely used in people’s daily life. It also serves, however, as a carrier for the proliferation of spam mails filled with false information, malicious software and undesired advertisements. Spam mails not only bring inconvenience but also unnecessarily consume a lot of network resource and even pose a huge threat to their information safety. Therefore, it remains an important task to effectively identify and filter spam mails. Current filtering methods are mainly based on identifying the source and content of mails, which are not effective and require a large amount of artificial labeling and are not sensitive to the changes of spam mails’ content or format. In recent years, researchers have applied deep reinforcement learning to the natural language processing and obtained good results. Therefore, this paper presents a classification method for identifying spam mails based on deep Q-network. The mail text first is preprocessed, then is segmented and is transformed into word vectors using Word2vec model. The deep Q-network is used to filter spam mails based on these word vectors in order to improve efficiency and accuracy. The method makes full use of the CBOW model in Word2vec to obtain the word vector corresponding to each participle in the mail text, and directly processes the obtained word vector with the deep Q-network, without extracting the features of the mail, so as to avoid the negative impact caused by the deviation of feature extraction. The experiment results verify the effectiveness of the method.

Key words: electronic mail, deep Q-network, Word2vec, text classification

中图分类号: