基于BiGRU-Attention-CNN模型的垃圾邮件检测方法

摘要/Abstract

摘要： 电子邮件是一种重要的通信工具，但是垃圾邮件问题一直影响着人们日常的工作生活。不断改进垃圾邮件的检测技术、提高垃圾邮件的检测速度和准确率有着重要的研究意义和现实意义。双向门控循环单元（BiGRU）和卷积神经网络（CNN）广泛应用于文本分类领域，二者的结合可以充分发挥BiGRU上下文依赖关系提取能力以及CNN特征提取能力，但是针对垃圾邮件检测问题，还需要考虑邮件中一些特定的词语，因此本文提出一种基于BiGRU-Attention-CNN模型的垃圾邮件检测方法来提高垃圾邮件的检测准确率。模型首先将邮件文本转换成特征向量并进行BiGRU序列化学习，随后引入注意力机制（Attention）对特定词语赋予更大的权重，再将注意力层输入CNN模型，经过卷积、池化、全连接，最终得到分类结果。本文将模型在Trec06c邮件数据集上进行实验，与其他模型进行对比取得了更好的效果，最终模型的准确率达到91.62%。

关键词: 双向门控循环单元, 注意力机制, 卷积神经网络, 垃圾邮件识别

Abstract: E-mail is an important communication tool, but the problem of spam has been affecting peoples daily work and life. Continuously improving spam detection technology and increasing the speed and accuracy of spam detection has important research and practical significance. Bi-directional gated recurrent unit (BiGRU) and convolutional neural network (CNN) are widely used in the field of text classification. The combination of them could give full play to BiGRU context dependency extraction capabilities and CNN feature extraction capabilities. But for the problem of spam recognition, it is also necessary to consider some specific words in the email. So this article proposes a spam recognition method based on the BiGRU-Attention-CNN model to improve the accuracy of spam detection. The model first converts the email text into feature vectors and performs BiGRU serialization learning, and then introduces the attention mechanism (Attention) to give greater weight to specific words. After the attention layer is input to the CNN model, through convolution, pooling, and full connection, the classification result is finally obtained. The model is tested on the Trec06c mail data set and compared with other models, better results are achieved. The final accuracy of the model is 91.62%.

Key words: BiGRU, attention, CNN, spam recognition

赵宇轩, 胡怀湘. 基于BiGRU-Attention-CNN模型的垃圾邮件检测方法[J]. 计算机与现代化, 2021, 0(04): 122-126.

ZHAO Yu-xuan, HU Huai-xiang. Spam Recognition Method Based on BiGRU-Attention-CNN Model[J]. Computer and Modernization, 2021, 0(04): 122-126.

参考文献

［1］林延中,裴智勇,刘川琦,等. 2019年中国企业邮箱安全性研究报告［R］. 北京:奇安信创新团队, 2020.
［2］申超. 反垃圾邮件新技术在新华网电子邮局中的应用研究［J］. 中国传媒科技, 2013(15):58-61.
［3］林建洪,翟建桐,徐菁. 融合LDA与Word2vector的垃圾邮件过滤方法研究［J］. 网络安全技术与应用, 2017(3):73-75.
［4］王鹿,李志伟,朱成德,等. 基于朴素贝叶斯算法的垃圾邮件过滤研究［J］. 传感器与微系统, 2020,39(9):46-48.
［5］吴小晴,万国金,李程文,等. 一种改进TF-IDF的中文邮件识别算法研究［J］. 现代电子技术, 2020,43(12):83-86.
［6］黄鹤,荆晓远,董西伟,等. 基于Skip-gram的CNNs文本邮件分类模型［J］. 计算机技术与发展, 2019,29(6):143-147.
［7］周枝凝,王斌君,翟一鸣,等. 基于ALBERT动态词向量的垃圾邮件过滤模型［J］. 信息网络安全, 2020,20(9):107-111.
［8］ KIM Y. Convolutional neural networks for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1746-1751.
［9］迟殿委. 基于Python的网页图片爬取［J］. 电脑编程技巧与维护, 2019(5):129-130.
［10］官琴,邓三鸿,王昊. 中文文本聚类常用停用词表对比研究［J］. 数据分析与知识发现, 2017(3):72-80.
［11］徐博龙. 应用Jieba和Wordcloud库的词云设计与优化［J］. 福建电脑, 2019,35(6):25-28.
［12］景栋盛,薛劲松,冯仁君. 基于深度Q网络的垃圾邮件文本分类方法［J］. 计算机与现代化, 2020(6):89-94.
［13］HARRIS D, HARRIS S. Digital Design and Computer Architecture［M］. Morgan Kaufmann, 2010.
［14］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint arXiv:1301.3781, 2013.
［15］LUO Q, XU W R, GUO J. A study on the CBOW models overfitting and stability［C］// Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning. 2014:9-12.
［16］HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［17］CHO K, VAN MERRIENBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1724-1734.
［18］胡玉琦,李婧,常艳鹏,等. 引入注意力机制的BiGRU-CNN情感分类模型［J］. 小型微型计算机系统, 2020,41(8):1602-1607.
［19］ZHANG Y, WALLACE B. A sensitivity analysis of (and practitioners guide to) convolutional neural networks for sentence classification［C］// Proceedings of the 8th International Joint Conference on Natural Language Processing. 2017:253-263.
［20］季威志,薛雷. 基于BiGRU-CNN-Attention模型的股市评论情感分析［J］. 工业控制计算机, 2020,33(4):70-72.
［21］徐娟,卞良. 基于SVM的中文垃圾邮件预测系统研究［J］. 数字技术与应用, 2020,38(1):38-39.
［22］郑诚,薛满意,洪彤彤,等. 用于短文本分类的DC-BiGRU_CNN模型［J］. 计算机科学, 2019,46(11):186-192.
［23］吴小晴. 基于CNN的双向LSTM注意力机制垃圾邮件分类的研究与分析［D］. 南昌:南昌大学, 2020.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[3]	黄庭培1, 马禄彪1, 李世宝2, 刘建航1. 基于WiFi和原型网络的手势识别方法[J]. 计算机与现代化, 2024, 0(12): 34-39.
[4]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[5]	刘云海1, 冯广1, 吴晓婷2, 杨群2. 复杂施工场景下的安全帽佩戴检测算法[J]. 计算机与现代化, 2024, 0(12): 66-71.
[6]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[7]	谷岳, 邓松峰, 沈霁, 穆文涛, 赵恩棋. 基于改进YOLOv8的SAR舰船目标检测算法[J]. 计算机与现代化, 2024, 0(12): 78-83.
[8]	王艳媛, 茅正冲. 中英文场景文本图像的检测和识别算法[J]. 计算机与现代化, 2024, 0(12): 84-90.
[9]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[10]	张宇1, 2, 黎靖1, 2, 马铭1, 2, 王众祥1, 2, 孙妍1, 2. YOLOLW:一个新的轻量级目标检测模型[J]. 计算机与现代化, 2024, 0(11): 91-98.
[11]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[12]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[13]	魏学诚1, 江凌云1, 李研2, 何非2. 改进YOLOv5的路侧单目视角小目标检测算法[J]. 计算机与现代化, 2024, 0(10): 27-34.
[14]	杜猛俊1, 李昂1, 童俊1, 钱锦1, 康恺1, 王若丁1, 靳文星2. 基于改进极限学习算法的电力信息数据融合模型[J]. 计算机与现代化, 2024, 0(10): 61-64.
[15]	杨世军1, 狄广义1, 高军1, 陈见飞1, 王耀坤1, 季晓晗2. 跨模态注意力融合和信息感知的情感一致检测[J]. 计算机与现代化, 2024, 0(10): 113-119.