计算机与现代化 ›› 2020, Vol. 0 ›› Issue (10): 17-22.

• 人工智能 • 上一篇    下一篇

基于聚类分析算法的垃圾邮件识别

  

  1. (东北石油大学计算机与信息技术学院,黑龙江大庆163318)
  • 出版日期:2020-10-14 发布日期:2020-10-14
  • 作者简介:盖璇(1990—),女,山东海阳人,实验师,硕士,研究方向:人工智能,E-mail: nepu_gx@163.com。
  • 基金资助:
    东北石油大学引导性创新基金资助项目(ky121728)

Spam E-mail Recognition Based on Cluster Analysis Algorithm

  1. (School of Computer and Information Technology, Northeast Petroleum University, Daqing 163318, China)
  • Online:2020-10-14 Published:2020-10-14

摘要: 以往使用的垃圾邮件识别方法在面对如今更新速度快且种类繁多的分词时,很难精准地识别出邮件中的关键分词,识别方法的应用能力需要进一步提高。为此,提出一种基于聚类分析算法的垃圾邮件识别方法。首先,预处理邮件样本,得到邮件文本内容的关键分词,剔除停用词,根据分词在邮件文本中出现的频率计算出分词的权重;然后,结合邮件特征属性,构建邮件特征空间,将邮件特征量化;最后,提取出邮件特征并降维处理,将其作为聚类算法的输入,经过迭代计算输出结果从而完成垃圾邮件的识别。实验结果表明:设计的基于聚类分析算法的垃圾邮件识别方法在关键词提取与分词方面更加精确,并且能够准确地识别出垃圾邮件,说明设计的基于聚类分析算法的垃圾邮件识别方法的实际应用能力得到了提高。

关键词: 聚类算法, 垃圾邮件, 分词处理, 文本聚类

Abstract: For spam recognition methods used in the past, in the face of today’s fast updating and a wide variety of word segmentation, it is difficult to accurately identify the key word segmentation in a e-mail, the application ability of the recognition methods needs to be further improved. To this end, a spam recognition method based on cluster analysis algorithm is proposed. Firstly, we preprocess e-mail samples to get the key word segmentation of the e-mail text content, remove the stop words, and calculate the weight of the word segmentation according to the frequency of the word segmentation in the e-mail text. Then, combining with the e-mail feature attributes, we construct the e-mail feature space, and quantify the e-mail feature. Lastly, the features of the e-mail are extracted and processed for dimensionality reduction, which is used as the input of the clustering algorithm, and the output result is iteratively calculated to complete the identification of spam. The experimental results show that the designed spam e-mail recognition method based on cluster analysis algorithm is more accurate in keyword extraction and word segmentation, and can accurately identify spam e-mails, which shows the practical application ability of the designed spam e-mail recognition method based on cluster analysis algorithm has been improved.

Key words: clustering algorithm, spam, word segmentation, text clustering