计算机与现代化 ›› 2021, Vol. 0 ›› Issue (09): 113-120.

• 数据库与数据挖掘 • 上一篇    下一篇

S2R2:基于相关性与冗余性分析的半监督特征选择

  

  1. (1.南京航空航天大学计算机科学与技术学院,江苏南京211106;2.软件新技术与产业化协同创新中心,江苏南京210093;
    3.南京航空航天大学民航学院,江苏南京211106)
  • 出版日期:2021-09-14 发布日期:2021-09-14
  • 作者简介:张东方(1996—),男,山东泰安人,硕士研究生,研究方向:特征选择,E-mail: davion_zhang@foxmail.com; 通信作者:陈海燕(1979—),女,讲师,博士,研究方向:机器学习,数据挖掘,E-mail: chenhaiyan@nuaa.edu.cn; 袁立罡(1980—),男,实验室副主任,讲师,博士,研究方向:交通信息工程及控制,E-mail: kelfen_yuan@126.com。
  • 基金资助:
    国家自然科学基金资助项目(61501229); 中央高校基本科研业务费专项资金资助项目(NS2019054,NS2020045)

S2R2: Semi-supervised Feature Selection Based on Analysis of Relevance and Redundancy

  1. (1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;
    2. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210093, China;
    3. College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
  • Online:2021-09-14 Published:2021-09-14

摘要: 特征选择是模式识别与数据挖掘的关键问题之一,它可以移除数据集中的冗余和不相关特征以提升学习性能。基于最大相关最小冗余准则,提出一种新的基于相关性与冗余性分析的半监督特征选择方法(S2R2),S2R2方法独立于任何分类学习算法。该方法首先对无监督相关度信息度量进行分析与扩充,然后结合信息增益,设计一种半监督特征相关性与冗余性度量,可以有效识别与移除不相关和冗余特征,最后采用增量搜索技术贪婪地构建特征子集,避免搜索指数级大小的解空间,提高算法的运行效率。本文还提出S2R2方法的快速过滤版本,FS2R2,以更好地应对大规模特征选择问题。多个标准数据集上的实验结果表明了所提方法的有效性和优越性。

关键词: 半监督学习, 特征选择, 信息论, 最大相关最小冗余

Abstract: Feature selection is one of the key problems of pattern recognition and data mining, which can be removed dataset redundant and irrelevant features to improve learning performance. Based on the max-relevance and min-redundancy criteria, a novel semi-supervised feature selection method based on relevance and redundancy analysis is proposed. This new method is independent of any classification learning algorithm. Firstly, unsupervised relevance is analyzed and expanded. Then it is combined with information gain to form a semi-supervised feature relevance and redundancy measures, which can effectively identify and remove irrelevant and redundant features. Finally, an incremental forward search is used to construct feature subset in a greedy manner, which avoiding the search for exponential solution spaces and improving algorithm efficiency. This article also proposes the FS2R2 method as a fast version of the S2R2 method to deal with large-scale problems. The experimental results on standard data sets illustrate the effectiveness and superiority of  the proposed approaches.

Key words: semi-supervised learning, feature selection, information theory, max-relevance and min-redundancy