S2R2:基于相关性与冗余性分析的半监督特征选择

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (09): 113-120.

S2R2:基于相关性与冗余性分析的半监督特征选择

(1.南京航空航天大学计算机科学与技术学院，江苏南京211106;2.软件新技术与产业化协同创新中心,江苏南京210093;
3.南京航空航天大学民航学院,江苏南京211106)

出版日期:2021-09-14 发布日期:2021-09-14
作者简介:张东方(1996—),男,山东泰安人,硕士研究生,研究方向:特征选择,E-mail: davion_zhang@foxmail.com; 通信作者:陈海燕(1979—),女,讲师,博士,研究方向:机器学习,数据挖掘,E-mail: chenhaiyan@nuaa.edu.cn; 袁立罡(1980—),男,实验室副主任,讲师,博士,研究方向:交通信息工程及控制,E-mail: kelfen_yuan@126.com。
基金资助:
国家自然科学基金资助项目(61501229); 中央高校基本科研业务费专项资金资助项目(NS2019054,NS2020045)

S2R2: Semi-supervised Feature Selection Based on Analysis of Relevance and Redundancy

(1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;
2. Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210093, China;
3. College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)

Online:2021-09-14 Published:2021-09-14

摘要/Abstract

摘要： 特征选择是模式识别与数据挖掘的关键问题之一，它可以移除数据集中的冗余和不相关特征以提升学习性能。基于最大相关最小冗余准则，提出一种新的基于相关性与冗余性分析的半监督特征选择方法(S2R2),S2R2方法独立于任何分类学习算法。该方法首先对无监督相关度信息度量进行分析与扩充，然后结合信息增益，设计一种半监督特征相关性与冗余性度量，可以有效识别与移除不相关和冗余特征，最后采用增量搜索技术贪婪地构建特征子集，避免搜索指数级大小的解空间，提高算法的运行效率。本文还提出S2R2方法的快速过滤版本，FS2R2，以更好地应对大规模特征选择问题。多个标准数据集上的实验结果表明了所提方法的有效性和优越性。

关键词: 半监督学习, 特征选择, 信息论, 最大相关最小冗余

Abstract: Feature selection is one of the key problems of pattern recognition and data mining, which can be removed dataset redundant and irrelevant features to improve learning performance. Based on the max-relevance and min-redundancy criteria, a novel semi-supervised feature selection method based on relevance and redundancy analysis is proposed. This new method is independent of any classification learning algorithm. Firstly, unsupervised relevance is analyzed and expanded. Then it is combined with information gain to form a semi-supervised feature relevance and redundancy measures, which can effectively identify and remove irrelevant and redundant features. Finally, an incremental forward search is used to construct feature subset in a greedy manner, which avoiding the search for exponential solution spaces and improving algorithm efficiency. This article also proposes the FS2R2 method as a fast version of the S2R2 method to deal with large-scale problems. The experimental results on standard data sets illustrate the effectiveness and superiority of the proposed approaches.

Key words: semi-supervised learning, feature selection, information theory, max-relevance and min-redundancy

张东方, 陈海燕, 袁立罡. S2R2:基于相关性与冗余性分析的半监督特征选择[J]. 计算机与现代化, 2021, 0(09): 113-120.

ZHANG Dong-fang, CHEN Hai-yan, YUAN Li-gang. S2R2: Semi-supervised Feature Selection Based on Analysis of Relevance and Redundancy[J]. Computer and Modernization, 2021, 0(09): 113-120.

参考文献

［1］ SREEDHARAN N P, GANESAN B, RAVEENDRAN R, et al. Grey wolf optimisation-based feature selection and classification for facial emotion recognition［J］. IET Biometrics, 2018,7(5):490-499.
［2］吴成东,卢紫微,于晓升. 基于加权随机森林的图像超分辨率算法研究［J］. 控制与决策, 2019,34(10):2243-2248.
［3］ FANG S Q, CAI Z P, SUN W C, et al. Feature selection method based on class discriminative degree for intelligent medical diagnosis［J］. Computers Materials & Continua, 2018,55(3):419-433.
［4］ SOMU N, RAMAN M R, KIRTHIVASAN K, et al. Hypergraph based feature selection technique for medical diagnosis［J］. Journal of Medical Systems, 2016,40(11):1-16.〖HJ1.65mm〗
［5］ TUNC A. Feature selection in credibility study for finance sector［J］. Procedia Computer Science, 2019,158:254-259.
［6］ ZENG Z L, ZHANG H J, ZHANG R, et al. A novel feature selection method considering feature interaction［J］. Pattern Recognition, 2015,48(8):2656-2666.
［7］ MWANGI B, TIAN T S, SOARES J C, et al. A review of feature reduction techniques in neuroimaging［J］. Neuroinformatics, 2014,12(2):229-244.
［8］ SUGIYAMA M. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis［J］. Journal of Machine Learning Research, 2007(8):1027-1061.
［9］ WANG Y T, WANG J D, CHEN H Y, et al. Semi-supervised local fisher discriminant analysis based on reconstruction probability class［J］. International Journal of Pattern Recognition and Artificial Intelligence, 2015,29(2):1550007.
［10］PENG H C, LONG F H, DING C, et al. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005,27(8):1226-1238.
［11］CHANDRASHEKAR G, SAHIN F. A survey on feature selection methods［J］. Computers & Electrical Engineering, 2014,40(1):16-28.
［12］XUE B, ZHANG M J, BROWNE W N, et al. A survey on evolutionary computation approaches to feature selection［J］. IEEE Transactions on Evolutionary Computation, 2016,20(4):606-626.
［13］毛勇,周晓波,夏铮,等. 特征选择算法研究综述［J］. 模式识别与人工智能, 2007,20(2):211-218.
［14］CAI J, LUO J W, WANG S l, et al. Feature selection in machine learning: A new perspective［J］. Neurocomputing, 2018,300:70-79.
［15］YU L, LIU H. Efficient feature selection via analysis of relevance and redundancy［J］. Journal of Machine Learning Research, 2004,5:1205-1224.
［16］XU J L. Adapt the mRMR criterion for unsupervised feature selection［C］// International Conference on Advanced Data Mining and Applications, 2010:111-121.
［17］XU J, TANG B, HE H B, et al. Semisupervised feature selection based on relevance and redundancy criteria［J］. IEEE Transactions on Neural Networks, 2017,28(9):1974-1984.
［18］ZHAO J D, LU K, HE X F, et al. Locality sensitive semi-supervised feature selection［J］. Neurocomputing, 2008,71(10):1842-1849.

［19］LEE S, PARK Y, DAURIOL B J, et al. A novel feature selection method based on normalized mutual information［J］. Applied Intelligence, 2012,37(1):100-120.

［20］GAO W F, HU L, ZHANG P, et al. Class-specific mutual information variation for feature selection［J］. Pattern Recognition, 2018,79:328-339.
［21］WANG Y T, WANG J D, LIAO H, et al. An efficient semi-supervised representatives feature selection algorithm based on information theory［J］. Pattern Recognition, 2017,61:511-523.
［22］KOLLER D, SAHAMI M. Toward optimal feature selection［C］// International Conference on Machine Learning. 1996:284-292.
［23］PRESS H, TEUKOLSKY A, VETTERLINGT, et al. Numerical Recipes in C［M］. Cambridge: Cambridge University Press, 1996.
［24］ASUNCION A. UCI Machine Learning Repository［DB/OL］. ［2021-01-08］. http://archive.ics.uci.edu/ml/datasets.html.
［25］ZHAO Z, MORSTATTER F, SHARMA S, et al. Advancing feature selection research［J］. ASU Feature Selection Repository Arizona State University, 2010:1-28.
［26］FAYYAD U M, IRANIK B. Multi-interval discretization of continuous-valued attributes for classification learning［C］// International Joint Conference on Artificial Intelligence. 1993:1022-1027.
［27］FIX E, HODGES J L. Discriminatory analysis nonparametric discrimination: Consistency properties［J］. International Statistical Review, 1989,57(3):21-49.
［28］ZHANG S C, LI X L, ZONG M, et al. Efficient kNN classification with different numbers of nearest neighbors［J］. IEEE Transactions on Neural Networks and Learning Systems, 2018,29(5):1774-1785.
［29］SALZBERG S L. C4.5: Programs for Machine Learning［M］. Elsevier, 2014.
［30］SAXENA A, PARE S, MEENA M S, et al. A two-phase approach for semi-supervised feature selection［J］. Algorithms, 2020,13(9):215.

[1]	苏凯旋. 基于改进XGBoost模型的网络入侵检测研究[J]. 计算机与现代化, 2024, 0(06): 109-114.
[2]	罗澍寰, 孙武, 游杰, 王伟, 胡必伟, 姜南. 基于可见-近红外光谱法无损检测梨总酸含量[J]. 计算机与现代化, 2024, 0(05): 80-84.
[3]	刘缨杰, 兰海, 魏宪. 基于卷积与稀疏编码的半监督学习方法[J]. 计算机与现代化, 2022, 0(11): 9-16.
[4]	王扬, 陈梅, 李晖. FOCoR:一种基于特征选择优化的课程推荐技术[J]. 计算机与现代化, 2022, 0(10): 1-7.
[5]	赵迎利, 朱旭. 基于组稀疏联合学习的影像遗传学数据关联分析[J]. 计算机与现代化, 2022, 0(08): 43-49.
[6]	莫云. 基于混合特征选择的脑电解码方法[J]. 计算机与现代化, 2022, 0(04): 92-96.
[7]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[8]	代继鹏, 邵峰晶, 孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究[J]. 计算机与现代化, 2021, 0(06): 6-11.
[9]	陈丝雨, 庄毅, 李静. 基于LSTM网络的移动云计算多元负载预测模型[J]. 计算机与现代化, 2021, 0(06): 74-85.
[10]	杨秋良, 王钰, 杨杏丽, 李济洪. 基于互信息F统计量特征选择技术的地基气象云图分类[J]. 计算机与现代化, 2021, 0(02): 18-23.
[11]	蒋万明1,2，郭春1,2，蒋朝惠1,2. 一种基于BiLSTM的低速率DDoS攻击检测方法[J]. 计算机与现代化, 2020, 0(05): 120-.
[12]	马吉科,尹飞,祝永晋,豆龙龙,李剑. 一种应用半监督学习的计量装置运行状态辨识方法[J]. 计算机与现代化, 2020, 0(03): 82-.
[13]	彦逸，李波，陈守明，林强，黄巨涛，温柏坚. 一种面向反馈网络的因果特征选择算法及其应用[J]. 计算机与现代化, 2019, 0(12): 95-.
[14]	李富星，蒙祖强 . 一种改进的类别区分词特征选择算法[J]. 计算机与现代化, 2019, 0(03): 73-.
[15]	马建红，刘广森，姚爽，杨智. 面向短文本的特征选择及文本表示[J]. 计算机与现代化, 2019, 0(03): 95-.