基于集成学习与不平衡数据的返贫预测

计算机与现代化 ›› 2022, Vol. 0 ›› Issue (04): 12-16.

基于集成学习与不平衡数据的返贫预测

(五邑大学数学与计算科学学院，广东江门529020)

出版日期:2022-05-07 发布日期:2022-05-07
作者简介:龚云翔（2000—），男（回族），河南漯河人，本科生，研究方向：数据挖掘，E-mail: gyx18239583806@163.com; 通信作者：袁仕芳（1972—），男，湖南洞口人，教授，博士，研究方向：数据处理和数值分析，E-mail: yuanshifang305@163.com; 刘付谦（2000—），男，广东东莞人，本科生，研究方向：数据挖掘，E-mail: 1399766572@qq.com。
基金资助:
广东省普通高校特色创新项目(2019KQNCX156); 五邑大学港澳联合研发基金资助项目(2019WGALH20); 2020年国家级大学生创新创业训练计划项目(202011349016); 2018年五邑大学教学质量工程与教学改革项目(JX2018024)

Poverty-returning Prediction Based on Ensemble Learning and Unbalanced Data

(School of Mathematics and Computational Science， Wuyi University， Jiangmen 529020， China)

Online:2022-05-07 Published:2022-05-07

摘要/Abstract

摘要： 中国在扶贫工作取得决定性成就的同时，仍有一些脱贫人口存在返贫风险。本文基于不平衡数据集，利用SMOTE模型对返贫类别样本进行过采样处理，处理后的返贫与未返贫样本数据比例为3:1；接着建立基于Stacking集成学习的返贫预测模型，利用网格搜索对各个模型超参数进行寻优，结合10折交叉验证提高模型的泛化能力。本文使用4种不同的融合模型对脱贫户是否返贫进行预测。实验结果表明，与单一模型相比，模型融合后的分类效果要优于单独的分类器，其中最优融合模型的Acc为0.962，F1-score为0.946。

关键词: 返贫预测, 过采样技术, 集成学习, 融合模型

Abstract: While China has made the decisive achievement on working on poverty alleviation, there are still some people out of poverty who exist risk of returning to poverty. Based on the unbalanced data set, this paper used the model of SMOTE to do sampling process for multi-class samples of returning to poverty. The sample’s ratio of returned to poverty and non-returned to poverty is 3〖DK〗∶1. After that, based on ensemble learning of Stacking, this paper constructed a prediction model of poverty-returning, used grid search to optimize hyper parameters of every model and improved the generalization ability by combining the 10-fold cross-validation. In this paper, four different integration models are used to predict whether the poor households will return to poverty. Compared with the single model, the experiments indicate that the classification results with fusion model are better. Among them, the optimal Acc and F1-score of fusion model are 0.962 and 0.946.

Key words: prediction of poverty-returning, SMOTE, ensemble learning, fusion model

龚云翔, 袁仕芳, 刘付谦. 基于集成学习与不平衡数据的返贫预测[J]. 计算机与现代化, 2022, 0(04): 12-16.

GONG Yun-xiang, YUAN Shi-fang, LIU Fu-qian. Poverty-returning Prediction Based on Ensemble Learning and Unbalanced Data[J]. Computer and Modernization, 2022, 0(04): 12-16.

参考文献

［1］阎昊. 基于数据挖掘的返贫预测模型的研究与应用［D］. 西安:陕西师范大学, 2019.
［2］田昆. 基于Logistic回归分析的返贫预测模型研究［D］. 兰州:西北师范大学, 2018.
［3］朱容波,张静静,李媛丽,等. 基于优化FOA-BPNN模型的脱贫时间预测［J］. 中南民族大学学报(自然科学版), 2018,37(4):109-114.
［4］唐小兵,周国华. 基于主成分分析的县域贫困脆弱性评价——基于2016年湖南省贫困县的调研［J］. 中南林业科技大学学报(社会科学版), 2017,11(3):47-52.
［5］张学敏,史玲燕,薛艳,等. 乡村振兴视阈下返贫预警评价指标体系构建与实证［J］. 统计与决策, 2021,37(13):58-62.
［6］ XU X M, NIU D X, FU M, et al. A multi time scale wind power forecasting model of a chaotic echo state network based on a hybrid algorithm of particle swarm optimization and tabu search［J］. Energies, 2015,8(11):12388-12408.
［7］ MESINA J, ISANAN J L, MADERAZO C V. Poverty incidence identification of cities and municipalities using convolutional neural network as applied to satellite imagery［J］. IOP Conference Series: Materials Science and Engineering, 2019,482. DOI: 10.1088/1757-899X/482/1/012044.〖HJ0.44mm〗
［8］ POKHRIYAL N, JACQUES D C. Combining disparate data sources for improved poverty prediction and mapping［J］. Proceedings of the National Academy of Sciences of the United States of America, 2017,114(46). DOI: 10.1073/pnas.1700319114.
［9］ SHENG B, CHEN S L, SI H Y, et al. A feature-based deep neural framework for poverty prediction［C］// Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS). 2021:568-573.
［10］CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE: Synthetic minority over-sampling technique［J］. Journal of Artificial Intelligence Research, 2002,16(1):321-357.
［11］马灵草,董婷,戴晗青,等. 基于SMOTE算法的声门型喉癌患者术后下呼吸道感染预警模型构建［J］. 护理学杂志, 2021,36(8):1-4.
［12］WOLPERT D H. Stacked generalization［J］. Neural Networks, 1992,5(2):241-259.
［13］张宏鸣,陈丽君,刘雯,等. 基于Stacking集成学习的夏玉米覆盖度估测模型研究［J］. 农业机械学报, 2021,52(7):195-202.
［14］张庆宏. 基于随机森林算法的我国信用债券违约预测评估［D］. 南京:南京大学, 2020.
［15］FRIEDMAN J H. Stochastic gradient boosting［J］. Computational Statistics and Data Analysis, 2002,38(4):367-378.
［16］张野,叶国林,李欣刚. 基于Stacking模型融合的商户赌博监测算法研究［J］. 数字技术与应用, 2020,38(12):95-98.
［17］JIANG Y, TONG G X, YIN H N, et al. A pedestrian detection method based on genetic algorithm for optimize XGBoost training parameters［J］. IEEE Access, 2019,7:118310-118321.
［18］李轩,梅飞,沙浩源,等. 基于多状态数据均衡与XGBoost的特高压换流阀运行状态评估［J/OL］. 高电压技术:1-10(2021-04-20)［2021-04-28］. https://kns.cnki.net/kcms/detail/detail.aspx?doi=10.13336/j.1003-6520.hve.20210073.
［19］刘秀磊,孔凡芃,谌彤童,等. 基于BERT与XGBoost的航天科技开源情报分类［J］. 郑州大学学报(理学版), 2021,53(3):15-22.
［20］VAPNIK V N, KOTZ S. Estimation of Dependences Based on Empirical Data［M］. New York: Springer, 2006.
［21］徐良辰,郭崇慧. 基于集成学习的胃癌生存预测模型研究［J］. 数据分析与知识发现, 2021,5(8):86-99.
［22］HUANG H, XU H H, WANG X H, et al. Maximum F1-score discriminative training criterion for automatic mispronunciation detection［J］. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015,23(4):787-797.
［23］CHICCO D, JURMAN G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation［J］. BMC Genomics, 2020,21(1). DOI: 10.1186/s12864-019-6413-7.
［24］LAVALLE S M, BRANICKY M S, LINDEMANN S R. On the relationship between classical grid search and probabilistic roadmaps［J］. The International Journal of Robotics Research, 2004,23(7-8):673-692.
［25］BATTEN A J, THORPE J, PIEGARI R I, et al. A resampling based grid search method to improve reliability and robustness of mixture-item response theory models of multimorbid high-risk patients［J］. IEEE Journal of Biomedical and Health Informatics, 2020,24(6):1780-1787.
［26］王彦兵,王聪,赵亚丽,等. 基于ROC曲线的永久散射体识别最佳阈值定量筛选［J］. 遥感学报, 2021,25(10):2083-2094.
［27］王彦光,朱鸿斌,徐维超. ROC曲线及其分析方法综述［J］. 广东工业大学学报, 2021,38(1):46-53.

[1]	王杰, 徐祥, 罗晓丹, 张萌, 黄澈, 洪冠中, 汪翔. 基于集成学习的巢湖面雨量计算方法[J]. 计算机与现代化, 2023, 0(09): 38-43.
[2]	刘付谦, 秦华妮, 赖惠慧. 基于SMOTE和贝叶斯优化的Adj-LightGBM人岗匹配算法[J]. 计算机与现代化, 2023, 0(03): 90-95.
[3]	唐洁, 文元美. 基于3CNN-BiGRU的睡眠自动分期研究[J]. 计算机与现代化, 2022, 0(02): 120-126.
[4]	王磊, 宋波. 基于ADE-Stacking的心力衰竭非计划性再入院风险预测模型[J]. 计算机与现代化, 2022, 0(01): 23-27.
[5]	王继民, 季昌政, 李家欢, 曹颖. 基于集成学习的中小河流洪水预报[J]. 计算机与现代化, 2021, 0(05): 51-58.
[6]	王海, 江峰, 杜军威, 赵军. 过采样与集成学习方法在软件缺陷预测中的对比研究[J]. 计算机与现代化, 2020, 0(06): 83-.
[7]	李科心, 李静, 邵佳炜, 肖屹. 多层次序列集成的高维数值型异常检测[J]. 计算机与现代化, 2020, 0(06): 73-.
[8]	刘树艺，李静，胡春，王伟. 基于卷积神经网络与集成学习的交通标志识别[J]. 计算机与现代化, 2019, 0(12): 67-.
[9]	顾陈楠,曾晓勤 . 基于3D卷积神经网络的动态手势识别[J]. 计算机与现代化, 2019, 0(11): 75-.
[10]	乔媛，廖小平，邵开霞. 基于跳跃显露模式挖掘算法的癌症分类[J]. 计算机与现代化, 2018, 0(05): 100-.
[11]	雷武1,2，廖闻剑2，彭艳兵2. 基于随机森林与LambdaMART的搜索排序模型[J]. 计算机与现代化, 2017, 0(3): 54-.
[12]	刘意. 一种基于频繁子图的集成分类算法[J]. 计算机与现代化, 2017, 0(1): 32-35.
[13]	李瑞1，袁小玲2. 半动态集成选择分类方法[J]. 计算机与现代化, 2015, 0(2): 48-.