一种大规模文本分类大间隔近邻算法

doi:10.3969/j.issn.1006-2475.2016.06.015

计算机与现代化 ›› 2016, Vol. 0 ›› Issue (6): 68-72.doi: 10.3969/j.issn.1006-2475.2016.06.015

一种大规模文本分类大间隔近邻算法

1.广西大学计算机与电子信息学院,广西南宁530004;2.广西通信规划设计咨询有限公司,广西南宁530007

收稿日期:2015-12-28 出版日期:2016-06-16 发布日期:2016-06-17
作者简介:朱茜(1987-),女,陕西宝鸡人,广西大学计算机与电子信息学院硕士研究生,研究方向:数据挖掘; 覃华(1972-),男, 教授,博士,研究方向:电子商务数据挖掘与最优化技术; 陈晨(1989-),女,硕士研究生,研究方向:数据挖掘。
基金资助:
国家自然科学基金资助项目(61363027); 教育部人文社会科学研究规划基金资助项目(11YJAZH080)

A Large Margin Nearest Neighbor Algorithm of Large-scale Text Classification

1. School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China;
2. Guangxi Communication Planning and Design Co. Ltd.〖KG-*4〗, Nanning 530007, China

Received:2015-12-28 Online:2016-06-16 Published:2016-06-17

摘要/Abstract

摘要： 大间隔近邻算法（Large Margin Nearest Neighbor，LMNN）具有较强学习能力和泛化能力，在分类领域有广泛的应用。但将其用于大规模文本分类问题时，LMNN算法中的半定规划问题规模会随着数据规模增大而急剧膨胀，导致求解困难。针对此问题，引入胡贝尔损失函数把LMNN算法的半定优化模型分解为2个低阶的连续优化子模型，降低算法的计算复杂度，提高计算效率。在舆情分类数据集上的实验结果表明，本文算法与传统大间隔近邻算法相比，精度提高了4.5%，分类时间节省了47.1%，故采用分解降阶法来改进LMNN算法的性能是可行的，更适用于大规模文本分类。


关键词: 半定规划, 大间隔近邻, 胡贝尔损失函数, 大规模文本分类, 泛化能力

Abstract: The large margin nearest neighbor algorithm has strong learning ability and generalization ability, which is widely used in the field of classification. But it will sink into difficulties when the semidefinite programming（SDP） scale of the LMNN algorithm expands rapidly as the data increasing used to solve the large-scale text classification problem. To solve this problem, we introduced the Huber loss function, which divided the Semidefinite Optimization Model of LMNN algorithm into two low-level continuous optimization sub-models, and finally reduced the computation complexity of the algorithm and improved its efficiency. The experimental results on the classification data set of public opinion show that the precision of the proposed algorithm was improved 4.5%, and the classification time saved 47.1% compared with the traditional one. It also can prove that adopting the low-level decomposition reduction method to improve the performance of the LMNN algorithm is feasible and more suitable for large-scale text classification.


Key words: semidefinite programming, large margin nearest neighbor, Huber loss function, large-scale text classification, generalization


中图分类号:

TP311.13

朱茜1,覃华1,冯志新2,陈晨1. 一种大规模文本分类大间隔近邻算法[J]. 计算机与现代化, 2016, 0(6): 68-72.

ZHU Qian1, QIN Hua1, FENG Zhi-xin2, CHEN Chen1. A Large Margin Nearest Neighbor Algorithm of Large-scale Text Classification[J]. Computer and Modernization, 2016, 0(6): 68-72.

参考文献

［1］ Marasovic T, Papic V, Zanchi V. LMNN metric learning and fuzzy nearest neighbour classifier for hand gesture recognition［J］. Journal on Multimodal User Interfaces, 2015,9(3):211-221.
［2］ Todeschini R, Ballabio D, Consonni V, et al. Locally centred Mahalanobis distance:A new distance measure with salient features towards outlier detection［J］. Analytica Chimica Acta, 2013,787:1-9.
［3］方育柯,傅彦,周俊临,等. 基于选择性集成的最大化软间隔算法［J］. 软件学报, 2012,34(5):1132-1147.
［4］ Hu Fan, Xia Gui-song, Sun Hong. Multi-Level max-margin analysis for semantic classification of satellite images［J］. Wuhan University Journal of Natural Sciences, 2015,20(1):47-54.
［5］ Assi K C, Labelle H, Cheriet F. Modified large margin nearest neighbor metric learning for regression［J］. IEEE Signal Processing Letters, 2014,21(3):292-296.
［6］ Shen Chunhua, Kim J, Wang Lei. Scalable large-margin Mahalanobis distance metric learning［J］. IEEE Transactions on Neural Networks, 2010,21(9):1524-1530.
［7］ Weinberger K Q, Saul L K. Fast solvers and efficient implementations for distance metric learning［C］// Proceedings of the 25th International Conference on Machine Learning. 2008:1160-1167.
［8］ Weinberger K Q, Sha F, Saul L K. Convex optimizations for distance metric learning and pattern classification ［Applications Corner］［J］. IEEE Signal Processing Magazine, 2010,27(3):146-158.
［9］ Park K, Shen C, Hao Z, et al. Efficiently learning a distance metric for large margin nearest neighbor classification［C］// Proceedings of the 25th AAAI Conference on Artificial Intelligence. 2011:453-458.
［10］陈开志,乐承沛,钟尚平. 融合距离度量学习和SVM的图像匹配算法［J］. 小型微型计算机系统, 2015(6):1353-1357.
［11］Deng Zhi-hong, Luo Kun-hu. CLE_LMNN: A novel framework of LMNN based on clustering labeled examples［J］. Expert Systems with Applications, 2015,42(14):5988-5993.
［12］彭凯,汪伟,杨煜普. 基于余弦距离度量学习的伪K近邻文本分类算法［J］. 计算机工程与设计, 2013,34(6):2200-2203.
［13］连荷清,李斌,孙怀江. 基于极大间隔最近邻学习的运动捕获数据检索［J］. 计算机应用与软件, 2013,30(11):302-305.
［14］韦化,吴阿琴,白晓清. 一种求解机组组合问题的内点半定规划方法［J］. 中国电机工程学报, 2008,28(1):35-40.
［15］胡晓雄,贾育秦. 基于不同损失和距离函数的乘更新分类算法［J］. 计算机应用研究, 2014,31(2):344-347.
［16］Shen Chunhua, Welsh A, Wang Lei. PSDBoost: Matrix-generation linear programming for positive semidefinite matrices learning［C］// Proceedings of Advances in Neural Information Processing Systems. 2008:1473-1480.
［17］Erway J B, Marcia R F. Algorithm 943: MSS: MATLABsoftware for L-BFGS trust-region subproblems for large-scale optimization［J］. ACM Transactions on Mathematical Software(TOMS), 2014,40(4):Article No. 28.
［18］杨柳,于剑,景丽萍. 一种自适应的大间隔近邻分类算法［J］. 计算机研究与发展, 2013,50(11):2269-2277.

一种大规模文本分类大间隔近邻算法

A Large Margin Nearest Neighbor Algorithm of Large-scale Text Classification

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 1

编辑推荐

Metrics

本文评价