计算机与现代化 ›› 2016, Vol. 0 ›› Issue (6): 68-72.doi: 10.3969/j.issn.1006-2475.2016.06.015

• 算法设计与分析 • 上一篇    下一篇

一种大规模文本分类大间隔近邻算法

  

  1. 1.广西大学计算机与电子信息学院,广西南宁530004;2.广西通信规划设计咨询有限公司,广西南宁530007
  • 收稿日期:2015-12-28 出版日期:2016-06-16 发布日期:2016-06-17
  • 作者简介:朱茜(1987-),女,陕西宝鸡人,广西大学计算机与电子信息学院硕士研究生,研究方向:数据挖掘; 覃华(1972-),男, 教授,博士,研究方向:电子商务数据挖掘与最优化技术; 陈晨(1989-),女,硕士研究生,研究方向:数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61363027); 教育部人文社会科学研究规划基金资助项目(11YJAZH080)

A Large Margin Nearest Neighbor Algorithm of Large-scale Text Classification

  1. 1. School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China;
    2. Guangxi Communication Planning and Design Co. Ltd.〖KG-*4〗, Nanning 530007, China
  • Received:2015-12-28 Online:2016-06-16 Published:2016-06-17

摘要: 大间隔近邻算法(Large Margin Nearest Neighbor,LMNN)具有较强学习能力和泛化能力,在分类领域有广泛的应用。但将其用于大规模文本分类问题时,LMNN算法中的半定规划问题规模会随着数据规模增大而急剧膨胀,导致求解困难。针对此问题,引入胡贝尔损失函数把LMNN算法的半定优化模型分解为2个低阶的连续优化子模型,降低算法的计算复杂度,提高计算效率。在舆情分类数据集上的实验结果表明,本文算法与传统大间隔近邻算法相比,精度提高了4.5%,分类时间节省了47.1%,故采用分解降阶法来改进LMNN算法的性能是可行的,更适用于大规模文本分类。

关键词: 半定规划, 大间隔近邻, 胡贝尔损失函数, 大规模文本分类, 泛化能力

Abstract: The large margin nearest neighbor algorithm has strong learning ability and generalization ability, which is widely used in the field of classification. But it will sink into difficulties when the semidefinite programming(SDP) scale of the LMNN algorithm expands rapidly as the data increasing used to solve the large-scale text classification problem. To solve this problem, we introduced the Huber loss function, which divided the Semidefinite Optimization Model of LMNN algorithm into two low-level continuous optimization sub-models, and finally reduced the computation complexity of the algorithm and improved its efficiency. The experimental results on the classification data set of public opinion show that the precision of the proposed algorithm was improved 4.5%, and the classification time saved 47.1% compared with the traditional one. It also can prove that adopting the low-level decomposition reduction method to improve the performance of the LMNN algorithm is feasible and more suitable for large-scale text classification.

Key words: semidefinite programming, large margin nearest neighbor, Huber loss function, large-scale text classification, generalization

中图分类号: