计算机与现代化 ›› 2021, Vol. 0 ›› Issue (02): 45-50.

• 数据库与数据挖掘 • 上一篇    下一篇

基于数字内容偏好的多标签分类应用

  

  1. (陕西科技大学电子信息与人工智能学院,陕西西安710021)
  • 出版日期:2021-03-01 发布日期:2021-03-01
  • 作者简介:刘斌(1972—),男,陕西咸阳人,副教授,硕士,研究方向:大数据分析,数据挖掘,E-mail: Liubin@sust.edu.cn; 李笑(1996—),女,陕西咸阳人,硕士研究生,研究方向:数据挖掘,大数据分析,E-mail: 254428765@qq.com。
  • 基金资助:
    国家自然科学基金资助项目(61871260)

Application of Multi-label Classification Based on Digital Content Preference

  1. (School of Electronic Information and Artificial Intelligence, Shaanxi University of Science & Technology, Xi’an 710021, China)
  • Online:2021-03-01 Published:2021-03-01

摘要: 目前电信行业的数字内容研究主要是基于业务口径进行不同偏好的用户洞察,多以业务经验进行判断,不利于数字内容用户规模的发展扩大。为此,本文利用大流量客户的历史数据,基于多标签分类算法对数字内容偏好进行研究,得到各类别的潜在目标客户,最终通过营销推荐客户喜好内容,提高精准营销能力。首先以M电信公司用户的基础、消费属性等脱敏数据作为数据源,并获取近3个月视频、音乐、阅读活跃用户清单,人工进行活跃维度的标注,得到初始数据集;由于正负样本不均衡,故采用多次下采样的方法随机抽样得到3份样本数据,并使用CC、ML-KNN、RakelD等6种算法进行对比实验验证;实验结果表明:采用RakelD及ML-KNN多标签分类算法在数字内容用户偏好洞察方面有较好的预测能力,故采用ML-KNN作为RakelD算法的基本分类器,即RakelD_MLKNN方法,对正负样比例不同的数据集分别进行预测,效果均优于前6种已经存在的常用多标签分类算法及传统经验选型方法。

关键词: 数字内容偏好, 多标签分类, CC算法, ML-KNN算法, RakelD算法

Abstract: At present, the research on digital content in telecom industry is mainly based on the user insight of different preferences based on business caliber, and most of them are based on business experience, which is not conducive to the development and expansion of the scale of digital content users. To this end, this paper makes use of the historical data of large-volume customers and studies the digital content preference based on multi-label classification algorithm, so as to obtain various potential target customers, and finally recommend customers’ preferences through marketing to improve precision marketing ability. Firstly, desensitization data such as the basis and consumption attributes of M telecom users are taken as the data source, and the list of active users of video, music and reading in the last three months is obtained. The active dimension is manually annotated to obtain the initial data set. Because the positive and negative samples are not balanced, three samples are randomly sampled by multiple down-sampling method, and six algorithms including CC, ML-KNN and RakelD are used for comparative experimental verification. The experimental results show that the RakelD and ML-KNN multi-tag classification algorithms have better predictive ability in the perspective of user preference. Therefore, ML-KNN is adopted as the basic classifier of RakelD algorithm, namely RakelD_MLKNN method, to respectively predict the data sets with different positive and negative sample proportions, and the results are all better than the previous 6 existing common multi-label classification algorithms and traditional empirical selection methods.

Key words: digital content preference, multi-label classification, Classifier Chains (CC) algorithm, Multi-Label K-Nearest Neighbor (ML-KNN) algorithm, Random k labelsets Disjoint (RakelD) algorithm