计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

基于计量风格学的小说质量分析

  

  1. (江西财经大学信息管理学院,江西南昌330013)
  • 收稿日期:2018-11-16 出版日期:2019-05-14 发布日期:2019-05-14
  • 作者简介:李艳丽(1996-),女,湖北襄阳人,本科生,研究方向:数据分析,E-mail: liyl@jxufe.edu.cn; 李宛蓉(1995-),女,本科生,研究方向:数据挖掘; 廖欣(1996-),女,本科生,研究方向:数据挖掘; 李静娟(1995-),女,本科生,研究方向:数据挖掘; 汤露(1995-),女,本科生,研究方向:数据挖掘; 通信作者:刘喜平(1981-),男,副教授,博士,研究方向:数据挖掘与数据分析,E-mail: liuxiping@jxufe.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61462037)

Stylometry-based Analysis of Literature Texts

  1. (School of Information Technology, Jiangxi University of Finance and Economics, Nanchang 330013, China)
  • Received:2018-11-16 Online:2019-05-14 Published:2019-05-14

摘要: 从计量风格学的角度来对小说文本进行比较研究。目前对小说文本的研究以定性为主,很少有定量的;以主观分析的居多,客观实证分析的较少。采集涉及网络小说和经典小说的225部小说作品,分成3个作品集,分别对应“优秀”、“良好”和“较差”的作品。对于每个作品,提取篇幅、词性、节奏、词汇量等方面的特征,基于这些特征,构造决策树、神经网络、贝叶斯等分类模型,由此来发现3个作品集之间的关键差异。研究发现,3个作品集在计量风格统计特征上有着较为明显的区别;对于不同的作品集,不同的特征具有不同的区分度。

关键词: 风格计量学, 文本分析, 小说文本

Abstract: This study compares literary works from the perspective of stylometry. At present, the research on literature is mainly qualitative and subjective analysis, and there are few quantitative studies and empirical analysis. A total number of 225 literary works are collected in the study, including Internet literary works and classical literary works, which are divided into three subsets, corresponding to the “excellent”, “good” and “poor”. For each work, a lot of features regarding article length, part of speech, rhythm, vocabulary, etc. are extracted. Based on these features, classifiers such as decision trees, neural networks and Bayesian are constructed. The models are utilized to find the key differences among the three datasets. The study found that the three datasets have obvious differences in stylometry statistics, and for different pair of datasets, the features have different discriminative power.

Key words: stylometry, text analysis, literature text

中图分类号: