文本摘要评测方法的语义损失度

摘要/Abstract

摘要： 在当前文本摘要自动生成领域，传统的ROUGE评测方法已多次被研究者发现其评测结果与人工评测结果差距过大，但该差距尚未数值化，无法丈量。基于此现状，本文采用多个不同类型、长度的公开中文摘要数据集，通过定义语义损失率计算方法来衡量ROUGE在评价时所产生的语义损失程度，同时综合考虑摘要长度以及数据集内在因素对生成摘要评价的影响，最终可视化ROUGE评测与人工评测存在误差的具体数值。实验结果表明，ROUGE评测分数与人工评测分数呈弱相关性，ROUGE方法对不同长度的数据集都存在一定程度的语义损失，同时摘要长度和数据集的原始标注误差也会对最终的评测分数产生重要影响。本文定义的语义损失率计算方法可以为更好地选择数据集和评测方法提供一定的参考依据，为改进评测方法提供一定的思路方向，同时也对最终客观测评模型的有效性提供一定的指导帮助。

关键词: 文本摘要, 评测方法, 语义损失率, 数据集偏差

Abstract: In the current field of text summarization automatic generation， the traditional ROUGE evaluation method has been repeatedly found by researchers that the gap between its evaluation results and artificial evaluation results is too large， but the gap has not been numerical and cannot be measured. Based on this situation， this paper uses multiple public Chinese summary datasets of different types and lengths to measure the degree of semantic loss generated by ROUGE in the evaluation by defining the calculation method of semantic loss rate. At the same time， it comprehensively considers the influence of summary length and internal factors of datasets on the generation of summary evaluation， and the specific values of errors between ROUGE evaluation and artificial evaluation are visualized finally. The experimental results show that the ROUGE evaluation score is weakly correlated with the artificial evaluation score. ROUGE method has a certain degree of semantic loss for different length datasets， and the length of the summary and the original annotation error of the datasets will also have an important impact on the final evaluation score. The calculation method of semantic loss rate defined in this paper can provide a certain reference for better selection of datasets and evaluation methods， provide a direction of thinking for improving evaluation methods， and also provide certain a guidance and help for the effectiveness of the final objective evaluation model.

Key words: text summarization, evaluation method, semantic-loss rate, dataset bias

金独亮, 范永胜, 张琪. 文本摘要评测方法的语义损失度[J]. 计算机与现代化, 2023, 0(03): 84-89.

JIN Du-liang, FAN Yong-sheng, ZHANG Qi. Semantic Loss Degree of Text Summarization Evaluation Method[J]. Computer and Modernization, 2023, 0(03): 84-89.

参考文献

［1］李金鹏，张闯，陈小军，等. 自动文本摘要研究综述［J］. 计算机研究与发展， 2021，58（1）：1-21.
［2］张紫芸，王文发，马乐荣，等. 文本摘要模型的研究进展［J］.计算机与现代化， 2022（6）：56-66.
［3］吴世鑫，黄德根，李玖一. 基于语义对齐的生成式文本摘要研究［J］. 北京大学学报（自然科学版）， 2021，57（1）：1-6.
［4］ LIN C Y. Rouge： A package for automatic evaluation of summaries［C］// Proceedings of Workshop of ACL 2004. 2004：74-81.
［5］ PAPINENI K， ROUKOS S， WARD T， et al. BLEU： A method for automatic evaluation of machine translation［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002：311-318.
［6］董青秀，穗志方，詹卫东，等. 自然语言处理评测中的问题与对策［J］. 中文信息学报， 2021，35（6）：1-15.
［7］ TRICHELAIR P， EMAMI A， TRISCHLER A， et al. How reasonable are common-sense reasoning tasks： A case-study on the Winograd schema challenge and S-WAG［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing （EMNLP-IJCNLP）. 2019：3382-3387.
［8］ MCCOY R T， PAVLICK E， LINZEN T. Right for the Wrong Reasons： Diagnosing Syntactic Heuristics in Natural Language Inference［EB/OL］. （2019-02-04）［2022-07-17］. https：//arxiv.org/abs/1902.01007v1.pdf.
［9］ NIVEN T， KAO H Y. Probing Neural Network Comprehension of Natural Language Arguments［EB/OL］. （2019-07-17）［2022-07-17］. https：//arxiv.org/abs/1907.07355v1.pdf.
［10］ BANERJEE S，LAVIE A. METEOR： An automatic metric for MT evaluation with improved correlation with human judgments［C］// Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005：65-72.
［11］ DENKOWSKI M，LAVIE A. Meteor universal： Language specific translation evaluation for any target language［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. 2014：376-380.
［12］ GUO Y N， HU J F. Meteor++2.0： Adopt syntactic level paraphrase knowledge into machine translation evaluateion［C］// Proceedings of the 4th Conference on Machine Translation. 2019：501-506.
［13］ NG J P， ABRECHT V. Better summarization evaluation with word embeddings for ROUGE［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015：1925-1930.
［14］ GANESAN K. Rouge 2.0： Updated and improved measu-res for evaluation of summarization tasks［J］. arXiv preprint arXiv：1803.01937， 2018.
［15］ STANOJEVIC M， SIMA'AN K. Beer： Better evaluation as ranking［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. 2014：414-419.
［16］赖华，高玉梦，黄于欣，等. 基于多粒度特征的文本生成评价方法［J］. 中文信息学报， 2022，36（3）：45-53.
［17］朱永清，赵鹏，赵菲菲，等. 基于深度学习的生成式文本摘要技术综述［J］. 计算机工程， 2021，47（11）：11-21.
［18］ KUSNER M J， SUN Y， KOLKIN N I，et al. From word embeddings to document distances［C］// International Conference on Machine Learning. 2015：957-966.
［19］ RUBNER Y， TOMASI C， GUIBAS L J. The earth mover’s distance as a metric for image retrieval［J］.International Journal of Computer Vision， 2000，40（2）：99-121.
［20］ MIKOLOV T， SUTSKEVER I， CHEN K， et al. Distributed representations of words and phrases and their compositionality［C］// Advances in Neural Information Processing Systems. 2013：3111-3119.
［21］ SALTON G， BUCKLEY C. Term-weighting Approaches in Automatic Text Retrieval［M］. Information Processing & Management， 1988：513-523.
［22］ HU B， CHEN Q， ZHU F. LCSTS： A large scale Chinese short text summarization dataset［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015：1967-1972.
［23］侯圣峦，张书涵，费超群. 文本摘要常用数据集和方法研究综述［J］. 中文信息学报， 2019，33（5）：1-16.
［24］ SEE A， LIU P J， CHRISTOPHER D. Get to the point： Summarization with pointer-generator networks［J］. arXiv preprint arXiv：1704.04368， 2017.
［25］ MIHALCEA R， TARAU P. Textrank： Bringing order into text［C］// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004：404-411.
［26］朱玉佳，祝永志，董兆安. 基于TextRank算法的联合打分文本摘要生成［J］. 通信技术， 2021，54（2）：323-326.
［27］ CARBONELL J， GOLDSTEIN J. The use of MMR diversity-based reranking for reordering documents and producing summaries［C］// Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998：335-336.
［28］ ZHANG J Q， ZHAO Y， SALEH M， et al. PEGASUS： Pre-training with Extracted Gap-sentences for Abstractive Summarization［EB/OL］.（2019-12-18）［2022-07-17］. https：
//arxiv.org/abs/1912.08777v1.pdf.
［29］ DEVLIN J， CHANG M W， LEE K， et al. BERT： Pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. 2019：4171-4186.
［30］ RODGERS J L， NICEWANDER W A. Thirteen ways to look at the correlation coefficient［J］. The American Statistician， 1988，42（1）：59-66.
［31］刁力力，王丽坤，陆玉昌，等. 计算文本相似度阈值的方法［J］. 清华大学学报（自然科学版）， 2003（1）：108-111.

[1]	刘雅情, 张海军, 梁科晋, 张昱, 王月阳. 基于PGN-CL的文本摘要生成模型[J]. 计算机与现代化, 2023, 0(02): 66-71.
[2]	张紫芸, 王文发, 马乐荣, 丁苍峰. 文本摘要模型的研究进展[J]. 计算机与现代化, 2022, 0(06): 56-66.
[3]	岳一峰，黄蔚，任祥辉. 一种基于BERT的自动文本摘要模型构建方法[J]. 计算机与现代化, 2020, 0(01): 63-.