文本摘要模型的研究进展

摘要/Abstract

摘要： 随着互联网产生的文本数据越来越多，文本信息过载问题日益严重，对各类文本进行一个“降维”处理显得非常必要，文本摘要便是其中一个重要的手段,也是人工智能领域研究的热点和难点之一。文本摘要旨在将文本或文本集合转换为包含关键信息的简短摘要。近年来语言模型的预处理提高了许多自然语言处理任务的技术水平，包括情感分析、问答、自然语言推理、命名实体识别和文本相似性、文本摘要。本文梳理文本摘要以往的经典方法和近几年的基于预训练的文本摘要方法，并对文本摘要的数据集以及评价方法进行整理，最后总结文本摘要目前面临的挑战与发展趋势。

关键词: 数据集, 文本摘要, 预训练模型

Abstract: With more and more text data generated by the Internet, the problem of text information overload is becoming more and more serious. It is very necessary to reduce the dimension of various texts, and text summarization is one of the important means, and it is also one of the hot and difficult points in the field of artificial intelligence research. Text is designed to transform a text or a collection of texts into a short summary containing key information. In recent years, language model preprocessing has improved the technical level of many natural language processing tasks, including emotion analysis, question and answer, natural language reasoning, named entity recognition, text similarity and text summarization. In this paper, the classic methods of text summarization in the past and the methods of text summarization based on pre-training in recent years are combed, and the data sets and evaluation methods of text summarization are sorted out. Finally, the challenges and development trends of text summarization are summarized.

Key words: datasets, text abstract, pre-training model

张紫芸, 王文发, 马乐荣, 丁苍峰. 文本摘要模型的研究进展[J]. 计算机与现代化, 2022, 0(06): 56-66.

ZHANG Zi-yun, WANG Wen-fa, MA Le-rong, DING Cang-feng. Research Progress of Text Summarization Model[J]. Computer and Modernization, 2022, 0(06): 56-66.

参考文献

［1］朱永清,赵鹏,赵菲菲,等. 基于深度学习的生成式文本摘要技术综述［J］. 计算机工程, 2021,47(11):11-21.
［2］ CHENG J P, LAPATA M. Neural summarization by extracting sentences and words［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016:484-494.
［3］ NALLAPATI R, ZHAI F F, ZHOU B. SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017:3075-3081.
［4］ PAULUS R, XIONG C M, SOCHER R. A deep reinforced model for abstractive summarization［C］// Proceedings of the 6th International Conference on Learning Representations. 2017.
［5］ WU Y X, HU B T. Learning to extract coherent summary via deep reinforcement learning［C］// The 32nd AAAI Conference on Artifificial Intelligence. 2018,32(1).
［6］ HAGHIGHI A, VANDERWENDE L. Exploring content models for multi-document summarization［C］// Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009:362-370.
［7］ CHEUNG J C K, PENN G. Probabilistic domain modelling with contextualized distributional semantic vectors［C］// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013:392-401.
［8］ CAO Z Q, LI W J, LI S J, et al. Improving multi-document summarization via text classification［C］// The 31st AAAI Conference on Artificial Intelligence. 2017:31(1).
［9］ ISONUMA M, FUJINO T, MORI J, et al. Extractive summarization using multi-task learning with document classification［C］// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017:2101-2110.
［10］温文波,杜维. 蚁群算法概述［J］. 石油化工自动化, 2002(1):19-22.
［11］李金鹏,张闯,陈小军,等. 自动文本摘要研究综述［J］. 计算机研究与发展, 2021,58(1):1-21.
［12］PAPADIMITRIOU C H, RAGHAVAN P, TAMAKI H, et al. Latent semantic indexing: A probabilistic analysis［J］. Journal of Computer and System Sciences, 2000,61(2):217-235.
［13］ZHENG B, MCLEAN D C, LU X H. Identifying biological concepts from a protein-related corpus with a probabilistic topic model［J］. BMC Bioinformatics, 2006. DOI: 10.1186/1471-2105-7-58.
［14］郭继峰,费禹潇,孙文博,等. 一种融合主题的PGN-GAN文本摘要模型［J/OL］. ［2021-12-06］. http://kns.cnki.net/kcms/detail/21.1106.TP.20211115.1055.002.html.
［15］BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation［J］. Journal of Machine Learning Research, 2003,3:993-1022.
［16］DEERWESTER S, DUMAIS S T, FURNAS G W, et al. Indexing by latent semantic analysis［J］. Journal of the American Society for Information Science, 1990,41(6):391-407.
［17］TARDAN P P, ERWIN A, ENG K I, et al. Automatic text summarization based on semantic analysis approach for documents in Indonesian language［C］// 2013 International Conference on Information Technology and Electrical Engineering (ICITEE). 2013:47-52.
［18］JAGADEESH J J, PINGALI P, VARMA V. Sentence extraction based single document summarization［R］. Workshop on Document Summarization, 2005.
［19］ZHOU L, HOVY E. Template-filtered headline summarization［C］// Proceedings of the ACL Workshop on Text Summarization. 2004.
［20］CAO Z Q, LI W J, LI S J, et al. Retrieve, rerank and rewrite: Soft template based neural summarization［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018:152-161.
［21］WANG K, QUAN X J, WANG R. BiSET: Bi-directional selective encoding with template for abstractive summarization［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2153-2162.
［22］PAGE L, BRIN S, MOTWANI R, et al. The PageRank citation ranking: Bringing order to the Web［J］. Stanford Digital Libraries Working Paper, 1998.
［23］MIHALCEA R, TARAU P. TextRank: Bringing order into texts［C］// 2004 Conference on Empirical Methods in Natural Language Processing. 2004.
［24］汪旭祥,韩斌,高瑞,等. 基于改进TextRank的文本摘要自动提取［J］. 计算机应用与软件, 2021,38(6):155-160.
［25］SEHGAL S, KUMAR B, RAMPAL L, et al. A modification to graph based approach for extraction based automatic text summarization［M］// Progress in Advanced Computing and Intelligent Engineering. 2018:373-378.
［26］PEYRARD M. A simple theoretical model of importance for summarization［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. DOI:10.18653/v1/P19-1101.
［27］WEST P, HOLTZMAN A, BUYS J, et al. BottleSum: Unsupervised and self-supervised sentence summarization using the information bottleneck principle［C］// Proceedings of the Empirical Methods in Natural Language Processing (EMNLP) & International Joint Conference on Natural Language Processing (IJCNLP). 2019:3750-3759.
［28］LEV G, SHMUELI-SCHEUER M, HERZIG J, et al. TalkSumm: A dataset and scalable annotation method for scientific paper summarization based on conference talks［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2125-2131.
［29］PALASKAR S, LIBOVICKY J, GELLA S, et al. Multimodal abstractive summarization for how2 videos［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:6587-6596.
［30］RUSH A M, CHOPRA S, WESTON J. A neural attention model for abstractive sentence summarization［J］. arXiv preprint arXiv: 1509.00685, 2015.
［31］CHOPRA S, AULI M, RUSH A M. Abstractive sentence summarization with attentive recurrent neural networks［C］// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. DOI: 10.18653/v1/N16-1012.
［32］NALLAPATI R, ZHOU B, SANTOS C N D, et al. Abstractive text summarization using sequence-to-sequence RNNs and beyond［C］// Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. 2016. DOI: 10.18653/v1/K16-1028.
［33］SEE A, LIU P J, MANNING C D. Get to the point: Summarization with pointer-generator networks［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017:1073-1083.
［34］GU J T, LU Z D, LI H, et al. Incorporating copying mechanism in sequence-to-sequence learning［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016:1631-1640.
［35］ZENG W Y, LUO W J, FIDLER S, et al. Efficient summarization with read-again and copy mechanism［J］. arXiv preprint arXiv:1611.03382, 2016.
［36］HSU W T, LIN C K, LEE M Y, et al. A unified model for extractive and abstractive summarization using inconsistency loss［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018:132-141.
［37］LI C L, XU W R, LI S, et al. Guiding generation for abstractive text summarization based on key information guide network［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018:55-60.
［38］GEHRMANN S, DENG Y T, RUSH A M. Bottom-up abstractive summarization［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018:4098-4109.
［39］CHEN Y C, BANSAL M. Fast abstractive summarization with reinforce-selected sentence rewriting［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018:675-686.
［40］PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018:2227-2237.
［41］RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training［J］. 2018.
［42］DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171-4186.
［43］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint arXiv:1301.3781, 2013.
［44］PENNINGTON J, SOCHER R, MANNING C D. Glove: Global vectors for word representation［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. DOI: 10.3115/v1/D14-1162.
［45］BOJANOWSKI P, GRAVE E, JOULIN A, et al. Enriching word vectors with subword information［J］. Transactions of the Association for Computational Linguistics, 2016,5(1). DOI: 10.1162/tacl_a_00051.
［46］ZHANG X X, WEI F R, ZHOU M. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:5059-5069.
［47］LIU L Q, LU Y, YANG M, et al. Generative adversarial network for abstractive text summarization［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018:8109-8110.
［48］BAE S, KIM T, KIM J, et al. Summary level training of sentence rewriting for abstractive summarization［J］. arXiv preprint arXiv:1909.08752, 2019.
［49］SHARMA E, HUANG L Y, HU Z, et al. An entity-driven framework for abstractive summarization［C］// Proceedings of the 2019 Empirical Methods in Natural Language Processing Conference and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP-2019). 2019:3278-3289.
［50］ZHANG H Y, GONG Y Y, YAN Y, et al. Pretraining-based natural language generation for text summarization［C］// Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). 2019:789-797.
［51］SONG K Q, WANG B Q, FENG Z, et al. Controlling the amount of verbatim copying in abstractive summarization［C］// Proceedings of the AAAI Conference on Artificial Intelligence. 2020,34(5):8902-8909.
［52］ZHANG J Q, ZHAO Y, SALEH M, et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization［C］// Proceedings of the 37th International Conference on Machine Learning. 2020:11328-11339.
［53］SONG K T, TAN X, QIN T, et al. Mass: Masked sequence to sequence pre-training for language generation［J］. arXiv preprint arXiv:1905.02450, 2019.
［54］LIU Y. Fine-tune BERT for extractive summarization［J］. arXiv preprint arXiv:1903.10318, 2019.
［55］ZHENG H, LAPATA M. Sentence centrality revisited for unsupervised summarization［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:6236-6247.
［56］WANG D Q, LIU P F, ZHONG M, et al. Exploring domain shift in extractive text summarization［J］. arXiv preprint arXiv:1908.11664, 2019.
［57］CHO S, LI C, YU D, et al. Multi-Document summarization with determinantal point processes and contextualized representations［C］// Proceedings of the 2nd Workshop on New Frontiers in Summarization. 2019. DOI: 10.18653/v1/D19-5412.
［58］LIU Y, LAPATA M. Text summarization with pretrained encoders［C］// Proceedings of the Empirical Methods in Natural Language Processing (EMNLP) & International Joint Conference on Natural Language Processing (IJCNLP). 2019:3728-3738.
［59］KHANDELWAL U, CLARK K, JURAFSKY D, et al. Sample efficient text summarization using a single pre-trained transformer［J］. arXiv preprint arXiv: 1905.08836, 2019.
［60］DONG L, YANG N, WANG W H, et al. Unified language model pre-training for natural language understanding and generation［J］. arXiv preprint arXiv: 1905.03197, 2019.
［61］LEWIS M, LIU Y H, GOYAL N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2019.DOI:10.18653/v1/2020.acl-main.703.
［62］BELTAGY I, LO K, COHAN A. SciBERT: A pretrained language model for scientific text［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019:3613-3618.
［63］SUN S, NENKOVA A. The feasibility of embedding based automatic evaluation for single document summarization［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019:1216-1221.
［64］GRAFF D, CIERI C. English gigaword［J］. Linguistic Data Consortium, Philadelphia, 2003. DOI: 10.35111/0z6y-q265.
［65］RUSH A M, CHOPRA S, WESTON J. A neural attention model for abstractive sentence summarization［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015:379-389.
［66］HERMANN K M, KOVCISKY T, GREFENSTETTE E, et al. Teaching machines to read and comprehend［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015:1693-1701.
［67］SEE A, LIU P J, MANNING C D. Get to the point: Summarization with pointer-generator networks［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017:1073-1083.
［68］SANDHAUS E. The New York Times Annotated Corpus Overview［DB/OL］.［2021-09-02］.https://catalog.ldc.upeen.edu/docs/LDC2008T19/new_york_times.annotated.corpus.pdf.
［69］DURRETT G, BERG-KIRKPATRICK T, KLEIN D. Learning-based single-document summarization with compression and anaphoricity constraints［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016.
［70］NARAYAN S, COHEN S B, LAPATA M. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018:1797-1807.
［71］GRUSKY M, NAAMAN M, ARTZI Y. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018:708-719.
［72］HU B T, CHEN Q C, ZHU F Z. LCSTS: A large scale chinese short text summarization dataset［J］. arXiv preprint arXiv:1506.05865, 2015.
［73］SHARMA E, LI C, WANG L. BIGPATENT: A large-scale dataset for abstractive and coherent summarization［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2204-2213.
［74］FABBRI A R, LI I, SHE T W, et al. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:1074-1084.
［75］LIN C Y. Rouge: A package for automatic evaluation of summaries［C］// Text Summarization Branches Out. 2004:74-81.
［76］PAPINENI K, ROUKOS S, WARD T, et al. BLEU: A method for Automatic Evaluation of Machine Translation［C］// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 2002:311-318.
［77］BANERJEE S, LAVIE A. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments［C］// Proceedings of the 2005Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005:65-72.
［78］DENKOWSKI M, LAVIE A. Meteor universal: Language specific translation evaluation for any target language［C］// Proceedings of the 9th Workshop on Statistical Machine Translation. 2014:376-380.

[1]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[2]	刘静乐, 罗翔, 宫成荣, 张国鹏. 基于RF-RFECV和LightGBM算法的糖尿病预测[J]. 计算机与现代化, 2023, 0(11): 36-43.
[3]	沈加炜, 陆一鸣, 陈晓艺, 钱美玲, 陆卫忠, . 基于深度学习的人体行为检测方法研究综述[J]. 计算机与现代化, 2023, 0(09): 1-9.
[4]	邱地发, 于淑芳, 刘锦辉, 毕梦昭. 基于YOLOv5s的遥感图像的车辆小目标检测[J]. 计算机与现代化, 2023, 0(05): 122-126.
[5]	金独亮, 范永胜, 张琪. 文本摘要评测方法的语义损失度[J]. 计算机与现代化, 2023, 0(03): 84-89.
[6]	刘雅情, 张海军, 梁科晋, 张昱, 王月阳. 基于PGN-CL的文本摘要生成模型[J]. 计算机与现代化, 2023, 0(02): 66-71.
[7]	王浩畅, 刘如意. 基于预训练模型的关系抽取研究综述[J]. 计算机与现代化, 2023, 0(01): 49-57.
[8]	孙丹, 饶兰香, 施炜利, 孟莎莎, 胡少文, 胡必伟, 应嵩. 基于混合N-Gram模型和XGBoost算法的内部威胁检测方法[J]. 计算机与现代化, 2022, 0(08): 99-105.
[9]	王建华, 冉煜琨. 适用于便携式设备的深度神经网络眼动跟踪[J]. 计算机与现代化, 2021, 0(08): 58-63.
[10]	王浩畅, 孙孟冉, 赵铁军. 基于ELMO的低资源神经机器翻译[J]. 计算机与现代化, 2021, 0(07): 38-42.
[11]	韩灿灿, 李志华, 徐睿. 基于CycleGAN的非平行语音去噪方法[J]. 计算机与现代化, 2021, 0(02): 73-77.
[12]	闫芮铵, 张立臣. 基于Focal Loss和卷积神经网络的入侵检测[J]. 计算机与现代化, 2021, 0(01): 65-69.
[13]	岳一峰，黄蔚，任祥辉. 一种基于BERT的自动文本摘要模型构建方法[J]. 计算机与现代化, 2020, 0(01): 63-.
[14]	齐玉东1，丁海强1，赵锦超2，孙明玮1. 基于biRNN的海军军械不均衡文本数据集处理方法[J]. 计算机与现代化, 2019, 0(12): 21-.
[15]	杨磊1,王少云2,刘力冉1,龚永富1. 一种智能视频监控系统中的行人检测方法[J]. 计算机与现代化, 2019, 0(11): 69-.