基于改进T5 PEGASUS模型的新闻文本摘要生成

doi:10.16180/j.cnki.issn1007-7820.2023.12.010

摘要/Abstract

摘要：

新闻文本摘要生成任务旨在解决用户在阅读新闻时无法快速把握内容重点而造成的时间损耗和阅读疲劳等问题。目前面向中文的文本摘要模型效果较佳的是T5 PEGASUS模型,但针对该模型的研究较少。文中针对T5 PEGASUS模型的中文分词方面进行改进,使用更适用于新闻领域的Pkuseg分词方法进行处理,并在NLPCC2017、LCSTS、SogouCS这3种新闻长度不同的公开数据集上验证其有效性。研究发现Pkuseg分词方法更适合T5 PEGASUS模型,模型生成摘要的ROUGE(Recall-Oriented Understudy for Gisting Evaluation)值与新闻文本长度成正相关,训练集损失值和损失值下降速度与新闻文本长度成负相关,在面对少量训练集时能得到较高的ROUGE分数,因此该模型具有较强的小样本学习能力。

关键词: 文本摘要生成, 生成式模型, T5 PEGASUS, 新闻文本, 中文分词, Pkuseg, 小样本学习, ROUGE

Abstract:

The task of generating news text summarizations aims to solve the problems of wasting time and reading fatigue caused by users' inability to quickly grasp the key points of the content when reading news. At present, the best text summarization model for Chinese is the T5 PEGASUS model, but there are few researches on this model. In this study, the Chinese word segmentation of the T5 PEGASUS model is improved, and the Pkuseg word segmentation method, which is more suitable for news field, is used for processing, and its effectiveness is verified on three public datasets with different news lengths: NLPCC2017, LCSTS and SogouCS. It is found that the Pkuseg method is more suitable for the T5 PEGASUS model. The ROUGE value of T5 Pegasus model generated summaries is positively correlated with the length of news text, and the loss value of training set and the decline speed of loss value are negatively correlated with the length of news text. In the face of a small number of training sets, the model can get a high ROUGE score, so the model has a strong few-shot learning ability.

Key words: text summarization, generative model, T5 PEGASUS, news text, Chinese word segmentation, Pkuseg, few-shot learning, ROUGE

中图分类号:

TP391.1

张琪,范永胜. 基于改进T5 PEGASUS模型的新闻文本摘要生成[J]. 电子科技, 2023, 36(12): 72-78.

ZHANG Qi,FAN Yongsheng. Research on Generating News Text Summarization Based on Improved T5 PEGASUS Model[J]. Electronic Science and Technology, 2023, 36(12): 72-78.

图/表 13

图1

表1

表2

图2

图3

图4

表3

表4

图5

图6

表5

表6

表7

参考文献 18

[1]	Sri S H B, Dutta S R. A survey on automatic text su-mmarization techniques[C]. Kancheepuram: International Conference on Physics and Energy, 2021:121-135.
[2]	李金鹏, 张闯, 陈小军, 等. 自动文本摘要研究综述[J]. 计算机研究与发展, 2021, 58(1):1-21.
	LI Jinpeng, Zhang Chuang, Chen Xiaojun, et al. Survey on automatic text summarization[J]. Journal of Computer Research and Development, 2021, 58(1):1-21.
[3]	Luhn H P. The automatic creation of literature abstracts[J]. IBM Journal of Research and Development, 1958, 2(2):159-165. doi: 10.1147/rd.22.0159
[4]	Zhang J, Zhao Y, Saleh M, et al. Pegasus:Pretraining with extracted gap-sentences for abstractive summarization[C]. Vienna: International Conference on Machine L-earning, 2020:11328-11339.
[5]	Puspitaningrum D. A survey of recent abstract summarization techniques[C]. Singapore: The Sixth International Congress on Information and Communication Technology, 2022:783-801.
[6]	Goodwin T R, Savery M E, Demner-Fushman D. Flight of the PEGASUS? Comparing transformers on few-shot and zero-shot multi-document abstractive summarization[C]. Barcelona: International Conference on Computational Linguistics, 2020:5640-5645.
[7]	Yang T H, Lu C C, Hsu W L. More than extracting "I-mportant" sentences:The application of PEGASUS[C]. Taichung: International Conference on Technologies and Applications of Artificial Intelligence, 2021:131-134.
[8]	Yadav D, Lalit N, Kaushik R, et al. Qualitative analysis of text summarization techniques and its spplications in health domain[J]. Computational Intelligence and Neuroscience, 2022, 20(2):1-14.
[9]	Mathur A, Suchithra M. Application of abstractive su-mmarization in multiple choice question generation[C]. Greater Noida: International Conference on Computational Intelligence and Sustainable Engineering Solutions, 2022:409-413.
[10]	李岱峰, 林凯欣, 李栩婷. 基于提示学习与T5 PEGASUS的图书宣传自动摘要生成器[J]. 数据分析与知识发现, 2023, 7(3):121-130.
	Li Daifeng, Lin Kaixin, Li Yuting. A books promotion abstractive summarization method based on prompt learning and T5 PEGASUS[J]. Data Analysis and Knowledge Discovery, 2023, 7(3):121-130.
[11]	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Mach Learn Research, 2020, 21(4):1-67.
[12]	Xue L, Constant N, Roberts A, et al. mT5:A massively multilingual pre-trained text-to-text transformer[C]Online: Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies, 2021:483-498.
[13]	施旭涛. 基于堆叠BiLSTM的中文自动文本摘要研究[D]. 昆明: 云南大学, 2019:1-21.
	Shi Xutao. Research on automatic Chinese text summarization based on stack BiLSTM[D]. Kunming: Yunnan University, 2019:1-21.
[14]	李辉, 王一丞. 基于CNNCIFG-Attention模型的文本情感分类[J]. 电子科技, 2022, 35(2):46-51.
	Li Hui, Wang Yicheng. CNNCIFG-Attention model forttext sentiment classifcation[J]. Electronic Science and Technology, 2022, 35(2):46-51.
[15]	Hu B, Chen Q, Zhu F. LCSTS:A large scale Chinese short text summarization dataset[J]. Computer Science, 2015(1):1967-1972.
[16]	Lin C Y. Rouge:A package for automatic evaluation of summaries[C]. Barcelona: The Workshop on Text Summarization Branches Ou, 2004:74-81.
[17]	韩肖赟. 舆情分析的混合主题模型研究与应用[D]. 西安: 陕西科技大学, 2020:17-18.
	Han Xiaoyun. Research and application of hybrid topic model for public opinion analysis[D]. Xi'an: Shaanxi University of Science & Technology, 2020:17-18.
[18]	李福鹏, 付东翔. 编码器的金融文本情感分析方法[J]. 电子科技, 2020, 33(9):10-15.
	Li Fupeng, Fu Dongxiang. Sentiment analysis method of financial text based on transformer encoder[J]. Electronic Science and Technology, 2020, 33(9):10-15.

软硬件	配置/版本
操作系统	Windows 10
CPU	AMD Ryzen 5 2600x
GPU	NVIDIA GeForce RTX 3070
开发工具	PyCharm
Python	3.7
CUDA	11.0
Torch	1.7.0
Transformers	4.15.0

参数类型	数值
迭代次数epoch	10
批数据大小batch_size	8
学习率lr	2×10^-4
新闻最大长度max_len_inputs	512
摘要最大长度max_len_outputs	40
训练集∶测试集	8∶2

数据集	分词方法	ROUGE-1			ROUGE-2			ROUGE-L
数据集	分词方法	P/%	R/%	F/%	P/%	R/%	F/%	P/%	R/%	F/%
NLPCC2017	Jieba	58.0	56.2	56.4	39.5	38.5	38.5	52.5	50.9	51.0
	Pkuseg	57.7	56.8	56.5	39.2	39.1	38.6	52.2	51.3	51.0
	THULAC	58.2	49.0	52.6	39.1	32.3	34.9	52.3	44.0	47.2
LCSTS	Jieba	41.0	35.9	37.2	23.9	21.1	21.7	37.7	33.0	34.2
	Pkuseg	41.0	36.5	37.6	23.8	21.4	21.9	37.6	33.4	34.4
	THULAC	38.8	37.2	36.8	22.8	22.1	21.7	35.4	34.0	33.6
SogouCS	Jieba	49.1	44.4	45.5	31.2	28.6	29.1	46.3	41.9	42.9
	Pkuseg	49.6	44.5	45.7	31.6	28.8	29.4	46.7	42.0	43.2
	THULAC	49.1	44.1	45.3	31.6	28.7	29.2	46.3	41.6	42.7

数据集	ROUGE-1			ROUGE-2			ROUGE-L
数据集	P/%	R/%	F/%	P/%	R/%	F/%	P/%	R/%	F/%
NLPCC2017	57.7	56.8	56.5	39.2	39.1	38.6	52.2	51.3	51.1
LCSTS	41.0	36.5	37.6	23.8	21.4	21.9	37.6	33.4	34.4
SogouCS	49.6	44.5	45.7	31.6	28.8	29.4	46.7	42.0	43.2

新闻数量	ROUGE-1			ROUGE-2			ROUGE-L
新闻数量	P/%	R/%	F/%	P/%	R/%	F/%	P/%	R/%	F/%
10	34.5	24.4	28.2	17.3	13.2	15.0	26.2	19.1	22.0
50	43.4	27.5	32.7	19.9	14.2	16.1	38.2	24.3	28.8
100	55.3	41.2	45.0	27.5	22.7	24.0	49.2	35.9	39.5
500	48.2	40.1	42.4	25.8	22.8	23.5	42.6	35.3	37.4
1 000	52.2	45.6	47.8	31.0	27.5	28.6	46.5	40.6	42.5
5 000	51.1	48.8	49.0	30.5	30.0	29.7	45.1	43.1	43.3
10 000	53.6	51.3	51.6	33.6	32.9	32.6	47.8	45.7	46.0