基于BERT和LightGBM的文本关键词提取方法

doi:10.16180/j.cnki.issn1007-7820.2023.03.002

摘要/Abstract

摘要：

传统的文本关键词提取方法忽略了上下文语义信息,不能解决一词多义问题,提取效果并不理想。基于LDA和BERT模型,文中提出LDA-BERT-LightGBM(LB-LightGBM)模型。该方法选择LDA主题模型获得每个评论的主题及其词分布,根据阈值筛选出候选关键词,将筛选出来的词和原评论文本拼接在一起输入到BERT模型中,进行词向量训练,得到包含文本主题词向量,从而将文本关键词提取问题通过LightGBM算法转化为二分类问题。通过实验对比了textrank算法、LDA算法、LightGBM算法及文中提出的LB-LightGBM模型对文本关键词提取的准确率P、召回率R以及F1。结果表明,当TopN取3~6时,F1的平均值比最优方法提升3.5%,该方法的抽取效果整体上优于实验中所选取的对比方法,能够更准确地发现文本关键词。

关键词: 主题模型, 词向量, BERT, LightGBM, 候选, 提取, 文本主题

Abstract:

Traditional text keyword extraction methods ignore the contextual semantic information and cannot solve the problem of ambiguity of a word, so the extraction effect is not ideal. Based on the LDA and BERT models, this study proposes the LDA-BERT-LightGBM (LB-LightGBM) model. The LDA topic model is selected to obtain the topic of each review and its word distribution, candidate keywords are filtered out according to the threshold, and the filtered words and the original review text are spliced and input into the BERT model. The word vector training is performed to obtain the word vector containing the text topic, so the text keyword extraction problem is converted into a two-classification problem through the LightGBM algorithm. The textrank algorithm, LDA algorithm, LightGBM algorithm and the proposed LB-LightGBM model are compared through experiments on the accuracy rate P, recall rate R and F1 of text keyword extraction in the present study. The results show that when TopN takes 3~6, the average value of F1 is 3.5% higher than that of the optimal method, indicating that the extraction effect of this method is generally better than that of the comparison method selected in the experiment, and the text keywords can be found more accurately.

Key words: topic model, word vector, BERT, LightGBM, candidate, extraction, text theme

中图分类号:

TP391.1

何传鹏,尹玲,黄勃,王明胜,郭茹燕,张帅,巨家骥. 基于BERT和LightGBM的文本关键词提取方法[J]. 电子科技, 2023, 36(3): 7-13.

HE Chuanpeng,YIN Ling,HUANG Bo,WANG Mingsheng,GUO Ruyan,ZHANG Shuai,JU Jiaji. Text Keyword Extraction Method Based on BERT and LightGBM[J]. Electronic Science and Technology, 2023, 36(3): 7-13.

图/表 11

图1

图2

图3

图4

图5

图6

图7

表1

图8

图9

图10

参考文献 22

[1]	王俊玲. 改进TextRank的文本关键词提取算法[J]. 软件导刊, 2021, 20(4):49-52.
	Wang Junling. Text keyword extraction algorithm based on improved TextRank[J]. Software Guide, 2021, 20(4):49-52.
[2]	詹飞, 朱艳辉, 梁文桐, 等. 基于BERT和TextRank关键词提取的实体链接方法[J]. 湖南工业大学学报, 2020, 34(4): 63-70.
	Zhan Fei, Zhu Yanhui, Liang Wentong, et al. Entity linking via BERT and TextRank keyword extraction[J]. Journal of Hunan University of Technology, 2020, 34(4):63-70.
[3]	王成柱, 魏银珍. 语义相似度领域基于XGBOOST算法的关键词自动抽取方法[J]. 计算机与数字工程, 2020, 48(6): 1300-1303.
	Wang Chengzhu, Wei Yinzhen. Automatic keyword extraction method based on XGBOOST algorithm in semantic similarity domain[J]. Computer and Digital Engineering, 2020, 48(6):1300-1303.
[4]	祖弦, 谢飞, 刘啸剑. 融合词和文档嵌入的关键词抽取算法[J]. 计算机科学与探索, 2021, 15(2):294-302. doi: 10.3778/j.issn.1673-9418.2003022
	Zu Xian, Xie Fei, Liu Xiaojian. Keyphrase extraction combining word and document embeddings[J]. Journal of Frontiers of Computer Science and Technology, 2021, 15(2): 294-302. doi: 10.3778/j.issn.1673-9418.2003022
[5]	陈芬. 基于Word2Vec与TextRank的关键词抽取研究[D]. 武汉: 华中师范大学, 2020.
	Chen Fen. Keywords extraction based on Word2Vec and TextRank[D]. Wuhan: Central China Normal University, 2020.
[6]	李俊, 吕学强. 融合BERT语义加权与网络图的关键词抽取方法[J]. 计算机工程, 2020, 46(9):89-93.
	Li Jun, Lü Xueqiang. Keyword extraction method based on BERT semantic weighting and network graph[J]. Computer Engineering, 2020, 46(9):89-93.
[7]	Fanani A M, Suyanto S. Syllabification model of indonesian language named-Entity using syntactic n-gram[J]. Procedia Computer Science, 2021, 17(9):721-727.
[8]	张伟, 石倩, 何霄, 等. 改进的TF-IDF算法在文本分类中的研究[J]. 信息技术与网络安全, 2021, 40(7):72-76.
	Zhang Wei, Shi Qian, He Xiao, et al. Research on improved TF-IDF algorithm in text classification[J]. Information Technology and Network Security, 2021, 40(7):72-76.
[9]	姚兆旭, 马静. 面向微博话题的“主题+观点”词条抽取算法研究[J]. 数据分析与知识发现, 2016(7):78-86.
	Yao Zhaoxu, Ma Jing. Extracting topic and opinion from microblog posts with new algorithm[J]. Data Analysis and Knowledge Discovery, 2016(7):78-86.
[10]	Wang Z H, Wang D, Li Q. Keyword extraction from scientific research projects based on SRP-TF-IDF[J]. Chinese Journal of Electronics, 2021, 30(4):652-657. doi: 10.1049/cje2.v30.4
[11]	Jeong S, Kang Y, Lee J, et al. Variational embedding of a hidden Markov model to generate human activity sequences[J]. Transportation Reasearch Part C:Emerging Technologies, 2021, 13(1):1-22.
[12]	李航. 统计学习方法[M]. 2版. 北京: 清华大学出版社, 2019.
	Li Hang. Statistical learning methods[M]. 2nd ed. Beijing: Tsinghua University Press, 2019.
[13]	周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
	Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.
[14]	Ji J, Wang H Y, Song S S, et al. Sentiment analysis of comments of wooden furniture based on naïve Bayesian model[J]. Progress in Artificial Intelligence, 2021, 10(1):23-35. doi: 10.1007/s13748-020-00221-3
[15]	罗恺, 袁晓东. 基于LDA主题模型与社会网络的专利技术融合趋势研究——以关节机器人为例[J]. 情报杂志, 2021, 40(3):89-97.
	Luo Kai, Yuan Xiaodong. A study on the technology convergence trend of patent based on LDA and social network-An example of joint robot[J]. Journal of Intelligence, 2021, 40(3):89-97.
[16]	刘艳文, 魏赟. 基于LDA主题模型的情感分析研究[J]. 电子科技, 2020, 33(7):12-16.
	Liu Yanwen, Wei Yun. Research on emotional analysis based on LDA topic model[J]. Electronic Science and Technology, 2020, 33(7):12-16.
[17]	Zhang Z, Wu S, Jiang D, et al. BERT-JAM: Maximizing the utilization of BER for neural machine translation[J]. Neurocomputing, 2021, 46(5):84-94.
[18]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]. Long Beach: Proceedings of the Thirty-first Conference on Neural Information Processing Systems, 2017.
[19]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL].(2018-10-11) [2019-06-02]. https://arxiv.org/abs/1810.04805.
[20]	卢佳伟, 陈玮, 尹钟. 融合TextRank算法的中文短文本相似度计算[J]. 电子科技, 2020, 33(10):51-56.
	Lu Jiawei, Chen Wei, Yin Zhong. Chinese short text similarity calculation based on TextRank algorithm[J]. Electronic Science and Technology, 2020, 33(10):51-56.
[21]	诸葛越, 江云胜. 百面深度学习:算法工程师带你去面试[M]. 北京: 人民邮电出版社, 2020.
	Zhuge Yue, Jiang Yunsheng. Hundred-faced deep learning: An algorithm engineer takes you to an interview[M]. Beijing: People Post Press, 2020.
[22]	刘昌澍, 李响, 詹瑾瑜, 等. 基于TextCNN和LightGBM的导游违规行为检测[J]. 计算机技术与发展, 2021, 31(5):143-149.
	Liu Changshu, Li Xiang, Zhan Jinyu, et al. Illegal tour guide behavior detection based on TextCNN and LightGBM[J]. Computer Technology and Development, 2021, 31(5):143-149.

模型	参数	N=3	N=4	N=5	N=6
Text Rank	P	0.711	0.701	0.688	0.661
	R	0.694	0.685	0.674	0.642
	F1	0.702	0.693	0.681	0.652
LDA	P	0.772	0.764	0.742	0.722
	R	0.756	0.746	0.731	0.701
	F1	0.764	0.755	0.734	0.711
LightGBM	P	0.812	0.802	0.786	0.756
	R	0.801	0.795	0.771	0.745
	F1	0.807	0.797	0.778	0.751
LB-LightGBM	P	0.852	0.833	0.813	0.791
	R	0.843	0.825	0.804	0.786
	F1	0.848	0.829	0.808	0.788