基于卷积循环网络与非局部模块的语音增强方法

doi:10.16180/j.cnki.issn1007-7820.2022.03.002

摘要/Abstract

摘要：

现有的深度神经网络语音增强方法忽视了相位谱学习的重要性,从而造成增强语音质量不理想。针对这一问题,文中提出了一种基于卷积循环网络与非局部模块的语音增强方法。通过设计一种编解码网络,将语音信号的时域表示作为编码端的输入进行深层特征提取,从而充分利用语音信号的幅值信息以及相位信息。在编码端和解码端的卷积层中加入非局部模块,在提取语音序列关键特征的同时,抑制无用特征,并引入门控循环单元网络捕捉语音序列间的时序相关性信息。在ST-CMDS中文语音数据集上实验结果表明,与未处理的含噪语音相比,使用文中方法生成的增强语音质量和可懂度平均提升了61%和7.93%。

关键词: 语音增强, 深度神经网络, 卷积循环网络, 非局部模块, 监督学习, 门控循环单元, 幅值谱, 相位谱

Abstract:

The existing deep neural network speech enhancement methods ignore the importance of phase spectrum learning and cause the enhanced speech quality to be unsatisfactory. In view of this problem, a speech enhancement method based on convolutional recurrent network and non-local modules is proposed in the present study. By designing an encoder-decoder network, the time-domain representation of the speech signal is used as the input of the encoding end for deep feature extraction, so as to make full use of the amplitude information and phase information of the speech signal. Non-local modules are added to the convolutional layers of the encoder and decoder to extract key features of the speech sequence while suppressing useless features. A gated loop unit network is introduced to capture the timing correlation information between the speech sequences. The experimental results on the ST-CMDS Chinese speech dataset show that compared with the unprocessed noisy speech, the quality and intelligibility of the enhanced speech are improved by 61% and 7.93% on average.

Key words: speech enhancement, deep neural network, convolutional recurrent network, non-local module, supervised learning, gated recurrent unit, magnitude spectrum, phase spectrum

中图分类号:

TN912.35

李辉,景浩,严康华,徐良浩. 基于卷积循环网络与非局部模块的语音增强方法[J]. 电子科技, 2022, 35(3): 8-15.

Hui LI,Hao JING,Kanghua YAN,Lianghao XU. Speech Enhancement Method Based on Convolutional Recurrent Network and Non-Local Module[J]. Electronic Science and Technology, 2022, 35(3): 8-15.

图/表 8

图1

图2

图3

表1

表2

表3

表4

图4

参考文献 21

[1]	刘文举, 聂帅, 梁山 , 等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016,42(6):819-833.
	Liu Wenju, Nie Shuai, Liang Shan , et al. Deep learning based speech separation technology and its developments[J]. Acta Automatica Sinica, 2016,42(6):819-833.
[2]	毕灶荣, 童东兵, 陈巧玉 . 基于快速MFCC计算的说话人识别系统的设计[J]. 电子科技, 2018,31(4):25-28.
	Bi Zaorong, Tong Dongbing, Chen Qiaoyu . Design of speaker recognition system based on fast MFCC calculation[J]. Electronic Science and Technology, 2018,31(4):25-28.
[3]	刘立辉, 杨毅, 王旭阳 , 等. 机载任务系统语音交互技术应用研究[J]. 电子科技, 2017,30(12):125-129.
	Liu Lihui, Yang Yi, Wang Xuyang , et al. Applied research on the speech interaction technology in airborne mission system[J]. Electronic Science and Technology, 2017,30(12):125-129.
[4]	Wang D L, Chen J T . Supervised speech separation based on deep learning:an overview[J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2018,26(10):1702-1726.
[5]	Xu Y, Du J, Dai L R , et al. An experimental study on speech enhancement based on deep neural networks[J]. IEEE Signal Processing Letters, 2013,21(1):65-68.
[6]	袁文浩, 孙文珠, 夏斌 , 等. 利用深度卷积神经网络提高未知噪声下的语音增强性能[J]. 自动化学报, 2018,44(4):751-759.
	Yuan Wenhao, Sun Wenzhu, Xia Bin , et al. Improving speech enhancement in unseen noise using deep convolutional neural network[J]. Acta Automatica Sinica, 2018,44(4):751-759.
[7]	范存航, 刘斌, 陶建华 , 等. 一种基于卷积神经网络的端到端语音分离方法[J]. 信号处理, 2019,35(4):542-548.
	Fan Cunhang, Liu Bin, Tao Jianhua , et al. An end-to-end speech separation method based on convolutional neural network[J]. Journal of Signal Processing, 2019,35(4):542-548.
[8]	Paliwal K, Wójcicki K, Shannon B . The importance of phase in speech enhancement[J]. Speech Communication, 2011,53(4):465-494.
[9]	Pascual S, Bonafonte A, Serrà J. SEGAN:speech enhancement generative adversarial network[C]. Stockholm:Proceedings of the International Speech Communication Association, 2017.
[10]	王怡斐, 韩俊刚, 樊良辉 . 基于WGAN的语音增强算法研究[J]. 重庆邮电大学学报(自然科学版), 2019,31(1):136-142.
	Wang Yifei, Han Jungang, Fan Lianghui . Algorithm research of speech enhancement based on WGAN[J]. Journal of Chongqing University of Posts and Telecommunications(Natural Science Edition), 2019,31(1):136-142.
[11]	Baby D, Verhulst S. Sergan:speech enhancement using relativistic generative adversarial networks with gradient penalty[C]. Brighton:Proceedings of the International Conference on Acoustics,Speech and Signal Processing, 2019.
[12]	Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation [C].Boston:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[13]	Stoller D, Ewert S, Dixon S. Wave-U-Net: a multi-scale neural network for end-to-end audio source separation[C]. Paris:International Society for Music InformationRetrieval, 2018.
[14]	Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[C]. Salt Lake City:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[15]	袁文浩, 娄迎曦, 夏斌 , 等. 基于卷积门控循环神经网络的语音增强方法[J]. 华中科技大学学报(自然科学版), 2019,47(4):13-18.
	Yuan Wenhao, Lou Yingxi, Xia Bin , et al. Speech enhancement method based on convolutional gated recurrent neural network[J]. Journal of Huazhong University of Science and Technology(Natural Science Edition), 2019,47(4):13-18.
[16]	黎阳, 沈烨, 刘敏 , 等. 融合运动信息与表观信息的多目标跟踪算法[J]. 电子科技, 2020,33(9):21-24.
	Li Yang, Shen Ye, Liu Min , et al. Multi-target tracking algorithm by combining motion information and apparent information[J]. Electronic Science and Technology, 2020,33(9):21-24.
[17]	贝琛圆, 于海滨, 潘勉 , 等. 基于改进U-Net网络的腺体细胞图像分割算法[J]. 电子科技, 2019,32(11):18-22.
	Bei Chenyuan, Yu Haibin, Pan Mian , et al. Gland cell image segmentation algorithm based on improved U-Net network[J]. Electronic Science and Technology, 2019,32(11):18-22.
[18]	Piczak K J. ESC:Dataset for environmental sound classification[C]. Brisbane:Proceedings of the Twenty-third Acm International Conference on Multimedia, 2015.
[19]	Varga A, Steeneken H J M. Assessment for automatic speech recognition:II. NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems[J]. Speech Communication, 1993,12(3):247-251.
[20]	Rix A W, Beerends J G, Hollier M P, et al. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs[C]. Piscataway:Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2001.
[21]	Taal C H, Hendriks R C, Heusdens R , et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio Speech and Language Processing, 2011,19(7):2125-2136.

网络层	输入维度	输出维度	超参数
输入层	1×16 384	1×16 384	—
编码端	1×16 384	288×4	k=15,s=1 n=24,48,72,96,120,144,168,192,216,240,264,288
非局部模块	288×4	288×4	—
重构层	288×4	4×288	—
GRU层1	4×288	4×288	288
GRU层2	4×288	4×288	288
特征融合层	[4×288, 4×288]	4×288	—
重构层	4×288	288×4	—
非局部模块	288×4	288×4	—
解码端	576×8	24×16 384	k=5,s=1 n=288,264,240,216,192,168,144,120,96,72,48,24
输出层	25×16 384	1×16 384	k=5,s=1,n=1

配置名称	型号参数
操作系统	Windows 10
编程语言	Python 3.6
处理器	Inter Core i5-9400F @2.90 GHz
显卡	RTX 2060S
内存	16 GB

信噪比	噪声类型	a	b	c	d	e	f	g	h
-3 dB	Babble	1.12	1.18	1.14	1.17	1.25	1.28	1.29	1.35
	Destroyer engine	1.21	1.43	1.34	1.43	1.55	1.59	1.56	1.59
	F16	1.15	1.43	1.36	1.50	1.50	1.57	1.51	1.66
	HF channel	1.21	1.46	1.39	1.47	1.52	1.69	1.63	1.69
	M109	1.16	1.63	1.58	1.71	1.71	1.63	1.65	1.83
	White	1.10	1.41	1.14	1.43	1.54	1.59	1.60	1.68
0 dB	Babble	1.31	1.41	1.35	1.41	1.51	1.53	1.51	1.60
	Destroyer engine	1.28	1.53	1.44	1.58	1.67	1.76	1.70	1.80
	F16	1.17	1.59	1.48	1.65	1.75	1.80	1.71	1.87
	HF channel	1.27	1.59	1.49	1.63	1.80	1.86	1.81	1.83
	M109	1.25	1.82	1.72	1.94	1.93	1.91	1.90	2.05
	White	1.12	1.59	1.24	1.64	1.73	1.78	1.78	1.88
3 dB	Babble	1.31	1.93	1.79	2.08	1.71	1.82	1.77	1.99
	Destroyer engine	1.36	1.67	1.58	1.73	1.90	1.98	1.91	1.96
	F16	1.28	1.83	1.71	1.90	2.03	2.08	1.98	2.12
	HF channel	1.36	1.68	1.61	1.74	1.97	2.05	1.99	1.97
	M109	1.42	2.00	1.93	2.10	2.26	2.21	2.21	2.30
	White	1.16	1.76	1.40	1.79	1.90	1.93	1.93	2.03
均值		1.23	1.61	1.48	1.66	1.73	1.78	1.75	1.84

信噪比	噪声类型	a	b	c	d	e	f	g	h
-3 dB	Babble	63.91	61.11	62.04	60.80	71.17	70.83	70.95	72.02
	Destroyer engine	68.61	71.86	70.39	73.51	77.57	77.04	77.03	79.13
	F16	67.39	75.06	74.52	75.81	79.15	78.48	78.17	80.57
	HF channel	69.01	73.28	72.87	75.58	78.81	78.78	78.51	78.69
	M109	79.45	83.67	83.28	84.29	86.74	88.72	86.44	88.69
	White	72.82	73.33	55.57	74.11	81.25	81.12	80.74	83.56
0 dB	Babble	76.11	80.04	80.02	78.88	84.82	84.69	84.43	83.95
	Destroyer engine	77.06	80.08	79.32	80.81	84.02	83.55	83.61	85.02
	F16	74.86	81.98	81.21	82.26	84.41	84.18	83.77	86.11
	HF channel	76.87	80.24	79.68	80.90	84.13	83.76	83.61	84.29
	M109	85.69	87.91	87.41	88.19	90.68	90.56	90.53	91.67
	White	79.45	79.92	68.29	80.12	85.41	85.49	85.11	85.39
3 dB	Babble	80.13	91.79	89.27	92.71	88.66	88.67	88.58	92.11
	Destroyer engine	83.81	84.97	85.11	85.29	88.06	87.63	87.7	88.11
	F16	83.06	86.25	85.91	85.99	89.09	88.78	88.65	89.33
	HF channel	82.45	84.27	84.42	84.41	87.73	87.25	87.56	86.95
	M109	91.03	90.70	91.10	90.65	93.71	93.67	93.49	94.25
	White	84.99	85.82	80.40	85.95	88.45	88.43	89.71	89.53
均值		77.59	80.68	78.38	81.13	84.66	84.54	84.37	85.52