一种用于实时语音增强的卷积准循环网络

doi:10.19665/j.issn1001-2400.2022.03.020

摘要/Abstract

摘要：

为了在保证实时性的前提下,进一步提高深度神经网络的语音增强性能,提出一种用于实时语音增强的卷积准循环网络。该网络采用因果形式的输入,只利用含噪语音当前帧及过去帧的时频域特征,以满足实时语音增强方法的输入要求;基于准循环神经网络对含噪语音时间维度上的相关性进行建模,利用其对含噪语音序列的并行处理能力,提高网络模型的计算效率;同时使用卷积层改进准循环神经网络在隐层对含噪语音频率维度特征的计算方式,使网络模型能够更好地利用含噪语音相邻频带之间的局部相关性,提高网络模型的语音增强性能。实验结果表明,与基于准循环神经网络的语音增强方法相比,基于卷积准循环网络的语音增强方法不仅提高了语音增强性能,还降低了网络模型的参数量;与其他语音增强方法相比,卷积准循环网络在保证因果形式输入的前提下,有效地抑制了背景噪声对目标语音的干扰、降低了目标语音的失真程度,拥有更好的语音增强性能。最后,在不同计算平台上验证了基于卷积准循环网络的语音增强方法的实时性。

关键词: 语音增强, 准循环神经网络, 卷积神经网络, 实时性

Abstract:

To improve the speech enhancement performance of deep neural networks under the premise of ensuring the real-time performance,a convolutional quasi-recurrent network for real-time speech enhancement is proposed.The network uses a causal input,and it only uses the time-frequency domain features of the current and past frames of the noisy speech to meet the input requirements of the real-time speech enhancement method.The network uses the quasi-recurrent neural network to model the correlation of the noisy speech in the time domain,and uses its parallel calculations capability for the noisy speech sequences to improve the computational efficiency of the model.The network uses the convolutional layer to improve the feature extraction method of the quasi-recurrent neural network for the frequency domain feature of the noisy speech,which enables the network to better utilize the local correlation between the adjacent frequency bands of the noisy speech and improve the performance of speech enhancement.Experimental results show that,compared with the speech enhancement method based on the quasi-recurrent network,the speech enhancement method based on the convolutional quasi-recurrent network not only improves the speech enhancement performance,but also reduces the parameter number of the network model.Compared with existing methods,the convolutional quasi-recurrent network effectively suppresses the interference of background noise on the target speech,reduces the distortion of the target speech,and has a better speech enhancement performance under the premise of ensuring the causal input.The real-time performance of the speech enhancement method based on the convolutional quasi-recurrent network is verified on different computing platforms.

Key words: speech enhancement, quasi-recurrent network, convolutional neural network, real-time performance

中图分类号:

TN912

时云龙,袁文浩,胡少东,娄迎曦. 一种用于实时语音增强的卷积准循环网络[J]. 西安电子科技大学学报, 2022, 49(3): 183-190.

SHI Yunlong,YUAN Wenhao,HU Shaodong,LOU Yingxi. Convolutional quasi-recurrent network for real-time speech enhancement[J]. Journal of Xidian University, 2022, 49(3): 183-190.

图/表 6

图1

图2

图3

表1

表2

表3

参考文献 29

[1]	LIU D, SMARAGDIS P, KIM M. Experiments on Deep Learning for Speech Denoising[C]// Fifteenth Annual Conference of the International Speech Communication Association.Baixas:ISCA, 2014:2685-2689.
[2]	常新旭, 张杨, 杨林, 等. 融合多头自注意力机制的语音增强方法[J]. 西安电子科技大学学报, 2020, 47(1):104-110.
	CHANG Xinxu, ZHANG Yang, YANG Lin, et al. Speech Enhancement Method Based on the Multi-Head Self-Attention Mechanism[J]. Journal of Xidian University, 2020, 47(1):104-110.
[3]	BOLL S. Suppression of Acoustic Noise in Speech Using Spectral Subtraction[J]. IEEE Transactions on Acoustics,Speech,and Signal Processing, 1979, 27(2):113-120. doi: 10.1109/TASSP.1979.1163209
[4]	CHEN J, BENESTY J, HUANG Y, et al. New Insights into The Noise Reduction Wiener Filter[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2006, 14(4):1218-1234. doi: 10.1109/TSA.2005.860851
[5]	DENDRINOS M, BAKAMIDIS S, CARAYANNIS G. Speech Enhancement from Noise:A Regenerative Approach[J]. Speech Communication, 1991, 10(1):45-57. doi: 10.1016/0167-6393(91)90027-Q
[6]	时文华, 张雄伟, 邹霞, 等. 联合深度编解码网络和时频掩蔽估计的单通道语音增强[J]. 声学学报, 2020, 45(3):299-307.
	SHI Wenhua, ZHANG Xiongwei, ZOU Xia, et al. Time Frequency Masking Based Speech Enhancement Using Deep Encoder-Decoder Neural Network[J]. Acta Acustica, 2020, 45(3):299-307.
[7]	贾海蓉, 王卫梅, 吉慧芳. 信噪比信息与时频特征修正相位的语音增强[J]. 西安电子科技大学学报, 2019, 46(5):162-170.
	JIA Hairong, WANG Weimei, JI Huifang. Speech Enhancement Based on The Modified Phase Using Sgnal-to-Noise Ratio Information and Time-Frequency Characteristics[J]. Journal of Xidian University, 2019, 46(5):162-170.
[8]	XU Y, DU J, DAI L R, et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks[J]. IEEE Signal Processing Letters, 2013, 21(1):65-68. doi: 10.1109/LSP.2013.2291240
[9]	KANG T G, KWON K, SHIN J W, et al. NMF-Based Speech Enhancement Incorporating Deep Neural Network[C]// Fifteenth Annual Conference of the International Speech Communication Association.Baixas:ISCA, 2014:2843-2846.
[10]	KOUNOVSKY T, MALEK J. Single Channel Speech Enhancement Using Convolutional Neural Network[C]// 2017 IEEE International Workshop of Electronics,Control,Measurement,Signals and their Application to Mechatronics (ECMSM).Piscataway:IEEE, 2017:1-5.
[11]	PARK S R, LEE J W. A Fully Convolutional Neural Network for Speech Enhancement (2016)[J/OL]. [2016-09-22]. http://export.arxiv.org/pdf/1609.07132.
[12]	GERMAIN F, CHEN Q, KOLTUN V. Speech Denoising with Deep Feature Losses[C]// Proceedings of the Annual Conference of the International Speech Communication Association.Baixas:ISCA, 2019:2723-2727.
[13]	HUANG P S, KIM M, HASEGAWA-JOHNSON M, et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2015, 23(12):2136-2147. doi: 10.1109/TASLP.2015.2468583
[14]	SUN L, DU J, DAI L R, et al. Multiple-Target Deep Learning for LSTM-RNN Based Speech Enhancement[C]// 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).Piscataway:IEEE, 2017:136-140.
[15]	GAO T, DU J, DAI L R, et al. Densely Connected Progressive Learning for LSTM-Based Speech Enhancement[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE, 2018:5054-5058.
[16]	BRADBURY J, MERITY S, XIONG C, et al. Quasi-Recurrent Neural Networks (2016)[J/OL]. [2016-11-05]. https://arxiv.org/abs/1611.01576.
[17]	ARIK S Ö, CHRZANOWSKI M, COATES A, et al. Deep Voice:Real-time Neural Text-to-Speech[C]// International Conference on Machine Learning. New York: ACM, 2017:195-204.
[18]	VALENTINI-BOTINHAO C, WANG X, TAKAKI S, et al. Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks[C]// Proceedings of International Speech Communication Association.Baixas:ISCA, 2016:352-356.
[19]	TJIEMANN J, ITO N, VINCENT E. The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND):A Database of Multichannel Environmental Noise Recordings[J]. Journal of the Acoustical Society of America, 2013, 19(1):035081.
[20]	WEN S X, DU J, LEE C H. On Generating Mixing Noise Signals with Basis Functions for Simulating Noisy Speech and Learning DNN-Based Speech Enhancement Models[C]// 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).Piscataway:IEEE, 2017:1-6.
[21]	DONG Y, EVERSOLE A, SELTZER M, et al. An Introduction to Computational Networks and the Computational Network Toolkit:MSR-TR-2014-112[R]. Redmond: Microsoft Technical Report, 2014.
[22]	HU Y, LOIZOU P C. Evaluation of Objective Quality Measures for Speech Enhancement[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2007, 16(1):229-238. doi: 10.1109/TASL.2007.911054
[23]	SCALART P, FILHO J V. Speech Enhancement Based on A Priori Signal to Noise Estimation[C]// IEEE International Conference on Acoustics,Speech,and Signal Processing Conference Proceedings.Piscataway:IEEE, 1996:629-632.
[24]	PASCUAL S, BONAFONTE A, SERRA J. SEGAN:Speech Enhancement Generative Adversarial Network (2017)[J/OL]. [2017-03-28]. https://arxiv.org/abs/1703.09452v1.
[25]	RETHAGE D, PONS J, SERRA X. A Wavenet for Speech Denoising[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE, 2018:5069-5073.
[26]	SONI M H, SHAH N, PATIL H A. Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network[C]// 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).Piscataway:IEEE, 2018:5039-5043.
[27]	SHIFAS M P V, ADIGA N, TSIARAS V, et al. A Non-Causal FFTNet Architecture for Speech Enhancement (2020)[J/OL]. [2020-06-08]. https://arxiv.org/abs/2006.04469v1.
[28]	YANG F, WANG Z, LI J, et al. Improving Generative Adversarial Networks for Speech Enhancement through Regularization of Latent Representations[J]. Speech Communication, 2020, 118:1-9. doi: 10.1016/j.specom.2020.02.001
[29]	PANDEY A, WANG D L. On Cross-Corpus Generalization of Deep Learning Based Speech Enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2020, 28:2489-2499. doi: 10.1109/TASLP.2020.3016487

网络类型	QRNN			CQRN
k	1	2	3	1	2	3
参数量/×10⁶	5.26	10.37	15.48	4.89	5.56	6.22

方法	因果性	CSIG	CBAK	COVL	PESQ
Wiener^[23]	是	3.23	2.68	2.67	2.22
SEGAN^[24]	否	3.48	2.94	2.80	2.16
Wavenet^[25]	否	3.62	2.94	2.98
MMSE-GAN^[26]	否	3.80	3.12	3.14	2.53
Deep Feature Loss^[12]	是	3.86	3.33	3.22
SE-FFTNET^[27]	否	3.60	3.20	2.98	2.37
HLGAN^[28]	否	3.65	3.19	3.05	2.48
CQRN	是	4.19	3.34	3.51	2.80

帧移	CSIG	CBAK	COVL	PESQ
256点	4.19	3.34	3.51	2.80
128点	4.24	3.40	3.57	2.86