一种时频平滑的深度神经网络语音增强方法

doi:10.19665/j.issn1001-2400.2019.04.018

摘要/Abstract

摘要：

由于现有的基于深度神经网络的语音增强方法在网络结构的设计上缺乏对语音增强问题自身特点的考虑, 针对这一问题,基于语音增强在时间和频率两个维度上的不同特性,受传统语音增强方法中的含噪语音局部特征计算方法启发,设计了一种在时间和频率两个维度上进行不同处理的时频平滑网络。该网络采用门控循环单元来表达含噪语音在时间上的相关性,同时采用卷积神经网络来表达含噪语音在频率上的相关性,实现了类似传统语音增强方法的时频平滑处理。实验结果表明,这种时频平滑网络在保证语音增强系统因果性的前提下,相比其他网络显著地提高了语音增强性能,增强后的语音具有更好的语音质量和可懂度。

关键词: 语音增强, 时频平滑, 卷积神经网络, 深度神经网络

Abstract:

In the existing speech enhancement methods based on the deep neural network, the characteristics of speech enhancement problem are not fully considered in the design of the network structure. In view of this problem, based on the different characteristics of speech enhancement in time and frequency, inspired by the feature calculation method in the traditional speech enhancement methods, a time-frequency smoothing network with different processings in time and frequency is designed. In this network, a gated recurrent unit is used to express the correlation of noisy speech with time, and a convolutional neural network is used to express the correlation of the noisy speech with frequency, which realizes a time-frequency smoothing process similar to that of the traditional speech enhancement methods. Experimental results show that the proposed time-frequency smoothing network can significantly improve the speech enhancement performance compared with other networks under the premise of ensuring the causality of the speech enhancement system and that the enhanced speech has a better speech quality and intelligibility.

Key words: speech enhancement, time-frequency smoothing, convolutional neural network, deep neural network

中图分类号:

TN912.3

袁文浩,梁春燕,娄迎曦,房超,王志强. 一种时频平滑的深度神经网络语音增强方法[J]. 西安电子科技大学学报, 2019, 46(4): 130-136.

YUAN Wenhao,LIANG Chunyan,LOU Yingxi,FANG Chao,WANG Zhiqiang. Speech enhancement method based on the time-frequency smoothing deep neural network[J]. Journal of Xidian University, 2019, 46(4): 130-136.

图/表 5

图1

表1

表2

图2

表3

参考文献 20

[1]	刘文举, 聂帅, 梁山 , 等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016,42(6):819-833. doi: 10.16383/j.aas.2016.c150734
	LIU Wenju, NIE Shuai, LIANG Shan , et al. Deep Learning Based Speech Separation Technology and Its Developments[J]. Acta Automatica Sinica, 2016,42(6):819-833. doi: 10.16383/j.aas.2016.c150734
[2]	WANG D L, CHEN J . Supervised Speech Separation Based on Deep Learning: An Overview[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018,26(10):1702-1726.
[3]	WANG Q, DU J, DAI L R , et al. A Multiobjective Learning and Ensembling Approach to High-performance Speech Enhancement with Compact Neural Network Architectures[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018,26(7):1185-1197.
[4]	WANG Y, WANG D L . Towards Scaling Up Classification-based Speech Separation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2013,21(7):1381-1390.
[5]	WANG Y, NARAYANAN A, WANG D L . On Training Targets for Supervised Speech Separation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2014,22(12):1849-1858.
[6]	WILLIAMSON D S, WANG D L . Time-frequency Masking in the Complex Domain for Speech Dereverberation and Denoising[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017,25(7):1492-1501.
[7]	XU Y, DU J, DAI L R , et al. An Experimental Study on Speech Enhancement Based on Deep Neural Networks[J]. IEEE Signal Processing Letters, 2014,21(1):65-68.
[8]	XU Y, DU J, DAI L R , et al. A Regression Approach to Speech Enhancement Based on Deep Neural Networks[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015,23(1):7-19.
[9]	HUANG P S, KIM M, HASEGAWA-JOHNSON M , et al. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015,23(12):2136-2147.
[10]	WENINGER F, ERDOGAN H, WATANABE S. et al. Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-robust ASR [C]//Lecture Notes in Computer Science: 9237. Heidelberg: Springer Verlag, 2015: 91-99.
[11]	CHEN J, WANG D . Long Short-term Memory for Speaker Generalization in Supervised Speech Separation[J]. Journal of the Acoustical Society of America, 2017,141(6):4705-4714.
[12]	PARK S R, LEE J M. A Fully Convolutional Neural Network for Speech Enhancement [C]//Proceedings of the 2017 Annual Conference of the International Speech Communication Association. Baixas: International Speech Communication Association, 2017: 1993-1997.
[13]	FU S W, TSAO Y, LU X. SNR-aware Convolutional Neural Network Modeling for Speech Enhancement [C]//Proceedings of the 2016 Annual Conference of the International Speech Communication Association. Baixas: International Speech Communication Association, 2016: 3768-3772.
[14]	LOIZOU P C. Speech Enhancement: Theory and Practice[M]. Boca Raton: CRC Press, 2013.
[15]	COHEN I . Noise Spectrum Estimation in Adverse Environments: Improved Minima Controlled Recursive Averaging[J]. IEEE Transactions on Speech and Audio Processing, 2003,11(5):466-475.
[16]	GAROFOLO J S, LAMEL L F, FISHER W M , et al. TIMIT Acoustic-phonetic Continuous Speech Corpus [EB/OL]. [2018-09-10].https://catalog.ldc.upenn.edu/LDC93S1.
[17]	HU G . 100 Nonspeech Environmental Sounds[EB/OL]. [ 2018- 09- 03]. http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html.
[18]	VARGA A, STEENEKEN H J M . Assessment for Automatic Speech Recognition: II. NOISEX-92: A Database and an Experiment to Study the Effect of Additive Noise on Speech Recognition Systems[J]. Speech Communication, 1993,12(3):247-251.
[19]	RIX A W, BEERENDS J G, HOLLIER M P. et al. Perceptual Evaluation of Speech Quality (PESQ)-a New Method for Speech Quality Assessment of Telephone Networks and Codecs [C]//Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2001: 749-752.
[20]	TAAL C H, HENDRIKS R C, HEUSDENS R , et al. An Algorithm for Intelligibility Prediction of Time-frequency Weighted Noisy Speech[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2011,19(7):2125-2136.

噪声	信噪比/dB	含噪语音	全连接神经网络	门控循环单元	时频平滑网络
N₁	-7	1.62	1.65	2.06	2.24
	0	2.08	2.25	2.65	2.78
	7	2.54	2.68	3.09	3.21
N₂	-7	1.29	1.24	1.67	1.98
	0	1.63	1.79	2.24	2.47
	7	2.08	2.32	2.73	2.90
N₃	-7	1.49	1.45	1.78	2.10
	0	1.81	1.92	2.32	2.62
	7	2.21	2.42	2.80	3.08
N₄	-7	1.30	1.23	1.59	2.02
	0	1.57	1.69	2.07	2.51
	7	1.97	2.17	2.56	2.92

噪声	信噪比/dB	含噪语音	全连接神经网络	门控循环单元	时频平滑网络
N₁	-7	0.61	0.58	0.71	0.71
	0	0.76	0.73	0.84	0.84
	7	0.87	0.82	0.91	0.90
N₂	-7	0.48	0.44	0.58	0.61
	0	0.63	0.61	0.75	0.76
	7	0.79	0.75	0.86	0.86
N₃	-7	0.53	0.45	0.60	0.67
	0	0.69	0.64	0.78	0.82
	7	0.84	0.80	0.88	0.90
N₄	-7	0.52	0.46	0.63	0.67
	0	0.69	0.65	0.78	0.80
	7	0.84	0.79	0.88	0.88