动态特征联合新掩模优化神经网络语音增强

doi:10.19665/j.issn1001-2400.2021.03.012

摘要/Abstract

摘要：

针对神经网络语音增强算法因特征选取不能全面表示语音非线性结构导致语音质量较差的问题,提出一种动态特征联合新掩模优化神经网络语音增强的方法。首先,提取带噪语音的3种特征并进行拼接以得到静态特征,后求一阶、二阶差分导数,捕捉语音的瞬息信号,融合成动态特征,动静结合完成特征内部互补,减少语音失真。其次,为了使增强语音的可懂度和清晰度同时达到最好,提出一种新的自适应掩模,它既能自适应调整语音、噪声的能量比例,又能自适应调节传统掩模和平方根掩模的比例;并用Gammatone通道权重修改每个通道内的掩模值,模仿人类听觉系统,进一步提升语音的可懂度。最后,对不同噪声背景下的多条语音进行实验仿真。结果表明,与已有的文献中不同算法相比,该算法的信噪比、主观语音质量、短时客观可懂度值都较高,验证了该算法的有效性。

关键词: 动态特征, 自适应掩模, 语音增强, 神经网络

Abstract:

Concerning the problem that the Neural Network speech enhancement algorithm cannot fully represent the nonlinear structure of speech due to feature selection,which leads to speech distortion.This paper proposes the combination of dynamic features with a new mask to optimize neural network speech enhancement.First,three features of noisy speech are extracted and spliced to obtain static features.Then,the first and second difference derivatives are obtained to capture the instantaneous signals of speech and fuse them into dynamic features.The combination of dynamic and static features completes internal complementarity of features and reduced speech distortion.Second,in order to enhance the intelligibility and clarity of speech at the same time,an adaptive mask is proposed,which can adjust the energy ratio of speech and noise as well as the ratio of the traditional mask and the square root mask.The Gammatone channel weight is used to modify the mask value in each channel to simulate the human auditory system and further improve the speech intelligibility.Finally,the simulation of multiple voices under different noise backgrounds shows that compared with different literature algorithms,the algorithm has a higher SNR,subjective speech quality and short-term objective intelligibility,which verifies the effectiveness of the algorithm.

Key words: dynamic characteristics, adaptive mask, speech enhancement, Neural Network

中图分类号:

TN912.35

梅淑琳,贾海蓉,王晓刚,武奕峰. 动态特征联合新掩模优化神经网络语音增强[J]. 西安电子科技大学学报, 2021, 48(3): 91-98.

MEI Shulin,JIA Hairong,WANG Xiaogang,WU Yifeng. Combination of dynamic features with a new mask to optimize neural network speech enhancement[J]. Journal of Xidian University, 2021, 48(3): 91-98.

图/表 9

图1

图2

图3

图4

图5

表1

表2

表3

图6

参考文献 17

[1]	贾海蓉, 王卫梅, 王雁, 等. 区分性联合稀疏字典交替优化的语音增强[J]. 西安电子科技大学学报, 2019,46(3):74-81.
	JIA Hairong, WANG Weimei, WANG Yan, et al. Speech Enhancement Based on Discriminative Joint Sparse Dictionaryalternate Optimization[J]. Journal of Xidian University, 2019,46(3):74-81.
[2]	袁文浩, 娄迎曦, 梁春燕, 等. 感知联合优化的深度神经网络语音增强方法[J]. 西安电子科技大学学报, 2019,46(2):89-94.
	YUAN Wenhao, LOU Yingxi, LIANG Chunyan, et al. Speech Enhancement Method Based on the Perceptual Joint Optimization Deep Neural Network[J]. Journal of Xidian University, 2019,46(2):89-94.
[3]	MOHAMMADIHA N, SMARAGDIS P, LEIJON A. Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization[J]. IEEE Transactions on Audio,Speech,and Language Processing, 2013,21(10):2140-2151. doi: 10.1109/TASL.2013.2270369
[4]	WANG Y, NARAYANAN A, WANG D L. On Training Targets for Supervised Speech Separation[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2014,22(12):1849-1858. doi: 10.1109/TASLP.2014.2352935
[5]	李保明, 付小宁. 基于理想组合掩蔽的监督性语音增强算法[J]. 计算机科学与应用, 2018,8(4):546-552.
	LI Baoming, FU Xiaoning. Supervised Speech Enhancement Algorithm Based on Phase Spectrum Estimation[J]. Computer Science and Application, 2018,8(4):546-552.
[6]	王雁, 贾海蓉, 吉慧芳, 等. 特征联合优化深度信念网络的语音增强算法[J]. 计算机工程与应用, 2019,55(9):38-42.
	WANG Yan, JIA Hairong, JI Huifang, et al. Feature Joint Optimization of Deep Belief Network for Speech Enhancement[J]. Computer Engineering and Applications, 2019,55(9):38-42.
[7]	BAO F, ABDULLA W H. Noise Masking Method Based on an Effective Ratio Mask Estimation in Gammatone Channels[J]. APSIPA Transactions on Signal and Information Processing, 2018,7:1-12.
[8]	郭欣, 贾海蓉, 王栋. 利用子空间改进的K-SVD语音增强算法[J]. 西安电子科技大学学报, 2016,43(6):109-115.
	GUO Xin, JIA Hairong, WANG Dong. Speech Enhancement Using the Improved K-SVD Algorithm by Subspace[J]. Journal of Xidian University, 2016,43(6):109-115.
[9]	LI R, SUN X, LIU Y, et al. Multi-resolution Auditory Cepstral Coefficient and Adaptive Mask for Speech Enhancement with Deep Neural Network[J]. Eurasip Journal on Advances in Signal Processing, 2019,2019(1):22. doi: 10.1186/s13634-019-0618-4
[10]	British Standards Institution. Specification for Normal Equal-loudness Level Contours for Pure Tones Under Free-field Listening Conditions:BS-3383:1988[S]. 1988.
[11]	白静, 史燕燕, 薛珮芸, 等. 融合非线性幂函数和谱减法的CFCC特征提取[J]. 西安电子科技大学学报, 2019,46(1):86-92.
	BAI Jing, SHI Yanyan, XUE Peiyun, et al. CFCC Feature Extraction for Fusion of the Power-law Nonlinearity Function and Spectral Subtraction[J]. Journal of Xidian University, 2019,46(1):86-92.
[12]	XU Y, DU J, DAI D R, et al. A regression Approach to Speech Enhancement Based on Deep Neural Network[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2015,23(1):7-19. doi: 10.1109/TASLP.6570655
[13]	刘文举, 聂帅, 梁山, 等. 基于深度学习语音分离技术的研究现状与进展[J]. 自动化学报, 2016,42(6):819-833.
	LIU Wenjiu, NIE Shuai, LIANG Shan, et al. Deep Learning Based Speech Separation Technology and Its Developments[J]. Acta Automatica Sinica, 2016,42(6):819-833.
[14]	BAO F, ABDULLA W H. A New Time-frequency Binary Mask Estimation Method Based on Convex Optimization of Speech Power[J]. Speech Communication, 2018,97:51-65. doi: 10.1016/j.specom.2018.01.002
[15]	HE Q, BAO F, BAO C. Multiplicative Update of Auto-regressive Gains for Codebook-based Speech Enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2017,25(3):457-468. doi: 10.1109/TASLP.2016.2636445
[16]	袁文浩, 梁春燕, 娄迎曦, 等. 一种时频平滑的深度神经网络语音增强方法[J]. 西安电子科技大学学报, 2019,46(4):130-136.
	YUAN Wenhao, LIANG Chunyan, LOU Yingxi, et al. Speech Enhancement Method Based on the Time-frequency Smoothing Deep Neural Network[J]. Journal of Xidian University, 2019,46(4):130-136.
[17]	BAO F, ABDULLA W H. A New Ratio Mask Representation for CASA-based Speech Enhancement[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019,27(1):7-19. doi: 10.1109/TASLP.2018.2868407

噪声	信噪比	SegSNR
噪声	信噪比	带噪语音	算法1	算法2	算法3
	10	-16.201 5	2.462 3	2.530 9	3.934 6
	5	-20.808 9	0.484 8	0.546 7	1.648 9
F16噪声	0	-25.111 2	-1.937 2	-1.839 4	-0.965 9
	-5	-29.335 9	-6.660 9	-6.423 7	-5.423 4
	-10	-33.786 7	-13.235 6	-13.567 2	-12.013 4
	10	-15.912 8	0.860 0	1.537 8	2.367 4
	5	-20.593 7	-1.876 5	-1.456 9	0.244 2
Babble噪声	0	-25.050 4	-6.401 2	-5.589	-4.578 0
	-5	-28.822 8	-7.881 1	-6.889 8	-6.035 6
	-10	-32.134 5	-14.125 6	-13.126 7	-12.332 4

噪声	信噪比/dB	PESQ
噪声	信噪比/dB	带噪语音	算法1	算法2	算法3
	10	2.197 6	2.623 8	2.666 7	2.719 7
	5	2.031 2	2.121 4	2.353 9	2.665 8
F16噪声	0	1.674 7	2.016 8	2.122 6	2.428 4
	-5	1.462 7	1.950 6	1.523 9	2.261 5
	-10	1.121 4	1.210 7	1.443 7	2.183 7
	10	2.235 1	2.558 5	2.602 0	2.610 3
	5	1.784 7	2.329 4	2.564 5	2.590 9
Babble噪声	0	1.422 7	1.796 8	2.110 3	2.253 8
	-5	1.143 7	1.567 8	1.994 7	2.167 4
	-10	0.986 9	1.099 4	1.234 8	2.010 2

噪声	信噪比/dB	STOI
噪声	信噪比/dB	带噪语音	算法1	算法2	算法3
	10	0.800 0	0.831 0	0.870 4	0.906 1
	5	0.769 3	0.800 1	0.818 9	0.859 9
F16噪声	0	0.712 0	0.766 8	0.794 0	0.814 0
	-5	0.666 3	0.702 0	0.759 0	0.775 0
	-10	0.643 0	0.660 0	0.678 9	0.696 3
	10	0.799 4	0.829 6	0.875 1	0.897 6
	5	0.752 0	0.812 1	0.823 0	0.864 3
Babble噪声	0	0.695 6	0.835 2	0.836 1	0.836 6
	-5	0.676 6	0.719 2	0.744 1	0.787 3
	-10	0.610 7	0.655 5	0.660 0	0.694 2