基于听感量化编码的神经网络语音合成方法研究

doi:10.16180/j.cnki.issn1007-7820.2019.09.016

摘要/Abstract

摘要：

针对当前神经网络声学建模中数据混用困难的问题,文中提出了一种基于听感量化编码的神经网络语音合成方法。通过设计听感量化编码模型学习海量语音在音色、语种、情感上的不同差异表征,构建统一的多人数据混合训练的神经网络声学模型。在统一的听感量化编码声学模型内通过数据共享和迁移学习,可以显著降低合成系统搭建的数据量要求,并实现对合成语音的音色、语种、情感等属性的有效控制。提升了神经网络语音合成的质量和灵活性,一小时数据构建语音合成系统自然度可达到4.0MOS分,达到并超过普通说话人水平。

关键词: 语音合成, 听感量化编码, 神经网络, 少数据量合成, 跨语种合成, 情感控制

Abstract:

Current neural network based speech synthesis framework is designed for single speaker, requiring at least a few hours training, and cannot make use of speech data from different speakers, languages, styles. To address this problem, a perception quantification-based neural network speech synthesis method was proposed. In the proposed method, a perception quantification-based model was designed to learn the representations for different attributes of speech. A unified acoustic model was built using the learnt perception quantification representations for different speakers, languages and styles. An adaptation method was introduced to transfer the knowledge from the unified acoustic model to new speakers with limited speech data. The proposed method could effectively control the speaker, language, and style of synthetic speech, achieve cross-language, cross-style speech synthesis, and the adaptation method could reduce the demand for training data to a few minutes. The proposed methods significantly improved the quality and flexibility of speech synthesis systems, and the naturalness of synthesized speech is similar to or better than an average mandarin speaker.

Key words: speech synthesis, perception quantification, neural networks, limited data, cross-language, style control

中图分类号:

TN912.33

刘庆峰,江源,胡亚军,刘利娟. 基于听感量化编码的神经网络语音合成方法研究[J]. 电子科技, 2019, 32(9): 76-79.

LIU Qingfeng,JIANG Yuan,HU Yajun,LIU Lijuan. Research on Perception Quantification-based Neural Speech Synthesis Methods[J]. Electronic Science and Technology, 2019, 32(9): 76-79.

图/表 6

图1

图2

图3

表1

表2

表3

参考文献 16

[1]	Yoshimura T, Tokuda K, Masuko T, et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synconfproc [C].Budapest:Sixth European Conference on Speech Communication and Technology, EUROSPEECH, 1999.
[2]	Tokuda K, Masuko T, Miyazaki N , et al. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling[C].Phoenix:International Conference on Acoustics, Speech and Signal Processing(ICASSP), 1999.
[3]	Tokuda K, Yoshimura T, Masuko T , et al. Speech parameter generation algorithms for HMM-based speech synconfproc[C].Istanbul:International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2000.
[4]	Ling Z H, Deng L, Yu D . Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synjournal[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013,21(10):2129-2139.
[5]	Zen H . Deep learning in speech synconfproc[C].Guangzhou:Keynote Speech Given at Isca Speech Synconfproc Workshop (SSW8), 2013.
[6]	Fan Y, Qian Y, Xie F L , et al. TTS synconfproc with bidirectional LSTM based recurrent neural networks[C].Minneapolis:Fifteenth Annual Conference of the International Speech Communication Association(ISCA), 2014.
[7]	Zen H, Sak H. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synconfproc [C].South Brisbane: International Conference on Acoustics,Speech and Signal Processing (ICASSP),IEEE, 2015.
[8]	Ling Z H, Kang S Y, Zen H , et al. Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends[J]. IEEE Signal Processing Magazine, 2015,32(3):35-52.
[9]	Takaki S, Yamagishi J . A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synconfproc [C]. Shanghai:International Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, 2016.
[10]	Chen L H, Raitio T, Valentini-Botinhao C , et al. DNN-based stochastic postfilter for HMM-based speech synconfproc [C]. Singapore:15 ^th Annual Conference of the International Speech Communication Association,INTERSPEECH , 2014.
[11]	Kaneko T, Kameoka H, Hojo N , et al. Generative adversarial network-based postfilter for statistical parametric speech synconfproc [C].New Orleans:International Conference on Acoustics, Speech and Signal Processing (ICASSP),IEEE, 2017.
[12]	刘庆峰 . 基于听感量化理论的语音合成系统研究[D]. 合肥:中国科学技术大学, 2003.
	Liu Qingfeng . Research on perception quantification-based speech synthesis system[D]. Hefei:University of Science and Technology of China, 2003.
[13]	Hu Y J, Ling Z H . DBN-based spectral feature representation for statistical parametric speech synjournal[J]. IEEE Signal Processing Letters, 2016,23(3):321-325.
[14]	Liu L J, Ding C, Jiang Y, et al. The IFLYTEK system for blizzard challenge [C].Stockholm:The Blizzard ChallengeWorkshop, 2017.
[15]	An S, Ling Z, Dai L. Emotional statistical parametric speech synconfproc using LSTM-RNNs [C].Kuala Lumpur : Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC),IEEE, 2017.
[16]	Hu Y J, Ling Z H . Extracting spectral features using deep autoencoders with binary distributed hidden units for statistical parametric speech synjournal[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2018,26(4):713-724.

中文自然度	基线系统	听感量化系统
中文10 h数据	4.02	4.22
中文1 h数据	3.70	4.05
中文5 min	-	3.46

情感判断正确率	基线系统	听感量化系统	相对提升
中立	81.6%	92.7%	60.3%
开心	91.6%	100%	100%
生气	95.6%	98.3%	61.4%
悲伤	100%	100%	-
4项平均	92.2%	97.75%	71.2%