基于Transformer的多编码器端到端语音识别

doi:10.16180/j.cnki.issn1007-7820.2024.04.001

摘要/Abstract

摘要：

当前广泛使用的Transformer模型具有良好的全局依赖关系捕捉能力,但其在浅层时容易忽略局部特征信息。针对该问题,文中提出了一种使用多个编码器来改善语音特征信息提取能力的方法。通过附加一个额外的卷积编码器分支来强化对局部特征信息的捕捉,弥补浅层Transformer对局部特征信息的忽视,有效实现音频特征序列全局和局部依赖关系的融合,即提出了基于Transformer的多编码器模型。在开源中文普通话数据集Aishell-1上的实验表明,在没有外部语言模型的情况下,相比于Transformer模型,基于Transformer的多编码器模型的字符错误率降低了4.00%。在内部非公开的上海话方言数据集上,文中所提模型的性能提升更加明显,其字符错误率从19.92%降低至10.31%,降低了48.24%。

关键词: Transformer, 语音识别, 端到端, 深度神经网络, 多编码器, 多头注意力, 特征融合, 卷积分支网络

Abstract:

The current widely used Transformer model has a strong ability to capture global dependencies, but it tends to ignore local feature information at shallow layers. To solve this problem, this study proposes a method using multiple encoders to improve the ability of speech feature extraction. An additional convolutional encoder branch is added to strengthen the capture of local feature information, make up for the neglect of local feature information in shallow Transformer, and effectively realize the integration of global and local dependencies of audio feature sequences. In other words, a multi-encoder model based on Transformer is proposed. Experiments on the open-source Chinese Mandarin data set Aishell-1 show that without an external language model, the proposed Transformer-based multi-encoder model has a relative reduction of 4.00% in character error rate when compared with the Transformer model. On the internal non-public Shanghainese dialect data set, the performance improvement of the proposed model is more obvious, and the character error rate is reduced by 48.24% from 19.92% to 10.31%.

Key words: Transformer, speech recognition, end-to-end, deep neural networks, multi-encoder, multi-head attention, feature fusion, convolution branch networks

中图分类号:

TN912.34

庞江飞, 孙占全. 基于Transformer的多编码器端到端语音识别[J]. 电子科技, 2024, 37(4): 1-7.

PANG Jiangfei, SUN Zhanquan. Multi-Encoder Transformer for End-to-End Speech Recognition[J]. Electronic Science and Technology, 2024, 37(4): 1-7.

图/表 9

图1

图2

图3

图4

表1

表2

图5

表3

表4

参考文献 27

[1]	Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2):257-286. doi: 10.1109/5.18626
[2]	刘庆峰, 江源, 胡亚军, 等. 基于听感量化编码的神经网络语音合成方法研究[J]. 电子科技, 2019, 32(9):76-79.
	Liu Qingfeng, Jiang Yuan, Hu Yajun, et al. Research on perception quantification-based neural speech synthesis methods[J]. Electronic Science and Technology, 2019, 32(9):76-79.
[3]	Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification:Labelling unsegmented sequence data with recurrent neural networks[C]. Pittsburgh: Proceedings of the Twenty-third International Conference on Machine Learning, 2006:529-537.
[4]	He Y, Sainath T N, Prabhavalkar R, et al. Streaming end-to-end speech recognition for mobile devices[C]. Brig-hton: IEEE International Conference on Acoustics,Speech and Signal Processing, 2019:653-659.
[5]	Li B, Chang S, Sainath T N, et al. Towards fast and accurate streaming end-to-end ASR[C]. Barcelona: IEEE International Conference on Acoustics,Speech and Signal Processing, 2020:311-320.
[6]	Li S, Dabre R, Lu X, et al. Improving Transformer-based speech recognition systems with compressed structure and speech attributes augmentation[C]. Graz: Proceedings of the Annual Conference of the International Speech Communication Association, 2019:295-303.
[7]	Chan W, Jaitly N, Le Q, et al. Listen,attend and spell:A neural network for large vocabulary conversational speech recognition[C]. Shanghai: IEEE International Conference on Acoustics,Speech and Signal Processing, 2016:389-396.
[8]	Dong L, Xu S, Xu B. Speech-Transformer:A no-recurre-nce sequence-to-sequence model for speech recognition[C]. Calgary: IEEE International Conference on Acoustics,Speech and Signal Processing, 2018:601-608.
[9]	Chen X, Zhang S, Song D, et al. Transformer with bidirectional decoder for speech recognition[C]. Shanghai: Proceedings of the Annual Conference of the International Speech Communication Association, 2020:498-503.
[10]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]. Long Beach: Advances in Neural Information Processing Systems, 2017:353-361.
[11]	Zhou S, Dong L, Xu S, et al. Syllable-based sequence-to-sequence speech recognition with the Transformer in Mandarin Chinese[C]. Hyderabad: Proceedings of the Annual Conference of the International Speech Communication Association, 2018:263-270.
[12]	Miao H, Cheng G, Gao C, et al. Transformer-based online CTC/attention end-to-end speech recognition architecture[C]. Barcelona: IEEE International Conference on Acoustics,Speech and Signal Processing, 2020:193-201.
[13]	Huang W, Hu W, Yeung Y T, et al. Conv-Transformer tr-ansducer:Low latency,low frame rate,streamable end- to-end speech recognition[C]. Shanghai: Proceedings of the Annual Conference of the International Speech Communication Association, 2020:933-942.
[14]	Gulati A, Qin J, Chiu C C, et al. Conformer:Convolution-augmented Transformer for speech recognition[C]. Shanghai: Proceedings of the Annual Conference of the International Speech Communication Association, 2020:801-807.
[15]	Lohrenz T, Li Z, Fingscheidt T. Multi-encoder learning and stream fusion for Transformer-based end-to-end automatic speech recognition[C]. Brno: Proceedings of the Annual Conference of the International Speech Communication Association, 2021:711-719.
[16]	Bu H, Du J, Na X, et al. Aishell-1:An open-source Mandarin speech corpus and a speech recognition baseline[C]. Seoul: The Twentyth Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, 2017:821-827.
[17]	He K, Zhang X, Ren S, et al. Deep residual learning forimage recognition[C]. Las Vegas: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016:542-540.
[18]	Park D S, Chan W, Zhang Y, et al. Specaugment:A simple data augmentation method for automatic speech recognition[C]. Graz: Proceedings of the Annual Conference of the International Speech Communication Association, 2019:632-638.
[19]	Karita S, Chen N, Hayashi T, et al. A comparative study on Transformer vs rnn in speech applications[C]. Singapore: IEEE Automatic Speech Recognition and Understanding Workshop, 2019:578-583.
[20]	Guo P, Boyer F, Chang X, et al. Recent developments on ESPnet toolkit boosted by conformer[C]. Toronto: IEEE International Conference on Acoustics,Speech and Signal Processing, 2021:55-60.
[21]	Sun S, Guo P, Xie L, et al. Adversarial regularization for attention based end-to-end robust speech recognition[J]. IEEE/ACM Transactions on Audio,Speech,and Language Processing, 2019, 27(11):1826-1838. doi: 10.1109/TASLP.6570655
[22]	Tian Z, Yi J, Bai Y, et al. Synchronous Transformers for end-to-end speech recognition[C]. Barcelona: IEEE International Conference on Acoustics,Speech and Signal Processing, 2020:1250-1258.
[23]	Luo H, Zhang S, Lei M, et al. Simplified self-attention for Transformer-based end-to-end speech recognition[C]. Shenzhen: IEEE Spoken Language Technology Workshop, 2021:988-995.
[24]	Ding F, Guo W, Dai L, et al. Attention-based gated scaling adaptive acoustic model for CTC-based speech recognition[C]. Barcelona: IEEE International Conferenceon Acoustics,Speech and Signal Processing, 2020:555-560.
[25]	Luo J, Wang J, Cheng N, et al. Multi-quartznet:Multi-resolution convolution for speech recognition with multilayer feature fusion[C]. Shenzhen: IEEE Spoken Language Technology Workshop, 2021:1128-1135.
[26]	Shan C, Weng C, Wang G, et al. Component fusion:Learning replaceable language model component for end- to-end speech recognition system[C]. Brighton: IEEE International Conference on Acoustics,Speech and Signal Processing, 2019:89-95.
[27]	Fan Z, Li J, Zhou S, et al. Speaker-aware speech-Transf-ormer[C]. Singapore: IEEE Automatic Speech Recognition and Understanding Workshop, 2019:801-812.

模型	验证集字符错误率/%	测试集字符错误率/%
STBD^[9]	5.80	6.64
LFMMI^[16]	6.44	7.62
ESPnet(Transformer)^[20]	6.00	6.70
LDS-REG^[21]	9.43	10.56
Sync-Transformer^[22]	7.91	8.91
SSAN^[23]	-	6.84
AGS^[24]	7.00	7.94
Multi-QuartzNet^[25]	-	6.77
LAS^[26]	-	10.56
ST^[27]	7.93	8.36
Transformer(本文)	5.68	6.18
MET	5.54	5.93

卷积分支层数	测试集上的字符错误率/%
1	6.16
2	6.01
3	5.93
4	6.13

交叉验证批次	测试集上的字符错误率/%
第1次	11.21
第2次	10.19
第3次	10.47
平均	10.62