Performance vs. hardware requirements in state-of-the-art automatic speech recognition

EURASIP Journal on Audio, Speech, and Music Processing

Table 4 Comparison of ASR systems

ASR system	Kaldi TDNN	Kaldi CNN-TDNN	PaddlePaddle DeepSpeech2	Facebook CNN-ASG	Facebook TDS-S2S	RWTH Returnn	Nvidia Jasper	Nvidia QuartzNet
System type	HMM-based	HMM-based	E2E NN	E2E NN	E2E NN	E2E NN	E2E NN	E2E NN
Multi-component vs. single NN	AM (NN) +	AM (NN) +	Single NN	Single NN	Single NN	Single NN +	Single NN	Single NN
	PD +	PD +				BPE encoding list
	LM	LM
	+ [Op. LM rescr.]	+ [Op. LM rescr.]	+ [Op. LM]	+ [Op. LM]	+ [Op. LM]	+ [Op. LM]	+ [Op. LM]	+ [Op. LM]
Speech features	3 frames x 40 MFCCs + 100 i-vectors	1 frame x 40 fbanks + 200 i-vectors	160 frames x 160 log-spectrograms	240 frames x 40 Mel-fbanks	240 frames x 80 Mel-fbanks	200 frames x 40 MFCC	160 frames x 64 Log-fbanks	160 frames x 64 Mel-spectrograms
NN architecture	Time delay	Conv. + Time delay	Conv. + Bi-RNN	Conv. + GLU	TDS-GRU Enc. - Dec.	LSTM Enc. - Dec. +Attention	Time delay	Time-channel separable conv.
Layers	1 x TDNN 16 x TDNN-F	6 x CNN 12 x TDNN-F	2 x CNN 2 x Bi-RNN	17 x CNN 1 x GLU	1 x CNN + 2 x TDS 1 x CNN + 3 x TDS 1 x CNN + 6 x TDS 1 x GRU	6 x LSTM 1 x Attention	1 x CNN 10 x Dense residual 3 * CNN	1 x CNN 15 x TCS Conv. 2 x CNN 1 x 1 conv.
Output	AM: 6k posteriors		NN: 30 chars		NN: 10k word parts		NN: 28 characters
	+ LM: 200k words		NN: * words		NN: * words		NN: * words
			+ LM: 200k words		+ LM: 200k words		+ LM: 200k words
Loss function	LF-MMI + CE		CTC	ASG	S2S Attention	S2S Attention + CTC	CTC
Model size[ ∗10⁶]	20	18	49	208	38	187	333	18.8
Operations per frame [ ∗10⁶]	41	63	105	22k	15	125	42k	1.8k
Activations per frame [ ∗10³]	44	51	13	1k	9	38k	3k	3.5k

The systems are compared in terms of (i) type: hybrid, HMM-based, versus end-to-end neural network (E2E NN), (ii) component types: multi-component versus single neural network with additional, optional language model, (iii) speech features, (iv) neural network architecture, including the loss function, (v) output size and type, and (vi) model complexity, expressed in terms of model size, number of activations and number of operations required to process one frame of speech. Each system is described in terms of architecture, complexity and hardware requirements
Abbreviations in the table are the following: AM Acoustic Model, ASG Auto Segmentation Criterion, ASR Automatic Speech Recognition, BPE Byte-pair encoding, CE Cross-entropy, Conv. Convolutional, CNN Convolutional Neural Network, CTC Connectionist Temporal Classification, Dec. Decoder, E2E End-to-End, Enc. Encoder, GLU Gated Linear Unit, GRU Gated Recurrent Unit, HMM Hidden Markov Model, LF-MMI Lattice-Free Maximum Mutual Information, LM Language Model, LSTM Long-Short Term Memory, MFCC Mel-Frequency Cepstral Coefficients, NN Neural Network, OOV Out-of-vocabulary, Op. Optional, PD Phonetic Model, rescr. rescoring, RNN Recurrent Neural Network, RWTH Rheinisch-Westfälische Technische Hochschule Aachen (Aachen University), S2S Sequence-to-Sequence, TDNN Time-Delay Neural Network, TDNN-F Factored Time-Delay Neural Network, TDS Time-Depth Separable