Skip to main content

Table 4 Comparison of ASR systems

From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition

ASR system

Kaldi TDNN

Kaldi CNN-TDNN

PaddlePaddle DeepSpeech2

Facebook CNN-ASG

Facebook TDS-S2S

RWTH Returnn

Nvidia Jasper

Nvidia QuartzNet

System type

HMM-based

HMM-based

E2E NN

E2E NN

E2E NN

E2E NN

E2E NN

E2E NN

Multi-component vs. single NN

AM (NN) +

AM (NN) +

Single NN

Single NN

Single NN

Single NN +

Single NN

Single NN

 

PD +

PD +

   

BPE encoding list

  
 

LM

LM

      
 

+ [Op. LM rescr.]

+ [Op. LM rescr.]

+ [Op. LM]

+ [Op. LM]

+ [Op. LM]

+ [Op. LM]

+ [Op. LM]

+ [Op. LM]

Speech features

3 frames x 40 MFCCs + 100 i-vectors

1 frame x 40 fbanks + 200 i-vectors

160 frames x 160 log-spectrograms

240 frames x 40 Mel-fbanks

240 frames x 80 Mel-fbanks

200 frames x 40 MFCC

160 frames x 64 Log-fbanks

160 frames x 64 Mel-spectrograms

NN architecture

Time delay

Conv. + Time delay

Conv. + Bi-RNN

Conv. + GLU

TDS-GRU Enc. - Dec.

LSTM Enc. - Dec. +Attention

Time delay

Time-channel separable conv.

Layers

1 x TDNN 16 x TDNN-F

6 x CNN 12 x TDNN-F

2 x CNN 2 x Bi-RNN

17 x CNN 1 x GLU

1 x CNN + 2 x TDS 1 x CNN + 3 x TDS 1 x CNN + 6 x TDS 1 x GRU

6 x LSTM 1 x Attention

1 x CNN 10 x Dense residual 3 * CNN

1 x CNN 15 x TCS Conv. 2 x CNN 1 x 1 conv.

Output

AM: 6k posteriors

NN: 30 chars

NN: 10k word parts

NN: 28 characters

 

+ LM: 200k words

NN: * words

NN: * words

NN: * words

  

+ LM: 200k words

+ LM: 200k words

+ LM: 200k words

Loss function

LF-MMI + CE

CTC

ASG

S2S Attention

S2S Attention + CTC

CTC

Model size[ ∗106]

20

18

49

208

38

187

333

18.8

Operations per frame [ ∗106]

41

63

105

22k

15

125

42k

1.8k

Activations per frame [ ∗103]

44

51

13

1k

9

38k

3k

3.5k

  1. The systems are compared in terms of (i) type: hybrid, HMM-based, versus end-to-end neural network (E2E NN), (ii) component types: multi-component versus single neural network with additional, optional language model, (iii) speech features, (iv) neural network architecture, including the loss function, (v) output size and type, and (vi) model complexity, expressed in terms of model size, number of activations and number of operations required to process one frame of speech. Each system is described in terms of architecture, complexity and hardware requirements
  2. Abbreviations in the table are the following: AM Acoustic Model, ASG Auto Segmentation Criterion, ASR Automatic Speech Recognition, BPE Byte-pair encoding, CE Cross-entropy, Conv. Convolutional, CNN Convolutional Neural Network, CTC Connectionist Temporal Classification, Dec. Decoder, E2E End-to-End, Enc. Encoder, GLU Gated Linear Unit, GRU Gated Recurrent Unit, HMM Hidden Markov Model, LF-MMI Lattice-Free Maximum Mutual Information, LM Language Model, LSTM Long-Short Term Memory, MFCC Mel-Frequency Cepstral Coefficients, NN Neural Network, OOV Out-of-vocabulary, Op. Optional, PD Phonetic Model, rescr. rescoring, RNN Recurrent Neural Network, RWTH Rheinisch-Westfälische Technische Hochschule Aachen (Aachen University), S2S Sequence-to-Sequence, TDNN Time-Delay Neural Network, TDNN-F Factored Time-Delay Neural Network, TDS Time-Depth Separable