Skip to main content

Table 4 Comparison of ASR systems

From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition

ASR system Kaldi TDNN Kaldi CNN-TDNN PaddlePaddle DeepSpeech2 Facebook CNN-ASG Facebook TDS-S2S RWTH Returnn Nvidia Jasper Nvidia QuartzNet
System type HMM-based HMM-based E2E NN E2E NN E2E NN E2E NN E2E NN E2E NN
Multi-component vs. single NN AM (NN) + AM (NN) + Single NN Single NN Single NN Single NN + Single NN Single NN
  PD + PD +     BPE encoding list   
  LM LM       
  + [Op. LM rescr.] + [Op. LM rescr.] + [Op. LM] + [Op. LM] + [Op. LM] + [Op. LM] + [Op. LM] + [Op. LM]
Speech features 3 frames x 40 MFCCs + 100 i-vectors 1 frame x 40 fbanks + 200 i-vectors 160 frames x 160 log-spectrograms 240 frames x 40 Mel-fbanks 240 frames x 80 Mel-fbanks 200 frames x 40 MFCC 160 frames x 64 Log-fbanks 160 frames x 64 Mel-spectrograms
NN architecture Time delay Conv. + Time delay Conv. + Bi-RNN Conv. + GLU TDS-GRU Enc. - Dec. LSTM Enc. - Dec. +Attention Time delay Time-channel separable conv.
Layers 1 x TDNN 16 x TDNN-F 6 x CNN 12 x TDNN-F 2 x CNN 2 x Bi-RNN 17 x CNN 1 x GLU 1 x CNN + 2 x TDS 1 x CNN + 3 x TDS 1 x CNN + 6 x TDS 1 x GRU 6 x LSTM 1 x Attention 1 x CNN 10 x Dense residual 3 * CNN 1 x CNN 15 x TCS Conv. 2 x CNN 1 x 1 conv.
Output AM: 6k posteriors NN: 30 chars NN: 10k word parts NN: 28 characters
  + LM: 200k words NN: * words NN: * words NN: * words
   + LM: 200k words + LM: 200k words + LM: 200k words
Loss function LF-MMI + CE CTC ASG S2S Attention S2S Attention + CTC CTC
Model size[ 106] 20 18 49 208 38 187 333 18.8
Operations per frame [ 106] 41 63 105 22k 15 125 42k 1.8k
Activations per frame [ 103] 44 51 13 1k 9 38k 3k 3.5k
  1. The systems are compared in terms of (i) type: hybrid, HMM-based, versus end-to-end neural network (E2E NN), (ii) component types: multi-component versus single neural network with additional, optional language model, (iii) speech features, (iv) neural network architecture, including the loss function, (v) output size and type, and (vi) model complexity, expressed in terms of model size, number of activations and number of operations required to process one frame of speech. Each system is described in terms of architecture, complexity and hardware requirements
  2. Abbreviations in the table are the following: AM Acoustic Model, ASG Auto Segmentation Criterion, ASR Automatic Speech Recognition, BPE Byte-pair encoding, CE Cross-entropy, Conv. Convolutional, CNN Convolutional Neural Network, CTC Connectionist Temporal Classification, Dec. Decoder, E2E End-to-End, Enc. Encoder, GLU Gated Linear Unit, GRU Gated Recurrent Unit, HMM Hidden Markov Model, LF-MMI Lattice-Free Maximum Mutual Information, LM Language Model, LSTM Long-Short Term Memory, MFCC Mel-Frequency Cepstral Coefficients, NN Neural Network, OOV Out-of-vocabulary, Op. Optional, PD Phonetic Model, rescr. rescoring, RNN Recurrent Neural Network, RWTH Rheinisch-Westfälische Technische Hochschule Aachen (Aachen University), S2S Sequence-to-Sequence, TDNN Time-Delay Neural Network, TDNN-F Factored Time-Delay Neural Network, TDS Time-Depth Separable