From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition
ASR system | Kaldi TDNN | Kaldi CNN-TDNN | PaddlePaddle DeepSpeech2 | Facebook CNN-ASG | Facebook TDS-S2S | RWTH Returnn | Nvidia Jasper | Nvidia QuartzNet |
---|---|---|---|---|---|---|---|---|
System type | HMM-based | HMM-based | E2E NN | E2E NN | E2E NN | E2E NN | E2E NN | E2E NN |
Multi-component vs. single NN | AM (NN) + | AM (NN) + | Single NN | Single NN | Single NN | Single NN + | Single NN | Single NN |
PD + | PD + | BPE encoding list | ||||||
LM | LM | |||||||
+ [Op. LM rescr.] | + [Op. LM rescr.] | + [Op. LM] | + [Op. LM] | + [Op. LM] | + [Op. LM] | + [Op. LM] | + [Op. LM] | |
Speech features | 3 frames x 40 MFCCs + 100 i-vectors | 1 frame x 40 fbanks + 200 i-vectors | 160 frames x 160 log-spectrograms | 240 frames x 40 Mel-fbanks | 240 frames x 80 Mel-fbanks | 200 frames x 40 MFCC | 160 frames x 64 Log-fbanks | 160 frames x 64 Mel-spectrograms |
NN architecture | Time delay | Conv. + Time delay | Conv. + Bi-RNN | Conv. + GLU | TDS-GRU Enc. - Dec. | LSTM Enc. - Dec. +Attention | Time delay | Time-channel separable conv. |
Layers | 1 x TDNN 16 x TDNN-F | 6 x CNN 12 x TDNN-F | 2 x CNN 2 x Bi-RNN | 17 x CNN 1 x GLU | 1 x CNN + 2 x TDS 1 x CNN + 3 x TDS 1 x CNN + 6 x TDS 1 x GRU | 6 x LSTM 1 x Attention | 1 x CNN 10 x Dense residual 3 * CNN | 1 x CNN 15 x TCS Conv. 2 x CNN 1 x 1 conv. |
Output | AM: 6k posteriors | NN: 30 chars | NN: 10k word parts | NN: 28 characters | ||||
+ LM: 200k words | NN: * words | NN: * words | NN: * words | |||||
+ LM: 200k words | + LM: 200k words | + LM: 200k words | ||||||
Loss function | LF-MMI + CE | CTC | ASG | S2S Attention | S2S Attention + CTC | CTC | ||
Model size[ ∗106] | 20 | 18 | 49 | 208 | 38 | 187 | 333 | 18.8 |
Operations per frame [ ∗106] | 41 | 63 | 105 | 22k | 15 | 125 | 42k | 1.8k |
Activations per frame [ ∗103] | 44 | 51 | 13 | 1k | 9 | 38k | 3k | 3.5k |