From: A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
Methods | WER (%) | |
---|---|---|
Referenced traditional systems | TC-DNN + BLSTM-DNN [32] | 3.47 |
CNN + RAW speech [33] | 5.6 | |
Referenced end-to-end systems | Deep Speech 2 (extra 11,940 h of labeled English data) [34] | 3.6 |
Joint CTC-attention + char-LM + word-LM [35] | 5.6 | |
Pyramidal encoder + attention + label smoothing [36] | 6.7 | |
LAS grapheme model + RNN grapheme LM [37] | 6.9 | |
Attention + extended trigram LM [38] | 9.3 | |
Our end-to-end systems | VGG + BLSTM + add attention + word-LM (baseline/K0) | 4.7 |
High-level features + joint CTC-attention + word-LM (K1) | 4.3 | |
K1 + multi-level location-based attention (K2) | 4.1 | |
K1 + multi-head location-based attention (K3) | 4.1 | |
K1 + multi-level multi-head location-based attention (K4) | 3.8 |