From: A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
Methods | WER (%) | |
---|---|---|
Referenced traditional systems | IBM CAPIO [40] | 3.19 |
17-layer TDNN + iVectors [41] | 3.80 | |
Referenced end-to-end systems | End-to-end CNN on the waveform + conv LM [42] | 3.44 |
Deep Speech 2 (extra 11,940 h of labeled English data) [34] | 5.33 | |
Our end-to-end systems | VGG + BLSTM + add attention + word-LM (baseline/J0) | 4.3 |
High-level features + joint CTC-attention + word-LM (J1) | 4.0 | |
J1 + multi-level location-based attention (J2) | 3.8 | |
J1 + multi-head location-based attention (J3) | 3.8 | |
J1 + multi-level multi-head location-based attention (J4) | 3.6 |