Skip to main content

Table 4 Comparisons on WSJ “eval92”

From: A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

 

Methods

WER (%)

Referenced traditional systems

TC-DNN + BLSTM-DNN [32]

3.47

CNN + RAW speech [33]

5.6

Referenced end-to-end systems

Deep Speech 2 (extra 11,940 h of labeled English data) [34]

3.6

Joint CTC-attention + char-LM + word-LM [35]

5.6

Pyramidal encoder + attention + label smoothing [36]

6.7

LAS grapheme model + RNN grapheme LM [37]

6.9

Attention + extended trigram LM [38]

9.3

Our end-to-end systems

VGG + BLSTM + add attention + word-LM (baseline/K0)

4.7

High-level features + joint CTC-attention + word-LM (K1)

4.3

K1 + multi-level location-based attention (K2)

4.1

K1 + multi-head location-based attention (K3)

4.1

K1 + multi-level multi-head location-based attention (K4)

3.8