A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

EURASIP Journal on Audio, Speech, and Music Processing

Table 4 Comparisons on WSJ “eval92”

	Methods	WER (%)
Referenced traditional systems	TC-DNN + BLSTM-DNN [32]	3.47
Referenced traditional systems	CNN + RAW speech [33]	5.6
Referenced end-to-end systems	Deep Speech 2 (extra 11,940 h of labeled English data) [34]	3.6
	Joint CTC-attention + char-LM + word-LM [35]	5.6
	Pyramidal encoder + attention + label smoothing [36]	6.7
	LAS grapheme model + RNN grapheme LM [37]	6.9
	Attention + extended trigram LM [38]	9.3
Our end-to-end systems	VGG + BLSTM + add attention + word-LM (baseline/K0)	4.7
	High-level features + joint CTC-attention + word-LM (K1)	4.3
	K1 + multi-level location-based attention (K2)	4.1
	K1 + multi-head location-based attention (K3)	4.1
	K1 + multi-level multi-head location-based attention (K4)	3.8