A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Table 6 Comparisons on LibriSpeech “test clean”

	Methods	WER (%)
Referenced traditional systems	IBM CAPIO [40]	3.19
Referenced traditional systems	17-layer TDNN + iVectors [41]	3.80
Referenced end-to-end systems	End-to-end CNN on the waveform + conv LM [42]	3.44
Referenced end-to-end systems	Deep Speech 2 (extra 11,940 h of labeled English data) [34]	5.33
Our end-to-end systems	VGG + BLSTM + add attention + word-LM (baseline/J0)	4.3
	High-level features + joint CTC-attention + word-LM (J1)	4.0
	J1 + multi-level location-based attention (J2)	3.8
	J1 + multi-head location-based attention (J3)	3.8
	J1 + multi-level multi-head location-based attention (J4)	3.6