A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

EURASIP Journal on Audio, Speech, and Music Processing

Table 2 Comparisons on the TIMIT core test set

	Methods	PER (%)
Referenced traditional systems	Kaldi’s DNN-HMM	18.5
Referenced traditional systems	Hierarchical maxout CNN [26]	16.5
Referenced end-to-end systems	Hierarchical CNNs with CTC [27]	18.2
	RNN transducer initialized with CTC + weight noise [28]	17.7
	fMLLR + attention + weight noise [3]	17.6
	fMLLR + RNN + CRF [29]	17.3
Our end-to-end systems	Transferred high-level features + joint CTC-attention + RNN-LM (P0) [17]	16.59
	P0 + multi-level location-based attention (P1)	16.42
	P0 + multi-head location-based attention (P2)	16.51
	P0 + multi-level multi-head location-based attention (P3)	16.34