TY - JOUR AU - Qin, Chu-Xiong AU - Zhang, Wen-Lin AU - Qu, Dan PY - 2019 DA - 2019/10/28 TI - A new joint CTC-attention-based speech recognition model with multi-level multi-head attention JO - EURASIP Journal on Audio, Speech, and Music Processing SP - 18 VL - 2019 IS - 1 AB - A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER. SN - 1687-4722 UR - https://doi.org/10.1186/s13636-019-0161-0 DO - 10.1186/s13636-019-0161-0 ID - Qin2019 ER -