From: Advanced recurrent network-based hybrid acoustic models for low resource speech recognition
Model | Training time per epoch (hours) | Â | Decoding RTF | |||||
---|---|---|---|---|---|---|---|---|
 | 101 Cantonese | 104 Pashto | 107 Vietnamese | 202 Swahili | 204 Tamil | 302 Kazakh | 404 Georgian |  |
DNN-fbank | 2.35 | 1.32 | 1.22 | 0.73 | 1.04 | 0.69 | 0.70 | 1.65 |
LSTM-fbank | 5.33 | 3.17 | 3.22 | 2.03 | 2.93 | 1.99 | 1.99 | 2.02 |
BLSTM-fbank | 12.79 | 8.40 | 7.69 | 4.92 | 7.00 | 4.85 | 4.88 | 2.77 |
LW-BLSTM-fbank | 8.01 | 4.71 | 4.83 | 3.06 | 4.37 | 3.03 | 3.04 | 2.31 |
LW-BrLSTM-fbank | 9.08 | 5.26 | 5.47 | 3.40 | 4.99 | 3.43 | 3.48 | 2.41 |
LW-BGRU-fbank | 7.10 | 4.11 | 4.21 | 2.75 | 3.83 | 2.69 | 2.72 | 2.27 |
LW-BrGRU-fbank | 8.03 | 4.68 | 4.92 | 3.13 | 4.34 | 3.06 | 3.10 | 2.36 |
CNN-fbank [21] | 7.07 | 4.61 | 4.22 | – | 3.70 | – | – | – |
CMNN-fbank [21] | 7.09 | 4.62 | 4.30 | – | 3.13 | – | – | – |
RMNN-fbank [21] | 4.78 | 2.93 | 3.00 | – | 2.19 | – | – | – |