Advanced recurrent network-based hybrid acoustic models for low resource speech recognition

EURASIP Journal on Audio, Speech, and Music Processing

Table 14 Training time per epoch (hours) and decoding real time factor (RTF) of all models for all languages

Model	Training time per epoch (hours)							Decoding RTF
	101 Cantonese	104 Pashto	107 Vietnamese	202 Swahili	204 Tamil	302 Kazakh	404 Georgian
DNN-fbank	2.35	1.32	1.22	0.73	1.04	0.69	0.70	1.65
LSTM-fbank	5.33	3.17	3.22	2.03	2.93	1.99	1.99	2.02
BLSTM-fbank	12.79	8.40	7.69	4.92	7.00	4.85	4.88	2.77
LW-BLSTM-fbank	8.01	4.71	4.83	3.06	4.37	3.03	3.04	2.31
LW-BrLSTM-fbank	9.08	5.26	5.47	3.40	4.99	3.43	3.48	2.41
LW-BGRU-fbank	7.10	4.11	4.21	2.75	3.83	2.69	2.72	2.27
LW-BrGRU-fbank	8.03	4.68	4.92	3.13	4.34	3.06	3.10	2.36
CNN-fbank [21]	7.07	4.61	4.22	–	3.70	–	–	–
CMNN-fbank [21]	7.09	4.62	4.30	–	3.13	–	–	–
RMNN-fbank [21]	4.78	2.93	3.00	–	2.19	–	–	–