Performance vs. hardware requirements in state-of-the-art automatic speech recognition

EURASIP Journal on Audio, Speech, and Music Processing

Table 3 Comparison of the most popular speech datasets used for ASR evaluation

ASR task	Speech type	Size [h]	# of speakers	Framework
				K	P	W	R	N
LibriSpeech [72]	read speech	960	∼2400	✓	✓	✓	✓	✓
WSJ [73]		80	284	✓		✓	✓
TED-LIUM2 [74]	TED talks	207	1242	✓			✓
Switchboard [75]	conversational telephone speech	300	543	✓			✓
Fisher [76]		2742	∼12400	✓

We compare the type of speech and dataset size, expressed in number of hours of speech and number of speakers. The recipes available in various ASR frameworks: K - Kaldi; P - PaddlePaddle DeepSpeech; W - Wav2Letter; R - RWTH Returnn; N - Nvidia (OpenSeq2Seq & NeMo)