ASR task
|
Speech type
|
Size [h]
|
# of speakers
|
Framework
|
---|
| | | |
K
|
P
|
W
|
R
|
N
|
---|
LibriSpeech [72]
|
read speech
|
960
| ∼2400 | ✓ | ✓ | ✓ | ✓ | ✓ |
WSJ [73]
| |
80
|
284
| ✓ | | ✓ | ✓ | |
TED-LIUM2 [74]
|
TED talks
|
207
|
1242
| ✓ | | | ✓ | |
Switchboard [75]
|
conversational telephone speech
|
300
|
543
| ✓ | | | ✓ | |
Fisher [76]
| |
2742
| ∼12400 | ✓ | | | | |
- We compare the type of speech and dataset size, expressed in number of hours of speech and number of speakers. The recipes available in various ASR frameworks: K - Kaldi; P - PaddlePaddle DeepSpeech; W - Wav2Letter; R - RWTH Returnn; N - Nvidia (OpenSeq2Seq & NeMo)