Deep neural networks for automatic speech processing: a survey from large corpora to limited data

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 SOTA results over test-clean set from LibriSpeech and quantity of data used. Some self-supervised results are provided from [7]

Model type	Quantity of data used		WER
	Pre-training	Training
Pase+ [8]	50h	960h	16.62
Wav2Vec2.0 [9]	960h		4.79
	60k h		3.10
HuBERT [10]	960h		4.79
	60k h		2.94
Hybrid model [11]	-		2.7
End to end supervised [12]	-		2.44
Wav2Vec2.0 using conformers and spec augment [13]	60k h		1.4
Wav2Vec using BERT XXL [14]	60k h		1.4