Skip to main content

Table 6 Comparison of ASR systems in term of performance

From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition

ASR system WER[%]
  without LM n-gram LM
  test-clean test-other test-clean test-other
Kaldi TDNN [64] - - 3.85 9.57
Kaldi CNN-TDNN [65] - - 3.87 9.42
PaddlePaddle DeepSpeech2 [66] 10.70 30.00 6.03 20.29
RWTH Returnn [67] 4.71 15.17 4.67 15.16
Facebook CNN-ASG [68] - - 4.82 14.54
Facebook TDS-S2S [69] 5.36 15.64 4.21 11.87
Nvidia Jasper [70] 3.86 11.93 3.19 9.03
Nvidia QuartzNet [71] 3.90 11.28 2.98 8.38
  1. Performance is expressed in terms of the word error rate (lower is better). The evaluation is performed on two LibriSpeech subsets: test-clean and test-other. For the frameworks which allow this, the evaluation is performed in two scenarios: with or without an external language model