Skip to main content

Table 2 Main differences between pipeline ASR and end-to-end ASR in terms of architecture, decoding strategy, input and output data

From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition

  Pipeline ASR End-to-end ASR
System architecture multi-component: single component:
  acoustic model (usually neural network) neural network for both acoustic and language modeling
  language model (usually probabilistic) + optional more complex language model
  phonetic dictionary  
Decoding strategy Weighted Finite State Transducer (WFST) decoder Connectionist Temporal Classification (CTC) decoder
   Sequence-to-sequence attention decoder
Input always hand-crafted features raw waveform or sometimes hand-crafted features
  + optional learned features  
Output context dependent phones and then words characters or word-parts