From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition
 | Pipeline ASR | End-to-end ASR |
---|---|---|
System architecture | multi-component: | single component: |
 | acoustic model (usually neural network) | neural network for both acoustic and language modeling |
 | language model (usually probabilistic) | + optional more complex language model |
 | phonetic dictionary |  |
Decoding strategy | Weighted Finite State Transducer (WFST) decoder | Connectionist Temporal Classification (CTC) decoder |
 |  | Sequence-to-sequence attention decoder |
Input | always hand-crafted features | raw waveform or sometimes hand-crafted features |
 | + optional learned features |  |
Output | context dependent phones and then words | characters or word-parts |