Skip to main content

Table 2 Main differences between pipeline ASR and end-to-end ASR in terms of architecture, decoding strategy, input and output data

From: Performance vs. hardware requirements in state-of-the-art automatic speech recognition

 

Pipeline ASR

End-to-end ASR

System architecture

multi-component:

single component:

 

acoustic model (usually neural network)

neural network for both acoustic and language modeling

 

language model (usually probabilistic)

+ optional more complex language model

 

phonetic dictionary

 

Decoding strategy

Weighted Finite State Transducer (WFST) decoder

Connectionist Temporal Classification (CTC) decoder

  

Sequence-to-sequence attention decoder

Input

always hand-crafted features

raw waveform or sometimes hand-crafted features

 

+ optional learned features

 

Output

context dependent phones and then words

characters or word-parts