Performance vs. hardware requirements in state-of-the-art automatic speech recognition

EURASIP Journal on Audio, Speech, and Music Processing

Table 2 Main differences between pipeline ASR and end-to-end ASR in terms of architecture, decoding strategy, input and output data

	Pipeline ASR	End-to-end ASR
System architecture	multi-component:	single component:
	acoustic model (usually neural network)	neural network for both acoustic and language modeling
	language model (usually probabilistic)	+ optional more complex language model
	phonetic dictionary
Decoding strategy	Weighted Finite State Transducer (WFST) decoder	Connectionist Temporal Classification (CTC) decoder
		Sequence-to-sequence attention decoder
Input	always hand-crafted features	raw waveform or sometimes hand-crafted features
	+ optional learned features
Output	context dependent phones and then words	characters or word-parts