 Review
 Open Access
 Published:
Performance vs. hardware requirements in stateoftheart automatic speech recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2021, Article number: 28 (2021)
Abstract
The last decade brought significant advances in automatic speech recognition (ASR) thanks to the evolution of deep learning methods. ASR systems evolved from pipelinebased systems, that modeled handcrafted speech features with probabilistic frameworks and generated phone posteriors, to endtoend (E2E) systems, that translate the raw waveform directly into words using one deep neural network (DNN). The transcription accuracy greatly increased, leading to ASR technology being integrated into many commercial applications. However, few of the existing ASR technologies are suitable for integration in embedded applications, due to their hard constrains related to computing power and memory usage. This overview paper serves as a guided tour through the recent literature on speech recognition and compares the most popular ASR implementations. The comparison emphasizes the tradeoff between ASR performance and hardware requirements, to further serve decision makers in choosing the system which fits best their embedded application. To the best of our knowledge, this is the first study to provide this kind of tradeoff analysis for stateoftheart ASR systems.
Introduction
Speech is one of the most important forms of human communication and, given this fact, it has always been desired to extend the voice interaction with the technologies that surround us, making it as naturally as possible. In this way, the technology could be used on a larger scale, without the need for additional knowledge and even by people with disabilities or people whose activity requires handsfree operation of the devices. Automatic speech recognition (ASR) enables users to transcribe streams of speech into the corresponding texts. The last few years have brought a significant increase in this field, the voiceactivated devices being particularly successful. For instance, smarthome assistants, such as Amazon Echo or virtual personal assistants like Siri, have become wellknown and they are capable of performing complex speech to text tasks. They successfully transcribe even if the environmental conditions are challenging, like background noise or hesitations. Their input is represented by free speech with a high degree of variability from one speaker to another. The type of speech is usually free speech, the only constraint being the pronunciation of a keyword, which will wake up the device. Because ASR systems are highly computational and they require a large number of resources, it is very difficult for this entire process to take place in an embedded environment. Usually, hotword detection is performed on the device, while the rest of the speech is transferred into the cloud for transcription.
However, there is an increased interest in developing speech recognition systems that can be deployed in embedded systems [1–5]. These are usually integrated in wearable devices or Internet of Things applications, thus eliminating the need for cloud processing. Embedded devices come with hard constrains related to computing power and memory usage. In this situation it is crucial to use ASR systems that allow for realtime processing, along with very low energy consumption and low memory footprint, but at the same time preserving as much as possible the high accuracy of the neural networkbased systems. In order to meet the requirements of embedded systems, one can resort to several optimization techniques: (i) architecture optimization, (ii) data quantization and format optimization.
Architecture optimization involves reducing the number of layers and the sizes of the layers, etc. aiming to eventually obtain a smaller neural network in terms of total number of parameters. This, along with reducing the number of activations that need to be kept in memory, leads to a system that can better fit into memoryconstrained systems. Moreover, by reducing the model size one implicitly reduces the number of operations performed during inference and thus obtains a better realtime factor. Data quantization refers to the process of reducing the bit precision of the weights and activations, while data format optimization concerns using fixedpoint or integer numbers instead of typical floatingpoint values [6]. Data quantization and format optimization help at obtaining a smaller footprint for the models and activations (e.g. 8bit values instead of 32bit values) and also for reducing the realtime factor by simplifying the operations performed during inference (e.g. multiply and accumulations with 8bit integers instead of 32bit floatingpoint values).
In this context, it is important to pinpoint the ASR systems which are suitable for deployment in embedded applications. There are several surveys which present some of the state of the art ASR systems, as follows. Zhang et al. [7] present a review focused on deep learning aiming to improve the environmental robustness of speech recognition systems. They introduce technologies related to the frontend of the systems, which is performing the signal preprocessing, but also about the backend, which is performing the model training, for this particular subtask. The authors compare the traditional, probabilistic, systems with those based on deep learning. They provide a comparison in terms of the accuracy of some different general approaches on 4 benchmark speech databases. Park et al. [8] present realtime speech recognition systems for mobile and embedded devices. Several neural network based acoustic models are compared by accuracy versus model size. The authors create an acoustic model by combining recurrent units and convolutional layers. The main purpose is to solve the excessive memory consumption due the parameters size in recurrent neural networks by using a parallelization approach. Purwins et al. [9] present an overview of deep learning techniques used for audio signal processing. The authors make a classification of the main applications in the field of audio signal processing, highlighting the types of acoustic features, the types of neural networks used to train the model, the data requirements and the computational complexity. Kim et al. [10] describe in detail various components and algorithms used in endtoend speech recognition systems and compare them from the architectural point of view and also in terms of accuracy. Several methods of model compression are discussed, considering the possibility of using such systems in commercial ondevice applications.
As opposed to the survey papers mentioned above, which present overviews on texttospeech systems and focus on highlevel architectures, this article analyzes many concrete ASR implementations available in the most popular ASR toolkits. The various component parts of the ASR implementations are presented in detail, along with graphical representations of the underlying neural networks. Finally, we offer insightful comparisons on hardware performance versus transcription accuracy.
This survey provides an overview of the best and most popular speech recognition architectures, including architectural insights that can be used to identify the system which is most suitable for deployment in embedded applications. Data quantization and format optimization, which are not covered by this survey, can be potentially applied regardless of the selected architecture, as they are orthogonal optimizations, leading to a reduction in model size and realtime factor along with a compromise related to accuracy.
The aim of the survey is twofold: (i) to serve as a guided tour through the recent literature on automatic speech recognition and (ii) to provide an analysis of the tradeoff between ASR performance and hardware requirements for the most popular ASR implementations. The survey describes the various components and subsystems of generic ASR systems in Section 2 and then describe, evaluates and compares eight specific ASR implementations in Sections 3 and 4. The comparison is structured as a tradeoff between transcription accuracy and system complexity. We first compare the models in terms of model size, number of activations and number of operations required to process one frame of speech and then translate these metrics into hardware requirements regarding memory load and throughput. The aim is to provide a complete picture of the tradeoffs between complexity and performance to further serve the decision makers in choosing the system which is most suitable for deployment in embedded applications.
The paper is organized as follows. Section 2 presents the general characteristics of a speech recognition system, offering details about its components. The differences between a traditional pipeline system and an endtoend system are discussed. We describe the most common types of speech features and we provide information regarding different speech recognition system architectures grouped into traditional, architectures based on Hidden Markov Models (HMM) and endtoend neural networks, along with the most common language modeling methods. Section 3 is concerned with the description from the architectural point of view of eight of the best performing opensource speech recognition systems, while the Section 4 deals with the comparison and evaluation by considering the accuracy and the complexity of the systems on a standard speech recognition task. Section 5 is reserved for conclusions. The paper organization is presented in Table 1.
Introduction to ASR systems
This section introduces the basic concepts in automatic speech recognition. Section 2.1 presents the road from traditional ASR to endtoend ASR. Section 2.2 describes the most common speech features which are used in current stateoftheart implementations. Section 2.3 introduces the main principles in traditional ASR, while Section 2.4 presents different endtoend approaches. Section 2.5 summarizes the common language models and their integration techniques in ASR systems.
The road from pipeline ASR to endtoend ASR
Table 2 presents the main differences between these two categories of ASR systems in terms of architecture, decoding strategy, input and output data. A pipeline ASR involves multiple models, each one created in the training stage: an acoustic model (AM)  usually implemented as a neural network, a phonetic model (PD) and a language model (LM)  usually implemented as a probabilistic component. The decoding graph is a weighted finite state transducer (WFST), obtained by composition operations applied on these elements. The acoustic modeling element outputs posterior probabilities for context dependent phones. The phonetic modeling element combines these phones to form valid words, while the linguistic modeling element combines words in sequences as likely to form sentences. On the other hand, endtoend ASR trains a single network to learn the speech representations. A connectionist temporal classification (CTC) decoder is usually used, this being a beam search based mechanism. It provides a probability distribution over the output labels set for each input time step. The CTC decoding outputs characters separated by blanks, aligned with the input sequence. Because in this way words and word sequences are directly obtained, the language model is optional in this approach. However, it ensures valid outgoing words and more likely word sequences.
Pipeline ASR
This represents the traditional ASR system (see the top of Fig. 1), made up of multiple components that work together as in a pipeline system. The automatic speech recognition task consists in identifying the most probable words sequence W^{∗}, given the probability of the speech signal X to be generated by the sequence of words W:
Following the Bayes’ rule, this equation could be transformed into an equivalent form:
P(X) does not depend on the words sequence W, therefore the denominator could be eliminated and the problem could be limited to the following aspects; (i) the computation of the probability of the speech signal X, given the corresponding words sequence W: P(XW) and (ii) the computation of the probability for the words sequence: P(W).
The first probability can be determined using an acoustic model, while the second probability is obtained using a language model. The link between these models is provided by the phonetic model, which consists in a dictionary associating each word from the language model to a sequence of phonemes, modeled by the acoustic model.
The acoustic model estimates the likelihood that a speech signal has been generated by a sequence of words. To achieve that, it is necessary to make observations based on an extended set of word pronunciations from as many speakers as possible, then obtaining a mathematical model that estimates the probability of each word. This premise leads to the choice of the word as the base speech modeling unit, which, however, opposes a few principles. The base unit must be precise and its representation should include the manifestation in several acoustic contexts. At the same time, it must be trainable, there must exist enough data to be able to estimate the unit correctly. Last but not least, the base unit must be generalizable, being possible to build every word from a sequence of units. Words cannot be used as base units because they are not generalizable and trainable as opposed to sublexical units, such as phonemes or even subphonetic units.
Endtoend ASR
The concept of “endtoend system” (see the bottom of Fig. 1) has more interpretations. The most common definition assumes that an endtoend speech recognition system is a single neural network that receives raw audio signal at the input and provides a sequence of words at the output, representing the appropriate transcript [11]. The acoustic, phonetic and language components are jointly trained into a single network [12], removing the need for a handcrafted pronunciation dictionary. Instead, a traditional system comprises some independent modules that work together as a pipeline, each module performing a specific task. As a summary, the features extractor aims to obtain speech features from the acoustic signal. The acoustic model provides the probabilities for each signal sequence, represented by those speech features, given the corresponding sequence of words. The language model deals with the word sequences to match them in phrases with a logical meaning.
The following aspects describe the differences between endtoend systems and traditional systems, although they may not necessarily be considered as fundamental differences. Some endtoend considered neural networks, as the one in Section 3.5, which is introduced in [13], are capable of processing raw audio signal, the feature extraction being integrated into the network [14], but also some handcrafted features which have been already extracted in a previous step. From the acoustic model point of view, while the traditional systems are modeling phonemes, the endtoend systems are trained on graphemes (characters) [15], wordpieces [16] or even entire words [17]. The output in traditional systems consists in probability distributions over the phonetic units, while the endtoend systems can do more than that, their output consisting directly in characters. The language model is a standalone component in traditional systems, while in endtoend systems it is possible to be the same, pluggedin as an external component, but also embedded in the network, as a part of the network performing the task of language modeling. There are also lexiconfree approaches in endtoend networks, which remove the constraints imposed by a finite lexicon, being possible to handle outofvocabulary (OOV) words. This feature must be balanced with another aspect, the risk of obtaining meaningless or misspelled words.
Besides the unitary, homogeneous architecture of the endtoend systems, fundamental to them is that the training is done from scratch, without using prealigned data. In this way, the possibility of using potentially incorrect alignments as training targets is eliminated [18]. Endtoend systems are fully discriminative, everything is learned from data. Standard approaches use an additional generative component, most often this being represented by a Hidden Markov Model (HMM). The hybrid HMMDNN approaches based on LatticeFree Maximum Mutual Information (LFMMI) objective function are also considered endtoend [19], in the sense that there are not required initial forcedalignments to start training.
Feature extraction
This subsection describes the most popular speech features used in ASR: spectrograms, Melfilterbanks, Melfrequency cepstral coefficients and ivectors. Feature extraction is a preprocessing technique which transforms the speech waveform in a more compact parametric representation, focusing on some keyproperties of the signal, relevant for speech recognition. Regardless the features type, most of them share some main principles, being computed by applying some specific operations. In general, they are biologicallyinspired, imitating the human sound perception system. The main operations that are applied in the feature extraction process will be explained below.
Type of features
The speech signal is not stationary, its statistics change over the temporal dimension. Therefore, the analysis is performed on short time frames, in order to respect the quasistationary property of speech signal. The first operation is represented by framing, which consists of splitting the signal in short frames, usually 25 ms with 10 ms overlap. The second operation is called windowing and it is done during framing. A signal convolution with a Hamming [20] or Hanning [21] filter is performed, aiming to smooth the frame border discontinuities, in order to avoid the occurrence of frequency artifacts. Because it deamplifies the frame ends, some parts of the signal would matter less. This is why overlapping is essential: it retrieves the information which may be discarded at the border of two consecutive frames. The third most common operation in this pipeline consists in applying the Fast Fourier Transform (FFT) to pass the signal from time domain to frequency domain. As the Eq. 3 shows, the speech signal is composed through the time domain convolution between the base signal, represented by the air exhaled from the lungs, and the time response of the vocal tract. The latter is conclusive for the speech recognition task and the FFT is used to separate it. Therefore, the convolution operation in time domain becomes a multiply operation in frequency domain and the Eq. 4 presents how the logarithm is applied to convert the multiplication operation into a summation operation, which is a linear one and whose terms can be separated.
As could be observed in Fig. 2, although the first operations are common to most types of features, there are more or less processed features, depending on the stage at which they were created. The spectrograms represent a very common type of features. They are defined as the power of the frequency bins at a specific time and they are obtained by applying framing, windowing and FFT operations.
The Melfilterbanks features, named also MelFrequency Spectral Coefficients (MFSC), follow the same main steps as spectrograms, being characterized by an additional step that involves applying a bank of triangular Mel filters [22]. Typically, a number of 40 filters is used. They translate the frequencies from a wider initial range to a narrower range. The vocal parameters are grouped around the lower frequencies, where the human auditory system can achieve a much better distinction. Consequently, the lower part of the spectrum contains much more crucial information for speech recognition. In the frequency range 01000 Hz the perception is linear, while over this threshold, it becomes logarithmic.
The MelFrequency Cepstral Coefficients (MFCCs) [23] are the most commonly used features in speech recognition. Obtaining them involves an additional operation compared to Melfilterbanks: a final step consisting in the backward transition in the time domain by applying the discrete cosine transform (DCT). Its role is to reduce the dimensionality of the parameters and to achieve their decorrelation, because the filterbanks parameters are highly correlated. From the each frame signal are usually retained the first 13 MFCCs. This value is traditionally used. Because the MFCCs approximate the initial signal, more values make a more accurate, but more computationally complex approach, while fewer values make a thicker approximation. The MFCCs only provide information about the current frame and sometimes context data is needed. To fill this gap, the 1^{st} and 2^{nd} derivatives are computed, known also as delta and deltadelta or differential and acceleration coefficients. They capture information about the signal trends, the variation over the neighboring frames.
Ivectors (identity vectors) [24] are a kind of feature predominantly used for the speaker recognition task, but they have also applicability in speech recognition. They are derived from the Joint Factor Analysis [25] (JFA), where the supervectors have been defined. They consist in the mean components of the Gaussian Mixture Model (GMM) that models the speakerspecific acoustic features. Because ivectors contain speaker characteristic information, they have a role in improving the system’s adaptation to the specific speech of a speaker.
Traditional, HMMbased acoustic modeling
This subsection presents the key concepts of the HMMbased acoustic modeling, a widespread approach, now considered traditional, with the advent of endtoend networks. The phoneme is the shortest unit of sound and it usually corresponds to the acoustic manifestation of a letter. Each word can be decomposed into a phoneme sequence. Given the way the words are articulated, the phonemes are context dependent: two identical phonemes will manifest differently if they occur in different contexts with different neighboring phonemes. Consequently, phonemes are usually modelled with three states: the first state models the transition from the previous phoneme, the second state models the central, stationary part of the current phoneme, while the third state models the transition to the next phoneme. A phoneme modelled this way is referred to in the literature [26] with the term triphoneme, and each of its states is referred to with the term senone. The number of different triphonemes is very large (three to the power of number of phonemes), thus the training process is difficult and intensely computational. However the various senones composing the triphonemes are not totally distinct and, consequently, they can be clustered based on similarity and modeled together.
Traditionally, the feature vectors representing the senones were modeled using Gaussian Mixture Models (GMMs) [27], while the transitions between senones (the actual triphonemes) were modelled using HMMs [28]. Provided the properties of speech, HMMs used in ASR are constrained to have forwardonly transitions. HMMGMM systems for ASR are trained using the BaumWelch algorithm [29], while the Viterbi algorithm [30] is used for speech decoding.
Taking into account that these states are contextdependent and speech is possible due to transitions between states, it was found that speech can be modeled using the Hidden Markov Models (HMMs). An HMM is a finite state automata where the sequence of states is not known, but we know only the acoustic vectors corresponding to each state, generated by a probability density function. An HMM is characterized by a set of states, emissive or nonemissive, together with a set of transition and loop probabilities. Generally, the HMM transitions have no constraints, except for the ASR systems. Due to the speech sequential characteristic, back off transitions or states skipping transitions are not allowed. Each state may have transitions to itself or to the next state. Each state emits vectors of values based on a probability density function, given by a Gaussian Mixture Model (GMM). The normal probability density function models a range of values, being centered in the highest probability point, corresponding to the mean of the values. It provides the probability that a value belongs to a class. Thus, by choosing an arbitrary point in the value space, we can determine how likely it is to belong to one of the classes. Gaussian Mixtures (GMMs) are weighted sums of Gaussian densities, which try to approximate the probability of acoustic feature vectors, extracted from the signal.
Training a GMM aims to estimate its parameters. This operation is done by using the ExpectationMaximization (EM) algorithm. The first step of the training (E) computes the class belonging probabilities for the input data. The second phase (M) computes the model parameters using the current class belonging probabilities of the input data. The algorithm runs iteratively until convergence to the local maxim of the plausibility function is reached. HMM decoding uses the Viterbi algorithm and it represents the speech recognition task itself. It determines the most likely sequence of acoustic states that has generated a set of speech features.
The first step towards employing deep learning for speech recognition was represented by the replacement of GMMs with DNNs for modeling senones [31]. In this case, the DNN’s role was to predict the senone class. While the first attempts used speech corpora annotated at phonemelevel and required no HMMGMM alignments, the usage of speech corpora annotated at utterancelevel was restricted by having phonemelevel alignments produced by intermediary HMMGMM systems.
The hybrid approach is similar to the traditional HMMGMM, given that the GMM is substituted by a neural network, using the posterior probabilities output by the last layer. Moreover, the neural network is not trained from scratch, but on top of the alignments provided by an already trained HMMGMM system. This is because usually the speech corpora are not annotated at phonemelevel. Consequently, the neural network in HMMDNN approaches is used to predicts the senone class based on the input represented by several frames of feature vectors.
TDNN
TimeDelay Neural Network (TDNN) [32] is a convolutional network which operates on time domain, modeling the temporal dependencies. This is easier to be parallelized than a recurrent network and it is comparable with feedforward DNN in terms of required training time. The input of each unit in TDNN layers is expanded out spatially in a couple of sequential units from the previous layer. Thus, the lower layers learn a narrow context, while the higher layers process activations for a wider temporal context. The hyperparameters for a TDNN are represented by the lengths of the input context of each layer. In [33] it is proposed a TDNN which uses a subsampling technique, computing the hidden activations only at some specific time steps. This method avoids redundancy due the fact that the large contexts overlap leads to highly correlated neighboring activations. It was concluded that a larger context on the left side is optimal for online decoding. The model size and the training time are also reduced. Therefore, this subsampling mechanism in TDNN networks is similar to a convolutional operation which allows gaps in the convolutional filter.
Because it was desired to compress the network layers, a factored form of TDNN (TDNNF) which is derived from Single Value Decomposition (SVD), was introduced in [34] as an improvement over TDNN. It supposes training from a random start and each learned matrix is decomposed as a product of two smaller factors, one of them constrained to be semiorthogonal. This decomposition is obtained by a linear bottleneck operation. Thus, weights compression is achieved, the smaller singular values are discarded. A matrix M is semiorthogonal if MM^{T}=I or M^{T}M=I.
Convolutional + timedelay neural network (CNNTDNN)
This is another variation of network, based on the purely TDNN approach, before which are added a couple of stacked convolutional layers. Their main purpose is to perform a further processing of the acoustic features, acting as a feature processing frontend. These extra layers perform temporal convolutions on the speech features, reducing their spectral and temporal variability. The convolutional layers, by their structure given by local connectivity, weight sharing and pooling, have the property to annihilate the small variations that appear in the spectral domain. These variations are induced by both the speaker and the acoustic environment in which the speech takes place [35]. There are some reports [36–39] stating that better results are obtained using CNNTDNN instead of simple TDNN networks, but the CNNTDNN requires a little more computing power.
Endtoend ASR systems
This subsection aims to summarize the most common approaches in endtoend systems. We provide a brief overview of the neural networks used for acoustic modeling, presenting their main characteristics.
Architectures
Endtoend systems refer to a unitary architecture, capable to receive raw audio or speech features as input and it outputs words. Despite a traditional, pipeline system, all the components as feature extractor, acoustic model and even sometimes the language model are integrated in one single network. Specifically, the loss function of the neural network is set directly at the character level. The following paragraphs include a review of the most popular endtoend architectures, along with some specific details for each.
Some stateoftheart acoustic modeling approaches using neural networks are based on simple stacked convolutions [13, 40]. The speech features are received as network input and they are passed over a series of convolutional layers, each one being characterized by the number of filters and their dimensionality. The loss function is directly set at the character level facilitating the output of the networks to be represented by words. Some wellknown architectures based on stacked convolutions are those presented in [13, 40].
In [41, 42] are presented convolutional architectures characterized by dense residual connections. The residual connections work as a bypass, connecting early layers to later layers. The term dense denotes connections between each layer to every other layer, similar to a fullyconnected network. In the dense residual networks, all the outputs from the preceding layers are concatenated in order to provide the input for the subsequent layers. The total number of connections is L(L+1)/2, where L represents the number of layers, unlike a regular network where the number of connections is just L, one between each two consecutive layers. Among the advantages of this approach are the avoidance of vanishing gradient in very deep networks, better feature propagation and network parameters reduction.
The recurrent approaches succeed in modeling longrange dependencies over temporal sequences. In [43] are described some variations of recurrent architectures used for acoustic modeling. The recurrent neural networks (RNN) are characterized by cyclic connections, the output from the previous step feeds the input at the current time step. The Long Short Term Memory (LSTM) networks are more complex RNN networks. They contain memory blocks, that in turn contains cells with selfconnections, aiming to conserve the temporal state. Besides the cells, there are some gating mechanisms, like the input, output and forget gates, which control the dataflow. Some LSTMs also have peephole connections [44] between the internal cells and gates. One of the advantages is that the internal state of a cell could be inspected even if the output gate is closed. Another one is represented by the ability to learn precise and robust timing intervals between relevant signal events. The bidirectional LSTMs (BLTSMs) are networks that process the input in both directions: from the past to the future and vice versa. It was shown that a two layer LSTM outperforms a DNN with an order of magnitude more parameters. The DNN drawback consists into the limited temporal modeling. DeepSpeech [45] is a wellknown recurrentbased ASR framework which obtains competitive results with just few bidirectional layers.
The encoderdecoder architecture [46–48] is a neural network specialized in sequence to sequence mapping. Besides its application in speech recognition, is was initially applied in machine translation and language modeling. Usually, the encoder and the decoder are implemented by recurrent networks, thanks to their ability to work with timedependent sequences of data. Both of them are jointly trained. Basically, as it is described in [46], the main idea revolves around a sequence of input data which is passed to the encoder where the fixedlength context vector is composed. The decoder is autoregressive, it consumes the previously decoded symbols as well as the context vector in order to predict the following symbol.
Attention comes as an improvement of the encoderdecoder architecture. It aims to bypass the limitation of fixedlength encoding vector. Working with long sequences could be an issue because the neural network must be able to compress all the information in one vector. In [49] it is shown that the performance of an encoderdecoder network decreases as the length of the input increases. The attention approach encodes the input into a sequence of vectors and chooses a subset of them during the decoding. The decoder performs two operations simultaneously: aligning and decoding. Each time the network tries to generate a new output symbol, it looks at a set of positions in the source sequence where the relevant information is concentrated. The prediction is computed considering the encoding vectors for these positions as well as the previous predicted symbols. Thus, instead of having one single encoding for the whole sequence, there is a corresponding context vector for each output symbol. This is computed as a weighted sum of the encoder hidden states, where each hidden state h_{i} carries on information about the whole sequence, but focusing on the input frames around the ith position.
Another encoderdecoder architecture is presented in [50]. This is an attentionbased encoderdecoder, where the novelty is represented by the encoder, a fully convolutional one, composed by timedepth separable (TDS) blocks. A TDS block comprises two parts: a 2D convolution layer, both in time and features space domain, followed by two convolutional layers with 1x1 filter, which work as a fullyconnected feedforward layer. This approach has the ability to generalize much better than other convolutional networks, working with a reduced number of parameters, while the receptive field is kept large. The decoder is a conventional recurrentbased one, but having a particularity regarding the training step: the previous groundtruth is used instead of the real previous prediction. This technique is called teacher forcing. The idea behind aims to discard recurrent dependency mechanisms in order to facilitate parallel computing.
Loss functions
Conventional DNN acoustic models are using a framelevel objective function to perform training based on the alignments provided by a HMMGMM system, commonly the crossentropy (CE) function. The crossentropy objective function is described in Eq. 5, where t iterates over THE AUDIO FILES in the training set, i over the time frames in each audio file and \(\hat {y}_{i}\) represent the labels for each frame provided by forcedalignment.
Connectionist Temporal Classification (CTC) [51] allows to train a network without being required a framelevel alignment between the speech signal and the transcripts from the training dataset. Standard ASR systems use a statistic (e.g. GMM) or deep learning (e.g. DNN) component to predict what is being uttered and a time consistency component(e.g. HMM or CTC) to handle the context, the previous and the future frames. The CTC approach implies a softmax component as a final layer of the network, in order to provide a distribution of probability over all the possible output symbols. The output could be visualized as a G_{CTC}(θ;T) graph for a given transcription θ over T time frames. Each time frame is defined by a set of nodes and each node representing a possible output label, given by a probability distribution function f_{t}(). A path π=π_{1},...,π_{n}∈G_{CTC}(θ;T) represents a potential transcription through this graph. A sequence level objective function, derived from maximum likelihood, operates on this distribution aiming to maximize the probability of the correct symbol:
where logadd() is an exponential function applied on a sum of logarithms, working as an improved version of max(). Specific to the CTC is the use of an additional blank symbol (), which is required to model the nonphoneme emissions. Given the usual terminology, the task of determining the labels is called decoding. The trivial approach is the maxdecoding or greedy decoding, which supposes to take the most probable phoneme at each time step. Because the predicted sequences have variable length, it is possible for a phoneme to be successively predicted. The blank label is used to separate the legit repetitive characters by those that are repeated due the variable speech rate or to mark the transition from one character to another. In fact, to obtain the final transcript, repeating characters are collapsed and the blank labels are discarded. A CTC drawback is represented by the assumption that the output label at a given moment of time is independent by the previous output labels. This fact is imposed by the architecture, there is no feedback loop from the CTC output layer to itself or to the network. However, CTC decoding can be integrated with a language model, so that the transcription of nonexistent words is avoided. Another approach presented in [52] consists in a loss function as a joint between CTC and attention mechanism.
LatticeFree Maximum Mutual Information (LFMMI) objective function is used in chain models [53] to perform sequencediscriminative training. The traditional MMI aims to maximize the posterior probability and it is as follows:
where S_{r} is the correct transcription of the r^{th} speech file O_{r},P(s) is the language model probability for sentence s. The numerator provides the likelihood of data given correct word sequence (reference alignment), while the denominator provides the total likelihood of the data given all possible word sequences, being equivalent to summing all possible word sequences estimated by the full acoustic and language models. The numerator encodes the supervision information and it is specific for each utterance, while the denominator encodes all possible word sequences and it is identical for all the utterances. This objective function is optimized by maximizing the numerator and minimizing the denominator.
Therefore, the chain models are trained from scratch, without any prealigned data, trying to offer an alternative solution for CTC and attention mechanisms.The authors in [53] reported that their efforts to get CTC to beat crossentropy were unsuccessful, but some ideas from CTC could be used in the sequencediscriminative LFMMI criterion. CTC and MMI maximizes the conditional loglikelihood of the correct transcript, but the probabilities in CTC are locally normalized, while in MMI are globally normalized. LFMMI supposes to create the denominator using a phonelevel language model instead of word level one. This language model is estimated from phonelevel alignments of the training data. Another characteristic of the chain models is the 3times smaller output frame rate, allowing the HMM to be traversable in one transition, instead of three. The realtime decoding becomes faster as well.
The Auto Segmentation Criterion (ASG) [13] criterion was developed as an improvement over CTC. It introduces a dependency between the output symbols, by using transition probabilities between them. Moreover, there are a couple of new features of ASG. The output graph is less complex because there are no blank labels. I was empirically found that there is no advantage when blank labels are used to model the garbage [13]. Instead of it, the repetition of the output symbols is modeled by a repetition character label. Another characteristic is given by the unnormalized scores of the nodes. It facilitates external LM plugin, which would provide transition scores between nodes. Also, specific to the ASG is global normalization instead of perframe normalization, leading to low confidence for incorrect transcriptions. Therefore, the score of a given sequence of words W is given by:
where f() denotes the probability of an output symbol at the t time step and g() is the transition probability between two consecutive symbols.
Language modeling
In this subsection we describe the language modeling approaches used in the systems we are going to analyze later. As explained in Section 2.1, the language model is one of the components of an ASR system, which can be integrated as a distinct component, and in the endtoend systems its role can even be performed by a part of the neural network.
The role of the language model is to estimate the likelihood of a word sequence W=w_{1},...,w_{n} to form a valid sentence. A language model is useful to take decisions when the acoustic model output is composed by a set of phonemes which could form multiple alternative sentences. Even these alternatives are very similar from the acoustic point of view, the LM will choose the one that makes more sense.
Ngram/Probabilistical LM
The ngram model provides a statistical view over how words are combined to form valid sentences. It assumes that a word depends only on a fixed number of previous words. The probability of a sequence of words is considered as a set of probabilities, where the probability of a given word depends by the preceding words:
Words occurrence or succession probabilities could be computed by taking into account a large volume of text. The most common ngram models are 2gram and 3gram, where a history of one or two words respectively is required. For instance, the probability of a pair of words for a 2gram model is computed as:
The occurrence probability of the words pair (w_{i},w_{j}) is given by the number of occurrences of the word w_{i} followed by the word w_{j}, divided by the number of occurrences of the same word w_{i}, followed by other words.
Neural network based language models
Instead of ngram models or a feedforward DNN network which learns from a fixed context, RNNs are capable to learn also from all the previous words. The recurrent neural network based language model presented in [54] is a simple one. It comprises one input layer, one hidden layer and one output layer. The input at current time step consists in the current word, w, and the hidden state from the previous time step. The output layer represents the probability distribution of the next word given the previous word the context. Because these algorithms cannot directly work on text and label encoding (using integers) could be confusing for the network, inducing the idea of order or hierarchy, they are using onehotencoding (1ofk) for words, where k is the number of words from vocabulary. Onehotencoding associates a binary vector to each word in vocabulary.
RNNs are difficult to train using backpropagation due the vanishing gradient problem. The gradient propagated back through the network decays or grows exponentially as context get longer. LSTMs provide an alternative to avoid this issue, using a different memory cell, while the rest of the algorithm remains unchanged, being similar with the RNNs. The LSTMs take as input the previous hidden state and the current input. The cells decide what to preserve and what to remove from the memory. In [55] is presented a topology which consists in an input layer, two hidden layers, where the first one is a projection layer and the second layer is a recurrent one, using LSTM cells. The projection layer performs the projection of all words in the context in a continuous space.
A new type of convolutional network [56], based on gated linear units (GLU), is able to outperform LSTMs for the language modeling task, both in terms of accuracy and implementation, being easier to parallelize and less complex. This approach performs a convolution operation over the input aiming to remove the temporal dependencies.
The TransformerXL [57] network can learn dependencies without constrains regarding the fixedlength context. It captures longer dependencies than simple Transformers or than RNNs, achieving better performances on short and long sequences and a faster inference time. Its particularity consists in reusing the hidden states from the previous segments, instead of computing them each time for each new segment. These reused states play the role of a buffer memory for the current segment, linking the segments in this way and propagating the information over longer contexts.
Language model integration
In shallow fusion [58, 59], the AM proposes at each time step a set of possible phones, which are scored by a weighted sum of scores given by the AM and the LM. The shallow fusion formula is given by:
where the first term is the AM probability and the second one is the LM probability. This fusion takes place at the inference time.
Deep fusion [58] is based on concatenation of acoustic model and language model hidden states next to each other. Both models are trained separately and their fusion is made using a gating mechanism. The output probability of the next word is given by this model which is fine tuned to use both of the hidden states. The hidden layer of the deep output takes as input the hidden state of the LM in addition to that of the AM. The biggest disadvantage with deep fusion is that the AM and the LM are trained independently. This fact could be an issue in encoderdecoder models, because the decoder is learning a language model from the training data labels, which can be poor compared to the large text corpora used for LM training. The decoder must overcome this limitation, being able to incorporate the new language information. Another issue occurs if the AM and LM are trained on different domain corpora. If they are deep fused, the decoder will tend to follow the linguistic style learned by the AM.
Cold fusion [60] concept is derived from the deep fusion, the main difference lies in fact that the endtoend acoustic model is trained from scratch together with a pretrained LM. During the training process, the AM learns to use the relevant information from the LM to correctly map the source sequence to the target sequence. If there are uncertainties at the decoding step caused by the AM (noisy speech, outofvocabulary words), the fusion model learns to take advantage of the LM. Cold fusion uses a different gate for each hidden node of the LM. This improvement allows the decoder to choose which information given by LM fits better at a specific time step. In [60] was shown that a decoder using cold fusion outperforms a pure endtoend attention based system, even if the last one is using 4x number of parameters. Also, the training time is speedup when cold fusion is used. Domain transfer is easier when cold fusion is used. Only a small amount of labeled data is required to close the gap between domains.
In [61] is proposed a novel LM integration approach, where a pretrained LM should represent a lower layer of the decoder of an attentionbased encoder decoder system. Thus, more tight word embeddings to the context are provided.
Rescoring
The output of an ASR is not a simple word sequence hypothesis corresponding to the acoustic signal, being more advantageous to keep more information. This fact is done by generating a lattice, a graph G(N,A), where N represents the nodes and A represents the arches. The output of the decoding could be structured as a lattice, where each arch has a specific probability and a path through the graph is an alternative transcription. The path with the best probability leads to the best transcription hypothesis. Lattice rescoring [62, 63] implies processing all the probabilities and replacing them with new ones provided by a better language model. The difference between rescoring and shallow fusion is as follows: the rescoring operation performs over the nbest hypotheses produced after the beam search, while the shallow fusion performs a loglinear interpolation between AM and LM score after each beam search time step.
Stateoftheart ASR implementations
The various ASR system architectures presented in the previous sections differ from many points of view. They comprise various types of acoustic and language models, some are based on a multicomponent pipeline structure, while others are endtoend neural networks, etc. This leads to systems with fundamentally different complexities, in terms of model size (or number of parameters) and activations, which influence directly the memory load, and in terms of number of operations performed for transcribing speech, which influences directly the realtime factor. These are crucial performance figures, which one must take into account, along with the transcription quality (measured in word error rate), when choosing the architecture to be implemented and deployed in embedded applications. Consequently, this section is dedicated to the comparison of the most popular, modern ASR implementations in terms of a tradeoff between system complexity and accuracy. To the best of our knowledge, this is the first comparison of such scale created for modern automatic speech recognition systems.
The various ASR systems evaluated and compared in this section are the following:

Kaldi’s pureTDNN [64]  a lightweight, multicomponent ASR system that uses a timedelay neural network for acoustic modeling and an HMM for sequence modeling;

Kaldi’s CNNTDNN [65]  an extension of the previous system that processes the input features with 1D convolutional layers;

DeepSpeech2 implementation from PaddlePaddle [66]  an endtoend bidirectional RNN with convolutional layers for speech feature processing;

RETURNN from RWTH [67]  an attentionbased encoderdecoder that outputs word parts;

Facebook CNNASG [68]  a fully convolutional endtoend network that uses a CTCderived criteria, ASG, being able to output characters;

Facebook TDSS2S [69]  an endtoend encoderdecoder with timedepth separable convolutions, trained with sequencetosequence (S2S) attention mechanism;

Jasper from Nvidia [70]  an endtoend deep neural network based on timedelay convolutional interleaved with fully connected layers and characterized by residual connections;

QuartzNet from Nvidia [71]  a Jasper derived endtoend deep neural network based on 1D timechannel separable convolutions.
The systems mentioned above are analyzed and compared from a structural point of view, providing information regarding the network input and output type and dimension or the type, number and size of the component layers. Based on these values, we compare the networks in terms of number of parameters, operations and activations, thus offering insights into how they could be implemented and deployed in embedded applications and with what costs.
A fair comparison of the system complexity vs. accuracy tradeoff can only be made in the context of a specific speech recognition task, because the ASR systems are usually adapted (number and size of layers, size of vocabulary, etc.) to each task. The most popular speech recognition tasks/ corpora are presented in Table 3. Benchmarking for English speech recognition is usually performed on one of these tasks. The table shows the ASR frameworks that are the subject of our comparison, all providing adapted ASR systems for LibriSpeech [72], while only three of them also provide adapted systems for Wall Street Journal (WSJ) [73]. In this context, we decided to focus the analysis on the LibriSpeech case study.
Librispeech [72] is one of the most popular freely available English dataset, presenting a great variety of data, both through the large number of speakers and the number of hours composing this speech corpus. It contains 1000 hours of read speech from public domain audio books, provided by approximately 2400 speakers. This is a widespread task, most of the wellknown ASR frameworks contain adapted systems for it, being possible to compare them in terms of WER.
Kaldi chain model TDNN
This model is part of an implementation for the LibriSpeech task existing in the Kaldi toolkit [64]. It corresponds to a multicomponent system, consisting of a TDNN based acoustic model, a phonetic model and a language model, all these being the core components of the pipeline system. This system is a hybrid one, the acoustic model consists of a TDNN which jointly work with an HMM, as presented in Section 2.3. The TDNN network outputs the probability that an acoustic signal part corresponds to a subphonetic unit. The HMM manages how these units can be linked together. The phonetic model functions as a lookup table that determines to which word a sequence of phonemes corresponds, while the language model estimates the likelihood of a sequence of words. Optionally, a more complex language model for rescoring can be used, which improves the initial transcription. Typically, the language model used for decoding is a probabilistic ngram of order 2 or 3, while the rescoring operation uses a more complex, higherorder ngram (Section 2.5.1) or a RNNtrained model (Section 2.5.2).
Two types of features are used as the input of the TDNN network: 40dimensional highresolution MFCCs extracted from frames of 25 ms length and 10 ms shift and 100dimensional ivectors computed from chunks of 150 consecutive frames, equivalent to 1.5 seconds of speech. Three consecutive MFCC vectors and the ivector corresponding to a chunk are concatenated, obtaining a 220dimensional feature vector for a frame. The components are decorrelated by applying Linear Discriminant Analysis (LDA), without changing the dimensionality of the data. Therefore, the network input is a 220dimensional feature vector (Feature type #1). More details about these features are illustrated on the left side of Fig. 3.
The network trunk consists of a cascade of 16 factored timedelay blocks (TDNNF), preceded by a simple TDNN block. As it was explained in the last paragraph from Section 2.3.1, there is a main difference between a TDNNF block and a TDNN block; the TDDNF block comprises a linearaffine sequence of operations that act like a bottleneck transforming the 1536dimensional input vector into an 160dimensional intermediary vector and then back into an 1536dimensional output vector. This is based on the matrix decomposition technique and it is useful for parameter compression. Particular to this implementation, the TDDNF block ends with a summation operation that adds the output of the current processing block to the downscaled (75%) output of the previous block: this acts like a residual connection. Therefore, the TDNN performs 1D temporal convolution, applying the operations on the current input vector as well as some previous and some future input vectors. The contexts differ from one block to another. The TDNN Blocks 24 process the input vectors at time indexes t1, t, t+1. The TDNN Block 5 processes only the input vector at time t. The TDNN Blocks 617 process the input vectors at time indexes t3, t, t+3. Those blocks are using the subsampling technique: some timeframes are ignored during the temporal convolutions, the network having in this way a larger receptive field. An overview of the network is depicted in the central part of Fig. 3.
It has been empirically proven [77] that the network performs better if it has two output blocks. The first one is based on crossentropy, called xent in Kaldi. The other one is based on the chain loss function, which uses the LFMMI criteria. Both of them are explained in Section 2.4.2. Each one is composed by affine, Rectified Linear Unit (ReLU), batch normalization and again the affine layers. They differ by the logsoftmax operation applied at the end of the crossentropy based block. The training process is using both blocks, while the inference only uses the chain based block, because chain models are trained with sequence objective function. The output of the network is 6016dimensional and it consists in posterior probabilities for the acoustic states, while the output at the entire system level is given by the size of the vocabulary of the language model, equal to 200k words in our case. The output blocks are presented in the left bottom part of Fig. 3.
Kaldi chain model CNNTDNN
This model [65] represent a variation of the previous simple TDNN model, being also implemented in the Kaldi toolkit as an approach for the LibriSpeech task. It is part of a multicomponent system, which comprises an acoustic model, a hybrid one based on TDNNHMM, a phonetic model and an ngram language model. In a similar way, a more complex ngram or neural based language model can be optionally used for rescoring.
In terms of network input features, Melfilterbanks are used in this implementation instead of MFCCs. The final features are organized as a matrix, unlike the previous case of simple TDNN where the input features are represented as a vector. Therefore, the input in the CNNTDNN network is composed of two types of features: 40dimensional Melfilterbanks extracted from frames of 25 ms length and 10 ms shift and 200dimensional ivectors computed from chunks of 150 consecutive frames. The 40 components of the current Melfilterbank vector and the 200 components of the chunk’s ivector are organized in a 40x6 matrix of speech features (Feature type #2). The feature extraction procedure and the way they are organized are illustrated in the central part of the left column of Fig. 3.
The neural network component of the acoustic model is very similar to the previous Kaldi TDNN network, the main difference is represented by a few CNN layers placed before the timedelay layers, which act like a frontend block. Three matrices of speech features (Feature Type #2) are provided as input for the Conv. Block 1: the features for the current, previous and next acoustic frames, or, equivalently, a feature volume of 6 x 40 x 3. It uses 64 filters of size 3x3 to perform time and feature space convolutions and outputs a 64 x 40 x 1 volume.
The CNN blocks with frontend role are followed by 12 blocks of factored TDNN (TDNNF). The first TDNNF (TDNNF Block 1) processes only the current time frame, while the rest of them are performing temporal convolution over the time index t3, t and t+3. The input vectors at time indexes t3 and t are spliced together into the linear layer, while the input vectors at time indexes t and t+3 are spliced together into the affine layer. The entire CNNTDNN network is depicted on the central column in Fig. 3.
The output blocks of the CNNTDNN are identical to those from the simple TDNN architecture. The neural network output is represented by 6016dimensional posterior probabilities of the acoustic states, while the output of the system is given by the 200k words language model.
The input of the Conv. Block 2 consists of three timeconsecutive volumes as the one output by Conv. Block 1, which are spliced together to form the 64 x 40 x 3 feature volume. The second convolutional block applies another 64 filters of size 3x3 to perform time and feature space convolutions and outputs a 64 x 40 x 1 volume. More filters are applied in Conv. Blocks 3 – 6, from 128 up to 256, while the size of the feature volume is kept constant by decreasing the height from 40 to 20 and finally to 10. The convolutional blocks are providing a 2560dimensional output which is passed to the succeeding timedelay blocks.
Paddle Paddle implementation of DeepSpeech2
This model [66] represents an implementation of the DeepSpeech2 [45] algorithm created by PaddlePaddle (PArallel Distributed Deep LEarning) to address the LibriSpeech ASR task. This is an endtoend (E2E) system composed by a single neural network (see Fig. 4) which processes audio features and provides words at the output, as described in Section 2.4. There is no need of a phonetic model and the language model is optional, but it brings transcription improvements, limiting the occurrence of nonexistent words. This system allows the integration with a probabilistic ngram language model by the shallow fusion method, as explained in Section 2.5.3.
The feature extraction step takes place in a previous step, outside of the neural network. The signal is windowed and 160dimensional spectrograms are computed from a frame of 20 ms length and 10 ms overlap. The processed chunk has the length equal to 160 frames, 1.6 seconds, corresponding to the time sequence processed at once by the network.
The network is considered to be recurrent based as described in Section 2.4.1, but the first layers are convolutional layers, which have more of a preprocessing role of the signal. Therefore, the first two layers perform both time and features space convolution. The first layer applies 32 filters of size 41 x 11, with a stride equal to 3 and 2, respectively a padding equal to 20 and 5, over the 160 x 160 input dimension. The second layer receives the 54 x 81 x 32 output of the previous layer and it performs also a convolution operation using 32 filters of size 21 x 11. A stride equal to 1 and 2, respectively a padding of 10 and 5 are used. This layer outputs a volume of 54 x 41 x 32.
The following 3 layers are all bidirectional recurrent layers. The sequence length is 41, corresponding to the feature dimensionality after the convolutional transforms. The input size of the first recurrent layer is 3776, being equal to the time length after the convolution, 54, by the number of channels, 32, adding the RNN layer size, 2048. The input for the other two recurrent layers in 4096, as the sum of the size of the current RNN layer and the size of the previous RNN layer. A batch normalization operation is performed after each layer. The network is trained using the CTC loss function, explained in detail in Section 2.4.2. This is more than a regular loss function, because it consists of a distribution probability over the output symbols, but it also manages how the symbols succeed, from this point of view having a similar role to HMM.
The output of the network is 30dimensional, representing the characters set. If the language model is pluggedin, the output size of the system becomes 200k and it consists of the words existing in the model.
RWTH RETURNN
This model [67] is one of the RWTH RETURNN implementations for the LibriSpeech task. The system, presented in detail in [78], is an endtoend one consisting in an attentionbased encoderdecoder neural network architecture with recurrent layers. The network gets as input handcrafted features and outputs subword parts, created via bytepair encoding (BPE) [79]. The final output is given by the language model, an ngram or an LSTMbased (Section 2.5.2), both of them can be integrated by shallow fusion (Section 2.5.3).
The input is computed onthefly, 40dimensional MFCC features are extracted using a window of 25 ms with 10 ms shift, over a sequence of 2 seconds.
The encoder is composed by 6 stacked bidirectional LSTM layers, having the hidden size equal to 1024. The input of the first layer is represented by the extracted features, while for the other layers the input is 2048dimensional, as the concatenation of forward and backward of the previous layer. After the forward and backward sublayers, the dropout is applied. The sequence length decreases due to the pooling operation. Therefore, the initial sequence length is 200, halving after the layers with the index 0, 1 and 2.
The output of the LSTM later serves as input for 3 entities: the encoder context, the inverse fertility factor and the CTC mechanism. The encoder context represent the encoder state, the concatenation of the forward and backward hidden states from the 6^{th} LSTM layer, on which was applied a pooling operation to reduce their size to 1024.
The CTC is used as an additional loss function, in order to help the convergence. Using some recurrent links, which take over the previous output embedding of the decoder (y_{t}), as well as its hidden states (S_{t}), the weight feedback and the energy factors are computed. Based on these, the attention feedback factor is obtained, which controls the influence of each state of the encoder in obtaining each state of the decoder. The readout_in element takes as input the output embedding, the decoder hidden states as well as the encoder context weighted using the attention mechanism. The result is passed to the output_prob element, which provides a probability over the final network output, represented by 10026 subword parts.
The probabilistic model we used is the 4gram with 200k words. Another language model we tried is a two layer LSTM network, integrated by shallow fusion as a subnetwork at the inference time. A detailed overview of the whole system is illustrated in the left side of Fig. 4.
Facebook CNNASG
This model [68] can be found in Wav2Letter toolkit from Facebook, being specially created to address the LibriSpeech task. The system around it is considered to be endtoend and depending on the recipe, it may vary from the following points of view:

it can get as input raw audio [80], power spectrum, MFCCs or Melfilterbanks;

it can use a lexicon or it can be lexiconfree [81]; the lexicon acts like a phonetic model, it consists of a mapping from words to their representation as a sequence of tokens, where the tokens are the acoustic units;

the system outputs a score over the acoustic units, which consist in phonemes, graphemes or word pieces;

as the output is represented by characters, the system may work without a language model, or it can be pluggedin by shallow fusion an ngram model or a neural network language model [56], as presented in Section 2.5.2.
We will refer to the fully convolutional recipe (Conv. GLU) [68] where the neural network takes as input Melfilterbanks, called also MelFrequency Spectral Coefficients (MFSC), described in Section 2.2.1. The output of the network consists in scores over the characters set. The system uses a lexicon and an ngram language model. Another recipe that is a bit different from the system point of view, but uses a similar neural network is the lexiconfree one [82].
Regarding the input of the system, this framework computes features on the fly, prior to running over the neural network. The input of this network is represented by 40dimensional Melfilterbank features, extracted from audio frames of 25 ms length and 10 ms shift, processing at once a sequence of 240 frames, which means 2.4 seconds.
The architecture of this network is fully convolutional, as explained in Section 2.4.1. This is composed by 17 1D, timeconvolution blocks, each one being characterized by a weight normalization operation [83], the convolution itself, whose output is passed to a Gated Linear Unit (GLU) [56] and finally, the dropout technique is applied. The filter size increases with a unit from one layer to another, the first value being 13 and the last 29. The stride value is always equal to 1. The padding value is also equal to 1, excepting the first layer, which has a padding equal to 170. The number of the output channels increases from one layer to another, the first value is 400, while the last is 1816, where each value is with 10% greater than the previous. The number of input channels is equal to 40 for the first block, the following input channels being equal to half of the number of the previous block output channels, due to the GLU dimensionality reduction. The GLU performs a elementwise product of the first half of its input and the other half, after it was passed through the sigmoid function. After the convolutional blocks, the next layer is a reorder layer, thus the number of input channels becomes the number of output channels, being equal to 908.
The output layers are two final linear layers, on which is applied weight normalization, as well the GLU and the dropout mechanism for the second last. They transform the number of input channels from 908 to 1816 output channels, respectively from 908 to 30, which is the output size of the network, the number of classes, where each class correspond to a character. The final system’s output is 200k words, as the number of unique words from the ngram language model. Therefore, the network is trained to output letters, this thing being possible due the AutoSegmentation Criterion (ASG) training criteria, which is an improvement over CTC, both being largely explained in Section 2.4.2. Details of the whole architecture are shown in the left side of Fig. 5.
Facebook TDSS2S
This model [69] is another Facebook endtoend approach for the LibriSpeech task, implemented in Wav2Letter framework. Similar to the Facebook CNNASG system, the system is composed by a single neural network, a lexicon and it supports a pluggedin language model of the same types, convolutional [56] or ngram, but in our analysis we considered the second one.
The input of the system is identical with that from the Facebook CNNASG approach: 80dimensional filterbank vectors extracted onthefly from audio frames of 25 ms length and 10 ms shift excepting the size of the filterbanks, which is equal to 80 in this case. The sequence length is 240 frames, equivalent to 2.4 seconds processed at once.
The network is a sequencetosequence attention based encoderdecoder. The encoder is represented by a timedepth separable (TDS) convolutional neural network, described in Section 2.4.1. The decoder is a simple recurrent layer, based on Gated Recurrent Units (GRU) [49]. This is a recurrent cell, similar to the LSTM, but with only 3 gates, missing the output gate. The convolutional encoder has an advantage over the recurrent approach, because of the ease with it can be parallelized.
The encoder architecture comprises 11 TDS blocks and 3 interleaved, subsampling, 2D convolutional layers: one before the 1^{st} TDS block and the others before the 3^{rd}, respectively the 6^{th} TDS block. The input and the output of each 2D convolution have the shape T x w x c, where T is the time length, w is the feature length and c is the number of channels.
In the time domain, these convolutional layers are performing a subsampling operation, halving the sequence length after each of them, due the stride equal to 2. The total subsampling factor is 8. The filter size is always 21, while the padding has the value equal to 10. In the feature domain, the features size remains all the time equal to 80, because the convolution uses a filter and a stride both equal to 1, while the padding is 0. At the same time, each subsampling brings an output channel increase, due to the time compression. They have values equal to 10, 14 and 18.
Each TDS block is composed by a 2D convolution, over time and features space, similar to the one previously described, but without performing a subsampling in time domain. It is followed by a ReLU layer as well as a layer applying a normalization technique. The TDS block contains also 2 convolutional layers with 1x1 kernel, acting like a fully connected layer. These layers are separated by a ReLU nonlinearity in between. After the last one is applied a layer normalization. They take an input of shape T x 1 x wc, resulting a same size output. All the time the number of input and output channels, the filters length and the stride values are equal to 1, while the paddings are 0.
After the last TDS block, a reorder layer interchanges the time and features dimensions. This is followed by a linear layer, which takes an 1440dimensional input and provides an 1024dimensional output. The output of the encoder is represented by word embeddings.
The GRU decoder has the hidden size equal to 512 and it takes the 1024dimensional output of the encoder. It has integrated the attention mechanism which performs the alignment. The objective function is a simple log probability over the words sequence. Finally, the network is able to classify over almost 10k word parts, representing the output token set. As in the previous cases, the use of a language model will constrain the output to its vocabulary size, respectively 200k words. The architecture and its component blocks are illustrated on the right side of Fig. 5.
Nvidia Jasper
Jasper [70] is an endtoend implementation in OpenSeq2Seq toolkit from Nvidia, created as an approach for the LibriSpeech task. The system comprises a single neural network, without the need of a phonetic model. It takes preprocessed features as input and provides character at the output. The framework provides the possibility of integration with a probabilistic or a TransformerXL neural network language model [57], as mentioned in Section 2.5.2.
The input of the network is represented by 64dimensional logfilterbanks, extracted using a 20 ms frame length with 10 ms shift, while the sequence length processed at a time is 160 frames, equivalent to 1.6 seconds.
This network is based on a fully 1D convolutional architecture, which uses deep residual connections. They works as a bypass over the convolutional blocks, avoiding the vanishing gradient problem. The convolutions are only in the timedomain, being similar to a timedelay network.
The architecture is a 10x5 Jasper network, composed of 10 blocks, each one having 5 subblocks. Each subblock performs 1D convolution, batch normalization, ReLU and dropout. The convolution is applied on time domain, using a filter having the same value for two consecutive blocks, but whose size increases, having in turn values of 11, 13, 17, 21, 25. The stride always has a value of 1, and the padding is set so that the length of the output sequence matches the length of the input sequence. All subblocks in a block have the same number of the output channels, this number being the same for two consecutive blocks, but it grows with the depth of the network, having values of 256, 384, 512, 640 and 768. The residual connections are represented by 1x1 convolutions followed by batch normalization. They link the input of each subblock to the output of the block. Therefore, there are 5 residual connections in each block.
The network starts with a pure convolutional layer and it ends with two others. The first one learns 256 channels from 1 input channel, while the last two learn 896 channels from 768 and 1024 from 896. The last layer, a fullyconnected layer, performs a 1x1 convolution, where the number of the output channels is 28. This number corresponds to the characters set, over which is provided a probability distribution. Therefore, the network is trained using the CTC criteria, making possible a character based output of the network. The final output of the system is given by the language model, the same 4gram with 200k words was used in our experiments.
A new optimizer, called NovoGrad [84], is used in this work. This is similar to Adam, but it computes the second moments per layer, instead of per weight. It helps the network to be more stable and the memory consumption is halved, compared to Adam. The entire system is depicted in the let side of Fig. 6.
Nvidia QuartzNet
QuartzNet [85] is an endtoend implementation in the NeMo toolkit from Nvidia, designed as a more efficient variant of Jasper in terms number of parameters and operations. As with Jasper, the system comprises a single neural network that takes preprocessed features as input and provides character as output, and integration with either a probabilistic or a neural network language model is supported.
The input of the network is represented by 64dimensional logfilterbanks, extracted using a 20 ms frame length with 10 ms shift, while the sequence length processed at a time is 160 frames, equivalent to 1.6 seconds.
This network is based on a fully 1D convolutional architecture with residual connections. The main difference with Jasper is the introduction of timechannel separable convolutions, a variation of timedepth separable convolution described earlier in Section 2.4.1.
The architecture we consider is the largest and most accurate variant presented in [85], a 15x5 QuartzNet network.
ASR comparison and evaluation. Case study on LibriSpeech
In this section, the implementations presented above, which are specific for the LibriSpeech ASR task, are now evaluated and compared in terms of accuracy and hardware requirements.
These were first analyzed at the system level, and then especially at the neural network component level. Both Kaldi based implementations are multicomponent systems, while the other implementations are endtoend systems. In the first case, the acoustic model, the phonetic dictionary the language model are different components that work together as a system. The acoustic model is a hybrid: timedelay neural network (TDNN) + Hidden Markov Model (HMM). In the second case, a single neural network performs the phonetic and the linguistic modeling, in addition to the acoustic modeling. While in the multicomponent systems, the decoding language model is mandatory and only the rescoring language model being optional, in the endtoend systems also the decoding language model is optional. All the implementations analyzed and tested by us use different handcrafted features as input. The Kaldi based implementations are the only ones using a combination of two kind of features, MFCCs and ivectors, while the others are using a single type of features. From the point of view of the network architectures, different variations of the convolution and recurrent networks are used. Multicomponent systems use crossentropy or chain loss objective functions, based on the LFMMI cost function, while endtoend systems use more complex mechanisms: sequence to sequence attention, CTC or ASG. They work as loss functions, but also as a HMM, performing the task of aligning the sequences. In terms of output, the neural networks in hybrid approaches are providing posterior probabilities of phonetic units, while the other networks output characters or word parts. At the system level, all systems were used in combination with a probabilistic, 4gram language model, with a vocabulary of 200k words, this value representing the final output size of the system. A summary of the characteristics of each architecture can be found in the Table 4.
Evaluation of model complexity
The purpose of the complexity assessment is to know which of the studied architectures are suitable for embedded systems. In an embedded system that is constrained by computational power and memory, an ASR that has few operations, activations and parameters can be integrated. We further describe how we performed the complexity computation. Therefore, the worst case scenario is considered, when all the parameters of the network were kept in memory throughout the inference.
To determine the complexity of each algorithm (model size, number of operations and activations), several operations were performed: (i) the source code was analyzed, (ii) the log files were inspected at the time of inference and (iii) the inference was run step by step in debugger mode. We summarized the information about each layer, such as its type, the input and output dimension, as well as other layer specific additional details. Based on these, we calculated the complexity corresponding to each layer. Table 5 presents the formulas used to perform these calculations.
The number of parameters represents the number of weights learned by the network. In fully connected layers, this is obtained as the product of the input size and the output size of the layer. Timedelay layers are computed similarly, whereby this product being further multiplied by the context size, representing the number of how many vectors at different time frames are considered. In convolutional layers, the number of parameters is given by the multiplication between the filter size (which can be unidimensional or bidimensional) and the number of input and output filters. In the recurrent layers, we used a multiplication formula of four factors. The first factor is the sum between the input size (the features size or the output of the previous size) and the size of the actual recurrent layer. The second factor is the output size of the layer, usually being equal to the size of the recurrent layer. The number of gates depends on the recurrent cell type: 1 for RNN, 3 for GRU, 4 for LSTM. Finally, the fourth factor has the value equal to 1 or 2, indicating whether the layer is unidirectional or bidirectional.
The multiplyaccumulate operation (MAC) is defined as the product of two numbers, which is added to an accumulator. In our context, this represents the matrix multiplications that take place in neural networks. The formula is similar to the one used in the calculation of the parameters, in addition being multiplied by the feature vector size and the temporal length of the sequence processed by the network.
The total number of operations (Ops) is equal to twice the number of MACs, because each of them involves a multiplication operation and a summation operation.
The number of activations represent the number of outputs of each layer. In fully connected and timedelay layers, this is obtained as the output size multiplied by the output sequence length in time. In the case of convolutional layers, the formula is similar, with the difference that the out size could be unidimensional of bidimensional, multiplied also by the number of output filters. The activations in the recurrent layers are obtained by multiplying the size of the recurrent layer, equal to the output size, the temporal sequence length and the directionality factor, being equal to 1 or 2.
Comparison of ASR systems in terms of model complexity
The size of the various models described in the previous section varies between tens of millions (18 M for Kaldi CNNTNN and Nvidia QuartzNet) and hundreds of millions (333 M for Nvidia Jasper) of parameters.
Kaldi TDNNs involve learning a smaller number of weights. This is thanks to the medium number of layers (17 and 18), as well as the constant dimensions of the inputs and outputs of the layers throughout the entire network. The timechannel separable convolutions in Nvidia QuartzNet are also leading to a small number of parameters.
Although about double than Nvidia QuartzNet and Kaldibased systems, PaddlePaddle’s DeepSpeech2 implementation and the TDSS2S from Facebook are economical in terms of learnable parameters. For DeepSpeech2, the reduced number of layers in the encoder (2 convolutional and 3 recurrent), as well as the size (2048) and type of the recurrent ones (without gating mechanisms), directly influence the number of parameters. In the case of the second one, the number of filters varies between 10 and 18 during all the 11 TDS blocks from the encoder, while the size of the filters remains constant. The convolutions are interleaved with fully connected layers.
The recurrent encoderdecoder with attention in RWTH RETURNN is among the models with a large number of parameters (187 M). This is given by the 6 recurrent bidirectional layers in the encoder, each having the size of 1024. More than that, the recurrent cells are LSTMs and each of the 4 characteristic gates implies additional parameters. Also, a contribution in this regard is provided by all the dependencies between the components of the attention mechanism, as well as the recurrent structure of the decoder.
The Facebook CNNASG implementation has a large number of parameters. Essential is the large number of filters, between 400 and 1816, which steadily grows along the 17 timeconvolutional blocks. The dimensions of the filters have the same increasing character from layer to layer, the smallest having length of 13, and the largest, 29.
The most expensive implementation in terms of model size is, by far, Nvidia Jasper. Although this is also a timedelay network, the large number of parameters is given by the depth of the network. There are 10 dense residual blocks and 4 simple convolutional layers, resulting a total of 54 layers and 50 residual connections. Each block in turn consist of 5 repetitive convolutional subblocks. Also, the number of filters used is very high, gradually increasing from 256 to 1024.
From the point of view of the number of operations, these were scaled at frame level and they are up to the gigaorder. This is somewhat correlated with the number of parameters: a large number of parameters leads to a large number of operations and vice versa. The computation of the number of operations is crucially influenced by the temporal length of the sequence, taking into account that the network does not process all the data at once, but sequences with a certain length. Because the networks process input sequences of different durations, the number of operations has been scaled, so that the data corresponds to the processing of a single frame, and the results are comparable. The fewest operations are required in the Facebook TDSS2S implementation: the sequence length decreases with advancement towards the upper layers of the network. The same thing happens in other convolutional implementations, such as in the convolutional layers of DeepSpeech2. The recurrent layers usually retain the same temporal length of the sequence. An exception occurs in the recurrent layers from RWTH RETURNN implementation. A pooling operation with a factor of 2 is applied, halving the size of the temporal sequence after each one of the 3 recurrent layers at the beginning of the network. QuartzNet performs operations that are one order of magnitude larger than in the case of the previous architectures. The number of operations is quite high in the Facebook CNNASG fully convolutional algorithm. Although the temporal length of the sequence decreases to the upper layers of the network, this is slightly higher in comparison with the other algorithms.
By far most operations are involved in Jasper implementation from Nvidia. The difference is up to 3 orders of magnitude compared to the other networks. The time sequence is kept constant, with the help of padding. Also, the depth of the network plays an important role.
From the point of view of the number of activations, they represent the total dimension of the layers’ outputs of each network. These are of the order of millions, the least activations are found in the Facebook TDSS2S implementation, followed closely by PaddlePaddle DeepSpeech2 implementation, due to the small size of the network, while the most activations are found in the implementations from Nvidia.
To summarize this analysis, we can certainly conclude that Nvidia Jasper is the most complex model: largest number of parameters and largest number of operations needed to process a frame of speech. Following closely are the recurrent encoderdecoder from RWTH and the fully convolutional CNNASG model from Facebook. With similar model sizes (187M vs. 208M), the first requires significantly more memory to store the 38M activations per frame, while the latter requires significantly more processing power to perform the 22M operations per frame. QuartzNet has a small number of parameters, comparable to Kaldibased architectures, but in terms of memory it has a medium load, due to the high number of activations. The number of operations is also high, but lower than in the case of Nvidia Jasper or Facebook CNNASG architectures.
On the other end, Kaldi’s implementations are the lightest models, with only around 20M parameters to be stored in memory, and fastest models, with around 40M  60M operations to be performed per each speech frame. Facebook TDSS2S model is also following closely, with a slightly larger model size to put pressure on the system’s memory, while requiring three to four times less processing power to process one speech frame.
Comparison of ASR systems in terms of performance
The previous sections presented in detail the ASR models. We described the work flow and the various components of the systems, emphasizing the complexities of the systems in terms of memory and processing power requirements. The current section aims to analyze the performance of the ASR systems.
The performance evaluation is conducted in terms of word error rate (WER [%]) at system level for each implementation. This is calculated as the total number of transcription errors (insertions, substitutions and deletions), relative to the total number of words in the groundtruth. The error rates presented in Table 6 are either reported by the authors in scientific papers or provided in technical reports on the frameworks repositories.
All the models presented in Table 6 were trained on the training subsets in LibriSpeech, comprising a total of 960 hours of speech (see Table 7). Some ASR systems are able to produce highquality transcriptions without the need for an additional language model. Therefore, we present results in two scenarios:

the endtoend neural network used solely, without an additional language model for rescoring (left side of Table 6) and

the full ASR system used in conjunction with an external language model: a probabilistic nonpruned 4gram (fglarge) [86] (right side of Table 6).
Note that Facebook implementations also work in the absence of a language model, but the results for this scenario were not reported so far.
The evaluation is performed on the two LibriSpeech test datasets: testclean, which comprises clean speech, and testother, which comprises speech recorded in more challenging acoustic environments (see Table 7).
The first conclusion that can be drawn is that the complex models (RWTH RETURNN and Nvidia Jasper) obtain similar results regardless of whether an external language model is used or not. By contrast, Facebook TDSS2S and PaddlePaddle DeepSpeech2, which also have the ability to work without an external LM, perform poorly in this situation.
The best results are obtained with Nvidia Quartznet, 2,98% WER on clean speech and 8.38% on nonclean speech. They are closely followed by the Nvidia Jasper and Kaldi based systems. The systems from Facebook and RWTH RETURNN follow in this hierarchy, while PaddlePaddle DeepSpeech2 is at the end of this ranking.
Tradeoffs between ASR performance and hardware requirements
This subsection concludes the evaluation section by discussing the various tradeoffs between ASR system performance and hardware requirements for the various systems analyzed. While in previous sections we compared the ASR systems with respect to model complexity and performance separately, we are now interested to see how the model complexities translate to hardware requirements. Moreover, it is important to understand if stronger hardware constraints lead to improved ASR accuracy or not. Finally, this analysis should pinpoint the ASR systems that meet the requirements for embedded speech applications.
Based on the data in Table 4, we estimated the memory requirements for each ASR system, by making the following assumptions: (i) the model parameters and activations that need to be stored in the memory are 4B numbers and (ii) all ASR system need to store in the memory the models and at least the network activations required to process 1 second of speech. The memory load results are presented in Table 8.
In Table 8 we also present the required throughput of the hardware system. The throughput is the number of operations per second that the hardware should be able to perform in order to process speech data in real time (i.e. process one second of speech in one second). Finally, for the sake of simplicity, we only express the ASR performance in terms of the WER obtained on the most popular LibriSpeech evaluation scenario: the ASR uses an external LM for language rescoring and the evaluation is performed on textclean (i.e. the subset that contains speech recorded in clean acoustic environments).
The data shows that RWTH RETURNN, Nvidia Jasper, Nvidia QuartzNet and Facebook CNNASG are prohibitive with respect to memory load. They require between 14x and 235x more memory than the lightest system (Kaldi CNNTDNN). On the opposite side, Kaldibased systems, PaddlePaddle DeepSpeech2 and Facebook TDSS2S require similar amounts of memory: between 100 and 200 MB^{Footnote 1}.
The tradeoff between ASR performance and memory requirements is presented as a tradeoff diagram in Fig. 7 (left). Although Nvidia QuartzNet requires a large amount of memory, it rests on the Pareto front thanks to its low WER (high performance). Kaldibased systems are both on the Pareto front because they dominate each other (i.e. CNNTDNN is better in terms performance, and TDNN is better in terms of memory requirements).
With regard to computational power requirements, the data clearly shows that Nvidia Jasper and Facebook CNNASG are, again, prohibitive. They make between 1400x and 2800x more operations than the fastest system (Facebook TDSS2S). Facebook TDSS2S is also significantly faster (2.7x) than its successor (i.e. Kaldi TDNN) and the subsequent systems.
The tradeoff between ASR performance and memory requirements is presented as a tradeoff diagram in Fig. 7 (right). Nvidia QuartzNet and Facebook TDSS2S are extremes on the Pareto front: the first has the best performance, while the latter requires the least amount of operations to process on second of speech. As a tradeoff between the two, Kaldi TDNN is also on the Pareto front: it is significantly faster than Nvidia QuartzNet and more accurate than Facebook TDSS2S.
Finally, we also analyzed the tradeoff in a 3dimensional scenario: for an embedded application one would be interested in a simultaneous tradeoff between ASR performance, memory and computational power requirements. The results are presented in Table 8 (see last column). Among the eight systems that were analyzed, Nvidia QuartzNet, Facebook TDSS2S and Kaldi CNNTDNN are extremes on the Pareto front, being the best systems in terms of performance, throughput requirements and respectively memory requirements. As a tradeoff between the three, Kaldi TDNN still provides Pareto optimal design points, dominating each of the other three systems in two measures.
Conclusion
This article presented an overview of the fundamentals of automatic speech recognition systems and their evolution over the last years. The general architecture of an ASR system was presented, as well as the various approaches for each component part. We summarized the most popular types of speech features and the way that they can be extracted and improved. We passed over the acoustic modeling algorithms, from traditional probabilistic models, which were replaced by neural networks in hybrid approaches, to endtoend models using pure neural networks without an alignment model. We also presented the main approaches of language modeling for ASR. Based on this overview, several conclusions can be drawn with respect to the current trends in the field of automatic speech recognition.
Although pipeline ASR systems represented the state of the art up to now, endtoend systems are on an ascending trend and will most likely replace them successfully in the near future. Our analysis showed that only one endtoend system (Nvidia QuartzNet) has managed to surpass the state of the art pipeline system (Kaldi TDNN) in terms of accuracy, having a similar number of parameters, but with a greater cost in terms of number of operations required to process one frame of speech.
From the point of view of the acoustic features, many implementations, including state of the art systems, still use traditional features, such as MFCCs or ivectors.
Kaldi, for example, combines MFCC and ivector features in various ways, depending on the neural layers that process the features further. Moreover, in Kaldi several feature transforms and data augmentation techniques (e.g. speed and noise perturbations) are used. However, we can clearly see the attempt to migrate from handcrafted features towards the raw waveform. Many networks use special input layers that act like a feature extractor frontend. Although the trend is clear and many attempts were made to perform automatic speech recognition from the raw waveform, current endtoend ASR implementations still use spectrograms or filterbanks as input.
Acoustic modeling is performed in all state of the art ASR implementations using neural networks. Most approaches use recurrent networks and convolutional networks. The disadvantage of the recurrent networks consists in the difficulty of parallelization. Moreover, the bidirectional networks require special tricks if they are to be used for online transcription, as the entire signal is not available in this scenario. The use of residual connections became a common practice, thus improving the propagation of data over the network, sometimes working as a shortcut over some layers.
While the acoustic units modelled in the state of the art systems are phones or subphonetic units (e.g. senones), the trend is to migrate away from these intermediate representations and to model directly characters, word parts or even words. As there are already effective implementations that output characters or word parts (6 of the 8 implementations analyzed) it is clear that the phonetic dictionary in a traditional pipeline ASR system will soon become obsolete.
In terms of language modeling, pipeline systems, as those based on Kaldi, are compulsory depending on a language model. Endtoend systems can optionally use a language model as an addon. Its use leads to better results, but it is not crucial. All results can be further improved by using a rescoring model, which takes over and corrects the initial transcript. Regarding the type of the language model, probabilistic models, ngrams, are still used and they are very popular, but the neural approaches, with convolutional or recurrent language models, are superior in terms of accuracy, although more expensive in terms of computation.
Apart from the overview on automatic speech recognition, we conducted an indepth analysis of eight different ASR implementations for LibriSpeech ASR task in order to identify the frameworks which might be suitable for integration in embedded applications with a minimal drop in performance. We evaluated the TDNN and CNNTDNN architectures from Kaldi, RETURNN attentionbased encoderdecoder from RWTH, fully convolutional CNNASG and TDSS2S from Facebook, PaddlePaddle DeepSpeech2 from PaddlePaddle, Jasper denseresidual and QuartzNet from Nvidia. We described and compared the various features used as input, the various types and sizes of layers and blocks of layers, the loss functions, the various output types and their link to output words etc. The analysis aimed at offering an insight into whether these models are suitable or not for integration in embedded applications. To this end, we first expressed the complexities of the models in terms of model size, number of activations and number of operations required to process one frame of speech. Finally, we translated these metrics into hardware requirements, such as memory load and minimum throughput, metrics that can be directly to decide which ASR implementation suits certain hardware constrains. To the best of our knowledge, this is the first article that presented such an analysis.
The conclusions that aroused from this analysis are very interesting. We showed that some endtoend implementations (i.e. RWTH RETURNN, Nvidia Jasper, Nvidia QuartzNet and Facebook CNNASG) are prohibitive for embedded applications due to their memory requirements. They require between 14x and 235x more memory than the lightest system (i.e. Kaldi CNNTDNN). On the opposite side, Kaldibased systems, PaddlePaddle DeepSpeech2 and Facebook TDSS2S require similar amounts of memory.
With regard to computational power requirements, we conclude that Nvidia Jasper and Facebook CNNASG are, again, not suitable for embedded applications. They make between 1400x and 2800x more operations than the fastest system (i.e. Facebook TDSS2S). Facebook TDSS2S is also significantly faster (2.7x) than its successor (i.e. Kaldi TDNN) and the subsequent systems.
For an embedded application one would be interested in a simultaneous tradeoff between ASR performance, memory and computational power requirements. With respect to this, our tradeoff analysis showed that Nvidia QuartzNet, Facebook TDSS2S and Kaldi CNNTDNN are extremes on the Pareto front, being the best systems in terms of performance, throughput requirements and respectively memory requirements. As a tradeoff between the three, Kaldi TDNN still provides Pareto optimal design points, dominating each of the other systems in two out of the three measures.
Availability of data and materials
Not applicable.
Notes
 1.
This is only the amount of memory required to load the neural model and store all the activations of the network for processing 1 second of speech. More memory might be needed for other components, such as the language model etc.
Abbreviations
 AM:

Acoustic Model
 ASG:

Auto Segmentation Criterion
 ASR:

Automatic Speech Recognition
 BLSTM:

Bidirectional LongShort Term Memory
 BPE:

Bytepair encoding
 CE:

CrossEntropy
 CNN:

Convolutional Neural Network
 CTC:

Connectionist Temporal Classification
 DNN:

Deep Neural Network
 DCT:

Discrete Cosine Transform
 E2E:

EndtoEnd
 EM:

ExpectationMaximization
 FFT:

Fast Fourier Transform
 GLU:

Gated Linear Unit
 GMM:

Gaussian Mixture Model
 GOPS:

Giga Operations per Second
 GRU:

Gated Recurrent Unit
 HMM:

Hidden Markov Model
 JFA:

Joint Factor Analysis
 LDA:

Linear Discriminant Analysis
 LFMMI:

LatticeFree Maximum Mutual Information
 LM:

Language Model
 LSTM:

LongShort Term Memory
 MAC:

MultiplyAccumulate Operation
 MB:

Mega bytes
 MFCC:

MelFrequency Cepstral Coefficients
 MFSC:

MelFrequency Spectral Coefficients
 MMI:

Maximum Mutual Information
 NN:

Neural Network
 OOV:

Outofvocabulary
 Ops:

Operations
 PD:

Phonetic Model
 ReLU:

Rectified Linear Unit
 RNN:

Recurrent Neural Network
 RWTH:

RheinischWestfälische Technische Hochschule Aachen (Aachen University)
 S2S:

SequencetoSequence
 SVD:

Single Value Decomposition
 TDNN:

TimeDelay Neural Network
 TDNNF:

Factored TimeDelay Neural Network
 TDS:

TimeDepth Separable
 WER:

Word Error Rate
 WSJ:

Wall Street Journal
References
 1
M. Price, J. Glass, A. P. Chandrakasan, A lowpower speech recognizer and voice activity detector using deep neural networks. IEEE J. SolidState Circ.53(1), 66–75 (2017).
 2
M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, W. Sung, in 2016 IEEE International Workshop on Signal Processing Systems (SiPS). Fpgabased lowpower speech recognition with recurrent neural networks (IEEEDallas, 2016), pp. 230–235. https://doi.org/10.1109/SiPS.2016.48.
 3
S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, et al., in Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays. Ese: Efficient speech recognition engine with sparse lstm on fpga, (2017), pp. 75–84. https://doi.org/10.1145/3020078.3021745.
 4
B. Liu, H. Qin, Y. Gong, W. Ge, M. Xia, L. Shi, Eeraasr: An energyefficient reconfigurable architecture for automatic speech recognition with hybrid dnn and approximate computing. IEEE Access. 6:, 52227–52237 (2018).
 5
R. Yazdani, A. Segura, J. M. Arnau, A. Gonzalez, in 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). An ultra lowpower hardware accelerator for automatic speech recognition (IEEETaipei, 2016), pp. 1–12. https://doi.org/10.1109/MICRO.2016.7783750.
 6
S. Migacz, in GPU Technology Conference, vol. 2. 8bit inference with tensorrt, (2017), p. 5. https://ondemand.gputechconf.com/gtc/2017/presentation/s73108bitinferencewithtensorrt.pdf.
 7
Z. Zhang, J. Geiger, J. Pohjalainen, A. E. D. Mousa, W. Jin, B. Schuller, Deep learning for environmentally robust speech recognition: An overview of recent developments. ACM Trans. Intell. Syst. Technol. (TIST). 9(5), 1–28 (2018).
 8
J. Park, Y. Boo, I. Choi, S. Shin, W. Sung, in Advances in Neural Information Processing Systems. Fully neural network based speech recognition on mobile and embedded devices, (2018), pp. 10620–10630. https://dl.acm.org/doi/10.5555/3327546.3327722.
 9
H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Y. Chang, T. Sainath, Deep learning for audio signal processing. IEEE J. Sel. Top. Signal Process.13(2), 206–219 (2019).
 10
C. Kim, D. Gowda, D. Lee, J. Kim, A. Kumar, S. Kim, A. Garg, C. Han, A review of ondevice fully neural endtoend automatic speech recognition algorithms. arXiv preprint arXiv:2012.07974 (2020).
 11
D. Wang, X. Wang, S. Lv, An overview of endtoend automatic speech recognition. Symmetry. 11(8), 1018 (2019).
 12
C. Shan, J. Zhang, Y. Wang, L. Xie, in ICASSP. Attentionbased endtoend speech recognition on voice search (IEEE, 2018), pp. 4764–4768. https://doi.org/10.1109/ICASSP.2018.8462492.
 13
R. Collobert, C. Puhrsch, G. Synnaeve, Wav2letter: an endtoend convnetbased speech recognition system. arXiv preprint arXiv:1609.03193 (2016).
 14
M. Alam, M. D. Samad, L. Vidyaratne, A. Glandon, K. M. Iftekharuddin, Survey on deep neural networks in speech and vision systems. Neurocomputing. 417:, 302–321 (2020).
 15
T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu, et al., in ICASSP. No need for a lexicon? evaluating the value of the pronunciation lexica in endtoend models (IEEE, 2018), pp. 5859–5863. https://doi.org/10.1109/icassp.2018.8462380.
 16
C. C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al., in ICASSP. Stateoftheart speech recognition with sequencetosequence models (IEEE, 2018), pp. 4774–4778. https://doi.org/10.1109/ICASSP.2018.8462105.
 17
R. Collobert, A. Hannun, G. Synnaeve, Wordlevel speech recognition with a dynamic lexicon. arXiv preprint arXiv:1906.04323 (2019).
 18
A. Graves, A. r. Mohamed, G. Hinton, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Speech recognition with deep recurrent neural networks (IEEE, 2013), pp. 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947.
 19
H. Hadian, H. Sameti, D. Povey, S. Khudanpur, in Interspeech. Endtoend speech recognition using latticefree mmi, (2018), pp. 12–16. https://doi.org/10.21437/Interspeech.20181423.
 20
R. W. Hamming, Digital Filters (Courier Corporation, 1998).
 21
A. V. Oppenheim, Discretetime Signal Processing (Pearson Education India, 1999).
 22
S. S. Stevens, J. Volkmann, E. B. Newman, A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am.8(3), 185–190 (1937).
 23
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process.28(4), 357–366 (1980).
 24
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Frontend factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process.19(4), 788–798 (2010).
 25
P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans. Audio Speech Lang. Process. 15(4), 1435–1447 (2007).
 26
R. D. LopezCozar, M. Araki, Spoken, Multilingual and Multimodal Dialogue Systems: Development and Assessment (Wiley, 2005). https://doi.org/10.1002/0470021578.
 27
Y. Zhang, M. Alder, R. Togneri, in ICASSP, vol. 1. Using gaussian mixture modeling in speech recognition (IEEE, 1994), p. 613. https://doi.org/10.1109/ICASSP.1994.389219.
 28
S. J. Young, J. J. Odell, P. C. Woodland, in Proceedings of the Workshop on Human Language Technology. Treebased state tying for high accuracy acoustic modelling (Association for Computational Linguistics, 1994), pp. 307–312. https://doi.org/10.3115/1075812.1075885.
 29
L. E. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat.41(1), 164–171 (1970).
 30
A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory. 13(2), 260–269 (1967).
 31
G Hinton, et al., Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Proc. Mag.29(6), 82–97 (2012).
 32
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. J. Lang, Phoneme recognition using timedelay neural networks. IEEE Trans. Acoust. Speech Signal Process.37(3), 328–339 (1989).
 33
V. Peddinti, D. Povey, S. Khudanpur, in Interspeech. A time delay neural network architecture for efficient modeling of long temporal contexts, (2015), pp. 3214–3218. https://academic.microsoft.com/paper/2402146185/reference.
 34
D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, S. Khudanpur, in Interspeech. Semiorthogonal lowrank matrix factorization for deep neural networks, (2018), pp. 3743–3747. https://doi.org/10.21437/Interspeech.20181417.
 35
O. AbdelHamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process.22(10), 1533–1545 (2014).
 36
Kaldi Help Google Group: CNNTDNN vs. TDNN (2020). https://groups.google.com/d/msg/kaldihelp/jsg1Oo4bNGQ/uwvFw5PtBwAJ. Accessed 23 Mar 2020.
 37
F. L. Kreyssig, C. Zhang, P. C. Woodland, in ICASSP. Improved tdnns using deep kernels and frequency dependent gridrnns (IEEE, 2018), pp. 4864–4868. https://doi.org/10.1109/ICASSP.2018.8462523.
 38
A. Biswas, E. Yılmaz, F. de Wet, E. van der Westhuizen, T. Niesler, in Interspeech. SemiSupervised Acoustic Model Training for FiveLingual CodeSwitched ASR, (2019), pp. 3745–3749. https://doi.org/10.21437/interspeech.20191325.
 39
C. Zorilă, C. Boeddeker, R. Doddipatla, R. HaebUmbach, in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). An investigation into the effectiveness of enhancement in asr training and test for chime5 dinner party transcription (IEEE, 2019), pp. 47–53. https://doi.org/10.1109/ASRU46091.2019.9003785.
 40
N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, R. Collobert, Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864 (2018).
 41
G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Densely connected convolutional networks, (2017), pp. 4700–4708. https://doi.org/10.1109/cvpr.2017.243.
 42
J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, R. T. Gadde, in Interspeech. Jasper: An EndtoEnd Convolutional Neural Acoustic Model, (2019), pp. 71–75. https://doi.org/10.21437/interspeech.20191819.
 43
H. Sak, A. Senior, F. Beaufays, in Interspeech. Long shortterm memory recurrent neural network architectures for large scale acoustic modeling, (2014), pp. 338–342. https://research.google/pubs/pub43905.pdf.
 44
F. A. Gers, N. N. Schraudolph, J. Schmidhuber, Learning precise timing with lstm recurrent networks. J. Mach. Learn. Res.3(Aug), 115–143 (2002).
 45
D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., in 2016 International Conference on Machine Learning. Deep speech 2: Endtoend speech recognition in english and mandarin, (2016), pp. 173–182. https://academic.microsoft.com/paper/2193413348/reference.
 46
D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
 47
W. Chan, N. Jaitly, Q. Le, O. Vinyals, in ICASSP. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition (IEEE, 2016), pp. 4960–4964. https://doi.org/10.1109/ICASSP.2016.7472621.
 48
I. Sutskever, et al., Q. Le, Sequence to Sequence Learning with Neural Networks.Adv. Neural Inf. Process. Syst.27:, 3104–3112 (2014).
 49
K. Cho, B. van Merriënboer, D. Bahdanau, Y. Bengio, in Proceedings of SSST8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. On the properties of neural machine translation: Encoder–decoder approaches, (2014), pp. 103–111. https://doi.org/10.3115/v1/w144012.
 50
A. Hannun, A. Lee, Q. Xu, R. Collobert, in Interspeech. SequencetoSequence Speech Recognition with TimeDepth Separable Convolutions, (2019), pp. 3785–3789. https://doi.org/10.21437/interspeech.20192460.
 51
A. Graves, S. Fernández, F. Gomez, J. Schmidhuber, in Proceedings of the 23rd International Conference on Machine Learning. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks (ACM, 2006), pp. 369–376. https://doi.org/10.1145/1143844.1143891.
 52
T. Hori, S. Watanabe, Y. Zhang, W. Chan, in Interspeech. Advances in joint ctcattention based endtoend speech recognition with a deep cnn encoder and rnnlm, (2017). https://doi.org/10.21437/INTERSPEECH.20171296.
 53
D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, S. Khudanpur, in Interspeech. Purely sequencetrained neural networks for asr based on latticefree mmi, (2016), pp. 2751–2755. https://doi.org/10.21437/Interspeech.2016595.
 54
T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, S. Khudanpur, in 2010 Conference of the International Speech Communication Association. Recurrent neural network based language model, (2010). https://academic.microsoft.com/paper/179875071/reference.
 55
M. Sundermeyer, R. Schlüter, H. Ney, in 2012 Conference of the International Speech Communication Association. Lstm neural networks for language modeling, (2012). https://academic.microsoft.com/paper/2402268235/reference.
 56
Y. N. Dauphin, A. Fan, M. Auli, D. Grangier, in Proceedings of the 34th International Conference on Machine LearningVolume 70. Language modeling with gated convolutional networks (JMLR. org, 2017), pp. 933–941. https://academic.microsoft.com/paper/2963970792/reference.
 57
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, R. Salakhutdinov, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. TransformerXL: Attentive language models beyond a fixedlength context (Association for Computational LinguisticsFlorence, 2019), pp. 2978–2988.
 58
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C. Lin, F. Bougares, H. Schwenk, Y. Bengio, On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535 (2015).
 59
A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, R. Prabhavalkar, in ICASSP. An analysis of incorporating an external language model into a sequencetosequence model (IEEE, 2018), pp. 1–5828. https://doi.org/10.1109/icassp.2018.8462682.
 60
A. Sriram, H. Jun, S. Satheesh, A. Coates, in Interspeech. Cold fusion: Training seq2seq models together with language models, (2018), pp. 387–391. https://doi.org/10.21437/interspeech.20181392.
 61
S. Toshniwal, A. Kannan, C. C. Chiu, Y. Wu, T. N. Sainath, K. Livescu, in 2018 IEEE Spoken Language Technology Workshop (SLT). A comparison of techniques for language model integration in encoderdecoder speech recognition (IEEE, 2018), pp. 369–375. https://doi.org/10.1109/SLT.2018.8639038.
 62
T. Mikolov, S. Kombrink, A. Deoras, L. Burget, J. Cernocky, in Proc. of the 2011 ASRU Workshop. Rnnlmrecurrent neural network language modeling toolkit, (2011), pp. 196–201. https://academic.microsoft.com/paper/2474824677/reference.
 63
H. Xu, et al., in ICASSP. A pruned rnnlm latticerescoring algorithm for automatic speech recognition (IEEE, 2018), pp. 5929–5933. https://doi.org/10.1109/ICASSP.2018.8461974.
 64
Kaldi TDNN LibriSpeech implementation (2020). https://github.com/kaldiasr/kaldi/blob/master/egs/librispeech/s5/local/chain/tuning/run_tdnn_1d.sh. Accessed 23 Mar 2020.
 65
Kaldi CNNTDNN LibriSpeech implementation (2020). https://github.com/kaldiasr/kaldi/blob/master/egs/librispeech/s5/local/chain/tuning/run_cnn_tdnn_1a.sh. Accessed 23 Mar 2020.
 66
PaddlePaddle DeepSpeech2 LibriSpeech implementation (2020). https://github.com/PaddlePaddle/DeepSpeech/blob/develop/model_utils/network.py. Accessed 23 Mar 2020.
 67
RWTH Returnn LibriSpeech implementation (2020). https://github.com/rwthi6/returnnexperiments/blob/master/2018asrattention/librispeech/fullsetupattention/returnn.config . Accessed 23 Mar 2020.
 68
Wav, 2Letter CNNGLU fully convolutional LibriSpeech implementation (2020). https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/conv_glu/librispeech/network.arch. Accessed 23 Mar 2020.
 69
Wav, 2Letter timedomain separable LibriSpeech implementation (2020). https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/seq2seq_tds/librispeech/network.arch. Accessed 23 Mar 2020.
 70
Nividia OpenSeq2Seq Jasper LibriSpeech implementation (2020). https://github.com/NVIDIA/OpenSeq2Seq/blob/master/example_configs/speech2text/jasper10x5_LibriSpeech_nvgrad.py. Accessed 23 Mar 2020.
 71
Nvidia QuartzNet implementation (2020). https://github.com/NVIDIA/NeMo/blob/master/examples/asr/configs/quartznet15x5.yaml. Accessed 23 Mar 2020.
 72
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in ICASSP. Librispeech: an asr corpus based on public domain audio books (IEEE, 2015), pp. 5206–5210. https://doi.org/10.1109/ICASSP.2015.7178964.
 73
D. B. Paul, J. M. Baker, in Proceedings of the Workshop on Speech and Natural Language. The design for the wall street journalbased csr corpus (Association for Computational Linguistics, 1992), pp. 357–362. https://doi.org/10.3115/1075527.1075614.
 74
A. Rousseau, P. Deléglise, Y. Esteve, in 2014 Language Resources and Evaluation. Enhancing the tedlium corpus with selected data for language modeling and more ted talks, (2014), pp. 3935–3939. https://academic.microsoft.com/paper/2251321385/reference.
 75
J. J. Godfrey, E. C. Holliman, J. McDaniel, in ICASSP, vol. 1. Switchboard: Telephone speech corpus for research and development (IEEE, 1992), pp. 517–520. https://doi.org/10.1109/ICASSP.1992.225858.
 76
C. Cieri, D. Miller, K. Walker, in 2004 Language Resources and Evaluation, vol. 4. The fisher corpus: a resource for the next generations of speechtotext, (2004), pp. 69–71. https://academic.microsoft.com/paper/97072897/reference.
 77
Kaldi Help Google Group: Multiple output heads in chain network (2020). https://groups.google.com/d/msg/kaldihelp/WC8hYgL2o3I/WccCc0ucAgAJ. Accessed 23 Mar 2020.
 78
A. Zeyer, K. Irie, R. Schlüter, H. Ney, in Interspeech. Improved training of endtoend attention models for speech recognition, (2018), pp. 7–11. https://doi.org/10.21437/Interspeech.20181616.
 79
R. Sennrich, B. Haddow, A. Birch, in 2016 Meeting of the Association for Computational Linguistics. Neural machine translation of rare words with subword units, (2016), pp. 1715–1725. https://doi.org/10.18653/v1/p161162.
 80
N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, E. Dupoux, in Interspeech. Endtoend speech recognition from the raw waveform, (2018), pp. 781–785. https://doi.org/10.21437/Interspeech.20182414.
 81
T. Likhomanenko, G. Synnaeve, R. Collobert, in Interspeech. Who needs words? lexiconfree speech recognition, (2019), pp. 3915–3919. https://doi.org/10.21437/Interspeech.20193107.
 82
Wav, 2Letter lexiconfree LibriSpeech implementation (2020). https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/lexicon_free/librispeech/am.arch. Accessed 23 Mar 2020.
 83
T. Salimans, D. P. Kingma, in 2016 Neural Information Processing Systems. Weight normalization: A simple reparameterization to accelerate training of deep neural networks, (2016), pp. 901–909. https://academic.microsoft.com/paper/2963685250/reference.
 84
B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, H. Nguyen, J. M. Cohen, Stochastic gradient methods with layerwise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286 (2019).
 85
S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, Y. Zhang, in ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Quartznet: Deep automatic speech recognition with 1d timechannel separable convolutions, (2020), pp. 6124–6128. https://doi.org/10.1109/icassp40776.2020.9053889.
 86
Open Speech and Language Resources (2020). http://www.openslr.org/resources/11/4gram.arpa.gz. Accessed 23 Mar 2020.
Acknowledgements
Not applicable.
Funding
This work was partly supported by a grant of the Romanian Ministry of Research and Innovation, CCCDI – UEFISCDI, project number PNIIIP11.2PCCDI20170818 / 73PCCDI, within PNCDI III.
Author information
Affiliations
Contributions
Authors’ contributions
ALG was responsible for summarizing the methods and writing the manuscript. AP provided guidance for ALG in order to perform the experiments and the subsequent analyzes. ALG and HC organized the manuscript structure and content. MB supevised the entire project and provided suggestions on the writing. All authors have read and approved the final manuscript. The contributions of all authors are considered to be equal.
Authors’ information
ALG graduated with a Master’s degree from the University Politehnica of Bucharest, Romania, where he currently follows his Ph.D. studies as a member of the Speech and Dialogue research laboratory. The main direction of interest is artificial intelligence applied in speech technology, including speech recognition and speaker recognition. In 2019, he did this work during an internship at Xilinx Research Labs, Dublin.
AP is a Senior Engineer at Xilinx Research in Dublin, Ireland, where he works at the intersection of hardware and software for machine learning acceleration. He earned his Bachelor’s degree from Politecnico di Milano, Italy, and his Master’s degree from the University of Illinois at Chicago, USA.
HC received the Ph.D. degree in electronics and telecommunications from University Politehnica of Bucharest (2011), where he has served as Associate Professor since 2017. His research interests include machine learning and artificial intelligence, with a special focus on automatic speech and speaker recognition methods. He authored over 75 scientific papers in international conferences and journals. He was awarded the Romanian Academy prize “Mihail Drăgănescu” (2016) for outstanding research contributions in Spoken Language Technology, after developing the first automatic speech recognition system for the Romanian language.
MB is a Distinguished Engineer at Xilinx Research in Dublin, Ireland, where she heads a team of international scientists driving exciting research to define new application domains for Xilinx devices, such as machine learning. She earned her Master’s degree from the University of Kaiserslautern in Germany and brings over 25 years of computer architecture, FPGA and board design, in research institutions (ETH Zurich and Bell Labs) and development organizations. She is heavily involved with the international research community serving as the technical cochair of FPL’2018, workshop organizer (H2RC, ITEM’2020), and member of numerous technical program committees (FPL, ISFPGA, DATE, etc.).
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Georgescu, AL., Pappalardo, A., Cucu, H. et al. Performance vs. hardware requirements in stateoftheart automatic speech recognition. J AUDIO SPEECH MUSIC PROC. 2021, 28 (2021). https://doi.org/10.1186/s13636021002174
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636021002174
Keywords
 Automatic speech recognition
 Survey
 Endtoend ASR systems
 Deep learning
 Performance analysis