 Research
 Open Access
 Published:
Multicandidate missing data imputation for robust speech recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2012, Article number: 17 (2012)
Abstract
The application of Missing Data Techniques (MDT) to increase the noise robustness of HMM/GMMbased large vocabulary speech recognizers is hampered by a large computational burden. The likelihood evaluations imply solving many constrained least squares (CLSQ) optimization problems. As an alternative, researchers have proposed frontend MDT or have made oversimplifying independence assumptions for the backend acoustic model. In this article, we propose a fast MultiCandidate (MC) approach that solves the perGaussian CLSQ problems approximately by selecting the best from a small set of candidate solutions, which are generated as the MDT solutions on a reduced set of cluster Gaussians. Experiments show that the MC MDT runs equally fast as the uncompensated recognizer while achieving the accuracy of the full backend optimization approach. The experiments also show that exploiting the more accurate acoustic model of the backend does pay off in terms of accuracy when compared to frontend MDT.
1. Introduction
One of the major concerns in deploying Automatic Speech Recognition (ASR) applications is the lack of robustness of the technology when compared to human listeners. A key aspect is the sensitivity to background noise. This effect is caused by the differences between the conditions in which the statistical models for speech are trained and those in which they are applied in reallife situations. Many approaches which reduce the mismatch to improve the noise robustness of speech recognition have been proposed earlier. They modify either the frontend signal preprocessing or the backend acoustic model of the recognizer. A popular frontend method is the Advanced FrontEnd [1] which applies multiple stages of Wiener filtering to remove the background noise from the corrupted observations. Other techniques working in the frontend are, e.g., spectral subtraction [2], Stereo Piecewise Linear Compensation for Environment [3] and the Vector Taylor series compensation algorithm [4]. Some examples of backend approaches are Parallel Model Combination (PMC) [5] and model adaption algorithms, such as Maximum Likelihood Linear Regression (MLLR) [6] and Maximum A Posterior probability (MAP) based adaptation [7].
In the late 1990s, Missing Data Techniques (MDT) were introduced in speech recognition as a perceptually motivated approach to improve the noise robustness of a speech recognizer. Research in Auditory Scene Analysis (ASA) [8] proposed models for the capability of human listeners to deal with concurrent signals. The human auditory system is able to extract sufficient information from the speech source of interest in order to recognize what is said, even if parts of the target signal are masked by other signals. It exploits the redundancy in the speech signal and can thus handle missing data. The motivation of MDT is to explore these capabilities of human listeners and exploit them in ASR to reduce the performance gap between humans and computers. It relies on the model that a given spectral band at a given time is dominated by either speech or noise. In the frontend preprocessing, the timefrequency regions of a speech signal are labeled as reliable or as unreliable. This labeling information is encoded into a socalled missing data mask. In the backend decoding, features in the unreliable regions are either ignored or predicted to alleviate the mismatch. This compensation strategy relies only on the speech model and unlike PMC for instance, it does not require a model of the noise, though some assumptions about the noise are required instead while generating the missing data mask [9, 10]. In recent years, the MDT was extended to techniques such as the glimpsing model [11] and speech fragment decoding [12]. Other related work includes the propagation of uncertainty [13] where the authors transform the uncertainty encoded in the binary mask from the spectral domain to the cepstral domain, and handle the transformed uncertainty with the cepstral backend acoustic models. The authors of [14] introduce a twopass MDT system, where the lattice generated by the MDT recognizer in the first pass is rescored. In the second pass, a statebased hypothesis test then generates the socalled "integrated mask", yielding better recognition results.
The two major problems in MDT are first estimating the mask and then exploiting these masks during recognition. Identifying the 'missing' part during recognition is an essential step in MDT as proposed by Cooke et al. [15] and Lippmann and Carlson [16]. A missing data detector makes a binary decision about which spectrotemporal components are unreliable due to noise distortions and which remain reliable, i.e., are dominated by speech. Approaches of missing data mask estimation, such as Bayesian classification [17], harmonic mask estimation [10], local SNRbased mask estimation [18, 19], and VQ mask estimation [9] mainly exploit characteristics of the speech signal. The authors of [20] estimate the missing data masks based on computational ASA. More approaches can be found in a survey on missing data mask estimation by Cerisara et al. [21]. The concept of binary reliability masks can be extended to soft masks[22] when uncertainty about the reliability information is taken into account. The mask then assumes continuous values instead of binary values. Soft masks are not considered in this article, as we have found them to provide little benefit in [23].
Several paradigms have been designed to apply MDT once the masks are computed. MDT was first formulated for a spectral acoustic model [15], which is referred to as spectral MDT in this article. The spectral energy within each unreliable component can be either reconstructed based on the acoustic model and the reliable information, or marginalized out of the probability density functions (PDF) of the HMM states. The former scheme is defined as imputation and the latter is defined as marginalization. In order to improve the performance of MDT, Raj et al. [24], Van hamme [25], Cerisara [26], Häkkinen and Haverinen [27], and Faubel et al. [28] applied MDT using cepstral acoustic models, which are referred to as cepstral MDT in this article. The experimental results of cepstral MDT demonstrate its advantage over the spectral model. The authors of [24] used MDT imputation to enhance the speech features in the frontend, while in Maximum Likelihood (ML) Gaussianbased imputation [25] and in conditional mean imputation [28], the authors consider MDT imputation associated with Gaussians in the backend.
The above work addresses the robustness of the MDT system rather than its efficiency. MDT systems involve much more intensive computation in the backend, as explained in Section 3. This was already noticed in [15], where the problem was addressed by compromising on the acoustic model (diagonal Gaussians for logspectral features). An alternative solution is to formulate MDT as a frontend technique [24]. In this article, we propose a MultiCandidate (MC) MDT which not only produces competitive recognition accuracy, but also possesses the same efficiency as a conventional large vocabulary recognizer under noisy conditions. We advocate the backend approach, since it exploits the most accurate speech model that is available in the recognizer to compensate for the missing data. Each HMM state represents an accurate hypothesis about what the missing speech could be, integrating all knowledge that is available in the decoder: acoustics, lexical information, and language model. Hence, we expect more accurate missing data imputation than with frontend MDT approaches, where such sophistication is not available. In our setting, we go beyond the state level and compute a clean speech vector per Gaussian. In addition to the entire set of Gaussians embedded in the HMM, a fairly small set of Gaussians are trained to function as cluster Gaussians (CG). They provide feasible candidates (i.e., they satisfy the constraints for the imputed data, as described in Section 2.1) of imputations for the entire set of Gaussians. As such, instead of solving the full optimization problem for each Gaussian in the acoustic model, candidate solutions are selected from the CG and the most likely one is retained. Therefore, implementation of MC MDT requires only a modest modification of conventional HMMbased recognizers. The MC MDT forms the main contribution of this article. It is an algorithm that aims at computational gains for large vocabulary speech recognizers without sacrificing accuracy or robustness. It provides a solution for applying MDT to an existing backend model trained for the speech feature vector of one's choice. Furthermore, we show experimentally that we gain more immunity to noise than if MDT is applied as a frontend featureenhancement technique [24] and compare several methods for solving the imputation problem.
Figure 1 shows the architecture of the MC MDT system. This article is focused on the three blocks in the middle, i.e., the imputation of the (cluster of) CG and MC imputation for the backend Gaussians (BGs). The rest of this article is arranged as follows: in Section 2, we introduce the conventional statebased imputation and marginalization [15] as well as the spectral reconstruction [24]. In Section 3, we discuss MDT imputation under the framework of ML decoding and why it becomes difficult when using a model trained with decorrelated features such as cepstral features or features generated by, e.g., linear discriminant analysis (LDA) [29]. Section 4 describes the approach of MC MDT imputation using CG to speed up the Gaussianbased imputation. Section 5 explains how to further speed up the imputation of the CG by selecting a subset of the CG dynamically. Section 6 describes several experimental results. Finally, in Section 7 we present our conclusions and propose future work.
2. Spectral and Cepstral MDT systems
In this section, we review some of the concepts of MDT that lead to approaches that are most related to the proposed system.
2.1. Bounds
Environmental noises are assumed to be additive in the spectral domain. Hence, at frame t, the logspectra of the underlying complete clean speech x_{ t } can be assumed to be approximately bounded above by the observed noisy feature vector y _{ t }, namely:
where the inequality sign for vectors applies componentwise. Both x_{ t } and y_{ t } can be partitioned into their reliable and unreliable subvectors according to the mask:
For the reliable spectrotemporal regions, the observed noisy features are deemed to be pure speech:
whereas for the unreliable regions, the observed features act merely as upper bounds for the clean speech:
2.2. Statebased imputation and marginalization
The authors of [15] formulated several MDT approaches which use the acoustic models trained in the same domain in which the masks are expressed and in which the constraints of Equation (1) hold. In their experiments, the acoustic feature vectors are obtained via a 64channel auditory filter bank with center frequencies spaced linearly on an ERB scale from 50 Hz to 8 kHz. The HMMbased speech recognizer is adapted to accommodate MDT by modifying the state likelihood evaluation as outlined below. Each HMM state is expressed as a mixture of multivariate Gaussians with a diagonal covariance matrix. The MDT here is carried out framebyframe and is assumed independent across frames. The authors proposed both statebased imputation and marginalization. Besides the upper bound y _{ t, u }, a lower bound can also be applied to control the arbitrariness of compensation for the unreliable components. This idea can be applied to all methods described in this article. However, for consistency, we will omit lower bounds from this article.
In statebased marginalization, each state output PDF is a function of the reliable components only, while the unreliable components are marginalized out, i.e., each unreliable component is integrated over the range of values it can assume. The PDF of a state s is given by:
where G(s) represents the set of Gaussians belonging to the Gaussian Mixture Model (GMM) of state s. The integral of Gaussian k can be calculated using the componentwise error function because its covariance matrix is assumed to be diagonal.
In statebased imputation, the clean speech is imputed for every state s, followed by calculating the likelihoods using the imputed values, which will be utilized to expand hypotheses in the search space during decoding. Two ways of imputing the clean speech per state are given: linear combination or winnertakesall.
In linear combination, the Minimum Mean Squar Error (MMSE) estimate of the imputation from state s is
where μ_{ u, k } is the unreliable subvector of mean of Gaussian k and
In winnertakesall, after the clean speech is imputed for each Gaussian belonging to state s, the mixture's likelihood is evaluated for all imputed values and the most likely imputation is selected as the imputation of the state. In other words, the imputation of state s is approximated by the clean speech vector imputed from its $\widehat{k}$ th member Gaussian:
where $\widehat{k}=\underset{k\in G\left(s\right)}{argmax}\phantom{\rule{0.1em}{0ex}}p\left({\widehat{\mathbf{x}}}_{t,u,k}s\right)$. ${\hat{\mathbf{x}}}_{t,u,k}$ is the maximum likelihood imputation of the unreliable subsector x_{ t, u, k } for Gaussian k included in G(s):
This problem has a closed form solution:
where it should be understood that we have written the solution vectorially for convenience, but the top or bottom case in (2) may apply to different vector components. The imputation using a spectral acoustic model containing Gaussians with a diagonal covariance matrix has an analytical solution because the components of the logspectral features are considered to be independent. However, the spectral features do have correlation among their components and the spectral GMM used above is not very effective to model this. The performance of HMM speech recognizers using GMMs with diagonal covariance is significantly better when using decorrelated features, e.g., MEL Frequency Cepstral Coefficients (MFCC). Therefore, a cepstral MDT model with diagonal covarianceGaussians is introduced in the following section.
2.3. Spectral reconstruction
In [24], the authors reconstruct the spectral features using either a correlationbased method or a clusterbased method. The reconstructed spectra are then transformed into cepstra for processing by the speech recognizer.
The correlationbased approach solves the imputation of unreliable components at each frame by exploiting the correlations among the components in the spectrotemporal representation. The correlation is modeled by a Gaussian widesense stationary (WSS) process whose parameters are learned from training data. The core of the algorithm is a bounded MAP estimate:
where y_{ t, n } is the neighborhood vector containing all the related reliable components which are spectrally and temporally sufficiently close to x_{ t, u } as defined by the WSS model. The likelihood p(x _{ t, u }y _{ t, n }) is modeled with a full covariance Gaussian conditioned on the observed y_{ t, n }. The authors establish an iterative approach to solve (3).
In the clusterbased approach, the distribution of the observation is modeled by a spectral GMM with M mixture components with full covariance. Each of these mixture components is called CG, trained by the ExpectationMaximization (EM) algorithm. The unreliable components of the reconstructed spectra are obtained from a linear combination of the values imputed for the CG:
where
is the imputation resulting from the m th CG, a bounded optimization problem which can be solved by the MAP algorithm as in the correlationbased approach. P(mx_{ t, u, m } ≤ y_{ t, u } , y_{ t, r }) is the posterior probability of the CG given the reliable data and the feasible region for the unreliable data.
To make computation of this posterior probability tractable, the spectral CGs are assumed to be diagonal in this circumstance.
In both the correlationbased and the clusterbased method, the reconstruction is separated from the decoding and there is only one single imputation per frame, while in the spectral statebased imputation of Section 2.2, each state or Gaussian has its own imputation, which is theoretically more suitable for an MLbased recognizer. The likelihood of each state is calculated at its imputed value and used in the backend of the recognizer which incorporates the lexical and grammatical knowledge to drive the path pruning in the beam.
It should be noted that the authors of [15] show that statebased marginalization outperforms statebased imputation. Therefore, it would be natural to formulate marginalization for cepstral or other decorrelated models as well. However, this leads to definite integration of full covariance Gaussians. Even if approximations described in [28] would be applied to marginalization, the computational complexity is not acceptable for a practical speech recognizer. Hence, we only focus on imputation with decorrelated models.
3. Missing data imputation for maximum likelihood decoding
Stateoftheart automatic speech recognizers take a Bayesian approach, i.e., the decoding process is to find a sequence of words $\widehat{\mathbf{W}}$ whose posterior probability is maximal given a Tframe sequence of observations y_{1...T}:
where the language model P(W) is the probability of a hypothesized word sequence W. In practice, the most likely state sequence s_{1...T}that realizes W is found. In MDT, the maximization should be additionally taken over the unreliable features to be imputed, i.e., x_{1...T, u}, to find out the optimal imputation ${\widehat{\mathbf{x}}}_{1\dots T,u}$ bounded by the noisy observation y_{1...T, u}.
For a given state sequence s_{1...T}with W embedded, the complete speech is given by the following expression, where we have assumed stateconditional independence of x_{1...T, u}:
where a is the product of the transition probabilities between the states on the hypothesized path. The maximization in Equation (7) can be accomplished framebyframe, i.e., the optimal clean speech at time t is obtained by the maximization of the output PDF of state s over the complete speech x_{ t }bounded by the observation y_{ t }:
Equation (8) formulates an ML statebased missing data imputation. The constrained optimization in (8) is not computationally tractable. If each member Gaussian in a state output PDF is assumed to impute its own clean speech using MLE:
MDT imputation becomes ML Gaussianbased imputation, which is an approximation of the statebased imputation but is computationally more tractable. It will be shown in Section 6.4.4 that (8) and (9) yield comparable recognition accuracy.
If the model used for imputation is trained with cepstral features or other decorrelated features, such as LDA [29] or HLDA [30] features, Gaussian k can be formulated in the logspectral domain after the corresponding linear transformation C of full rowrank is applied:
where C^{+} represents the pseudo inverse of C, μ _{ k }, Σ_{ k } are the mean and diagonal covariance of the transformed features and D_{ m } denotes the dimension of the decorrelated feature vectors. Instead of maximizing probabilities, we can equivalently minimize the cost function:
with the precision matrix
the maximization of (9) over x_{ t, k } becomes:
Notice that H_{ k } can be singular (e.g., when the cepstral features have less dimension than the logspectral features), in which case a kdependent small fraction of the identity matrix is added to regularize H _{ k }, so a unique solution of (11) is found. Since H_{ k } is not diagonal, the bounded minimization in (11) can no longer be solved by Equation (2). Instead, it becomes a Constrained Least Square (CLSQ) problem, which does not have an analytical solution. Methods such as the MAP algorithm [24], primal active set methods [31], Multiplicative Updates (MU) [32], and imputation with PROSPECT features [33] have been proposed. But, their computational cost for large vocabulary speech recognizers with tens or hundreds of thousands of Gaussians becomes prohibitive. Below, the MC MDT imputation is proposed to significantly reduce the computational intensity to achieve an MDT recognizer with speed.
4. MC missing data imputation
In (11), Gaussianbased imputation is formulated as searching for the optimal clean speech vector within a feasible region, i.e., the (continuous) subspace which is spanned by the unreliable components and is bounded by the observation. This process can be approximated by evaluating each Gaussian on a list of feasible clean speech candidates and then selecting the candidate which maximizes the likelihood as the imputed value. This approximation is the basic idea behind the MC MDT imputation. For every Gaussian, the list of candidates is given by the imputation from a small set of CG. The Gaussians in the acoustic model, typically a large number, will be called BGs in the remainder of the article. The optimization for each BG in (11) is then approximated by selecting the ${\widehat{l}}_{t,k}$th clean speech candidate such that:
where Ω _{ k } represents all the CGs which might generate suitable solutions for Gaussian k, and ${\stackrel{\u0303}{\mathbf{x}}}_{t,u,l}$ is the clean speech estimate of the unreliable speech components obtained from CG l. The construction of Ω _{ k } will be detailed in Section 4.3. Hence, in MC MDT, solving the CLSQ problem of BG k is replaced by L_{ k } likelihood evaluations, where L_{ k } is the cardinality of Ω_{ k }. Whereas solving a large number of BG imputation problems is avoided, the task is shifted to the restricted set of CGs. Solving each of these problems requires a computational effort that is at least an order of magnitude greater than the evaluation of a Gaussian likelihood, so various approaches for the imputation with CGs are discussed below.
4.1. MLimputation for CG
The imputed value from the CGs can be computed by iterative approaches such as Gradient Descent (GD) [33], MU [32], or MAP [24]. In GD, the gradient for iteration τ is
where each negative component of g_{ t, k } ^{τ} , for which x_{ t, k } is on the boundary y_{ t }, is zeroed and so is each reliable component of g_{ t, k } ^{τ} in order to not violate the constraints. Since the cost function in Equation (10) is quadratic, the optimal step for iteration τ has an analytic expression:
The step direction is maintained, but the step size is reduced such that the boundary constraints are not violated.
To initialize the GD algorithm, the nondiagonal covariance structure is ignored, i.e., it starts from the solution in Equation (2).
We opt for GD rather than MU [32] or MAP [24] because it benefits from several advantages simultaneously: (i) the number of iterations required for practical convergence is smaller [33], (ii) the gradient computation (13) can be carried out from right to left such that only a small number of matrixvector multiplications and vector operations are required (see Appendix), (iii) only the constant transformation matrix C, the observation, mean, and variances need to be copied to the cache memory of the CPU while other methods may require a larger memory access bandwidth, (iv) GD does not require square root operations like MU, hence the total number of arithmetic operations per iteration is less than that of MU (as shown in Table 1).
The computational effort is further reduced by using PROSPECT features together with GD, which is proposed in [33].
PROSPECT features are composed of two feature subset. The first are cepstral features of a low order D_{c} (e.g., D_{ c } = 4), which models the rough shape of the spectrum at time t. This cepstral part is given by
where C_{ c } denotes the reduced DCT matrix with orthonormal rows. The remaining details of the signal are captured by
which is termed the projection part because it is the orthogonal projection of x_{ t } on the orthogonal complement of the subspace spanned by the rows of C_{ c }. The concatenation of the cepstral part and the projection part is referred to as PROjected SPECTral (PROSPECT) features:
The PROSPECT transformation matrix is
The likelihood of the k th Gaussian based on the PROSPECT features is formulated as
where
and
where ${\mathbf{\mu}}_{k}^{c}$, ${\mathbf{\Sigma}}_{k}^{c}$, ${\mathbf{\mu}}_{k}^{\perp}$, and ${\mathbf{\Sigma}}_{k}^{\perp}$ are the means and covariance matrices of cepstral and projection part of PROSPECT Gaussian k, respectively. They are all estimated on data using the EMalgorithm and both ${\mathbf{\Sigma}}_{k}^{c}$ and ${\mathbf{\Sigma}}_{k}^{\perp}$ are diagonal. However, a diagonal ${\mathbf{\Sigma}}_{k}^{\perp}$ implies invalid independence assumptions in the spectral residual ${\mathbf{v}}_{t}^{\perp}$. Hence, the stream exponent β in (14) is introduced to reduce the impact of these assumptions. According to [33], a typical value of β is 0.5. Note that F(v_{t}k) is not a strict PDF because it does not integrate to unity due to β, but we will still refer to it as the likelihood of Gaussian k. When substituting (15) and (16) in (14), the cost function of Gaussian k becomes
where
The gradient computation and cost function evaluation now involve only multiplication of small matrices and vector additions, which is exploiting the CPU's cache memory more efficiently and reduces the computational effort in comparison to a cepstral (or LDA) model, as witnessed by Table 1. Refer to Appendix for details. The study [33] also shows that the PROSPECT model performs equally well as the cepstral model for a recognizer without MDT. Because of their better efficiency and comparable accuracy, the PROSPECT features are preferred for the CGs and the algorithm selected for minimizing (17) is GD. Since the CGs only serve to generate candidate spectra, there is no need for the CG and the BG to be expressed in the same feature domain. For example, in the experiments of Section 6, the BGs will be trained with the features generated by the Mutual Informationbased Discriminant Analysis (MIDA) technique [34].
4.2. Training the CGs
Clustering methods for Gaussians can be categorized as modelbased or datadriven. In the former methods, such as the popular Kmeans, the parameters of the CGs are estimated from parameters of the BGs. In the latter methods, parameters of the CGs are estimated from training data. Modelbased Gaussian clustering methods are not well suited to create the CGs in MC MDT, because they would involve a transformation between the domains in which CGs and BGs are expressed. For instance, MIDA CGs can be first evaluated using MIDA BGs and then converted into PROSPECT CGs. But this conversion involves a lossy transformation and hence performance cannot be guaranteed. Therefore, approaches driven by data are selected in this study.
In order to obtain the CGs from training data, a compact HMM is trained. The compact model shares its structure with the backend model containing the BGs in the sense that it uses the same phonetic decision tree (PDT) [35], but it has only M Gaussians which are shared among leaves of the PDT. Hence, every HMM state s will have an associated set of CGs as well as a set of BGs, denoted by G_{CG}(s) and G_{BG}(s), respectively. Typically, M is a few hundred and M << K, where K is the total number of BGs. These M Gaussians are used as the CGs and can be trained for any feature representation. The parameters to be trained are the PROSPECT means and covariance matrices of the CGs as well as the mixture weights P_{CG}(ms). Before training the compact model, a state level segmentation is made using the Viterbi algorithm with the backend model, i.e., the segmentation specifies the alignment between the states and the frames of the training data. M BGs are randomly selected to initialize the CGs. However, since BGs and CGs may be expressed on different feature sets, a particular initialization of the CGs is required. Hereto, the M retained BGs are considered as a GMM with uniform weights. The posterior probabilities of the M BGs are calculated on the MIDA representation and are used in the first iteration of the EM algorithm, i.e., the BG posteriors are used to softly assign training samples to the CGs to initialize the mean, covariance and mixture weights. Subsequently, a standard EM training without altering the segmentation is performed using PROSPECT features. Consequently, each tied state is now modeled by a GMM with up to M components trained on PROSPECT features. Finally, every BG can be assigned to multiple CGs to form a soft clustering, as explained below.
4.3. Association between CGs and BGs
The association between the CGs and the BGs is based on the same segmentation used in Section 4.2. In this step, the likelihood of all the BGs belonging to state s at training frame t is calculated along the Viterbi path. The likelihood of the PROSPECT CGs belonging to s is calculated for the same frames. Then CG ${\widehat{m}}_{t}$ and BG ${\widehat{k}}_{t}$ are found by
and
where F(Ry_{t}m) is calculated by Equation (14). E represents the linear transformation of the backend features. For every training frame of speech t, entry $\left({\widehat{m}}_{t},{\widehat{k}}_{t}\right)$ of the association matrix Φ (as shown in Figure 2) is incremented by 1. After all training data are processed, the set Ω _{ k } in Equation (12) is defined from the k th column of Φ as those entries that are larger than the product of a pruning threshold θ_{ Φ } and the maximum of the k th column. Moreover, if Ω _{ k } contains more than L_{max} elements, only the L_{max} largest Φvalues are retained. The entries of Φ that are not in Ω _{ k } are subsequently set to zero.
The probability how often CG m is associated with any BGs is formulated by
and is used as the prior probability of a CG below.
4.4. Application of MC MDT in the recognizer
Figure 2 illustrates the process of calculating the likelihoods of the BGs. Since the CGs and the CGBG association table are now available, it is also convenient to apply Gaussian selection together with the MC MDT. The motivation of Gaussian selection is that only a small (frame dependent) portion of Gaussians dominate the likelihoods of the HMM states, and are therefore worth evaluating. However, since the likelihood computation in an MDT system involves the imputation of the unknown data, many conventional methods do not apply readily. The proposed approach proceeds as follows. The recognizer first evaluates all the PROSPECT CGs using GD. Only the BGs assigned (as determined by Ω_{ k }) to sufficiently likely CGs will be calculated, while the others will be ignored. The prior probabilities (18) and the resulting likelihoods of CGs are used to calculate the corresponding posterior probabilities. The posterior probabilities are sorted in descending order and then truncated to the length such that a large fraction ρ (e.g. ρ = 0.95) of the posterior probability mass is included, i.e., the number of CGs kept is the smallest L_{ s } such that
where ${\widehat{\mathbf{x}}}_{t,u,m}$ is the imputed value from CG m. The CG are reordered such that
and
The exponent α is introduced to compensate for unmodeled correlations among the features and will indirectly control the number of selected BGs. A typical value of α is 0.4, which led to a reasonable tradeoff between the number of selected BG and recognition accuracy on the development dataset used in [36]. $P\left(m{\widehat{\mathbf{x}}}_{t,u,m},{\mathbf{y}}_{t,r}\right)$ denotes the posterior probability of CG m based on its imputation. In Figure 2, the CGs labeled with "1" in the CG selection table are selected at frame t. Only the imputed clean spectra resulting from the selected CGs are transformed into the MIDA domain and maintained as possible candidates for BG likelihood evaluation.
When calculating the likelihood of a particular BG k, the MC MDT recognizer looks up the k th column of the CGBG association table Φ to find the candidate list. Notice that some of the associated CGs may have been pruned by criterion (19) and are removed from the list. The recognizer calculates the likelihoods of the BG for the candidates of imputed clean speech and selects the maximum as the likelihood of that BG. If the candidate list is empty, the BG is assigned a likelihood of zero. On average, the number of multiplications involved per BG is reduced to $2\stackrel{\u0304}{L}{D}_{m}$, where $\stackrel{\u0304}{L}$ is the average number of CGs associated to a BG and D_{ m } is the dimension of MIDA features. The resulting likelihoods of the BGs are used to calculate the state output PDFs, which are then processed by the decoder.
5. Selection of CG
The MC MDT system can be further sped up by applying Gaussian selection on the CGs. Though M << K, the evaluation of a CG is still an order of magnitude more expensive than the evaluation of a BG. Thus, only the likely CGs are selected to impute the candidate clean speech. Existing methods of Gaussian selection can be classified as axis indexingbased methods [37, 38] and VQbased methods [39, 40]. The former quickly locates the likely regions based on the observation, then selects the Gaussians in the likely regions [38] or removes the Gaussians in the unlikely regions [37]. But in MDT systems, it is not straightforward to determine which regions in the feature space are likely, because some of the components of the observation are missing. On the contrary, VQbased methods suit the MC MDT system well. ClusterofCluster Gaussians (CCG) in the PROSPECT domain are now established. The MC MDT recognizer will select the CGs based on the likelihoods resulting from the imputation of CCGs, i.e., an additional layer of Gaussian selection is provided. Consequently, it reduces the number of CG CLSQ problems to be solved. Clustering of the CGs is a prerequisite for the VQbased Gaussian selection. In this study, we apply the soft KMeans algorithm to generate the CCGs. Since the CCGs and the CGs are expressed in the same domain (PROSPECT features in our example), a modelbased approach is feasible and preferred here.
5.1. Soft Kmeans clustering
The following pseudo code summarizes the steps to obtain the CCGs. A single cluster is first calculated using all the CGs. The number of CCGs then grows incrementally from 1 to N to avoid suboptimal clustering as much as possible.

1.
Set the number of CG n to 1 and compute a single CCG from all CGs.

2.
While n < N
2a. Find CCG $\widehat{j}$ with the maximum mean WSKLD
2b. Split CCG $\widehat{j}$ into two and increment n
2c. For iteration τ from 1 to T
2c1. For CCG i, i from 1 to n
2c11. For CG m, m from 1 to M
Calculate the weight by which CG m updates CCG i, ĝ(i, m)
2c12. Given ĝ(i, m), update μ _{ i }, Σ _{ i } iteratively
The distance metric between Gaussians and the computation of the CCGs from its member CGs are two crucial components for every step listed in the above pseudo code. The distance metric is Weighted KullbackLeibler Divergence (WSKLD) in step 2a and is explained in Section 5.2. The parameter estimation algorithms in steps 1, 2c11, and 2c12 are described in Section 5.3. Step 2b is described in Section 5.4.
5.2. Distance metric between PROSPECT Gaussians
The symmetric KullbackLeibler Divergence (SKLD) is commonly used to measure the distance between CCG n and CG m:
The application of SKLD to (14) requires some care: the stream exponent β in the likelihood model for PROSPECT features makes it an improper distribution, requiring renormalization such that it integrates to unity. Second, it was found that SKLD overweighs differences in the projection part of the PROSPECT Gaussians. Therefore, in [41], further simplifications were proposed and experimentally verified leading to the WSKLD as a clustering metric for multistream features:
where β_{ j } is the exponent of stream j, SKLD _{ j } is the symmetric KLD computed on the features of stream j only and N_{ strm } is the total number of streams. In this study, N_{ strm }is 6 because the PROSPECT features contain static, velocity and acceleration stream of both cepstral and projection parts.
5.3. Parameter estimation of CCGs
Following the KMeans algorithm in [42], the cost function to be minimized for clustering is
where γ controls the stiffness of the clustering and g(n, m) are unknown clustering weights. The parameters to be updated iteratively are
In each iteration, the first step is to obtain the optimal weight by which CG m affects CCG n as
The second step is to find the optimal values of mean and covariance of each CG given the weights. The estimation of means and covariance matrices of the CCGs is based on the approach in [43], where a method for finding the centroid of a set of Gaussians is derived. The centroid is the CCG that minimizes the sum of the WSKLD to all CGs. Here, we extend the results of [43] by modifying the cost function to (21). The mean of a CCG is thereby estimated as
Matrix Z is constructed to facilitate the reestimation of the covariance matrix of the CCGs
where
and
By construction, Z has D_{ P } positive and D_{ P } symmetrically negative eigenvalues, where D_{ P } is the dimension of PROSPECT features. A 2D_{ P } byD_{ P } matrix V is constructed whose columns are the D_{ P } eigenvectors corresponding to the positive eigenvalues. V is partitioned in its upper halve U and lower halve W:
Like in [43], ${\widehat{\mathbf{\Sigma}}}_{n}$ is constrained to be diagonal during clustering. It can be seen from Equations (22) and (23) that the procedure of estimating the CCGs given the weights ĝ(n,m) is iterative. The calculation of the mean depends on the previously calculated covariance and vice versa. The exit criterion is the convergence of the cost function defined in Equation (21).
In step 1 of the pseudo code from Section 5.1, a single CCG is initialized by averaging the means and covariance matrices of the entire set of CGs. The parameters of the single CCG are then updated using Equations (22) and (23) for several iterations. Splitting a CCG and reestimation of all CCGs are carried out iteratively till N CCGs are obtained, as explained below.
5.4. Splitting a CCG
In each iteration, the CCG with the maximum withincluster mean WSKLD is found
Principal component analysis is applied on the covariance matrix ${\mathbf{\Sigma}}_{\widehat{j}}$ to find the first principal eigenvector e_{1} and eigenvalue, λ_{1}. If the number of CCGs in the current iteration is n, CCG $\widehat{j}$ is split into two Gaussians with the means and covariance matrices:
where ξ is the disturbing rate. The WSKLDs of all the M CGs to the newly created CCGs is then calculated. Each weight $\u011d\left(\widehat{j},m\right)$ is also split into two according to the WSKLDs:
The parameters of CCG $\widehat{j}$ and CCG n + 1 are then reestimated using Equations (22) and (23) with fixed number (e.g., 3) of iterations. The means and covariance matrices of CCG 1 to n + 1 are subsequently updated until convergence of the global cost function (21).
Finally, when n reaches N, an N by M CCGCG table of exponentiated negative WSKLD is calculated. This table plays the same role as the association table in Section 4.3. The same schemes as in Section 4.3 are used to truncate the table. Also, the same schemes as in Section 4.4 are used to select likely CGs, thus avoiding solving CLSQ problems whose solutions are unlikely to survive pruning criterion (19).
6. Experiments
To show the effectiveness of the proposed approaches, a large vocabulary speech recognizer was modified accordingly and experiments on the noisy dictation task AURORA4 were run. In Section 6.1, we describe the training and test datasets and further details on the required acoustic models. Section 6.2 explains various components of the MDT recognizer. Section 6.3 outlines two baseline systems: first, a nonMDT system serving as the speed baseline for our MC MDT experiments and second a backend MDT system serving as the accuracy baseline for our MC MDT experiments. In Section 6.4, some MC MDT variants are first analyzed and subsequently compared with the clusterbased reconstruction described in Section 6.5. Section 6.6 evaluates MC MDT systems where the CGs are expressed in either the cepstral domain or the logspectral domain. All testing results are summarized in Tables 2, 3, 4, and 5.
6.1. Data and models
6.1.1. AURORA4
Speech recognition experiments were conducted on the AURORA4 database [44], a large vocabulary task that is derived from the WSJ0 Wall Street Journal 5kword dictation corpus. A bigram language model for a 5kword closed vocabulary is provided by Lincoln Laboratory.
For training, only cleancondition data sampled at 16 kHz were used, consisting of 7,138 utterances from 83 speakers, which amounts to 14 h of speech data. All recordings are made with the close talking microphone and no noise is added.
The test database is composed of 330 read sentences (5,353 words) from 8 different speakers. Fourteen different versions of this set are created. The first dataset is clean and is recorded with the same closetalk microphone as used while recording the training data. It is artificially corrupted by adding six types of noise to establish datasets 27: car (set 2), babble (set 3), restaurant (set 4), street (set 5), airport (set 6), and train (set 7). Set 8 is recorded with fartalk microphones. Test sets 914 are created by artificially adding the same six types of noise as used for generating sets 27. Each test set contains 330 utterances and has an SNR that ranges from 5 to 15 dB.
6.1.2. Training backend acoustic model
The design of the frontend as well as the backend acoustic model is based on prior study [23, 33, 37] which obtained competitive accuracies on clean speech and good robustness in an MDT configuration.
The signal power spectrum is calculated with a 32ms Hamming window and a 10ms window shift and is integrated using a 22channel MELscaled triangular filter bank with lowest frequency centered at 140 Hz to increase the robustness to lowfrequency noises. Since all frequencies above 7 kHz of the AURORA4 data are filtered out, the last band is centered at 5800 Hz. The 22 logspectral coefficients are meannormalized and the first and secondorder time derivatives are appended to result in 66dimensional spectral features.
To train the backend acoustic model, the normalized spectral features are transformed into 39dimensional MIDA features, which are improved LDA features leading to decorrelation and diagonalization of the mixture components [34]. It has only half the dimension of PROSPECT features (see Section 6.1.3), hence leading to a significant effort reduction in the likelihood calculation of the BGs and showing better accuracy than MFCC.
The acoustic model uses crossword and contextdependent triphones. The HMM for each triphone contains three states. A PDT defines 4091 tied states, or senones, which in their turn share 21,037 BGs. The output probability of each state is a mixture of 190 BGs on average and each Gaussian is shared among 45 different tied states.
6.1.3. Training CGs and CCGs
The compact acoustic model containing the CGs is trained with the same training data by following the training procedure outlined in Section 4.2. Here, the statelevel segmentation of the training data is obtained by forced alignment using the backend MIDA model of Section 6.1.2. The normalized static logspectral features and the dynamic features are transformed into PROSPECT features with D_{ c } = 4, i.e., for each stream in the features, four cepstral coefficients are kept, and D = 22 projection coefficients are appended. Consequently, the PROSPECT features including delta's have 78 dimensions. An earlier experiment on AURORA4 showed that MC MDT with 500 to 900 CGs yields a reasonable tradeoff between recognition time and accuracy. Therefore, we use 500 CGs in the following experiments. The association table Φ is built on the same training data. The maximum number of CGs associated with a particular BG, L_{max}, is 5. An earlier experiment on the Flemish Speecon and SpeechDat Car data [36] showed that increasing L_{max} beyond 5 only leads to more computation without increasing the recognition accuracy. The average number of CGs associated with a particular BG, $\stackrel{\u0304}{L}$, is 3.6.
Fifty CCGs are obtained by clustering the 500 CGs using the procedure from Section 5. The maximum number of CCGs associated with a CG is 5 and the average number is 3.6. In previous experiments on Gaussian clustering, we have found a γ of 0.3 in Equation (21) to be a good choice, which we have maintained in these experiments.
6.1.4. Training spectral CGs for the clusterbased reconstruction with MAP
In order to accomplish the experiments of the MAP clusterbased reconstruction for comparison, a mixture of 500 Gaussians with full covariance is trained as well on the spectral data using EM on the same segmentation. As proposed in [24], the initial iterations use a diagonal covariance model that serves to make the posterior probability calculation (6) feasible. Only in the last EMiteration, the full covariance matrices are estimated for application in Equation (5).
6.1.5. Training backend PROSPECT model
In order to show the speed improvement of MC MDT over a full MDT system [45], i.e., where the CLSQ problem (11) is solved per Gaussian with GD, an acoustic model with Gaussians estimated on PROSPECT features is required. The model has 21,037 PROSPECT Gaussians which are obtained by Single Pass Retraining (SPR) [46] of the acoustic model with MIDA features. The inputs of the SPR are the MIDA features, PROSPECT features, and the MIDA model described above. The MIDA model is used to compute the posterior probabilities of every Gaussian over the training data, which are subsequently combined with the PROSPECT feature observations to estimate their GMM weights, means and diagonal covariance matrices.
6.2. Recognizer
6.2.1. Handling convolutional noise
Besides additive noise, the MDT recognizer also handles convolutional noise by the channel compensation technique described in [23], which maximizes the likelihood of the recognized speech on the backend model. To make the implementation tractable, only the contribution of the single Gaussian that gives the largest contribution to the state likelihood (the dominating BG) is taken into account. However, unlike in [23], the current approach computes only approximate solution for BGs which is expressed in a different feature domain. Therefore, each dominating BG is replaced by the PROSPECT CG with the largest Φvalue so the maximum likelihood channel estimation of [23] can be readily applied. The channel estimate is subtracted from the observed logspectra and hence the CCG, CG, and BG models are all compensated for convolutional distortions.
6.2.2. Mask estimation
The missing data detector used is the method described in [23] which integrates harmonicity and SNR with a speech model based on vector quantization. At each frame, the best match between a harmonic decomposition of noisy speech and a codebook describing the harmonic decomposition of clean speech is found. VQ mask estimation requires a speech and silence codebook which are trained with a randomly selected subset of the clean training data in Section 6.1.1. The codebook contains 520 codewords which are updated using the channel estimation of Section 6.2.1 during recognition.
6.2.3. Test configuration
The decoding consists of a timesynchronous beam search algorithm as described in [47]. The recognizer was launched on a PC installed with Dual Core AMD Opteron 280 2.4 GHz processors with a cache size of 1 MB. Only one processor core is activated. The MDT imputation is only applied to the static stream, while the first and secondorder time derivatives are uncompensated. The Word Error Rates (WER) are calculated for all the experiments. Meanwhile, the CPU time is measured. Tables 2 and 3 list the WERs of the experiments over the 14 types of environmental noises. Tables 4 and 5 contain the timing measurements for the BG and CG evaluation, for beam search as well as the endtoend timing information (column "TOTAL") of the recognizer under noisy and clean condition, respectively. The timing measurements are achieved by starting and stopping (precise) timers framesynchronously at the entry and exit of each of the different processing steps: frontend processing, CG imputation, candidate evaluation for all BGs, and beam search. The total time is then obtained by dividing the accumulated timings by the number of processed frames over several utterances.
6.3. Baselines
6.3.1. Recognition without MDT
The system that does not make use of MDT is provided as a baseline in terms of recognition time such that we can measure the computational cost of the robustness obtained from the MDT systems. The acoustic model is the backend HMM containing 21,037 MIDA Gaussians described in Section 6.1.2. An axis indexingbased Gaussian selection method, Fast Removal of Gaussians (FRoG) [37] is used. The testing results are shown in the first rows of Tables 2, 3, 4 and 5. The default FRoG Gaussian pruning setting works well on clean speech and results in only about 5% of Gaussians being evaluated. However, we noticed a performance degradation due to Gaussian pruning on noisy speech. Therefore, the FRoG Gaussian pruning settings were adjusted on the noisy test data such that the accuracy was not degraded more than 2% compared to no pruning, requiring 27% of Gaussians to be kept. Notice that this procedure yields an optimistic speed estimate for this baseline, as tuning on an independent development set would require some safety margin as well. Notice that this nonMDT system produces higher WER than the MDT systems under the clean condition (test set 1), as shown in Table 2. This is mainly due to the nonMDT system using spectral mean normalization to reduce the channel effects, while the MDT systems use the more sophisticated MLEbased channel update as described in Section 6.2.1.
6.3.2. Backend PROSPECT imputation
This setup is the most refined previously published version of our MDT system [23] and serves as a baseline in term of recognition accuracy such that we can measure the accuracy cost of the proposed speed improvements. Two iterations of GD are found to be enough for the convergence in terms of recognition accuracy, hence are applied for all the 21,037 PROSPECT Gaussians. This system runs at 15 times real time in noisy condition and 6.6 times real time in clean condition as shown in row 2 of Tables 4 and 5, respectively. However, the accuracy benefits of MDT can be clearly seen in contrast to the nonMDT system in Tables 2 and 3.
6.4. MC MDT
6.4.1. Gaussianbased MC MDT with GD
The Gaussianbased MC MDT system is an instance of the concepts outlined in Section 4. Two iterations of GD are applied for all the 500 CGs. The posterior probabilitybased BG selection described in Section 4.4 is applied. α was tuned with an isolated wordrecognition experiment of MCMDT on the Speecon and the SpeechDat Car databases used in [36], which we hence regard as development data for this article. The tuning experiment shows that a good tradeoff between accuracy and BG evaluation effort is obtained at α = 0.4, but that it does not critically affect the recognition accuracy.
Compared to the backend PROSPECT MDT system, i.e., row 2 versus row 3 in Tables 2, 3, 4, and 5, the Gaussianbased MC MDT yields a comparable WER, while it uses less than 20% of the CPU time over the entire test set.
It is remarkable that the Gaussianbased MC MDT works as fast as the nonMDT recognizer with the same backend acoustic model under noisy conditions (row 1 versus row 3 of Table 4). The MC MDT spends time in evaluating CGs, but its decoding time is reduced by 4 ms per frame. Faster decoding on corrupted data is actually a common benefit from MDT imputation as shown in Table 4. In nonMDT systems, the mismatch between data and model results in a lower likelihood for the ground truth hypothesis and also causes many hypotheses to yield a similar score, so pruning is not effective and the decoder slows down. In the MDT system, noise addition also slows the recognizer down, but through a different mechanism. Thanks to the imputation process in MDT systems, the likelihood of the ground truth hypothesis will not deteriorate. The likelihood of alternative hypotheses will also increase, but because they do not fit the data well, their imputation benefit is not that strong. Apparently, a significant likelihood gap is maintained among the hypothesis, causing pruning in MDT systems to be more effective than in nonMDT systems. The effort spent in evaluating CGs is recovered in the search.
The increase in the likelihood of alternative hypotheses in the MDT system under noisy conditions also causes the MC MDT system with GD to run about 2.5 times slower than under clean conditions (row 3 of Table 4 versus row 3 of Table 5). As the data get noisier, the imputation becomes less constrained, since the number of unreliable components increases and the bounds outlined in Section 2.1 become less strict. Hence, the dynamic range of the BG likelihoods will decrease, such that the hypothesis likelihoods will show smaller differences, causing pruning to be less effective. Additionally, the system is slowed down with increasing noise levels because more spectrotemporal regions are labeled as unreliable and the complexity of imputation for CGs and CCGs increases.
Some common advantages of MDT are revealed by the results shown in Tables 2 and 3. All the experiments with MDT produce lower WERs than the nonMDT system over the corresponding noise types, as well as in the clean condition. Especially for the nonstationary noise types, namely, set 37 and 1014, the benefit from MDT is more significant.
Though MDT systems show an advantage in both the closetalk and the fartalk test sets, the performance is greatly degraded in the latter condition, because the channel compensation technique of Section 6.2.1 is restricted to the estimation of a logspectral offset vector, which can only compensate for convolutional effects with a short impulse response. However, the fact that the backend PROSPECT MDT and the MC MDT system perform equally on this test set confirms that the modification to estimate the channel on CGs rather than on BGs (see Section 6.2.1) works.
6.4.2. Gaussianbased MC MDT with MAP
This experiment is conducted to compare GD with MAP as a solver for the imputation problems. The full precision matrix of the CGs in Equation (17), as required for MAP, is precalculated. The same BG selection as in the previous section is applied. Six iterations are found to be enough for the convergence in terms of WER, and therefore applied for the CGs. Comparing row 3 and 4 of Tables 2 and 3 reveals that the MAP solver performs equally robust as the MC MDT using GD. But as shown in Tables 4 and 5, it is slower, especially in evaluating CGs due to more iterations and copying full precision matrices from the main memory to the cache memory of the CPU.
6.4.3. Gaussianbased MC MDT with CG selection
The CG selection introduced in Section 5 is added to the Gaussianbased MC MDT system in Section 6.4.1. In addition to the 50 PROSPECT imputation operations for the CCGs, about 106 PROSPECT imputation operations for the CGs are observed per frame of 10 ms. Therefore, the number of CLSQ problems solved is less than onethird of that of MC MDT without CG selection. Two iterations of GD are applied on the imputation of CCGs. The implementation of Gaussianbased MC MDT plus CG selection does not harm the recognition accuracy but consumes less CPU time in comparison with the Gaussianbased MC MDT system (compare row 3 with 5 in Tables 2, 3, 4, and 5).
6.4.4. Statebased MC MDT
The imputed values of the Gaussianbased MC MDT described in Section 6.4.1 can also be used to perform a statebased MC MDT, where the imputation for state s is the linear combination of the imputed values from the BGs included in the GMM of that state.
where G_{BG}(s) represents all the Gaussians belonging to the GMM of state s, and
The BG selection from Section 4.4 is also activated for this experiment, so only the selected BGs are actually involved in the imputation formulae. Each BG is shared among about 45 states and is therefore evaluated at multiple imputed spectra from its owner states. Hence, for this statebased MC MDT experiment, every selected BG is evaluated at 45 statesbased imputed spectra as ${\widehat{\mathbf{x}}}_{t,u,s}$ in Equation (24). The MCbased likelihood estimation of each BG is still performed at 3.6 (average number of CGs assigned to each BG) candidate spectra. These likelihood evaluations lead to a computationally expensive implementation. Statebased MC MDT yields WERs fairly close to those obtained in other MC MDT experiments, but with a significantly higher computational cost, as shown in the 8th rows in Tables 2, 3, 4, and 5.
6.5. Clusterbased reconstruction
6.5.1. Clusterbased reconstruction with MAP
This experiment is an instance of the concept formulated by Equations (4), (5), and (6). Each of the 500 fullcovariance spectral CGs is used to impute clean speech with six iterations of the MAP imputation. The corresponding diagonalcovariance CGs are used to calculate the definite integrals in Equation (6). To compensate for unmodeled correlations, the integrals are exponentiated with 0.3, a value that is tuned on the test set for best accuracy. The global clean spectrum is then reconstructed using Equation (4). Since the likelihoods of the 500 CGs are already calculated, they are used to select BGs as explained in Section 4.4. Despite the test set optimization, the clusterbased reconstruction with MAP imputation is still less robust than the MC MDT systems when comparing row 7 with rows 3, 4, 5, and 8 in Tables 2 and 3. The use of a more accurate speech model provided by the BGs seems to pay off.
6.5.2. Clusterbased reconstruction with GD
This system approximates the previous one by combining the imputed spectra obtained from the PROSPECT CGs like in Equation (4), but takes a different approach to compute the posterior probabilities of the CGs. These posteriors are calculated by renormalizing the likelihoods of the imputed clean spectra. The likelihoods also serve for BG selection as explained in Section 4.4. To compensate for unmodeled correlations, the likelihoods are exponentiated with 0.3, a value that is also tuned on the test set for best accuracy. We did not observe a significant accuracy gain beyond the 500 PROSPECT CGs used to report the results in the tables. The approximations outlined above do not harm the robustness as shown by a comparison between the rows 6 and 7 of Tables 2 and 3, but this implementation is faster because GD imputation is more efficient than MAP and the computation of the posterior probabilities are simplified. Again, despite the test set optimization applied for this clusterbased reconstruction method, MC MDT outperforms it as well.
6.6. Imputation using logspectral and cepstral CGs
Using a PROSPECT feature representation for the CGs in MC MDT experiments is an implementation choice motivated by speed considerations (see Section 4.1). The CGs can also be trained with other features, e.g., cepstra or logspectra. To accomplish the comparison of the MC MDT system using CGs in these domains, same number, namely 500 of cepstral CGs and logspectral CGs are trained using the same datadriven approach as described in Section 6.1.3.
6.6.1. Cepstral imputation
The dimension of the cepstral CGs is 39: 13 static cepstral coefficients, 13 firstand secondorder time derivatives. The average number of CGs per BG is 3.6, the same as for PROSPECT CGs. During recognition, the imputation is performed by Multiplicative Updates (MU) [32] with five iterations, an algorithm capable of handling rankdeficient precision matrices, such as H_{ k } in Equation (11). The testing results are shown in row 10 of Tables 2, 3, 4, and 5. The WER and the percentage of selected BGs obtained by using cepstral CGs are comparable with using PROSPECT CGs. Observe that cepstral CGs introduce timeconsuming imputation of MU, which slows down the imputation of CGs by 3060%.
6.6.2. Spectral imputation
The spectral imputation method described by Equation (2) is tempting for its simplicity. It is worth investigating whether it yields a list of candidate spectra of sufficient quality. The dimension of the logspectral CGs is 66: 22 static logspectral coefficients and their first and secondorder time derivatives. The average number of CGs per BG is increased to 5. The test results are shown in row 9 of Tables 2, 3, 4, and 5. While it saves time in CG imputation, the method looses both accuracy and efficiency compared to PROSPECT CG imputation. BG selection is also less effective and more BGs need to be activated to guarantee a reasonable accuracy. Finally, spectral CGs are not able to provide channel estimates (as in Section 6.2.1) as PROSPECT CGs can. This claim is motivated by an experiment (not reported in this article) where the logspectral CGs provide the candidates of clean speech and trigger the BG selection, while PROSPECT CGs are used for channel estimation, which improved the recognition accuracy by 3.58% relative on test sets 814.
7. Conclusions and future work
We have proposed several effective optimizations to a large vocabulary speech recognizer that is based on MDT. The outcome is a recognizer that runs equally fast as the uncompensated system, has identical performance on clean data, has the same robustness as our latest published missing data system [23] and shows competitive performance on the AURORA4 task.
We first formulated the missing data paradigm such that it can be applied to an acoustic model that requires no compromises on accuracy and uses standard feature representations, i.e., a formulation that covers cepstral as well as LDAfeatures as they are commonly used in today's speech recognizers. This formulation exploits the most accurate speech model that the recognizer disposes of: the backend HMM. The computational load was significantly reduced by the proposed MC approach to solve the CLSQ problems with sufficient accuracy for practical purposes. Here, candidates are obtained from exact solutions on a smaller set of CG, followed by selection of the most likely candidate. The posterior probabilities of the CG were exploited to construct a Gaussian selection algorithm that saves more computation by excluding Gaussians that are unlikely to make a significant contribution to the state likelihoods. Finally, the CGs were structured hierarchically with a modelbased Gaussian clustering algorithm to achieve further speed gains.
The proposed method was compared to clusterbased imputation, an MDT that enhances the feature vector based on a GMM speech model, a technique that is also suitable for large vocabulary tasks. Our experiments reveal that it is beneficial to accuracy to exploit the more accurate backend model instead.
The optimizations show that no modeling compromises are required to apply MDT to large vocabulary recognition and that, on noisy data, any additional computational cost in likelihood calculation is easily recovered by a reduction in the search effort. These benefits make the missing data formalism very suitable to tackle various robustness issues beyond the additive noise effects considered in this article. With a suitable missing data detector, the solutions described in this article open pathways to also efficiently cover reverberated speech and exploit multiple microphones to implement directional hearing.
Appendix: Computational complexity of maximized likelihood per Gaussian
This section reveals how the numbers of multiplication involved in each approach in Table 1 are obtained. The complexity is quantified as the number of multiplications or divisions.
MAP
The iterative method of the MAP algorithm in [24] includes the following steps:

a.
Initialize x _{ t, u } using the componentwise minimization: x _{ t, u }(i) = min[μ_{ t, u }(i),y_{ t, u }(i)] as in Equation (2).

b.
For each i of the unreliable sub vector x_{ t, u }, calculate the conditional mean
$${\stackrel{\u0304}{\mathbf{x}}}_{t,u,k}\left(i\right)=E\left({\mathbf{x}}_{t,u,k}\left(i\right){\mathbf{x}}_{t,u,k}\left(1\right)...{\mathbf{x}}_{t,u,k}\left(i1\right),\phantom{\rule{2.77695pt}{0ex}}{\mathbf{x}}_{t,u,k}\left(i+1\right)...{\mathbf{x}}_{t,u,k}\left(D{D}_{t,u}\right),{\mathbf{y}}_{t,r}\right)$$
where D_{ t, u } is the number of unreliable components in the spectrum at time t.
And constrain

c.
Repeat step b for several iterations and calculate the likelihood p(x _{ t, u, k }, y _{ t, r }k)
Raj et al. provide a standard solution for the conditional mean in step b:
where μ_{ t, u, k } is the mean of unreliable subvector. Σ_{ k } is the covariance of Gaussian k. Σ_{ t, u, r, k } contains only the rows with indices corresponding to the unreliable components and columns with the reliable indices in Σ _{ t, k }.
The conditional mean can also be formulated as
where H_{ k } = Σ _{ k }^{1} and H_{ t, u, u, k } is a D_{t, u}by D_{ t, u } submatrix of H_{ k } containing only the rows and columns corresponding to the unreliable components in the feature vector. H_{ t, u, r, k } is a D_{t, u}by DD_{t, u}submatrix of H_{ k } containing only the rows corresponding to the indices of unreliable components in the feature and the columns with reliable indices.
In step b, only 1 dimension is free and updated per operation as:
The second part of the summation in the numerator is constant and can be calculated at the first iteration. Hence, the MAP algorithm involves (D_{ t, u } + 1) × D_{ t, u } multiplications per iteration where the dimension of ${\stackrel{\u0304}{\mathbf{x}}}_{t,u,k}$ is D_{ t, u } . In the above equation, x _{ t }(j) is the j th component of the latest updated unreliable component.
Besides updating the clean speech in each iteration, the likelihood of the CG is also required for BG selection, which involves (D + 1)D multiplications. Each iteration of MAP is very efficient, but as shown by the experimental result, MAP needs six iterations to reach convergence in terms of WER. Furthermore, as described in Section 4.1, the full precision matrix has to be handled in the CPU cache memory. Hence, MAP is slower than GD + PROSPECT.
Multiplicative updates
As outlined in [32, 33], the step calculation of the i th unreliable component for Gaussian k is given by
H _{ t, u, u, k }^{+}(H _{ t, u, u, k }^{}) is obtained from H_{ t, u, u, k } by setting all negative (positive) entries to zero. H _{ t, u, u, k }^{+}(x _{ t, u, k }μ _{ t, u, k }) together with H _{ t, u, u, k }^{}(x _{ t, u, k }μ _{ t, u, k }) involve D_{ t, u } × D_{ t, u } multiplication operations.
b = H _{ t, u, r, k }(y_{ t, r }  μ _{ r, k }) + H _{ t, u, u, k }(x_{ t, u, k }  μ _{ t, u, k }) so its first term can be calculated prior to iteration. The second term is already calculated while calculating H _{ t, u, u, k }^{+}(x_{ t, u, k }  μ _{ t, u, k }) and H _{ t, u, u, k }^{}(x_{ t, u, k }  μ_{ t, u, k }). The square, multiplication (2 ×) and division in the above equation involves 4 × D_{ t, u } multiplications per iteration. In addition, D_{ t, u } computationally expensive square root operations are also involved in each iteration. The calculation of the likelihood involves (D + 1)D multiplications, the same as MAP. MU shares the same drawback with MAP that it has to handle the full precision matrix whenever it is called. As proved by the experiments, MU needs five iterations for convergence of the WER.
Gradient descent + cepstral Gaussians
To calculate the likelihood and gradient of a Cepstral Gaussian with diagonal covariance given a frame of unreliable spectrum, the precision matrix of the Gaussian must be either precalculated or calculated online. The precalculation implies that the system has to handle the D × D precision matrix which leads to frequent data exchange between CPU and main memory. To calculate the gradient online, the precision matrix must be represented in the logspectral domain as
where H_{ k } is the precision matrix of Gaussian k, and it is transformed from the inverse diagonal covariance matrix Σ _{ k }^{1} by applying the transpose of DCT matrix C. Let C_{ r } be the D_{ m } by D  D_{ t, u } sub matrix of C, containing the columns with the reliable indices, and C_{ u } contains the remaining elements. Equation (26) can be represented as
${\mathbf{C}}^{\prime}{{\mathbf{\Sigma}}_{k}}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}{\mathbf{C}}_{r}\left({\mathbf{y}}_{t,r}{\mathbf{\mu}}_{t,r,k}\right)$ is constant for every iteration and is part of the calculation of the likelihood. ${\mathbf{C}}^{\prime}{{\mathbf{\Sigma}}_{k}}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}{\mathbf{C}}_{u}\left({\mathbf{x}}_{t,u}{\mathbf{\mu}}_{t,u,k}\right)$ involves D_{ t, u } (2D_{ m } + 1) multiplications for the unreliable component of the gradient. A small fraction of the gradient needs to be added to the gradient to cope with the singularity of H_{ k } as mentioned in Equation (11). Hence, another D_{ t, u } multiplications are needed. The calculation of step size and step scale in Equation (13) involve D_{ m }D + D_{ m } + D + D_{ t, u } multiplications. The calculation of likelihood contains 2D_{ m }D + D_{ m } + D multiplications.
Gradient descent + PROSPECT features
The cost function for a Gaussian trained with PROSPECT features is shown in Equation (17). The calculation of gradient g_{ t, k } can be decomposed to the projection part and cepstral part. The following quantities must be calculated.
${\mathbf{C}}_{c}\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$: ${\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}$ is the spectral mean of Gaussian k which is precalculated from the PROSPECT mean using the inverse PROSPECT transform R. D_{ c }D_{ t, u } multiplication are involved for the unreliable components every iteration. Additionally D_{ c } (D  D_{ t, u } ) multiplications are required to calculate the reliable components before the first iteration.
${{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$: Based on the calculation of ${\mathbf{C}}_{c}\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$, D_{ c }D multiplications are involved.
${\mathbf{\Sigma}}_{k}^{{\perp}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}}\left(\mathbf{I}{{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}\right)\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$: includes another D multiplications.
${\mathbf{C}}_{c}{\mathbf{\Sigma}}_{k}^{{\perp}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}}\left(\mathbf{I}{{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}\right)\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$: includes another D_{ c }D multiplications.
${{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}{\mathbf{\Sigma}}_{k}^{{\perp}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}}\left(\mathbf{I}{{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}\right)\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$: requires D_{ c }D_{ t, u } multiplications per iteration.
The cepstral part ${{\mathbf{C}}_{c}}^{\prime}{\mathbf{\Sigma}}_{k}^{{c}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}}{\mathbf{C}}_{c}\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$: Based on the calculation of ${\mathbf{C}}_{c}\left({\mathbf{x}}_{t}{\mathbf{R}}^{\prime}{\mathbf{\mu}}_{k}\right)$, D_{ c } + D_{ c }D_{ t, u } multiplications per iteration are involved.
Given the obtained gradient g _{ t, k }, the step size involves the following quantities as in Equation (13):
g _{ t, k }'g _{ t, k }: D multiplications.
C _{ c } g _{ t, k }: D_{ c }D multiplications.
${{\mathbf{g}}_{t,k}}^{\prime}{{\mathbf{C}}_{c}}^{\prime}{\mathbf{\Sigma}}_{k}^{{c}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}}{\mathbf{C}}_{c}{\mathbf{g}}_{t,k}$: 2D_{ c } multiplications.
(IC _{ c }'C _{ c })g _{ t, k }: D_{ c }D multiplications given C_{ c } g_{t,k}.
${{\mathbf{g}}_{t,k}}^{\prime}\left(\mathbf{I}{{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}\right){\mathbf{\Sigma}}_{k}^{{\perp}^{{\phantom{\rule{0.1em}{0ex}}}^{1}}}\left(\mathbf{I}{{\mathbf{C}}_{c}}^{\prime}{\mathbf{C}}_{c}\right){\mathbf{g}}_{t,k}$: 2D multiplications
Another D multiplications are required for scaling the gradient as explained in the previous section.
Besides the iterations, the initial likelihood involves 2(D + D_{ c } ) multiplications.
With the typical values of D, D_{ m } , D_{ c } and D_{ t, u } in Table 1 the number of multiplications involved in MAP is 2138 per Gaussian, while it is 2106 for MU. But MU needs 80 square root operations per Gaussian. The number of multiplications involved in GD with cepstral Gaussian is 2177. This number is reduced to 1416 when using PROSPECT features.
Abbreviations
 ASA:

auditory scene analysis
 BE:

BackEnd
 BG:

Backend Gaussian
 CCG:

clusterofcluster Gaussians
 CG:

cluster Gaussian
 CGS:

cluster Gaussian selection
 CLBR:

clusterbased reconstruction
 CLSQ:

constrained least squares
 DCT:

discrete cosine transform
 EM:

expectation maximization
 FRoG:

fast removal of Gaussians
 GB:

Gaussianbased
 GD:

gradient descent
 KLD:

KullbackLeibler divergence
 LDA:

linear discriminant analysis
 MAP:

maximum a posterior probability
 MC:

multicandidate
 MDT:

missing data technique
 MFCC:

MEL Frequency Cepstral Coefficients
 MIDA:

mutual informationbased discriminant analysis
 ML:

maximum likelihood
 MLE:

maximum likelihood estimation
 MU:

multiplicative updates
 PDF:

probability density functions
 PDT:

phonetic decision tree
 PMC:

parallel model combination
 PROSPECT:

PRojected SPECTra
 SB:

statebased
 SKLD:

symmetric KullbackLeibler divergence
 WSKLD:

weighted KullbackLeibler divergence
 WSS:

widesense stationary.
References
 1.
ETSI standard doc., Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced frontend feature extraction algorithm; compression algorithms ETSI, Tech Rep ES 202 050 v1.1.3 2003.
 2.
Flores JAN, Young SJ: Continuous speech recognition in noise using spectral subtraction and HMM adaptation. In Proceedings of ICASSP. Adelaide, South Australia, Australia; 1994:409412.
 3.
Droppo J, Deng L, Acero A: Evaluation of the SPLICE algorithm on the Aurora2 database. In Proceedings of Eurospeech. Aalborg, Denmark; 2001:217220.
 4.
Moreno PJ, Raj B, Stern RMJ: A vector Taylor series approach for environmentindependent speech recognition. In Proceedings of ICASSP. Atlanta, Georgia, USA; 1996:733736.
 5.
Gales M: Modelbased techniques for noise robust speech recognition. PhD thesis. University of Cambridge; 1995.
 6.
Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 1995, 9(2):171185. 10.1006/csla.1995.0010
 7.
Gauvain JL, Lee CH: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 1994, 2(2):291298. 10.1109/89.279278
 8.
Bregman AS: Auditory Scene Analysis. MIT Press, Cambridge; 1990.
 9.
Van Segbroeck M, Van hamme H: Vector quantization based mask estimation for missing data ASR. In Proceedings of Interspeech. Antwerp, Belgium; 2007:910913.
 10.
Barker J, Cooke M, Green P: Robust ASR based on clean speech models: an evaluation of missing data. In Proceedings of Eurospeech. Aalborg, Denmark; 2001:213216.
 11.
Cooke M: A glimpsing model of speech perception in noise. J Acoust Soc Am 2006, 119(3):15621573. 10.1121/1.2166600
 12.
Barker JP, Cooke MP, Ellis DPW: Decoding speech in the presence of other sources. Speech Commun 2005, 45(1):525. 10.1016/j.specom.2004.05.002
 13.
Srinivasan S, Wang D: Transforming binary uncertainties for robust speech recognition. IEEE Trans Audio Speech Lang Process 2007, 15(7):21302140.
 14.
Srinivasan S, Wang D: Robust speech recognition by integrating speech separation and hypothesis testing. Speech Commun 2010, 52(1):7281. 10.1016/j.specom.2009.08.008
 15.
Cooke M, Green P, Josifovski L: A Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun 2001, 34(3):267285. 10.1016/S01676393(00)000340
 16.
Lippmann RP, Carlson BA: Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise. In Proceedings of Eurospeech. Rhodes, Greece; 1997:3740.
 17.
Seltzer ML, Raj B, Stern RM: A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition. Speech Commun 2004, 43(4):379393. 10.1016/j.specom.2004.03.006
 18.
Cooke M, Morris A, Green P: Missing data techniques for robust speech recognition. In Proceedings of ICASSP. Munich, Germany; 1997:863866.
 19.
Renevey P, Drygajlo A: Detection of reliable features for speech recognition in noisy conditions using a statistical criterion. In Proceedings of CRAC Workshop. Aalborg, Denmark; 2001:7174.
 20.
Srinivasan S, Shao Y, Jin Z, Wang D: A computational auditory scene analysis system for robust speech recognition. In Proceedings of Interspeech. Pittsburg, Pennsylvania, USA; 2006:7376.
 21.
Cerisara C, Demange S, Haton JP: On noise masking for automatic missing data speech recognition: a survey and discussion. Comput Speech Lang 2007, 21(3):443457. 10.1016/j.csl.2006.08.001
 22.
Coy A, Barker J: Soft harmonic masks for recognising speech in the presence of a competing speaker. In Proceedings of Interspeech. Lisbon, Portugal; 2005:26412644.
 23.
Van Segbroeck M, Van hamme H: Advances in missing feature techniques for robust large vocabulary continuous speech recognition. IEEE Trans Audio Speech Lang Process 2011, 19(1):123137.
 24.
Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Commun 2004, 43(4):275296. 10.1016/j.specom.2004.03.007
 25.
Van hamme H: Robust speech recognition using cepstral domain missing data techniques and noisy masks. In Proceedings of ICASSP. Volume 1. Montreal, Quebec, Canada; 2004:213216.
 26.
Cerisara C: Towards missing data recognition with cepstral features. In Proceedings of Eurospeech. Geneva, Switzerland; 2003:30573060.
 27.
Häkkinen J, Haverinen H: On the use of missing feature theory with cepstral features. In Proceedings of CRAC Workshop. Aalborg, Denmark; 2001.
 28.
Faubel F, McDonough J, Klakow D: Bounded conditional mean imputation with Gaussian mixture models: a reconstruction approach to partly occluded features. In Proceedings of ICASSP. Taipei, Taiwan; 2009:38693872.
 29.
HaebUmbach R, Ney H: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of ICASSP. San Francisco, California, USA; 1992:1316.
 30.
Kumar N, Andreou AG: A generalization of linear discriminant analysis in maximum likelihood framework. In Tech Rep JHUCLSP Technical Report No. 16. Johns Hopkins University; 1996.
 31.
Fletcher R: Practical Methods of Optimization. John Wiley & Sons, Chichester; 1980.
 32.
Saul LK, Sha F, Lee DD: Statistical signal processing with nonnegativity constraints. In Proceedings of Eurospeech. Geneva, Switzerland; 2003:10011004.
 33.
Van hamme H: Prospect features and their application to missing data techniques for robust speech recognition. In Proceedings of Interspeech. Jeju, Korea; 2004:101104.
 34.
Duchateau J, Demuynck K, Van Compernolle D, Wambacq P: Class definition in discriminant feature analysis. In Proceeding of Eurospeech. Aalborg, Denmark; 2001:16211624.
 35.
Young SJ, Odell JJ, Woodland PC: Treebased state tying for high accuracy acoustic modelling. In Proceedings of Workshop on Human Language Technology. Plainsboro, New Jerey, USA; 1994:307312.
 36.
Iskra D, Grosskopf B, Marasek K, van den Heuvel H, Diehl F, Kiessling A: SPEECONSpeech Databases for Consumer Devices: Database Specification and Validation. In Proceedings of LREC. Las Palmas, Spain; 2002:329333.
 37.
Demuynck K: Extracting, modeling and combining information in speech recognition. PhD thesis. K.U., Leuven, ESAT; 2001.
 38.
Fritsch J, Rogina I: The bucket box intersection (BBI) algorithm for fast approximate evaluation of diagonal mixture Gaussians. In Proceeding of ICASSP. Volume 2. Atlanta, Georgia, USA; 1996:273276.
 39.
Bocchieri E: Vector quantization for efficient computation of continuous density likelihoods. In Proceeding of ICASSP. Volume 2. Minneapolis, Minnesota, USA; 1993:692695.
 40.
Watanabe T, Shinoda K, Takagi K, Iso KI: High speed speech recognition using treestructured probability density function. In Proceeding of ICASSP. Volume 1. Detroit, Michigan, USA; 1995:556559.
 41.
Wang Y, Van hamme H: Speed improvements in a missing databased speech recognizer by Gaussian selection. In Proceedings of NAGDAGA. Rotterdam, Netherlands; 2009:423426.
 42.
Mackay DJ: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, UK; 2003.
 43.
Myrvoll TA, Soong FK: Optimal clustering of multivariate normal distributions using divergence and its application to HMM adaptation. In Proceedings of ICASSP. Hong Kong; 2003:552555.
 44.
Parihar N, Picone J: Analysis of the aurora large vocabulary evaluations. In Proceedings of Eurospeech. Geneva, Switzerland; 2003:337340.
 45.
Gemmeke JF, Van Segbroeck M, Wang Y, Cranen B, Van hamme H: Automatic speech recognition using missing data techniques: handling of realworld data. In Robust Speech Recognition of Uncertain or Missing Data. Edited by: HaebUmbach R, Kolossa D. BerlinHeidelberg (Germany), Springer Verlag; 2011:157186.
 46.
Young S, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P:The HTK Book. [http://htk.eng.cam.ac.uk/docs/docs.shtml]
 47.
SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit.[http://www.spraak.org/]
Acknowledgements
This study was financed by the MIDAS project of the Nederlandse Taalunie under the STEVIN programme. Thanks to Kris Demuynck for various implementations and the anonymous reviewers for suggesting additional interesting comparisons and analyses.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Wang, Y., Van hamme, H. Multicandidate missing data imputation for robust speech recognition. J AUDIO SPEECH MUSIC PROC. 2012, 17 (2012). https://doi.org/10.1186/16874722201217
Received:
Accepted:
Published:
Keywords
 speech recognition
 constrained optimization
 missing data
 noise robustness