- Research
- Open access
- Published:
Multi-candidate missing data imputation for robust speech recognition
EURASIP Journal on Audio, Speech, and Music Processing volume 2012, Article number: 17 (2012)
Abstract
The application of Missing Data Techniques (MDT) to increase the noise robustness of HMM/GMM-based large vocabulary speech recognizers is hampered by a large computational burden. The likelihood evaluations imply solving many constrained least squares (CLSQ) optimization problems. As an alternative, researchers have proposed frontend MDT or have made oversimplifying independence assumptions for the backend acoustic model. In this article, we propose a fast Multi-Candidate (MC) approach that solves the per-Gaussian CLSQ problems approximately by selecting the best from a small set of candidate solutions, which are generated as the MDT solutions on a reduced set of cluster Gaussians. Experiments show that the MC MDT runs equally fast as the uncompensated recognizer while achieving the accuracy of the full backend optimization approach. The experiments also show that exploiting the more accurate acoustic model of the backend does pay off in terms of accuracy when compared to frontend MDT.
1. Introduction
One of the major concerns in deploying Automatic Speech Recognition (ASR) applications is the lack of robustness of the technology when compared to human listeners. A key aspect is the sensitivity to background noise. This effect is caused by the differences between the conditions in which the statistical models for speech are trained and those in which they are applied in real-life situations. Many approaches which reduce the mismatch to improve the noise robustness of speech recognition have been proposed earlier. They modify either the frontend signal preprocessing or the backend acoustic model of the recognizer. A popular frontend method is the Advanced Front-End [1] which applies multiple stages of Wiener filtering to remove the background noise from the corrupted observations. Other techniques working in the frontend are, e.g., spectral subtraction [2], Stereo Piecewise Linear Compensation for Environment [3] and the Vector Taylor series compensation algorithm [4]. Some examples of backend approaches are Parallel Model Combination (PMC) [5] and model adaption algorithms, such as Maximum Likelihood Linear Regression (MLLR) [6] and Maximum A Posterior probability (MAP) based adaptation [7].
In the late 1990s, Missing Data Techniques (MDT) were introduced in speech recognition as a perceptually motivated approach to improve the noise robustness of a speech recognizer. Research in Auditory Scene Analysis (ASA) [8] proposed models for the capability of human listeners to deal with concurrent signals. The human auditory system is able to extract sufficient information from the speech source of interest in order to recognize what is said, even if parts of the target signal are masked by other signals. It exploits the redundancy in the speech signal and can thus handle missing data. The motivation of MDT is to explore these capabilities of human listeners and exploit them in ASR to reduce the performance gap between humans and computers. It relies on the model that a given spectral band at a given time is dominated by either speech or noise. In the frontend preprocessing, the time-frequency regions of a speech signal are labeled as reliable or as unreliable. This labeling information is encoded into a so-called missing data mask. In the backend decoding, features in the unreliable regions are either ignored or predicted to alleviate the mismatch. This compensation strategy relies only on the speech model and unlike PMC for instance, it does not require a model of the noise, though some assumptions about the noise are required instead while generating the missing data mask [9, 10]. In recent years, the MDT was extended to techniques such as the glimpsing model [11] and speech fragment decoding [12]. Other related work includes the propagation of uncertainty [13] where the authors transform the uncertainty encoded in the binary mask from the spectral domain to the cepstral domain, and handle the transformed uncertainty with the cepstral backend acoustic models. The authors of [14] introduce a two-pass MDT system, where the lattice generated by the MDT recognizer in the first pass is rescored. In the second pass, a state-based hypothesis test then generates the so-called "integrated mask", yielding better recognition results.
The two major problems in MDT are first estimating the mask and then exploiting these masks during recognition. Identifying the 'missing' part during recognition is an essential step in MDT as proposed by Cooke et al. [15] and Lippmann and Carlson [16]. A missing data detector makes a binary decision about which spectro-temporal components are unreliable due to noise distortions and which remain reliable, i.e., are dominated by speech. Approaches of missing data mask estimation, such as Bayesian classification [17], harmonic mask estimation [10], local SNR-based mask estimation [18, 19], and VQ mask estimation [9] mainly exploit characteristics of the speech signal. The authors of [20] estimate the missing data masks based on computational ASA. More approaches can be found in a survey on missing data mask estimation by Cerisara et al. [21]. The concept of binary reliability masks can be extended to soft masks[22] when uncertainty about the reliability information is taken into account. The mask then assumes continuous values instead of binary values. Soft masks are not considered in this article, as we have found them to provide little benefit in [23].
Several paradigms have been designed to apply MDT once the masks are computed. MDT was first formulated for a spectral acoustic model [15], which is referred to as spectral MDT in this article. The spectral energy within each unreliable component can be either reconstructed based on the acoustic model and the reliable information, or marginalized out of the probability density functions (PDF) of the HMM states. The former scheme is defined as imputation and the latter is defined as marginalization. In order to improve the performance of MDT, Raj et al. [24], Van hamme [25], Cerisara [26], Häkkinen and Haverinen [27], and Faubel et al. [28] applied MDT using cepstral acoustic models, which are referred to as cepstral MDT in this article. The experimental results of cepstral MDT demonstrate its advantage over the spectral model. The authors of [24] used MDT imputation to enhance the speech features in the front-end, while in Maximum Likelihood (ML) Gaussian-based imputation [25] and in conditional mean imputation [28], the authors consider MDT imputation associated with Gaussians in the backend.
The above work addresses the robustness of the MDT system rather than its efficiency. MDT systems involve much more intensive computation in the backend, as explained in Section 3. This was already noticed in [15], where the problem was addressed by compromising on the acoustic model (diagonal Gaussians for log-spectral features). An alternative solution is to formulate MDT as a front-end technique [24]. In this article, we propose a Multi-Candidate (MC) MDT which not only produces competitive recognition accuracy, but also possesses the same efficiency as a conventional large vocabulary recognizer under noisy conditions. We advocate the backend approach, since it exploits the most accurate speech model that is available in the recognizer to compensate for the missing data. Each HMM state represents an accurate hypothesis about what the missing speech could be, integrating all knowledge that is available in the decoder: acoustics, lexical information, and language model. Hence, we expect more accurate missing data imputation than with frontend MDT approaches, where such sophistication is not available. In our setting, we go beyond the state level and compute a clean speech vector per Gaussian. In addition to the entire set of Gaussians embedded in the HMM, a fairly small set of Gaussians are trained to function as cluster Gaussians (CG). They provide feasible candidates (i.e., they satisfy the constraints for the imputed data, as described in Section 2.1) of imputations for the entire set of Gaussians. As such, instead of solving the full optimization problem for each Gaussian in the acoustic model, candidate solutions are selected from the CG and the most likely one is retained. Therefore, implementation of MC MDT requires only a modest modification of conventional HMM-based recognizers. The MC MDT forms the main contribution of this article. It is an algorithm that aims at computational gains for large vocabulary speech recognizers without sacrificing accuracy or robustness. It provides a solution for applying MDT to an existing backend model trained for the speech feature vector of one's choice. Furthermore, we show experimentally that we gain more immunity to noise than if MDT is applied as a frontend feature-enhancement technique [24] and compare several methods for solving the imputation problem.
Figure 1 shows the architecture of the MC MDT system. This article is focused on the three blocks in the middle, i.e., the imputation of the (cluster of) CG and MC imputation for the backend Gaussians (BGs). The rest of this article is arranged as follows: in Section 2, we introduce the conventional state-based imputation and marginalization [15] as well as the spectral reconstruction [24]. In Section 3, we discuss MDT imputation under the framework of ML decoding and why it becomes difficult when using a model trained with decorrelated features such as cepstral features or features generated by, e.g., linear discriminant analysis (LDA) [29]. Section 4 describes the approach of MC MDT imputation using CG to speed up the Gaussian-based imputation. Section 5 explains how to further speed up the imputation of the CG by selecting a subset of the CG dynamically. Section 6 describes several experimental results. Finally, in Section 7 we present our conclusions and propose future work.
2. Spectral and Cepstral MDT systems
In this section, we review some of the concepts of MDT that lead to approaches that are most related to the proposed system.
2.1. Bounds
Environmental noises are assumed to be additive in the spectral domain. Hence, at frame t, the log-spectra of the underlying complete clean speech x t can be assumed to be approximately bounded above by the observed noisy feature vector y t , namely:
where the inequality sign for vectors applies component-wise. Both x t and y t can be partitioned into their reliable and unreliable sub-vectors according to the mask:
For the reliable spectro-temporal regions, the observed noisy features are deemed to be pure speech:
whereas for the unreliable regions, the observed features act merely as upper bounds for the clean speech:
2.2. State-based imputation and marginalization
The authors of [15] formulated several MDT approaches which use the acoustic models trained in the same domain in which the masks are expressed and in which the constraints of Equation (1) hold. In their experiments, the acoustic feature vectors are obtained via a 64-channel auditory filter bank with center frequencies spaced linearly on an ERB scale from 50 Hz to 8 kHz. The HMM-based speech recognizer is adapted to accommodate MDT by modifying the state likelihood evaluation as outlined below. Each HMM state is expressed as a mixture of multivariate Gaussians with a diagonal covariance matrix. The MDT here is carried out frame-by-frame and is assumed independent across frames. The authors proposed both state-based imputation and marginalization. Besides the upper bound y t, u , a lower bound can also be applied to control the arbitrariness of compensation for the unreliable components. This idea can be applied to all methods described in this article. However, for consistency, we will omit lower bounds from this article.
In state-based marginalization, each state output PDF is a function of the reliable components only, while the unreliable components are marginalized out, i.e., each unreliable component is integrated over the range of values it can assume. The PDF of a state s is given by:
where G(s) represents the set of Gaussians belonging to the Gaussian Mixture Model (GMM) of state s. The integral of Gaussian k can be calculated using the component-wise error function because its covariance matrix is assumed to be diagonal.
In state-based imputation, the clean speech is imputed for every state s, followed by calculating the likelihoods using the imputed values, which will be utilized to expand hypotheses in the search space during decoding. Two ways of imputing the clean speech per state are given: linear combination or winner-takes-all.
In linear combination, the Minimum Mean Squar Error (MMSE) estimate of the imputation from state s is
where μ u, k is the unreliable sub-vector of mean of Gaussian k and
In winner-takes-all, after the clean speech is imputed for each Gaussian belonging to state s, the mixture's likelihood is evaluated for all imputed values and the most likely imputation is selected as the imputation of the state. In other words, the imputation of state s is approximated by the clean speech vector imputed from its th member Gaussian:
where . is the maximum likelihood imputation of the unreliable subsector x t, u, k for Gaussian k included in G(s):
This problem has a closed form solution:
where it should be understood that we have written the solution vectorially for convenience, but the top or bottom case in (2) may apply to different vector components. The imputation using a spectral acoustic model containing Gaussians with a diagonal covariance matrix has an analytical solution because the components of the log-spectral features are considered to be independent. However, the spectral features do have correlation among their components and the spectral GMM used above is not very effective to model this. The performance of HMM speech recognizers using GMMs with diagonal covariance is significantly better when using decorrelated features, e.g., MEL Frequency Cepstral Coefficients (MFCC). Therefore, a cepstral MDT model with diagonal covariance-Gaussians is introduced in the following section.
2.3. Spectral reconstruction
In [24], the authors reconstruct the spectral features using either a correlation-based method or a cluster-based method. The reconstructed spectra are then transformed into cepstra for processing by the speech recognizer.
The correlation-based approach solves the imputation of unreliable components at each frame by exploiting the correlations among the components in the spectro-temporal representation. The correlation is modeled by a Gaussian wide-sense stationary (WSS) process whose parameters are learned from training data. The core of the algorithm is a bounded MAP estimate:
where y t, n is the neighborhood vector containing all the related reliable components which are spectrally and temporally sufficiently close to x t, u as defined by the WSS model. The likelihood p(x t, u |y t, n ) is modeled with a full covariance Gaussian conditioned on the observed y t, n . The authors establish an iterative approach to solve (3).
In the cluster-based approach, the distribution of the observation is modeled by a spectral GMM with M mixture components with full covariance. Each of these mixture components is called CG, trained by the Expectation-Maximization (EM) algorithm. The unreliable components of the reconstructed spectra are obtained from a linear combination of the values imputed for the CG:
where
is the imputation resulting from the m th CG, a bounded optimization problem which can be solved by the MAP algorithm as in the correlation-based approach. P(m|x t, u, m ≤ y t, u , y t, r ) is the posterior probability of the CG given the reliable data and the feasible region for the unreliable data.
To make computation of this posterior probability tractable, the spectral CGs are assumed to be diagonal in this circumstance.
In both the correlation-based and the cluster-based method, the reconstruction is separated from the decoding and there is only one single imputation per frame, while in the spectral state-based imputation of Section 2.2, each state or Gaussian has its own imputation, which is theoretically more suitable for an ML-based recognizer. The likelihood of each state is calculated at its imputed value and used in the backend of the recognizer which incorporates the lexical and grammatical knowledge to drive the path pruning in the beam.
It should be noted that the authors of [15] show that state-based marginalization outperforms state-based imputation. Therefore, it would be natural to formulate marginalization for cepstral or other decorrelated models as well. However, this leads to definite integration of full covariance Gaussians. Even if approximations described in [28] would be applied to marginalization, the computational complexity is not acceptable for a practical speech recognizer. Hence, we only focus on imputation with decorrelated models.
3. Missing data imputation for maximum likelihood decoding
State-of-the-art automatic speech recognizers take a Bayesian approach, i.e., the decoding process is to find a sequence of words whose posterior probability is maximal given a T-frame sequence of observations y1...T:
where the language model P(W) is the probability of a hypothesized word sequence W. In practice, the most likely state sequence s1...Tthat realizes W is found. In MDT, the maximization should be additionally taken over the unreliable features to be imputed, i.e., x1...T, u, to find out the optimal imputation bounded by the noisy observation y1...T, u.
For a given state sequence s1...Twith W embedded, the complete speech is given by the following expression, where we have assumed state-conditional independence of x1...T, u:
where a is the product of the transition probabilities between the states on the hypothesized path. The maximization in Equation (7) can be accomplished frame-by-frame, i.e., the optimal clean speech at time t is obtained by the maximization of the output PDF of state s over the complete speech x t bounded by the observation y t :
Equation (8) formulates an ML state-based missing data imputation. The constrained optimization in (8) is not computationally tractable. If each member Gaussian in a state output PDF is assumed to impute its own clean speech using MLE:
MDT imputation becomes ML Gaussian-based imputation, which is an approximation of the state-based imputation but is computationally more tractable. It will be shown in Section 6.4.4 that (8) and (9) yield comparable recognition accuracy.
If the model used for imputation is trained with cepstral features or other decorrelated features, such as LDA [29] or HLDA [30] features, Gaussian k can be formulated in the log-spectral domain after the corresponding linear transformation C of full row-rank is applied:
where C+ represents the pseudo inverse of C, μ k , Σ k are the mean and diagonal covariance of the transformed features and D m denotes the dimension of the decorrelated feature vectors. Instead of maximizing probabilities, we can equivalently minimize the cost function:
with the precision matrix
the maximization of (9) over x t, k becomes:
Notice that H k can be singular (e.g., when the cepstral features have less dimension than the log-spectral features), in which case a k-dependent small fraction of the identity matrix is added to regularize H k , so a unique solution of (11) is found. Since H k is not diagonal, the bounded minimization in (11) can no longer be solved by Equation (2). Instead, it becomes a Constrained Least Square (CLSQ) problem, which does not have an analytical solution. Methods such as the MAP algorithm [24], primal active set methods [31], Multiplicative Updates (MU) [32], and imputation with PROSPECT features [33] have been proposed. But, their computational cost for large vocabulary speech recognizers with tens or hundreds of thousands of Gaussians becomes prohibitive. Below, the MC MDT imputation is proposed to significantly reduce the computational intensity to achieve an MDT recognizer with speed.
4. MC missing data imputation
In (11), Gaussian-based imputation is formulated as searching for the optimal clean speech vector within a feasible region, i.e., the (continuous) subspace which is spanned by the unreliable components and is bounded by the observation. This process can be approximated by evaluating each Gaussian on a list of feasible clean speech candidates and then selecting the candidate which maximizes the likelihood as the imputed value. This approximation is the basic idea behind the MC MDT imputation. For every Gaussian, the list of candidates is given by the imputation from a small set of CG. The Gaussians in the acoustic model, typically a large number, will be called BGs in the remainder of the article. The optimization for each BG in (11) is then approximated by selecting the th clean speech candidate such that:
where Ω k represents all the CGs which might generate suitable solutions for Gaussian k, and is the clean speech estimate of the unreliable speech components obtained from CG l. The construction of Ω k will be detailed in Section 4.3. Hence, in MC MDT, solving the CLSQ problem of BG k is replaced by L k likelihood evaluations, where L k is the cardinality of Ω k . Whereas solving a large number of BG imputation problems is avoided, the task is shifted to the restricted set of CGs. Solving each of these problems requires a computational effort that is at least an order of magnitude greater than the evaluation of a Gaussian likelihood, so various approaches for the imputation with CGs are discussed below.
4.1. ML-imputation for CG
The imputed value from the CGs can be computed by iterative approaches such as Gradient Descent (GD) [33], MU [32], or MAP [24]. In GD, the gradient for iteration τ is
where each negative component of g t, k τ , for which x t, k is on the boundary y t , is zeroed and so is each reliable component of g t, k τ in order to not violate the constraints. Since the cost function in Equation (10) is quadratic, the optimal step for iteration τ has an analytic expression:
The step direction is maintained, but the step size is reduced such that the boundary constraints are not violated.
To initialize the GD algorithm, the non-diagonal covariance structure is ignored, i.e., it starts from the solution in Equation (2).
We opt for GD rather than MU [32] or MAP [24] because it benefits from several advantages simultaneously: (i) the number of iterations required for practical convergence is smaller [33], (ii) the gradient computation (13) can be carried out from right to left such that only a small number of matrix-vector multiplications and vector operations are required (see Appendix), (iii) only the constant transformation matrix C, the observation, mean, and variances need to be copied to the cache memory of the CPU while other methods may require a larger memory access bandwidth, (iv) GD does not require square root operations like MU, hence the total number of arithmetic operations per iteration is less than that of MU (as shown in Table 1).
The computational effort is further reduced by using PROSPECT features together with GD, which is proposed in [33].
PROSPECT features are composed of two feature subset. The first are cepstral features of a low order Dc (e.g., D c = 4), which models the rough shape of the spectrum at time t. This cepstral part is given by
where C c denotes the reduced DCT matrix with orthonormal rows. The remaining details of the signal are captured by
which is termed the projection part because it is the orthogonal projection of x t on the orthogonal complement of the subspace spanned by the rows of C c . The concatenation of the cepstral part and the projection part is referred to as PROjected SPECTral (PROSPECT) features:
The PROSPECT transformation matrix is
The likelihood of the k th Gaussian based on the PROSPECT features is formulated as
where
and
where , , , and are the means and covariance matrices of cepstral and projection part of PROSPECT Gaussian k, respectively. They are all estimated on data using the EM-algorithm and both and are diagonal. However, a diagonal implies invalid independence assumptions in the spectral residual . Hence, the stream exponent β in (14) is introduced to reduce the impact of these assumptions. According to [33], a typical value of β is 0.5. Note that F(vt|k) is not a strict PDF because it does not integrate to unity due to β, but we will still refer to it as the likelihood of Gaussian k. When substituting (15) and (16) in (14), the cost function of Gaussian k becomes
where
The gradient computation and cost function evaluation now involve only multiplication of small matrices and vector additions, which is exploiting the CPU's cache memory more efficiently and reduces the computational effort in comparison to a cepstral (or LDA) model, as witnessed by Table 1. Refer to Appendix for details. The study [33] also shows that the PROSPECT model performs equally well as the cepstral model for a recognizer without MDT. Because of their better efficiency and comparable accuracy, the PROSPECT features are preferred for the CGs and the algorithm selected for minimizing (17) is GD. Since the CGs only serve to generate candidate spectra, there is no need for the CG and the BG to be expressed in the same feature domain. For example, in the experiments of Section 6, the BGs will be trained with the features generated by the Mutual Information-based Discriminant Analysis (MIDA) technique [34].
4.2. Training the CGs
Clustering methods for Gaussians can be categorized as model-based or data-driven. In the former methods, such as the popular K-means, the parameters of the CGs are estimated from parameters of the BGs. In the latter methods, parameters of the CGs are estimated from training data. Model-based Gaussian clustering methods are not well suited to create the CGs in MC MDT, because they would involve a transformation between the domains in which CGs and BGs are expressed. For instance, MIDA CGs can be first evaluated using MIDA BGs and then converted into PROSPECT CGs. But this conversion involves a lossy transformation and hence performance cannot be guaranteed. Therefore, approaches driven by data are selected in this study.
In order to obtain the CGs from training data, a compact HMM is trained. The compact model shares its structure with the backend model containing the BGs in the sense that it uses the same phonetic decision tree (PDT) [35], but it has only M Gaussians which are shared among leaves of the PDT. Hence, every HMM state s will have an associated set of CGs as well as a set of BGs, denoted by GCG(s) and GBG(s), respectively. Typically, M is a few hundred and M << K, where K is the total number of BGs. These M Gaussians are used as the CGs and can be trained for any feature representation. The parameters to be trained are the PROSPECT means and covariance matrices of the CGs as well as the mixture weights PCG(m|s). Before training the compact model, a state level segmentation is made using the Viterbi algorithm with the backend model, i.e., the segmentation specifies the alignment between the states and the frames of the training data. M BGs are randomly selected to initialize the CGs. However, since BGs and CGs may be expressed on different feature sets, a particular initialization of the CGs is required. Hereto, the M retained BGs are considered as a GMM with uniform weights. The posterior probabilities of the M BGs are calculated on the MIDA representation and are used in the first iteration of the EM algorithm, i.e., the BG posteriors are used to softly assign training samples to the CGs to initialize the mean, covariance and mixture weights. Subsequently, a standard EM training without altering the segmentation is performed using PROSPECT features. Consequently, each tied state is now modeled by a GMM with up to M components trained on PROSPECT features. Finally, every BG can be assigned to multiple CGs to form a soft clustering, as explained below.
4.3. Association between CGs and BGs
The association between the CGs and the BGs is based on the same segmentation used in Section 4.2. In this step, the likelihood of all the BGs belonging to state s at training frame t is calculated along the Viterbi path. The likelihood of the PROSPECT CGs belonging to s is calculated for the same frames. Then CG and BG are found by
and
where F(Ryt|m) is calculated by Equation (14). E represents the linear transformation of the backend features. For every training frame of speech t, entry of the association matrix Φ (as shown in Figure 2) is incremented by 1. After all training data are processed, the set Ω k in Equation (12) is defined from the k th column of Φ as those entries that are larger than the product of a pruning threshold θ Φ and the maximum of the k th column. Moreover, if Ω k contains more than Lmax elements, only the Lmax largest Φ-values are retained. The entries of Φ that are not in Ω k are subsequently set to zero.
The probability how often CG m is associated with any BGs is formulated by
and is used as the prior probability of a CG below.
4.4. Application of MC MDT in the recognizer
Figure 2 illustrates the process of calculating the likelihoods of the BGs. Since the CGs and the CG-BG association table are now available, it is also convenient to apply Gaussian selection together with the MC MDT. The motivation of Gaussian selection is that only a small (frame dependent) portion of Gaussians dominate the likelihoods of the HMM states, and are therefore worth evaluating. However, since the likelihood computation in an MDT system involves the imputation of the unknown data, many conventional methods do not apply readily. The proposed approach proceeds as follows. The recognizer first evaluates all the PROSPECT CGs using GD. Only the BGs assigned (as determined by Ω k ) to sufficiently likely CGs will be calculated, while the others will be ignored. The prior probabilities (18) and the resulting likelihoods of CGs are used to calculate the corresponding posterior probabilities. The posterior probabilities are sorted in descending order and then truncated to the length such that a large fraction ρ (e.g. ρ = 0.95) of the posterior probability mass is included, i.e., the number of CGs kept is the smallest L s such that
where is the imputed value from CG m. The CG are reordered such that
and
The exponent α is introduced to compensate for unmodeled correlations among the features and will indirectly control the number of selected BGs. A typical value of α is 0.4, which led to a reasonable trade-off between the number of selected BG and recognition accuracy on the development dataset used in [36]. denotes the posterior probability of CG m based on its imputation. In Figure 2, the CGs labeled with "1" in the CG selection table are selected at frame t. Only the imputed clean spectra resulting from the selected CGs are transformed into the MIDA domain and maintained as possible candidates for BG likelihood evaluation.
When calculating the likelihood of a particular BG k, the MC MDT recognizer looks up the k th column of the CG-BG association table Φ to find the candidate list. Notice that some of the associated CGs may have been pruned by criterion (19) and are removed from the list. The recognizer calculates the likelihoods of the BG for the candidates of imputed clean speech and selects the maximum as the likelihood of that BG. If the candidate list is empty, the BG is assigned a likelihood of zero. On average, the number of multiplications involved per BG is reduced to , where is the average number of CGs associated to a BG and D m is the dimension of MIDA features. The resulting likelihoods of the BGs are used to calculate the state output PDFs, which are then processed by the decoder.
5. Selection of CG
The MC MDT system can be further sped up by applying Gaussian selection on the CGs. Though M << K, the evaluation of a CG is still an order of magnitude more expensive than the evaluation of a BG. Thus, only the likely CGs are selected to impute the candidate clean speech. Existing methods of Gaussian selection can be classified as axis indexing-based methods [37, 38] and VQ-based methods [39, 40]. The former quickly locates the likely regions based on the observation, then selects the Gaussians in the likely regions [38] or removes the Gaussians in the unlikely regions [37]. But in MDT systems, it is not straightforward to determine which regions in the feature space are likely, because some of the components of the observation are missing. On the contrary, VQ-based methods suit the MC MDT system well. Cluster-of-Cluster Gaussians (CCG) in the PROSPECT domain are now established. The MC MDT recognizer will select the CGs based on the likelihoods resulting from the imputation of CCGs, i.e., an additional layer of Gaussian selection is provided. Consequently, it reduces the number of CG CLSQ problems to be solved. Clustering of the CGs is a prerequisite for the VQ-based Gaussian selection. In this study, we apply the soft K-Means algorithm to generate the CCGs. Since the CCGs and the CGs are expressed in the same domain (PROSPECT features in our example), a model-based approach is feasible and preferred here.
5.1. Soft K-means clustering
The following pseudo code summarizes the steps to obtain the CCGs. A single cluster is first calculated using all the CGs. The number of CCGs then grows incrementally from 1 to N to avoid suboptimal clustering as much as possible.
-
1.
Set the number of CG n to 1 and compute a single CCG from all CGs.
-
2.
While n < N
2a. Find CCG with the maximum mean WSKLD
2b. Split CCG into two and increment n
2c. For iteration τ from 1 to T
2c-1. For CCG i, i from 1 to n
2c-1-1. For CG m, m from 1 to M
Calculate the weight by which CG m updates CCG i, ĝ(i, m)
2c-1-2. Given ĝ(i, m), update μ i , Σ i iteratively
The distance metric between Gaussians and the computation of the CCGs from its member CGs are two crucial components for every step listed in the above pseudo code. The distance metric is Weighted Kullback-Leibler Divergence (WSKLD) in step 2a and is explained in Section 5.2. The parameter estimation algorithms in steps 1, 2c-1-1, and 2-c-1-2 are described in Section 5.3. Step 2b is described in Section 5.4.
5.2. Distance metric between PROSPECT Gaussians
The symmetric Kullback-Leibler Divergence (SKLD) is commonly used to measure the distance between CCG n and CG m:
The application of SKLD to (14) requires some care: the stream exponent β in the likelihood model for PROSPECT features makes it an improper distribution, requiring renormalization such that it integrates to unity. Second, it was found that SKLD overweighs differences in the projection part of the PROSPECT Gaussians. Therefore, in [41], further simplifications were proposed and experimentally verified leading to the WSKLD as a clustering metric for multi-stream features:
where β j is the exponent of stream j, SKLD j is the symmetric KLD computed on the features of stream j only and N strm is the total number of streams. In this study, N strm is 6 because the PROSPECT features contain static, velocity and acceleration stream of both cepstral and projection parts.
5.3. Parameter estimation of CCGs
Following the K-Means algorithm in [42], the cost function to be minimized for clustering is
where γ controls the stiffness of the clustering and g(n, m) are unknown clustering weights. The parameters to be updated iteratively are
In each iteration, the first step is to obtain the optimal weight by which CG m affects CCG n as
The second step is to find the optimal values of mean and covariance of each CG given the weights. The estimation of means and covariance matrices of the CCGs is based on the approach in [43], where a method for finding the centroid of a set of Gaussians is derived. The centroid is the CCG that minimizes the sum of the WSKLD to all CGs. Here, we extend the results of [43] by modifying the cost function to (21). The mean of a CCG is thereby estimated as
Matrix Z is constructed to facilitate the re-estimation of the covariance matrix of the CCGs
where
and
By construction, Z has D P positive and D P symmetrically negative eigenvalues, where D P is the dimension of PROSPECT features. A 2D P -by-D P matrix V is constructed whose columns are the D P eigenvectors corresponding to the positive eigenvalues. V is partitioned in its upper halve U and lower halve W:
Like in [43], is constrained to be diagonal during clustering. It can be seen from Equations (22) and (23) that the procedure of estimating the CCGs given the weights ĝ(n,m) is iterative. The calculation of the mean depends on the previously calculated covariance and vice versa. The exit criterion is the convergence of the cost function defined in Equation (21).
In step 1 of the pseudo code from Section 5.1, a single CCG is initialized by averaging the means and covariance matrices of the entire set of CGs. The parameters of the single CCG are then updated using Equations (22) and (23) for several iterations. Splitting a CCG and re-estimation of all CCGs are carried out iteratively till N CCGs are obtained, as explained below.
5.4. Splitting a CCG
In each iteration, the CCG with the maximum within-cluster mean WSKLD is found
Principal component analysis is applied on the covariance matrix to find the first principal eigenvector e1 and eigenvalue, λ1. If the number of CCGs in the current iteration is n, CCG is split into two Gaussians with the means and covariance matrices:
where ξ is the disturbing rate. The WSKLDs of all the M CGs to the newly created CCGs is then calculated. Each weight is also split into two according to the WSKLDs:
The parameters of CCG and CCG n + 1 are then re-estimated using Equations (22) and (23) with fixed number (e.g., 3) of iterations. The means and covariance matrices of CCG 1 to n + 1 are subsequently updated until convergence of the global cost function (21).
Finally, when n reaches N, an N by M CCG-CG table of exponentiated negative WSKLD is calculated. This table plays the same role as the association table in Section 4.3. The same schemes as in Section 4.3 are used to truncate the table. Also, the same schemes as in Section 4.4 are used to select likely CGs, thus avoiding solving CLSQ problems whose solutions are unlikely to survive pruning criterion (19).
6. Experiments
To show the effectiveness of the proposed approaches, a large vocabulary speech recognizer was modified accordingly and experiments on the noisy dictation task AURORA-4 were run. In Section 6.1, we describe the training and test datasets and further details on the required acoustic models. Section 6.2 explains various components of the MDT recognizer. Section 6.3 outlines two baseline systems: first, a non-MDT system serving as the speed baseline for our MC MDT experiments and second a backend MDT system serving as the accuracy baseline for our MC MDT experiments. In Section 6.4, some MC MDT variants are first analyzed and subsequently compared with the cluster-based reconstruction described in Section 6.5. Section 6.6 evaluates MC MDT systems where the CGs are expressed in either the cepstral domain or the log-spectral domain. All testing results are summarized in Tables 2, 3, 4, and 5.
6.1. Data and models
6.1.1. AURORA-4
Speech recognition experiments were conducted on the AURORA-4 database [44], a large vocabulary task that is derived from the WSJ0 Wall Street Journal 5k-word dictation corpus. A bigram language model for a 5k-word closed vocabulary is provided by Lincoln Laboratory.
For training, only clean-condition data sampled at 16 kHz were used, consisting of 7,138 utterances from 83 speakers, which amounts to 14 h of speech data. All recordings are made with the close talking microphone and no noise is added.
The test database is composed of 330 read sentences (5,353 words) from 8 different speakers. Fourteen different versions of this set are created. The first dataset is clean and is recorded with the same close-talk microphone as used while recording the training data. It is artificially corrupted by adding six types of noise to establish datasets 2-7: car (set 2), babble (set 3), restaurant (set 4), street (set 5), airport (set 6), and train (set 7). Set 8 is recorded with far-talk microphones. Test sets 9-14 are created by artificially adding the same six types of noise as used for generating sets 2-7. Each test set contains 330 utterances and has an SNR that ranges from 5 to 15 dB.
6.1.2. Training backend acoustic model
The design of the front-end as well as the backend acoustic model is based on prior study [23, 33, 37] which obtained competitive accuracies on clean speech and good robustness in an MDT configuration.
The signal power spectrum is calculated with a 32-ms Hamming window and a 10-ms window shift and is integrated using a 22-channel MEL-scaled triangular filter bank with lowest frequency centered at 140 Hz to increase the robustness to low-frequency noises. Since all frequencies above 7 kHz of the AURORA-4 data are filtered out, the last band is centered at 5800 Hz. The 22 log-spectral coefficients are mean-normalized and the first- and second-order time derivatives are appended to result in 66-dimensional spectral features.
To train the backend acoustic model, the normalized spectral features are transformed into 39-dimensional MIDA features, which are improved LDA features leading to decorrelation and diagonalization of the mixture components [34]. It has only half the dimension of PROSPECT features (see Section 6.1.3), hence leading to a significant effort reduction in the likelihood calculation of the BGs and showing better accuracy than MFCC.
The acoustic model uses cross-word and context-dependent triphones. The HMM for each triphone contains three states. A PDT defines 4091 tied states, or senones, which in their turn share 21,037 BGs. The output probability of each state is a mixture of 190 BGs on average and each Gaussian is shared among 45 different tied states.
6.1.3. Training CGs and CCGs
The compact acoustic model containing the CGs is trained with the same training data by following the training procedure outlined in Section 4.2. Here, the state-level segmentation of the training data is obtained by forced alignment using the backend MIDA model of Section 6.1.2. The normalized static log-spectral features and the dynamic features are transformed into PROSPECT features with D c = 4, i.e., for each stream in the features, four cepstral coefficients are kept, and D = 22 projection coefficients are appended. Consequently, the PROSPECT features including delta's have 78 dimensions. An earlier experiment on AURORA-4 showed that MC MDT with 500 to 900 CGs yields a reasonable tradeoff between recognition time and accuracy. Therefore, we use 500 CGs in the following experiments. The association table Φ is built on the same training data. The maximum number of CGs associated with a particular BG, Lmax, is 5. An earlier experiment on the Flemish Speecon and SpeechDat Car data [36] showed that increasing Lmax beyond 5 only leads to more computation without increasing the recognition accuracy. The average number of CGs associated with a particular BG, , is 3.6.
Fifty CCGs are obtained by clustering the 500 CGs using the procedure from Section 5. The maximum number of CCGs associated with a CG is 5 and the average number is 3.6. In previous experiments on Gaussian clustering, we have found a γ of 0.3 in Equation (21) to be a good choice, which we have maintained in these experiments.
6.1.4. Training spectral CGs for the cluster-based reconstruction with MAP
In order to accomplish the experiments of the MAP cluster-based reconstruction for comparison, a mixture of 500 Gaussians with full covariance is trained as well on the spectral data using EM on the same segmentation. As proposed in [24], the initial iterations use a diagonal covariance model that serves to make the posterior probability calculation (6) feasible. Only in the last EM-iteration, the full covariance matrices are estimated for application in Equation (5).
6.1.5. Training backend PROSPECT model
In order to show the speed improvement of MC MDT over a full MDT system [45], i.e., where the CLSQ problem (11) is solved per Gaussian with GD, an acoustic model with Gaussians estimated on PROSPECT features is required. The model has 21,037 PROSPECT Gaussians which are obtained by Single Pass Retraining (SPR) [46] of the acoustic model with MIDA features. The inputs of the SPR are the MIDA features, PROSPECT features, and the MIDA model described above. The MIDA model is used to compute the posterior probabilities of every Gaussian over the training data, which are subsequently combined with the PROSPECT feature observations to estimate their GMM weights, means and diagonal covariance matrices.
6.2. Recognizer
6.2.1. Handling convolutional noise
Besides additive noise, the MDT recognizer also handles convolutional noise by the channel compensation technique described in [23], which maximizes the likelihood of the recognized speech on the backend model. To make the implementation tractable, only the contribution of the single Gaussian that gives the largest contribution to the state likelihood (the dominating BG) is taken into account. However, unlike in [23], the current approach computes only approximate solution for BGs which is expressed in a different feature domain. Therefore, each dominating BG is replaced by the PROSPECT CG with the largest Φ-value so the maximum likelihood channel estimation of [23] can be readily applied. The channel estimate is subtracted from the observed log-spectra and hence the CCG, CG, and BG models are all compensated for convolutional distortions.
6.2.2. Mask estimation
The missing data detector used is the method described in [23] which integrates harmonicity and SNR with a speech model based on vector quantization. At each frame, the best match between a harmonic decomposition of noisy speech and a codebook describing the harmonic decomposition of clean speech is found. VQ mask estimation requires a speech and silence codebook which are trained with a randomly selected subset of the clean training data in Section 6.1.1. The codebook contains 520 codewords which are updated using the channel estimation of Section 6.2.1 during recognition.
6.2.3. Test configuration
The decoding consists of a time-synchronous beam search algorithm as described in [47]. The recognizer was launched on a PC installed with Dual Core AMD Opteron 280 2.4 GHz processors with a cache size of 1 MB. Only one processor core is activated. The MDT imputation is only applied to the static stream, while the first- and second-order time derivatives are uncompensated. The Word Error Rates (WER) are calculated for all the experiments. Meanwhile, the CPU time is measured. Tables 2 and 3 list the WERs of the experiments over the 14 types of environmental noises. Tables 4 and 5 contain the timing measurements for the BG and CG evaluation, for beam search as well as the end-to-end timing information (column "TOTAL") of the recognizer under noisy and clean condition, respectively. The timing measurements are achieved by starting and stopping (precise) timers frame-synchronously at the entry and exit of each of the different processing steps: front-end processing, CG imputation, candidate evaluation for all BGs, and beam search. The total time is then obtained by dividing the accumulated timings by the number of processed frames over several utterances.
6.3. Baselines
6.3.1. Recognition without MDT
The system that does not make use of MDT is provided as a baseline in terms of recognition time such that we can measure the computational cost of the robustness obtained from the MDT systems. The acoustic model is the backend HMM containing 21,037 MIDA Gaussians described in Section 6.1.2. An axis indexing-based Gaussian selection method, Fast Removal of Gaussians (FRoG) [37] is used. The testing results are shown in the first rows of Tables 2, 3, 4 and -5. The default FRoG Gaussian pruning setting works well on clean speech and results in only about 5% of Gaussians being evaluated. However, we noticed a performance degradation due to Gaussian pruning on noisy speech. Therefore, the FRoG Gaussian pruning settings were adjusted on the noisy test data such that the accuracy was not degraded more than 2% compared to no pruning, requiring 27% of Gaussians to be kept. Notice that this procedure yields an optimistic speed estimate for this baseline, as tuning on an independent development set would require some safety margin as well. Notice that this non-MDT system produces higher WER than the MDT systems under the clean condition (test set 1), as shown in Table 2. This is mainly due to the non-MDT system using spectral mean normalization to reduce the channel effects, while the MDT systems use the more sophisticated MLE-based channel update as described in Section 6.2.1.
6.3.2. Backend PROSPECT imputation
This setup is the most refined previously published version of our MDT system [23] and serves as a baseline in term of recognition accuracy such that we can measure the accuracy cost of the proposed speed improvements. Two iterations of GD are found to be enough for the convergence in terms of recognition accuracy, hence are applied for all the 21,037 PROSPECT Gaussians. This system runs at 15 times real time in noisy condition and 6.6 times real time in clean condition as shown in row 2 of Tables 4 and 5, respectively. However, the accuracy benefits of MDT can be clearly seen in contrast to the non-MDT system in Tables 2 and 3.
6.4. MC MDT
6.4.1. Gaussian-based MC MDT with GD
The Gaussian-based MC MDT system is an instance of the concepts outlined in Section 4. Two iterations of GD are applied for all the 500 CGs. The posterior probability-based BG selection described in Section 4.4 is applied. α was tuned with an isolated word-recognition experiment of MC-MDT on the Speecon and the SpeechDat Car databases used in [36], which we hence regard as development data for this article. The tuning experiment shows that a good trade-off between accuracy and BG evaluation effort is obtained at α = 0.4, but that it does not critically affect the recognition accuracy.
Compared to the backend PROSPECT MDT system, i.e., row 2 versus row 3 in Tables 2, 3, 4, and 5, the Gaussian-based MC MDT yields a comparable WER, while it uses less than 20% of the CPU time over the entire test set.
It is remarkable that the Gaussian-based MC MDT works as fast as the non-MDT recognizer with the same backend acoustic model under noisy conditions (row 1 versus row 3 of Table 4). The MC MDT spends time in evaluating CGs, but its decoding time is reduced by 4 ms per frame. Faster decoding on corrupted data is actually a common benefit from MDT imputation as shown in Table 4. In non-MDT systems, the mismatch between data and model results in a lower likelihood for the ground truth hypothesis and also causes many hypotheses to yield a similar score, so pruning is not effective and the decoder slows down. In the MDT system, noise addition also slows the recognizer down, but through a different mechanism. Thanks to the imputation process in MDT systems, the likelihood of the ground truth hypothesis will not deteriorate. The likelihood of alternative hypotheses will also increase, but because they do not fit the data well, their imputation benefit is not that strong. Apparently, a significant likelihood gap is maintained among the hypothesis, causing pruning in MDT systems to be more effective than in non-MDT systems. The effort spent in evaluating CGs is recovered in the search.
The increase in the likelihood of alternative hypotheses in the MDT system under noisy conditions also causes the MC MDT system with GD to run about 2.5 times slower than under clean conditions (row 3 of Table 4 versus row 3 of Table 5). As the data get noisier, the imputation becomes less constrained, since the number of unreliable components increases and the bounds outlined in Section 2.1 become less strict. Hence, the dynamic range of the BG likelihoods will decrease, such that the hypothesis likelihoods will show smaller differences, causing pruning to be less effective. Additionally, the system is slowed down with increasing noise levels because more spectro-temporal regions are labeled as unreliable and the complexity of imputation for CGs and CCGs increases.
Some common advantages of MDT are revealed by the results shown in Tables 2 and 3. All the experiments with MDT produce lower WERs than the non-MDT system over the corresponding noise types, as well as in the clean condition. Especially for the non-stationary noise types, namely, set 3-7 and 10-14, the benefit from MDT is more significant.
Though MDT systems show an advantage in both the close-talk and the far-talk test sets, the performance is greatly degraded in the latter condition, because the channel compensation technique of Section 6.2.1 is restricted to the estimation of a log-spectral offset vector, which can only compensate for convolutional effects with a short impulse response. However, the fact that the backend PROSPECT MDT and the MC MDT system perform equally on this test set confirms that the modification to estimate the channel on CGs rather than on BGs (see Section 6.2.1) works.
6.4.2. Gaussian-based MC MDT with MAP
This experiment is conducted to compare GD with MAP as a solver for the imputation problems. The full precision matrix of the CGs in Equation (17), as required for MAP, is pre-calculated. The same BG selection as in the previous section is applied. Six iterations are found to be enough for the convergence in terms of WER, and therefore applied for the CGs. Comparing row 3 and 4 of Tables 2 and 3 reveals that the MAP solver performs equally robust as the MC MDT using GD. But as shown in Tables 4 and 5, it is slower, especially in evaluating CGs due to more iterations and copying full precision matrices from the main memory to the cache memory of the CPU.
6.4.3. Gaussian-based MC MDT with CG selection
The CG selection introduced in Section 5 is added to the Gaussian-based MC MDT system in Section 6.4.1. In addition to the 50 PROSPECT imputation operations for the CCGs, about 106 PROSPECT imputation operations for the CGs are observed per frame of 10 ms. Therefore, the number of CLSQ problems solved is less than one-third of that of MC MDT without CG selection. Two iterations of GD are applied on the imputation of CCGs. The implementation of Gaussian-based MC MDT plus CG selection does not harm the recognition accuracy but consumes less CPU time in comparison with the Gaussian-based MC MDT system (compare row 3 with 5 in Tables 2, 3, 4, and 5).
6.4.4. State-based MC MDT
The imputed values of the Gaussian-based MC MDT described in Section 6.4.1 can also be used to perform a state-based MC MDT, where the imputation for state s is the linear combination of the imputed values from the BGs included in the GMM of that state.
where GBG(s) represents all the Gaussians belonging to the GMM of state s, and
The BG selection from Section 4.4 is also activated for this experiment, so only the selected BGs are actually involved in the imputation formulae. Each BG is shared among about 45 states and is therefore evaluated at multiple imputed spectra from its owner states. Hence, for this state-based MC MDT experiment, every selected BG is evaluated at 45 states-based imputed spectra as in Equation (24). The MC-based likelihood estimation of each BG is still performed at 3.6 (average number of CGs assigned to each BG) candidate spectra. These likelihood evaluations lead to a computationally expensive implementation. State-based MC MDT yields WERs fairly close to those obtained in other MC MDT experiments, but with a significantly higher computational cost, as shown in the 8th rows in Tables 2, 3, 4, and 5.
6.5. Cluster-based reconstruction
6.5.1. Cluster-based reconstruction with MAP
This experiment is an instance of the concept formulated by Equations (4), (5), and (6). Each of the 500 full-covariance spectral CGs is used to impute clean speech with six iterations of the MAP imputation. The corresponding diagonal-covariance CGs are used to calculate the definite integrals in Equation (6). To compensate for unmodeled correlations, the integrals are exponentiated with 0.3, a value that is tuned on the test set for best accuracy. The global clean spectrum is then reconstructed using Equation (4). Since the likelihoods of the 500 CGs are already calculated, they are used to select BGs as explained in Section 4.4. Despite the test set optimization, the cluster-based reconstruction with MAP imputation is still less robust than the MC MDT systems when comparing row 7 with rows 3, 4, 5, and 8 in Tables 2 and 3. The use of a more accurate speech model provided by the BGs seems to pay off.
6.5.2. Cluster-based reconstruction with GD
This system approximates the previous one by combining the imputed spectra obtained from the PROSPECT CGs like in Equation (4), but takes a different approach to compute the posterior probabilities of the CGs. These posteriors are calculated by renormalizing the likelihoods of the imputed clean spectra. The likelihoods also serve for BG selection as explained in Section 4.4. To compensate for unmodeled correlations, the likelihoods are exponentiated with 0.3, a value that is also tuned on the test set for best accuracy. We did not observe a significant accuracy gain beyond the 500 PROSPECT CGs used to report the results in the tables. The approximations outlined above do not harm the robustness as shown by a comparison between the rows 6 and 7 of Tables 2 and 3, but this implementation is faster because GD imputation is more efficient than MAP and the computation of the posterior probabilities are simplified. Again, despite the test set optimization applied for this cluster-based reconstruction method, MC MDT outperforms it as well.
6.6. Imputation using log-spectral and cepstral CGs
Using a PROSPECT feature representation for the CGs in MC MDT experiments is an implementation choice motivated by speed considerations (see Section 4.1). The CGs can also be trained with other features, e.g., cepstra or log-spectra. To accomplish the comparison of the MC MDT system using CGs in these domains, same number, namely 500 of cepstral CGs and log-spectral CGs are trained using the same data-driven approach as described in Section 6.1.3.
6.6.1. Cepstral imputation
The dimension of the cepstral CGs is 39: 13 static cepstral coefficients, 13 first-and second-order time derivatives. The average number of CGs per BG is 3.6, the same as for PROSPECT CGs. During recognition, the imputation is performed by Multiplicative Updates (MU) [32] with five iterations, an algorithm capable of handling rank-deficient precision matrices, such as H k in Equation (11). The testing results are shown in row 10 of Tables 2, 3, 4, and 5. The WER and the percentage of selected BGs obtained by using cepstral CGs are comparable with using PROSPECT CGs. Observe that cepstral CGs introduce time-consuming imputation of MU, which slows down the imputation of CGs by 30-60%.
6.6.2. Spectral imputation
The spectral imputation method described by Equation (2) is tempting for its simplicity. It is worth investigating whether it yields a list of candidate spectra of sufficient quality. The dimension of the log-spectral CGs is 66: 22 static log-spectral coefficients and their first- and second-order time derivatives. The average number of CGs per BG is increased to 5. The test results are shown in row 9 of Tables 2, 3, 4, and 5. While it saves time in CG imputation, the method looses both accuracy and efficiency compared to PROSPECT CG imputation. BG selection is also less effective and more BGs need to be activated to guarantee a reasonable accuracy. Finally, spectral CGs are not able to provide channel estimates (as in Section 6.2.1) as PROSPECT CGs can. This claim is motivated by an experiment (not reported in this article) where the log-spectral CGs provide the candidates of clean speech and trigger the BG selection, while PROSPECT CGs are used for channel estimation, which improved the recognition accuracy by 3.58% relative on test sets 8-14.
7. Conclusions and future work
We have proposed several effective optimizations to a large vocabulary speech recognizer that is based on MDT. The outcome is a recognizer that runs equally fast as the uncompensated system, has identical performance on clean data, has the same robustness as our latest published missing data system [23] and shows competitive performance on the AURORA-4 task.
We first formulated the missing data paradigm such that it can be applied to an acoustic model that requires no compromises on accuracy and uses standard feature representations, i.e., a formulation that covers cepstral as well as LDA-features as they are commonly used in today's speech recognizers. This formulation exploits the most accurate speech model that the recognizer disposes of: the backend HMM. The computational load was significantly reduced by the proposed MC approach to solve the CLSQ problems with sufficient accuracy for practical purposes. Here, candidates are obtained from exact solutions on a smaller set of CG, followed by selection of the most likely candidate. The posterior probabilities of the CG were exploited to construct a Gaussian selection algorithm that saves more computation by excluding Gaussians that are unlikely to make a significant contribution to the state likelihoods. Finally, the CGs were structured hierarchically with a model-based Gaussian clustering algorithm to achieve further speed gains.
The proposed method was compared to cluster-based imputation, an MDT that enhances the feature vector based on a GMM speech model, a technique that is also suitable for large vocabulary tasks. Our experiments reveal that it is beneficial to accuracy to exploit the more accurate backend model instead.
The optimizations show that no modeling compromises are required to apply MDT to large vocabulary recognition and that, on noisy data, any additional computational cost in likelihood calculation is easily recovered by a reduction in the search effort. These benefits make the missing data formalism very suitable to tackle various robustness issues beyond the additive noise effects considered in this article. With a suitable missing data detector, the solutions described in this article open pathways to also efficiently cover reverberated speech and exploit multiple microphones to implement directional hearing.
Appendix: Computational complexity of maximized likelihood per Gaussian
This section reveals how the numbers of multiplication involved in each approach in Table 1 are obtained. The complexity is quantified as the number of multiplications or divisions.
MAP
The iterative method of the MAP algorithm in [24] includes the following steps:
-
a.
Initialize x t, u using the component-wise minimization: x t, u (i) = min[μ t, u (i),y t, u (i)] as in Equation (2).
-
b.
For each i of the unreliable sub vector x t, u , calculate the conditional mean
where D t, u is the number of unreliable components in the spectrum at time t.
And constrain
-
c.
Repeat step b for several iterations and calculate the likelihood p(x t, u, k , y t, r |k)
Raj et al. provide a standard solution for the conditional mean in step b:
where μ t, u, k is the mean of unreliable sub-vector. Σ k is the covariance of Gaussian k. Σ t, u, r, k contains only the rows with indices corresponding to the unreliable components and columns with the reliable indices in Σ t, k .
The conditional mean can also be formulated as
where H k = Σ k -1 and H t, u, u, k is a Dt, uby D t, u sub-matrix of H k containing only the rows and columns corresponding to the unreliable components in the feature vector. H t, u, r, k is a Dt, uby D-Dt, usub-matrix of H k containing only the rows corresponding to the indices of unreliable components in the feature and the columns with reliable indices.
In step b, only 1 dimension is free and updated per operation as:
The second part of the summation in the numerator is constant and can be calculated at the first iteration. Hence, the MAP algorithm involves (D t, u + 1) × D t, u multiplications per iteration where the dimension of is D t, u . In the above equation, x t (j) is the j th component of the latest updated unreliable component.
Besides updating the clean speech in each iteration, the likelihood of the CG is also required for BG selection, which involves (D + 1)D multiplications. Each iteration of MAP is very efficient, but as shown by the experimental result, MAP needs six iterations to reach convergence in terms of WER. Furthermore, as described in Section 4.1, the full precision matrix has to be handled in the CPU cache memory. Hence, MAP is slower than GD + PROSPECT.
Multiplicative updates
As outlined in [32, 33], the step calculation of the i th unreliable component for Gaussian k is given by
H t, u, u, k +(H t, u, u, k -) is obtained from H t, u, u, k by setting all negative (positive) entries to zero. H t, u, u, k +(x t, u, k -μ t, u, k ) together with H t, u, u, k -(x t, u, k -μ t, u, k ) involve D t, u × D t, u multiplication operations.
b = H t, u, r, k (y t, r - μ r, k ) + H t, u, u, k (x t, u, k - μ t, u, k ) so its first term can be calculated prior to iteration. The second term is already calculated while calculating H t, u, u, k +(x t, u, k - μ t, u, k ) and H t, u, u, k -(x t, u, k - μ t, u, k ). The square, multiplication (2 ×) and division in the above equation involves 4 × D t, u multiplications per iteration. In addition, D t, u computationally expensive square root operations are also involved in each iteration. The calculation of the likelihood involves (D + 1)D multiplications, the same as MAP. MU shares the same drawback with MAP that it has to handle the full precision matrix whenever it is called. As proved by the experiments, MU needs five iterations for convergence of the WER.
Gradient descent + cepstral Gaussians
To calculate the likelihood and gradient of a Cepstral Gaussian with diagonal covariance given a frame of unreliable spectrum, the precision matrix of the Gaussian must be either pre-calculated or calculated on-line. The pre-calculation implies that the system has to handle the D × D precision matrix which leads to frequent data exchange between CPU and main memory. To calculate the gradient on-line, the precision matrix must be represented in the log-spectral domain as
where H k is the precision matrix of Gaussian k, and it is transformed from the inverse diagonal covariance matrix Σ k -1 by applying the transpose of DCT matrix C. Let C r be the D m by D - D t, u sub matrix of C, containing the columns with the reliable indices, and C u contains the remaining elements. Equation (26) can be represented as
is constant for every iteration and is part of the calculation of the likelihood. involves D t, u (2D m + 1) multiplications for the unreliable component of the gradient. A small fraction of the gradient needs to be added to the gradient to cope with the singularity of H k as mentioned in Equation (11). Hence, another D t, u multiplications are needed. The calculation of step size and step scale in Equation (13) involve D m D + D m + D + D t, u multiplications. The calculation of likelihood contains 2D m D + D m + D multiplications.
Gradient descent + PROSPECT features
The cost function for a Gaussian trained with PROSPECT features is shown in Equation (17). The calculation of gradient g t, k can be decomposed to the projection part and cepstral part. The following quantities must be calculated.
: is the spectral mean of Gaussian k which is pre-calculated from the PROSPECT mean using the inverse PROSPECT transform R. D c D t, u multiplication are involved for the unreliable components every iteration. Additionally D c (D - D t, u ) multiplications are required to calculate the reliable components before the first iteration.
: Based on the calculation of , D c D multiplications are involved.
: includes another D multiplications.
: includes another D c D multiplications.
: requires D c D t, u multiplications per iteration.
The cepstral part : Based on the calculation of , D c + D c D t, u multiplications per iteration are involved.
Given the obtained gradient g t, k , the step size involves the following quantities as in Equation (13):
g t, k 'g t, k : D multiplications.
C c g t, k : D c D multiplications.
: 2D c multiplications.
(I-C c 'C c )g t, k : D c D multiplications given C c gt,k.
: 2D multiplications
Another D multiplications are required for scaling the gradient as explained in the previous section.
Besides the iterations, the initial likelihood involves 2(D + D c ) multiplications.
With the typical values of D, D m , D c and D t, u in Table 1 the number of multiplications involved in MAP is 2138 per Gaussian, while it is 2106 for MU. But MU needs 80 square root operations per Gaussian. The number of multiplications involved in GD with cepstral Gaussian is 2177. This number is reduced to 1416 when using PROSPECT features.
Abbreviations
- ASA:
-
auditory scene analysis
- BE:
-
BackEnd
- BG:
-
Backend Gaussian
- CCG:
-
cluster-of-cluster Gaussians
- CG:
-
cluster Gaussian
- CGS:
-
cluster Gaussian selection
- CLBR:
-
cluster-based reconstruction
- CLSQ:
-
constrained least squares
- DCT:
-
discrete cosine transform
- EM:
-
expectation maximization
- FRoG:
-
fast removal of Gaussians
- GB:
-
Gaussian-based
- GD:
-
gradient descent
- KLD:
-
Kullback-Leibler divergence
- LDA:
-
linear discriminant analysis
- MAP:
-
maximum a posterior probability
- MC:
-
multi-candidate
- MDT:
-
missing data technique
- MFCC:
-
MEL Frequency Cepstral Coefficients
- MIDA:
-
mutual information-based discriminant analysis
- ML:
-
maximum likelihood
- MLE:
-
maximum likelihood estimation
- MU:
-
multiplicative updates
- PDF:
-
probability density functions
- PDT:
-
phonetic decision tree
- PMC:
-
parallel model combination
- PROSPECT:
-
PRojected SPECTra
- SB:
-
state-based
- SKLD:
-
symmetric Kullback-Leibler divergence
- WSKLD:
-
weighted Kullback-Leibler divergence
- WSS:
-
wide-sense stationary.
References
ETSI standard doc., Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced frontend feature extraction algorithm; compression algorithms ETSI, Tech Rep ES 202 050 v1.1.3 2003.
Flores JAN, Young SJ: Continuous speech recognition in noise using spectral subtraction and HMM adaptation. In Proceedings of ICASSP. Adelaide, South Australia, Australia; 1994:409-412.
Droppo J, Deng L, Acero A: Evaluation of the SPLICE algorithm on the Aurora2 database. In Proceedings of Eurospeech. Aalborg, Denmark; 2001:217-220.
Moreno PJ, Raj B, Stern RMJ: A vector Taylor series approach for environment-independent speech recognition. In Proceedings of ICASSP. Atlanta, Georgia, USA; 1996:733-736.
Gales M: Model-based techniques for noise robust speech recognition. PhD thesis. University of Cambridge; 1995.
Leggetter CJ, Woodland PC: Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput Speech Lang 1995, 9(2):171-185. 10.1006/csla.1995.0010
Gauvain JL, Lee CH: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans Speech Audio Process 1994, 2(2):291-298. 10.1109/89.279278
Bregman AS: Auditory Scene Analysis. MIT Press, Cambridge; 1990.
Van Segbroeck M, Van hamme H: Vector quantization based mask estimation for missing data ASR. In Proceedings of Interspeech. Antwerp, Belgium; 2007:910-913.
Barker J, Cooke M, Green P: Robust ASR based on clean speech models: an evaluation of missing data. In Proceedings of Eurospeech. Aalborg, Denmark; 2001:213-216.
Cooke M: A glimpsing model of speech perception in noise. J Acoust Soc Am 2006, 119(3):1562-1573. 10.1121/1.2166600
Barker JP, Cooke MP, Ellis DPW: Decoding speech in the presence of other sources. Speech Commun 2005, 45(1):5-25. 10.1016/j.specom.2004.05.002
Srinivasan S, Wang D: Transforming binary uncertainties for robust speech recognition. IEEE Trans Audio Speech Lang Process 2007, 15(7):2130-2140.
Srinivasan S, Wang D: Robust speech recognition by integrating speech separation and hypothesis testing. Speech Commun 2010, 52(1):72-81. 10.1016/j.specom.2009.08.008
Cooke M, Green P, Josifovski L: A Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun 2001, 34(3):267-285. 10.1016/S0167-6393(00)00034-0
Lippmann RP, Carlson BA: Using missing feature theory to actively select features for robust speech recognition with interruptions, filtering and noise. In Proceedings of Eurospeech. Rhodes, Greece; 1997:37-40.
Seltzer ML, Raj B, Stern RM: A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition. Speech Commun 2004, 43(4):379-393. 10.1016/j.specom.2004.03.006
Cooke M, Morris A, Green P: Missing data techniques for robust speech recognition. In Proceedings of ICASSP. Munich, Germany; 1997:863-866.
Renevey P, Drygajlo A: Detection of reliable features for speech recognition in noisy conditions using a statistical criterion. In Proceedings of CRAC Workshop. Aalborg, Denmark; 2001:71-74.
Srinivasan S, Shao Y, Jin Z, Wang D: A computational auditory scene analysis system for robust speech recognition. In Proceedings of Interspeech. Pittsburg, Pennsylvania, USA; 2006:73-76.
Cerisara C, Demange S, Haton J-P: On noise masking for automatic missing data speech recognition: a survey and discussion. Comput Speech Lang 2007, 21(3):443-457. 10.1016/j.csl.2006.08.001
Coy A, Barker J: Soft harmonic masks for recognising speech in the presence of a competing speaker. In Proceedings of Interspeech. Lisbon, Portugal; 2005:2641-2644.
Van Segbroeck M, Van hamme H: Advances in missing feature techniques for robust large vocabulary continuous speech recognition. IEEE Trans Audio Speech Lang Process 2011, 19(1):123-137.
Raj B, Seltzer ML, Stern RM: Reconstruction of missing features for robust speech recognition. Speech Commun 2004, 43(4):275-296. 10.1016/j.specom.2004.03.007
Van hamme H: Robust speech recognition using cepstral domain missing data techniques and noisy masks. In Proceedings of ICASSP. Volume 1. Montreal, Quebec, Canada; 2004:213-216.
Cerisara C: Towards missing data recognition with cepstral features. In Proceedings of Eurospeech. Geneva, Switzerland; 2003:3057-3060.
Häkkinen J, Haverinen H: On the use of missing feature theory with cepstral features. In Proceedings of CRAC Workshop. Aalborg, Denmark; 2001.
Faubel F, McDonough J, Klakow D: Bounded conditional mean imputation with Gaussian mixture models: a reconstruction approach to partly occluded features. In Proceedings of ICASSP. Taipei, Taiwan; 2009:3869-3872.
Haeb-Umbach R, Ney H: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of ICASSP. San Francisco, California, USA; 1992:13-16.
Kumar N, Andreou AG: A generalization of linear discriminant analysis in maximum likelihood framework. In Tech Rep JHU-CLSP Technical Report No. 16. Johns Hopkins University; 1996.
Fletcher R: Practical Methods of Optimization. John Wiley & Sons, Chichester; 1980.
Saul LK, Sha F, Lee DD: Statistical signal processing with non-negativity constraints. In Proceedings of Eurospeech. Geneva, Switzerland; 2003:1001-1004.
Van hamme H: Prospect features and their application to missing data techniques for robust speech recognition. In Proceedings of Interspeech. Jeju, Korea; 2004:101-104.
Duchateau J, Demuynck K, Van Compernolle D, Wambacq P: Class definition in discriminant feature analysis. In Proceeding of Eurospeech. Aalborg, Denmark; 2001:1621-1624.
Young SJ, Odell JJ, Woodland PC: Tree-based state tying for high accuracy acoustic modelling. In Proceedings of Workshop on Human Language Technology. Plainsboro, New Jerey, USA; 1994:307-312.
Iskra D, Grosskopf B, Marasek K, van den Heuvel H, Diehl F, Kiessling A: SPEECON--Speech Databases for Consumer Devices: Database Specification and Validation. In Proceedings of LREC. Las Palmas, Spain; 2002:329-333.
Demuynck K: Extracting, modeling and combining information in speech recognition. PhD thesis. K.U., Leuven, ESAT; 2001.
Fritsch J, Rogina I: The bucket box intersection (BBI) algorithm for fast approximate evaluation of diagonal mixture Gaussians. In Proceeding of ICASSP. Volume 2. Atlanta, Georgia, USA; 1996:273-276.
Bocchieri E: Vector quantization for efficient computation of continuous density likelihoods. In Proceeding of ICASSP. Volume 2. Minneapolis, Minnesota, USA; 1993:692-695.
Watanabe T, Shinoda K, Takagi K, Iso KI: High speed speech recognition using tree-structured probability density function. In Proceeding of ICASSP. Volume 1. Detroit, Michigan, USA; 1995:556-559.
Wang Y, Van hamme H: Speed improvements in a missing data-based speech recognizer by Gaussian selection. In Proceedings of NAG-DAGA. Rotterdam, Netherlands; 2009:423-426.
Mackay DJ: Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, UK; 2003.
Myrvoll TA, Soong FK: Optimal clustering of multivariate normal distributions using divergence and its application to HMM adaptation. In Proceedings of ICASSP. Hong Kong; 2003:552-555.
Parihar N, Picone J: Analysis of the aurora large vocabulary evaluations. In Proceedings of Eurospeech. Geneva, Switzerland; 2003:337-340.
Gemmeke JF, Van Segbroeck M, Wang Y, Cranen B, Van hamme H: Automatic speech recognition using missing data techniques: handling of real-world data. In Robust Speech Recognition of Uncertain or Missing Data. Edited by: Haeb-Umbach R, Kolossa D. Berlin-Heidelberg (Germany), Springer Verlag; 2011:157-186.
Young S, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P:The HTK Book. [http://htk.eng.cam.ac.uk/docs/docs.shtml]
SPRAAK: Speech Processing, Recognition and Automatic Annotation Kit.[http://www.spraak.org/]
Acknowledgements
This study was financed by the MIDAS project of the Nederlandse Taalunie under the STEVIN programme. Thanks to Kris Demuynck for various implementations and the anonymous reviewers for suggesting additional interesting comparisons and analyses.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Wang, Y., Van hamme, H. Multi-candidate missing data imputation for robust speech recognition. J AUDIO SPEECH MUSIC PROC. 2012, 17 (2012). https://doi.org/10.1186/1687-4722-2012-17
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1687-4722-2012-17