In the traditional feature compensation, the noise estimation and clean speech estimation share the same GMM, which is trained by clean speech features during the training phase and is converted to noisy GMM in the testing condition. In order to guarantee the accuracy of clean speech estimation, the speech model usually consists of a large number of Gaussian components, which leads to a high computational cost. To improve computational efficiency without reducing the recognition accuracy, this paper employs two GMMs to estimate the noise parameters and restore the clean speech feature respectively, which is illustrated in Fig. 1. GMM1 composed of fewer Gaussian components is used to represent approximately the distribution of the speech feature and estimate rapidly the noise parameters from noisy speech features by the EM algorithm. Moreover, the average loglikelihood difference of GMM1 is considered as a sign of noise variation, which is used to decide whether or not to perform model combination. If the auxiliary function of the EM algorithm converges, the noise information which is composed of the noise variation sign and noise parameters is sent to model combination module, where the estimated single Gaussian noise model is combined with GMM2 to obtain noisy GMM for clean speech estimation. GMM2 has sufficient Gaussian components and can accurately characterize the distribution of speech in the cepstral domain. The noise distribution is independent of the speech distribution and thus it can be considered that the estimated noise parameters are weakly related to the Gaussian number of the speech model, which is used for the noise estimation. Therefore, in the model combination, the noise model estimated by GMM1 is closer to that estimated by the traditional GMM which consists of a large number of Gaussian components and is employed for both the noise estimation and clean speech restoration. On the other hand, the Gaussian number of GMM2 is similar to that of the traditional GMMbased feature compensation and the noisy GMM obtained by combining the GMM2 and estimated noise model can accurately restore the clean speech. Therefore, the proposed algorithm can achieve the recognition accuracy similar to that of the traditional GMMbased feature compensation.
This paper only considers the additive noise and ignores the channel distortion. According to the MFCC extraction method, we can obtain the relationship between the noisy speech cepstral feature y and clean speech cepstral feature x as:
$$ \begin{aligned} y = Clog\left(exp\left(C^{1}x\right)+exp\left(C^{1}n\right)\right) \end{aligned} $$
(1)
where n denotes the cepstral features of the additive noise; C and C^{−1} denote the discrete cosine transform (DCT) matrix and its inverse transform matrix, respectively. By taking the firstorder VTS expansion at point (μ_{x},μ_{n0}) both sides of (1), we can obtain the following linear approximation:
$$ {}\begin{aligned} y &= (IU)(x\mu_{x})+U(n\mu_{n0}) + Clog\left(exp\left(C^{1}\mu_{x}\right)\right.\\ &\left.+exp\left(C^{1}\mu_{n0}\right)\right) \end{aligned} $$
(2)
where μ_{x} and μ_{n0} are the mean of x and the initial mean of n, respectively; I denotes the identity matrix; U is given by,
$$ \begin{aligned} U=Cdiag\left(\frac{exp\left(C^{1}\mu_{n0}\right)}{exp\left(C^{1}{\mu_{x}}\right)+exp\left(C^{1}{\mu_{n0}}\right)}\right)C^{1} \end{aligned} $$
(3)
where diag() denotes the diagonal matrix whose diagonal elements are equal to those of the vector in the parentheses. Taking the expectation on both sides of (2), the mean vector of the noisy speech μ_{y} can be expressed as:
$$ \begin{aligned} {}\mu_{y}=U\mu_{n}U\mu_{n0}+Clog\left(exp\left(C^{1}\mu_{x}\right)+exp\left(C^{1}\mu_{n0}\right)\right) \end{aligned} $$
(4)
where μ_{n} is the mean of n. Similarly, we can obtain the variance of the noisy speech \(\sum _{y}\) by taking the variance operation on both sides of (2):
$$ \begin{aligned} \Sigma_{y}=(IU)\Sigma_{x}(IU)^{T}+U\Sigma_{n}U^{T} \end{aligned} $$
(5)
where Σ_{n} denotes the variances of n.
In the noise estimation, the probability density function (PDF) of the speech signal is represented by GMM1:
$$ {\begin{aligned} b(x_{t})=\sum_{m=1}^{\mathrm{M}}c_{m}\left\{(2\pi)^{\frac{\mathrm{D}}{2}}\Sigma_{x,m}^{\frac{1}{2}}\times exp\left[\frac{1}{2}(x_{t}\mu_{x,m})^{T}\Sigma_{x,m}^{1}(x_{t}\mu_{x,m})\right]\right\} \end{aligned}} $$
(6)
where x_{t} denotes the tth static cepstral feature vector; c_{m},μ_{x,m}, and Σ_{x,m} are the mixture coefficient, mean vector, and covariance matrix of the mth Gaussian component, respectively; and M and D denote the Gaussian number of GMM1 and the dimension of the static feature x_{t}, respectively. GMM1 is trained from clean speech in the training phase and used to estimate the noise parameters from noisy testing speech. The noise parameters, μ_{n} and Σ_{n}, are estimated using the EM algorithm under the maximum likelihood criterion and the auxiliary function is defined as:
$$ \begin{aligned} Q(\bar{\lambda}\lambda)&=\frac{1}{2}\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)\\&\times\left[(y_{t}\mu_{y,m})^{T}\Sigma_{y,m}^{1}(y_{t}\mu_{y,m})log\left\Sigma_{y,m}^{1}\right\right] \end{aligned} $$
(7)
where γ_{m}(t)=P(k_{t}=my_{t},λ) is the posterior probability of being in mixture component m at time t given the observation y_{t} and the prior parameter set λ; \(\bar {\lambda }\) denotes the new GMM parameter set.
For the mth Gaussian component of GMM1, (4) can be rewritten as:
$$ \begin{aligned} \mu_{y,m}&=U_{m}\mu_{n}U_{m}\mu_{n0}+Clog\left(exp\left(C^{1}\mu_{x,m}\right)\right.\\&\left.+exp\left(C^{1}\mu_{n0}\right)\right) \end{aligned} $$
(8)
Substituting (8) into (7) and setting the derivative of \(Q(\bar {\lambda }\lambda)\) with respect to μ_{n} to zero, the noise mean μ_{n} can be estimated by,
$$ \begin{aligned} \mu_{n}=&\left[\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)U_{m}^{T}\Sigma_{y,m}^{1}U_{m}\right]^{1}\times\\ &\left[ \sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)U_{m}^{T}\Sigma_{y,m}^{1}\right.\\ &\times\left(y_{t}Clog\left(exp\left(C^{1}\mu_{x,m}\right)+exp\left(C^{1}\mu_{n0}\right)\right)\right.\\& \left.\left.+U_{m}\mu_{n0}\right){\vphantom{\sum_{m=1}^{\mathrm{M}}}}\right] \end{aligned} $$
(9)
In the cepstral space, there are weak correlations among the different components of the cepstral vector, and thereby Σ_{x,m},Σ_{n}, and Σ_{y,m} can be simplified into the diagonal matrices. Equation (5) can be rewritten as:
$$ \begin{aligned} \sigma_{y,m}=(V_{m}\cdot* V_{m})\sigma_{x,m}+(U_{m}\cdot*U_{m})\sigma_{n} \end{aligned} $$
(10)
where V_{m}=I−U_{m}; σ_{y,m},σ_{x,m}, and σ_{n} denote the variance vectors which are composed of the diagonal elements of Σ_{y,m},Σ_{x,m}, and Σ_{n}, respectively; the operation symbol ·∗ denotes the elementwise product for two vectors whose dimensions are the same. By substituting (10) into (7) and taking the derivative of \(Q(\bar {\lambda }\lambda)\) with respect to σ_{n}, we can obtain:
$$ \begin{aligned} &\frac{\partial Q(\bar{\lambda}\lambda)}{\partial\sigma_{n}} =\sum_{m=1}^{\mathrm{M}}\frac{\partial\eta_{y,m}}{\partial\sigma_{n}}\frac{\partial Q(\bar\lambda\lambda)}{\partial\eta_{y,m}}\\= &\sum_{m=1}^{\mathrm{M}}\frac {\partial\eta_{y,m}} {\partial\sigma_{n}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)\left[(y_{t}\mu_{y,m}) \cdot *(y_{t}\mu_{y,m})\right.\\ & \left.(V_{m}\cdot *V_{m})\sigma_{x,m}(U_{m}\cdot*U_{m})\sigma_{n} \right] \end{aligned} $$
(11)
where η_{y,m}=(σ_{y,m})^{−1}=[(V_{m}·∗V_{m})σ_{x,m}+(U_{m}·∗U_{m})σ_{n})]^{−1} and each element of η_{y,m} is the reciprocal of the corresponding element of σ_{y,m}. The D ×D matrix \(\frac {\partial \eta _{y,m}}{\partial \sigma _{n}}\) can be regarded as the weighting factor of the mth Gaussian component and is written as:
$$ {}\begin{aligned} G_{m}&=\frac{\partial\eta_{y,m}}{\partial\sigma_{n}}=(U_{m}^{T}\cdot*U_{m}^{T})\times diag\left[ ((V_{m}\cdot*V_{m})\sigma_{x,m}\right.\\ &\left.+(U_{m}\cdot*U_{m})\sigma_{n})^{2}\right] \end{aligned} $$
(12)
To obtain the closedform solution of the noise variance, the weighting factor G_{m} is approximated as a constant matrix:
$$ \begin{aligned} G_{m}&=(U_{m}^{T}\cdot*U_{m}^{T})\times diag\left[((V_{m}\cdot*V_{m})\sigma_{x,m}\right.\\&\left.+(U_{m}\cdot*U_{m})\sigma_{n0})^{2}\right] \end{aligned} $$
(13)
where σ_{n0} is the initial value of σ_{n} and is estimated from previous EM iteration. By setting the derivatives of \(Q(\bar {\lambda }\lambda)\) with respect to σ_{n} to zero, the noise variance σ_{n} can be computed as:
$$ \begin{aligned} \sigma_{n}=&\left[\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)G_{m}(U_{m}\cdot*U_{m})\right]^{1}\times\\ &\left[\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)G_{m}((y_{t}\mu_{y,m})\cdot*(y_{t}\mu_{y,m})\right.\\ &\left. (V_{m}\cdot*V_{m})\sigma_{x,m}){\vphantom{\sum_{m=1}^{\mathrm{M}}\sum_{t=1}}}\right] \end{aligned} $$
(14)
In addition to noise estimation, another function of GMM1 is to monitor time variations of the environmental noise. When the recognizer is under the stationary condition, the parameters of GMM2 used for clean speech estimation are not updated and the noisy GMM2 estimated from the previous time interval is directly employed for the clean speech feature estimation of the current time interval, which can save energy and improve battery runtime for mobile devices. When the environmental noise varies, the clean GMM2 is combined with the estimated noise parameters μ_{n} and σ_{n} to produce the noisy GMM2 for computing clean speech features. It is difficult to determine whether the noise variation occurs by comparing the noise parameters of two time intervals directly. Therefore, this work employs the average loglikelihood difference over all the frames of the current time interval as the sign of noise variation. Besides the adapted noisy GMM1 estimated from the current testing speech, the noise parameters of the previous time interval are saved in memory and used to produce another noisy GMM1 by model combination with the clean GMM1. If the average loglikelihood difference of the two noisy GMM1 is more than the threshold, we can believe that the noise variation occurs. The noise variation sign and noise parameters compose the noise information, which is sent to model combination module to decide whether or not to update the parameters of the noisy GMM2.
As shown in Fig. 1, the complete noise estimation process is summarized below.

Initialize the initial mean μ_{n0} and initial variance σ_{n0} using the vector of all zeros and the vector of all ones, respectively.

Initialize the mean μ_{y,m} and variance σ_{y,m} of GMM1 with μ_{y,m}= μ_{x,m},σ_{y,m}= σ_{x,m}.

Compute the posterior probability of the noisy speech using GMM1.

Compute the auxiliary function of the EM algorithm by Eq. (7).

Estimate the noise parameters μ_{n} and σ_{n} using Eqs. (9) and (14), respectively.

Update the mean μ_{y,m} and variance σ_{y,m} of GMM1 using Eqs. (8) and (10), respectively.

Update the initial mean μ_{n0} and initial variance σ_{n0} with μ_{n0} = μ_{n},σ_{n0}= σ_{n}.

If the convergence criterion is not met, go to step 3.