In the traditional feature compensation, the noise estimation and clean speech estimation share the same GMM, which is trained by clean speech features during the training phase and is converted to noisy GMM in the testing condition. In order to guarantee the accuracy of clean speech estimation, the speech model usually consists of a large number of Gaussian components, which leads to a high computational cost. To improve computational efficiency without reducing the recognition accuracy, this paper employs two GMMs to estimate the noise parameters and restore the clean speech feature respectively, which is illustrated in Fig. 1. GMM1 composed of fewer Gaussian components is used to represent approximately the distribution of the speech feature and estimate rapidly the noise parameters from noisy speech features by the EM algorithm. Moreover, the average log-likelihood difference of GMM1 is considered as a sign of noise variation, which is used to decide whether or not to perform model combination. If the auxiliary function of the EM algorithm converges, the noise information which is composed of the noise variation sign and noise parameters is sent to model combination module, where the estimated single Gaussian noise model is combined with GMM2 to obtain noisy GMM for clean speech estimation. GMM2 has sufficient Gaussian components and can accurately characterize the distribution of speech in the cepstral domain. The noise distribution is independent of the speech distribution and thus it can be considered that the estimated noise parameters are weakly related to the Gaussian number of the speech model, which is used for the noise estimation. Therefore, in the model combination, the noise model estimated by GMM1 is closer to that estimated by the traditional GMM which consists of a large number of Gaussian components and is employed for both the noise estimation and clean speech restoration. On the other hand, the Gaussian number of GMM2 is similar to that of the traditional GMM-based feature compensation and the noisy GMM obtained by combining the GMM2 and estimated noise model can accurately restore the clean speech. Therefore, the proposed algorithm can achieve the recognition accuracy similar to that of the traditional GMM-based feature compensation.
This paper only considers the additive noise and ignores the channel distortion. According to the MFCC extraction method, we can obtain the relationship between the noisy speech cepstral feature y and clean speech cepstral feature x as:
$$ \begin{aligned} y = Clog\left(exp\left(C^{-1}x\right)+exp\left(C^{-1}n\right)\right) \end{aligned} $$
(1)
where n denotes the cepstral features of the additive noise; C and C−1 denote the discrete cosine transform (DCT) matrix and its inverse transform matrix, respectively. By taking the first-order VTS expansion at point (μx,μn0) both sides of (1), we can obtain the following linear approximation:
$$ {}\begin{aligned} y &= (I-U)(x-\mu_{x})+U(n-\mu_{n0}) + Clog\left(exp\left(C^{-1}\mu_{x}\right)\right.\\ &\left.+exp\left(C^{-1}\mu_{n0}\right)\right) \end{aligned} $$
(2)
where μx and μn0 are the mean of x and the initial mean of n, respectively; I denotes the identity matrix; U is given by,
$$ \begin{aligned} U=Cdiag\left(\frac{exp\left(C^{-1}\mu_{n0}\right)}{exp\left(C^{-1}{\mu_{x}}\right)+exp\left(C^{-1}{\mu_{n0}}\right)}\right)C^{-1} \end{aligned} $$
(3)
where diag() denotes the diagonal matrix whose diagonal elements are equal to those of the vector in the parentheses. Taking the expectation on both sides of (2), the mean vector of the noisy speech μy can be expressed as:
$$ \begin{aligned} {}\mu_{y}=U\mu_{n}-U\mu_{n0}+Clog\left(exp\left(C^{-1}\mu_{x}\right)+exp\left(C^{-1}\mu_{n0}\right)\right) \end{aligned} $$
(4)
where μn is the mean of n. Similarly, we can obtain the variance of the noisy speech \(\sum _{y}\) by taking the variance operation on both sides of (2):
$$ \begin{aligned} \Sigma_{y}=(I-U)\Sigma_{x}(I-U)^{T}+U\Sigma_{n}U^{T} \end{aligned} $$
(5)
where Σn denotes the variances of n.
In the noise estimation, the probability density function (PDF) of the speech signal is represented by GMM1:
$$ {\begin{aligned} b(x_{t})=\sum_{m=1}^{\mathrm{M}}c_{m}\left\{(2\pi)^{-\frac{\mathrm{D}}{2}}|\Sigma_{x,m}|^{-\frac{1}{2}}\times exp\left[-\frac{1}{2}(x_{t}-\mu_{x,m})^{T}\Sigma_{x,m}^{-1}(x_{t}-\mu_{x,m})\right]\right\} \end{aligned}} $$
(6)
where xt denotes the tth static cepstral feature vector; cm,μx,m, and Σx,m are the mixture coefficient, mean vector, and covariance matrix of the mth Gaussian component, respectively; and M and D denote the Gaussian number of GMM1 and the dimension of the static feature xt, respectively. GMM1 is trained from clean speech in the training phase and used to estimate the noise parameters from noisy testing speech. The noise parameters, μn and Σn, are estimated using the EM algorithm under the maximum likelihood criterion and the auxiliary function is defined as:
$$ \begin{aligned} Q(\bar{\lambda}|\lambda)&=-\frac{1}{2}\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)\\&\times\left[(y_{t}-\mu_{y,m})^{T}\Sigma_{y,m}^{-1}(y_{t}-\mu_{y,m})-log\left|\Sigma_{y,m}^{-1}\right|\right] \end{aligned} $$
(7)
where γm(t)=P(kt=m|yt,λ) is the posterior probability of being in mixture component m at time t given the observation yt and the prior parameter set λ; \(\bar {\lambda }\) denotes the new GMM parameter set.
For the mth Gaussian component of GMM1, (4) can be rewritten as:
$$ \begin{aligned} \mu_{y,m}&=U_{m}\mu_{n}-U_{m}\mu_{n0}+Clog\left(exp\left(C^{-1}\mu_{x,m}\right)\right.\\&\left.+exp\left(C^{-1}\mu_{n0}\right)\right) \end{aligned} $$
(8)
Substituting (8) into (7) and setting the derivative of \(Q(\bar {\lambda }|\lambda)\) with respect to μn to zero, the noise mean μn can be estimated by,
$$ \begin{aligned} \mu_{n}=&\left[\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)U_{m}^{T}\Sigma_{y,m}^{-1}U_{m}\right]^{-1}\times\\ &\left[ \sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)U_{m}^{T}\Sigma_{y,m}^{-1}\right.\\ &\times\left(y_{t}-Clog\left(exp\left(C^{-1}\mu_{x,m}\right)+exp\left(C^{-1}\mu_{n0}\right)\right)\right.\\& \left.\left.+U_{m}\mu_{n0}\right){\vphantom{\sum_{m=1}^{\mathrm{M}}}}\right] \end{aligned} $$
(9)
In the cepstral space, there are weak correlations among the different components of the cepstral vector, and thereby Σx,m,Σn, and Σy,m can be simplified into the diagonal matrices. Equation (5) can be rewritten as:
$$ \begin{aligned} \sigma_{y,m}=(V_{m}\cdot* V_{m})\sigma_{x,m}+(U_{m}\cdot*U_{m})\sigma_{n} \end{aligned} $$
(10)
where Vm=I−Um; σy,m,σx,m, and σn denote the variance vectors which are composed of the diagonal elements of Σy,m,Σx,m, and Σn, respectively; the operation symbol ·∗ denotes the element-wise product for two vectors whose dimensions are the same. By substituting (10) into (7) and taking the derivative of \(Q(\bar {\lambda }|\lambda)\) with respect to σn, we can obtain:
$$ \begin{aligned} &\frac{\partial Q(\bar{\lambda}|\lambda)}{\partial\sigma_{n}} =\sum_{m=1}^{\mathrm{M}}\frac{\partial\eta_{y,m}}{\partial\sigma_{n}}\frac{\partial Q(\bar\lambda|\lambda)}{\partial\eta_{y,m}}\\= &\sum_{m=1}^{\mathrm{M}}\frac {\partial\eta_{y,m}} {\partial\sigma_{n}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)\left[(y_{t}-\mu_{y,m}) \cdot *(y_{t}-\mu_{y,m})\right.\\ & \left.-(V_{m}\cdot *V_{m})\sigma_{x,m}-(U_{m}\cdot*U_{m})\sigma_{n} \right] \end{aligned} $$
(11)
where ηy,m=(σy,m)−1=[(Vm·∗Vm)σx,m+(Um·∗Um)σn)]−1 and each element of ηy,m is the reciprocal of the corresponding element of σy,m. The D ×D matrix \(\frac {\partial \eta _{y,m}}{\partial \sigma _{n}}\) can be regarded as the weighting factor of the mth Gaussian component and is written as:
$$ {}\begin{aligned} G_{m}&=-\frac{\partial\eta_{y,m}}{\partial\sigma_{n}}=(U_{m}^{T}\cdot*U_{m}^{T})\times diag\left[ ((V_{m}\cdot*V_{m})\sigma_{x,m}\right.\\ &\left.+(U_{m}\cdot*U_{m})\sigma_{n})^{-2}\right] \end{aligned} $$
(12)
To obtain the closed-form solution of the noise variance, the weighting factor Gm is approximated as a constant matrix:
$$ \begin{aligned} G_{m}&=(U_{m}^{T}\cdot*U_{m}^{T})\times diag\left[((V_{m}\cdot*V_{m})\sigma_{x,m}\right.\\&\left.+(U_{m}\cdot*U_{m})\sigma_{n0})^{-2}\right] \end{aligned} $$
(13)
where σn0 is the initial value of σn and is estimated from previous EM iteration. By setting the derivatives of \(Q(\bar {\lambda }|\lambda)\) with respect to σn to zero, the noise variance σn can be computed as:
$$ \begin{aligned} \sigma_{n}=&\left[\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)G_{m}(U_{m}\cdot*U_{m})\right]^{-1}\times\\ &\left[\sum_{m=1}^{\mathrm{M}}\sum_{t=1}^{\mathrm{T}}\gamma_{m}(t)G_{m}((y_{t}-\mu_{y,m})\cdot*(y_{t}-\mu_{y,m})\right.\\ &\left. -(V_{m}\cdot*V_{m})\sigma_{x,m}){\vphantom{\sum_{m=1}^{\mathrm{M}}\sum_{t=1}}}\right] \end{aligned} $$
(14)
In addition to noise estimation, another function of GMM1 is to monitor time variations of the environmental noise. When the recognizer is under the stationary condition, the parameters of GMM2 used for clean speech estimation are not updated and the noisy GMM2 estimated from the previous time interval is directly employed for the clean speech feature estimation of the current time interval, which can save energy and improve battery run-time for mobile devices. When the environmental noise varies, the clean GMM2 is combined with the estimated noise parameters μn and σn to produce the noisy GMM2 for computing clean speech features. It is difficult to determine whether the noise variation occurs by comparing the noise parameters of two time intervals directly. Therefore, this work employs the average log-likelihood difference over all the frames of the current time interval as the sign of noise variation. Besides the adapted noisy GMM1 estimated from the current testing speech, the noise parameters of the previous time interval are saved in memory and used to produce another noisy GMM1 by model combination with the clean GMM1. If the average log-likelihood difference of the two noisy GMM1 is more than the threshold, we can believe that the noise variation occurs. The noise variation sign and noise parameters compose the noise information, which is sent to model combination module to decide whether or not to update the parameters of the noisy GMM2.
As shown in Fig. 1, the complete noise estimation process is summarized below.
-
Initialize the initial mean μn0 and initial variance σn0 using the vector of all zeros and the vector of all ones, respectively.
-
Initialize the mean μy,m and variance σy,m of GMM1 with μy,m= μx,m,σy,m= σx,m.
-
Compute the posterior probability of the noisy speech using GMM1.
-
Compute the auxiliary function of the EM algorithm by Eq. (7).
-
Estimate the noise parameters μn and σn using Eqs. (9) and (14), respectively.
-
Update the mean μy,m and variance σy,m of GMM1 using Eqs. (8) and (10), respectively.
-
Update the initial mean μn0 and initial variance σn0 with μn0 = μn,σn0= σn.
-
If the convergence criterion is not met, go to step 3.