Skip to main content

Compensation of SNR and noise type mismatch using an environmental sniffing based speech recognition solution

Abstract

Multiple-model based speech recognition (MMSR) has been shown to be quite successful in noisy speech recognition. Since it employs multiple hidden Markov model (HMM) sets that correspond to various noise types and signal-to-noise ratio (SNR) values, the selected acoustic model can be closely matched with the test noisy speech, which leads to improved performance when compared with other state-of-the-art speech recognition systems that employ a single HMM set. However, as the number of HMM sets is usually limited due to practical considerations as well as effective model selection, acoustic mismatch can still be a problem in MMSR. In this study, we proposed methods to improve recognition performance by mitigating the mismatch in SNR and noise type for an MMSR solution. For the SNR mismatch, an optimal SNR mapping between the test noisy speech and the HMM was determined by experimental investigation. Improved performance was demonstrated by employing the SNR mapping instead of using the estimated SNR of the test noisy speech directly. We also proposed a novel method to reduce the effect of noise type mismatch by compensating the test noisy speech in the log-spectrum domain. We first derive the relation between the log-spectrum vectors in the test and training noisy speech. Since the relation is a non-linear function of the speech and noise parameters, the statistical information regarding the testing log-spectrum vectors was obtained by approximation using vector Taylor series (VTS) algorithm. Finally, the minimum mean square error estimation of the training log-spectrum vectors was used to reduce the mismatch between the training and test noisy speech. By employing the proposed methods in the MMSR framework, relative word error rate reduction of 18.7% and 21.3% was achieved on the Aurora 2 task when compared to a conventional MMSR and multi-condition training (MTR) method, respectively.

1 Introduction

It is well known that significant performance degradation occurs when speech recognition is used in noisy environments. Various research efforts have previously been directed at noise-robust speech recognition such as noise-robust feature extraction, speech enhancement, and feature and model parameter compensation [1–5]. These approaches are used independently or in combination with each other to improve the performance of speech recognition under noisy environments.

Training hidden Markov model (HMM) directly using noisy speech has been considered an alternative approach to conventional methods [6–8]. However, such solutions work most effectively when the statistical structure of the noise does not vary greatly across train and test environments. Originally, developed to address speaking style/stress variations [6], multi-style training and later multi-condition training (MTR) have been considered on the Aurora 2 task [9]. In the MTR method, noisy speech signals under various noise conditions are used collectively for training the HMM. While remarkable performance improvements have been obtained with an MTR method, it has some drawbacks since it combines a number of noise conditions during training, which reduces the phonetic sharpness of the acoustic models in their probability density functions versus matched training where the training material is assumed to have the same noise condition as the noisy test speech.

To overcome this weakness of the MTR method, a multiple-model based speech recognition (MMSR) framework was recently proposed, and successful results using this approach were demonstrated [8]. In this method, multiple HMM sets corresponding to various noise types and SNR values are constructed during training, and a single HMM (reference HMM) set which is closest to the noisy test speech is selected for recognition. MMSR has been shown to achieve better performance compared to MTR for the Aurora 2 task [8].

Before actual speech recognition takes place in the MMSR framework, it is first necessary to classify the noise type and estimate the SNR of the test speech in order to select the reference HMM that most closely matches the noise condition in the test speech. As errors in this process will cause misrecognition, performance of the MMSR can be improved significantly by minimizing or compensating for such errors. In the previous studies on MMSR, once the noise type is determined, the reference HMM that is closest to the estimated SNR of the test speech is selected [8]. However, according to our preliminary study [10], we expect that performance can be improved by selecting a reference HMM that has a slightly different (higher or lower) SNR value than the estimated SNR. This conjecture is based on our assumption that in a specific noise type/level, specific phoneme classes can be influenced by noise more than others (e.g., for wideband noise, consonants such as fricatives and stops will be more severely degraded, while vowels and diphthongs will have less distortion when considering the impact of noise on automatic speech recognition (ASR) performance). Also, speech signal energy is generally bimodal, with vowels, diphthongs, liquids, glides having high energy, and fricatives, stops, and affricates having low energy. The selection of an HMM with either higher or lower SNR may be influenced by the specific SNRs of consonants. Another possible reason for this SNR mismatch phenomenon was also explained in [11]. They say that training data with low SNR values reduces the speech discrimination of the trained HMM set, and it may be advantageous to employ an HMM set trained on data with higher SNR value.

According to a previous study [8], noise type classification accuracy using a Gaussian mixture model (GMM) for four known and distinct types of noise is nearly 100%. This suggests that noise type classification, assuming diverse and well-separated noise classes, does not adversely affect the performance of MMSR. However, this is no longer true when an unknown type of noise signal is present in the test speech. Since noise type mismatch can significantly impact performance, a strategy to address any noise type difference between training and test noisy speech should be employed. Since most conventional methods have focused on compensating the difference between clean and noise-corrupted speech, they cannot be directly applied to MMSR. Jacobian adaptation has been used to adapt the parameters of HMMs in changing noise conditions with some success [12, 13]. However, since this is based on a simple linear approximation of the nonlinear cepstral distortion, it does not accurately reflect the changing noise conditions present in the HMM parameters.

Vector Taylor series (VTS) based approaches have been widely used for noise robust speech recognition [2, 14] due to the outstanding performance of these methods. The basic strategy takes advantage of the relationship between clean and noise-corrupted speech signals in an analytical way where the relationship can be approximated quite accurately by the vector Taylor series. The resulting probability density function of the noisy speech signals can be easily estimated without using much adaptation data. Here, we apply a VTS-based approach to compensate for the noise type difference in MMSR. We first derive a novel formula describing the relationship between the test and training noisy speech in the log-spectrum domain and then VTS is used to approximate this nonlinear relation. During testing, we compensate the test log-spectrum vector to move it more closely towards a match with the reference HMM using minimum mean square error (MMSE) estimation of the training log-spectrum vector.

In this study, we propose to mitigate the mismatch between the test noisy speech and the selected reference HMM in MMSR from two different points of view. The SNR mismatch is reduced by optimally mapping the estimated SNR value of the test speech before selecting the reference HMM, and the noise type mismatch is handled by compensating the test noisy speech in the log-spectrum domain using VTS.

This paper is organized as follows: a review on MMSR is presented in section 2 and an experimental investigation on the SNR mismatch in the MMSR is described in section 3. Compensation of the test noisy speech is described in section 4. The experimental procedure and results are presented and discussed in section 5. Finally, conclusions are presented in section 6.

2 Multiple-model based speech recognition framework

2.1 Environmental sniffing

Environmental aware speech processing was proposed in the study by [15]. This previous study established the concept of ‘environmental sniffing’ in order to characterize and effectively direct subsequent speech processing systems based on environmental noise types and levels. The study also showed that one could achieve a significant reduction in computational requirements versus traditional multi-recognizer ROVER solutions with an increase in recognition performance. Our study here focuses on selecting the best HMM platform from such an environmental sniffing hierarchy.

2.2 Architecture of multiple-model based speech recognition framework

In MMSR, multiple reference HMMs corresponding to the noise environments, both in type and SNR range, are constructed during training, and the reference HMM that is most appropriate for the test noisy speech is selected for recognition. To select the reference HMM, the SNR of the noisy test speech must be first estimated and the noise type classified. The architecture of the MMSR is shown in Figure 1.

Figure 1
figure 1

Architecture of MMSR framework which is divided into two parts: training phase and testing phase.

In Figure 2, an example where a reference HMM is selected based on the SNR value and noise type of the test speech in the MMSR framework is shown. For this study, a reference HMM for every 2-dB interval across four noise types (babble, car, subway, exhibition) during training was constructed and stored in the environmental sniffing HMM database. In the example shown in Figure 2, the noise type from the noisy test speech was classified as subway and the SNR was estimated at 5.5 dB. This information was then sent to the environmental sniffing HMM database, and the reference HMM corresponding to subway/6 dB, which was closest to the noisy test speech, was selected for recognition. It is generally believed that choosing the reference HMM with the most similar SNR value to the test speech will result in the best ASR performance for conventional MMSR. However, in this study, we experimentally determined the optimal SNR value of the reference HMM which results in an improved recognition accuracy, better than matched, for a given noisy test speech utterance.

Figure 2
figure 2

An example of the reference HMM selection process in MMSR.

2.2.1 SNR estimation and noise type classification

A simple energy-based voice activity detector (VAD) [16] for SNR estimation in the MMSR was used. The VAD works in a similar way as an endpoint detector. It uses energy thresholds from the noisy speech utterance to find the speech parts of the utterance. In the SNR estimation method, the power of the noise σ ^ n 2 was estimated using samples in the non-speech period obtained by the VAD, and this value was subtracted from the signal power σ ^ x 2 estimated from the parts of speech activity to find the speech only power. The expression for the SNR in the noisy speech is defined as follows:

SNR = 10 log σ ^ x 2 - σ ^ n 2 σ ^ n 2
(1)

For environmental sniffing based noise type classification, the cepstral feature vector of the noise signal C n is modeled by a GMM. The GMM represents the weighted linear combination of the Gaussian probability density functions and is expressed as follows:

p C n = ∑ i = 1 M ω i p i C n = ∑ i = 1 M ω i 1 2 π D / 2 Σ i 1 / 2 exp - 1 2 C n - μ i ′ Σ i - 1 C n - μ i
(2)

In Equation 2, the weight factor ω i satisfies ∑ i = 1 M ω i = 1 and μ i , Σ i each represent the D-dimensional mean vector and covariance matrix of the Gaussian probability density function p i (C n ). Other studies have also employed a GMM to characterize the acoustic noise space using online GMM modeling [17, 18]. Next, for noise signal classification, the GMM is trained for each noise type via expectation maximization (EM)-based maximum-likelihood estimation. Using the Aurora 2 database, one GMM is trained for each known noise type, and the GMM that provides the best accumulated likelihood for the first 10 frames of the noisy test speech is chosen as the noise type model for evaluation testing.

3 Analysis of SNR mismatch in environmental sniffing based MMSR

In this section, an experimental investigation was performed to illustrate the effect of SNR mismatch between noisy test and train speech in MMSR. The performance variation due to SNR mismatch was explored to determine the optimal SNR mapping between noisy test and train speech that gives the best recognition performance.

For the experiments in this section, a new train and test data was generated using the Aurora 2 database. Four known noise signal types (subway, babble, car, exhibition noises) were added to the clean training data of the Aurora 2 database to generate the noisy train data for each noise type across the SNR range from 0 to 30 dB in 2-dB intervals (16 SNR levels). The 4 × 16 sets of noisy training data were used to construct 64 sets of HMMs. The hidden Markov model toolkit [19] was employed for training and testing in this experiment. 1,001 clean speech sentences in Set A were corrupted to generate the noisy test data by adding the four known types of noise signal (subway, babble, car, exhibition noises) with a SNR range from 0 to 30 dB in 2-dB intervals. The test data was generated independently of the test data used in experiments in section 5 so that the analytical results in this section could be applied without loss of generality to the SNR mapping in the speech recognition experiments described in section 5. The range of noise types and the SNR values of the training and test speech were assumed to be known a priori in the MMSR.

To illustrate the impact of SNR mismatch between train and test for the environmental sniffing MMSR framework, Figure 3 shows the word error rate (WER) surface across the four noise types. In the analysis, the minimum WER did not necessarily occur when the SNR of the noisy test speech matched the training speech (i.e., if a match HMM were selected, the red minimum plots would all lie along the green diagonal of the input test speech SNR versus selected HMM SNR model). This means that the lowest WER cannot be guaranteed even if the SNR was matched. This may not be a serious issue for the recognition performance at high SNRs, since the WER surface is relatively flat in these regions. However, we can see that the WER surface has steep changes in slope at low SNR regions, which signifies the importance of finding an effective SNR mapping between training and test speech for optimal recognition performance.

Figure 3
figure 3

WER surface. Showing result of SNR mismatch between noisy test and train speech for four noise types in the Aurora 2 database (The red line connects the minimum WER points).

Figure 4 shows the WER curve as the SNR of the selected reference HMM is changed when the SNR of test speech is fixed at 0, 2 and 4 dB (babble noise), respectively. For example, when the SNR of the test speech is 0 dB, the WER is 43.65% when the reference HMM trained at 0 dB was selected, while the WER decreased to 32.07% for the 6-dB reference HMM. This example is in contrast to the conventional idea that the best noisy speech recognition performance will be achieved when the SNRs of the test and train speech are matched. This performance difference is so large that an alternative selection process is needed for the reference HMM given the SNR of the test speech. In addition to that illustrated for babble noise, a similar result was also observed for the other three types of noise in Set A as well.

Figure 4
figure 4

The variation of the WER (%) as the SNR of the reference HMM changes. SNR of the reference HMM changes when the SNR of the test speech is fixed at 0, 2, 4 dB, respectively.

Based on the WER surface shown in Figure 3, it is clearly possible to determine the best reference HMM given the estimated SNR of the test speech. The results of this analysis are summarized in Table 1. As expected, some difference in SNR between the noisy test speech and the best reference HMM were observed. For the test speech with low SNRs, an advantage to selecting a reference HMM with a higher SNR value than the actual estimated value was demonstrated. In Table 1, the estimated SNR values of the test speech is adjusted to compensate for the estimation errors, which makes the SNR mapping in Table 1 robust against the SNR estimation errors.

Table 1 SNR of reference HMM showing lowest WER as the SNR of test speech was varied

4 Feature compensation for environmental sniffing based MMSR

Although SNR mismatch in MMSR can be reduced through optimal mapping of the estimated SNR as described in section 3, there remains the problem of noise type mismatch, which occurs when an unknown (unseen during training) type of noise signal is present in the test speech. For this reason, the recognition accuracy of MMSR was worse than the MTR method for Set B (see section 5.1 for a detailed description on Set B) where unknown types of noise are encountered. To overcome this problem, we developed a novel feature compensation method based on VTS. First, the relation between the log-spectrum vectors in the noisy train and test speech was derived. Since this relation is a nonlinear function of the speech and noise parameters, the statistical information regarding the test log-spectrum vectors is obtained by approximation using the VTS algorithm. Finally, MMSE estimation for the training log-spectrum vectors is performed to reduce the mismatch between the noisy train and test speech.

4.1 Relationship between noisy speech signals

For feature compensation, a relationship between the log-spectrum vectors in the noisy training and test speech was derived. We employed the usual assumption of the relationship between the clean speech vector x and the noisy speech vector y in the log-spectrum domain as follows:

y = x + log i + exp n - x
(3)

If g(x, n) = log(i + exp(n - x)), then we have y = x + g(x, n),

where n is the additive noise in the corruption process and i is a unity vector. Based on Equation 3, the noisy log-spectrum vector yTr in the training speech and yTe in the test speech can be described as follows:

y Tr = x + log i + exp n Tr - x = x + g x , n Tr
(4)
y Te = x + log i + exp n Te - x = x + g x , n Te ,
(5)

where nTr and nTe are the additive noises in the training and test speech, respectively.

Combining Equations 4 and 5, we can express yTe in terms of yTr as follows:

y Te = y Tr + g x , n Te - g x , n Tr .
(6)

Assume that nTr is determined beforehand during training and nTe is expressed as a variable n, which should be estimated using the noisy test speech, then, g(x, nTe) - g(x, nTr) can be described as follows:

g x , n Te - g x , n Tr = log i + exp n Te - x - log i + exp n Tr - x
g x , n - g x , n Tr i = log i + exp n - x i i + exp n Tr - x i = log exp x + exp n i exp x + exp n Tr i .
(7)

Here, [·] i represents the ith element of a vector.

From Equation 4, the following equation can be derived:

y Tr = log exp x + exp n Tr
(8)

Taking the exponential of both sides in Equation 8, the equation can be rewritten as follows:

exp y Tr = exp ( x ) + exp ( n Tr ) exp x = exp ( y Tr ) - exp ( n Tr )
(9)

Substituting Equation 9 into Equation 7 produces

g x , n - g x , n Tr i = log exp y Tr - exp n Tr + exp n i exp y Tr i = log i + exp n - y Tr - exp n Tr - y Tr i ≡ G y Tr , n , n Tr i
(10)

If Equation 10 is substituted back into Equation 6, the relation between the log-spectrum vectors in the noisy training and test speech can be obtained as follows:

y Te = y Tr + G y Tr , n , n Tr = y Tr + log i + exp n - y Tr - exp n Tr - y Tr
(11)

Equation 11 can be used to find statistical information on y Te given the statistics of the training log-spectrum vector y Tr .

4.2 Compensation of the feature vector

Next, a compensation of the feature vector was performed employing MMSE estimation of the log-spectrum vector. The compensation will estimate the noisy log-spectrum in the training speech given the log-spectrum from the test speech using a statistical relation. By using the estimated log-spectrum vector instead of the original log-spectrum vector from the test speech, the mismatch between the test speech and reference HMM in the environmental sniffing based MMSR (ESniff MMSR) can be reduced, which will improve the recognition performance without changing the parameters of the reference HMM.

4.2.1 Estimating the mean and covariance of log-spectrum vector

A statistical relationship between the log-spectrum vectors from the training and test noisy speech was first derived. Equation 11 was expanded by using the first-order VTS around an initial value n0 of n, and the mean of the training log-spectrum vector μ Tr = E{y Tr } to obtain the following equation:

y Te = y Tr + G ( μ y Tr , n 0 , n Tr ) + ∇ y Tr G ( μ y Tr , n 0 , n Tr ) ( y Tr - μ y Tr ) + ∇ n G μ y Tr , n 0 , n Tr ( n - n 0 ) ,
(12)

where the gradient matrices are assumed to be diagonal and obtained as follows:

∇ y Tr G μ y Tr , n 0 , n Tr ii = exp n Tr - exp n 0 i exp μ y Tr + exp n 0 - exp n Tr i , ∇ n G μ y Tr , n 0 , n Tr ii = exp n 0 i exp μ y Tr + exp n 0 - exp n Tr i
(13)

Here, [·] ii represents the ith diagonal element of a matrix. Using Equation 12, the mean μ y Te and covariance Σ y Te of y Te can be expressed from the mean vector and covariance matrix of the noisy training speech y Tr as follows:

μ y Te = μ y Tr + G ( μ y Tr , n 0 , n Tr ) + ∇ n G ( μ y Tr , n 0 , n Tr ) ( n - n 0 ) Σ y Te = I + ∇ y Tr G μ y Tr , n 0 , n Tr Σ y Tr I + ∇ y Tr G μ y Tr , n 0 , n Tr T ,
(14)

where I is an identity matrix. Next, the noise vector was characterized.

4.2.2. Maximum likelihood estimation of noise vector

The log-spectrum vector y Tr of the noisy training speech was assumed to be distributed as a mixture of Gaussian distributions with mean vectors and covariance matrices obtained through a vector quantization process using the noisy training data. The mixture Gaussian distribution was separately estimated for each noisy type and SNR value using the same noisy training data to produce the reference HMM sets. Assuming also that the log-spectrum vector y Te of the noisy test speech is a mixture of distributed Gaussians, the distribution of y Te as a function of unknown noise vector n can be defined using Equation 14:

p y Te n = ∑ m = 1 M p m N ( μ y Te , m , Σ y Te , m ) ,
(15)

where N μ y Te , m , Σ y Te , m is the mth Gaussian component with a mean vector μ y Te , m and a covariance matrix Σ y Te , m . Also, p m is the mixture weight of the mth Gaussian component. Note that the mean vector μ y Te , m and covariance matrix Σ y Te , m are, themselves, fully parameterized by the noise vector n. In this study, the noise vector n was treated just as a parameter and not a random variable, and only the noisy speech vectors were treated as random variables.

Given a sequence of test log-spectrum vectors of length T, written as Y Te = {yTe,1, yTe,2,⋯,yTe,T}, the resulting log-likelihood function is defined as follows:

L ( Y Te | n ) = ∑ t = 1 T log p ( y Te , t | n ) .
(16)

Here, an iterative EM algorithm was employed to re-estimate the noise vector n by maximizing the log-likelihood function for the noisy test speech.

In the EM algorithm, the auxiliary function Q φ , φ ¯ used is defined as follows:

Q ( φ , φ ¯ ) = E { L ( Y Te | φ ¯ ) | Y Te , φ } = ∑ t = 1 T ∑ m = 1 M p ( m | y Te , t , n ) log p ( y Te , t , m | n ¯ ) .

The symbol φ actually represents the noise vector n which is assumed to be already known and φ ¯ is the unknown noise vector n ¯ which should be estimated. It is worth noting that Q φ , φ ¯ can be expanded as follows:

Q φ , φ ¯ = ∑ t = 1 T ∑ m = 1 M p m | y Te , t , n log p m + D 2 log 2 π - D 2 log Σ y Te , m + - 1 2 y Te , t - μ y Tr , m + G μ y Tr , m , n 0 , n Tr + ∇ n G μ y Tr , m , n 0 , n Tr n ¯ - n 0 T Σ y Te , m - 1 · y Te , t - μ y Tr , m + G μ y Tr , m , n 0 , n Tr + ∇ n G μ y Tr , m , n 0 , n Tr n ¯ - n 0
(17)

Next, to re-estimate n in Equation 17, the derivative of the auxiliary function with respect to n ¯ must be taken and set to equal 0.

∇ n ¯ Q φ , φ ¯ = ∑ t = 1 T ∑ m = 1 M p m | y Te , t , n ∇ n G μ y Tr , m , n 0 , n Tr T Σ y Te , m - 1 · y Te , t - μ y Tr , m + G μ y Tr , m , n , 0 n Tr + ∇ n G μ y Tr , m , n 0 , n Tr n ¯ - n 0 = 0
(18)
n ¯ = ∑ t = 1 T ∑ m = 1 M p m | y Te , t , n ∇ n G μ y Tr , m , n 0 , n Tr T Σ y Te , m ∇ n G μ y Tr , m , n 0 , n Tr - 1 · ∑ t = 1 T ∑ m = 1 M p m | y Te , t , n ∇ n G μ y Tr , m , n 0 , n Tr T Σ y Te , m - 1 · y Te , t - μ y Tr , m + G μ y Tr , m , n , 0 n Tr - ∇ n G μ y Tr , m , n 0 , n Tr n 0 .
(19)

The noise vector derived from Equation 19 was then substituted into Equation 14 to adapt μ y Te , m and Σ y Te , m in Equation 15. The likelihood function from Equation 16 and the auxiliary function from Equation 17 were consequently updated. This process was iterated until a defined convergence criterion was met in the log-likelihood function from Equation 16 of the noisy test speech. After convergence, an MMSE estimation of the original noisy training speech y Tr was performed using the statistical information from y Te . Using this MMSE process, the spectral mismatch between the noisy test speech and the selected reference HMM in the MMSR is expected to be reduced significantly.

4.2.3 MMSE estimation of the log-spectrum

The MMSE estimate of yTr given yTe can be expressed as follows:

y ^ Tr , MMSE = E y Tr | y Te = ∫ y Tr p y Tr | y Te d y Tr .
(20)

From Equation 11,

y Tr = y Te - log i + exp n - y Tr - exp n Tr - y Tr = y Te - G y Tr , n , n Tr . .
(21)

The following relationship was determined by substituting Equation 21 into Equation 20 and approximating G(yTr, n, nTr) based on a Taylor series approximation of order zero around the mean value μ y Tr , m

y ^ Tr , MMSE = y Te - ∫ y Tr G y Tr , n , n Tr p y Tr | y Te d y Tr = y Te - ∫ y Tr ∑ m = 1 M G y Tr , n , n Tr p y Tr , m | y Te d y Tr = y Te - ∑ m = 1 M p m | y Te ∫ y Tr G y Tr , n , n Tr p y Tr | m , y Te d y Tr ≈ y Te - ∑ m = 1 M p m | y Te G μ Tr , m , n , n Tr
(22)

The discrete cosine transform of the log-spectrum vector ŷTr,MMSE in Equation 22 was performed to find a 13th order cepstrum vector. The c0 component in the cepstrum vector was replaced with the log-energy. The delta and acceleration (delta-delta) coefficients of the cepstrum vector were also calculated to produce a 39-dimensional enhanced feature vector, which was finally used for speech recognition evaluation.

5 Experiments and discussions

5.1 Baseline system and speech corpora employed

In this study, we employ the Aurora 2 database for experiments. There are two sets of training data, each corresponding to clean training (CLEAN) and multi-condition training (MTR). Each consists of 8,440 sentences. The MTR set consists of both clean and noisy speech signal that is artificially contaminated by various kinds (subway, car, exhibition, babble) of noise with SNR ranges from 0 to 20 dB in 5-dB intervals.

Recognition experiments were conducted on 3 test sets (Set A, Set B, Set C) that are corrupted by a range of noise types with a SNR range of 0, 5, 10, 15, 20 dB. For each noise type and SNR value, there are 1,001 sentences for recognition. Set A and Set B are corrupted by an additive noise distortion alone, and Set C is corrupted by a combination of convolution noise and additive noise.

Two widely known speech features were used for the experiments. The first, entitled FE, consists of 12th order Mel-frequency cepstral coefficients, with the 0th cepstral component set aside, which were appended with the log energy to form a 13th order basic feature vector along with their delta and acceleration coefficients to construct a 39-dimensional feature vector for each frame [20]. The second feature set is a noise robust version of the FE, which is generally called advanced front-end (AFE) in the literature and is known to significantly reduce word error rates in noisy conditions [21]. Thirty-nine-dimensional feature vectors in the AFE that were consistent with the feature size used for the FE were constructed.

The HMM for each digit consists of 16 states with 3 Gaussian mixtures in each state. Silence is also modeled by a three-state HMM with six Gaussian mixtures in each state. Four known types of noise signal were added to the CLEAN training data to generate noisy speech for training the reference HMMs in the ESniff MMSR solution. To construct a sufficient number of reference HMMs, a noisy speech signal was generated for every 2-dB interval between 0 and 30 dB resulting in a collection of 16 reference HMM sets constructed for each noise type. The total number of reference HMM sets used in the experiment was 4 × 16 = 64, with a single HMM set selected for recognition depending on the noise type and SNR value of the noisy test speech.

5.2 Experimental results

5.2.1. Comparison with conventional methods

In Table 2, the WER of ESniff MMSR was compared with other approaches for noisy speech recognition using FE for feature vectors. From the table, it can be seen that ESniff MMSR significantly outperforms parallel model combination (PMC) [2] as well as the CLEAN training and VTS [3] method, but is only slightly better than the previous MTR method. Compared with ESniff MMSR, the MTR method shows strong noise robustness for Set B which consists of noisy speech corrupted with unknown types of noise signals (restaurant, street, airport, station). Even though the ESniff MMSR performs much better than the MTR method for Set A and Set C, the difference in the average WER between the ESniff MMSR and MTR is not significantly large (13.24% versus 13.68%) due to the results from Set B. The sharp probability density function of the acoustic model in ESniff MMSR seems to have adversely affected the speech recognition performance for Set B.

Table 2 Comparison of WERs (%) of ESniff MMSR with other approaches when using FE feature vectors

Figure 5 shows the WER of ESniff MMSR when the reference HMM was selected using the SNR mappings obtained from Table 1. For comparison, we also include the WER performance of conventional ESniff MMSR and the MTR methods. The results in this figure confirm the findings reported in Table 2 that the conventional ESniff MMSR performs better than the MTR method reducing the relative WER by 3.2%. The SNR mapping based ESniff MMSR method (ESniff SNR-MMSR) further improves the performance of the conventional ESniff MMSR. The ESniff SNR-MMSR method produces an average WER of 12.40% (95% confidence interval of the WER is ±0.095%), thereby reducing the average relative WER by 6.3% and 9.4% compared with conventional ESniff MMSR and MTR methods, respectively. As shown in Figure 5, the ESniff SNR-MMSR performs better than the conventional ESniff MMSR for all three test sets (Set A, Set B, Set C), which demonstrates that the experimentally motivated SNR mappings in Table 1 are quite effective irrespective of the noise type in the test speech. Although SNR mapping has been established using the known types of noise signal during training, it was also found to be effective for the unknown types of additive noise signal in Set B and the convolution noise in Set C. More detailed results on the MTR, ESniff MMSR and ESniff SNR-MMSR can be found in Tables 3, 4 and 5, for the individual noise types from the Aurora 2 task.

Figure 5
figure 5

WERs (%) of the ESniff MMSR with SNR mapping. Compared with the conventional ESniff MMSR and MTR methods when using the FE feature vectors.

Table 3 WER (%) of MTR method for Aurora 2 task using the FE feature vectors
Table 4 WER (%) of conventional ESniff MMSR method for Aurora 2 task using FE feature vectors
Table 5 WER (%) of ESniff SNR-MMSR method for Aurora 2 task using FE feature vectors

To effectively address the problem of noise type mismatch, the MMSE estimation of the log-spectrum vectors given the noisy test speech was performed as described in section 4. The experimental results of this analysis demonstrate that the recognition performance of the proposed environmental sniffing based MMSE (ESniff MMSE) method depends on the number of Gaussian distributions in Equation 15. Table 6 shows the WER of the ESniff MMSE method as the number of Gaussian distributions is varied from 2 to 128. The SNR mapping was also applied to the ESniff MMSE. Table 6 shows that the average WER consistently drops as the number of Gaussian distributions is decreased from 128 to 4. The worst performance was observed when M = 128 and the best was obtained when M = 4. This means that a small number of Gaussian mixtures is more than adequate to model the noisy log-spectrum vectors. A small number of Gaussian mixtures may be more appropriate for the noisy speech signal which is spectrally flattened due to the high amplitude noise signal at low SNRs thereby eliminating the adverse effect of a poor fit to the test data at low SNRs. More detailed results employing the ESniff MMSE solution is shown in Table 7. For the results in Table 7 and Figure 6, we selected M = 4 which gave the best performance on the Set A test.

Table 6 WERs (%) of ESniff MMSE
Table 7 WER (%) of ESniff MMSE (M = 4) method for Aurora 2 task using FE feature vectors
Figure 6
figure 6

WERs (%) of ESniff SNR-MMSR and ESniff MMSE. In comparison with conventional ESniff MMSR and MTR methods when using FE feature vectors.

Compared with the ESniff SNR-MMSR, ESniff MMSE always outperforms based on average WER, except when the MMSE Gaussian set is M = 128. The ESniff MMSE method generally shows significant performance improvements for both Set B and Set C, but the recognition accuracy is lower for Set A except when M = 4. This is expected since the ESniff MMSE was proposed to mitigate the effect of noise difference between noisy test and training speech. When the number of Gaussian distributions for ESniff MMSE is not appropriate, this approach adversely affects recognition performance for Set A which consists of noisy speech signals with known types of noise that do not require additional compensation for the noise type difference.

Figure 6 shows WER for the proposed methods (ESniff MMSE and ESniff SNR-MMSR) and compares this with MTR and conventional ESniff MMSR. The performance of MTR has been considered a benchmark in noisy speech recognition for the Aurora 2 task. As shown in the figure, the performance improvement for conventional ESniff MMSR is not significant compared with the MTR method. The ESniff SNR-MMSR method does improve the performance of the conventional ESniff MMSR by a significant margin, but it has limited performance benefit versus the MTR method for Set B due to the noise type mismatch between training and test speech. The decrease in performance due to the noise type mismatch could be significantly reduced by employing the ESniff MMSE method. By choosing the number of Gaussian distributions to be less than 8, better recognition performance is achieved versus MTR for Set B. When the number of Gaussian distributions is 4, the best average WER of 10.76% is obtained which corresponds to a reduction in relative WER of the MTR method by +21.3%. This performance improvement is significant compared with conventional ESniff MMSR where the relative WER is reduced by only +3.2% compared to the MTR method. Thus, in this study, speech recognition accuracy far better than conventional ESniff MMSR as well as the MTR method is achieved by employing the SNR mapping and MMSE estimation of the log-spectrum vector. This was achieved within an environmental sniffing framework, illustrating that effective SNR estimation with an improved mapping selection results in improved HMM speech recognition in noisy environments.

5.2.2 Performance evaluation using AFE feature vectors

In Table 8, the performance of the proposed methods (ESniff MMSE and ESniff SNR-MMSR) is compared with the MTR and conventional ESniff MMSR when using AFE feature vectors. Compared to conventional ESniff MMSR, ESniff SNR-MMSR shows a significant performance improvement as expected, and this improvement is consistently seen for all three test sets (Set A, Set B, and Set C). This demonstrates that the SNR mapping is still effective and not as sensitive to the specific feature vectors used. By employing ESniff MMSE, further improvements in performance based on the average WER were obtained. The performance of ESniff MMSE is robust against the change in the number of Gaussian distributions. The average WER of the ESniff MMSE varies only slightly as the number of Gaussian distributions is decreased from 128 to 4. This is in contrast to the results presented in Table 6 where a significant performance variation was observed with the number of Gaussian distributions. This may be due to the fact that the speech enhancement algorithm within the AFE has greatly reduced the noise signal in the test speech, and thus, a small number of Gaussian distributions are not as necessary.

Table 8 WERs (%) of ESniff MMSE

When using AFE for feature vectors, the performance of conventional ESniff MMSR is found to be inferior to MTR, but both proposed methods (ESniff SNR-MMSR and ESniff MMSE) show improved results over MTR. However, the performance improvement is not as significant as when using FE feature vectors and MTR still performs better than the proposed methods for Set B. The use of speech enhancement algorithm within the AFE seems to reduce the relative improvement of the proposed methods over MTR. More detailed results on the MTR, ESniff MMSR, ESniff SNR-MMSR, and ESniff MMSE (M = 128) can be found in Tables 9, 10, 11, and 12.

Table 9 WER (%) of MTR method for Aurora 2 task using AFE feature vectors
Table 10 WER (%) of conventional ESniff MMSR method for Aurora 2 task using AFE feature vectors
Table 11 WER (%) of ESniff SNR-MMSR method for Aurora 2 task using AFE feature vectors
Table 12 WER (%) of ESniff MMSE (M = 128) method for Aurora 2 task using AFE feature vectors

6 Conclusions

This study demonstrated that an environmental sniffing based MMSR solution improves ASR performance over the conventional MTR method. However, the mismatch in noise type and SNR value between test and training speech make it difficult for the MMSR to perform significantly better than the MTR method. In this study, we developed methods to improve the performance of the conventional MMSR by reducing mismatch issues for noisy speech recognition within an environmental sniffing framework.

For the SNR value mismatch, we experimentally determined the SNR mappings between the noisy test and training speech for optimal recognition performance. We achieved an average WER of 12.40% on the Aurora 2 task using FE for feature vectors thereby reducing the average relative WER by 6.3% and 9.4% compared with conventional MMSR and MTR methods, respectively. This is remarkable considering the fact that the conventional MMSR method could reduce relative WER by just 3.2% compared to the MTR method. Although the SNR mapping was determined using training data with known types of noise signal, it was shown to possess a generalization property that improved recognition performance on noisy test speech with combined unknown types of both additive and convolution noises.

The SNR mapping method improved performance over conventional MMSR by a significant margin but its performance is still inferior to MTR for Set B due to the noise type mismatch between training and test speech. It was possible to overcome this issue by MMSE of the training noisy log-spectrum given the test noisy speech. The performance of the MMSE method was found to be dependent on the number of Gaussian mixtures which model the noisy log-spectrum vectors. Compared with MTR and the conventional MMSR method, this solution showed improved performance for a wide range of Gaussian mixture counts. In particular, a small number of Gaussian mixtures was found to be more adequate in modeling the noisy log-spectrum vectors. As expected, the performance improvement was prominent for the test set with unknown types of additive noise signal. The MMSE method combined with the SNR mapping could reduce relative WER of the MTR method by 21.3% when using FE for feature vectors. This performance improvement is quite remarkable compared with the conventional MMSR method. When employing the AFE feature vectors, an improvement in performance was also observed using the proposed methods in noisy speech recognition, although the relative improvement over conventional methods was somewhat reduced due to the integrated speech enhancement algorithm inherent in the AFE.

In this study, we employed the SNR mapping and MMSE of the log-spectrum vectors in the environmental sniffing-based MMSR in an innovative way and achieved measurably improved speech recognition accuracy versus conventional MMSR as well as MTR methods. The results of this study show that an effective environmental sniffing framework coupled with improved SNR estimation and mapping, along with advanced noise modeling can improve overall speech recognition robustness.

References

  1. Ball SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust., Speech, Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209

    Article  Google Scholar 

  2. Gales MJF: Model based techniques for noise-robust speech recognition, Dissertation. University of Cambridge, Cambridge; 1996.

    Google Scholar 

  3. Moreno PJ: Speech Recognition in noisy environments, Dissertation. Carnegie Mellon University, USA; 1996.

    Google Scholar 

  4. Hansen JHL: Clements, A Mark, Constrained iterative speech enhancement with application to speech recognition. IEEE Trans. on Signal Processing 1991, 39(4):795-805. 10.1109/78.80901

    Article  Google Scholar 

  5. Kim W, Hansen JHL: Feature compensation in the cepstral domain employing model combination. Speech Commun. 2009, 51(2):83-96. 10.1016/j.specom.2008.06.004

    Article  Google Scholar 

  6. Lippmann RP, Martin EA, Paul DB: Multi-style training for robust isolated-word speech recognition. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1987), Dallas, TX; 1987:705-708.

    Google Scholar 

  7. Gong Y: Speech recognition in noisy environments: a survey. Speech Commun. 1995, 16: 261-291. 10.1016/0167-6393(94)00059-J

    Article  Google Scholar 

  8. Xu H, Tan ZH, Dalsgaard P, Lindberg B: Robust speech recognition on noise and SNR classification – a multiple-model framework. Proceedings of the 6th Annual Conference of the Speech Communication Association (INTERSPEECH 2005), Lisboa, Portugal; 2005:977-980.

    Google Scholar 

  9. Hirsch HG, Pearce D: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. Proceedings of the International Conference on Spoken Language Processing (ICSLP 2000), Bejing, China; 2000:18-20.

    Google Scholar 

  10. Chung YJ: Optimal SNR model selection in multiple-based speech recognition system. Proceedings of the First International Conference of Engineering and Technology Innovation (ICETI 2011), Kenting, Taiwan; 2011:154-159.

    Google Scholar 

  11. Xu H, Tan XH, Dalsgaard P, Lindberg B: Noise condition-dependent training based on noise classification and SNR estimation. IEEE Trans. Audio, Speech, Language Process 2007, 15(8):2431-2443.

    Article  Google Scholar 

  12. Sagayama S, Yamaguchi Y, Takahashi S: Jacobian adaptation of noisy speech models. Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 1997), Santa Barbara, California; 1997:396-403.

    Google Scholar 

  13. Sarikaya R, Hansen JHL: Improved Jacobian adaptation for fast acoustic model adaptation in noisy speech recognition. Proceedings of the 1st Annual Conference of the Speech Communication Association (INTERSPEECH 2000), Beijing, China; 2000:702-705.

    Google Scholar 

  14. Kim DY, Un CK, Kim NS: Speech recognition in noisy environments using first-order vector Taylor series. Speech Communication 1998, 24(1):39-49. 10.1016/S0167-6393(97)00061-7

    Article  Google Scholar 

  15. Akbacak M, Hansen JHL: Environmental sniffing: noise knowledge estimation for robust speech systems. Proceedings of IEEE International Conference of Acoustics, Speech and Signal Processing (ICASSP 2003), Hongkong; 2003:113-116.

    Google Scholar 

  16. Lamel L, Rabiner L, Rosenberg A, Wilpon J: An improved endpoint detector for isolated word recognition. IEEE Trans. Acoust., Speech. Signal Process 1981, 29(4):777-785.

    Google Scholar 

  17. Akbacak M, Hansen JHL: Environmental sniffing: noise knowledge estimation for robust speech systems. IEEE Trans. Audio, Speech and Language Process 2007, 15(2):465-477.

    Article  Google Scholar 

  18. Kim W, Hansen JHL, Novel A: Mask Estimation Method Employing Posterior-Based Representative Mean Estimate for Missing-Feature Speech Recognition. IEEE Transactions on Audio, Speech, and Language Processing 2011, 19(5):1434-1443.

    Article  Google Scholar 

  19. Young S: HTK: Hidden Markov Model Toolkit V3.4.1. Cambridge Univ. Eng. Dept. Speech Groupm, Cambridge; 1993.

    Google Scholar 

  20. ETSI draft standard doc: Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithm. ETSI Standard ES 202 108. 2000.

    Google Scholar 

  21. ETSI draft standard doc: Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithm. ETSI Standard ES 202 050. 2002.

    Google Scholar 

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2011–0006994).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongjoo Chung.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Chung, Y., Hansen, J.H. Compensation of SNR and noise type mismatch using an environmental sniffing based speech recognition solution. J AUDIO SPEECH MUSIC PROC. 2013, 12 (2013). https://doi.org/10.1186/1687-4722-2013-12

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1687-4722-2013-12

Keywords