Skip to main content

Robust Bayesian estimation for context-based speech enhancement

Abstract

Model-based speech enhancement algorithms that employ trained models, such as codebooks, hidden Markov models, Gaussian mixture models, etc., containing representations of speech such as linear predictive coefficients, mel-frequency cepstrum coefficients, etc., have been found to be successful in enhancing noisy speech corrupted by nonstationary noise. However, these models are typically trained on speech data from multiple speakers under controlled acoustic conditions. In this paper, we introduce the notion of context-dependent models that are trained on speech data with one or more aspects of context, such as speaker, acoustic environment, speaking style, etc. In scenarios where the modeled and observed contexts match, context-dependent models can be expected to result in better performance, whereas context-independent models are preferred otherwise. In this paper, we present a Bayesian framework that automatically provides the benefits of both models under varying contexts. As several aspects of the context remain constant over an extended period during usage, a memory-based approach that exploits information from past data is employed. We use a codebook-based speech enhancement technique that employs trained models of speech and noise linear predictive coefficients as an example model-based approach. Using speaker, acoustic environment, and speaking style as aspects of context, we demonstrate the robustness of the proposed framework for different context scenarios, input signal-to-noise ratios, and number of contexts modeled.

1 Introduction

Speech enhancement pertains to the processing of speech corrupted by noise, echo, reverberation, etc. to improve its quality and intelligibility. In this paper, by speech enhancement, we refer to the problem of noise reduction. It is relevant in several scenarios, for example, mobile telephony in noisy environments, such as restaurants and busy traffic, suffers from unclear communication. Also, speech recognition units [1] and hearing aids [2] require speech enhancement as a preprocessing algorithm.

Speech enhancement algorithms can be broadly classified into single- and multi-channel algorithms based on the number of microphones used to acquire the input noisy speech. Multi-channel algorithms exhibit superior performance because of the additional spatial information available about the noise and speech sources. However, the need for single-channel speech enhancement cannot be ignored. For example, single microphone systems are preferred in low-cost mobile units. In addition, multi-channel methods include a single-channel algorithm as a post-processing step to suppress diffuse noise. In this paper, we focus on single-channel speech enhancement.

Single-channel speech enhancement has been a challenging research problem for the last four decades. Several techniques have been devised to arrive at efficient solutions for the problem. Among these, spectral subtraction is one of the earliest and simplest techniques [3]. Herein, an estimate of the noise magnitude spectrum is subtracted from the observed noisy magnitude spectrum to obtain an estimate of the clean speech magnitude spectrum. Several variations of this technique have been developed over the years [4]-[7]. Methods based on a statistical model of speech to estimate the speech spectral amplitude such as the minimum mean square error short-time spectral amplitude estimator (MMSE-STSA) method have been found to be successful [8]-[10]. The statistical approach explicitly uses the probability density function (pdf) of the speech and noise DFT coefficients. Also, it allows consideration of non-Gaussian prior distributions [11] and different ways of modeling the spectral data [12],[13]. Subspace-based algorithms [14] assume the clean speech to be confined to a subspace of the noisy space. The noisy vector space is decomposed into noise-only and speech-plus-noise subspaces. The noise subspace components are suppressed, and the speech-plus-noise subspace components are further processed. A comprehensive survey of these techniques is provided in [15]. However, most of these methods depend on an accurate estimate of the noise power spectrum, for example, estimation of the noise magnitude spectrum during silent segments in [3], or a priori signal-to-noise ratio (SNR) estimation in [9], or estimation of the noise covariance matrix in the subspace-based methods.

Noise estimation algorithms mainly include voice activity detector (VAD) [16],[17] and buffer-based methods [18]-[20]. While VADs are unreliable at low SNRs, the buffer-based methods are not fast enough to track the quickly varying noise in nonstationary noise conditions. Thus, while these algorithms perform well in stationary noise, their accuracy deteriorates under nonstationary conditions. An improvement over these algorithms is provided in [21] wherein a recursive approach is employed for online noise power spectral density (PSD) tracking by analytically retrieving the prior and posterior probabilities of speech absence, and noise statistics, using a maximum likelihood-based criterion. A low-complexity, fast noise tracking algorithm is proposed in [22],[23].

Speech enhancement algorithms which employ trained models, such as codebooks [24]-[28], hidden Markov models (HMM) [29]-[31], Gaussian mixture models (GMM) [32], non-negative matrix factorization (NMF) models [33], dictionaries [34], etc., for speech and noise data are able to process noisy speech with sufficient accuracy even under nonstationary noise conditions. For example, codebook-based speech enhancement (CBSE) algorithms [25],[26] estimate the noise power spectrum for short segments of noisy speech, thus tracking nonstationary noise better than the buffer-based methods [18]. However, model-based methods typically employ a priori speech models which are trained on speech data from multiple speakers. For applications where the input noisy speech is more frequent from a particular speaker, such as in mobile telephony, it is desirable to exploit the speaker dependency for better speech enhancement. Similarly, it might be beneficial to consider models trained on or adapted to a specific acoustic environment or language. In this paper, we introduce the notion of context-dependent (CD) models, where by the word ‘context’, we refer to one or more aspects such as the speaker, acoustic environment, emotion, language, speaking style, etc. of the input noisy speech. By employing CD models, improved enhancement of noisy speech can be expected. These models can be adapted online from a context-independent (CI) model during high SNR regions of the input signal. In this paper, we assume the availability of such adapted CD models and focus on the enhancement using the converged models.

When the context of the noisy input matches the context of the data used to train the model, CD models are expected to result in better speech enhancement than CI models. We refer to such scenarios as context match scenarios. However, in practice, the modeled and observed contexts may not always match, leading to a context mismatch. In such scenarios, a CD model may lead to poorer results, and so the CI model would be preferred. Thus, what is required is a method that retains the benefits of both the CD and CI models and provides robust results irrespective of the scenario at hand.

In this paper, we introduce a Bayesian framework to optimally combine the estimates from the CD and CI models to achieve robust speech enhancement under varying contexts. As different aspects of context can be expected to remain constant for an extended duration in the input noisy signal, the framework considers past information to improve the estimation process. Also, in practice, different aspects of context may occur at the same time. So, the framework is designed to include several codebooks at the same time.

As an example of the model-based algorithm, we use the CBSE technique that employs trained models of speech and noise linear predictive (LP) coefficients as priors [26]. A part of this work has been presented in [35]. This papers extends [35] by incorporating memory-based estimation, considers the use of multiple CD models, and presents a detailed experimental analysis for different noise types, input SNRs, and aspects of context. The framework developed is general and can be used for other representations such as mel-frequency cepstrum coefficients, higher resolution PSDs, as well as other models such as GMMs, HMMs, and NMF.

The remainder of the paper is organized as follows. In the next section, a brief outline of the CBSE techniques [25],[26] is provided. Following this, we derive the memory-based Bayesian framework to optimally combine estimates from several codebooks (CD/CI). Thereafter, we present the experimental results for the proposed framework under varying contexts, noise types, and input SNRs. Finally, we summarize the conclusions.

2 Codebook-based speech enhancement

Consider an additive noise model of the observed noisy speech y(n):

y(n)=x(n)+w(n),
(1)

where n is the time index, x(n) is the clean speech signal, and w(n) is the noise signal.

We assume that speech and noise are statistically independent and follow zero-mean Gaussian distribution. Under these assumptions, Equation 1 leads to the following relation in the frequency domain:

P y (ω)= P x (ω)+ P w (ω),
(2)

where P y (ω), P x (ω), and P w (ω) are PSDs of the observed noisy speech, clean speech, and noise respectively, and ω is the angular frequency.

Consider a short-time segment of the observed noisy speech given by a vector y=[y(1),…,y(N)]T, where N is the size of the segment. Let the vectors x and w be defined analogously. Let a x = a x 0 , … , a x p denote the vector of LP coefficients for the short-time speech segment x corresponding to y, with a x 0 =1 and p the speech LP model order. Similarly, let a w = a w 0 , … , a w q denote the LP coefficient vector for the short-time noise segment w corresponding to y, with a w 0 =1 and q being the noise LP model order. Then, the speech and noise PSDs can be written as:

P x (ω)= g x | A x ( ω ) | 2 and P w (ω)= g w | A w ( ω ) | 2 ,
(3)

where g x and g w denote the variance of the prediction error for speech and noise, respectively; A x (ω)= ∑ k = 0 p a x k e − jωk ; and A w (ω)= ∑ k = 0 q a w k e − jωk . Let

m x = a x , g x , m w = a w , g w .
(4)

m x is a model describing the speech PSD, and m w describes the noise PSD. Codebook-driven speech enhancement techniques [25],[26] estimate m x and m w for each short-time segment: a x and a w are selected from trained codebooks of vectors of speech and noise LP coefficients, C x and C w , respectively, and the gain terms g x and g w are computed online, resulting in good performance in nonstationary noise. A maximum likelihood approach is adopted in [25] and a Bayesian minimum mean squared error (MMSE) approach in [26].

The estimates m Ì‚ x and m Ì‚ w are used to construct a Wiener filter to enhance the noisy speech in the frequency domain:

H(ω)= P ̂ x ( ω ) P ̂ x ( ω ) + P ̂ w ( ω ) ,
(5)

where P ̂ x (ω) and P ̂ w (ω) are estimates of the speech and noise PSDs, respectively, described by m ̂ x and m ̂ w . The Wiener filter is one example of a gain function, and any other gain function can be employed using the obtained speech and noise PSD estimates.

3 Bayesian estimation under varying contexts

In this section, we develop a Bayesian framework to obtain estimates of the speech and noise LP parameters, m x and m w , using one or more CD codebooks and a CI speech codebook. The CD codebooks improve estimation accuracy in the event of a context match, and the CI codebook provides robustness in the event of a context mismatch. The Bayesian framework needs to optimally combine the estimates from the various codebooks with no prior knowledge on whether or not the observed context matches the context modeled by the codebooks.

Consider K speech codebooks C x 1 , … , C x K , which include one or more CD codebooks and a CI codebook, depending on the contexts modeled. We consider a single noise codebook, C w , corresponding to the encountered noise type. Robustness to different noise types can be provided by extending the notion of context dependency to the noise codebooks as well. To maintain the focus on context dependency in speech, we only consider a single noise codebook.

As m x is a model for the speech PSD and m w is a model for the noise PSD, m=[m x ,m w ] is a model for the noisy PSD, given by the sum of the corresponding speech and noise PSDs. We consider m to be a random variable and seek its MMSE estimate, given the noisy observation, the speech codebooks, and the noise codebook. Let ℳ 1 denote the collection of all models of the noisy PSD corresponding to the speech codebook C x 1 and the noise codebook C w . The set ℳ 1 consists of quadruplets a x 1 i , g x , a w j , g w , where a x 1 i is the i th vector from the speech codebook C x 1 , a w j is the j th vector from the noise codebook C w , and the gain terms g x and g w are computed online for each combination of a x 1 i and a w j . Thus, ℳ 1 contains N x 1 × N w vectors, where N x 1 is the number of vectors in C x 1 and N w is the number of vectors in C w . The sets ℳ 2 ,…, ℳ K are similarly defined, corresponding to the speech codebooks C x 2 ,…, C x K . Let be a collection of all the models m contained in all the K speech codebooks and the noise codebook, i.e.,

ℳ= ℳ 1 ∪ ℳ 2 ∪…∪ ℳ K .
(6)

We consider the following K hypotheses:

H k : speech codebook C k best models the speech context for the current segment, 1≤k≤K.

At a given time T, one of the K hypotheses is valid. This corresponds to a state, and we write S T =H k to denote that at time T, the most appropriate speech codebook for the observed noisy segment is C k .

As mentioned in the introductory section, various aspects of context such as speaker, language, etc. can be expected to remain constant over multiple short-time segments, which can be exploited to improve estimation accuracy. The MMSE estimate of m for the T th short-time segment is thus obtained using not just the current noisy segment y T but a sequence that includes the current as well as past noisy segments, [ y1,…,y T ], where t is the segment index and y t , 1≤t≤T is a vector containing N noisy speech samples. The MMSE estimate of m can be written as

m ̂ = E m | y 1 , y 2 … , y T = ∑ k = 1 K p S T = H k | y 1 , y 2 , … , y T × E m | y 1 , y 2 … , y T , S T = H k .
(7)

The two terms in the last line of (7) lend themselves to an intuitive representation. The second term E[m|y1,y2…,y T ,S T =H k ] corresponds to an MMSE estimate of m assuming that the context is best described by H k . The first term provides a relative importance score to this estimate, based on the likelihood that C x k is indeed the most appropriate speech codebook. The weighted summation corresponds to a soft estimation, which allows the coexistence of multiple contexts, e.g., speaker and language, each being modeled by a separate codebook. Next, we derive expressions for both these terms.

First, we consider the term p(S T =H k |y1,y2,…,y T ).Let

α T (k)=p y 1 , y 2 … , y T , S T = H k ,k=1,2,…,K
(8)

represent the forward probability as in standard HMM theory [36]. It can be recursively obtained as follows:

Basis step:

α 1 (k)=p H k p y 1 | H k ,k=1,2,…,K.
(9)

The prior probabilities in the absence of any observation can be assumed to be equal in Equation 9. Thus, p(H k ) = 1 K , i.e., all hypotheses are equally likely.

Induction step: The state S T of the current noisy observation y T could have been reached from any of the states from the previous frame with a particular transition probability. This can be modeled as

α t + 1 (k)= ∑ l = 1 K α t ( l ) a lk p y t + 1 | H k ,
(10)

where 1≤t≤T−1 and l,k=1,2,…,K, and a lk represent the transition probability of reaching state k from state l. We assume the a priori transition probabilities to be known beforehand for a given set of speech codebooks. In this paper, we assume them to be fixed such that a lk takes higher values when l=k than otherwise, to capture the intuition that we typically do not rapidly switch between contexts such as speaker and language. Note that only the a priori transition probabilities are assumed to be fixed. The data-dependent part in Equation 10 is captured by the term p(yt+1|H k ), whose computation is addressed in the following. Using Equation 8,

p S T = H k | y 1 , y 2 , … , y T = p y 1 , y 2 , … , y T , S T = H k p y 1 , y 2 , … , y T = α T ( k ) ∑ k = 1 K α T ( k ) .
(11)

Next, we consider the term E[m|y1,y2…,y T ,S T =H k ] in Equation 7. In this section, we are interested in exploiting memory to ensure that the codebook that is most relevant to the current context at hand receives a high likelihood, and this is captured by Equation 11. For a given codebook, E[m|y1,y2…,y T ,S T =H k ] provides an improved estimate of m by exploiting not only the current noisy observation y T but also the past noisy segments. An expression for this term can be derived as in [26], where memory was restricted to the previous frame in view of the signal nonstationarity. Here, to retain the focus on selecting the appropriate context, we assume

E m | y 1 , y 2 … , y T , S T = H k =E m | y T , S T = H k .
(12)

In the following, we ignore the term S T and write E[ m|y T ,S T =H k ] as E[ m|y T ,H k ] for brevity. For a given hypothesis H k , we have

E m | y T , H k = ∑ m ∈ ℳ m p m | y T , H k = ∑ m ∈ ℳ m p y T | m , H k p m | H k p y T | H k .
(13)

Under a Gaussian LP model, m corresponds to an autocorrelation matrix R y for y T , which fully characterizes the pdf p(y T |m) as in

p y T | m = 1 ( 2 π ) N / 2 | R x + R w | 1 / 2 × exp − y T † R x + R w − 1 y T 2 ,
(14)

where † represents transpose, R y =R x +R w , R x = g x ( B x † B x ) − 1 , R w = g w B w † B w − 1 , B x is an N×N lower triangular Toeplitz matrix with [a x ,0,…,0]† as the first column, and B w is an N×N lower triangular Toeplitz matrix with [a w ,0,…,0]† as the first column. Thus, given a model m, y T is conditionally independent of H k , and we have

p y T | m , H k = p y T | m , k = 1 , 2 , … , K.
(15)

The logarithm of the likelihood p(y T |m) in the Equation 14 can be efficiently computed in the frequency domain following the approach of [26]. The gain terms that maximize the likelihood can be computed as in [26].

Next, we consider the term p(m|H k ) in Equation 13. Under hypothesis H k , the speech signal in the observed segment is best described by the codebook C x k . We assume all the models resulting from a given codebook are equally likely. This assumption is valid, in general, if the codebook size is large and derived from a phonetically balanced large training set.

Thus, assuming all the models resulting from C x k are equally likely, we have

p m | H k = 1 | ℳ k | , ∀ m ∈ ℳ k = 0 , otherwise ,
(16)

where | ℳ k | is the cardinality of ℳ k . From Equations 13 and 16, we have

E m | y T , H k = 1 | ℳ k | ∑ m ∈ ℳ k m p y T | m p y T | H k ,
(17)

where

p y T | H k = 1 | ℳ k | ∑ m ∈ ℳ k p y T | m
(18)

and p(y T |m) is given by Equation 14. Equation 18 is used in Equations 9 and 10 to obtain the forward probabilities. Finally, the required MMSE estimate m Ì‚ is obtained by using Equations 11 and 17 in Equation 7. The speech and noise PSDs corresponding to m Ì‚ can be obtained using Equation 3 and the Wiener filter from Equation 5. To ensure stability of the estimated LP parameters, the weighted sum in Equation 7 can be performed in the line spectral frequency domain. Note that the weights are non-negative and add up to unity as is evident from Equation 11. Alternatively, as we are finally interested in the speech and noise PSDs to be used in a Wiener filter, the weighted sum can be performed in the power spectral domain.

We conclude this section with some remarks on the calculation of the forward probabilities α T which for a codebook captures how well that codebook matches the context of the T th input segment. As mentioned earlier, the proposed framework can be used to model context in speech as well as noise. When context is modeled by the speech codebooks, it was found to be beneficial to calculate α T during speech-dominated segments, and during noise-dominated segments when modeling the noise context. The goal in computing α T is to assess how well a given speech codebook matches the underlying context for a given input segment. If this computation is performed during speech-dominated frames, we obtain accurate values for α T . However, inaccurate weight values may result when the computation is based on segments that lack sufficient information about the speech, such as silence or low-energy segments dominated by noise. In such situations, it is preferable to use the value of α T computed in the last speech-dominated segment. This, in other words, assumes that the context of the current segment is the same as that of the past segment. This assumption is valid in general as the context of speech is not expected to rapidly change from one speech burst to another. Thus, updating α T only during speech-dominated segments does not affect performance. However, estimating α T only during speech-dominated segments suffers from the disadvantage that there may not be a sufficient number of such segments in highly noisy conditions. Introducing a preliminary noise reduction step, e.g., using the long-term noise estimate from [18], and estimating α T from the enhanced signal was seen to address this problem. Importantly, the estimation of the speech and noise PSDs and the resulting Wiener filter occurs for each short-time segment, providing good performance under nonstationary noise conditions.

4 Experimental results

Experiments were performed to verify the robustness of the proposed framework under varying contexts. The contexts modeled by a trained CD codebook may or may not match with that of the observed noisy input signal, leading to two scenarios:

Context match: the best-case scenario for a CD codebook

Context mismatch: the worst-case scenario for a CD codebook

The robustness of the proposed framework, employing both CD and CI codebooks, was tested under both scenarios. Two different sets of experiments were performed, which differed in terms of number of codebooks employed and the aspects of contexts modeled. The first set consisted of experiments with two speech codebooks, a CI speech codebook and a CD speech codebook, modeling the speaker and acoustic environment as aspects of context. The second set consisted of experiments with three speech codebooks: a CI speech codebook and two CD speech codebooks to study the performance of the proposed framework with an increase in the number of codebooks employed. This set modeled, apart from speaker and acoustic environment, the speech type (normal, whisper, loud, etc.) of the input speech as aspects of context.

In the following, we first describe the experimental setup and, thereafter, the various experiments along with the corresponding results.

4.1 4.1 Experimental setup

In all the experiments, the input noisy test utterances were enhanced under different context scenarios, using the CBSE technique [26] applied using the CD codebook alone, the CBSE technique applied using the CI codebook alone, and the proposed Bayesian scheme. We expect that in the context match scenarios, employing the CD codebook alone should lead to the best results. On the other hand, in the context mismatch scenarios, employing the CI codebook alone should lead to results better than those obtained using the CD codebook. The proposed method, however, is expected to provide robust results under varying contexts, i.e., results close to the best results in all scenarios. To serve as a reference for comparisons, we also include results when applying the Wiener filter (5) with a noise estimate obtained from a state-of-the-art noise estimation scheme [37].

The performance of these four processing schemes was compared using two measures: the improvement in segmental SNR (SSNR) referred to as Δ SSNR (in dB) and the improvement in the perceptual evaluation of speech quality (PESQ) [38] measure, referred to as Δ PESQ, averaged over all the enhanced utterances considered under a particular experiment.

The speech codebooks used in the experiments were trained using the Linde-Buzo-Gray (LBG) algorithm [39]. First, the clean speech training utterances, resampled at 8 kHz, were segmented into 50% overlapped Hann windowed frames of size 256 samples each, corresponding to a duration of 32 ms wherein the speech signal can be assumed stationary. Then, LP coefficient vectors of dimension 10, extracted using these frames, were clustered using the LBG algorithm to generate speech codebooks of size 256 each using the Itakura-Saito (IS) distortion [40] as the error criterion.

For training the CI speech codebook, 180 English language utterances of duration 3 to 4 seconds each were used, from 25 male and 25 female speakers from the WSJ speech database [41]. This codebook served as the CI codebook for all the experiments described in this section. The speakers whose utterances were used to train the CI codebook were not used in the test utterances. The different experiments use different CD codebooks and input noisy test data, which are discussed later along with the description of each experiment.

The different CD and CI speech codebooks considered in the experiments are of large size (256) and are derived from a large number of phonetically balanced sentences from the WSJ database. Moreover, the LBG algorithm used to generate the speech codebooks computes cluster centroids in an optimal fashion. All these factors ensure the validity of the assumption about equal probability of models in Equation 16.

Two noise codebooks for two different noise types, traffic and babble, with eight entries each were trained similarly using LP coefficient vectors. For the traffic noise codebook, LP coefficient vectors of order 6 extracted from 2 min of nonstationary traffic noise were used. Since babble noise is speech-like, a higher LP model order of 10 was used while extracting LP coefficient training vectors from approximately 3 min of nonstationary babble noise. The same noise types were also used in the creation of test utterances at 0, 5, and 10 dB SNR for all the experiments. The actual samples were different from those used in training. The active speech level was computed using ITU-T P.56 method B in [42], and noise was scaled and added to obtain a desired SNR.

When processing the noisy files for a particular noise type, the appropriate noise codebook was used. In practice, a classified noise codebook scheme as discussed in [25] can be used. This scheme employs multiple noise codebook, each trained for a particular noise type. A maximum likelihood scheme is used to select the appropriate noise codebook for each short-time frame. This method was shown in [25] to perform as well as the case when the ideal noise codebook was used. We choose to use the ideal noise codebook to retain the focus on the performance of the proposed framework with regard to various aspects of the speech context.

4.2 4.2 Experiments with a single CD codebook

In this experiment, we test the proposed framework when two speech codebooks are employed, a CI and a CD codebook. The CD codebook models two aspects of context, ‘speaker’ and ‘acoustic environment’.

4.2.1 4.2.1 CD codebook training

For training the CD codebook, 180 English language utterances from a single speaker, of 3 to 4 s duration each, were used from the WSJ speech database. These utterances were convolved with an impulse response recorded at a distance of 50 cm from the microphone, in a reverberant room (T60 = 800 ms). This corresponds, for example, to hands-free mode on a mobile phone. In practice, this codebook is adapted during hands-free usage, making it dependent on both the speaker and acoustic environment.

4.2.2 4.2.2 Test utterances for the experiment

Two sets of ten clean speech utterances each were used to generate the noisy test data. Utterances for the first set were from the same speaker and acoustic environment as the data used to train the CD codebook, corresponding to the context match scenario and thus the best case for the CD codebook. The utterances themselves were different from those used in the training set.

The second set of clean utterances were from a speaker different from the one involved in training the CD codebook. These utterances were not convolved with the recorded impulse response (e.g., corresponding to hand-set mode in a mobile phone). Thus, both the speaker and acoustic environment were different from those used to train the CD codebook, corresponding to the context mismatch scenario and thus the worst case for the CD codebook.

4.2.3 4.2.3 Enhancement results

The test utterances were enhanced using the four schemes, mentioned in Section 4.1. The transition probabilities a lk were set to 0.99 when l=k and to 0.01 when l≠k, with l,k=1,2. Tables 1 and 2 provide the results for the best- and worst-case scenarios, respectively, in babble noise.

Table 1 Best-case scenario for a single CD codebook under babble noise
Table 2 Worst-case scenario for a single CD codebook under babble noise

As can be observed from Table 1, the best results are obtained for the CD codebook, as expected in a context match scenario. There is a significant difference between the results corresponding to the CD and CI codebooks, e.g., 0.19 for Δ PESQ and 1.3 dB for Δ SSNR, at 5 dB input SNR. Moreover, the standard deviation values indicate that the observed differences between the CD and CI results are statistically significant. This illustrates the benefit of employing CD codebooks. On the other hand, Table 2 demonstrates poorer performance when using the CD codebook compared to using the CI codebook, in a context mismatch scenario. The difference between their results is significant for Δ SSNR at all input SNRs, e.g., 1 dB at 0 dB input SNR, and for Δ PESQ at higher SNR, e.g., 0.22 at 10 dB input SNR. These results demonstrate the need for a scheme that appropriately combines the estimates obtained from the CD and CI codebooks, depending on the context at hand.

In Table 1, with increasing input SSNR, there is an increase in Δ PESQ but a decrease in Δ SSNR for all schemes except the reference method. This can be explained by considering the trade-off between speech distortion and noise reduction.

In general, enhancement using a Wiener filter involves applying a gain (also called attenuation) function. When applying this gain function to the noisy speech, both speech and noise components are attenuated. At lower input SNRs, the SSNR measure is dominated by the benefit of noise reduction while ignoring the penalty due to speech distortion. So in these scenarios, applying a greater attenuation than is optimal can increase the output SSNR values as it results in more noise attenuation (it also results in more speech attenuation but that is not captured by the SSNR measure). This situation occurs when using a mismatched codebook, where the clean speech PSD is underestimated, resulting in more severe attenuation of the noisy speech. PESQ is more closer to human perception, and we believe that the effect of speech distortion is better captured by PESQ, resulting in negative delta PESQ values for these scenarios. At higher input SNRs, the SSNR measure also captures the effect of speech distortion. Since Δ PESQ captures well the decrease in speech distortion with increasing input SSNR, there is an increase in Δ PESQ with increasing input SSNR in Table 1. On the other hand, SSNR measure is dominated at lower input SNRs by the benefit of noise reduction ignoring the penalty due to speech distortion. As a result, there is larger Δ SSNR at lower input SNRs than at higher input SNRs.

In contrast to the results obtained when using the CD and CI codebooks alone, the proposed framework achieves robust performance regardless of the observed context. For the best-case scenario (Table 1), its results are close to the CD results. For the worst-case scenario (Table 2), its results are close to the CI results. Thus, the proposed framework achieves results close to the best results for a given scenario, as desired. The reference scheme performs poorly due to the nonstationary nature of the noise. It may be noted that even using a mismatched codebook outperforms the reference scheme, highlighting the benefit of using a priori information for speech enhancement in nonstationary noise.

Tables 3 and 4 provide the results for the best- and worst-case scenarios, respectively, for the traffic noise case. Similar observations can be made as from the Tables 1 and 2 regarding the need for both the CI and CD codebooks for better performance and the robust performance of the proposed framework under varying contexts. Again, the reference method performs poorly due to the nonstationary nature of noise.

Table 3 Best-case scenario for a single CD codebook under traffic noise
Table 4 Worst-case scenario for a single CD codebook under traffic noise

Comparing Δ PESQ values for the best-case scenarios in Tables 1 and 3 for the two noise types shows that there is a sharper drop in values from 5 to 0 dB input SNR in the case of traffic noise results (0.2) compared to babble noise results (0.06). A similar observation can be made for the Δ PESQ values for the worst-case scenarios in Tables 2 and 4 for the two noise types. These observations indicate that the traffic noise case is more difficult to handle than babble noise at 0 dB input SNR. This occurred because the traffic noise considered for the experiments is highly nonstationary compared to the babble noise used for the experiments.

4.2.4 4.2.4 Comparison of the proposed method with the MMSE-STSA method

In the above experiments, the reference method chosen for comparison with the proposed method uses the Wiener gain, as described by (5), computed using a state-of-the-art noise estimator [37]. This choice provides an even comparison as the proposed method too employs the Wiener gain function. The two approaches, however, differ in the computation of the speech and noise PSDs for computing the Wiener gain.

Also of interest is a comparison of the proposed method with a popular statistical approach such as the MMSE-STSA method [9], the results of which are provided in Tables 5 and 6 for the Babble noise case. Table 5 corresponds to the context match scenario wherein the context of the CD codebook matches with that of the input noisy speech. Here, the performance of the proposed method is superior, especially for the PESQ values, to that of the MMSE-STSA technique. The advantage with the proposed approach is higher at lower SNR values. For the mismatch scenario, the performance of both the methods is comparable as shown in Table 6. Note that the Wiener filter is just one example of a gain function that can use the speech and noise PSDs estimated using the proposed method. The estimated speech and noise PSDs can also be used to compute the a priori and a posteriori SNRs for use in the MMSE-STSA gain function. This is however beyond the scope of this paper and is a topic for future work.

Table 5 Comparison of the proposed method with the MMSE-STSA technique for context match scenario corresponding to Table 1
Table 6 Comparison of the proposed method with the MMSE-STSA technique for context mismatch scenario corresponding to Table 2

4.3 4.3 Experiments with multiple CD codebooks

In the previous subsection, we tested the proposed framework under conditions when a single CD codebook was employed along with a CI codebook. Multiple aspects of context were modeled by the single CD codebook. In practice, different contexts will be modeled by different CD codebooks. In this subsection, we experiment with the case of two CD codebooks along with one CI codebook.

4.3.1 4.3.1 CD codebook training

The first CD codebook, referred to as CD-1, models a particular speaker and a speech type. The speech type considered is ‘whisper’ speech. The speech produced in the case of certain speech disorders (dysphonic speech) is similar to whispered speech. CD-1 was trained using around 10 min of whispered speech data from a single speaker from the CHAINS database [43].

The second CD codebook employed, referred to as CD-2, models normal speech in reverberant conditions for the same speaker as modeled by CD-1. CD-2 was trained using training utterances of duration around 10 min, convolved with the same impulse response as used in the previous experiments (corresponding to a distance of 50 cm from the microphone, in a reverberant room with T60 = 800 ms).

The two codebooks differ in terms of speaking style, whispered and normal, and also the acoustic environment. The separation in terms of acoustic environment is useful, e.g., to have different CD models for a particular user of the mobile phone to cater to hand-set and hands-free modes of operation. Note that the CI codebook is speaker-independent and corresponds to hand-set mode.

4.3.2 4.3.2 Test utterances for the experiment

Two sets of experiments were performed pertaining to the matching codebook being CD-1 or CD-2. The first set consisted of test utterances generated by adding noise to ten clean ‘whispered’ speech utterances from the same speaker as in generation of the CD-1 codebook. Similarly, the second set of experiments had test utterances generated using ten clean ‘normal’ speech utterances from the same speaker as in CD-2, convolved with the same recorded impulse response as used in training CD-2 to constitute the context match scenario for CD-2. In both sets of experiments, the test utterances considered were different from those used in the training of the codebooks. The noisy test utterances were generated as described in the beginning of the section.

4.3.3 4.3.3 Enhancement results

Enhancement using multiple CD codebooks was performed by setting transition probabilities a lk to 0.9 when l=k and to 0.05 when l≠k, with l,k= 1 to 3. Tables 7 and 8 present the matching scenario results for CD-1 and CD-2, respectively, for the babble noise case. Similarly, Tables 9 and 10 present the matching scenario results for CD-1 and CD-2, respectively, for traffic noise case. As can be observed from these tables, the best results for all the scenarios occur for the matching CD codebook. The difference between context match and mismatch (between CD-1 and CD-2/CI, and between CD-2 and CD-1/CI) is significant, especially in the Δ PESQ scores. The differences in Δ SSNR values are significant at higher input SNRs. As the number of codebooks employed by the proposed framework increases, there is a possibility of a negative influence from the inappropriate codebooks in the estimation of the model estimate. But from Tables 7, 8, 9, and 10, we observe that for the case of two CD codebooks and one CI codebook, the results for the proposed framework are close to those of the matched codebook at all input SNRs and for both noise types, confirming the robustness of the proposed framework under varying contexts.

Table 7 Results using two CD codebooks and on CI codebook, for context match scenario for CD-1 under babble noise
Table 8 Results using two CD codebooks and one CI codebook, for context match scenario for CD-2 under babble noise
Table 9 Results using two CD codebooks and one CI codebook, for context match scenario for CD-1 under traffic noise
Table 10 Results using two CD codebooks and one CI codebook, for context match scenario for CD-2 under traffic noise

5 Conclusions

In this paper, we have introduced the notion of context-dependent (CD) models for speech enhancement methods that use trained models of speech and noise parameters. CD speech models can be trained on one or more aspects of speech context such as speaker, acoustic environment, speaking style, etc., and CD noise models can be trained for specific noise types. Using CD models results in better speech enhancement performance compared to using context-independent (CI) models when the noisy speech shares the same context as the trained codebook. The risk, however, is degraded performance in the event of a context mismatch. Thus, the CD and CI models need to co-exist in a practical implementation. The Bayesian speech enhancement framework proposed in this paper obtains estimates of speech and noise parameters based on all available models, requires no prior information on the context at hand, and automatically obtains results close to those obtained when using the appropriate codebook for a given context scenario as seen from experiments with various aspects of speech context.

The improved performance of the proposed method is at the cost of increased computational complexity. As opposed to employing a single CI model, the proposed method involves computations with multiple models. The computations related to each model can, however, occur simultaneously, which allows for a parallel implementation.

The proposed method has been developed using the codebook-based speech enhancement system as an example of a data-driven model-based speech enhancement system. Other model-based schemes, such as those using HMMs, GMMs, and NMF, can benefit in a similar manner, and the extension is a topic for future work. The theory developed in this paper is directly applicable to context-dependent noise codebooks and can be used for robust noise estimation under varying noise conditions.

In this paper, context-dependent models are assumed to be available. In practice, they need to be trained online. For several aspects of context, a separate enrollment stage may not be meaningful and the models need to be progressively adapted during usage when the SNR is high. Distinguishing between different aspects of context and training separate models for them online is another topic for future work.

The codebooks considered in this paper consist of vectors of tenth-order LP coefficients, which model the smoothed spectral envelope. It will be worthwhile to investigate the suitability of other spectral representations such as higher resolution PSDs, mel-frequency cepstral coefficients, etc., to capture context-dependent information. Different features may be employed depending on which aspects of context are to be modeled and depending on the application, e.g., whether the enhancement is for speech communication, speaker identification, or for speech recognition.

Authors’ information

This work was performed when SS was with Philips Research Laboratories, Eindhoven, The Netherlands.

References

  1. Schuller B, Wöllmer M, Moosmayr T, Rigoll G: Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement. EURASIP J. Audio Speech Music Process 2009, 2009: 1-17. 10.1155/2009/942617

    Article  Google Scholar 

  2. Hamacher V, Chalupper J, Eggers J, Fischer E, Kornagel U, Puder H, Rass U: Signal processing in high-end hearing aids: state of the art, challenges, and future trends. EURASIP J. Appl. Signal Process 2005, 2005(18):2915-2929. 10.1155/ASP.2005.2915

    Article  Google Scholar 

  3. Boll SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209

    Article  Google Scholar 

  4. M Berouti, M Schwartz, J Makhoul, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). Enhancement of speech corrupted by acoustic noise (Washington D. C., 2–4 April 1979), pp. 208–211.

    Google Scholar 

  5. S Kamath, P Loizou, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise (Orlando, 13–17 May 2002), pp. IV-4164.

    Google Scholar 

  6. Lu Y, Loizou PC: A geometric approach to spectral subtraction. Speech Commun. 2008, 50(6):453-466. 10.1016/j.specom.2008.01.003

    Article  Google Scholar 

  7. Paliwal K, Schwerin B, Wojcicki K: Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator. Speech Commun. 2012, 54(2):282-305. 10.1016/j.specom.2011.09.003

    Article  Google Scholar 

  8. McAulay RJ, Malpass ML: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process 1980, 28(2):137-145. 10.1109/TASSP.1980.1163394

    Article  Google Scholar 

  9. Ephraim Y, Malah D: Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process 1984, 32(6):1109-1121. 10.1109/TASSP.1984.1164453

    Article  Google Scholar 

  10. Plourde E, Champagne B: Multidimensional STSA estimators for speech enhancement with correlated spectral components. IEEE Trans. Sig. Proc 2011, 59(7):3013-3024. 10.1109/TSP.2011.2138697

    Article  MathSciNet  Google Scholar 

  11. Borgstrom BJ, Alwan A: A unified framework for designing optimal STSA estimators assuming maximum likelihood phase equivalence of speech and noise. IEEE Trans. Audio Speech Language Process 2011, 19(8):2579-2590. 10.1109/TASL.2011.2156784

    Article  Google Scholar 

  12. Andrianakis Y, White PR: Speech enhancement algorithm based on a Chi MRF of the speech STFT amplitudes. IEEE Trans. Acoust. Speech Signal Process 2009, 17(8):1508-1517.

    Google Scholar 

  13. McCallum M, Guillemin B: Stochastic-deterministic MMSE STFT speech enhancement with general a priori information. IEEE Trans. Audio Speech Language Process 2013, 21(7):1445-1457. 10.1109/TASL.2013.2253100

    Article  Google Scholar 

  14. Ephraim Y, Van Trees HL: A signal subspace approach for speech enhancement. IEEE Trans. Acoust. Speech Signal Process 1995, 3(4):251-266. 10.1109/89.397090

    Article  Google Scholar 

  15. Loizou P: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton; 2007.

    Google Scholar 

  16. K Srinivasan, A Gersho, in Proceedings of the IEEE Speech Coding Workshop. Voice activity detection for cellular networks (Sainte-Adèle, 13–15 October 1993), pp. 85–86.

    Book  Google Scholar 

  17. Gorriz J, Ramirez J, Lan E, Puntonet C: Jointly Gaussian pdf-based likelihood ratio test for voice activity detection. IEEE Trans. Audio Speech Language Process 2009, 16(8):1565-1578. 10.1109/TASL.2008.2004293

    Article  Google Scholar 

  18. Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(4):504-512. 10.1109/89.928915

    Article  Google Scholar 

  19. Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Acoust. Speech Signal Process 2003, 11(5):466-475. 10.1109/TSA.2003.811544

    Article  Google Scholar 

  20. Erkelens JS, Heusdens R: Tracking of nonstationary noise based on data-driven recursive noise power estimation. IEEE Trans. Audio, Speech, Language Process 2008, 16(6):1112-1123. 10.1109/TASL.2008.2001108

    Article  Google Scholar 

  21. Souden M, Delcroix M, Kinoshita K, Yoshioka T, Nakatani T: Noise power spectral density tracking: a maximum likelihood perspective. IEEE Sig. Process. lett. 2012, 19(8):495-498. 10.1109/LSP.2012.2204048

    Article  Google Scholar 

  22. R Hendriks, R Heusdens, J Jensen, in Proc. of IEEE International Conf. on Acoustics Speech and Signal Processing (ICASSP), 2010. MMSE based noise PSD tracking with low complexity (Dallas, 14–19 March 2010), pp. 4266–4269.

  23. Gerkmann T, Hendriks R: Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio, Speech, Language Process 2012, 20(4):1383-1393. 10.1109/TASL.2011.2180896

    Article  Google Scholar 

  24. Sreenivas TV, Kirnapure P: Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Acoust. Speech Signal Process 1996, 4(5):383-389. 10.1109/89.536932

    Article  Google Scholar 

  25. Srinivasan S, Samuelsson J, Kleijn WB: Codebook driven short-term predictor parameter estimation for speech enhancement. IEEE Trans. Audio Speech Language Process 2006, 14(1):163-176. 10.1109/TSA.2005.854113

    Article  Google Scholar 

  26. Srinivasan S, Samuelsson J, Kleijn WB: Codebook-based Bayesian speech enhancement for nonstationary environments. IEEE Trans. Audio Speech Language Process 2007, 15(2):441-452. 10.1109/TASL.2006.881696

    Article  Google Scholar 

  27. Xiao X, Nickel RM: Speech enhancement with inventory style speech resynthesis. IEEE Trans. Audio, Speech Language Process 2010, 18(6):1243-1257. 10.1109/TASL.2009.2031793

    Article  Google Scholar 

  28. Rosenkranz T, Puder H: Improving robustness of codebook-based noise estimation approaches with delta codebooks. IEEE Trans. Audio Speech Language Process 2012, 20(4):1177-1188. 10.1109/TASL.2011.2172943

    Article  Google Scholar 

  29. Sameti H, Sheikhzadeh H, Deng L: HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. Acoust. Speech Signal Process 1998, 6(5):445-455. 10.1109/89.709670

    Article  Google Scholar 

  30. Zhao DY, Kleijn WB: HMM-based gain-modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Language Process 2007, 15(3):882-892. 10.1109/TASL.2006.885256

    Article  Google Scholar 

  31. Veisi H, Sameti H: Speech enhancement using hidden Markov models in Mel-frequency domain. Speech Commun. 2013, 55(2):205-220. 10.1016/j.specom.2012.08.005

    Article  Google Scholar 

  32. Hao J, Lee T-W, Sejnowski TJ: Speech enhancement using Gaussian scale mixture models. IEEE Trans. Audio Speech Language Process 2010, 18(6):1127-1136. 10.1109/TASL.2009.2030012

    Article  MathSciNet  Google Scholar 

  33. Mohammadiha N, Smaragdis P, Leijon A: Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Language Process 2013, 21(10):2140-2151. 10.1109/TASL.2013.2270369

    Article  Google Scholar 

  34. Sigg C, Dikk T, Buhmann J: Speech enhancement using generative dictionary learning. IEEE Trans. Audio, Speech, Language Process. 2012, 20(6):1698-1712. 10.1109/TASL.2012.2187194

    Article  Google Scholar 

  35. DHR Naidu, S Srinivasan, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). A Bayesian framework for robust speech enhancment under varying contexts (Kyoto, 25–30 March 2012), pp. 4557–4560.

    Google Scholar 

  36. Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257-286. 10.1109/5.18626

    Article  Google Scholar 

  37. Rangachari S, Loizou P: A noise estimation algorithm for highly nonstationary environments. Speech Commun. 2006, 28: 220-231. 10.1016/j.specom.2005.08.005

    Article  Google Scholar 

  38. A Rix, J Beerends, M Hollier, A Hekstra, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs (Salt Lake City, 7–11 May 2001), pp. 749–752.

    Google Scholar 

  39. Linde Y, Buzo A, Gray RM: An algorithm for vector quantizer design. IEEE Trans. Commun 1980, 28(1):84-95. 10.1109/TCOM.1980.1094577

    Article  Google Scholar 

  40. Gray R, Buzo A, Gray A, Matsuyama Y: Distortion measures for speech processing. IEEE Trans. Acoust. Speech Signal Process 1980, 28(4):367-376. 10.1109/TASSP.1980.1163421

    Article  Google Scholar 

  41. CSR-II (WSJ1) Complete LDC94S13A. DVD. Philadelphia: Linguistic Data Consortium (1994).

  42. ITU-T Rec. P.56, Objective measurement of active speech level. International Telecommunication Union, CH-Geneva (1993).

  43. F Cummins, M Grimaldi, T Leonard, J Simko, in Proceedings of the International Conference on Speech and Computer (SPECOM). The CHAINS corpus: characterizing individual speakers (St Petersburg, 2006), pp. 431–435.

Download references

Acknowledgements

The authors would like to thank Prof. G. V. Prabhakara Rao, Head, Department of Information Technology, Rajiv Gandhi Memorial College of Engineering and Technology, Nandyal, Andhra Pradesh, India, for valuable discussions on this topic.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Devireddy Hanumantha Rao Naidu.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rao Naidu, D.H., Srinivasan, S. Robust Bayesian estimation for context-based speech enhancement. J AUDIO SPEECH MUSIC PROC. 2014, 35 (2014). https://doi.org/10.1186/s13636-014-0035-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-014-0035-4

Keywords