Robust Bayesian estimation for context-based speech enhancement
© Naidu and Srinivasan; licensee Springer. 2014
Received: 31 October 2013
Accepted: 18 August 2014
Published: 12 September 2014
Model-based speech enhancement algorithms that employ trained models, such as codebooks, hidden Markov models, Gaussian mixture models, etc., containing representations of speech such as linear predictive coefficients, mel-frequency cepstrum coefficients, etc., have been found to be successful in enhancing noisy speech corrupted by nonstationary noise. However, these models are typically trained on speech data from multiple speakers under controlled acoustic conditions. In this paper, we introduce the notion of context-dependent models that are trained on speech data with one or more aspects of context, such as speaker, acoustic environment, speaking style, etc. In scenarios where the modeled and observed contexts match, context-dependent models can be expected to result in better performance, whereas context-independent models are preferred otherwise. In this paper, we present a Bayesian framework that automatically provides the benefits of both models under varying contexts. As several aspects of the context remain constant over an extended period during usage, a memory-based approach that exploits information from past data is employed. We use a codebook-based speech enhancement technique that employs trained models of speech and noise linear predictive coefficients as an example model-based approach. Using speaker, acoustic environment, and speaking style as aspects of context, we demonstrate the robustness of the proposed framework for different context scenarios, input signal-to-noise ratios, and number of contexts modeled.
KeywordsBayesian Codebook Context Noise reduction Speech enhancement
Speech enhancement pertains to the processing of speech corrupted by noise, echo, reverberation, etc. to improve its quality and intelligibility. In this paper, by speech enhancement, we refer to the problem of noise reduction. It is relevant in several scenarios, for example, mobile telephony in noisy environments, such as restaurants and busy traffic, suffers from unclear communication. Also, speech recognition units  and hearing aids  require speech enhancement as a preprocessing algorithm.
Speech enhancement algorithms can be broadly classified into single- and multi-channel algorithms based on the number of microphones used to acquire the input noisy speech. Multi-channel algorithms exhibit superior performance because of the additional spatial information available about the noise and speech sources. However, the need for single-channel speech enhancement cannot be ignored. For example, single microphone systems are preferred in low-cost mobile units. In addition, multi-channel methods include a single-channel algorithm as a post-processing step to suppress diffuse noise. In this paper, we focus on single-channel speech enhancement.
Single-channel speech enhancement has been a challenging research problem for the last four decades. Several techniques have been devised to arrive at efficient solutions for the problem. Among these, spectral subtraction is one of the earliest and simplest techniques . Herein, an estimate of the noise magnitude spectrum is subtracted from the observed noisy magnitude spectrum to obtain an estimate of the clean speech magnitude spectrum. Several variations of this technique have been developed over the years -. Methods based on a statistical model of speech to estimate the speech spectral amplitude such as the minimum mean square error short-time spectral amplitude estimator (MMSE-STSA) method have been found to be successful -. The statistical approach explicitly uses the probability density function (pdf) of the speech and noise DFT coefficients. Also, it allows consideration of non-Gaussian prior distributions  and different ways of modeling the spectral data ,. Subspace-based algorithms  assume the clean speech to be confined to a subspace of the noisy space. The noisy vector space is decomposed into noise-only and speech-plus-noise subspaces. The noise subspace components are suppressed, and the speech-plus-noise subspace components are further processed. A comprehensive survey of these techniques is provided in . However, most of these methods depend on an accurate estimate of the noise power spectrum, for example, estimation of the noise magnitude spectrum during silent segments in , or a priori signal-to-noise ratio (SNR) estimation in , or estimation of the noise covariance matrix in the subspace-based methods.
Noise estimation algorithms mainly include voice activity detector (VAD) , and buffer-based methods -. While VADs are unreliable at low SNRs, the buffer-based methods are not fast enough to track the quickly varying noise in nonstationary noise conditions. Thus, while these algorithms perform well in stationary noise, their accuracy deteriorates under nonstationary conditions. An improvement over these algorithms is provided in  wherein a recursive approach is employed for online noise power spectral density (PSD) tracking by analytically retrieving the prior and posterior probabilities of speech absence, and noise statistics, using a maximum likelihood-based criterion. A low-complexity, fast noise tracking algorithm is proposed in ,.
Speech enhancement algorithms which employ trained models, such as codebooks -, hidden Markov models (HMM) -, Gaussian mixture models (GMM) , non-negative matrix factorization (NMF) models , dictionaries , etc., for speech and noise data are able to process noisy speech with sufficient accuracy even under nonstationary noise conditions. For example, codebook-based speech enhancement (CBSE) algorithms , estimate the noise power spectrum for short segments of noisy speech, thus tracking nonstationary noise better than the buffer-based methods . However, model-based methods typically employ a priori speech models which are trained on speech data from multiple speakers. For applications where the input noisy speech is more frequent from a particular speaker, such as in mobile telephony, it is desirable to exploit the speaker dependency for better speech enhancement. Similarly, it might be beneficial to consider models trained on or adapted to a specific acoustic environment or language. In this paper, we introduce the notion of context-dependent (CD) models, where by the word ‘context’, we refer to one or more aspects such as the speaker, acoustic environment, emotion, language, speaking style, etc. of the input noisy speech. By employing CD models, improved enhancement of noisy speech can be expected. These models can be adapted online from a context-independent (CI) model during high SNR regions of the input signal. In this paper, we assume the availability of such adapted CD models and focus on the enhancement using the converged models.
When the context of the noisy input matches the context of the data used to train the model, CD models are expected to result in better speech enhancement than CI models. We refer to such scenarios as context match scenarios. However, in practice, the modeled and observed contexts may not always match, leading to a context mismatch. In such scenarios, a CD model may lead to poorer results, and so the CI model would be preferred. Thus, what is required is a method that retains the benefits of both the CD and CI models and provides robust results irrespective of the scenario at hand.
In this paper, we introduce a Bayesian framework to optimally combine the estimates from the CD and CI models to achieve robust speech enhancement under varying contexts. As different aspects of context can be expected to remain constant for an extended duration in the input noisy signal, the framework considers past information to improve the estimation process. Also, in practice, different aspects of context may occur at the same time. So, the framework is designed to include several codebooks at the same time.
As an example of the model-based algorithm, we use the CBSE technique that employs trained models of speech and noise linear predictive (LP) coefficients as priors . A part of this work has been presented in . This papers extends  by incorporating memory-based estimation, considers the use of multiple CD models, and presents a detailed experimental analysis for different noise types, input SNRs, and aspects of context. The framework developed is general and can be used for other representations such as mel-frequency cepstrum coefficients, higher resolution PSDs, as well as other models such as GMMs, HMMs, and NMF.
The remainder of the paper is organized as follows. In the next section, a brief outline of the CBSE techniques , is provided. Following this, we derive the memory-based Bayesian framework to optimally combine estimates from several codebooks (CD/CI). Thereafter, we present the experimental results for the proposed framework under varying contexts, noise types, and input SNRs. Finally, we summarize the conclusions.
2 Codebook-based speech enhancement
where n is the time index, x(n) is the clean speech signal, and w(n) is the noise signal.
where P y (ω), P x (ω), and P w (ω) are PSDs of the observed noisy speech, clean speech, and noise respectively, and ω is the angular frequency.
m x is a model describing the speech PSD, and m w describes the noise PSD. Codebook-driven speech enhancement techniques , estimate m x and m w for each short-time segment: a x and a w are selected from trained codebooks of vectors of speech and noise LP coefficients, C x and C w , respectively, and the gain terms g x and g w are computed online, resulting in good performance in nonstationary noise. A maximum likelihood approach is adopted in  and a Bayesian minimum mean squared error (MMSE) approach in .
where and are estimates of the speech and noise PSDs, respectively, described by and . The Wiener filter is one example of a gain function, and any other gain function can be employed using the obtained speech and noise PSD estimates.
3 Bayesian estimation under varying contexts
In this section, we develop a Bayesian framework to obtain estimates of the speech and noise LP parameters, m x and m w , using one or more CD codebooks and a CI speech codebook. The CD codebooks improve estimation accuracy in the event of a context match, and the CI codebook provides robustness in the event of a context mismatch. The Bayesian framework needs to optimally combine the estimates from the various codebooks with no prior knowledge on whether or not the observed context matches the context modeled by the codebooks.
Consider K speech codebooks , which include one or more CD codebooks and a CI codebook, depending on the contexts modeled. We consider a single noise codebook, C w , corresponding to the encountered noise type. Robustness to different noise types can be provided by extending the notion of context dependency to the noise codebooks as well. To maintain the focus on context dependency in speech, we only consider a single noise codebook.
We consider the following K hypotheses:
H k : speech codebook C k best models the speech context for the current segment, 1≤k≤K.
At a given time T, one of the K hypotheses is valid. This corresponds to a state, and we write S T =H k to denote that at time T, the most appropriate speech codebook for the observed noisy segment is C k .
The two terms in the last line of (7) lend themselves to an intuitive representation. The second term E[m|y1,y2…,y T ,S T =H k ] corresponds to an MMSE estimate of m assuming that the context is best described by H k . The first term provides a relative importance score to this estimate, based on the likelihood that is indeed the most appropriate speech codebook. The weighted summation corresponds to a soft estimation, which allows the coexistence of multiple contexts, e.g., speaker and language, each being modeled by a separate codebook. Next, we derive expressions for both these terms.
represent the forward probability as in standard HMM theory . It can be recursively obtained as follows:
The prior probabilities in the absence of any observation can be assumed to be equal in Equation 9. Thus, p(H k ) = , i.e., all hypotheses are equally likely.
The logarithm of the likelihood p(y T |m) in the Equation 14 can be efficiently computed in the frequency domain following the approach of . The gain terms that maximize the likelihood can be computed as in .
Next, we consider the term p(m|H k ) in Equation 13. Under hypothesis H k , the speech signal in the observed segment is best described by the codebook . We assume all the models resulting from a given codebook are equally likely. This assumption is valid, in general, if the codebook size is large and derived from a phonetically balanced large training set.
and p(y T |m) is given by Equation 14. Equation 18 is used in Equations 9 and 10 to obtain the forward probabilities. Finally, the required MMSE estimate is obtained by using Equations 11 and 17 in Equation 7. The speech and noise PSDs corresponding to can be obtained using Equation 3 and the Wiener filter from Equation 5. To ensure stability of the estimated LP parameters, the weighted sum in Equation 7 can be performed in the line spectral frequency domain. Note that the weights are non-negative and add up to unity as is evident from Equation 11. Alternatively, as we are finally interested in the speech and noise PSDs to be used in a Wiener filter, the weighted sum can be performed in the power spectral domain.
We conclude this section with some remarks on the calculation of the forward probabilities α T which for a codebook captures how well that codebook matches the context of the T th input segment. As mentioned earlier, the proposed framework can be used to model context in speech as well as noise. When context is modeled by the speech codebooks, it was found to be beneficial to calculate α T during speech-dominated segments, and during noise-dominated segments when modeling the noise context. The goal in computing α T is to assess how well a given speech codebook matches the underlying context for a given input segment. If this computation is performed during speech-dominated frames, we obtain accurate values for α T . However, inaccurate weight values may result when the computation is based on segments that lack sufficient information about the speech, such as silence or low-energy segments dominated by noise. In such situations, it is preferable to use the value of α T computed in the last speech-dominated segment. This, in other words, assumes that the context of the current segment is the same as that of the past segment. This assumption is valid in general as the context of speech is not expected to rapidly change from one speech burst to another. Thus, updating α T only during speech-dominated segments does not affect performance. However, estimating α T only during speech-dominated segments suffers from the disadvantage that there may not be a sufficient number of such segments in highly noisy conditions. Introducing a preliminary noise reduction step, e.g., using the long-term noise estimate from , and estimating α T from the enhanced signal was seen to address this problem. Importantly, the estimation of the speech and noise PSDs and the resulting Wiener filter occurs for each short-time segment, providing good performance under nonstationary noise conditions.
4 Experimental results
Experiments were performed to verify the robustness of the proposed framework under varying contexts. The contexts modeled by a trained CD codebook may or may not match with that of the observed noisy input signal, leading to two scenarios:
Context match: the best-case scenario for a CD codebook
Context mismatch: the worst-case scenario for a CD codebook
The robustness of the proposed framework, employing both CD and CI codebooks, was tested under both scenarios. Two different sets of experiments were performed, which differed in terms of number of codebooks employed and the aspects of contexts modeled. The first set consisted of experiments with two speech codebooks, a CI speech codebook and a CD speech codebook, modeling the speaker and acoustic environment as aspects of context. The second set consisted of experiments with three speech codebooks: a CI speech codebook and two CD speech codebooks to study the performance of the proposed framework with an increase in the number of codebooks employed. This set modeled, apart from speaker and acoustic environment, the speech type (normal, whisper, loud, etc.) of the input speech as aspects of context.
In the following, we first describe the experimental setup and, thereafter, the various experiments along with the corresponding results.
4.1 4.1 Experimental setup
In all the experiments, the input noisy test utterances were enhanced under different context scenarios, using the CBSE technique  applied using the CD codebook alone, the CBSE technique applied using the CI codebook alone, and the proposed Bayesian scheme. We expect that in the context match scenarios, employing the CD codebook alone should lead to the best results. On the other hand, in the context mismatch scenarios, employing the CI codebook alone should lead to results better than those obtained using the CD codebook. The proposed method, however, is expected to provide robust results under varying contexts, i.e., results close to the best results in all scenarios. To serve as a reference for comparisons, we also include results when applying the Wiener filter (5) with a noise estimate obtained from a state-of-the-art noise estimation scheme .
The performance of these four processing schemes was compared using two measures: the improvement in segmental SNR (SSNR) referred to as Δ SSNR (in dB) and the improvement in the perceptual evaluation of speech quality (PESQ)  measure, referred to as Δ PESQ, averaged over all the enhanced utterances considered under a particular experiment.
The speech codebooks used in the experiments were trained using the Linde-Buzo-Gray (LBG) algorithm . First, the clean speech training utterances, resampled at 8 kHz, were segmented into 50% overlapped Hann windowed frames of size 256 samples each, corresponding to a duration of 32 ms wherein the speech signal can be assumed stationary. Then, LP coefficient vectors of dimension 10, extracted using these frames, were clustered using the LBG algorithm to generate speech codebooks of size 256 each using the Itakura-Saito (IS) distortion  as the error criterion.
For training the CI speech codebook, 180 English language utterances of duration 3 to 4 seconds each were used, from 25 male and 25 female speakers from the WSJ speech database . This codebook served as the CI codebook for all the experiments described in this section. The speakers whose utterances were used to train the CI codebook were not used in the test utterances. The different experiments use different CD codebooks and input noisy test data, which are discussed later along with the description of each experiment.
The different CD and CI speech codebooks considered in the experiments are of large size (256) and are derived from a large number of phonetically balanced sentences from the WSJ database. Moreover, the LBG algorithm used to generate the speech codebooks computes cluster centroids in an optimal fashion. All these factors ensure the validity of the assumption about equal probability of models in Equation 16.
Two noise codebooks for two different noise types, traffic and babble, with eight entries each were trained similarly using LP coefficient vectors. For the traffic noise codebook, LP coefficient vectors of order 6 extracted from 2 min of nonstationary traffic noise were used. Since babble noise is speech-like, a higher LP model order of 10 was used while extracting LP coefficient training vectors from approximately 3 min of nonstationary babble noise. The same noise types were also used in the creation of test utterances at 0, 5, and 10 dB SNR for all the experiments. The actual samples were different from those used in training. The active speech level was computed using ITU-T P.56 method B in , and noise was scaled and added to obtain a desired SNR.
When processing the noisy files for a particular noise type, the appropriate noise codebook was used. In practice, a classified noise codebook scheme as discussed in  can be used. This scheme employs multiple noise codebook, each trained for a particular noise type. A maximum likelihood scheme is used to select the appropriate noise codebook for each short-time frame. This method was shown in  to perform as well as the case when the ideal noise codebook was used. We choose to use the ideal noise codebook to retain the focus on the performance of the proposed framework with regard to various aspects of the speech context.
4.2 4.2 Experiments with a single CD codebook
In this experiment, we test the proposed framework when two speech codebooks are employed, a CI and a CD codebook. The CD codebook models two aspects of context, ‘speaker’ and ‘acoustic environment’.
4.2.1 4.2.1 CD codebook training
For training the CD codebook, 180 English language utterances from a single speaker, of 3 to 4 s duration each, were used from the WSJ speech database. These utterances were convolved with an impulse response recorded at a distance of 50 cm from the microphone, in a reverberant room (T60 = 800 ms). This corresponds, for example, to hands-free mode on a mobile phone. In practice, this codebook is adapted during hands-free usage, making it dependent on both the speaker and acoustic environment.
4.2.2 4.2.2 Test utterances for the experiment
Two sets of ten clean speech utterances each were used to generate the noisy test data. Utterances for the first set were from the same speaker and acoustic environment as the data used to train the CD codebook, corresponding to the context match scenario and thus the best case for the CD codebook. The utterances themselves were different from those used in the training set.
The second set of clean utterances were from a speaker different from the one involved in training the CD codebook. These utterances were not convolved with the recorded impulse response (e.g., corresponding to hand-set mode in a mobile phone). Thus, both the speaker and acoustic environment were different from those used to train the CD codebook, corresponding to the context mismatch scenario and thus the worst case for the CD codebook.
4.2.3 4.2.3 Enhancement results
Best-case scenario for a single CD codebook under babble noise
ΔSSNR (in dB)
CBSE with CD
CBSE with CI
Worst-case scenario for a single CD codebook under babble noise
ΔSSNR (in dB)
CBSE with CD
CBSE with CI
As can be observed from Table 1, the best results are obtained for the CD codebook, as expected in a context match scenario. There is a significant difference between the results corresponding to the CD and CI codebooks, e.g., 0.19 for Δ PESQ and 1.3 dB for Δ SSNR, at 5 dB input SNR. Moreover, the standard deviation values indicate that the observed differences between the CD and CI results are statistically significant. This illustrates the benefit of employing CD codebooks. On the other hand, Table 2 demonstrates poorer performance when using the CD codebook compared to using the CI codebook, in a context mismatch scenario. The difference between their results is significant for Δ SSNR at all input SNRs, e.g., 1 dB at 0 dB input SNR, and for Δ PESQ at higher SNR, e.g., 0.22 at 10 dB input SNR. These results demonstrate the need for a scheme that appropriately combines the estimates obtained from the CD and CI codebooks, depending on the context at hand.
In Table 1, with increasing input SSNR, there is an increase in Δ PESQ but a decrease in Δ SSNR for all schemes except the reference method. This can be explained by considering the trade-off between speech distortion and noise reduction.
In general, enhancement using a Wiener filter involves applying a gain (also called attenuation) function. When applying this gain function to the noisy speech, both speech and noise components are attenuated. At lower input SNRs, the SSNR measure is dominated by the benefit of noise reduction while ignoring the penalty due to speech distortion. So in these scenarios, applying a greater attenuation than is optimal can increase the output SSNR values as it results in more noise attenuation (it also results in more speech attenuation but that is not captured by the SSNR measure). This situation occurs when using a mismatched codebook, where the clean speech PSD is underestimated, resulting in more severe attenuation of the noisy speech. PESQ is more closer to human perception, and we believe that the effect of speech distortion is better captured by PESQ, resulting in negative delta PESQ values for these scenarios. At higher input SNRs, the SSNR measure also captures the effect of speech distortion. Since Δ PESQ captures well the decrease in speech distortion with increasing input SSNR, there is an increase in Δ PESQ with increasing input SSNR in Table 1. On the other hand, SSNR measure is dominated at lower input SNRs by the benefit of noise reduction ignoring the penalty due to speech distortion. As a result, there is larger Δ SSNR at lower input SNRs than at higher input SNRs.
In contrast to the results obtained when using the CD and CI codebooks alone, the proposed framework achieves robust performance regardless of the observed context. For the best-case scenario (Table 1), its results are close to the CD results. For the worst-case scenario (Table 2), its results are close to the CI results. Thus, the proposed framework achieves results close to the best results for a given scenario, as desired. The reference scheme performs poorly due to the nonstationary nature of the noise. It may be noted that even using a mismatched codebook outperforms the reference scheme, highlighting the benefit of using a priori information for speech enhancement in nonstationary noise.
Best-case scenario for a single CD codebook under traffic noise
ΔSSNR (in dB)
CBSE with CD
CBSE with CI
Worst-case scenario for a single CD codebook under traffic noise
ΔSSNR (in dB)
CBSE with CD
CBSE with CI
Comparing Δ PESQ values for the best-case scenarios in Tables 1 and 3 for the two noise types shows that there is a sharper drop in values from 5 to 0 dB input SNR in the case of traffic noise results (0.2) compared to babble noise results (0.06). A similar observation can be made for the Δ PESQ values for the worst-case scenarios in Tables 2 and 4 for the two noise types. These observations indicate that the traffic noise case is more difficult to handle than babble noise at 0 dB input SNR. This occurred because the traffic noise considered for the experiments is highly nonstationary compared to the babble noise used for the experiments.
4.2.4 4.2.4 Comparison of the proposed method with the MMSE-STSA method
In the above experiments, the reference method chosen for comparison with the proposed method uses the Wiener gain, as described by (5), computed using a state-of-the-art noise estimator . This choice provides an even comparison as the proposed method too employs the Wiener gain function. The two approaches, however, differ in the computation of the speech and noise PSDs for computing the Wiener gain.
Comparison of the proposed method with the MMSE-STSA technique for context match scenario corresponding to Table 1
ΔSSNR (in dB)
Comparison of the proposed method with the MMSE-STSA technique for context mismatch scenario corresponding to Table 2
ΔSSNR (in dB)
4.3 4.3 Experiments with multiple CD codebooks
In the previous subsection, we tested the proposed framework under conditions when a single CD codebook was employed along with a CI codebook. Multiple aspects of context were modeled by the single CD codebook. In practice, different contexts will be modeled by different CD codebooks. In this subsection, we experiment with the case of two CD codebooks along with one CI codebook.
4.3.1 4.3.1 CD codebook training
The first CD codebook, referred to as CD-1, models a particular speaker and a speech type. The speech type considered is ‘whisper’ speech. The speech produced in the case of certain speech disorders (dysphonic speech) is similar to whispered speech. CD-1 was trained using around 10 min of whispered speech data from a single speaker from the CHAINS database .
The second CD codebook employed, referred to as CD-2, models normal speech in reverberant conditions for the same speaker as modeled by CD-1. CD-2 was trained using training utterances of duration around 10 min, convolved with the same impulse response as used in the previous experiments (corresponding to a distance of 50 cm from the microphone, in a reverberant room with T60 = 800 ms).
The two codebooks differ in terms of speaking style, whispered and normal, and also the acoustic environment. The separation in terms of acoustic environment is useful, e.g., to have different CD models for a particular user of the mobile phone to cater to hand-set and hands-free modes of operation. Note that the CI codebook is speaker-independent and corresponds to hand-set mode.
4.3.2 4.3.2 Test utterances for the experiment
Two sets of experiments were performed pertaining to the matching codebook being CD-1 or CD-2. The first set consisted of test utterances generated by adding noise to ten clean ‘whispered’ speech utterances from the same speaker as in generation of the CD-1 codebook. Similarly, the second set of experiments had test utterances generated using ten clean ‘normal’ speech utterances from the same speaker as in CD-2, convolved with the same recorded impulse response as used in training CD-2 to constitute the context match scenario for CD-2. In both sets of experiments, the test utterances considered were different from those used in the training of the codebooks. The noisy test utterances were generated as described in the beginning of the section.
4.3.3 4.3.3 Enhancement results
Results using two CD codebooks and on CI codebook, for context match scenario for CD-1 under babble noise
ΔSSNR (in dB)
CBSE with CD-1
CBSE with CD-2
CBSE with CI
Results using two CD codebooks and one CI codebook, for context match scenario for CD-2 under babble noise
ΔSSNR (in dB)
CBSE with CD-1
CBSE with CD-2
CBSE with CI
Results using two CD codebooks and one CI codebook, for context match scenario for CD-1 under traffic noise
ΔSSNR (in dB)
CBSE with CD-1
CBSE with CD-2
CBSE with CI
Results using two CD codebooks and one CI codebook, for context match scenario for CD-2 under traffic noise
ΔSSNR (in dB)
CBSE with CD-1
CBSE with CD-2
CBSE with CI
In this paper, we have introduced the notion of context-dependent (CD) models for speech enhancement methods that use trained models of speech and noise parameters. CD speech models can be trained on one or more aspects of speech context such as speaker, acoustic environment, speaking style, etc., and CD noise models can be trained for specific noise types. Using CD models results in better speech enhancement performance compared to using context-independent (CI) models when the noisy speech shares the same context as the trained codebook. The risk, however, is degraded performance in the event of a context mismatch. Thus, the CD and CI models need to co-exist in a practical implementation. The Bayesian speech enhancement framework proposed in this paper obtains estimates of speech and noise parameters based on all available models, requires no prior information on the context at hand, and automatically obtains results close to those obtained when using the appropriate codebook for a given context scenario as seen from experiments with various aspects of speech context.
The improved performance of the proposed method is at the cost of increased computational complexity. As opposed to employing a single CI model, the proposed method involves computations with multiple models. The computations related to each model can, however, occur simultaneously, which allows for a parallel implementation.
The proposed method has been developed using the codebook-based speech enhancement system as an example of a data-driven model-based speech enhancement system. Other model-based schemes, such as those using HMMs, GMMs, and NMF, can benefit in a similar manner, and the extension is a topic for future work. The theory developed in this paper is directly applicable to context-dependent noise codebooks and can be used for robust noise estimation under varying noise conditions.
In this paper, context-dependent models are assumed to be available. In practice, they need to be trained online. For several aspects of context, a separate enrollment stage may not be meaningful and the models need to be progressively adapted during usage when the SNR is high. Distinguishing between different aspects of context and training separate models for them online is another topic for future work.
The codebooks considered in this paper consist of vectors of tenth-order LP coefficients, which model the smoothed spectral envelope. It will be worthwhile to investigate the suitability of other spectral representations such as higher resolution PSDs, mel-frequency cepstral coefficients, etc., to capture context-dependent information. Different features may be employed depending on which aspects of context are to be modeled and depending on the application, e.g., whether the enhancement is for speech communication, speaker identification, or for speech recognition.
This work was performed when SS was with Philips Research Laboratories, Eindhoven, The Netherlands.
The authors would like to thank Prof. G. V. Prabhakara Rao, Head, Department of Information Technology, Rajiv Gandhi Memorial College of Engineering and Technology, Nandyal, Andhra Pradesh, India, for valuable discussions on this topic.
- Schuller B, Wöllmer M, Moosmayr T, Rigoll G: Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement. EURASIP J. Audio Speech Music Process 2009, 2009: 1-17. 10.1155/2009/942617View ArticleGoogle Scholar
- Hamacher V, Chalupper J, Eggers J, Fischer E, Kornagel U, Puder H, Rass U: Signal processing in high-end hearing aids: state of the art, challenges, and future trends. EURASIP J. Appl. Signal Process 2005, 2005(18):2915-2929. 10.1155/ASP.2005.2915View ArticleGoogle Scholar
- Boll SF: Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process 1979, 27(2):113-120. 10.1109/TASSP.1979.1163209View ArticleGoogle Scholar
- M Berouti, M Schwartz, J Makhoul, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). Enhancement of speech corrupted by acoustic noise (Washington D. C., 2–4 April 1979), pp. 208–211.Google Scholar
- S Kamath, P Loizou, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). A multi-band spectral subtraction method for enhancing speech corrupted by colored noise (Orlando, 13–17 May 2002), pp. IV-4164.Google Scholar
- Lu Y, Loizou PC: A geometric approach to spectral subtraction. Speech Commun. 2008, 50(6):453-466. 10.1016/j.specom.2008.01.003View ArticleGoogle Scholar
- Paliwal K, Schwerin B, Wojcicki K: Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator. Speech Commun. 2012, 54(2):282-305. 10.1016/j.specom.2011.09.003View ArticleGoogle Scholar
- McAulay RJ, Malpass ML: Speech enhancement using a soft-decision noise suppression filter. IEEE Trans. Acoust. Speech Signal Process 1980, 28(2):137-145. 10.1109/TASSP.1980.1163394View ArticleGoogle Scholar
- Ephraim Y, Malah D: Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process 1984, 32(6):1109-1121. 10.1109/TASSP.1984.1164453View ArticleGoogle Scholar
- Plourde E, Champagne B: Multidimensional STSA estimators for speech enhancement with correlated spectral components. IEEE Trans. Sig. Proc 2011, 59(7):3013-3024. 10.1109/TSP.2011.2138697MathSciNetView ArticleGoogle Scholar
- Borgstrom BJ, Alwan A: A unified framework for designing optimal STSA estimators assuming maximum likelihood phase equivalence of speech and noise. IEEE Trans. Audio Speech Language Process 2011, 19(8):2579-2590. 10.1109/TASL.2011.2156784View ArticleGoogle Scholar
- Andrianakis Y, White PR: Speech enhancement algorithm based on a Chi MRF of the speech STFT amplitudes. IEEE Trans. Acoust. Speech Signal Process 2009, 17(8):1508-1517.Google Scholar
- McCallum M, Guillemin B: Stochastic-deterministic MMSE STFT speech enhancement with general a priori information. IEEE Trans. Audio Speech Language Process 2013, 21(7):1445-1457. 10.1109/TASL.2013.2253100View ArticleGoogle Scholar
- Ephraim Y, Van Trees HL: A signal subspace approach for speech enhancement. IEEE Trans. Acoust. Speech Signal Process 1995, 3(4):251-266. 10.1109/89.397090View ArticleGoogle Scholar
- Loizou P: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton; 2007.Google Scholar
- K Srinivasan, A Gersho, in Proceedings of the IEEE Speech Coding Workshop. Voice activity detection for cellular networks (Sainte-Adèle, 13–15 October 1993), pp. 85–86.View ArticleGoogle Scholar
- Gorriz J, Ramirez J, Lan E, Puntonet C: Jointly Gaussian pdf-based likelihood ratio test for voice activity detection. IEEE Trans. Audio Speech Language Process 2009, 16(8):1565-1578. 10.1109/TASL.2008.2004293View ArticleGoogle Scholar
- Martin R: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process 2001, 9(4):504-512. 10.1109/89.928915View ArticleGoogle Scholar
- Cohen I: Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging. IEEE Trans. Acoust. Speech Signal Process 2003, 11(5):466-475. 10.1109/TSA.2003.811544View ArticleGoogle Scholar
- Erkelens JS, Heusdens R: Tracking of nonstationary noise based on data-driven recursive noise power estimation. IEEE Trans. Audio, Speech, Language Process 2008, 16(6):1112-1123. 10.1109/TASL.2008.2001108View ArticleGoogle Scholar
- Souden M, Delcroix M, Kinoshita K, Yoshioka T, Nakatani T: Noise power spectral density tracking: a maximum likelihood perspective. IEEE Sig. Process. lett. 2012, 19(8):495-498. 10.1109/LSP.2012.2204048View ArticleGoogle Scholar
- R Hendriks, R Heusdens, J Jensen, in Proc. of IEEE International Conf. on Acoustics Speech and Signal Processing (ICASSP), 2010. MMSE based noise PSD tracking with low complexity (Dallas, 14–19 March 2010), pp. 4266–4269.Google Scholar
- Gerkmann T, Hendriks R: Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio, Speech, Language Process 2012, 20(4):1383-1393. 10.1109/TASL.2011.2180896View ArticleGoogle Scholar
- Sreenivas TV, Kirnapure P: Codebook constrained Wiener filtering for speech enhancement. IEEE Trans. Acoust. Speech Signal Process 1996, 4(5):383-389. 10.1109/89.536932View ArticleGoogle Scholar
- Srinivasan S, Samuelsson J, Kleijn WB: Codebook driven short-term predictor parameter estimation for speech enhancement. IEEE Trans. Audio Speech Language Process 2006, 14(1):163-176. 10.1109/TSA.2005.854113View ArticleGoogle Scholar
- Srinivasan S, Samuelsson J, Kleijn WB: Codebook-based Bayesian speech enhancement for nonstationary environments. IEEE Trans. Audio Speech Language Process 2007, 15(2):441-452. 10.1109/TASL.2006.881696View ArticleGoogle Scholar
- Xiao X, Nickel RM: Speech enhancement with inventory style speech resynthesis. IEEE Trans. Audio, Speech Language Process 2010, 18(6):1243-1257. 10.1109/TASL.2009.2031793View ArticleGoogle Scholar
- Rosenkranz T, Puder H: Improving robustness of codebook-based noise estimation approaches with delta codebooks. IEEE Trans. Audio Speech Language Process 2012, 20(4):1177-1188. 10.1109/TASL.2011.2172943View ArticleGoogle Scholar
- Sameti H, Sheikhzadeh H, Deng L: HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. IEEE Trans. Acoust. Speech Signal Process 1998, 6(5):445-455. 10.1109/89.709670View ArticleGoogle Scholar
- Zhao DY, Kleijn WB: HMM-based gain-modeling for enhancement of speech in noise. IEEE Trans. Audio Speech Language Process 2007, 15(3):882-892. 10.1109/TASL.2006.885256View ArticleGoogle Scholar
- Veisi H, Sameti H: Speech enhancement using hidden Markov models in Mel-frequency domain. Speech Commun. 2013, 55(2):205-220. 10.1016/j.specom.2012.08.005View ArticleGoogle Scholar
- Hao J, Lee T-W, Sejnowski TJ: Speech enhancement using Gaussian scale mixture models. IEEE Trans. Audio Speech Language Process 2010, 18(6):1127-1136. 10.1109/TASL.2009.2030012MathSciNetView ArticleGoogle Scholar
- Mohammadiha N, Smaragdis P, Leijon A: Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Trans. Audio Speech Language Process 2013, 21(10):2140-2151. 10.1109/TASL.2013.2270369View ArticleGoogle Scholar
- Sigg C, Dikk T, Buhmann J: Speech enhancement using generative dictionary learning. IEEE Trans. Audio, Speech, Language Process. 2012, 20(6):1698-1712. 10.1109/TASL.2012.2187194View ArticleGoogle Scholar
- DHR Naidu, S Srinivasan, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). A Bayesian framework for robust speech enhancment under varying contexts (Kyoto, 25–30 March 2012), pp. 4557–4560.Google Scholar
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
- Rangachari S, Loizou P: A noise estimation algorithm for highly nonstationary environments. Speech Commun. 2006, 28: 220-231. 10.1016/j.specom.2005.08.005View ArticleGoogle Scholar
- A Rix, J Beerends, M Hollier, A Hekstra, in Proceedings of the IEEE Int. Conf. Acoust. Speech Signal Processing (ICASSP). Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs (Salt Lake City, 7–11 May 2001), pp. 749–752.Google Scholar
- Linde Y, Buzo A, Gray RM: An algorithm for vector quantizer design. IEEE Trans. Commun 1980, 28(1):84-95. 10.1109/TCOM.1980.1094577View ArticleGoogle Scholar
- Gray R, Buzo A, Gray A, Matsuyama Y: Distortion measures for speech processing. IEEE Trans. Acoust. Speech Signal Process 1980, 28(4):367-376. 10.1109/TASSP.1980.1163421View ArticleGoogle Scholar
- CSR-II (WSJ1) Complete LDC94S13A. DVD. Philadelphia: Linguistic Data Consortium (1994).Google Scholar
- ITU-T Rec. P.56, Objective measurement of active speech level. International Telecommunication Union, CH-Geneva (1993).Google Scholar
- F Cummins, M Grimaldi, T Leonard, J Simko, in Proceedings of the International Conference on Speech and Computer (SPECOM). The CHAINS corpus: characterizing individual speakers (St Petersburg, 2006), pp. 431–435.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.