A semisoft thresholding method based on Teager energy operation on wavelet packet coefficients for enhancing noisy speech

Sanam, Tahsina Farah; Shahnaz, Celia

doi:10.1186/1687-4722-2013-25

Research
Open access
Published: 19 November 2013

A semisoft thresholding method based on Teager energy operation on wavelet packet coefficients for enhancing noisy speech

Tahsina Farah Sanam¹ &
Celia Shahnaz¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2013, Article number: 25 (2013) Cite this article

3362 Accesses
16 Citations
Metrics details

Abstract

The performance of thresholding-based methods for speech enhancement largely depends upon the estimation of the exact threshold value. In this paper, a new thresholding-based speech enhancement approach, where the threshold is statistically determined using the Teager energy-operated wavelet packet (WP) coefficients of noisy speech, is proposed. The threshold thus obtained is applied to the WP coefficients of the noisy speech by employing a semisoft thresholding function in order to obtain an enhanced speech. A number of simulations were carried out in the presence of white, car, pink, and multi-talker babble noises to evaluate the performance of the proposed method. Standard objective measures as well as subjective evaluations show that the proposed method is capable of outperforming the existing state-of-the-art thresholding-based speech enhancement approaches for noisy speech of high as well as low levels of SNR.

1 Introduction

Enhancement of noisy speech has been an important problem and has a broad range of applications, such as mobile communications, speech coding, and recognition and hearing aid devices[1]. The performance of such applications operating in noisy environments is highly dependent on the noise reduction techniques employed therein.

Various speech enhancement methods have been reported in the literature describing the know-how to solve the problem of noise reduction in speech enhancement methods. Speech enhancement methods can be generally divided into several categories based on their domains of operation, namely time domain, frequency domain, and time-frequency domain. Time domain methods include the subspace approach[2], frequency domain methods include short-time Fourier transform (STFT)-based spectral subtraction[3–6], minimum mean square error (MMSE) estimator[7–11] and Wiener filtering[12–14], and time frequency-domain methods involve the employment of the family of wavelet[15–26]. All of the methods have their own advantages and drawbacks. In the MMSE estimator[7–11], the frequency spectrum of the noisy speech is modified to reduce the noise from noisy speech in the frequency domain. The spectral subtraction method[3–6] is simple and attempts to estimate the spectral amplitude of the clean speech by subtracting an estimate of the noise spectral amplitude from that of the observed noisy speech. Finally, the estimated amplitude is combined with the phase of the noisy speech to produce the desired estimate of the clean speech STFT. In the Wiener filter approach[12–14], the estimator of the clean speech STFT is simply the MMSE estimator when considering Gaussian-distributed clean speech and noise. In that case, the phase of the resulting estimate turns out to be that of the noisy speech. The spectral subtraction filter uses the instantaneous spectra of the noisy signal and the running average (time-averaged spectra) of the noise, whereas the Wiener filter is based on the ensemble average spectra of the signal and noise. Although the spectral subtraction method provides a trade-off between speech distortion and residual noise to some extent, its major drawback is the perceptually annoying musical nature of the residual noise characterized by tones at different frequencies that randomly appear and disappear. One of the major problems of the Wiener filter-based method is the requirement of obtaining clean speech statistics necessary for its implementation. The use of Wiener filter in speech enhancement generally introduces little speech distortion; however, as for the spectral subtraction approach, the speech enhanced based on the Wiener filter is also characterized by residual musical noises. Among the speech enhancement methods using time-frequency analyses, the use of nonlinear techniques based on discrete wavelet transform (DWT)[15–26] is a superior alternative to the methods using STFT-based analyses, such as spectral subtraction and Wiener filtering. In the DWT, the fixed bandwidth of the STFT is replaced with one that is proportional to frequency that allows better time resolution at high frequencies than the STFT. Here, low frequencies are examined with low temporal resolution while high frequencies are observed with greater temporal resolution. Thus, the DWT gains more attractiveness in representing and preserving the signal energy in the presence of noise that needs to be removed in the speech enhancement process. Since the DWT-based speech enhancement methods exploit the superior frequency localization property of the DWT, they have more capability of reducing musical noise, thus achieving better noise reduction performance in terms of quality as well as intelligibility.

The main challenge in speech enhancement approaches based on the thresholding of the DWT coefficients of the noisy speech is the estimation of a threshold value that marks a difference between the DWT coefficients of noise and that of clean speech. Then, by using the threshold, designing a thresholding scheme to minimize the effect of DWT coefficients corresponding to the noise is another difficult task considering the fact that conventional DWT-based speech enhancement approaches exhibit a satisfactory performance only at a relatively high signal-to-noise ratio (SNR). For zero-mean, normally distributed white noise, Donoho and Johnstone proposed the Universal threshold-based method for enhancing corrupted speech[19, 20]. For noisy speech, applying a unique threshold for all the DWT coefficients irrespective of the speech and silence frames may suppress noise to some extent, but it may also remove unvoiced speech frames, thus degrading the quality of the enhanced speech. The Teager energy operator (TEO) proposed by Kaiser[27] is employed to compute a threshold value that is used to threshold the wavelet packet coefficients of the noisy speech[18, 28, 29]. In particular, in the wavelet packet filtering (WPF) method[18], a time-adaptive threshold value is computed and an absolute offset parameter is used to distinguish speech frames from the noise ones. Thus, the WPF method suffers from an over-thresholding problem if the speech signal is contaminated by just slight noises. Statistical modeling is another approach of thresholding-based speech enhancement, where the threshold of wavelet packet coefficients is determined using the similarity distances between the probability distributions of the signals[17].

In this paper, we develop a new speech enhancement method based on thresholding in the wavelet packet domain. Since TEO is a popular way to estimate the energy of a band-limited signal, instead of direct employment of the TEO on the noisy speech, we apply the TEO on the wavelet packet (WP) coefficients of the noisy speech (as for[18, 28, 29]), but we propose a statistical modeling of the Teager energy (TE)-operated WP coefficients. By exploiting the symmetric Kullback-Leibler (SKL) divergence, we then determine an appropriate threshold with respect to speech and silent subbands. The threshold thus obtained is finally employed in a semisoft thresholding function for obtaining an enhanced speech.

2 Proposed method

The block diagram of our proposed system is shown in Figure1. It is seen from Figure1 that WP transform is first applied to each input speech frame. Then, the WP coefficients are subject to Teager energy approximation with a view to determine a threshold value for performing thresholding operation in the WP domain. On thresholding, an enhanced speech is obtained via inverse wavelet packet (IWP) transform.

2.1 Wavelet packet analysis

A method based on the wavelet packet decomposition is a generalization of the wavelet transform-based decomposition process that offers a richer range of probabilities for the analysis of signals, namely speech. In the orthogonal wavelet decomposition procedure, the generic step splits a speech signal into sets of approximation and detail coefficients. The set of approximation coefficients is then itself split into a second-level approximation and detail coefficients, successive details are never reanalyzed, and the process is repeated. Each level of decomposition is calculated by passing only the previous wavelet approximation coefficients through discrete-time low- and high-pass quadrature mirror filters. Mallat algorithm is one of the efficient ways to construct the DWT by iterating a two-channel perfect reconstruction filter bank over the low-pass scaling function branch[30]. However, this algorithm results in a logarithmic frequency resolution, which does not work well for all signals. In order to overcome the drawback as mentioned above, it is desirable to iterate the high-pass wavelet branch of the Mallat algorithm tree as well as the low-pass scaling function branch. Such a wavelet decomposition produced by these arbitrary subband trees is known as WP decomposition.

In the WP decomposition, both the detail and approximation coefficients are decomposed to create the full binary tree. For a given orthogonal wavelet function, a library of wavelet packet bases is generated. Each of these bases offers a particular way of coding signals, preserving global energy and reconstructing exact features. It is interesting to find an optimal decomposition with respect to a convenient criterion, computable by an efficient algorithm. Simple and efficient algorithms exist for both wavelet packet decomposition and optimal decomposition selection. Functions verifying an additivity-type property are well suited for efficient searching of binary tree structures and the fundamental splitting. Classical entropy-based criteria match these conditions and describe information-related properties for an accurate representation of a given signal. In particular, the best basis algorithm by Coifman and Wickerhauser finds a set of bases that provide the most desirable representation of the data relative to a particular cost function (e.g., entropy)[31].

In DWT decomposition, by the restriction of Heisenberg’s uncertainty principle, the spatial resolution and spectral resolution of high-frequency band become poor, thus limiting the application of DWT. In particular, there are some problems with the basic DWT-based thresholding method when it is applied to noisy speech for the purpose of enhancement. An important shortcoming is the shrinkage of the unvoiced frames of speech which contain many noise-like speech components leading to a degraded speech quality. On the other hand, in WP decomposition, since both the approximation and the detail coefficients are decomposed into two parts at each level of decomposition, a complete binary tree with superior frequency localization can be achieved. Thus, in the context of noisy speech enhancement, this particular feature of the WP decomposition provides better discriminability of speech coefficients among those of the noise and is indeed useful for enhancing speech in the presence of noise.

For a j-level WP transform, the noisy speech signal y[n] with frame length N is decomposed into 2^j subbands. The m th WP coefficient of the k th subband is expressed as

W_{k, m}^{j} = WP [y [n], j], n = 1, \dots N,

(1)

where m = 1,…,N/2^j and k = 1,…,2^j.

2.2 Teager energy approximation

The continuous form of the TEO[27] is given as

Ψ_{c} [y (t)] = {(\frac{d}{dt} y (t))}^{2} - y (t) \frac{d^{2}}{d t^{2}} y (t),

(2)

where Ψ_c[.] and y(t) represent the continuous TEO and a continuous signal, respectively. For a given bandlimited discrete signal y[n], the discrete-time TEO can be approximated by

Ψ_{d} (y [n]) = {y [n]}^{2} - y [n + 1] y [n - 1] .

(3)

The discrete-time TEO is nearly instantaneous since only three samples are required for the energy computation at each time instant as shown in (3). Due to this excellent time resolution, the output of a TEO provides us with the ability to capture the energy fluctuations and hence gives an estimate of the energy required to generate the signal[18, 27–29, 32–35].

In the context of the noisy speech enhancement by thresholding via WP analysis, the threshold must be adapted over time since speech is not always present in the signal. It is expected that the threshold should be larger during periods without speech and smaller for those with speech. Since the TEO provides an estimate of the signal energy over time, it can be employed to obtain an idea of speech/nonspeech activity and then decide an appropriate threshold value in the speech/nonspeech frame. But directly using the TEO on noisy speech may result in much undesired artefact and enhanced noises as TEO is a fixed-sized local operator[27]. Therefore, instead of direct employment of the TEO on the noisy speech, it is found reasonable to apply the TEO on the WP coefficients of the noisy speech[18]. The application of the discrete-time TEO on the $W_{k, m}^{j}$ results in a set of TEO coefficients $t_{k, m}^{j}$ . The m th TEO coefficient corresponding to the k th subband of the WP is given by

t_{k, m}^{j} = Ψ_{d} [W_{k, m}^{j}], k = 1, ..... 2^{j} .

(4)

Unlike the approach of threshold determination directly from the WP coefficients of noisy speech, the approach to determine threshold from the TE-operated WP coefficients and then employ it via a semisoft thresholding function, has more potential to eliminate as much of the noise as possible while still maintaining speech quality and intelligibility in the enhanced speech[29].

2.3 Statistical modeling of TE-operated WP coefficients

This paper proposes a new thresholding function employing a threshold value determined for each subband of the WP by statistically modeling the TE-operated WP coefficients $t_{k, m}^{j}$ with a probability distribution rather than choosing a threshold value directly from the $t_{k, m}^{j}$ .

In a certain range, the probability distribution of the $t_{k, m}^{j}$ of the noisy speech is expected to be nearly similar to those of the noise. Also, outside that range, the probability distribution of the $t_{k, m}^{j}$ of the noisy speech is expected to be similar to those of the clean speech. Thus, by considering the probability distributions of the $t_{k, m}^{j}$ of the noisy speech, noise, and clean speech, a more accurate threshold value can be obtained using a suitable scheme of pattern matching or similarity measure between the probability distributions. It is well known that the Kullback-Leibler (K-L) divergence provides a measure of the distance between two distributions. It is an appealing approach to robustly estimate the differences between two distributions. Instead of comparing just the TE-operated WP coefficients $t_{k, m}^{j}$ , the distribution of the $t_{k, m}^{j}$ of the noisy speech can be compared with the distribution of the $t_{k, m}^{j}$ of noise or that of clean speech using the K-L divergence. Since the K-L divergence is not a symmetric metric, we propose the use of the SKL divergence.

2.4 Optimal threshold calculation

This subsection presents our approach to obtain first the idea of speech/silent frame based on the SKL divergence and then to choose two different threshold values suitable for silent and speech frames. At first, the threshold value for a noisy speech frame is analytically obtained by solving equations either based on the SKL divergence between the probability distribution functions (pdfs) of the $t_{k, m}^{j}$ of the noisy speech and that of the noise or based on the SKL divergence between the pdfs of the $t_{k, m}^{j}$ of the noisy speech and that of the clean speech. To this end, in a frame of noisy speech/ noise/ clean speech, for each subband of WP, we formulate the histogram of the $t_{k, m}^{j}$ and approximate the histogram by a reasonably close pdf, namely Gaussian distribution. For this purpose, we follow the steps below:

1.
The histogram of the $t_{k, m}^{j}$ in each subband is obtained. The number of bins in the histogram has been set equal to the square root of the number of samples divided by two.
2.
Since the $t_{k, m}^{j}$ of clean speech, noisy speech, and noise are positive quantity, their histograms in each subband can be approximated by the positive part of a pdf following the Gaussian distribution. Such statistical modeling of the $t_{k, m}^{j}$ of clean speech, noisy speech, and noise is supported by experimental validation over all speech sentences of the NOIZEUS noisy speech corpus [36] at different SNR levels. Typical examples of such modeling are shown in Figures 2, 3, and 4, respectively.

The method in[17] does not employ the TE operation prior to computing the threshold value, and the threshold value for each subband of a noisy speech frame is determined by statistically modeling the WP coefficients. Since the WP coefficients are a signed quantity, their histograms in each subband are approximated by a two-sided Gaussian pdf. In the proposed method, due to the simpler approximation of the $t_{k, m}^{j}$ of clean speech, noisy speech, or noise by the positive part of a Gaussian pdf, the process of deriving the threshold value becomes less complex which is an additional advantage over the approach in[17]. In order to analytically determine an appropriate threshold value, we proceed as follows:

The K-L divergences are always nonnegative and zero if and only if the approximate Gaussian distribution functions of the $t_{k, m}^{j}$ of noisy speech and that of the noise, or the approximate Gaussian distribution functions of the $t_{k, m}^{j}$ of the noisy speech and that of the clean speech are exactly the same. In order to have a symmetric distance between any two approximate Gaussian distribution functions as mentioned above, the symmetric K-L divergence has been adopted in this paper. The symmetric K-L divergence is defined as

SKL (p, q) = \frac{KL (p, q) + KL (q, p)}{2}

(5)

where p and q are the two approximate Gaussian pdfs calculated from the corresponding histograms each having M number of bins and KL(.) is the K-L divergence given by

KL (p, q) = \sum_{i = 1}^{M} p_{i} (t_{k, m}^{j}) ln \frac{p_{i} (t_{k, m}^{j})}{q_{i} (t_{k, m}^{j})} .

(6)

In (6), $p_{i} (t_{k, m}^{j})$ represents the approximate Gaussian pdf of the $t_{k, m}^{j}$ of the noisy speech estimated by

\begin{array}{l} {\hat{p}}_{i} (t_{k, m}^{j}) = \frac{Number of coefficients in the i th bin of the histogram}{Total number of coefficients in each subband} . \end{array}

(7)

Similarly, the approximate Gaussian pdf of the $t_{k, m}^{j}$ of the noise and that of the $t_{k, m}^{j}$ of the clean speech can be estimated from (7) and denoted by ${\hat{q}}_{i} (t_{k, m}^{j})$ and ${\hat{r}}_{i} (t_{k, m}^{j})$ , respectively. Below a certain value λ of the $t_{k, m}^{j}$ of the noisy speech, the symmetric K-L divergence between ${\hat{p}}_{i} (t_{k, m}^{j})$ and ${\hat{q}}_{i} (t_{k, m}^{j})$ is approximately zero, i.e.,

SKL ({\hat{p}}_{i} (t_{k, m}^{j}), {\hat{q}}_{i} (t_{k, m}^{j})) \approx 0

(8)

where the bins lie in the range [1, λ] in both ${\hat{p}}_{i} (t_{k, m}^{j})$ and ${\hat{q}}_{i} (t_{k, m}^{j})$ . Alternatively, above the value λ of the $t_{k, m}^{j}$ of the noisy speech, the symmetric K-L divergence between ${\hat{p}}_{i} (t_{k, m}^{j})$ and ${\hat{r}}_{i} (t_{k, m}^{j})$ is almost zero, i.e.,

SKL ({\hat{p}}_{i} (t_{k, m}^{j}), {\hat{r}}_{i} (t_{k, m}^{j})) \approx 0

(9)

In (9), the bins lie in the range [λ + 1, M] in both ${\hat{p}}_{i} (t_{k, m}^{j})$ and ${\hat{r}}_{i} (t_{k, m}^{j})$ . Using (5) and (6) in evaluating (8) and (9), we get

\sum_{i = 1}^{λ} [{\hat{p}}_{i} (t_{k, m}^{j}) - {\hat{q}}_{i} (t_{k, m}^{j})] ln \frac{{\hat{p}}_{i} (t_{k, m}^{j})}{{\hat{q}}_{i} (t_{k, m}^{j})} \approx 0 .

(10)

\begin{array}{l} \sum_{i = λ + 1}^{M} [{\hat{p}}_{i} (t_{k, m}^{j}) - {\hat{r}}_{i} (t_{k, m}^{j})] ln \frac{{\hat{p}}_{i} (t_{k, m}^{j})}{{\hat{r}}_{i} (t_{k, m}^{j})} \approx 0 . \end{array}

(11)

From (10), it is apparent that the $t_{k, m}^{j}$ of the noisy speech lying in the range [1, λ] can be marked as the $t_{k, m}^{j}$ of noise and needed to be removed. Similarly, (11) attests that the $t_{k, m}^{j}$ of the noisy speech residing outside [1,λ] can be treated as similar to the $t_{k, m}^{j}$ of the clean speech and considered to be preserved. For obtaining a threshold value λ in each subband, (10) and (11) can be expressed as

\begin{array}{l} \int_{1}^{λ} [\frac{\sqrt{ϑ}}{\sqrt{2 π} σ_{S}} exp (- \frac{ϑ x^{2}}{2 σ_{S}^{2}}) - \frac{1}{2 π σ_{N}} exp (- \frac{x^{2}}{2 σ_{N}^{2}})] ln ((1 - \sqrt{ϑ}) \\ exp (- \frac{ϑ x^{2}}{2 σ_{S}^{2}} + \frac{x^{2}}{2 σ_{N}^{2}})) d x \approx 0, \end{array}

(12)

\begin{array}{l} \int_{λ + 1}^{\infty} [\frac{\sqrt{ϑ}}{\sqrt{2 π} σ_{S}} exp (- \frac{ϑ x^{2}}{2 σ_{S}^{2}}) - \frac{1}{2 π σ_{S}} exp (- \frac{x^{2}}{2 σ_{S}^{2}})] ln ((\sqrt{ϑ}) \\ exp (\frac{(1 - ϑ) x^{2}}{2 σ_{S}^{2}})) d x \approx 0, \end{array}

(13)

where $ϑ = σ_{S}^{2} / (σ_{N}^{2} + σ_{S}^{2})$ .

The range used for solving Equations (12) and (13) required for determining the threshold value λ in each subband is different from that used in[17]. The value of $t_{k, m}^{j}$ for which the threshold reaches its optimum value can be determined by minimizing (12) or (13). Since (12) is a definite integral, the derivative of the function defined in the left-hand side (L.H.S) of (12) representing the SKL divergence between ${\hat{p}}_{i} (t_{k, m}^{j})$ and ${\hat{q}}_{i} (t_{k, m}^{j})$ is calculated and set to zero. On the other hand, the derivative of the function obtained in the L.H.S of (13) representing the symmetric K-L distance between ${\hat{p}}_{i} (t_{k, m}^{j})$ and ${\hat{r}}_{i} (t_{k, m}^{j})$ is calculated and set to zero. By simplifying either derivatives, an optimum value of λ for each subband of a noisy speech frame can be obtained as

λ (k) = σ_{N} (k) \sqrt{2 (γ_{k} + γ_{k}^{2}) ln (\sqrt{1 + \frac{1}{γ_{k}}})},

(14)

where k is the subband index, σ_N is the variance of noise in each subband, and γ_k represents the segmental SNR defined as

γ_{k} = σ_{S}^{2} (k) / σ_{N}^{2} (k) .

(15)

Considering the facts that the threshold value λ(k) in (14) needs to be adjusted according to the input SNR and σ_N is inversely proportional to the input SNR, a modified version of the threshold λ(k) in each subband of a noisy speech frame can be derived as

λ (k) = [σ_{N} (k) / \sqrt{γ_{k}}] \sqrt{2 (γ_{k} + γ_{k}^{2}) ln (\sqrt{1 + \frac{1}{γ_{k}}})} .

(16)

In the nonspeech/silent subbands of a frame of noisy speech, the SKL divergence between the approximate Gaussian pdfs of the $t_{k, m}^{j}$ of the noisy speech and that of the $t_{k, m}^{j}$ of the noise is found to be nearly zero. An idea of speech/silent frame can thus be obtained based on the SKL divergence. Since in a silence frame only noise exists, a threshold value different from that used in a subband of a noisy speech frame should be selected for a subband of a silent frame of a noisy speech in order to remove the noise completely. Exploiting the facts above and using the threshold λ(k) derived in (16) for each subband of a noisy speech frame, two different threshold values suitable for a subband of a silent or speech frame are proposed to be chosen as

λ^{'} (k) = \{\begin{array}{l} max (t_{k, m}^{j}), & SKL ({\hat{p}}_{i} (t_{k, m}^{j}), {\hat{q}}_{i} (t_{k, m}^{j})) \approx 0 \\ λ (k), & otherwise . \end{array}

(17)

It is noteworthy that, in the context of enhancing speech under low levels of SNR, our proposed approach to determine the threshold value in a subband of a silent or speech frame is not only different but also more reasonable with simpler approximation and lesser computation in comparison to that described in[17].

2.5 Denoising by thresholding

For denoising purpose, hard thresholding sets zero to the coefficients whose absolute value is below the threshold[37–39]. This ignores the fact that there may be noise coefficients, which are bigger than the threshold value, thus resulting in time-frequency discontinuities of enhanced speech spectrum. Unlike the hard thresholding function, the soft thresholding function handles signals in a different way by making smooth transitions between the treated and the deleted coefficients based on the threshold value[20, 37, 38]. Noting the threshold determined by (17) as λ₁, the soft thresholding function can be applied on the m th WP coefficients of the k th subband $Y_{k, m}^{j}$ as

{({\hat{Y}}_{k, m}^{j})}_{S} = \{\begin{array}{l} | Y_{k, m}^{j} | - λ_{1} (k), & | Y_{k, m}^{j} | \geq λ_{1} (k) \\ 0, & | Y_{k, m}^{j} | < λ_{1} (k) . \end{array}

(18)

The soft thresholding can be viewed as setting the components of the noise to zero and performing a magnitude subtraction on the speech plus noise components. It is evident that the soft thresholding eliminates the time-frequency discontinuity resulting in smoother signals, but it yields the estimated coefficients that are the WP coefficients $| Y_{k, m}^{j} |$ of the noisy speech shifted by an amount of λ₁(k). Employment of such a shift even when $| Y_{k, m}^{j} |$ stands way out of noise level creates unnecessary bias in the enhanced spectrum. The variance of the threshold values over the frames of the whole noisy speech also affects the enhanced spectrum. The variance of the threshold values over the frames of the whole noisy speech also affects the enhanced spectrum.

In order to overcome the problems as mentioned above, in the semisoft thresholding function, the shifting by the amount of the threshold value is avoided[39]. Therefore, a semisoft thresholding function is preferred over the soft thresholding function with respect to the variance and bias of the estimated threshold value. By taking into account the advantages and shortcomings of all the thresholding functions, we apply a semisoft thresholding function on the WP coefficients of the noisy speech signal. By defining λ₂(k) as

λ_{2} (k) = \sqrt{2} λ_{1} (k),

(19)

the semisoft thresholding function is defined as

({\tilde{Y}}_{k, m}^{j}) = \{\begin{array}{l} 0, | Y_{k, m}^{j} | \leq λ_{1} (k) \\ Y_{k, m}^{j}, | Y_{k, m}^{j} | > λ_{2} (k) \\ sgn (Y_{k, m}^{j}) [\frac{λ_{2} (k) | Y_{k, m}^{j} | - λ_{1} (k)}{λ_{2} (k) - λ_{1} (k)}], otherwise, \end{array}

(20)

where ${\tilde{Y}}_{k, m}^{j}$ stands for the resulting semisoft thresholded WP coefficients.

2.6 Inverse wavelet packet transform

The enhanced speech frame is synthesized by performing the inverse WP transformation WP^-1 on the resulting thresholded WP coefficients ${\tilde{Y}}_{k, m}^{j}$

\hat{s} [n] = W P^{- 1} ({\tilde{Y}}_{k, m}^{j}),

(21)

where $\hat{s} [n]$ represents the enhanced speech frame. The final enhanced speech signal is reconstructed by using the standard overlap-and-add method.

3 Simulation results

In this section, a number of simulations are carried out to evaluate the performance of the proposed method.

3.1 Simulation conditions

Real speech sentences from the NOIZEUS noisy speech corpus[36] are employed for the experiments, where the speech data is sampled at 8 KHz. Four different types of noises, such as as white, car, pink, and multi-talker babble, are adopted from the NOISEX92[40] and NOIZEUS databases. Noisy speech at different SNR levels ranging from 15 to -15 dB is considered for our simulations.

In order to obtain overlapping analysis frames, Hamming windowing operation is performed, where the size of each of the frame is 512 samples with 50% overlap between successive frames. A three-level WP decomposition tree with db10 bases function is applied on the noisy speech frames, and the Teager energy operation is performed on the resulting WP coefficients. In the proposed method, for the implementation of WP decomposition, the 'wpdec’ function of the Matlab wavelet toolbox is used, where in order to obtain optimal decomposition, Shanon entropy criterion is employed. For the three-level WP transform, the noisy speech signal y[n] with frame length N = 512 samples is decomposed into eight subbands. For each subband (64 samples), a histogram is computed and variance is estimated. By computing the threshold(s), λ₁(k) = λ^′(k) and λ₂ from (17) and (19), respectively, a semisoft thresholding function is developed and applied on the WP coefficients of the noisy speech using (20).

3.2 Comparison metrics

Standard objective metrics, namely overall SNR improvement in decibels, Perceptual Evaluation of Speech Quality (PESQ), and Weighted Spectral Slope (WSS), are used for the evaluation of the proposed method[5, 41, 42]. In our simulation results, we have considered all 30 sentences of the NOIZEUS noisy speech corpus. We have taken into account the average result obtained from all 30 sentences for computing each of the objective metrics, namely SNR improvement in decibels, PESQ score, and WSS values. The proposed method is subjectively evaluated in terms of the spectrogram representations of the clean speech, noisy speech, and enhanced speech. Informal listening tests are also carried out, where the mean opinion scores (MOS) are evaluated in three dimensions, namely signal distortion (SIG), noise distortion (BAK), and overall quality (OVRL). The performance of our method is compared with some of the existing thresholding-based speech enhancement methods, such as Universal[20], Wavelet Packet Thresholding with Symmetric K-L Divergence (WTHSKL), and WPF[18] in both objective and subjective senses. In our method, while determining the threshold in (16), only time adaptation approach is incorporated through TE operation on WP coefficients as in the WPF method in[18] (time-adaptive approach), where threshold is adapted through time only and modulated depending on the speech or silent nature of the signal under an analysis frame. Unlike the time- and space-adaptive approach in[28], threshold value is not adapted through scales in our proposed method. Therefore, we found it more justified and fair to compare our proposed method with the WPF method. Apart from these methods, statistical model-based method (MMSE[9]), spectral subtractive method (spectral subtraction[6]), and Wiener filtering-type algorithm (Wiener Filtering[14]) are also included for the purpose of objective and subjective comparison. We have implemented the Universal, WTHSKL, and WPF methods independently using the parameters specified therein. For implementation of the MMSE, spectral subtraction, and Weiner filtering methods, we have used publicly available Matlab codes (MMSESTSA84, WienerScalart96, and SSBoll79) from the Matlab Central website (http://www.mathworks.com/matlabcentral/).

3.3 Objective evaluation

3.3.1 Results on white noise-corrupted speech

The results for semisoft thresholding function in terms of all the objective metrics, such as SNR improvement in decibels, PESQ, and WSS, obtained using the Universal, WTHSKL, WPF, and proposed methods for white noise-corrupted speech are presented in Figures5 and6 and in Table1.

Table 1 Performance comparison of different methods in terms of WSS for white noise-corrupted speech

A semisoft thresholding method based on Teager energy operation on wavelet packet coefficients for enhancing noisy speech

Abstract

1 Introduction

2 Proposed method

2.1 Wavelet packet analysis

2.2 Teager energy approximation

2.3 Statistical modeling of TE-operated WP coefficients

2.4 Optimal threshold calculation

2.5 Denoising by thresholding

2.6 Inverse wavelet packet transform

3 Simulation results

3.1 Simulation conditions

3.2 Comparison metrics

3.3 Objective evaluation

3.3.1 Results on white noise-corrupted speech

3.3.2 Results on car noise-corrupted speech

3.3.3 Results on pink noise-corrupted signal

3.3.4 Results on multi-talker babble noise-corrupted speech

3.4 Subjective evaluation

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords