An adaptive a priori SNR estimator for perceptual speech enhancement

Nahma, Lara; Yong, Pei Chee; Dam, Hai Huyen; Nordholm, Sven

doi:10.1186/s13636-019-0150-3

Research
Open access
Published: 07 June 2019

An adaptive a priori SNR estimator for perceptual speech enhancement

Lara Nahma ORCID: orcid.org/0000-0003-0223-9181¹,
Pei Chee Yong²,
Hai Huyen Dam¹ &
…
Sven Nordholm¹

EURASIP Journal on Audio, Speech, and Music Processing volume 2019, Article number: 7 (2019) Cite this article

4118 Accesses
5 Citations
Metrics details

Abstract

In this paper, an adaptive averaging a priori SNR estimation employing critical band processing is proposed. The proposed method modifies the current decision-directed a priori SNR estimation to achieve faster tracking when SNR changes. The decision-directed estimator (DD) employs a fixed weighting with the value close to one, which makes it slow in following the onsets of speech utterances. The proposed SNR estimator provides a means to solve this issue by employing an adaptive weighting factor. This allows an improved tracking of onset changes in the speech signal. As a consequence, it results in better preservation of speech components. This adaptive technique ensures that the weighting between the modified decision-directed a priori estimate and the maximum likelihood a priori estimate is a function of the speech absence probability. The estimate of the speech absence probability is modeled by a sigmoid function. Furthermore, a critical band mapping for the short-time Fourier transform analysis-synthesis system is utilized in the speech enhancement to achieve less musical noise. In addition, to evaluate the ability of the a priori SNR estimation method in preserving speech components, we proposed a modified objective measurement known as modified hamming distance. Evaluations are performed by utilizing both objective and subjective measurements. The experimental results show that the proposed method improves the speech quality under different noise conditions. Moreover, it maintains the advantage of the DD approach in eliminating the musical noise under different SNR conditions. The objective results are supported by subjective listening tests using 10 subjects (5 males and 5 females).

1 Introduction

Noise suppression and speech enhancement are essential techniques employed in many products, for instance, mobile phones, hearing aids, and assistive listening devices. Particularly, hearable devices have been poised to assist people with difficulties in hearing in social environments [1]. For noise suppression and speech enhancement to work in the environments where acoustic noise becomes more intrusive, it is vital to maintaining weak speech components while still balancing the amount of noise reduction. Accordingly, techniques that can enhance speech signals while preserving weak speech components under a large variety of acoustic scenarios are key to successful products [2–4]. In this context, it is important to consider not only the speech but also the quality of noise after suppression. Unnatural sounding background noise is bothersome for users of hearable devices or hearing aids.

Traditionally, speech enhancement techniques have been utilizing the frequency domain for processing where the short-time Fourier transform (STFT) has been used as a tool to process the input data using frame-based over-sampling techniques [3, 5–7]. When deploying STFT, the bandwidth is constant for each frequency bin, which is not the case for the human auditory system. Thus, a natural extension has been to use human auditory models in speech enhancement to improve the speech quality and intelligibility [8–11].

The human auditory spectrum model consists of a bank of bandpass filters, which follows a spectral bark scale or the so-called critical bands [11, 12]. In [11], a standard subtractive speech enhancement method is presented to eliminate the musical artifacts in very noisy situations. The masking properties of the auditory system are utilized to compute the subtraction parameter. In [13], a spectral subtraction noise reduction method is proposed using a spatial weighting technique based on the inhibitory property of the auditory system, which results in improving the estimated speech while reducing the musical noise.

Speech enhancement algorithms calculate a gain function, which is in most cases a function of a posteriori signal to noise ratio (SNR) or a combination of a posteriori and a priori SNR [14]. One exemplary speech enhancement algorithm is the spectral subtraction (SS) method proposed by Boll [15]. This algorithm is the most commonly used mainly due to its straightforward implementation and low computational complexity. In this method, a clean speech estimate is obtained by subtracting an estimated noise power spectrum from the noisy speech power spectrum while keeping the phase of the degraded speech signal. The spectral subtraction method embeds erroneous estimation of noise statistics resulting in an annoying artifact in the estimated speech signal commonly known as musical noise, which can be masked using perceptual thresholds [11, 16].

In contrast, the minimum mean-square error log spectral amplitude (MMSE-LSA) estimator proposed by Ephraim [17] avoids the appearance of the musical noise artifact. This estimator uses a priori SNR estimation based on a decision -directed estimation, which involves a weighted sum of two terms, the a priori SNR estimate from the previous frame and the maximum likelihood (ML) SNR estimate from the current frame. This estimation technique reduces the variance of the a priori SNR estimates particularly during noise frames, and as a result, the musical noise artifact is eliminated [18]. However, the emphasis of the previous frame in the DD estimation has a consequence that it leads to a slow adaptation towards speech onsets and offsets. Moreover, as DD approach depends on the a priori SNR estimation in the previous frame, an extra one frame delay is obtained during speech transients and results in a degradation of the speech quality [7].

The a priori SNR estimation algorithm has been improved in many ways, e.g., Breithaupt et al. [19] proposed the temporal cepstrum smoothing (TCS) technique for speech enhancement. This technique improves the accuracy of the a priori SNR estimation by exploiting the a priori knowledge of speech and noise signal and selectively smoothing the maximum likelihood estimate in the cepstral domain. This allows the preservation of speech components while simultaneously achieving high noise attenuation. However, this method has limitations under low SNR conditions where the noise components cannot be separated from the speech components. Suhadi [20] suggested a data-driven technique employing two trained neural networks to estimate the a priori SNR, one for speech and one for noise. The use of neural networks requires a substantial training process for estimating the a priori SNR since the proposed method is not a robust estimator under different noise environments, which results in a degradation of the estimated speech quality under non-stationary noise conditions. Plapous [21] presented a two-step noise reduction technique (TSNR) to refine the estimation of the a priori SNR and increase the estimator adaptation speed. The main disadvantage when using this TSNR method is its sensitivity to the selection of the gain function. A different choice of the gain function gives very different estimation results [22, 23]. A modified decision-directed approach (MDD) proposed by Yong et al. [7] matches the current noisy speech spectrum with the current a priori SNR estimate rather than the delayed one. This reduces the one frame delay for speech onsets, but the tracking speed of the a priori SNR estimation is still slow compared to the true SNR change since the recursive smoothing factor is constant and close to one.

In this paper, we extend the research in [24], which includes an improved a priori SNR estimation based on modeling the speech absence probability with a sigmoid function. This sigmoid function was used to control the adaptation speed of the a priori SNR estimation. The sigmoid function operates as an adaptive weighting function that emphasizes either the DD term or the ML estimate in the a priori SNR estimate update. The rationale used when developing the weighting function was that for positive SNR values; the a priori and the a posteriori SNR estimates are almost the same. Accordingly, by adding flexibility to select either of the two terms for SNR values below or above a certain threshold, we provide a way to emphasize both estimates. By utilizing a threshold and the sigmoid shape, an improved adaptation of the a priori SNR estimate is obtained.

The choice of gain function plays an important role since it is included in the DD estimation resulting in different performance. Previously, only the Wiener gain function was considered. In this work, we propose an improved a priori SNR estimation [24] using different gain functions, namely, Wiener filter (WF) [25] and MMSE-LSA gain function [17]. A new evaluation technique referred to as the modified Hamming distance (HD) has also been proposed. In common objective measures, speech components are not emphasized since they have small amplitudes or small energy. The proposed modified Hamming distance is based on voice activity detection (VAD) decision information in each time-frequency bin. Since this information is binary, data scaling that depend on amplitude or energy is avoided; thus, we can compare to ideal VAD decisions. Also in this work, we utilize a critical band mapping for an STFT analysis-resynthesis system in the speech enhancement framework for human perceptual processing. Moreover, the utilized critical band processing helps to reduce computational complexity since it combines K FFT frequency bins into I critical bands instead (I ≪K).

The remainder of this paper is organized as follows. In Section 2, a single-channel speech enhancement framework with critical band processing is developed. Section 3 shows the decision-directed based a priori SNR estimators. Section 4 develops the proposed a priori SNR estimation approach together with an investigation on the effect of the key parameters of the sigmoid function. Section 5 demonstrates the evaluation methodology. Section 6 presents the experimental results and discussion while Section 7 concludes the paper.

2 Critical band speech enhancement

A natural way to process speech signals is to use a perceptual filter bank [26]. By employing the inhibitory property of the human auditory system and combining with the speech enhancement algorithms [11], the performance of the speech processing system can be improved. There are many perceptual frequency warping scales used for speech processing [27, 28]. In this work, we employed a bark scale filter bank with a non-uniform resolution and incorporated it in a speech enhancement framework with the proposed a priori SNR estimation method. We assume that the noise and speech are additive and uncorrelated; thus, the noisy speech signal is given by

$$ y(n)=s(n)+v(n) $$

(1)

where s(n) and v(n) denote the clean speech signal and noise, respectively. The block diagram for critical band speech processing is described in Fig. 1. In the sequel, we will outline the details of the processing.

In the first step, the noisy signal is transformed to the time-frequency domain by applying STFT with K frequency bins

$$ Y(k,m)=S(k,m)+V(k,m) $$

(2)

where k is the frequency bin index and m is the time frame index. Then, in order to transform the output from the STFT Y(k,m) into the critical band, an analytical function is used to express the transformation between frequency f (in Hz) and critical band z (in bark scale), which is defined by [29]

$$ f=600\sinh\left(\frac{z}{6}\right). $$

(3)

The noisy spectrum is expressed in terms of the critical band numbers i and frame index m by combining the FFT frequency bins into I critical bands as follows:

$$ Y_{\text{CB}}(\mathit{i},m)=\sum\limits_{k=1}^{K/2+1}M(i,k)\left|Y(k,m)\right| $$

(4)

where i=[1,2,⋯,I]. The number of critical bands I is chosen with respect to the bark scale [29]. Here, M(i,k) are the critical bandpass filter coefficients, which are defined as

$$ {}M(i,k)= \left\{ \begin{array}{ll} 10^{(z(k)-z_{\mathrm{c}}(i)+0.5)} & z(k)< z_{\mathrm{c}}(i)-0.5\\ 1 & z_{\mathrm{c}}(i)-0.5<z(k)<z_{\mathrm{c}}(i)+0.5\\ 10^{-2.5(z(k)-z_{\mathrm{c}}(i)-0.5)} & z(k)>z_{\mathrm{c}}(i)+0.5 \end{array}\right. $$

(5)

where z_c(i) represents the center frequency of the ith critical band. A MATLAB implementation of the bark scale critical band processing is described in [30]. The main task of the speech enhancement scheme is to enhance the speech signal by applying a specific spectral gain function to the noisy spectrum. Let G_CB(m) denotes the gain vector in the critical band for the mth frame

$$\begin{array}{*{20}l} \mathbf{G}_{\text{CB}}(m)&=[G_{\text{CB}}(1,m),G_{\text{CB}}(2,m),...,G_{\text{CB}}(I,m)]^{T}. \end{array} $$

There are many different gain functions proposed in the literature. Common gain function often can be expressed as a function of the a priori SNR ξ(i,m), such as the WF method, which can be defined as [25]

$$ G_{\mathrm{WF,CB}}(i,m)=\frac{\xi(i,m)}{1+\xi(i,m)} $$

(6)

with ξ(i,m) denoting the a priori signal-to-noise ratio SNR, which is defined as

$$ \xi(i,m)=\frac{\lambda_{s}(i,m)}{\lambda_{v}(i,m)} $$

(7)

where λ_v(i,m) = E[|V(i,m)|²] and λ_s(i,m)=E[|S(i,m)|²] are the power spectral density of noise and clean speech, respectively.

MMSE-LSA [17] is another widely used speech estimator, which is obtained by minimizing the logarithm of the mean square error between original and enhanced speech spectra, and can be defined as a function of the priori SNR and the posteriori SNR, given by

$$ G_{\mathrm{LSA,CB}}(i,m)=\frac{\xi(i,m)}{1+\xi(i,m)}\exp\left\{ \frac{1}{2}{\int\limits_{\nu_{k}}^{\infty}}\frac{e^{-t}}{t}dt\right\} $$

(8)

where the lower bound ν_k of the integral is given by

$$ \nu_{k}=\frac{\xi(i,m)}{1+\xi(i,m)}\gamma(i,m) $$

(9)

and γ(i,m) denotes the a posteriori SNR defined as

$$ \gamma(i,m)=\frac{\left|Y_{\text{CB}}(i,m)\right|^{2}}{\lambda_{v}(i,m)}. $$

(10)

Once the gain vector G_CB(m) in a critical band is calculated, it is interpolated back to the gain vector in the STFT domain G(m) through an interpolation matrix A,

$$ \mathbf{G}(m)=\mathbf{A}\mathbf{G}_{\text{CB}}(m) $$

(11)

where the A matrix can be defined by least square approximation as A=(M^TM)⁻¹M^T and M denotes the matrix with elements M(i,k). From empirical findings, better results are obtained by simplifying the reconstruction matrix as

$$\mathbf{A}=\text{diag}\left(\frac{1}{\mathbf{1}\mathbf{M}}\right)\mathbf{M}^{T} $$

where 1 is 1×I row vector. The estimated speech in the STFT domain is then reconstructed by applying the interpolated gain function G(k,m) on the noisy signal in Eq. (2)

$$ \hat{S}(k,m)=G(k,m)Y(k,m). $$

(12)

Finally, the speech estimate is obtained by taking the inverse STFT of the enhanced speech and using the overlap-add method

$$ \hat{s}(n)=\text{ISTFT}\left(\hat{S}(k,m)\right). $$

(13)

3 Conventional a priori SNR estimation

In many speech enhancement algorithms, a priori SNR estimation is a dominant part of the gain function calculation as in Eqs. (6) and (8). Inaccuracies in the estimation of the a priori SNR can lead to audible speech distortion and musical noise. The state-of-the-art method to estimate the a priori SNR from noisy speech while avoiding musical noise is the DD approach [31]. In this method, the a priori SNR estimation is expressed as a weighting average of the amplitude estimate at the previous frame and the maximum likelihood estimate of the a priori SNR at the current frame. This method is defined by

$$ {}\hat{\xi}_{\text{DD}}(i,m)=\beta\frac{|\hat{S}(i,m-1)|^{2}}{\hat{\lambda}_{\mathrm{v}}(i,m-1)}+(1-\beta)P\left[\hat{\gamma}(i,m)-1\right] $$

(14)

where $\hat {S}(i,m-1)$ and $\hat {\lambda }_{\mathrm {v}}(i,m-1)$ denote the amplitude estimate and the noise estimate at the previous frame, respectively. P is the half wave rectification to keep the a priori SNR value positive, and 0<β<1 denotes a weighting factor that controls the trade-off between the a priori SNR from previous frame and the posteriori SNR at current frame, which can be defined as

$$ {\beta=\text{exp}(-R/f_{s}t_{s})} $$

(15)

where R is the frame rate, t_s and f_s denote the time averaging constant and the sampling frequency, respectively. By setting the weighting factor close to 1, two different behaviors of the a priori SNR estimation can be observed as explained in [18]. In the noise frames, the a priori SNR estimate corresponds to a scaled version of the a posteriori SNR since the second term of the DD approach is equal to zero. Thus, a priori SNR estimation can be expressed by

$$\begin{array}{@{}rcl@{}} \hat{\xi}_{\text{DD}}^{\downarrow}(i,m)\approx\beta G_{\text{CB}}^{2}(i,m-1)\hat{\gamma}(i,m-1). \end{array} $$

This behavior reduces the variations in the a priori SNR estimate and thus reduces the amount of musical noise produced. In the frames with speech onsets, the a priori SNR follows the a posteriori SNR from the preceding frame as given by

$$\begin{array}{*{20}l} \hat{\xi}_{\text{DD}}^{\uparrow\uparrow}(i,m) &= \beta\frac{G_{\text{CB}}^{2}(i,m-1)|Y_{\text{CB}}(i,m-1)|^{2}}{\hat{\lambda}_{\mathrm{v}}(i,m)}\\ &\quad+(1-\beta)P\left[\hat{\gamma}(i,m)\!-1\right]\\ &\approx \beta G_{\text{CB}}^{2}(i,m-1)\hat{\gamma}(i,m-1)\\ &\quad+(1-\beta)P\left[\hat{\gamma}(i,m)-1\right] \end{array} $$

where the second term that indicates the ML estimate would only have little impact on the estimation process since β is close to 1. In this case, the tracking of change in the a priori SNR estimate is slow since the a priori SNR estimation mainly depends on the posteriori SNR estimation in the previous frame. This behavior can lead to speech transient distortion. In order to overcome this problem, the authors in [7] proposed a modified decision-directed (MDD) approach. In that method, the a priori SNR estimate at the current frame is matched with the a posteriori SNR in the current frame instead of the previous one. Thus, the one-frame delay is reduced, which results in less speech distortion compared to the conventional DD approach. The MDD a priori SNR estimate is given by

$$\begin{array}{*{20}l} \hat{\xi}_{\text{MDD}}(i,m)&= \beta\frac{\!G_{\text{CB}}^{2}(i,m-1)\left|Y_{\text{CB}}(i,m)\right|^{2}}{\hat{\lambda}_{\mathrm{v}}(i,m)}\\&\quad+(1-\mathrm{\beta)}P\!\left[\hat{\gamma}(i,m)-1\right]. \end{array} $$

(16)

In addition, to maintain the advantage of the DD approach in eliminating the musical noise, the magnitude square of the noisy signal has been smoothed by using first-order recursive smoothing procedure as given by [7] to reduce the variance of the a priori SNR estimate. The first-order recursive averaging of the noisy signal is given by

$$ \lambda_{y}(i,m)=\alpha_{y}\lambda_{y}(i,m-1)+(1-\alpha_{y})\left|Y_{\text{CB}}(i,m)\right|^{2} $$

(17)

where α_y is a smoothing constant. The smoothed |Y_CB(i,m)|² is replacing the instantaneous power estimate in the a posteriori SNR Eq. (10).

4 Proposed a priori SNR estimation

The drawback of the MDD approach is that the fix weighting factor β in Eq. (15) reduces the influence from the second term towards the a priori SNR update resulting in a scaled down a priori SNR estimate when compared to the true a priori SNR. In light of this, we can conclude that the fix weighting factor β gives low variability of the gain function during noise-only periods but does not provide a fast change of the gain function when a speech utterance comes. Thus, it is desirable to replace the fix weighting factor β with an adaptive weighting factor β(i,m).

Recognizing that the speech absence probability is a key for the weighting according to Eq. (16), we model the speech absence probability based on a sigmoid function. As a remark, if the cumulative distribution function (CDF) is a sigmoid function, the probability density function (pdf) is similar to a Gaussian pdf but with larger tails, which is plausible for speech applications. The sigmoid consists of two parameters, σ to control transition speed and ρ to determine the threshold of active speech signal and noise [32]. The selection of these parameter values is based on the observation that the a priori SNR equals the posterior SNR for high SNRs. An adaptive weighting function $\hat {\beta }(i,m)$ is proposed based on the a posteriori SNR and is given by

$$ \hat{\beta}(i,m)=\frac{\beta_{0}}{1+\exp[-\sigma(\tilde{\gamma}(i,m)-\rho)]} $$

(18)

where β₀ is a constant slightly larger than β. The modified a priori SNR estimation approach is then defined by

$$\begin{array}{*{20}l} \hat{\xi}_{\text{prop}}(i,m)&=\hat{\beta}(i,m)\frac{G_{\text{CB}}^{2}(i,m-1)\left|Y_{\text{CB}}(i,m)\right|^{2}}{\hat{\lambda}_{v}(i,m)}\\&\quad+\!(1-\hat{\beta}(i,m))P\left[\tilde{\gamma}(i,m)-1\right] \end{array} $$

(19)

where $\tilde {\gamma }(i,m)$ is the a posteriori SNR estimate employing the smoothed estimate of the noisy speech from Eq. (17). Figure 2 describes the computation of the gain function by using the proposed method with an adaptive weighting function. In the following, we investigate the effect of two parameters σ and ρ on the proposed adaptive weighting function $\hat {\beta }(i,m)$.

To retain a similar property as a constant weighting factor β for speech-only and noise-only frames, we impose constraints on $\hat {\beta }(i,m)$ as:

$$\begin{array}{*{20}l} &\hat{\beta}(i,m) \\ &= \left\{ \begin{array}{lll} \beta, \text{ for noise-only frames or when } ~ \tilde{\gamma}(i,m) =1 \\ 1-\beta, ~\text{ for speech-only frames or when } ~ \tilde{\gamma}(i,\!m) =\gamma_{u}, ~ \gamma_{u} > > 1. \end{array} \right. \end{array} $$

(20)

which lead to

$$ \left\{ \begin{array}{lll} \frac{\beta_{0}}{1+\exp\left(-\sigma\left(1-\rho\right)\right)} &=&\beta \\[.2cm] \frac{\beta_{0}}{1+\exp\left(-\sigma\left(\gamma_{u}-\rho\right)\right)} &=&1-\beta \end{array}\right. $$

(21)

or

$$ \left\{ \begin{array}{lll} \sigma\left(1-\rho\right)&=&-\ln\left(\frac{\beta_{0}}{\beta}-1\right)\\[.2cm] \sigma\left(\gamma_{u}-\rho\right)&=&-\ln\left(\frac{\beta_{0}}{1-\beta}-1\right). \end{array}\right. $$

(22)

We now calculate the parameters σ and ρ directly for different levels of γ_u. From Eq. (22), we have

$$ \frac{1-\rho}{\gamma_{u}-\rho} = \frac{\ln\left(\frac{\beta_{0}}{\beta}-1\right)}{\ln\left(\frac{\beta_{0}}{1-\beta}-1\right)}. $$

(23)

As such, the parameter ρ can be obtained from γ_u as

$$\begin{array}{*{20}l} \rho=\frac{1-\gamma_{u} \frac{\ln\left(\frac{\beta_{0}}{\beta}-1\right)}{\ln\left(\frac{\beta_{0}}{1-\beta}-1\right)}}{1-\frac{\ln\left(\frac{\beta_{0}}{\beta}-1\right)}{\ln\left(\frac{\beta_{0}}{1-\beta}-1\right)}}. \end{array} $$

(24)

The parameter σ can be calculated as

$$ \sigma=\frac{-\ln\left(\frac{\beta_{0}}{\beta}-1\right)}{1-\rho}. $$

(25)

Figure 3 shows the pdf of a posteriori SNR for different noise types for β =0.98 and β₀=0.983, mapped with different adaptive smoothing factors calculated at several posteriori SNR values, γ_u: (i) at γ_u=5 dB SNR with σ=−4.469,ρ=2.295; (ii) at γ_u=7 dB SNR with σ=−2.408,ρ=3.402; (iii) at γ_u=9 dB SNR with σ=−1.391,ρ=5.159; and (iv) at γ_u=15 dB SNR with σ=−0.315,ρ=19.344. Adaptive smoothing factors with different parameters (slopes and means) can control the trade-off between the musical noise and the ability to preserve speech components. In pink noise case, the SNR estimate in noise-only case is distributed approximately between 0 and 1. According to Eq. (20), the adaptive smoothing factor is approximately β during this period to reduce the SNR variance. This can be noted from the figure (first plot on the left), where the adaptive smoothing factor is almost 0.983, which explains the ability of the proposed method to maintain the advantage of the conventional decision-directed method in reducing musical noise at low SNRs. Moreover, in the factory noise case where the SNR estimate is distributed between 0 and 2 during noise-only periods, the proposed smoothing factors designed at γ_u=9 dB and γ_u=15 dB reached the imposed constraint (0.983) during the noise variance, whereas adaptive factors designed at γ_u=5 dB and γ_u=7 dB are lower than 0.983 during noise periods, which leads to an increase in musical noise.

For the babble noise case, the figure on the left shows the pdf of a posteriori SNR estimate during a noise-only period. It can be observed that the pdf has a large spread because of the non-stationary character of the babble noise, which means that an adaptive smoothing factor designed at higher a posteriori SNR γ_u is required to reduce the SNR variance during noise-only frames and reducing the effect of musical noise. From the figure, it can be clearly noted that adaptive smoothing factor designed at γ_u=15 dB is the best among the designed factors since it attained a higher value over the a posteriori SNR distribution during the noise-only frames.

In addition, it can be noted that the weighting factor is inversely proportional to the a posteriori SNR γ. Thus, during the noise frames, γ takes small values. Consequently, the resulting weighting factor $\hat {\beta }(i,m)$ is close to 1, which means that the proposed method will have identical behavior as the DD and the MDD methods. This explains the ability of the proposed method to maintain the advantage of the DD method in reducing musical noise in the low SNRs. Since the second term is zero, the a priori SNR estimate in noise frames will be given by

$$ \hat{\xi}_{\text{prop}}^{\downarrow}(i,m)=\hat{\beta}(i,m)G_{\text{CB}}^{2}(i,m-1)\hat{\gamma}(i,m). $$

(26)

During speech activity frames, the resulting weighting factor takes values close to 0. In that scenario, the first term of Eq. (19) is almost negligible, and the a priori SNR estimate in speech activity frames will correspond to a smoothed version of the maximum likelihood estimate as given by

$$ \hat{\xi}_{\text{prop}}^{\uparrow\uparrow}(i,m)=(1-\hat{\beta}(i,m))P[{\tilde\gamma}(i,m)-1]. $$

(27)

During a speech transition, the weighting factor decreases with each increment of the instantaneous SNR. As a consequence, the a priori SNR estimation corresponds to a combination of the first and second terms in Eq. (19) as given by

$$\begin{array}{*{20}l} \hat{\xi}_{\text{prop}}^{\uparrow}(i,m)&=\hat{\beta}(i,m)G_{\text{CB}}^{2}(i,m-1)\hat{\gamma}(i,m)\\&\quad+(1-\hat{\beta}(i,m))P\!\left[\tilde{\gamma}(i,m)-1\right]. \end{array} $$

(28)

From (19), it can be noticed that the second term will have a varying impact on the a priori SNR updating process depending on the instantaneous SNR estimate. It is here the proposed method makes a difference in tracking any abrupt SNR changes. The apparent result is that more speech components are preserved as well as a reduction in the speech transient distortion.

5 Evaluation methodology

Speech quality evaluation can be classified into two categories: objective measurement and subjective measurement [3]. The first category is based on a mathematical comparison between the original and the enhanced speech signals. Many objective measurements have been proposed in the literature, such as the perceptual evaluation of speech quality measure (PESQ) [33, 34], segmental SNR measure SNR_seg [35, 36], and kurtosis ratio measure (KurtR) [37]. In addition, we propose a new evaluation method based on the Hamming distance as a speech preservation measure. The Hamming distance is a measure that takes into account speech presence or not for each time-frequency point. By measuring the difference between a clean speech binary mask and a processed speech binary mask, the measure takes into account the presence of speech in each time-frequency bin without amplitude weighting.

The perceptual evaluation of speech quality measure (PESQ) is the speech quality assessment recommended by ITU-T P.862 for its ability to predict the speech quality with a high correlation versus subjective listening tests [38]. PESQ implementation consists of first, estimating the bark spectrum of the input and the degraded signals by using a perceptual model in order to compute the loudness spectra and then compare between them to predict the perceived quality of the degraded signal. This objective means of quality assessment is expressed in terms of the mean opinion scores (MOS), measured from 1 to 5, where higher scores indicate higher quality. Here, we are using the implementation provided by Loizou [3].

Time domain-based segmental SNR is one of the widely used objective measures to evaluate the performance of speech enhancement algorithms, which is formed by averaging the frame level of SNR estimate [36] as given by

$$ {}\text{SNR}_{\text{seg}} = \frac{10}{M}\sum\limits_{m=0}^{M-1}\log_{10}\frac{\left\Vert \mathbf{s}(m)\right\Vert^{2}}{\left\Vert \mathbf{s}(m)-\hat{\mathbf{s}}(m)\right\Vert^{2}} $$

(29)

where M denotes the number of frames, while $\hat {\mathbf {s}}(m)$ and s(m) are the estimated and original speech vectors, respectively, in time domain. The segmental SNR values are limited in the range of [−10,35]dB in order to exclude frames with no speech.

In addition, to further investigate the performance of the a priori SNR estimation methods, we utilized the segmental speech preservation SNR_seg,sp and segmental noise reduction SNR_seg,noise as in [39]. These two measures give indications whether the improvement in SNR_seg is due to more noise reduction or more speech preservation and they can be defined as follows:

$$ {}\text{SNR}_{\mathrm{seg,sp}} = \frac{10}{M}\sum\limits_{m=0}^{M-1}\log_{10}\frac{\left\Vert \mathbf{s}(m)\right\Vert^{2}}{\left\Vert \mathbf{s}(m)-\tilde{\mathbf{s}}(m)\right\Vert^{2}} $$

(30)

$$ {}\text{SNR}_{\mathrm{seg,noise}} = \frac{10}{M}\sum\limits_{m=0}^{M-1}\log_{10}\frac{\left\Vert \mathbf{v}(m)\right\Vert^{2}}{\left\Vert \tilde{\mathbf{v}}(m)\right\Vert^{2}} $$

(31)

where $\tilde {\mathbf {s}}(m)$ and $\tilde {\mathbf {v}}(m)$ denote the mth frame of the filtered clean speech and noise signals with the same gain function used to enhance the noisy signal.

The Kurtosis ratio measure is a mathematical measure used to calculate the musical noise. Such measure defines by the estimated speech signal and the noisy speech signal during noise frames only [37]. In order to detect the speech silence and presence, a VAD decision was employed [40], given two hypotheses $\mathcal {H}_{0}(k,m)$ and $\mathcal {H}_{1}(k,m)$ indicate the speech absence and presence, respectively. VAD decision is given by

$$ D(k,m)=\begin{cases} \begin{array}{cc} 1~\text{if}~& \mathcal{H}_{1}(k,m)\\ 0 ~\text{if}~& \mathcal{H}_{0}(k,m) \end{array}\end{cases} $$

(32)

and V(k,m)=1−D(k,m) denotes the activity detection of the noise periods. In order to avoid the miss-detection of speech components, the reference VAD were generated with 50 dB global SNR. Kurtosis ratio can be defined by

$$ \text{KurtR}=E\left\{ \frac{\kappa_{\hat{s}}(k)}{\kappa_{y}(k)}\right\} $$

(33)

where $\kappa _{\hat {s}}(k)$ and κ_y(k) indicate the kurtosis of the enhanced signal and the noisy signal at the kth frequency bin, respectively. They are defined as follows:

$$ \kappa_{\hat{s}}(k)=\frac{\sum\limits_{m=1}^{M}\left|\hat{S_{s}}(k,m)V(k,m)\right|^{4}}{\left\{ \sum\limits_{m=1}^{M}\left|\hat{S_{s}}(k,m)V(k,m)\right|^{2}\right\}^{2}}-2 $$

(34)

and

$$ \kappa_{y}(k)=\frac{\sum\limits_{m=1}^{M}\left|Y(k,m)V(k,m)\right|^{4}}{\left\{ \sum\limits_{m=1}^{M}\left|Y(k,m)V(k,m)\right|^{2}\right\}^{2}}-2. $$

(35)

We proposed an evaluation method to measure the capability of the speech enhancement technique for preserving more weak speech components, referred to as the modified Hamming distance. It is determined by the difference of the time-frequency points detected using VAD decision [41] applied on the clean speech signal and the estimated speech signal. The detection of the VAD decisions for the noisy speech signal and estimated speech signal was performed only based on full-band VAD decisions for clean speech frames. The rationale for developing this new measure is that the result is amplitude invariant, which is important when measuring speech components. Those speech components otherwise would be overshadowed by strong amplitude components. The modified Hamming distance measure is calculated as

$$ \text{HD} = \frac{2}{KM}\sum\limits_{m=1}^{M}\sum\limits_{k=1}^{K/2}\left(\hat{D}(k,m)\oplus D(k,m)\right). $$

(36)

where ⊕ performs a logical exclusive OR operation that returns output containing elements set to either logical 1 (true) or logical 0 (false). Here, D(k,m) denotes the voice activity detection of the clean signal and $\hat {D}(k,m)$ denotes the VAD of the estimated speech signal conditioned on clean speech detected, which is computed initially by testing each sub-band independently for speech activity using the decision device and then analyzed by further logic to reduce false alarms. A lower HD score indicates more speech components are preserved.

The second category of evaluations is based on subjective listening tests, which are considered more accurate and reliable [42]. For the subjective listening test, 10 subjects (5 males and 5 females) were recruited to compare and rate the estimated speech signals, the noisy signals, and the clean speech signals under different SNR conditions. Three different utterances were concatenated to be used for this test. They were corrupted with different sources of noise at 10 dB SNR. In this paper, we used pink noise source which is a shaped and filtered version of the white noise, and babble noise source that represents a group of people speaking in a canteen. The listening test was performed in a quiet office room using a DT-880 Beyerdynamic headphones. A laptop was connected through the USB interface to the headphones via a Topping VX-1 amplifier to provide good quality audio and consistent sound level. The sound clips were embedded in a PowerPoint document, which was also used for recording the results. The listeners were required to listen to the sentences enhanced by the different methods (DD, MDD, and the proposed method) and rated them on a scale from 1 to 5 by steps of 1. This rating takes into account three criteria: speech quality, background noise, and the musical noise levels [3] and [7]. The ranking instruction can be found in Table 1, which describes the scale of the criteria used in the listening test. This methodology helps to reduce the listeners uncertainty in rating which speech enhancement method is better in terms of the aforementioned criteria and referce to the clean speech signals and the noisy signals.

Table 1 Scale description of the listening test criteria [43]

An adaptive a priori SNR estimator for perceptual speech enhancement

Abstract

1 Introduction

2 Critical band speech enhancement

3 Conventional a priori SNR estimation

4 Proposed a priori SNR estimation

5 Evaluation methodology

6 Experimental results and discussion

6.1 Experimental setup

6.2 Case 1

6.2.1 Evaluation of a priori SNR estimation

6.2.2 Objective results

6.2.3 Spectrograms

6.3 Case 2

6.4 Case 3

6.4.1 Objective results

6.4.2 Evaluation of listening tests

6.5 Evaluation of the effect of the bark scale frequency warping on the noise characteristics

6.6 Evaluate the benefit of the critical band processing

7 Conclusions and future work

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords