Assuming that the noise signal, v(n), to be additive and uncorrelated with clean speech, s(n), at sample n, the noisy speech, y(n), can be represented as:
$$ y(n)=s(n)+v(n), $$
(1)
Figure 1 shows the block-diagram of the proposed noise PSD tracking system. Firstly, a 32 ms Hamming window ([29], Chapter 7) with 50% overlap at fs=16 kHz sampling frequency was considered for converting y(n) into frames, y(n,k).
The noisy speech in Eq. (1) can then be represented in terms of frames as:
$$ y(n,l)=s(n,l)+v(n,l), $$
(2)
where \(l\epsilon \{0,1,2,\dots,L-1\}\) is the frame index, L is the total number of frames in an utterance, and N is the total number of samples in each frame, i.e. \(n\epsilon \{0,1,2,\dots,N-1\}\).
The noisy speech, y(n) (Eq. (1)), is also analysed frame-wise using the short-time Fourier transform (STFT) as:
$$ Y(l,m) = S(l,m) + V(l,m), $$
(3)
where \(m\epsilon \{0,1,2,\dots,511\}\) is the discrete-frequency index and Y(l,m),S(l,m), and V(l,m) represent the complex-valued STFT coefficients of the noisy speech, clean speech, and noise signal, respectively. A 32 ms Hamming window ([29], Chapter 7) with 50% overlap at fs=16 kHz sampling frequency was used for analysis and synthesis.
The next step of the proposed method is speech activity detection of the noisy speech frames followed by carrying out the tracking of noise PSD. These two steps are described in the following Sections.
Proposed speech activity detection algorithm
We introduce a spectral-flatness (denoted by ζ)-based adaptive thresholding technique for speech activity detection. For lth frame, ζ(l) is computed by the ratio of geometric and arithmetic mean of the 257-point single-sided noisy speech magnitude spectrum, |Y(l,m)|, containing the DC and Nyquist frequency components as [30–32]:
$$ \zeta(l)=\frac{\sqrt[M]{\prod_{m=0}^{M-1}|Y(l,m)|}}{\frac{1}{M}\sum_{m=0}^{M-1}|Y(l,m)|}, $$
(4)
where M=257, i.e. \(m\epsilon \{0,1,2,\dots,M-1\}\).
The ζ(l) ranged between 0 and 1 in the sense that the arithmetic mean of |Y(l,m)| is always greater than that of the geometric mean. To interpret ζ(l) as a speech activity detector on a framewise basis, we conduct an experiment, where an IEEE utterance sp05 (“Wipe the grease off his dirty face”) from the NOIZEUS corpus ([14], Chapter 12) (sampled at 16 kHz) is corrupted with 5 dB white (computer generated) as well as babble, traffic, passing car, and passing train noise sources taken from the freesound database [33]. It can be seen that ζ(l) varies between 1 and 0 over the time-frames depending on the silent/speech activity of the noisy speech (Fig. 2b)Footnote 1.
Typically, |Y(l,m)| is dominated by noise during silent activity. In white noise condition, |Y(l,m)| approximately contains similar power at all frequency bins, i.e. remains flat during speech pauses, resulting in ζ(l)≈1 (e.g. 0–0.15 s or 1.8-2.19 s of Fig. 2b). Conversely, |Y(l,m)| remains non-uniform in active speech regions, yielding lower (approaching 0) ζ(l) (e.g. 0.16–0.33 s or 0.9–1.06 s of Fig. 2b). This grasps the main idea of ζ(l) being used as a speech activity detector of the noisy speech on a framewise basis. However, the non-stationary noise sources, such as babble, traffic, passing car, and passing train, may affect the spectrum of |Y(l,m)| non-uniformly. As a result, |Y(l,m)| does not remain flat during speech pauses, resulting in ζ(l) not necessarily approaching 1, but still remains higher than that of it in speech regions (Fig. 2b). To adopt ζ(l) as speech activity detector in such conditions, we propose an adaptive thresholding technique. We have found that the adaptive threshold (tζ) (the average of the previous ζ(l)’s) can be used to detect the speech activity of the noisy speech frames. Specifically, by assuming the first y(n,0) (l=0) as silent, \(\mathbb {S}_{\zeta }\) (sum of ζ(l)’s in previous frames) is initialized as: \(\mathbb {S}_{\zeta }=\zeta (0)\). For lth frame (l≥1), tζ is computed as: \(t_{\zeta }=\mathbb {S}_{\zeta }/l\), where \(\mathbb {S}_{\zeta }=\mathbb {S}_{\zeta }+\zeta (l)\). If ζ(l)>tζ (l≥1), y(n,l) is detected as silent; otherwise speech activity is present. Since the updated \(\mathbb {S}_{\zeta }\) at lth frame is used to compute tζ at (l+1)th frame, it does not require infinite memory. The computed tζ is also able to capture the long-term variability of the noisy speech during speech activity detection in the sense that it takes the average of all previously computed ζ(l)’s to that of the current estimate. Therefore, it minimizes the impact of the abrupt change of the noise amplitude between two successive frames during speech activity detection. The whole process is summarized in Section 2.3.
Figure 3 compares the detected flags (0/1: silent/speech) from the utterance sp05 corrupted with 5 dB babble noise with the reference flags (0/-1: silent/speech). It can be seen that the detected flags are closely similar to that of the reference. In this experiment, the reference flags are generated by visually inspecting the frames of sp05 (Fig. 2a). More details about the performance evaluation of the proposed SAD with existing SAD methods are given in Section 4.1.
Proposed noise PSD tracking algorithm
During silent activity of y(n,l),s(n,l)≈0 (Eq. (2)), meaning that the y(n,l) is completely filled with the additive noise, v(n,l). Thus, unlike the benchmark methods [16–22, 25], the proposed method keeps the detected silent frames of y(n,l) unprocessed. To start the algorithm, the first noisy speech frame, y(n,0) (l=0), is assumed to be silent, which gives an estimate of noise. Therefore, |Y(l,m)|2 corresponding to y(n,l) (l=0) is used to initialize the noise periodogram, \(|\hat {V}(0,m)|^{2}=|Y(0,m)|^{2}\) and the noise PSD, Pv(0,m)=|Y(0,m)|2. The noisy speech periodogram, |Y(l,m)|2 is computed as:
$$\begin{array}{*{20}l} |Y(l,m)|^{2}=\frac{1}{N}\left|\sum_{n=0}^{N-1}y(n,l)e^{-j\frac{2\pi}{N}nm}\right|^{2}. \end{array} $$
(5)
Specifically, during silent activity of y(n,l) (1≤l≤L−1), |Y(l,m)|2 gives an estimate of the noise periodogram, i.e. \(|\hat {V}(l,m)|^{2}=|Y(l,m)|^{2}\). On the other hand, during speech activity of y(n,l) (1≤l≤L−1), s(n,l) remains embedded with v(n,l)—which leads to a risk of leaking speech power to the estimated noise power, \(|\hat {V}(l,m)|^{2}\). To cope with this problem, we have found that the application of a derivative based high-pass filter to y(n,l) during speech activity filtered out the components of s(n,l) before estimating \(|\hat {V}(l,m)|^{2}\). Specifically, the clean speech, s(n,l) (Eq. (2)) is smooth enough to be locally approximated with a lower-order polynomial terms, which can be thought of as a truncated Taylor series, whilst the noise signal, v(n,l) contains a higher-order polynomial terms to the series of noisy speech, y(n,l) [34]. It is demonstrated in Ogrodzki ([34], Eq. (5.80)) that a smooth signal can be approximated by a 3rd order polynomial terms, which is interpreted as a 3rd order truncated Taylor series. Motivated by this observation, the application of a 4th order derivative to y(n,l) (Eq. (2)) acts as a high-pass filter, which filters out the components of s(n,l), what it remains mostly the components of v(n,l). Therefore, the filtered-signal gives an estimate of the additive noise, \(\hat {v}(n,l)\). The filtering operation is represented as a convolution of y(n,l) with a 4th order derivative template, w(n)=[1 −4 6 −4 1] as [35]:
$$ \hat{v}(n,l)=\sum_{i=0}^{4}w(i)y(n-i,l). $$
(6)
Using the estimated \(\hat {v}(n,l)\), the corresponding noise periodogram, \(|\hat {V}(l,m)|^{2}\), is computed as:
$$\begin{array}{*{20}l} |\hat{V}(l,m)|^{2}=\frac{1}{N}\left|\sum_{n=0}^{N-1}\hat{v}(n,l)e^{-j\frac{2\pi}{N}nm}\right|^{2}. \end{array} $$
(7)
Note that the proposed 4th order derivative-based high-pass filter is used to filter out the clean speech components, not for filtering additive noise. Therefore, The application of a 4th order derivative-based high-pass filter to y(n,l) reduces most of the clean speech components, s(n,l), resulting a noise dominated signal, \(\hat {v}(n,l)\). As a result, although the high-pass filter is designed with the fixed parameter (4th order derivative), it does not impact the noise having different frequency distribution. Since the components of s(n,l) are filtered out prior to estimating \(\hat {v}(n,l)\), it mitigates the risk of leaking speech power, |S(l,m)|2 to the computed noise periodogram, \(|\hat {V}(l,m)|^{2}\). However, the high-pass filter may reduce some smoothed noise components—which remain closely coincide with the clean speech, such as babble noise in the filtered signal, \(\hat {v}(n,l)\). Therefore, for preserving the closely coincide noise components in the filtered signal, we perform a recursive averaging with the estimated noise power, \(|\hat {V}(l,m)|^{2}\), and the noise PSD, Pv(l,m) (l>0) as:
$$\begin{array}{*{20}l} P_{v}(l,m)=\beta P_{v}(l-1,m)+(1-\beta)|\hat{V}(l,m)|^{2}, \end{array} $$
(8)
where β (ranged between 0 and 1) is a smoothing factor. The choice of β impacts the estimate of Pv(l,m) to some extent. It is observed that β≈1 for speech-dominated frames of noise corrupted speech than that of silent frames relatively containing a bit lower value ([14], Section 9.4.1). Motivated by this observation, we empirically set β=0.98 for η(l)=1, and β=0.9 for η(l)=0, which gives a better estimate of Pv(l,m).
Summary of the proposed noise PSD estimator
By integrating the discussions in Sections 2.1 and 2.2, the proposed noise PSD estimator can be summarized as:1. Initialization: (l=0)
a) Compute ζ(0) using Eq. (4)
b) Assume \(\mathbb {S}_{\zeta }=\zeta (0)\)
c) Compute |Y(0,m)|2 using Eq. (5)
d) Assume \(|\hat {V}(0,m)|^{2}=|Y(0,m)|^{2}\)
e) Assume Pv(0,m)=|Y(0,m)|22. for l=1 to L−1do [framewise processing loop]
a) Compute ζ(l) using Eq. (4)
b) Compute |Y(l,m)|2 using Eq. (5)
c) \(\mathbb {S}_{\zeta }=\mathbb {S}_{\zeta }+\zeta (l)\)
d) \(t_{\zeta }=\mathbb {S}_{\zeta }/l\)
e) if ζ(l)>tζthen [silent activity]
i. \(|\hat {V}(l,m)|^{2}=|Y(l,m)|^{2}\)
ii. β=0.9
else [speech activity]
i. Estimate \(\hat {v}(n,l)\) using Eq. (6)
ii. Compute \(|\hat {V}(l,m)|^{2}\) using Eq. (7)
iii. β=0.98
end if
f) Update Pv(l,m) using Eq. (8)
end for
Speech enhancement using estimated noise PSD
To evaluate the performance of the proposed noise PSD estimator against the benchmark methods, it is employed to the traditional MMSE-based speech enhancement system. Typically, the estimated noise PSD is used in the DD approach to computing the a priori SNR—a key parameter of the MMSE gain function used for speech enhancement [4]. Specifically, given the noisy speech magnitude spectrum, |Y(l,m)|, an estimate of the clean speech magnitude spectrum, \(|\hat {S}(l,m)|,\) is obtained as [4]:
$$\begin{array}{*{20}l} |\hat{S}(l,m)|=G(l,m)|Y(l,m)| \end{array} $$
(9)
where G(l,m) is a gain function.
The MMSE-STSA gain function is given by [4]:
$$\begin{array}{*{20}l} &G_{\text{MMSE-STSA}}(l,m)=\frac{\sqrt{\pi}}{2}\frac{\sqrt{\nu(l,m)}}{\gamma(l,m)}\exp{\left(\frac{-\nu(l,m)}{2}\right)}\\& \left[(1 + \nu(l,m))I_{0}\left(\frac{\nu(l,m)}{2}\right) +\nu(l,m) I_{1}\left(\frac{\nu(l,m)}{2}\right)\right], \end{array} $$
(10)
where I0(·) and I1(·) denote the modified Bessel functions of zero and first order, and ν(l,m) is given by:
$$ \nu(l,m) = \frac{\xi(l,m)}{1+\xi(l,m)}\gamma(l,m), $$
(11)
where ξ(l,m) and γ(l,m) are the a priori and a posteriori SNR, respectively, defined as [4]:
$$\begin{array}{*{20}l} \xi(l,m) &= \frac{\lambda_{s}(l,m)}{\lambda_{v}(l,m)}, \end{array} $$
(12)
$$\begin{array}{*{20}l} \gamma(n,k)&=\frac{|Y(l,m)|^{2}}{\lambda_{v}(l,m)}, \end{array} $$
(13)
where λs(l,m)=E{|S(l,m)|2} is the variance of the clean speech spectral component and λv(l,m)=E{|V(l,m)|2} is the variance of the noise spectral component. In practice, we do not have access to |S(l,m)|2 and |V(l,m)|2 for computing λs(l,m) and λv(l,m). Thus, we need to estimate λs(l,m) and λv(l,m) from noisy speech for computing ξ(l,m) and γ(l,m). In this paper, we have used the noise PSD, Pv(l,m) estimated by the proposed and benchmark methods, which replace λv(l,m) to compute γ(l,m). With the computed γ(l,m), the traditional DD approach gives an estimate of \(\hat {\xi }(l,m)\) as [4]:
$$\begin{array}{*{20}l} \hat{\xi}(l,m)&=\eta\frac{|\hat{S}(l-1,m)|^{2}}{P_{v}(l-1,m)}+\\ &(1-\eta)\max(\hat{\gamma}(l,m)-1,0), \end{array} $$
(14)
where max(.) is the maximum function, η is the smoothing factor usually set to 0.98, and \(|\hat {S}(l-1,m)|^{2}\) and Pv(l−1,m) represent the estimated clean speech power spectrum and noise PSD at (l−1)th (l>0) frame, respectively.
Using the estimated \(\hat {\xi }(l,m)\), the MMSE-LSA gain function is given by [5]:
$$ G_{\text{MMSE-LSA}}(l,m) = \frac{\hat{\xi}(l,m)}{1+\hat{\xi}(l,m)}\exp \left\{ \frac{1}{2} \int^{\infty}_{\nu(l,m)} \frac{e^{-t}}{t} dt \right\}, $$
(15)
where the integral part belongs to the exponential integral.
We have also used the square-root Wiener filter (SRWF) gain function, which can be represented in terms of \(\hat {\xi }(l,m)\) as Loizou ([14], Section 6.5.1 of Chapter 6):
$$\begin{array}{*{20}l} G_{\text{SRWF}}(l,m)=\sqrt{\frac{\hat{\xi}(l,m)}{1+\hat{\xi}(l,m)}}. \end{array} $$
(16)