In this section, the proposed noise reduction algorithm is derived. The filtering is only applied in the low frequency range which is affected by wind noise. It should be noted that the noise signal consists of wind as well as car noise components. However, in the presence of wind noise, the wind noise components are dominant at low frequencies. In the following, we consider only the non-stationary wind noise components at low frequencies and neglect the slowly varying driving noise. Such stationary noise components can be estimated and reduced by state-of-the-art noise reduction approaches.
The proposed wind noise reduction approach is derived from the commonly used speech distortion weighted multichannel Wiener filter [3], which is defined as
$$ \mathbf{G}^{\mathbf{MWF}} = \left({\mathbf{R}_{\mathbf{S}}} + \mu {\mathbf{R}_{\mathbf{N}}}\right)^{-1}{\Phi_{X}^{2}} \mathbf{H} {\tilde{H}}^{*} $$
(8)
where \({\tilde {H}}\) is the acoustic transfer function of an arbitrary chosen microphone channel. μ is a noise overestimation parameter which allows a trade-off between noise reduction and speech distortion. The output signal ZMWF of the Wiener filter is obtained by
$$ {Z}_{MWF} = \mathbf{Y}\cdot{\mathbf{G}^{\mathbf{MWF}}}^{\dag}. $$
(9)
In [30, 31], it is shown that GMWF can be decomposed into an MVDR beamformer
$$ \mathbf{G}^{\mathbf{MVDR}} = \frac{{\mathbf{R}}_{\mathbf{N}}^{-1} \mathbf{H}}{\mathbf{H}^{\dag} {\mathbf{R}}_{\mathbf{N}}^{-1} {\mathbf{H}}} $$
(10)
and a single-channel Wiener post filter
$$ {G}^{WF}=\frac{{\gamma^{\text{out}}}}{{\gamma^{\text{out}}} + \mu} $$
(11)
as
$$ \mathbf{G}^{\mathbf{MWF}} = \mathbf{G}^{\mathbf{MVDR}} \cdot {G}^{{WF}} \cdot {\tilde{H}}^{*}. $$
(12)
The term γout is the narrow-band SNR at the beamformer output which is defined as
$$ {\gamma^{\text{out}}} = \text{tr}\left({\mathbf{R}_{\mathbf{S}}}{\mathbf{R}_{\mathbf{N}}^{-1}}\right), $$
(13)
where tr(·) denotes the trace operator. We exploit this decomposition for the proposed wind noise reduction. Firstly, we derive a beamformer for the considered microphone setup.
3.1 Beamformer
In the following, we consider time-aligned signals where the alignment compensates the different times of arrival for the speech signal. This is achieved by delaying the front microphone with a suitable sample delay τ to be in phase with the rear microphone,
$$\begin{array}{*{20}l} {\hat{Y}_{1}}(\nu) &= Y_{1}(\nu) \cdot \left\{ \begin{array}{ll} e^{-j2\pi\frac{\nu}{L}\tau} & \text{for}\ \nu \in 0,\ldots,\frac{L}{2}-1 \\ e^{j2\pi\frac{\nu}{L}\tau} & \text{for}\ \nu \in \frac{L}{2},\ldots,L-1 \end{array}\right. \end{array} $$
(14)
where L denotes the block length of the short-time Fourier transform. After this alignment, we assume that the ATFs in H are identical, because the low frequency speech components have a large wavelength compared with the microphone distance.
$$\begin{array}{*{20}l} H &= {\hat{H}_{1} }= H_{2} & \end{array} $$
(15)
$$\begin{array}{*{20}l} \mathbf{H} &= H \cdot [1, 1]^{T} \end{array} $$
(16)
which leads to the speech correlation matrix depending only on the PSD of the speech signal at one of the microphones
$$ {\mathbf{R}_{\mathbf{S}}} = {\Phi_{X}^{2}} |H|^{2} \left(\begin{array}{ll} 1 & 1 \\ 1 & 1 \end{array}\right) = {\Phi_{S}^{2}} \left(\begin{array}{ll} 1 & 1 \\ 1 & 1 \end{array}\right). $$
(17)
Furthermore, it can be assumed that the wind noise terms for both microphone signals are uncorrelated even for small distances of the microphones [28, 32]. This simplifies the noise correlation matrix as well as its inverse since the cross-terms can be neglected
$$ {\mathbf{R}_{\mathbf{N}}^{-1}} = \left(\begin{array}{cc} \frac{1}{{\Phi_{N_{1}}^{2}}} & 0 \\ 0 & \frac{1}{{\Phi_{N_{2}}^{2}}} \end{array}\right). $$
(18)
The numerator term of the GMVDR in (10) can be written as
$$ {\mathbf{R}_{\mathbf{N}}^{-1}} \mathbf{H} = H \cdot \left(\begin{array}{c} \frac{1}{{\Phi_{N_{1}}^{2}}} \\ \frac{1}{{\Phi_{N_{2}}^{2}}} \end{array}\right) $$
(19)
and the denominator as
$$ \mathbf{H}^{\dag} {\mathbf{R}_{\mathbf{N}}^{-1}} \mathbf{H} = |H|^{2} \cdot \left(\frac{1}{{\Phi_{N_{1}}^{2}}} + \frac{1}{{\Phi_{N_{2}}^{2}}}\right). $$
(20)
Since H is not known, it is set to H=1. This results in the minimum variance (MV) beamformer coefficients
$$ {G^{MV}_{i}} = \frac{\frac{1}{\Phi_{N_{i}}^{2}}}{\frac{1}{{\Phi_{N_{1}}^{2}}} + \frac{1}{\Phi_{N_{2}}^{2}}}, $$
(21)
which can be interpreted as a noise-dependent weighting of the input signals. Note that the MV beamformer achieves the same narrow-band output SNR as the MVDR beamformer but no distortion-free response [5]. Finally, the output of the beamformer can be written as
$$ Y_{MV} = \left({\hat{Y}_{1}} \cdot {G^{MV}_{1}} + {Y_{2}} \cdot {G^{MV}_{2}}\right). $$
(22)
Using (17) and (18), we are able to calculate the narrow-band output SNR of the beamformer as
$$ {\gamma^{\text{out}}} = {\Phi_{S}^{2}} \cdot \left(\frac{1}{{\Phi_{N_{1}}^{2}}} + \frac{1}{{\Phi_{N_{2}}^{2}}} \right) = \frac{{\Phi_{S}^{2}}}{{\Phi_{N_{\text{beam}}}^{2}}}, $$
(23)
where \({\Phi _{N_{\text {beam}}}^{2}}\) denotes the noise PSD at the beamformer output. This PSD can be calculated as
$$ {{\Phi_{N_{\text{beam}}}^{2}}} = \frac{{\Phi_{N_{1}}^{2}} \cdot {\Phi_{N_{2}}^{2}}}{{\Phi_{N_{1}}^{2}} + {\Phi_{N_{2}}^{2}}}. $$
(24)
3.2 Special cases
In the following, we consider some special cases for the beamformer derived in (22). Assuming \({\Phi _{N_{1}}^{2}} = {\Phi _{N_{2}}^{2}}\) and uncorrelated noise terms as in [29], then \({G^{MV}_{i}}\) reduces to the simple weighting of a delay-and-sum beamformer (a simple summing of the aligned signals)
$$ {G^{DS}_{i}} = \frac{\frac{1}{{\Phi_{N_{1}}^{2}}}}{\frac{1}{{\Phi_{N_{1}}^{2}}} + \frac{1}{{\Phi_{N_{1}}^{2}}}} = \frac{1}{2}, $$
(25)
which results in the output signal
$$ {Y_{DS}} = \frac{1}{2}\left({\hat{Y}_{1}}+ {Y_{2}}\right). $$
(26)
A delay-and-sum beamformer is also proposed in [17] for closely spaced microphones with wind noise.
We keep the condition of uncorrelated noise terms and assume a special case where the short-time noise PSDs are varying over time and frequency. This is motivated by the highly non-stationary local short-time wind noise disturbances [19] and implies that only one microphone is affected by wind noise at any given time and frequency index κ and ν
$$ {\Phi_{N_{1}}^{2}}(\kappa,\nu) < < {\Phi_{N_{2}}^{2}}(\kappa,\nu) $$
(27)
or
$$ {\Phi_{N_{1}}^{2}}(\kappa,\nu) > >{\Phi_{N_{2}}^{2}}(\kappa,\nu). $$
(28)
Then, the noise PSD-dependent weighting in (21) reduces to a selection approach of the dedicated frequency bins by comparing the short-time PSDs of the microphone signals \({\Phi _{Y_{i}}^{2}}\), because the speech signal PSDs \({\Phi _{S_{i}}^{2}}\) are assumed to be identical for both microphones. Therefore, the resulting output signal YFBS can be written as
$$\begin{array}{*{20}l} {Y_{FBS}}(\kappa,\nu) &= \left\{ \begin{array}{ll} {Y_{1}}(\kappa,\nu), & {\Phi_{Y_{1}}^{2}}(\kappa,\nu) < {\Phi_{Y_{2}}^{2}}(\kappa,\nu) \\ {Y_{2}}(\kappa,\nu), & {\Phi_{Y_{1}}^{2}}(\kappa,\nu) > {\Phi_{Y_{2}}^{2}}(\kappa,\nu) \\ \end{array}\right. \end{array} $$
(29)
3.3 PSD estimation
Next, we derive estimates for the speech and noise PSDs which are required for the beamformer and post filter. As mentioned in [29], most single-channel noise estimation procedures (i.e., [33–35]) rely on the assumption that the noise signal PSDs are varying more slowly in time than the speech signal PSD. This is not the case for wind noise. The fast varying short-time PSDs make noise estimation a challenging task for a single microphone. However, using more than one microphone, the different correlation properties for speech and wind noise can be used for the estimation.
A reference for the wind noise can be obtained by exploiting the fact that the wind noise components in the two microphones are incoherent while the speech components are coherent. To block the speech signal, a delay-and-subtract approach is used to obtain a noise reference
$$ N = \frac{{\hat{Y}_{1}}-{Y_{2}}}{2}, $$
(30)
which depends only on incoherent wind noise terms. The PSD of this noise reference is
$$\begin{array}{*{20}l} {\Phi_{N}^{2}} &= {\mathbb{E}}\left\{NN^{*}\right\} \end{array} $$
(31)
$$\begin{array}{*{20}l} &= {\mathbb{E}}\left\{\left(\frac{{\hat{Y}_{1}}-{Y_{2}}}{2}\right)\left(\frac{{\hat{Y}_{1}}-{Y_{2}}}{2}\right)^{*}\right\} \end{array} $$
(32)
$$\begin{array}{*{20}l} &= \frac{1}{4}\left({\mathbb{E}}\left\{{\hat{Y}_{1}}{\hat{Y}_{1}}^{*}\right\} - {\mathbb{E}}\left\{{\hat{Y}_{1}}{Y_{2}}^{*}\right\}\right.\notag\\ & - \left. {\mathbb{E}}\left\{{Y_{2}}{\hat{Y}_{1}}^{*}\right\} + {\mathbb{E}}\left\{{Y_{2}}{Y_{2}}^{*}\right\}\right) \end{array} $$
(33)
$$\begin{array}{*{20}l} &= \frac{1}{4} \left({\mathbb{E}}\left\{{\hat{N}_{1}}{\hat{N}_{1}}^{*}\right\} - {\mathbb{E}}\left\{{\hat{N}_{1}}{N_{2}}^{*}\right\}\notag \right.\\ & \left. - {\mathbb{E}}\left\{{N}_{2}{\hat{N}_{1}}^{*}\right\} + {\mathbb{E}}\left\{{N}_{2}{N}_{2}^{*}\right\} \right). \end{array} $$
(34)
The cross-terms vanish, because the wind noise terms are uncorrelated. Hence, we obtain
$$\begin{array}{*{20}l} {\Phi_{N}^{2}} &= \frac{{\Phi_{N_{1}}^{2}}}{4} + \frac{{\Phi_{N_{2}}^{2}}}{4}. \end{array} $$
(35)
Note that the delay-and-subtract signal in (30) is used in other applications as the output of a differential microphone array [17]. Obviously, this is not suitable for microphone positions that are sensitive to wind noise, because the noise terms are heavily amplified.
By summing the aligned signals according to (26), we augment coherent signal components. The combined signal YDS has the PSD
$$\begin{array}{*{20}l} {\Phi_{Y_{DS}}^{2}} &= {\mathbb{E}}\left\{{Y_{DS}}{Y_{DS}}^{*}\right\} \end{array} $$
(36)
$$\begin{array}{*{20}l} &= {\mathbb{E}}\left\{\left(\frac{{\hat{Y}_{1}}+{Y_{2}}}{2}\right)\left(\frac{{\hat{Y}_{1}}+{Y_{2}}}{2}\right)^{*}\right\} \end{array} $$
(37)
$$\begin{array}{*{20}l} &= \frac{1}{4}\left({\mathbb{E}}\left\{{\hat{Y}_{1}}{\hat{Y}_{1}}^{*}\right\} + {\mathbb{E}}\left\{{\hat{Y}_{1}}{Y_{2}}^{*}\right\}\right.\\ & + \left. {\mathbb{E}}\left\{{Y_{2}}{\hat{Y}_{1}}^{*}\right\} + {\mathbb{E}}\left\{{Y_{2}}{Y_{2}}^{*}\right\}\right) \end{array} $$
(38)
$$\begin{array}{*{20}l} &= {\mathbb{E}}\left\{SS^{*}\right\} + \frac{1}{4} \left({\mathbb{E}}\left\{{\hat{N}_{1}}{\hat{N}_{1}}^{*}\right\} + {\mathbb{E}}\left\{{\hat{N}_{1}}{N_{2}}^{*}\right\}\right. \\ & \left.+ {\mathbb{E}}\left\{{N}_{2}{\hat{N}_{1}}^{*}\right\} + {\mathbb{E}}\left\{{N}_{2}{{N}_{2}}^{*}\right\}\right). \end{array} $$
(39)
Again, the noise cross-terms vanish and we obtain
$$\begin{array}{*{20}l} {\Phi_{Y_{DS}}^{2}} &= {\Phi_{S}^{2}} + \frac{{\Phi_{N_{1}}^{2}}}{4} + \frac{{\Phi_{N_{2}}^{2}}}{4}. \end{array} $$
(40)
Combining (35) and (40) yields the PSD of the clean speech signal
$$ {\Phi_{S}^{2}} = {\Phi_{Y_{DS}}^{2}} - {\Phi_{N}^{2}} $$
(41)
and the noise PSD at the ith microphone
$$ {\Phi_{N_{i}}^{2}} = {\Phi_{Y_{i}}^{2}} - {\Phi_{S}^{2}}. $$
(42)
Note that this derivation only holds for uncorrelated noise terms. \({\Phi _{S}^{2}}\) may still contain correlated noise. However, we neglect the correlated driving noise as stated at the beginning of this section. In contrast to Zelinskis post filter [36], which also assumes zero correlation between the microphone signals, we assume the short-time noise PSDs to be different \(\left ({\Phi _{N_{1}}^{2}} \neq {\Phi _{N_{2}}^{2}}\right)\).
3.4 Post filter
As described in (12), the beamformer is followed by a single-channel Wiener post filter to achieve additional noise suppression. We use the post filter
$$ {G^{WF}} = \frac{{\gamma}}{{\gamma} + {\mu}}. $$
(43)
with the SNR estimate
$$ {\gamma} = \frac{{\Phi_{S}^{2}}}{{\Phi_{N}^{2}}}. $$
(44)
That is, the noise PSD is estimated according to (35) instead of (23), because this estimate showed a better performance in the simulations regarding SNR and speech distortion. Note that \({\Phi _{N}^{2}}\geq {\Phi _{N_{\text {beam}}}^{2}}\) holds, with equality if \({\Phi _{N_{1}}^{2}}={\Phi _{N_{2}}^{2}}\). Hence, the noise estimation in (44) results in an overestimation of the noise power if the short-time PSDs at the microphones vary. This is similar to using an overestimation parameter μ>1.
Finally, the output of the complete wind noise reduction algorithm is
$$\begin{array}{@{}rcl@{}} {Z} &=& \left({\hat{Y}_{1}} \cdot {G^{MV}_{1}} + {Y_{2}} \cdot {G^{MV}_{2}}\right) \cdot {G^{WF}} \end{array} $$
(45)
$$\begin{array}{@{}rcl@{}} &=& {Y_{MV}} \cdot {G^{WF}}. \end{array} $$
(46)
This wind noise reduction algorithm is only applied for frequencies below a cutoff frequency fc, because wind noise mostly contains low frequency components and the assumptions about the signal properties are only valid for low frequencies. Figure 1 shows the block diagram of the signal processing structure.