2.1. System Model
A binaural hearing aid system is considered throughout the present study. There are two microphones on each hearing aid and it is assumed that the aids are linked, such that all four microphone signals are available to a noise reduction algorithm. The processor provides a noise reduced output at each ear.
It is assumed that the signals at each microphone
, at time
, consist of a speech (target) signal,
, convolved with the impulse response,
, from speech source to microphone, and some additive noise. The additive noise contains both the interfering sound source
, convolved with the room impulse response from the source to microphone,
, and the internal sensor noise
, as indicated in (1) for the left and right hearing aid, respectively,
with
representing the microphone number index in the two hearing aids. It is assumed that the noise is uncorrelated with speech and is a short-term stationary zero-mean process.
2.2. Binaural Multichannel Wiener Filter
The BMWF algorithm proposed in [13] provides a Minimum Mean Square Error (MMSE) estimate of the speech component in the two front microphones. As depicted in Figure 1, two Wiener filters are computed to estimate the noise components
and
in the front left and right microphones, which are then subtracted from the original noisy speech signals
and
to obtain estimates
and
of the clean speech components.
Computation of the left and right Wiener filters requires spatiotemporal information about the speech and noise sources in the form of their second-order statistics. Using the received microphone signals, an approximation of the second-order statistics can be obtained from a block of input data of length
. For a filter of length
per channel, the input data vector
for the left front channel is given in (2). Accordingly, input data vectors are defined for the remaining channels. An input data vector
for all microphone signals is constructed as expressed in (3), which is used for computing the correlation matrices of speech and noise
The speech plus noise correlation matrix
, given in (4), can be calculated directly from the input data vector in (3)
The noise components are not directly available, as they cannot be separated from the mixture of speech and noise in the received microphone signals in (2) and (3). Therefore, they need to be estimated in periods that only contain noise, in order to compute the second-order statistics of the noise. Such an operation requires a voice activity detection (VAD) mechanism to identify the time instants in the received mixture signal that do not contain speech. At these time instants, denoted
, the noise correlation matrix
is calculated as expressed in the following:
As the noise correlation matrix is constructed from
data samples collected at time instants
, the correlation matrices are scaled such that
and
. The left and right Wiener filters
are then calculated as shown in the following:
Since the speech signal is estimated in the left and right microphone channel, the BMWF processing inherently preserves the ITD cues of the speech component. However, ITD cues of the noise component are distorted [12, 13]. In order to improve localization, some noise is left unprocessed at the output, by incorporating a parameter
into the filter calculation in (6), as shown in (7):
The noise controlling parameter
can take on values between 0 and 1, where
puts all effort on noise reduction with no attempt on preservation of localization cues, and
puts all effort on preserving localization cues and no noise reduction is performed, that is, there is a trade-off between noise reduction and preservation of localization cues.
The BMWF algorithm uses no information for computation of the filter matrix other than the second-order statistics determined by the VAD. It can be expected that the performance of the BMWF will degrade at some point due to VAD detection errors, leading to incorrect noise estimation. If speech is detected as noise, vectors containing speech samples will be added to the noise data matrix in (5), which leads to cancellation of parts of the speech signal. On the other hand, if too many actual noise samples are detected as speech, less noise vectors are added to the noise data matrix in (5) and a poorer noise estimate is obtained which leads to incorrect noise reduction. Generally, a multichannel Wiener filter can be decomposed into a minimum variance distortionless response MVDR beamformer followed by a (spectral) Wiener postfilter [18]. Therefore, it can also be expected that the speech enhancement strongly depends on the spatial configuration of the noise sources. The adaptive beamformer is mostly effective at suppressing interference comprising fewer sources than the number of microphones, with the noise reduction decreasing fast as the number of noise sources increases. While the beamformer should not modify the target signal, the postfilter can attenuate the target signal, according to the amount of noise present at the output of the beamformer. Hence, as the Wiener postfilter trades off target distortion with noise reduction, the amount of target cancellation is expected to be small in the case of few noise sources, and high for many sources.
2.3. Voice Activity Detector
Speech has strong amplitude modulations in the frequency region of 2–10 Hz, such that its envelope fluctuates over a wide dynamic range. Many types of noise (e.g., traffic or babble noise where signals of many speakers are superimposed) exhibit smaller and more rapid envelope fluctuations compared to speech. These properties can be exploited for detection of time periods in a signal where speech is absent. Therefore, an envelope-based VAD developed for hearing aid applications is used, as proposed in [19]. The algorithm adaptively tracks the dynamics of a signal's power envelope and provides speech pause detection based on the envelope minima in a noisy speech signal. This VAD has been shown to have a low rate of speech periods falsely detected as noise even at low-input SNR of −10 dB [19], which is desirable in order to avoid deteriorations of the speech signals in the noise reduction process. Also, in [19], the VAD was compared to the standardized ITU G.729 VAD by means of receiver operating characteristic (ROC) curves, and was found to outperform it for a representative set of noise types and SNRs. The VAD provides speech/noise classification by analyzing time frames of 8 ms, using the following processing steps for each frame:
-
(1)
A 50% overlap is used such that the processing delay is 4 ms. Each frame is Hanning windowed and a 256-point FFT is performed.
-
(2)
Short-term magnitude-squared spectra were calculated. Temporal power envelopes are obtained by summing up the squared spectral components. Moreover, a low- and high-band power envelope are calculated, by summing up the squared spectral components below a cutoff frequency
and above
. The envelopes of band-limited signals are considered since some noise types have stronger low- (or high-) frequency components. In that case, one of the band-limited envelopes may be less disturbed by the noise and provide more reliable information for speech pause decision. The envelopes are smoothed slightly using a first-order recursive low-pass filter with a release time constant
.
-
(3)
The maxima and minima of the signal envelope are obtained by tracking the peaks and valleys of the envelope waveform. This is done with two first-order recursive low-pass filters with attack and release time constants
and
. The differences between the maxima and minima are calculated to obtain the current dynamic range of the signal.
-
(4)
The decision for a speech pause is based on several requirements regarding the dynamic range of the signal and the current envelope values for the three bands. As the complete decision process is described in [19], it will not be outlined here, that is, only the general concepts are provided. The criterion for the envelope being close enough to its minimum is determined by the free parameters
and
and the current dynamic range of the signal. The threshold parameter
represents the threshold for determining whether the current dynamic range of the signal is low, medium or high. The parameter
can take on values between 0 and 1 and is used in comparisons of whether a fraction (
) of the current dynamic range is higher than the difference between the current envelope and its minimum. The settings of
and
determine how strict the requirements for detecting a speech pause are, and they can be adjusted to make the VAD more or less sensitive to detecting speech pauses. By increasing one or both of the parameters, the algorithm will detect more speech pauses, but at the same time, it will also detect more speech periods as noise.