# Robust noise power spectral density estimation for binaural speech enhancement in time-varying diffuse noise field

- Youna Ji
^{1}, - Yonghyun Baek
^{1}and - Young-cheol Park
^{1}Email author

**2017**:25

https://doi.org/10.1186/s13636-017-0122-4

© The Author(s). 2017

**Received: **29 December 2016

**Accepted: **14 November 2017

**Published: **29 November 2017

## Abstract

In speech enhancement, noise power spectral density (PSD) estimation plays a key role in determining appropriate de-nosing gains. In this paper, we propose a robust noise PSD estimator for binaural speech enhancement in time-varying noise environments. First, it is shown that the noise PSD can be numerically obtained using an eigenvalue of the input covariance matrix. A simplified estimator is then derived through an approximation process, so that the noise PSD is expressed as a combination of the second eigenvalue of the input covariance matrix, the noise coherence, and the interaural phase difference (IPD) of the input signal. Later, to enhance the accuracy of the noise PSD estimate in time-varying noise environments, an eigenvalue compensation scheme is presented, in which two eigenvalues obtained in noise-dominant regions are combined using a weighting parameter based on the speech presence probability (SPP). Compared with the previous prediction filter-based approach, the proposed method requires neither causality delays nor explicit estimation of the prediction errors. Finally, the proposed noise PSD estimator is applied to a binaural speech enhancement system, and its performance is evaluated through computer simulations. The simulation results show that the proposed noise PSD estimator yields accurate noise PSD regardless of the direction of the target speech signal. Therefore, slightly better performance in quality and intelligibility can be obtained than that with conventional algorithms.

## Keywords

## 1 Introduction

The purpose of speech enhancement is to improve the quality and intelligibility of speech signals by suppressing daily environmental noise while allowing a minimal level of speech distortion. The Wiener filter and statistic model-based estimators [1] are well-known examples of the speech enhancement algorithm. Since the de-noising gains of the speech enhancement algorithm are fundamentally determined by the noise power spectral density (PSD), it is important to obtain an accurate noise PSD estimate. Therefore, extensive research has been conducted on noise PSD estimations using a single-microphone system [2–5]; however, they often exhibit limited performances in situations with non-stationary noise or a low signal-to-noise (SNR) ratio [6].

To overcome the limitations of single-channel systems, various multi-channel techniques have been developed, including the minimum variance distortionless response (MVDR) [7] and the multi-channel Wiener filter (MWF) with constraints [8–12]. The MVDR is a widely used spatial filter in multi-channel systems that minimizes output power under the constraint that the desired signal is not affected [7]. On the other hand, the MWF provides an optimal solution for broadband noise reduction from a minimum mean square error (MMSE) perspective. Speech-distortion-weighted MWF (SDW-MWF) has been introduced to control speech distortion and noise reduction [8]. Algorithms such as SDW-MWF and MVDR preserve speech binaural cues, but distort noise binaural cues [10]. Therefore, extensions for preserving the binaural cues of directional sources using additional cost functions or linear constraints have been proposed [10, 11]. As a result, another extension to preserve interaural coherence (IC) has been proposed [12] as part of a study of spatially isotropic noise, the spatial characteristic of which is represented by IC.

Although MWF-based extension algorithms can achieve significant noise reduction, there is always a trade-off between noise reduction and cue preservation regarding directional sources and background noise. One way to overcome the problem of binaural cue preservation is to apply a real-valued equal gain to both sides, rather than applying a complex-valued filter. This method diminishes noise reduction performance by acting as a single-channel noise reduction method, but preserves all binaural cues [13]. MWF performance critically depends on the statistical estimates of desired and undesired signal components. The Voice Activity Detector (VAD) is a general method for estimating noise or speech statistics, where the noise statistic can be updated during a noise-only time-frequency (TF) bin index. However, this method has the drawback that when the noise is time-varying and non-stationary, more sophisticated techniques are required to estimate signal statistics.

Many studies on binaural or multi-channel speech enhancement [14–18] based on real-valued gain function have shown that superior speech quality can be obtained by utilizing spatial information for both target speech and noise. Coherence-based binaural noise reduction was proposed in [14] and proven effective in terms of tracking the PSD of the diffuse noise. However, the effectiveness was validated using only the target speech source located in front of the listener. Other studies [15, 17] have proposed a prediction filter-based binaural noise PSD estimator where the diffuse noise PSD was obtained by solving a second-order equation formulated using a channel prediction model. Theoretically, this method should enable the device to obtain a true noise PSD when the target is situated at any location within a given distance of the listener. However, this approach requires a delay between channel signals to ensure the causality condition for the prediction filter, and the prediction error needs to be explicitly calculated. These factors directly affect the PSD estimator performance [16, 19].

Recently, neural network-based speech enhancement algorithms have been investigated [20, 21]. These algorithms are typically divided into two processes. In the learning process, features are extracted from a large training data set to learn the model and apply speech enhancement gains based on that model in the speech enhancement part. Although extensive research has been conducted on speech enhancement using neural networks, it is difficult to apply portable applications because of its high complexity.

In this paper, a new noise PSD estimator for a binaural speech enhancement system that can be operated in a fast time-varying diffuse noise field is presented. First, it is established that noise PSD can be estimated from the eigenvalues of the input covariance matrix without dependence on the target speech direction. Then, a method of approximating the obtained noise PSD is presented. The result is that the smaller eigenvalue is combined with the noise correlation function and the binaural phase difference.

The auto- and cross-PSDs of the input binaural signal are often estimated using a first-order recursive averaging filter [22]. In a rapidly changing noise environment, averaging with a short time constant is required to quickly reflect the signal statistics of the signal PSDs. However, the use of short time constants leads to bias in PSD estimates, which in turn degrades the overall performance of the speech enhancement system. In this paper, a method of compensating for the bias is proposed that uses the statistical characteristic of eigenvalues with a minor increase of the computational cost. The proposed algorithm can be adopted widely in speech-related applications, such as hearing aids and mobile phones.

The remainder of this paper is organized as follows. Section 2 presents a description of the general two-channel speech enhancement algorithm. A new noise PSD estimator based on the eigenvalue of the input covariance matrix is presented in Section 3. In Section 4, a compensation method to improve the performance of the noise estimator in a practical environment is discussed. Section 5 presents the simulation results, in which the performance of the proposed algorithm is compared with the results achieved using the conventional techniques. Finally, Section 6 concludes this paper.

## 2 Configuration of the speech enhancement algorithm for binaural systems

In this section, we begin with a mathematical modeling of noisy input signals in noisy environments. Following that, the configuration of a binaural speech enhancement system that can be applied to the proposed noise PSD estimator is briefly described.

### 2.1 Input signal model

*x*

_{ i }(

*t*), corrupted by additive noise in the temporal domain can be written as

*s*(

*t*) is the speech signal and

*n*

_{ i }(

*t*),

*i*=

*L*,

*R*are the environmental noises received by the left and right channel microphones, respectively, at time index

*t*.

*h*

_{ i }(

*t*) represents the acoustic impulse response from the speech source to the

*i*-th channel microphone and ⊗ denotes the convolution operation. After applying the short-time Fourier transforms (STFTs), (1) can be rewritten in the frequency domain as

*k*and

*l*are the frequency and frame indices, respectively. In this paper, the noise,

*N*

_{ i }(

*k*,

*l*), is assumed as a diffuse noise which is a non-directional signal with equal power and random phase [23, 24]. Under the assumption that the speech and noises are uncorrelated, the auto- and cross-PSD of the noisy input signals are obtained as

_{ S }(

*k*,

*l*) and Φ

_{ N }(

*k*,

*l*), respectively, are the speech and noise auto-PSDs, i.e., Φ

_{ S }(

*k*,

*l*) =

*E*[|

*S*(

*k*,

*l*)|

^{2}] and Φ

_{ N }(

*k*,

*l*) =

*E*[|

*N*

_{ L }(

*k*,

*l*)|

^{2}] ≈

*E*[|

*N*

_{ R }(

*k*,

*l*)|

^{2}]. Lastly, \( {\varPhi}_N^{ij}\left(k,l\right)=E\left[{N}_i\left(k,l\right){N}_j^{\ast}\left(k,l\right)\right] \) is the cross-PSD between the left and right channel noises.

*α*∈ [0, 1] is the smoothing factor that controls the trade-off relationship between the fast capturing of the time-varying statistics of the signals and the low-variance estimation of the spectrum.

### 2.2 Binaural speech enhancement system

*G*

_{ i }(

*k*,

*l*), is determined based on the estimated noise and input PSDs. The enhanced speech signal, \( {\widehat{S}}_i\left(k,l\right) \), is then obtained as

## 3 The proposed noise PSD estimator

In this section, we introduce the proposed noise PSD estimator based on eigenvalue of input covariance matrix. After that, approximation of the proposed estimator based on interaural binaural cues is presented.

### 3.1 Noise PSD estimation based on eigenvalues

_{ N }= sinc(2

*πfd*

_{ LR }/

*c*), where

*d*

_{ LR }and

*c*are the distance between the left and right microphones and the speed of sound, respectively, to model the coherence in the diffuse noise field. This was chosen because it is a simple and effective method and applied for many binaural speech enhancement techniques [15, 18, 39]. In addition, the head shadowing effect can be approximated simply by adjusting the distance between the microphones [17]. Using the coherence model, the cross-correlation between the left and right channel diffuse noise of a binaural system can be expressed as \( {\Phi}_N^{LR}={\Gamma}_N{\Phi}_N \) [17]. Then, the 2 × 2 covariance matrix of the binaural input signal in (2) becomes

*Φ*

_{ N }:

It should be noted that both the first and second eigenvalues of the input covariance matrix satisfy the above equation.

The estimator in (13) can be compared with the previous channel prediction-based noise PSD estimator in [17], where the noise PSD was obtained by solving a quadratic equation formed using the signals of the channel prediction filter. By substituting (3) and (4) into (13), it is straightforward to show that the estimator in (13) and the one in [17] are equivalent. The details are provided in the Appendix. Thus, the two estimators are expected to achieve numerically identical noise PSD under an ideal condition. On the other hand, another noise PSD estimator using the prediction filter was proposed in [15]. That method in [15] estimates the binaural noise PSD using the target-blocking signal based on the interaural transfer function (ITF) information obtained through the two-channel prediction filter.

However, there are two major differences when the implementation is considered. First, the algorithm in [17] requires an appropriate delay between channel signals to satisfy the causality of the system. It was shown in [40] that inappropriate delays could degrade the performance of the algorithm. Second, the prediction error and ITF need to be calculated explicitly. Therefore, inaccuracies occurring in the process of calculating the prediction error can lead to a bias of the estimated noise PSD. To reduce this bias, [13] proposed a method of calculating those variables using a time-domain adaptive prediction error filter (PEF). However, the performance of the adaptive PEF depends on the filter order, the input SNR, and the delay between the input signals. On the other hand, the proposed algorithm obtains the noise PSD estimate directly from the auto- and cross-PSD of the binaural input signal. Therefore, it can be less sensitive to the bias error of the estimated variables, compared with the method in [13]. In the next section, we first present a method of simplifying the estimator in (13), and later, a method of reducing the bias error will be addressed.

### 3.2 Approximation of the eigenvalue-based noise PSD estimator

In our previous study [16], Eq. (15) was approximated as *∆* ≈ ((|*H*
_{
L
}|^{2} + |*H*
_{
R
}|^{2})Φ_{
S
} + 2Φ_{
N
}Γ_{
N
})^{2} based on assumptions that ILDs and ITDs are negligible. As a result, the second eigenvalue was simplified to λ_{2} = Φ_{
N
}(1 − Γ_{
N
}), from which the noise PSD was obtained as \( {\widehat{\Phi}}_N\approx {\uplambda}_2/\left(1-{\Gamma}_N\right) \). However, ITD at low frequencies normally shows a dependency on the direction of the sound source [29], and therefore affects the directional perception of the sound source. In addition, the noise coherence is particularly high at low frequencies; this can amplify the bias caused by an erroneous approximation at low frequencies. Thus, ignoring ITD causes significant errors in the noise PSD estimates, especially when the speech is located anywhere but in front of the listener. In this paper, we present a simple but accurate approximation of (15), which is effective for not only all target directions but also all frequency bands.

*x*denotes the angle in radians of the function

*x*. Now,

*∆*is composed of three terms including a perfect square. Because the low-frequency ILDs are known to be insignificant [41], it can be generally assumed that |

*H*

_{ L }| ≈ |

*H*

_{ R }| at low frequencies. At high frequencies, on the other hand, the noise coherence

*Γ*

_{ N }becomes insignificant. Thus, it is possible to ignore the term

*A*in (16). The third term

*B*consists of two functions; \( {\sin}^2\left(\angle {\varPhi}_S^{LR}\right) \) and \( {\varGamma}_N^2 \). The \( {\sin}^2\left(\angle {\varPhi}_S^{LR}\right) \) function will have small values at low frequencies, regardless of the location of the speech source, due to the relatively long wavelength compared with the microphone distance. However, at high frequencies, it monotonically increases according to the angle of the speech source until the relative phase difference reaches 90

^{∘}. However, because the noise coherence

*Γ*

_{ N }will be small at high frequencies, the multiplicative combination of \( {\sin}^2\left(\angle {\varPhi}_S^{LR}\right) \) and

*Γ*

_{ N }will be still insignificant, compared with the perfect square term.

## 4 Compensation for underestimation of noise PSD

When the auto- and cross-PSDs of the input signal are estimated using the first-order recursion algorithm in (5), the smoothing factor, α, has to cope with two contradictory constraints: capturing the time-varying statistics of the signal component and reducing the estimator variance [22, 26, 43]. When the noise statistics are fast time varying, capturing of the instantaneous statistics of the signals is necessary. To this end, a short-term averaging needs to be conducted. However, the short-term averaging can result in bias error of the estimated PSD [16, 25]. In this section, we propose a method of compensating the bias using the speech presence probability.

### 4.1 Bias compensation for eigenvalue

*β*

_{ n }is a weighting parameter. On the other hand, during the presence of speech, only the second eigenvalue reflects the noise power. Thus, the eigenvalue averaging in (20) can be applied only during the speech absence period.

*p*is an estimate of SPP. When a frequency band is with high SPP (

*p*≈ 1),

*β*

_{ n }≈ 1, and

*λ*

_{ c }≈

*λ*

_{2}. Thus, during the presence of speech, only the second eigenvalue is reflected in the noise PSD estimate. When the frequency band is with low SPP (

*p*≈ 0),

*β*

_{ n }becomes \( {\beta}_n^{\hbox{'}} \), and the two eigenvalue are combined with the minimum bound, \( {\beta}_n^{\hbox{'}} \). Accordingly, the bias compensation for eigenvalue in (20) is mainly applied only to frequency bands with low SPP, i.e., noise-dominant frequency bands. Using (17), the maximum eigenvalue can be approximated as \( {\uplambda}_1=\left({\left|{H}_L\right|}^2+{\left|{H}_R\right|}^2\right){\varPhi}_S+{\Phi}_N\left(1+{\varGamma}_N\mathit{\cos}\left(\angle {\varPhi}_S^{LR}\right)\right) \). Thus, the averaged eigenvalue using (20) can be expressed as \( {\uplambda}_{\mathrm{c}}={\Phi}_N\left(1+{\varGamma}_N\mathit{\cos}\left(\angle {\varPhi}_S^{LR}\right)-2\beta {\varGamma}_N\mathit{\cos}\left(\angle {\varPhi}_S^{LR}\right)\right)+\left(1-\beta \right)\left({\left|{H}_L\right|}^2+{\left|{H}_R\right|}^2\right){\varPhi}_S \), which results in a new noise PSD estimator:

*β*

_{ n }≈ 1, the second term in the numerator goes to zero. On the other hand, in a speech absence region, i.e.,

*β*

_{ n }≈ 0, we have Φ

_{ S }≈ 0. Therefore, the second term in the numerator can be ignored. Based on these observations, the new noise PSD estimator based on the averaged eigenvalue can be re-expressed as

The minimum bound of the weighting parameter, \( {\beta}_n^{\hbox{'}} \), is experimentally determined as the one providing the lowest logarithmic error (LogErr) between the true and estimated noise PSD. A more detailed procedure can be found in the experimental evaluation. Also, the bands or regions with low SPPs still need to be identified, so in the next subsection, we propose a method of estimating SPP using eigenvalue ratios.

### 4.2 Estimation of the speech presence probability

The eigenvalue compensation method introduced in the previous subsection requires an SPP estimator in order to obtain *p*. Energy ratio-based approaches [27, 44–47] have been widely used to determine the speech activity region. Under the assumption that the left and right channel diffuse noise are uncorrelated, (14) is reduced to \( {\uplambda}_1=\left({\left|{H}_L\right|}^2+{\left|{H}_R\right|}^2\right){\Phi}_S+{\Phi}_N={\widehat{\Phi}}_S+{\Phi}_N \) and *λ*
_{2} = Φ_{
N
}. Then, a priori SNR can be calculated as \( \xi ={\widehat{\Phi}}_S/{\Phi}_N={\lambda}_1/{\lambda}_2-1 \), which indicates that the eigenvalue ratio *λ*
_{1}/*λ*
_{2} can be used as an alternative to the energy ratio. Thus, in this paper, the energy ratio-based SPP in [3] is modified using the eigenvalue ratio.

*k*

_{1}bands are averaged prior to the likelihood calculation to reduce random fluctuation. The threshold,

*T*

_{ L }, can be empirically determined using a method similar to that in [3]. In order to improve the robustness of performance, an additional frame likelihood of speech is measured as

*T*

_{ F }(

*l*), is updated using a convex combination:

*β*

_{ com }≤ 1 is a weighting factor and

*B*

_{ S + N }(

*l*) and

*B*

_{ N }(

*l*) denote buffers corresponding to noisy and noise-only cases, respectively, in which the log ratios of

*L*consecutive frames, 10log

_{10}

*ρ*

_{ F }(

*m*),

*l*−

*L*+ 1 ≤

*m*≤

*l*, are stored. Now, the threshold,

*T*

_{ F }(

*l*), is adaptively adjusted according to the convex combination between the minimum of the elements of

*B*

_{ S + N }(

*l*) and the maximum of the elements of

*B*

_{ N }(

*l*). Finally, the SPP is estimated as

*p'*(

*k*,

*l*) =

*P*

_{ L }(

*k*,

*l*) ·

*P*

_{ F }(

*l*) and 0 ≤

*α*

_{ SPP }≤ 1 is a smoothing parameter. It is important to mention that the proposed SPP estimator in (29) re-uses the eigenvalues computed using (10).

### 4.3 The proposed noise PSD estimator with SPP-based eigenvalue compensation

## 5 Computer simulations

In this section, the performance of the proposed noise PSD estimator is evaluated through computer simulations in a binaural speech enhancement situation and compared with those of the previous methods. All speech sentences used in the computer simulations were taken from the TIMIT database [50] and convolved with binaural room impulse responses (BRIRs) from the Oldenburg database [51] to simulate target directions. Binaural noises taken from the ETSI database [52] and Oldenburg database were added to the target speech at various SNRs. The left and right channel input signals were decomposed into 32 *ms* subframes with 50% overlap at a sampling rate of 16 kHz. The length of the subframe was determined to satisfy the rank-1 property [53].

### 5.1 Bias analysis of the approximated noise PSD estimator

### 5.2 Effectiveness of the eigenvalue compensation method

*p*= 0) by changing \( {\beta}_n^{\hbox{'}} \). The noise PSD was obtained using (23) with the energy-compensated eigenvalue,

*λ*

_{ c }. The results for nine different noise types are displayed in Fig. 5, where it can be observed that the minimum LogErr was obtained at around \( 0.2<{\beta}_n^{\hbox{'}}<0.4 \), regardless of the noise type. Thus, we set \( {\beta}_n^{\prime }=0.35 \) for all following simulations employing the eigenvalue compensation scheme.

*α*, of the first-order recursion algorithm in (5). For this simulation, 20 speech sentences taken from the TIMIT database and mensa noise from the ETSI database [52] at 0 dB SNR were used as input. For the LogErr calculation, the auto PSD of the left channel noise was considered the true noise PSD. The results are shown in Fig. 6. It can be observed that the noise PSDs obtained using (19), blue lines with circle and square markers, were significantly biased, particularly when the averaging was conducted over short terms with small α. High variation of the input PSDs resulted in high LogErr. The results obtained using (13) are represented by black lines with diamonds and triangle markers and are almost identical to those obtained with (19). However, the LogErr was noticeably reduced by using the proposed eigenvalue compensation scheme as indicated by the red asterisk and diamond markers. The results in Fig. 6 clearly confirm the benefits of the proposed eigenvalue compensation scheme. The parameter choice α will be discussed in Section 5.3.

To utilize the benefits of the eigenvalue compensation scheme, it is important to have a correct SPP parameter, *p*. Thus, the SPP estimator in (29) was evaluated and compared with the conventional energy-based SPP in [3]. All test parameters were set as described in [3]. We used *β*
_{
com
} = 0.1 for the convex combination, and the size of the buffers, *B*
_{
S + N
}(*l*) and *B*
_{
N
}(*l*), was set to 15. The *β*
_{
SPP
} in (27) and *α*
_{
SPP
} in (29) were fixed to 0.65 and 0.3, respectively. We used *k*
_{1} = 1, *T*
_{
L
} = 6 for under 5500 Hz, and *T*
_{
L
} = 8 for over 5500 Hz.

### 5.3 Noise estimation evaluation

The proposed noise PSD estimator in Fig. 2 was evaluated in comparison to the previous methods including the single-channel SPP-based noise PSD estimator (SC-SPP) [5], the improved dual-channel noise PSD estimator (ImNPSD) [17], the dual-channel noise PSD estimator (DC-NPSD) [6], and the bias-corrected blocking method of the interaural transfer function (BB-ITF) [15]. The proposed noise PSD estimator in (23) is referred to as “Prop” in all plots. To make the comparison consistent, the smoothing factor for the estimation of auto-and cross-PSDs was fixed at α = 0.65 in all tested algorithms. As mentioned in Section 4, long-term smoothing can reduce the estimator variance, but at the same time, short-term smoothing is required to capture the fast time-varying statistics of the signals. Thus, as a compromise between these two contradictory requirements, we experimentally chose a smoothing factor that can balance the tracking performance and LogErr. BB-ITF was implemented using the fast least-mean square (FLMS) algorithm [55] based on a 256-tap prediction error filter. The forgetting factor for signal power smoothing was set to 0.9, and the step size for updating the weight was 0.1. We also used a causality delay of 32 samples to account for the largest possible ITD of the binaural system. The same error signal, i.e., (4a) in [15] was utilized to implement ImNPSD.

^{∘}and − 90° azimuth, respectively. The experimental results show that the proposed algorithm always obtained the lowest LogErr among the compared algorithms in all tested conditions. BB-ITF achieved the second-best performance. For ImNPSD and BB-ITF, the causality delay was determined to achieve the best performance, resulting in a 32-sample delay. For the target speech at − 90°, all binaural algorithms underwent slight performance degradation. However, the proposed algorithm still maintained the best performance even though there was no consideration of signal delay. Additionally, the single-channel algorithm (SC-SPP) showed a comparable performance to the binaural algorithms for the target speech at − 90°. However, this was not concerned with the preservation of the binaural cues such as ILD and ITD.

### 5.4 Speech enhancement performance

The binaural speech enhancement system in Fig. 1 was implemented by using the proposed noise PSD estimator (Fig. 2) and conventional noise PSD estimators, and their respective performances were evaluated in terms of the quality and intelligibility of the enhanced speech. To this end, we measured the frequency-weighted SNR improvement (*∆*fwSNR) [56], short-time objective intelligibility improvement (*∆*STOI) [57], and perceptual evaluation of speech quality improvement (*∆*PESQ). All objective parameters were expressed as a difference of the correspondent measures in the output and the input of the system. Since fwSNR [56] was optimized at a 8-kHz sampling rate, the signals were down-sampled to 8 kHz before the measurement. Other objective measures including *∆*STOI and *∆*PESQ were obtained at a 16-kHz sampling rate.

*∆*fwSNR results for the left channel with a frontal target speech source. The proposed noise PSD estimator achieved the best improvement among the tested systems. This was mainly due to the superior accuracy of the proposed noise PSD estimator at low frequencies, where the noise power was concentrated. Results with a speech source at − 90

^{°}are presented in (b). It was also found that the proposed algorithm obtained the best performance. Again, all algorithms achieved lower performance than the case with a target at 0

^{°}. DC-NPSD worsened

*∆*fwSNR due to the mismatched assumption.

## 6 Conclusions

A robust noise PSD estimator for a binaural speech enhancement system was presented. The proposed algorithm obtained the noise PSD based on the second eigenvalue of the covariance matrix of the binaural input signal. To improve the accuracy of the noise PSD estimate, the eigenvalues in noise-dominant periods were averaged using SPP, which resulted in a reduction of bias error. The proposed algorithm robustly estimated the noise PSD for targets located in all directions around the listener in fast time-varying noise environments. The proposed algorithm is theoretically equivalent to the conventional channel prediction-based algorithm. However, since it does not require a causality delay and explicit estimation of the prediction errors, it is less computationally demanding and has less chance of being affected by the estimation bias due to fast smoothing. The experimental results confirmed that the proposed algorithm could achieve higher performance than the conventional algorithms regardless of the target direction, input SNR, and noise types. The objective parameters also confirmed that the proposed algorithm could obtain slightly better speech quality and intelligibility performance than the conventional techniques.

## Declarations

### Authors’ contributions

The main contribution of this article is the introduction of simple and robust noise PSD estimator for binaural speech enhancement systems. The proposed algorithm computes the time-varying diffuse noise PSD based on eigenvalue of input covariance matrix. Therefore, it can be seen that the noise PSD in the current frame can be calculated regardless of the presence or absence or speech or the direction of the speech. In addition, an eigenvalue compensation method is applied to improve the accuracy of the estimator based on speech presence probability. As a result, the proposed algorithm showed better results than previous algorithms in terms of accuracy, quality, and intelligibility. It also does not require a causality delay. All authors discussed the final results. All authors read and approved the final manuscript.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Ephraim, Y, & Malah, D. (1984). Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator.
*IEEE Trans. Acoust., Speech and Signal Process*,*32*(6), 1109–1121.View ArticleGoogle Scholar - Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics.
*IEEE Trans. Speech and Audio Process.*,*9*(5), 504–512.View ArticleGoogle Scholar - Rangachari, S, & Loizou, PC. (2006). A noise-estimation algorithm for highly non-stationary environments.
*Speech Comm.*,*48*(2), 220–231.View ArticleGoogle Scholar - Fan, N, Rosca, J, Balan, R (2007).
*Speech noise estimation using enhanced minima controlled recursive averaging*. In*Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)*IEEE.Google Scholar - Gerkmann, T, & Hendriks, RC. (2012). Unbiased MMSE-based noise power estimation with low complexity and low tracking delay.
*IEEE Trans. Audio, Speech, and Lang. Process*,*20*(4), 1383–1393.View ArticleGoogle Scholar - Nelke, CM, Beaugeant, C, Vary, P (2013). Dual microphone noise PSD estimation for mobile phones in hands-free position exploiting the coherence and speech presence probability. In
*Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)*.Google Scholar - Souden, M, Benesty, J, Affes, S. (2010). A study of the LCMV and MVDR noise reduction filters.
*IEEE Trans. Signal Process.*,*58*(9), 4925–4935.MathSciNetView ArticleGoogle Scholar - Doclo, S, et al. (2007). Frequency-domain criterion for the speech distortion weighted multichannel wiener filter for robust noise reduction.
*Speech Comm.*,*49*(7), 636–656.View ArticleGoogle Scholar - Spriet, A, Moonen, M, Wouters, J. (2004). Spatially pre-processed speech distortion weighted multi-channel Wiener filtering for noise reduction.
*Signal Process.*,*84*(12), 2367–2387.View ArticleGoogle Scholar - Cornelis, B, et al. (2010). Theoretical analysis of binaural multimicrophone noise reduction techniques.
*IEEE Trans. on Audio, Speech, and Lang. Process*,*18*(2), 342–355.View ArticleGoogle Scholar - Marquardt, D, et al. (2015). Theoretical analysis of linearly constrained multi-channel Wiener filtering algorithms for combined noise reduction and binaural cue preservation in binaural hearing aids.
*IEEE/ACM Trans. on Audio, Speech, and Lang. Process*,*23*(12), 2384–2397.View ArticleGoogle Scholar - Marquardt, D, Hohmann, V, Doclo, S. (2015). Interaural coherence preservation in multi-channel wiener filtering-based noise reduction for binaural hearing aids.
*IEEE/ACM Trans. on Audio, Speech and Lang. Process*,*23*(12), 2162–2176.View ArticleGoogle Scholar - Thiemann, J, et al. (2016). Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene.
*EURASIP J. on Advances in Signal Process*,*2016*(1), 12.View ArticleGoogle Scholar - Jeub, M, et al. (2011). Robust dual-channel noise power spectral density estimation. In
*Proceedings of the European Signal Processing Conference (EUSIPCO), Barcelona, Spain*.Google Scholar - Azarpour, M, Enzner, G, Martin, R (2014). Binaural noise PSD estimation for binaural speech enhancement. In
*Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)*IEEE.Google Scholar - Ji, Y, et al. (2013). Robust noise PSD estimation for binaural hearing aids in time-varying diffuse noise field. In
*Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)*IEEE.Google Scholar - Kamkar-Parsi, AH, & Bouchard, M. (2009). Improved noise power spectrum density estimation for binaural hearing aids operating in a diffuse noise field environment.
*IEEE Trans. on Audio, Speech, and Lang. Process*,*17*(4), 521–533.View ArticleGoogle Scholar - Braun, S, & Habets, EA. (2015). A multichannel diffuse power estimator for dereverberation in the presence of multiple sources.
*EURASIP J. on Audio, Speech, and Music Process*,*2015*(1), 1–14.View ArticleGoogle Scholar - Azarpour, M, Enzner, G, Martin, R (2013). Adaptive binaural noise reduction based on matched-filter equalization and post-filtering. In
*Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP)*IEEE.Google Scholar - Xu, Y, et al. (2014). An experimental study on speech enhancement based on deep neural networks.
*IEEE Signal processing letters*,*21*(1), 65–68.View ArticleGoogle Scholar - Xu, Y, et al. (2015). A regression approach to speech enhancement based on deep neural networks.
*IEEE/ACM Trans. on Audio, Speech and Lang. Process*,*23*(1), 7–19.View ArticleGoogle Scholar - Laska, BN, Bolic, M, Goubran, RA (2010).
*Coherence-assisted Wiener filter binaural speech enhancement*. In*Instrumentation and Measurement Technology Conference (I2MTC), 2010 IEEE*IEEE.Google Scholar - McCowan, IA, & Bourlard, H (2002). Microphone array post-filter for diffuse noise field. In
*Proceedings of the international conference on acoustics, speech, and signal processing (ICASSP)*IEEE.Google Scholar - Abutalebi, HR, et al. (2004). A hybrid subband adaptive system for speech enhancement in diffuse noise fields.
*Signal Processing Letters, IEEE*,*11*(1), 44–47.View ArticleGoogle Scholar - Merimaa, J, Goodwin, MM, Jot, J-M (2007). Correlation-based ambience extraction from stereo recordings. In
*Audio Engineering Society Convention 123*Audio Engineering Society.Google Scholar - Guérin, A, Le Bouquin-Jeannés, R, Faucon, G. (2003). A two-sensor noise reduction system: applications for hands-free car kit.
*EURASIP J. on Applied Signal Process.*,*2003*, 1125–1134.MATHGoogle Scholar - Cohen, I. (2002). Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator.
*Signal Processing Letters, IEEE*,*9*(4), 113–116.MathSciNetView ArticleGoogle Scholar - Griffiths, LJ, & Jim, CW. (1982). An alternative approach to linearly constrained adaptive beamforming.
*IEEE Trans. on Antennas and Propagation*,*30*(1), 27–34.View ArticleGoogle Scholar - Li, J, et al. (2011). Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication.
*Speech Comm.*,*53*(5), 677–689.View ArticleGoogle Scholar - Blauert, J. Spatial hearing: The psychophysics of human sound localization (MIT press, Cambridge, 1997)Google Scholar
- Doclo, S, et al. (2015). Multichannel signal enhancement algorithms for assisted listening devices: exploiting spatial diversity using multiple microphones.
*Signal Processing Magazine, IEEE*,*32*(2), 18–30.View ArticleGoogle Scholar - Loizou, PC. Speech enhancement: Theory and practice (CRC press, Boca Raton, 2013)Google Scholar
- Krueger, A, Warsitz, E, Haeb-Umbach, R. (2011). Speech enhancement with a GSC-like structure employing eigenvector-based transfer function ratios estimation.
*IEEE Trans. on Audio, Speech, and Lang. Process*,*19*(1), 206–219.View ArticleGoogle Scholar - Roman, N, Srinivasan, S, Wang, D. (2006). Binaural segregation in multisource reverberant environments.
*The Journal of the Acoustical Society of America*,*120*(6), 4040–4051.View ArticleGoogle Scholar - Dorbecker, M, & Ernst, S (1996). Combination of two-channel spectral subtraction and adaptive Wiener post-filtering for noise reduction and dereverberation. In
*Proceedings of the European Signal Processing Conference (EUSIPCO)*Citeseer.Google Scholar - Cook, RK, et al. (1955). Measurement of correlation coefficients in reverberant sound fields.
*The J. of the Acoustical Society of America*,*27*(6), 1072–1077.View ArticleGoogle Scholar - Lindevald, I, & Benade, A. (1986). Two-ear correlation in the statistical sound fields of rooms.
*The J. of the Acoustical Society of America*,*80*(2), 661–664.View ArticleGoogle Scholar - Jeub, M, Dorbecker, M, Vary, P. (2011). A semi-analytical model for the binaural coherence of noise fields.
*Signal Processing Letters, IEEE*,*18*(3), 197–200.View ArticleGoogle Scholar - McCowan, IA, & Bourlard, H. (2003). Microphone array post-filter based on noise field coherence.
*IEEE Trans. Speech and Audio Process.*,*11*(6), 709–716.View ArticleGoogle Scholar - Azarpour, M, Enzner, G, Martin, R (2012).
*Distortionless-response vs. matched-filter-array processing for adaptive binaural noise reduction*. In*Acoustic Signal Enhancement; Proceedings of IWAENC 2012; International Workshop on*VDE.Google Scholar - Algazi, VR, Avendano, C, Duda, RO. (2001). Elevation localization and head-related transfer function analysis at low frequencies.
*The J. of the Acoustical Society of America*,*109*(3), 1110–1122.View ArticleGoogle Scholar - Lefkimmiatis, S, & Maragos, P. (2007). A generalized estimation approach for linear and nonlinear microphone array post-filters.
*Speech Comm.*,*49*(7), 657–666.View ArticleGoogle Scholar - Rahmani, M, Akbari, A, Ayad, B. (2009). An iterative noise cross-PSD estimation for two-microphone speech enhancement.
*Appl. Acoust.*,*70*(3), 514–521.View ArticleGoogle Scholar - Cohen, I, & Berdugo, B. (2002). Noise estimation by minima controlled recursive averaging for robust speech enhancement.
*Signal Processing Letters, IEEE*,*9*(1), 12–15.View ArticleGoogle Scholar - Soleimani, S, & Ahadi, S (2008). Voice activity detection based on combination of multiple features using linear/kernel discriminant analyses. In
*Information and communication technologies: From theory to applications, 2008. ICTTA 2008. 3rd International Conference on*IEEE.Google Scholar - Evangelopoulos, G, & Maragos, P (2005). Speech event detection using multiband modulation energy. In
*INTERSPEECH*.Google Scholar - Davis, A, Nordholm, S, Togneri, R. (2006). Statistical voice activity detection using low-variance spectrum estimation and an adaptive threshold.
*IEEE Trans. on Audio, Speech, and Lang. Process*,*14*(2), 412–424.View ArticleGoogle Scholar - Ghosh, PK, Tsiartas, A, Narayanan, S. (2011). Robust voice activity detection using long-term signal variability.
*IEEE Trans. on Audio, Speech, and Lang. Process.*,*19*(3), 600–613.View ArticleGoogle Scholar - Ma, Y, & Nishihara, A. (2013). Efficient voice activity detection algorithm using long-term spectral flatness measure.
*EURASIP J. on Audio, Speech, and Music Process*,*2013*(1), 1–18.View ArticleGoogle Scholar - Zue, V, Seneff, S, Glass, J. (1990). Speech database development at MIT: TIMIT and beyond.
*Speech Comm.*,*9*(4), 351–356.View ArticleGoogle Scholar - Kayser, H, et al. (2009). Database of multichannel in-ear and behind-the-ear head-related and binaural room impulse responses.
*EURASIP J. on Advances in Signal Process.*,*2009*, 6.View ArticleGoogle Scholar - ETSI, EG, and 202396-1, Speech multimedia transmission quality (STQ); speech quality performance in the presence of background noise; part 1: background noise simulation technique, and background noise database
*.*2009.Google Scholar - Kim, G, & Cho, NI. (2008). Frequency domain multi-channel noise reduction based on the spatial subspace decomposition and noise eigenvalue modification.
*Speech Comm.*,*50*(5), 382–391.View ArticleGoogle Scholar - Moore, BC. An introduction to the psychology of hearing (Brill, Leiden, 2012)Google Scholar
- Haykin, SS (2008).
*Adaptive filter theory*. India: Pearson Education.MATHGoogle Scholar - Hu, Y, & Loizou, PC. (2008). Evaluation of objective quality measures for speech enhancement.
*IEEE Trans. on Audio, Speech, and Lang. Process*,*16*(1), 229–238.View ArticleGoogle Scholar - Taal, CH, et al. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech.
*IEEE Trans. on Audio, Speech, and Lang. Process*,*19*(7), 2125–2136.View ArticleGoogle Scholar