 Research
 Open access
 Published:
An online algorithm for echo cancellation, dereverberation and noise reduction based on a KalmanEM Method
EURASIP Journal on Audio, Speech, and Music Processing volumeÂ 2021, ArticleÂ number:Â 33 (2021)
Abstract
Many modern smart devices are equipped with a microphone array and a loudspeaker (or are able to connect to one). Acoustic echo cancellation algorithms, specifically their multimicrophone variants, are essential components in such devices. On top of acoustic echos, other commonly encountered interference sources in telecommunication systems are reverberation, which may deteriorate the desired speech quality in acoustic enclosures, specifically if the speaker distance from the array is large, and noise. Although suboptimal, the common practice in such scenarios is to treat each problem separately. In the current contribution, we address a unified statistical model to simultaneously tackle the three problems. Specifically, we propose a recursive EM (REM) algorithm for solving echo cancellation, dereverberation and noise reduction. The proposed approach is derived in the shorttime Fourier transform (STFT) domain, with timedomain filtering approximated by the convolutive transfer function (CTF) model. In the Estep, a Kalman filter is applied to estimate the nearend speaker, based on the noisy and reveberant microphone signals and the echo reference signal. In the Mstep, the model parameters, including the acoustic systems, are inferred. Experiments with human speakers were carried out to examine the performance in dynamic scenarios, including a walking speaker and a moving microphone array. The results demonstrate the efficiency of the echo canceller in adverse conditions together with a significant reduction in reverberation and noise. Moreover, the tracking capabilities of the proposed algorithm were shown to outperform baseline methods.
1 Introduction
1.1 The echo cancellation problem
Acoustic echo cancellation algorithms are an essential component in many telecommunication systems such as handsfree devices, conference room speakerphones and hearing aids [1â€“3]. Moreover, in modern devices, such as smart speakers that play loud music, it is mandatory to integrate an acoustic echo cancellation (AEC) algorithm to enable proper functionality of automatic speech recognition (ASR) systems, especially in the task of recognizing a hotword. Echo control is also common in robot audition applications, to enable proper humanrobot interaction. This may impose further complexity to the problem, as the robot may move while capturing the sound from the speakers.
Generally, the role of AEC algorithms is to suppress the interference related to a farend speaker (a known reference signal) and enhance the desired speech signal, denoted nearend speaker. This task requires an estimate of the acoustic path relating the loudspeaker and the microphone, and is obtained by the application of an adaptive filter [4â€“8]. Then, the farend signal is convolved with the estimated echo path to obtain a replica of the echo signal as received by the microphone. An estimate of the desired nearend signal is finally obtained by subtracting the estimated echo signals from the microphone signal. In [9], an acoustic echo control method is derived, including an echo canceller and a postfilter. The proposed algorithm is based on the Kalman Filter and provides an optimal statistical adaptation framework for the filter coefficients in timevarying and noisy acoustic environments.
1.2 Literature review
Many modern devices are equipped with more than one microphone. The common and most straightforward solution for cancelling the echo signal in the presence of noise is to first independently apply an AEC between the loudspeaker and each of the microphones and then to apply a beamformer. Cascade schemes, implemented in the timedomain, for joint AEC and beamforming are presented in [10, 11], with either AEC preceding the beamformer or vice versa. A frequencydomain implementation addressing the joint noise reduction and multimicrophone echo cancellation is proposed in [12]. The beamformer involves a generalized sidelobe canceller (GSC) structure and the AEC is implemented by applying the block leastmeansquare (BLMS) procedure [13]. Another approach, combining a minimum variance distortionless response (MVDR) beamformer and a recursive leastsquares (RLS)based AEC is presented in [14].
A multichannel echo cancellation is presented in [15], utilizing a lowcomplexity method. The method relies on a relative transfer function (RTF) scheme for multimicrophone AEC for reducing the overall computational load. Furthermore, it incorporates residual echo reduction into the beamformer design. This method is formulated in the STFT domain using the CTF [16] approximation.
Most studies in the literature assume that the physical distance between the farend signals and the microphone location is small. It is a reasonable assumption since in many devices the microphones and the loudspeaker are mounted into the same device. However, when the loudspeaker is an external device connected by a cable or wirelessly by Bluetooth, it can be located anywhere in the room. As a result, the received echo signal may include a significant amount of reflections. In such cases, the length of the echo path should take into account the multiple acoustic reflections, implying a long adaptive filter. When the adaptive filter cannot entirely represent the echo path, the AEC output may suffer from a significant residual echo.
A singlemicrophone approach for jointly suppressing reverberation of the nearend speaker, residual echo and background noise is presented in [17]. A spectral postfilter was developed to efficiently dereverberte the desired speech signal, together with the suppression of the late residual echo and the background noise.
In [18], a twomicrophone approach was presented. This algorithm comprises an adaptive filter to eliminate noncoherent signal components such as ambient noise and the reverberation of the nearend speech, in addition to echo cancellation. Another multichannel algorithm that jointly addresses the three problems is presented in [19]. An iterative expectationmaximization (EM technique is used for speech dereverberation, acoustic echo reduction, and noise reduction. The proposed method defines two statespace models, one for the acoustic echopath and the other for the reverberated nearend speaker. The reverberant speech source model is assumed to follow a noiseless autoregressive model. Two parameter optimization stages based on the Kalman smoother were applied to each statespace model in the Estep. The joint echo cancellation and dereverberation problem is also discussed in [20] for robot audition. An independent component analysis (ICA) scheme is adopted in order to provide a natural framework for these two problems using a microphone array.
The statistics of acoustic impulse response (AIR) is commonly used in dereverberation algorithms. A singlemicrophone method for the suppression of late room reverberation based on spectral subtraction is presented in [21]. This concept is extended to the multimicrophone case in [22]. The problem is formulated in the STFT domain while taking into account the contribution of the directpath in [23].
Yoshioka et al. [24] developed an EM algorithm for derverberation and noise reduction, where the room impulse response (RIR) is modelled as an autoregressive (AR) process in the STFT domain. An iterative and sequential Kalman expectationmaximization (KEM) scheme for singlemicrophone speech enhancement in the timedomain was introduced in [25]. This method was extended to a multimicrophone speech dereverberation method in [26], applied in the STFT domain, where the acoustic systems are approximated by the CTF model.
Many modern applications should address cases where the desired speaker, the microphone array and even the interference signal are moving, hence necessitating timevarying online parameter estimate. Unfortunately, the Wiener filter or the Kalman smoother cannot be straightforwardly applied in these cases, as they also utilize future samples. The statistical model of these algorithms should be adjusted to the dynamic scenario.
The REM, which is an efficient scheme for sequential parameter estimation, is particularly suitable for estimating timevarying parameters typical to dynamic scenarios. Titterington [27] formulated an online EM scheme using a stochastic approximation version of the modified gradient recursion. A recursive algorithm is proposed in [28] considering the convergence properties of Titteringtonâ€™s algorithm. The estimates generated by the recursive EM algorithm converged with probability one to a stationary point of the likelihood function. Recursive algorithms based on KEM were presented in [25, 29] using gradient decent algorithm for solving the maximum likelihood (ML) optimization. In [30], recursive EM methods for timevarying parameters were introduced with applications to multiple target tracking. CappÃ© and Moulines in [31] proposed another online version of the EM algorithm applicable to latent variable models of independent observations. A proof of convergence to a stationary point under certain additional conditions was established in this paper. For dependent observations, a recursive ML method was presented in [32] and is supported by a convergence proof. This method refers to statespace models in which the state process and the observations depend on a finite set of previous observations.
The acoustic path can be treated as stochastic processes under the Bayesian framework. An online EM based dereverberation algorithm is presented [33]. The acoustic paths where represented as random variables with a firstorder Markov chain and estimated in the Estep by using the Kalman filter. The speech components were modelled as timevarying parameters and were estimated in the Mstep.
An online algorithm for derverberation based on a Kalman expectationmaximization (RKEM) approach is presented in [34], where the acoustic parameters and the clean signal are jointly estimated. We refer to this algorithms as Kalman expectationmaximization for dereverberation (RKEMD). This framework is extended in the current contribution to jointly address echo cancellation, dereverberation and noise reduction problems.
1.3 Main contributions and outline
While most of the studies treat the problems of echo cancellation, dereverberation, and noise reduction separately, only a few propose a combined solution. In this paper, we present an online algorithm for the three problems addressing a unified statistical model using a microphone array. The microphone signal is degraded by an echo signal and an additive noise in highly reverberant environments. The proposed method is applied in the STFT domain using the RKEMD framework and simultaneously addresses all interfering sources. The acoustic systems of the nearend and farend signals are approximated by the CTF model and the statistical model is represented in a statespace formulation. Using a a doubletalk detector (DTD), our method suspends the adaptation of the acoustic systems coefficients when their relevant signals are inactive, but still enables adaptation during doubletalk. It is also capable of tracking timevariations of the acoustic systems. Hence, a feasible solution is provided in realistic dynamic scenarios when the nearend signal is moving, and even when the microphone array itself is moving.
The structure of the manuscript is as follows. In Section 2, the statistical model of the problem is presented. The recursive EM scheme is derived in Section 3. The desired nearend signal is estimated as a byproduct of the Estep of this scheme. In the recursive version, the Estep boils down to a Kalman filter that is applied to the observed signal with the estimated echo signal subtracted. In the Mstep, the CTF coefficients and the noise parameters are recursively estimated. It is further shown that the instantaneous speech variance cannot be estimated using the REM procedure and an external estimator is derived instead. Section 4 describes the DTD that facilitates a proper implementation of the echo cancellation stage. An experimental study for different realistic scenario, including the challenging scenario of moving microphone array, was carried out at the BarIlan acoustic lab and is detailed in Section 5. Conclusions are drawn in Section 6.
2 Statistical Model
Let x[n] be the clean nearend signal and y[n] be the farend signal in the timedomain. The signals are propagating in an acoustic enclosure before being picked up by a J microphone array. The microphone signals are denoted by
where âˆ— denotes timedomain convolution and \(j \in \mathcal {S}_{J}=[1,\ldots,J]\) is the microphone index. h_{j}[n] and g_{j}[n] are the RIRs relating x[n] and y[n] signals and the jth microphone, respectively. v_{j}[n] is an additive noise, as received by jth microphone.
The signals x[n] and y[n] are represented in the STFT domain by x(t,k) and y(t,k), respectively, where tâ‰¥1 is the timeframe index and \(k \in \mathcal {S}_{K}=[0,\ldots,K1]\) is the frequencybin index. We assume that the clean speech can be modelled as a complexGaussian variable, independent across STFT timeframes and frequencies (see [35]), with zeromean and variance Ï•_{x}(t,k)
where \(\mathcal {N}_{C}\) denotes a proper complexGaussian distribution.
In order to reduce the computational complexity and to facilitate the model analysis, we consider the CTF approximation [16] for the STFT representation of the timedomain RIR. The timedomain model in (1) can be approximated by
where the CTF systems are:
and the statevectors of the desired speech signal and the acoustic reference signal are, respectively
L is the length of CTF systems that depends on the reverberation time.
The noise signal v_{j}(t,k) is assumed to be a stationary complexGaussian spatially uncorrelated random process,
and \( E\left \{ {v_{j}(t,k)} v_{i}^{\ast }(t,k)\right \}=0\) for jâ‰ i.For conciseness, the frequency index k will be omitted when no ambiguity arises.
The signal model can be represented in the following statespace form:
where x_{t} and y_{t} were defined in (5) and d_{t} is defined as the observed signal after the subtraction of the echo signal contribution. The statetransition matrix is given by
the innovation process is given by
the measurement vector is given by
the observation matrices are
with h_{j} and g_{j} the CTF systems, as defined in (4), and the noise vector is given by
In the algorithm derivation, the following secondorder statistics matrices of the innovation and measurement noise signals will also be used:
where we assumed that the noise is independent between microphones.
3 Algorithm derivation
The EM algorithm [36] is an iterativebatch procedure that processes the entire dataset in each iteration until convergence to a local maximum of the ML criterion. Hence, it cannot be applied as is to the task of AEC, specifically in timevarying scenarios. We therefore resort to a recursive version of the EM in our algorithm derivation.
3.1 The likelihood function
We start the algorithm derivation by defining the parameter sets and the relevant datasets. As we are interested in causal estimators, the available timeframe indexes for estimating the desired signal at frame t are confined to \(\mathcal {S}_{t}=[1,\ldots,t]\), where t=1 is arbitrarily chosen as the first available timeframe. The EM algorithm is a method for estimating a set of deterministic parameters that maximizes the likelihood criterion. Since the EM works with the notation of completedata it also provides an estimate of the desired signal(s) as a byproduct of the estimation procedure.
Let \({\mathcal {Z}}_{t}\) be the set of measurements comprising all microphones and all timefrequency (TF) bins:
\(\mathcal {Y}_{t}\) the set of TF bins of the reference signal
and \(\mathcal {X}_{t}\) the unavailable set of TF bins of the desired speech signal
Both \({\mathcal {Z}}_{t}\) and \(\mathcal {Y}_{t}\) are available, where the set \({\mathcal {Z}}_{t}\) describes the available information in microphone signals, and \(\mathcal {Y}_{t}\) the information in farend signal as transmitted by the local loudspeaker.
The parameter sets of the statistical model presented in Section 2 comprises the following subsets:
for all \(j \in \mathcal {S}_{J}\); tâ‰¥1 and \(k \in \mathcal {S}_{K}\).
A note on the timedependency of the parameters is in place. Two distinct time scales can be defined. While the speech power spectral density (PSD) is rapidly changing from frame to frame, the RIRs relating the desired speech and the echo signal to the microphones, as well as the noise variances, are slowly timevarying. The distinct scales of the time variations imply different types of estimation procedures. While estimating the speech PSD necessitates an external smoothing procedure that maintains the rapid timevariations, estimating the RIRs and the noise variances boils down to recursive aggregation of past statistics. Consequently, slowly timevarying estimated parameters are obtained. In the following sections, estimators for the set of parameters Î˜ will be presented in details together with an online estimate of the desired speech signal.
The EM formulation requires the loglikelihood of the completedata. Under the assumed statistical model, it is given by:
where
and =C stands for equal up to constants that are independent of Î˜. Note that the second and the third lines of (12) are the loglikelihood of the clean speech signal and the loglikelihood of the additive noise, respectively. Both terms are expressed as a summation over the timeframe index \(\tau \in \mathcal {S}_{t}\), as a result of the independence between timeframes of the desired source and the noise signals in the STFT domain. The second term also decomposes to a sum over the J microphones due to the assumed independence of the noise signals across microphones. The likelihood function in (12) is separately calculated for all \(k \in \mathcal {S}_{K}\) due to the independence between frequency bins.
3.2 Recursive EM algorithm
We adopt the online EM formulation presented in [31], in which the auxiliary function is recursively calculated, while the maximization step remains intact. This formulation facilitates online and timevarying estimation of all model parameters.
The auxiliary function at timeframe t is given by a weighted sum of the auxiliary function at the previous timeframes and the innovation of the current measurement:
where \(\widehat {\boldsymbol {\Theta }}(t)\) is the parameter set estimate after measuring the observation z_{t} and the farend echo signal y_{t} at timeframe t, and Î³_{t}âˆˆ[0,1) is a smoothing parameter, that should decay in time for static scenarios. The maximization is computed over the aggregated auxiliary function (14)
Given the measurements and the echo signal, define the expected value of the instantaneous completedata loglikelihood^{Footnote 1}:
and substitute the timevarying smoothing parameter with a constant factor Î²=1âˆ’Î³_{t}, thus introducing an exponential decay of the contribution of past samples to the calculation, and consequently facilitating recursive estimation of timevarying parameters. Using these definitions, the recursive auxiliary function (14) can be rewritten as
The completedata likelihood is independent and identically distributed between time frames. Therefore, we can explicitly write (16) as:
Finally, the explicit recursive auxiliary function can be calculated by substituting (18) into (17):
where
and the first and secondorder statistics of the nearend speech signal given \({\mathcal {Z}}_{t}\) and \(\mathcal {Y}_{t}\) are:
3.2.1 EStep: Kalman filter
The calculation of the recursive auxiliary function (19) requires the first and secondorder statistics of the clean speech signal (21). These are acquired in the Estep of the recursive procedure by applying the Kalman filter. The Kalman filter, summarized in Algorithm 1, is the optimal causal estimator in minimum mean square error (MMSE) sense.
3.2.2 MStep: Parameter estimation
In the Mstep, we update parameters by maximizing the auxiliary function w.r.t. Î˜ yielding the subsequent estimate \(\widehat {\boldsymbol {\Theta }}(t+1)\),
resulting in the following update rules for the model parameters at the (t+1)th timeframe:
where we define the following aggregated secondorder statistics
with
and similarly
Note that (23) is an RLS update rule for estimating both filters and (24) is a recursive estimation of the residual power.
Unlike the estimation procedure of the filtersâ€™ coefficients, maximizing (22) w.r.t. the speech PSD cannot be applied. In 3.2.3, we explain the reasons for this phenomenon and propose an alternative algorithm for the recursive speech PSD estimation.
3.2.3 Recursive estimation of the speech variance
The speech variance Ï•_{x}(t) is a timevarying parameter, due to the nonstationarity of the speech signal, and hence smoothness over time cannot be assumed, in contrast to the CTF systems H and G and the noise variance \({\phi _{v_{j}}}\) that exhibit slower timevariations. In the proposed recursive algorithm, the available observed data refers to the time frames in the interval \(\mathcal {S}_{t}\), thus the derivative of (22) w.r.t. Ï•_{x}(t+1) is zero and does not impose any constraint.
Alternately, we propose to obtain a speech PSD estimator of \(\widehat {{\phi }}_{x}(t)\), which still maintains some smoothness of the PSD estimates. The spectral amplitude estimator presented in [39] is adapted for this estimation with the necessary changes to incorporate residual echo and reverberation. The optimal speech PSD estimator in the MMSE sense at the jth microphone signal:
where A_{j}(t) is a gain function that attenuates the late reverberant component and the noise component. Consequently, \(A^{2}_{j}(t)z_{j}(t)\mathbf {g}_{j}^{\top } {\mathbf {y}_{t}} ^{2}\) represents the variance estimator of the early speech component, \(x^{e}_{j}(t)=h_{j,0}(t)x(t)\). The gain function is defined as
where
and \(\phi _{r_{j}}\) is the late reverberant spectral variance. Note that Î¶_{prior,j} and Î¶_{post,j} are a priori and a posteriori signal to interference ratio (SIR), respectively. The calculation of (30) is executed for every channel j. The estimation of \(\widehat {\phi }_{x^{e}_{j}}(t)\) is unobserved and therefore the a priori SIR, Î¶_{prior,j}(t), is estimated by the decisiondirected estimator proposed by Ephraim and Malah in [40]:
where Î±_{sir} is a smoothing factor and Î¶_{min} is the minimum SIR that ensures the positiveness of Î¶_{post,j}(t)âˆ’1. Note that applying the gain function in (30) on Î¶_{post,j}(tâˆ’1) as in (33) represents the a priori SIR resulting from the previous frame process.
For the estimation of late reverberant spectral variance \(\phi _{r_{j}}\), the instantaneous power of the reverberation \(\widehat {\psi }_{r_{j}}(t)\) is calculated as in the RKEMD method [34]:
By the definition of \(\boldsymbol {\Phi }, \hat {h}_{j,0}(t)\) is excluded from (34) and hence only the variance of the late reverberation is taken into account. Then, \(\widehat {\phi }_{r_{j}}(t)\) is estimated by time smoothing using a smoothing parameter Î±_{r}âˆˆ[0,1):
The speech PSD \(\widehat {{\phi }}_{x}(t)\) is finally determined by averaging over all J channels:
It is clear that the presented model in (3) may suffer from gain ambiguity in estimating both \(\widehat {{\phi }}_{x}(t)\) and \(\widehat {\mathbf {h}}_{j}(t)\), attributed to the following equality:
where Î½(t,k) is an arbitrary time and frequencydependent gain. To circumvent this problem, we arbitrarily set \(\hat {h}_{j,0}(t,k)=1,~\forall j\) in (28).
3.3 Alternative Mstep 1
Estimating the CTF systems in the Mstep (23) boils down to RLStype update rule. An alternative and commonly used approach for adaptive filtering is the normalized leastmeansquare (NLMS) procedure, which is known for its good tracking capabilities, simplicity, and low computational complexity. Conversely, the RLS algorithm is more stable and its convergence rate is faster, at the expense of high computational complexity. The tradeoff between fast adaptation and computational complexity should be considered when choosing the appropriate adaptive filtering approach. We develop in the sequel an alternative Mstep based on the NLMS procedure.
First, we apply the NLMS procedure for estimating the echo path for each microphone \({\mathbf {g}_{j}}, \forall ~j\in \mathcal {S}_{J}\) rather than using the estimate resulting from Mstep stage in (23). The NLMS update rule is given by:
where Î»âˆˆ(0,2) is the stepsize, Î´_{NLMS}>0 is the regularization factor and e_{j}(t) is the instantaneous estimation error w.r.t. the jth microphone given by:
The update of the other acoustic parameters remains intact and is calculated as described in Section 3.2.2.
Substituting the CTF estimate of the echo path \(\widehat {\mathbf {g}}_{j}\) (23) by \(\widehat {\mathbf {g}}_{j}^{\text {NLMS}}\) leads to a combined structure of NLMS and RKEMD, where the NLMS estimation error of each channel is the input for RKEMD. This new scheme is denoted by NLMSRKEMD1.
3.4 Alternative Mstep 2
Although the RLS approach in the proposed algorithm is inefficient in means of computational complexity comparing to NLMS, the EM has the advantage of considering the nearend speaker in the echo cancellation model. We therefore introduce another alternative Mstep, in which the echo path is estimated using NLMS while still utilizing the benefits offered by the EM formulation. Based on a gradientdescent minimization of the likelihood function, adopted from [41] and [25], we substitute the maximization of \(\widehat {\mathbf {g}}_{j}\) in (23) with:
Explicitly, carrying out the derivative in (40) (also implementing the normalization operation) yields an adaption rule similar to (38), but with a different error term:
Now, the error signal (41) includes the subtraction of the estimated reverberant nearend signal. We denote this recursive EM variant as NLMSRKEMD2.
4 Double talk detector
The statistical model presented in Section 2 assumes a constant activity of the nearend and farend signals. However, in real scenarios this is not always the case, rendering the statistical modelling inaccurate. To circumvent this intermittency problem, we propose to adopt a DTD to detect the presence of the nearend signal, and to stop the adaptation of the parameters of the CTF model during inactive periods.
We propose to use the normalized crosscorrelation method presented in [42], based on the correlation level between the farend signal and the echo signal, that drops when the nearend signal is active. After some derivation, the decision variable is obtained by:
where \(\hat {\mathbf {g}}_{1}\) is the CTF estimate at the first microphone. If Î¾_{t}<Î·, then a doubletalk is detected. Note that Î¾_{t} is calculated using the parameter estimates in previous frame in order to freeze the adaptation in the current frame.
As noted in [43], a fixed value of Î· is not capable of addressing practical scenarios and that an adaptive threshold should be used instead:
and
where \(\tilde {\xi }_{t}\) is minimum Î¾_{t} across the frequency bins in frame t and Î±_{d} is a smoothing factor. Ïˆ_{t} is a small value that was set as \(0.002\sqrt {\Sigma _{t1}}\).
The proposed EM algorithm for echo cancellation, dereverberation an noise reduction, is summarized in Algorithm 2.
5 Performance evaluation
5.1 Setup
The proposed method was evaluated in two dynamic scenarios. The experiments were recorded at the Acoustic Signal Processing Lab, BarIlan University. The room dimensions are 6Ã—6Ã—2.4 m (length Ã— width Ã— height). The reverberation time of the room was set to 650 ms, by adjusting the rooms panels.
The sampling rate was set to 16 kHz, the STFT analysis window is set to a 32 ms Hamming window, with 75% overlap between adjacent timeframes. Avargel et al. [16] define the CTF length L according to the timedomain filter length, the STFT analysis window length and the overlap. The length of the timedomain filter, the RIR in our problem, is determined by the room reverberation time. We set the RIR length to be 650 ms, similar to the reverberation time. Consequently, L was set to 35 frames. Note that setting L to an excessively high value may result in estimation errors and as well as a high computational complexity. Setting L to a lower value than implied by [16], degrades the CTF approximation and can lead to partial dereverberation.
The desired clean speech estimator, \(\hat {x}(t)\), was further enhanced by applying a high pass filter to remove frequencies lower than 200 Hz. Finally, the parameters depicted in Table 1 were fixed for all simulations and experiments.
5.2 Experiments using real speakers
For demonstrating the capabilities of our method in realistic cases, we carried out two types of experiments involving human speakers that read out loud sentences and a loudspeaker that plays music. We tested the performance in two scenarios. In Scenario #1, the loudspeaker and the microphone array are static and the subject is moving in the room along a predefined path. In Scenario #2, the loudspeaker and the subject are static and the microphone array is manually moving. Both scenarios are depicted in Fig. 1.
The subjects in the experiments were native English speakers. Two females and three males participated in Scenario #1, and two females and two males in scenario #2. Several recordings of modern music, consisting of musical instruments and a singer, were played throughout the recording session. The SIR in Scenario #1 is set to 5 dB. The average of the measured SIR in Scenario #2 is 4.68 dB.
During the experiments, we tested 2 types of noise. The first is an airconditioner (AC) noise. The second is a pseudodiffused babble noise, played from 4 loudspeakers, facing the room walls. In Scenario #1, the reverberatedsignal to noise ratio (RSNR) is set to 15 dB. For Scenario #2 the RSNR is timevarying. The average RSNR is 6.62 dB for the AC noise and 9.5 dB for the babble noise.
5.3 Baseline methods
We propose to compare the proposed algorithm to a cascade implementation of AEC and a dereverberation algorithm. For the echo cancellation, we applied J instances of a conventional NLMS algorithm to mitigate the echo path relating the farend signal and each of the microphones. For each frame, the signals at the J outputs of the AECs are further processed by multichannel spectral enhancement (MCSE) algorithm [44]. We denote this approach as NLMSMCSE. In addition, we present the results of the proposed algorithm considering the alternative Msteps presented in Sections 3.3 and 3.4, NLMSRKEMD1 and NLMSRKEMD2, respectively. We also refer to the performance of a simple NLMS, without considering any dereverberation approach.
The DTD algorithm that was discussed in Section 4, was also utilized in the implementation of NLMSbased methods. During double talk, the NLMS adaptation is suspended in NLMSMCSE and NLMSRKEMD1. This is in contrast to our method that enables the adaptation of the CTF coefficients also during double talk. Adaptation is only suspended if the relevant signals are inactive.
For the NLMSMCSE method, \(\widehat {{\phi }}_{x}(t)\) was substituted by \(\hat {e}_{j}(t)^{2}\) in the detection function (42). In Scenario #2, the echo path is constantly changing during the double talk. Hence, suspending the adaptation during double talk degrades significantly the echo cancellation performance. Ignoring the DTD and allowing adaptation, despite the interfering effect of the nearend speaker to the NLMS convergence, is preferred in this case.
5.4 Speech quality and intelligibility
Two objective measures are used for evaluating the speech quality and intelligibility, namely the logspectral distortion (LSD) and the shorttime objective intelligibility (STOI) [45], respectively.
The LSD between x and \(\tilde {z}\in \{z_{1},\hat {x}\}\) is calculated for each time frame as:
where the minimum value is calculated by Îµ(C)=10^{âˆ’50/10} maxt,kC(t,k), which limits the logspectrum dynamic range of C to about âˆ’50dB. The presented value of the LSD is the median value of LSD(t) over all timeframes.
In addition, the dereverberation capabilities of the examined algorithms were evaluated using the SRMR measure [46].
The LSD, SRMR and STOI results for Scenario #1 are presented in Fig. 2. These plots describe the statistics of the measures over 140 experiments, including 5 different speakers, 2 sentences (20âˆ’25 s each), 2 types of noise and 7 songs were played as farend speaker. The speech quality, intelligibility and dereverberation measures for Scenario #2 are described in Table 2. The table reports the median values of 16 experiments with 4 different speakers, 2 sentences (20âˆ’25 s each), 2 types of noise and 1 song was played as the farend speaker (different song for every speaker). Whisker plots are not informative enough for this amount of data.
It is evident that the proposed method outperforms the competing algorithms in all measures for both scenarios. The estimated speech of the NLMSbased algorithms in Scenario #2 is severely distorted as compared with Scenario #1. Indeed, the NLMSMCSE algorithm exhibits comparable performance to the proposed method in scenario #1, but in the more challenging experiment, namely scenario #2, the proposed method significantly outperforms all baseline methods as evident from Table 2. The degradation in NLMSRKEMD1 and NLMSMCSE can be explained by the fact that in Scenario #2, the NLMS keeps updating the echo path during double talk. In contrast, in Scenario #1, the adaptation is suspended. Therefore, the performance gap between the proposed method and its competitors is more pronounced in Scenario #2.
In addition, we observed that the other methods are more sensitive than the proposed method to errors in the DTD. The misdetection and falsealarm of the DTD lead to severe performance degradation in the NLMSbased methods and consequently results in reduction in speech quality and intelligibility. It also explains the degradation in NLMSRKEMD2. However, our method converges faster even in the presence of these estimation errors and performs better.
We also note that, as expected, the NLMSRKEMD2 algorithm outperforms the NLMSRKEMD1 algorithm. However, its performance is still inferior to that of the NLMSMCSE algorithm. In terms of intelligibility, NLMSRKEMD1 and NLMSRKEMD2 even achieve inferior STOI measures than the microphone signal. However, the speech quality in terms of dereverberation and signal distortion still improved, as evident from a the higher SRMR and lower LSD measures.
5.5 Echo cancellation performance
A common performance measure for evaluating echo cancellation is the ERLE defined for each timeframe as
The ERLE results per frame for Scenario #1 are presented in Fig. 3, depicting the advantage of the proposed method over the competing methods for most frames. Furthermore, we can observe that the ERLE performance is rather stable and insensitive to changes in the farend signal and to the DTD accuracy.
Note that \(\mathbf {g}_{1}^{\top }(k) {\mathbf {y}_{t}}(k)\) is only available in Scenario #1. In Scenario #2, we cannot separately record the nearend signal and the echo signal and then mix them to generate a test scenario, due to the manual movement of the microphone array, which cannot be exactly repeated. Therefore, for Scenario #2, we propose to use the ratio of the power of the signal when the speech and reference signals are present and the signal power when only the reference signal is active. We refer to this ratio as as signal to echo ratio (SER) and we define it for the input and the output signals:
where
The improvement between the SER_{input} and SER_{output} indicates the attenuation in the echo power and is denoted by Î”SER. The length of both \(\mathcal {N}_{a}\) and \(\mathcal {N}_{a}\) is approximately 6 seconds. The median of the measured Î”SER for Scenario #2 is presented in Table 2, also depicting advantage of the proposed method over the competing methods. Recall that the echo path adaptation in NLMSMCSE and NLMSRKEMD1 continues in this scenario even during double talk while the statistical model that is used in these methods is not considering the nearend signal. NLMSRKEMD2 echo cancellation performance is worse than our method due to the constantly timevarying echo path and the convergence of the reverberated speech component. Hence, the level of the residual echo is significant and it is reflected in the Î”SER.
5.6 Spectograms assessment
In addition to the quality measures presented in Sections 5.4 and 5.5, we provide the spectograms of one example for Scenario #1 in Fig. 4 and for Scenario #2 in Fig. 5. The spectograms of both scenarios demonstrate the enhancement capabilities and the robustness of the proposed method to doubletalk scenarios. Sound examples of both scenarios can be found in the lab website^{Footnote 2}.
6 Conclusions
A recursive EM algorithm, based on Kalman filtering, for AEC, dereverberation and noise reduction was presented. The proposed statistical model is addressing the three problems simultaneously. The Estep and Mstep are implemented for each STFT timeframe. The Estep is implemented as a Kalman filter. The model parameters are estimated in the Mstep. Given the estimate of the acoustic path of the farend signal, the echo signal at each channel is evaluated. The estimated echo signal is subtracted from the microphone signal and the outcome is further processed by the Kalman filter. The desired speech variance was estimated by adopting a spectral estimation method. The estimated nearend signal is obtained as a byproduct of the Estep. A DTD was utilized in order to suspend the Mstep adaptation when the nearend and farend signal are not active and, consequently, to prevent adaptation errors.
The tracking ability of the algorithm was tested in an experimental study carried out in our lab in very challenging scenarios, including moving speakers and moving microphone array. The algorithm demonstrates convergence capabilities even during doubletalk scenarios in timevarying scenarios. Our method is shown to outperform competing methods based on the NLMS algorithm, in terms of intelligibility, speech quality, and echo cancellation performance.
Availability of data and materials
N/A
Notes
In their original contribution, CappÃ© and Moulines [31] assume independent and identically distributed measurements. This assumption does not hold in our measurement model. We therefore propose to use a slightly different model in which the expectation of the instantaneous complete data is also conditioned on past measurements, namely on \({\mathcal {Z}}_{t},\mathcal {Y}_{t}\) rather than only on z_{t},y_{t}. While a proof of such formulation is beyond the scope of this contribution, we note that similar formulations were successfully used in the context of speech processing [25, 34, 37]. To shed more light on underlying mathematical foundations of stochastic approximation, the interested reader is also referred to a comprehensive review on the topic [38].
www.eng.biu.ac.il/gannot/speechenhancement/
Abbreviations
 ADM:

adaptive directional microphone
 AIR:

acoustic impulse response
 AR:

autoregressive
 ASR:

automatic speech recognition
 ATF:

acoustic transfer function
 BIU:

BarIlan University
 BSI:

blind system identification
 BSS:

blind source separation
 CASA:

computational auditory scene analysis
 CTF:

convolutive transfer function
 DLP:

delayed linear prediction
 DOA:

direction of arrival
 DRR:

direct to reverberant ratio
 DSB:

delay and sum beamformer
 ECM:

expectationconditional maximization
 EM:

expectationmaximization
 EMK:

EMKalman
 EStep:

estimate step
 FAU:

FriedrichAlexander University of ErlangenNuremberg
 FIR:

finite impulse response
 FIM:

Fisher information matrix
 GCI:

glottal closure instants
 GEM:

generalized EM
 GSC:

generalized sidelobe canceller
 HOS:

highorder statistics
 HRTF:

head related transfer function
 IC:

interaural coherence
 ICA:

independent component analysis
 ILD:

interaural level difference
 ITD:

interaural time difference
 ITF:

interaural transfer function
 IS:

ItakuraSaito
 i.i.d:

independent and identically distributed
 KEM:

Kalman expectationmaximization
 KEMD:

Kalman expectationmaximization for dereverberation
 KEMDS:

Kalman expectationmaximization for dereverberation and separation
 RKEM:

recursive Kalman expectationmaximization
 RKEMD:

recursive Kalman expectationmaximization for dereverberation
 LPC:

linear prediction coding
 LSD:

logspectral distortion
 LSP:

line spectral pair
 LS:

leastsquares
 RLS:

recursive leastsquares
 NLMS:

normalized leastmeansquare
 LTI:

linear time invariant
 MA:

moving average
 MAP:

maximum aposteriori
 MCH:

multichannel
 MFCC:

melfrequency cepstrum coefficients
 MINT:

multiple inputoutput inverse theorem
 MKEMD:

MultiSpeaker KalmanEM for dereverberation
 ML:

maximum likelihood
 MLE:

maximum likelihood estimation
 MMSE:

minimum mean square error
 MSE:

mean square error
 MVDR:

minimum variance distortionless response
 MStep:

maximize step
 MWF:

multichannel Wiener filter
 MWFN:

MWF with partial noise estimate
 MTF:

multiplicative transfer function
 NMF:

nonnegative matrix factorization
 NPM:

normalized projection misalignment
 NSRR:

normalized signal to reverberant ratio
 OMLSA:

optimallymodified log spectral amplitude
 PESQ:

perceptual evaluation of speech quality
 PSD:

power spectral density
 p.d.f:

probability distribution function
 WGN:

white Gaussian noise
 REM:

recursive EM
 RIR:

room impulse response
 RSNR:

reverberatedsignal to noise ratio
 RTF:

relative transfer function
 SDWMWF:

speech distortion weighted multichannel Wiener filter
 SE:

spectral enhancement
 SIR:

signal to interference ratio
 SNR:

signal to noise ratio
 SPP:

speech presence probability
 SRMR:

speech to reverberation modulation energy ratio
 SRR:

signal to reverberant ratio
 STFT:

shorttime Fourier transform
 TF:

timefrequency
 UOL:

University of Oldenburg
 DUET:

degenerate unmixing estimation technique
 SDR:

signal to distortion ratio
 SIR:

signal to interference ratio
 SAR:

signal to artefacts ratio
 SRR:

signal to reverberant ratio
 AEC:

acoustic echo cancellation
 DTD:

doubletalk detector
 ERLE:

echo return loss enhancement
 SER:

signal to echo ratio
 AC:

airconditioner
 STOI:

shorttime objective intelligibility
 MCSE:

multichannel spectral enhancement
 BLMS:

block leastmeansquare
References
G. Schmidt, in 12th European Signal Processing Conference (EUSIPCO). Applications of acoustic echo control  an overview, (2004), pp. 9â€“16.
E. HÃ¤nsler, G. Schmidt, Acoustic Echo and Noise Control: a Practical Approach, vol. 40 (John Wiley & Sons, NewJersey, 2005).
E. HÃ¤nsler, G. Schmidt, Topics in Acoustic Echo and Noise Control: Selected Methods for the Cancellation of Acoustical Echoes, the Reduction of Background Noise, and Speech Processing (Springer, Berlin, 2006).
A. Gilloire, M. Vetterli, Adaptive filtering in subbands with critical sampling: analysis, experiments, and application to acoustic echo cancellation. IEEE Trans. on Signal Process.40(8), 1862â€“1875 (1992).
J. Benesty, F. Amand, A. Gilloire, Y. Grenier, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5. Adaptive filtering algorithms for stereophonic acoustic echo cancellation, (1995), pp. 3099â€“3102.
S. L. Gay, in The 32nd Asilomar Conference on Signals, Systems and Computers, 1. An efficient, fast converging adaptive filter for network echo cancellation, (1998), pp. 394â€“398.
H. Deng, M. Doroslovacki, Proportionate adaptive algorithms for network echo cancellation. IEEE Trans. Signal Process.54(5), 1794â€“1803 (2006).
D. L. Duttweiler, Proportionate normalized leastmeansquares adaptation in echo cancelers. IEEE Trans. Speech Audio Process.8(5), 508â€“518 (2000).
G. Enzner, P. Vary, Frequencydomain adaptive Kalman filter for acoustic echo control in handsfree telephones. Signal Process.86(6), 1140â€“1156 (2006).
W. Kellermann, in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1. Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays, (1997), pp. 219â€“2221.
W. Kellermann, in International Workshop on Acoustic Echo and Noise Control (IWAENC). Joint design of acoustic echo cancellation and adaptive beamforming for microphone arrays, (1997), pp. 81â€“84.
G. Reuven, S. Gannot, I. Cohen, Joint noise reduction and acoustic echo cancellation using the transferfunction generalized sidelobe canceller. Speech Commun.49(7), 623â€“635 (2007).
J. J. Shynk, Frequencydomain and multirate adaptive filtering. IEEE Signal Process. Mag.9(1), 14â€“37 (1992).
A. Cohen, A. Barnov, S. MarkovichGolan, P. Kroon, in 2018 26th European Signal Processing Conference (EUSIPCO). Joint beamforming and echo cancellation combining QRD based multichannel AEC and MVDR for reducing noise and nonlinear echo, (2018), pp. 6â€“10.
M. Luis Valero, E. A. P. Habets, Lowcomplexity multimicrophone acoustic echo control in the shorttime fourier transform domain. IEEE/ACM Trans. Audio Speech Lang Process.27(3), 595â€“609 (2019).
Y. Avargel, System identification in the shorttime fourier transform domain. PhD thesis, Technion  Israel Institute of Technology (2008). https://israelcohen.com/wpcontent/uploads/2018/05/YekutiaelAvargel_PhD_2008.pdf.
E. A. P. Habets, S. Gannot, I. Cohen, P. Sommen, Joint dereverberation and residual echo suppression of speech signals in noisy environments. IEEE Trans. Audio Speech Lang. Process.16(8), 1433â€“1451 (2008).
R. Martin, P. Vary, Combined acoustic echo cancellation, dereverberation and noise reduction: a two microphone approach. Ann. Telecommun.49:, 429â€“438 (1994).
M. Togami, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Variational Bayes state space model for acoustic echo reduction and dereverberation, (2015), pp. 101â€“105.
R. Takeda, K. Nakadai, T. Takahashi, K. Komatani, T. Ogata, H. G. Okuno, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. ICAbased efficient blind dereverberation and echo cancellation method for bargeinable robot audition, (2009), pp. 3677â€“3680.
K. Lebart, J. Boucher, P. Denbigh, A new method based on spectral subtraction for speech dereverberation. Acta Acustica Acustica. 87:, 359â€“366 (2001).
E. A. P. Habets, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4. Multichannel speech dereverberation based on a statistical model of late reverberation, (2005), pp. 173â€“176.
E. A. P. Habets, S. Gannot, I. Cohen, Late reverberant spectral variance estimation based on a statistical model. IEEE Signal Process. Lett.16(9), 770â€“773 (2009).
T. Yoshioka, T. Nakatani, M. Miyoshi, Integrated speech enhancement method using noise suppression and dereverberation. IEEE Trans. Audio Speech Lang. Process.17(2), 231â€“246 (2009).
S. Gannot, D. Burshtein, E. Weinstein, Iterative and sequential Kalman filterbased speech enhancement algorithms. IEEE Trans. Speech Audio Process.6(4), 373â€“385 (1998).
B. Schwartz, S. Gannot, E. A. P. Habets, in European Signal Processing Conference (EUSIPCO). Multimicrophone speech dereverberation using expectationmaximization and Kalman smoothing (Marakech, Morocco, 2013).
D. Titterington, Recursive parameter estimation using incomplete data. J. R. Stat. Soc. Ser. B Methodol.46(2), 257â€“267 (1984).
P. J. Chung, J. F. B0Ìˆhme, Recursive EM and SAGEinspired algorithms with application to DOA estimation. IEEE Trans. Signal Process.53(8), 2664â€“2677 (2005).
E. Weinstein, A. Oppenheim, M. Feder, J. Buck, Iterative and sequential algorithms for multisensor signal enhancement. IEEE Trans. Signal Process.42:, 846â€“859 (1994).
L. Frenkel, M. Feder, Recursive expectationmaximization (EM) algorithms for timevarying parameters with applications to multiple target tracking. IEEE Trans. Signal Process.47(2), 306â€“320 (1999).
O. CappÃ©, E. Moulines, Online expectationmaximization algorithm for latent data models. J. R. Stat. Soc. Ser. B Stat Methodol.71(3), 593â€“613 (2009).
B. Schwartz, S. Gannot, E. A. P. Habets, Y. Noam, Recursive maximum likelihood algorithm for dependent observations. IEEE Trans. Signal Process.67(5), 1366â€“1381 (2019).
D. Schmid, S. Malik, G. Enzner, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). An expectationmaximization algorithm for multichannel adaptive speech dereverberation in the frequencydomain, (2012), pp. 17â€“20.
B. Schwartz, S. Gannot, E. A. P. Habets, Online speech dereverberation using kalman filter and em algorithm. IEEE/ACM Trans. Audio Speech Lang. Process.23(2), 394â€“406 (2015).
I. Cohen, Speech enhancement using supergaussian speech models and noncausal a priori SNR estimation. Speech Commun.47:, 336â€“350 (2005).
A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol.39(1), 1â€“22 (1977).
E. Weinstein, A. V. Oppenheim, M. Feder, J. R. Buck, Iterative and sequential algorithms for multisensor signal enhancement. IEEE Trans. Signal Process.42(4), 846â€“859 (1994).
A. Benveniste, M. MÃ©tivier, P. Priouret, Adaptive Algorithms and Stochastic Approximations, vol. 22 (Springer, 2012).
P. J. Wolfe, S. J. Godsill, in The 11th IEEE Signal Processing Workshop on Statistical Signal Processing (SSP). Simple alternatives to the Ephraim and Malah suppression rule for speech enhancement, (2001), pp. 496â€“499.
Y. Ephraim, D. Malah, Speech enhancement using a minimummean square error shorttime spectral amplitude estimator. IEEE Trans. Acoustics Speech Signal Process.32(6), 1109â€“1121 (1984).
E. Weinstein, A. V. Oppenheim, M. Feder, Signal Enhancement Using Single and Multisensor Measurements (Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 0213 USA, 1990).
J. Benesty, D. R. Morgan, J. H. Cho, A new class of doubletalk detectors based on crosscorrelation. IEEE Trans. Speech Audio Process.8(2), 168â€“172 (2000).
X. Li, R. Horaud, L. Girin, S. Gannot, in International Workshop on Acoustic Signal Enhancement (IWAENC). Voice activity detection based on statistical likelihood ratio with adaptive thresholding, (2016).
E. A. P. Habets, in Speech Dereverberation. Speech dereverberation using statistical reverberation models (SpringerLondon, 2010), pp. 57â€“93.
C. SÃ¸rensen, J. B. Boldt, F. Gran, M. G. Christensen, in 24th European Signal Processing Conference (EUSIPCO). Seminonintrusive objective intelligibility measure using spatial filtering in hearing aids, (2016), pp. 1358â€“1362.
T. H. Falk, C. Zheng, W. Chan, A nonintrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans. Audio Speech Lang. Process.18(7), 1766â€“1774 (2010).
Acknowledgements
We would like to thank Mr. Pini Tandeitnik for his professional assistance during the acoustic room setup and the recordings.
Funding
This project has received funding from the European Unionâ€™s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 871245.
Author information
Authors and Affiliations
Contributions
Model development: NC, BS, GH and SG. Experimental testing: NC, GH and BS. Writing paper: NC, BS, and SG. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Consent for publication
All authors agree to the publication in this journal.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cohen, N., Hazan, G., Schwartz, B. et al. An online algorithm for echo cancellation, dereverberation and noise reduction based on a KalmanEM Method. J AUDIO SPEECH MUSIC PROC. 2021, 33 (2021). https://doi.org/10.1186/s13636021002192
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636021002192