An online algorithm for echo cancellation, dereverberation and noise reduction based on a Kalman-EM Method

Many modern smart devices are equipped with a microphone array and a loudspeaker (or are able to connect to one). Acoustic echo cancellation algorithms, specifically their multi-microphone variants, are essential components in such devices. On top of acoustic echos, other commonly encountered interference sources in telecommunication systems are reverberation, which may deteriorate the desired speech quality in acoustic enclosures, specifically if the speaker distance from the array is large, and noise. Although sub-optimal, the common practice in such scenarios is to treat each problem separately. In the current contribution, we address a unified statistical model to simultaneously tackle the three problems. Specifically, we propose a recursive EM (REM) algorithm for solving echo cancellation, dereverberation and noise reduction. The proposed approach is derived in the short-time Fourier transform (STFT) domain, with time-domain filtering approximated by the convolutive transfer function (CTF) model. In the E-step, a Kalman filter is applied to estimate the near-end speaker, based on the noisy and reveberant microphone signals and the echo reference signal. In the M-step, the model parameters, including the acoustic systems, are inferred. Experiments with human speakers were carried out to examine the performance in dynamic scenarios, including a walking speaker and a moving microphone array. The results demonstrate the efficiency of the echo canceller in adverse conditions together with a significant reduction in reverberation and noise. Moreover, the tracking capabilities of the proposed algorithm were shown to outperform baseline methods.


The echo cancellation problem
Acoustic echo cancellation algorithms are an essential component in many telecommunication systems such as hands-free devices, conference room speakerphones and hearing aids [1][2][3]. Moreover, in modern devices, such as smart speakers that play loud music, it is mandatory to integrate an acoustic echo cancellation (AEC) algorithm to enable proper functionality of automatic speech recognition (ASR) systems, especially in the task of recognizing a hot-word. Echo control is also common in robot audition subtracting the estimated echo signals from the microphone signal. In [9], an acoustic echo control method is derived, including an echo canceller and a postfilter. The proposed algorithm is based on the Kalman Filter and provides an optimal statistical adaptation framework for the filter coefficients in time-varying and noisy acoustic environments.

Literature review
Many modern devices are equipped with more than one microphone. The common and most straightforward solution for cancelling the echo signal in the presence of noise is to first independently apply an AEC between the loudspeaker and each of the microphones and then to apply a beamformer. Cascade schemes, implemented in the time-domain, for joint AEC and beamforming are presented in [10,11], with either AEC preceding the beamformer or vice versa. A frequency-domain implementation addressing the joint noise reduction and multi-microphone echo cancellation is proposed in [12]. The beamformer involves a generalized side-lobe canceller (GSC) structure and the AEC is implemented by applying the block least-mean-square (BLMS) procedure [13]. Another approach, combining a minimum variance distortionless response (MVDR) beamformer and a recursive least-squares (RLS)-based AEC is presented in [14].
A multi-channel echo cancellation is presented in [15], utilizing a low-complexity method. The method relies on a relative transfer function (RTF) scheme for multimicrophone AEC for reducing the overall computational load. Furthermore, it incorporates residual echo reduction into the beamformer design. This method is formulated in the STFT domain using the CTF [16] approximation.
Most studies in the literature assume that the physical distance between the far-end signals and the microphone location is small. It is a reasonable assumption since in many devices the microphones and the loudspeaker are mounted into the same device. However, when the loudspeaker is an external device connected by a cable or wirelessly by Bluetooth, it can be located anywhere in the room. As a result, the received echo signal may include a significant amount of reflections. In such cases, the length of the echo path should take into account the multiple acoustic reflections, implying a long adaptive filter. When the adaptive filter cannot entirely represent the echo path, the AEC output may suffer from a significant residual echo.
A single-microphone approach for jointly suppressing reverberation of the near-end speaker, residual echo and background noise is presented in [17]. A spectral postfilter was developed to efficiently dereverberte the desired speech signal, together with the suppression of the late residual echo and the background noise.
In [18], a two-microphone approach was presented. This algorithm comprises an adaptive filter to eliminate non-coherent signal components such as ambient noise and the reverberation of the near-end speech, in addition to echo cancellation. Another multichannel algorithm that jointly addresses the three problems is presented in [19]. An iterative expectation-maximization (EM technique is used for speech dereverberation, acoustic echo reduction, and noise reduction. The proposed method defines two state-space models, one for the acoustic echo-path and the other for the reverberated near-end speaker. The reverberant speech source model is assumed to follow a noiseless auto-regressive model. Two parameter optimization stages based on the Kalman smoother were applied to each state-space model in the E-step. The joint echo cancellation and dereverberation problem is also discussed in [20] for robot audition. An independent component analysis (ICA) scheme is adopted in order to provide a natural framework for these two problems using a microphone array.
The statistics of acoustic impulse response (AIR) is commonly used in dereverberation algorithms. A singlemicrophone method for the suppression of late room reverberation based on spectral subtraction is presented in [21]. This concept is extended to the multi-microphone case in [22]. The problem is formulated in the STFT domain while taking into account the contribution of the direct-path in [23].
Yoshioka et al. [24] developed an EM algorithm for derverberation and noise reduction, where the room impulse response (RIR) is modelled as an autoregressive (AR) process in the STFT domain. An iterative and sequential Kalman expectation-maximization (KEM) scheme for single-microphone speech enhancement in the time-domain was introduced in [25]. This method was extended to a multi-microphone speech dereverberation method in [26], applied in the STFT domain, where the acoustic systems are approximated by the CTF model.
Many modern applications should address cases where the desired speaker, the microphone array and even the interference signal are moving, hence necessitating timevarying online parameter estimate. Unfortunately, the Wiener filter or the Kalman smoother cannot be straightforwardly applied in these cases, as they also utilize future samples. The statistical model of these algorithms should be adjusted to the dynamic scenario.
The REM, which is an efficient scheme for sequential parameter estimation, is particularly suitable for estimating time-varying parameters typical to dynamic scenarios. Titterington [27] formulated an online EM scheme using a stochastic approximation version of the modified gradient recursion. A recursive algorithm is proposed in [28] considering the convergence properties of Titterington's algorithm. The estimates generated by the recursive EM algorithm converged with probability one to a stationary point of the likelihood function. Recursive algorithms based on KEM were presented in [25,29] using gradient decent algorithm for solving the maximum likelihood (ML) optimization. In [30], recursive EM methods for time-varying parameters were introduced with applications to multiple target tracking. Cappé and Moulines in [31] proposed another online version of the EM algorithm applicable to latent variable models of independent observations. A proof of convergence to a stationary point under certain additional conditions was established in this paper. For dependent observations, a recursive ML method was presented in [32] and is supported by a convergence proof. This method refers to state-space models in which the state process and the observations depend on a finite set of previous observations. The acoustic path can be treated as stochastic processes under the Bayesian framework. An online EM based dereverberation algorithm is presented [33]. The acoustic paths where represented as random variables with a firstorder Markov chain and estimated in the E-step by using the Kalman filter. The speech components were modelled as time-varying parameters and were estimated in the M-step.
An online algorithm for derverberation based on a Kalman expectation-maximization (RKEM) approach is presented in [34], where the acoustic parameters and the clean signal are jointly estimated. We refer to this algorithms as Kalman expectation-maximization for dereverberation (RKEMD). This framework is extended in the current contribution to jointly address echo cancellation, dereverberation and noise reduction problems.

Main contributions and outline
While most of the studies treat the problems of echo cancellation, dereverberation, and noise reduction separately, only a few propose a combined solution. In this paper, we present an online algorithm for the three problems addressing a unified statistical model using a microphone array. The microphone signal is degraded by an echo signal and an additive noise in highly reverberant environments. The proposed method is applied in the STFT domain using the RKEMD framework and simultaneously addresses all interfering sources. The acoustic systems of the near-end and far-end signals are approximated by the CTF model and the statistical model is represented in a state-space formulation. Using a a doubletalk detector (DTD), our method suspends the adaptation of the acoustic systems coefficients when their relevant signals are inactive, but still enables adaptation during double-talk. It is also capable of tracking time-variations of the acoustic systems. Hence, a feasible solution is provided in realistic dynamic scenarios when the near-end signal is moving, and even when the microphone array itself is moving.
The structure of the manuscript is as follows. In Section 2, the statistical model of the problem is presented. The recursive EM scheme is derived in Section 3. The desired near-end signal is estimated as a byproduct of the E-step of this scheme. In the recursive version, the E-step boils down to a Kalman filter that is applied to the observed signal with the estimated echo signal subtracted. In the M-step, the CTF coefficients and the noise parameters are recursively estimated. It is further shown that the instantaneous speech variance cannot be estimated using the REM procedure and an external estimator is derived instead. Section 4 describes the DTD that facilitates a proper implementation of the echo cancellation stage. An experimental study for different realistic scenario, including the challenging scenario of moving microphone array, was carried out at the Bar-Ilan acoustic lab and is detailed in Section 5. Conclusions are drawn in Section 6.

Statistical Model
Let x[ n] be the clean near-end signal and y[ n] be the far-end signal in the time-domain. The signals are propagating in an acoustic enclosure before being picked up by a J microphone array. The microphone signals are denoted by are represented in the STFT domain by x(t, k) and y(t, k), respectively, where t ≥ 1 is the time-frame index and k ∈ S K =[ 0, . . . , K − 1] is the frequency-bin index. We assume that the clean speech can be modelled as a complex-Gaussian variable, independent across STFT time-frames and frequencies (see [35]), with zero-mean and variance φ x (t, k) where N C denotes a proper complex-Gaussian distribution.
In order to reduce the computational complexity and to facilitate the model analysis, we consider the CTF approximation [16] for the STFT representation of the time-domain RIR. The time-domain model in (1) can be approximated by where the CTF systems are: and the state-vectors of the desired speech signal and the acoustic reference signal are, respectively L is the length of CTF systems that depends on the reverberation time.
The noise signal v j (t, k) is assumed to be a stationary complex-Gaussian spatially uncorrelated random process, and For conciseness, the frequency index k will be omitted when no ambiguity arises.
The signal model can be represented in the following state-space form: where x t and y t were defined in (5) and d t is defined as the observed signal after the subtraction of the echo signal contribution. The state-transition matrix is given by the innovation process is given by the measurement vector is given by the observation matrices are with h j and g j the CTF systems, as defined in (4), and the noise vector is given by In the algorithm derivation, the following second-order statistics matrices of the innovation and measurement noise signals will also be used: where we assumed that the noise is independent between microphones.

Algorithm derivation
The EM algorithm [36] is an iterative-batch procedure that processes the entire dataset in each iteration until convergence to a local maximum of the ML criterion. Hence, it cannot be applied as is to the task of AEC, specifically in time-varying scenarios. We therefore resort to a recursive version of the EM in our algorithm derivation.

The likelihood function
We start the algorithm derivation by defining the parameter sets and the relevant datasets. As we are interested in causal estimators, the available time-frame indexes for estimating the desired signal at frame t are confined to S t =[ 1, . . . , t], where t = 1 is arbitrarily chosen as the first available time-frame. The EM algorithm is a method for estimating a set of deterministic parameters that maximizes the likelihood criterion. Since the EM works with the notation of complete-data it also provides an estimate of the desired signal(s) as a by-product of the estimation procedure. Let Z t be the set of measurements comprising all microphones and all time-frequency (TF) bins: Y t the set of TF bins of the reference signal and X t the unavailable set of TF bins of the desired speech signal Both Z t and Y t are available, where the set Z t describes the available information in microphone signals, and Y t the information in far-end signal as transmitted by the local loudspeaker. The parameter sets of the statistical model presented in Section 2 comprises the following subsets: for all j ∈ S J ; t ≥ 1 and k ∈ S K . A note on the time-dependency of the parameters is in place. Two distinct time scales can be defined. While the speech power spectral density (PSD) is rapidly changing from frame to frame, the RIRs relating the desired speech and the echo signal to the microphones, as well as the noise variances, are slowly time-varying. The distinct scales of the time variations imply different types of estimation procedures. While estimating the speech PSD necessitates an external smoothing procedure that maintains the rapid time-variations, estimating the RIRs and the noise variances boils down to recursive aggregation of past statistics. Consequently, slowly time-varying estimated parameters are obtained. In the following sections, estimators for the set of parameters will be presented in details together with an online estimate of the desired speech signal.
The EM formulation requires the log-likelihood of the complete-data. Under the assumed statistical model, it is given by: where and C = stands for equal up to constants that are independent of . Note that the second and the third lines of (12) are the log-likelihood of the clean speech signal and the log-likelihood of the additive noise, respectively. Both terms are expressed as a summation over the time-frame index τ ∈ S t , as a result of the independence between time-frames of the desired source and the noise signals in the STFT domain. The second term also decomposes to a sum over the J microphones due to the assumed independence of the noise signals across microphones. The likelihood function in (12) is separately calculated for all k ∈ S K due to the independence between frequency bins.

Recursive EM algorithm
We adopt the online EM formulation presented in [31], in which the auxiliary function is recursively calculated, while the maximization step remains intact. This formulation facilitates online and time-varying estimation of all model parameters.
The auxiliary function at time-frame t is given by a weighted sum of the auxiliary function at the previous time-frames and the innovation of the current measurement: (14) where (t) is the parameter set estimate after measuring the observation z t and the far-end echo signal y t at time-frame t, and γ t ∈[ 0, 1) is a smoothing parameter, that should decay in time for static scenarios. The maximization is computed over the aggregated auxiliary function (14) Given the measurements and the echo signal, define the expected value of the instantaneous complete-data loglikelihood 1 : and substitute the time-varying smoothing parameter with a constant factor β = 1 − γ t , thus introducing an exponential decay of the contribution of past samples to the calculation, and consequently facilitating recursive estimation of time-varying parameters. Using these definitions, the recursive auxiliary function (14) can be rewritten as The complete-data likelihood is independent and identically distributed between time frames. Therefore, we can explicitly write (16) as: Finally, the explicit recursive auxiliary function can be calculated by substituting (18) into (17): wherê and the first-and second-order statistics of the near-end speech signal given Z t and Y t are:

E-Step: Kalman filter
The calculation of the recursive auxiliary function (19) requires the first-and second-order statistics of the clean speech signal (21). These are acquired in the E-step of the recursive procedure by applying the Kalman filter. The Kalman filter, summarized in Algorithm 1, is the optimal causal estimator in minimum mean square error (MMSE) sense.

M-Step: Parameter estimation
In the M-step, we update parameters by maximizing the auxiliary function w.r.t. yielding the subsequent estimate (t + 1), resulting in the following update rules for the model parameters at the (t + 1)-th time-frame: where we define the following aggregated second-order statistics with and similarly Note that (23) is an RLS update rule for estimating both filters and (24) is a recursive estimation of the residual power.
Unlike the estimation procedure of the filters' coefficients, maximizing (22) w.r.t. the speech PSD cannot be applied. In 3.2.3, we explain the reasons for this phenomenon and propose an alternative algorithm for the recursive speech PSD estimation.

Recursive estimation of the speech variance
The speech variance φ x (t) is a time-varying parameter, due to the non-stationarity of the speech signal, and hence smoothness over time cannot be assumed, in contrast to the CTF systems H and G and the noise variance φ v j that exhibit slower time-variations. In the proposed recursive algorithm, the available observed data refers to the time frames in the interval S t , thus the derivative of (22) w.r.t. φ x (t + 1) is zero and does not impose any constraint. Alternately, we propose to obtain a speech PSD estimator of φ x (t), which still maintains some smoothness of the PSD estimates. The spectral amplitude estimator presented in [39] is adapted for this estimation with the necessary changes to incorporate residual echo and reverberation. The optimal speech PSD estimator in the MMSE sense at the jth microphone signal: where A j (t) is a gain function that attenuates the late reverberant component and the noise component. Consequently, A 2 j (t)|z j (t) − g j y t | 2 represents the variance estimator of the early speech component, x e

The gain function is defined as
where and φ r j is the late reverberant spectral variance. Note that ζ prior,j and ζ post,j are a priori and a posteriori signal to interference ratio (SIR), respectively. The calculation of (30) is executed for every channel j. The estimation of φ x e j (t) is unobserved and therefore the a priori SIR, ζ prior,j (t), is estimated by the decision-directed estimator proposed by Ephraim and Malah in [40]: where α sir is a smoothing factor and ζ min is the minimum SIR that ensures the positiveness of ζ post,j (t) − 1. Note that applying the gain function in (30) on ζ post,j (t −1) as in (33) represents the a priori SIR resulting from the previous frame process.
For the estimation of late reverberant spectral variance φ r j , the instantaneous power of the reverberation ψ r j (t) is calculated as in the RKEMD method [34]: By the definition of ,ĥ j,0 (t) is excluded from (34) and hence only the variance of the late reverberation is taken into account. Then, φ r j (t) is estimated by time smoothing using a smoothing parameter α r ∈[ 0, 1): The speech PSD φ x (t) is finally determined by averaging over all J channels: It is clear that the presented model in (3) may suffer from gain ambiguity in estimating both φ x (t) and h j (t), attributed to the following equality: where ν(t, k) is an arbitrary time-and frequencydependent gain. To circumvent this problem, we arbitrarily set |ĥ j,0 (t, k)| = 1, ∀j in (28).

Alternative M-step 1
Estimating the CTF systems in the M-step (23) boils down to RLS-type update rule. An alternative and commonly used approach for adaptive filtering is the normalized least-mean-square (NLMS) procedure, which is known for its good tracking capabilities, simplicity, and low computational complexity. Conversely, the RLS algorithm is more stable and its convergence rate is faster, at the expense of high computational complexity. The tradeoff between fast adaptation and computational complexity should be considered when choosing the appropriate adaptive filtering approach. We develop in the sequel an alternative M-step based on the NLMS procedure. First, we apply the NLMS procedure for estimating the echo path for each microphone g j , ∀ j ∈ S J rather than using the estimate resulting from M-step stage in (23). The NLMS update rule is given by: where λ ∈ (0, 2) is the step-size, δ NLMS > 0 is the regularization factor and e j (t) is the instantaneous estimation error w.r.t. the jth microphone given by: The update of the other acoustic parameters remains intact and is calculated as described in Section 3.2.2. Substituting the CTF estimate of the echo path g j (23) by g NLMS j leads to a combined structure of NLMS and RKEMD, where the NLMS estimation error of each channel is the input for RKEMD. This new scheme is denoted by NLMS-RKEMD-1.

Alternative M-step 2
Although the RLS approach in the proposed algorithm is inefficient in means of computational complexity comparing to NLMS, the EM has the advantage of considering the near-end speaker in the echo cancellation model. We therefore introduce another alternative M-step, in which the echo path is estimated using NLMS while still utilizing the benefits offered by the EM formulation. Based on a gradient-descent minimization of the likelihood function, adopted from [41] and [25], we substitute the maximization of g j in (23) with: Explicitly, carrying out the derivative in (40) (also implementing the normalization operation) yields an adaption rule similar to (38), but with a different error term: Now, the error signal (41) includes the subtraction of the estimated reverberant near-end signal. We denote this recursive EM variant as NLMS-RKEMD-2.

Double talk detector
The statistical model presented in Section 2 assumes a constant activity of the near-end and far-end signals. However, in real scenarios this is not always the case, rendering the statistical modelling inaccurate. To circumvent this intermittency problem, we propose to adopt a DTD to detect the presence of the near-end signal, and to stop the adaptation of the parameters of the CTF model during inactive periods. We propose to use the normalized cross-correlation method presented in [42], based on the correlation level between the far-end signal and the echo signal, that drops when the near-end signal is active. After some derivation, the decision variable is obtained by: whereĝ 1 is the CTF estimate at the first microphone. If ξ t < η, then a double-talk is detected. Note that ξ t is calculated using the parameter estimates in previous frame in order to freeze the adaptation in the current frame. As noted in [43], a fixed value of η is not capable of addressing practical scenarios and that an adaptive threshold should be used instead: and whereξ t is minimum ξ t across the frequency bins in frame t and α d is a smoothing factor. ψ t is a small value that was set as 0.002 √ t−1 . The proposed EM algorithm for echo cancellation, dereverberation an noise reduction, is summarized in Algorithm 2. (a) Calculate φ r j (t) (35) and (28).

Setup
The proposed method was evaluated in two dynamic scenarios. The experiments were recorded at the Acoustic Signal Processing Lab, Bar-Ilan University. The room dimensions are 6 × 6 × 2.4 m (length × width × height). The reverberation time of the room was set to 650 ms, by adjusting the rooms panels. The sampling rate was set to 16 kHz, the STFT analysis window is set to a 32 ms Hamming window, with 75% overlap between adjacent time-frames. Avargel et al. [16] define the CTF length L according to the time-domain filter length, the STFT analysis window length and the overlap. The length of the time-domain filter, the RIR in our problem, is determined by the room reverberation time. We set the RIR length to be 650 ms, similar to the reverberation time. Consequently, L was set to 35 frames. Note that setting L to an excessively high value may result in estimation errors and as well as a high computational complexity. Setting L to a lower value than implied by [16], degrades the CTF approximation and can lead to partial dereverberation.
The desired clean speech estimator,x(t), was further enhanced by applying a high pass filter to remove frequencies lower than 200 Hz. Finally, the parameters depicted in Table 1 were fixed for all simulations and experiments.

Experiments using real speakers
For demonstrating the capabilities of our method in realistic cases, we carried out two types of experiments involving human speakers that read out loud sentences and a loudspeaker that plays music. We tested the performance in two scenarios. In Scenario #1, the loudspeaker and the microphone array are static and the subject is moving in the room along a predefined path. In Scenario #2, the loudspeaker and the subject are static and the microphone array is manually moving. Both scenarios are depicted in Fig. 1.
The subjects in the experiments were native English speakers. Two females and three males participated in Scenario #1, and two females and two males in scenario #2. Several recordings of modern music, consisting of musical instruments and a singer, were played throughout the recording session. The SIR in Scenario #1 is set to During the experiments, we tested 2 types of noise. The first is an air-conditioner (AC) noise. The second is a pseudo-diffused babble noise, played from 4 loudspeakers, facing the room walls. In Scenario #1, the reverberated-signal to noise ratio (RSNR) is set to 15 dB. For Scenario #2 the RSNR is time-varying. The average RSNR is 6.62 dB for the AC noise and 9.5 dB for the babble noise.

Baseline methods
We propose to compare the proposed algorithm to a cascade implementation of AEC and a dereverberation algorithm. For the echo cancellation, we applied J instances of a conventional NLMS algorithm to mitigate the echo path relating the far-end signal and each of the microphones. For each frame, the signals at the J outputs of the AECs are further processed by multichannel spectral enhancement (MCSE) algorithm [44]. We denote this approach as NLMS-MCSE. In addition, we present the results of the proposed algorithm considering the alternative Msteps presented in Sections 3.3 and 3.4, NLMS-RKEMD-1 and NLMS-RKEMD-2, respectively. We also refer to the performance of a simple NLMS, without considering any dereverberation approach.
The DTD algorithm that was discussed in Section 4, was also utilized in the implementation of NLMS-based methods. During double talk, the NLMS adaptation is suspended in NLMS-MCSE and NLMS-RKEMD-1. This is in contrast to our method that enables the adaptation of the CTF coefficients also during double talk. Adaptation is only suspended if the relevant signals are inactive.
For the NLMS-MCSE method, φ x (t) was substituted by |ê j (t)| 2 in the detection function (42). In Scenario #2, the echo path is constantly changing during the double talk. Hence, suspending the adaptation during double talk degrades significantly the echo cancellation performance. Ignoring the DTD and allowing adaptation, despite the interfering effect of the near-end speaker to the NLMS convergence, is preferred in this case.

Speech quality and intelligibility
Two objective measures are used for evaluating the speech quality and intelligibility, namely the log-spectral distortion (LSD) and the short-time objective intelligibility (STOI) [45], respectively.
The LSD between x andz ∈ {z 1 ,x} is calculated for each time frame as: where the minimum value is calculated by (C) = 10 −50/10 max t,k |C(t, k)|, which limits the log-spectrum dynamic range of C to about −50dB. The presented value of the LSD is the median value of LSD(t) over all timeframes. In addition, the dereverberation capabilities of the examined algorithms were evaluated using the SRMR measure [46].
The LSD, SRMR and STOI results for Scenario #1 are presented in Fig. 2. These plots describe the statistics of the measures over 140 experiments, including 5 different speakers, 2 sentences (20 − 25 s each), 2 types of noise and 7 songs were played as far-end speaker. The speech quality, intelligibility and dereverberation measures for Scenario #2 are described in Table 2. The table reports  It is evident that the proposed method outperforms the competing algorithms in all measures for both scenarios. The estimated speech of the NLMS-based algorithms in Scenario #2 is severely distorted as compared with Scenario #1. Indeed, the NLMS-MCSE algorithm exhibits comparable performance to the proposed method in scenario #1, but in the more challenging experiment, namely scenario #2, the proposed method significantly outperforms all baseline methods as evident from Table 2. The degradation in NLMS-RKEMD-1 and NLMS-MCSE can be explained by the fact that in Scenario #2, the NLMS keeps updating the echo path during double talk. In contrast, in Scenario #1, the adaptation is suspended. Therefore, the performance gap between the proposed method and its competitors is more pronounced in Scenario #2.
In addition, we observed that the other methods are more sensitive than the proposed method to errors in the DTD. The mis-detection and false-alarm of the DTD lead to severe performance degradation in the NLMS-based methods and consequently results in reduction in speech  quality and intelligibility. It also explains the degradation in NLMS-RKEMD-2. However, our method converges faster even in the presence of these estimation errors and performs better.
We also note that, as expected, the NLMS-RKEMD-2 algorithm outperforms the NLMS-RKEMD-1 algorithm. However, its performance is still inferior to that of the NLMS-MCSE algorithm. In terms of intelligibility, NLMS-RKEMD-1 and NLMS-RKEMD-2 even achieve inferior STOI measures than the microphone signal. However, the speech quality in terms of dereverberation and signal distortion still improved, as evident from a the higher SRMR and lower LSD measures.

Echo cancellation performance
A common performance measure for evaluating echo cancellation is the ERLE defined for each time-frame as ERLE(t) = 10 log 10 The ERLE results per frame for Scenario #1 are presented in Fig. 3, depicting the advantage of the proposed method over the competing methods for most frames. Furthermore, we can observe that the ERLE performance is rather stable and insensitive to changes in the far-end signal and to the DTD accuracy. Note that g 1 (k)y t (k) is only available in Scenario #1. In Scenario #2, we cannot separately record the near-end signal and the echo signal and then mix them to generate a test scenario, due to the manual movement of the microphone array, which cannot be exactly repeated. Therefore, for Scenario #2, we propose to use the ratio of the power of the signal when the speech and reference signals are present and the signal power when only the reference sig-nal is active. We refer to this ratio as as signal to echo ratio (SER) and we define it for the input and the output signals: The improvement between the SER input and SER output indicates the attenuation in the echo power and is denoted by SER. The length of both N a and N a is approximately 6 seconds. The median of the measured SER for Scenario #2 is presented in Table 2, also depicting advantage of the proposed method over the competing methods. Recall that the echo path adaptation in NLMS-MCSE and NLMS-RKEMD-1 continues in this scenario even during double talk while the statistical model that is used in these methods is not considering the near-end signal. NLMS-RKEMD-2 echo cancellation performance is worse than our method due to the constantly time-varying echo path and the convergence of the reverberated speech component. Hence, the level of the residual echo is significant and it is reflected in the SER.

Spectograms assessment
In addition to the quality measures presented in Sections 5.4 and 5.5, we provide the spectograms of one example for Scenario #1 in Fig. 4 and for Scenario #2 in Fig. 5. The spectograms of both scenarios demonstrate the enhancement capabilities and the robustness of the proposed method to double-talk scenarios. Sound examples of both scenarios can be found in the lab website 2 .

Conclusions
A recursive EM algorithm, based on Kalman filtering, for AEC, dereverberation and noise reduction was presented. The proposed statistical model is addressing the three problems simultaneously. The E-step and M-step are implemented for each STFT time-frame. The E-step is implemented as a Kalman filter. The model parameters are estimated in the M-step. Given the estimate of the acoustic path of the far-end signal, the echo signal at each channel is evaluated. The estimated echo signal is subtracted from the microphone signal and the outcome is further processed by the Kalman filter. The desired speech variance was estimated by adopting a spectral estimation method. The estimated near-end signal is obtained as a byproduct of the E-step. A DTD was utilized in order to suspend the M-step adaptation when the near-end and far-end signal are not active and, consequently, to prevent adaptation errors.
The tracking ability of the algorithm was tested in an experimental study carried out in our lab in very challenging scenarios, including moving speakers and moving microphone array. The algorithm demonstrates convergence capabilities even during double-talk scenarios in time-varying scenarios. Our method is shown to outperform competing methods based on the NLMS algorithm, in terms of intelligibility, speech quality, and echo cancellation performance.