A recursive expectation-maximization algorithm for speaker tracking and separation

The problem of blind and online speaker localization and separation using multiple microphones is addressed based on the recursive expectation-maximization (REM) procedure. A two-stage REM-based algorithm is proposed: (1) multi-speaker direction of arrival (DOA) estimation and (2) multi-speaker relative transfer function (RTF) estimation. The DOA estimation task uses only the time frequency (TF) bins dominated by a single speaker while the entire frequency range is not required to accomplish this task. In contrast, the RTF estimation task requires the entire frequency range in order to estimate the RTF for each frequency bin. Accordingly, a different statistical model is used for the two tasks. The first REM model is applied under the assumption that the speech signal is sparse in the TF domain, and utilizes a mixture of Gaussians (MoG) model to identify the TF bins associated with a single dominant speaker. The corresponding DOAs are estimated using these bins. The second REM model is applied under the assumption that the speakers are concurrently active in all TF bins and consequently applies a multichannel Wiener filter (MCWF) to separate the speakers. As a result of the assumption of the concurrent speakers, a more precise TF map of the speakers’ activity is obtained. The RTFs are estimated using the outputs of the MCWF-beamformer (BF), which are constructed using the DOAs obtained in the previous stage. Next, using the linearly constrained minimum variance (LCMV)-BF that utilizes the estimated RTFs, the speech signals are separated. The algorithm is evaluated using real-life scenarios of two speakers. Evaluation of the mean absolute error (MAE) of the estimated DOAs and the separation capabilities, demonstrates significant improvement w.r.t. a baseline DOA estimation and speaker separation algorithm.


Introduction
Multi-speaker separation techniques, utilizing microphone arrays, have attracted the attention of the research community and the industry in the last three decades, especially in the context of hands-free communication systems. A comprehensive survey of state-of-the-art multichannel audio separation methods can be found in [1][2][3].
A commonly used technique for source extraction is the LCMV-BF [4,5], which is a generalization of the minimum variance distortionless response (MVDR)-BF [6]. In [7], the LCMV-BF was reformulated by substituting *Correspondence: sharon.gannot@biu.ac.il 2 Faculty of Engineering, Bar-Ilan University, 5290002 Ramat-Gan, Israel Full list of author information is available at the end of the article the simple steering vectors based on the direct-path with the RTFs encompassing the entire reflection pattern of the acoustic propagation. The authors also presented a method to estimate the RTFs, based on the generalized eigenvalue decomposition (GEVD) of the power spectral density (PSD) matrices of the received signals and the background noise. A multi-speaker LCMV-BF was proposed in [8] to simultaneously extract all individual speaker signals. Moreover, the estimation procedure of the speakers' PSDs was facilitated by the decomposition of the multi-speaker MCWF into two stages, namely multi-speaker LCMV-BF and a subsequent multi-speaker post-filter.
In [7,8], the RTFs were estimated using time intervals comprising each of the desired speakers separately assuming a static scenario. Practically, these time intervals need to be detected from data and cannot be assumed to be known.
In [9], time-frames dominated by each of the speakers were identified by estimating the DOA for each frame using clustering of a time-series of steered response power (SRP) estimates. In [10,11], these frames were identified by exploiting convex geometry tools on the recovered simplex of the speakers' probabilities or the correlation function between frames [12]. In [13,14], a dynamic, neural-network-based, concurrent speaker detector was presented to detect single speaker frames. A library of these RTFs was collected for constructing an LCMV-BF and for further spatial identification of the speakers. In [15], the speech sparsity in the short-time Fourier transform (STFT) domain was utilized to track the DOAs of multiple speakers using a convolutional neural network (CNN) applied to the instantaneous RTF estimate. Speaker separation was obtained, as a byproduct of the tracking method, by the application of TF masking.
Unfortunately, the existence of single speaker dominant frames is not always guaranteed for simultaneously active speakers. Furthermore, for moving speakers the RTFs estimated by these frames may be irrelevant for subsequent processing. In [16], the sparsity of the speech signal in the STFT domain was utilized to model the frequency bins with complex-Gaussian mixture p.d.f. and the RTFs were offline estimated as part of an expectation-maximization (EM)-MoG procedure. In [17], an offline blind estimation of the acoustic transfer functions was presented using the non-negative Matrix Factorization and the EM algorithm. In [18,19], an offline estimation of the acoustic transfer functions was done by estimating a latent variable representing the speaker activity pattern. In [20], an online estimation of the blocking matrices (required for the generalized sidelobe canceler implementation of the MVDR-BF) associated with each of the speakers was carried out by clustering the DOA estimates from all TF bins. In [21,22], an online time-frequency masking has been proposed to estimate the RTFs using the EM algorithm and without any prior information on the array geometry or the plane wave assumption.
Common DOA estimators are based on the SRP-phase transform (PHAT) [23], the multiple signal classification (MUSIC) algorithm [24], or Model-based expectationmaximization source separation and localization (MESSL) [25]. In [26][27][28], the microphone observations were modeled as a mixture of high-dimensional complex-Gaussian with zero-mean, and a spatial covariance matrix that consists of both the speech and the noise power spectral densities (PSDs) was assumed. In [29], a DOA tracking procedure was proposed by applying the Cappé and Moulines recursive EM (CREM) algorithm. Recursive equations for the DOA probabilities and the candidate speakers PSDs were derived, which facilitated online DOA tracking of multiple speakers.
In this paper, an online and blind speaker separation procedure is presented. Multiple RTFs updating is performed using REM model that assumes concurrent activity of speakers. New links are established between the direct-path phase differences and the full RTF of each speaker. The dominant DOAs in each frame are estimated using a dedicated REM procedure. Then, in each frame, the RTFs are initialized by the direct-path phase difference (using the corresponding DOA). Finally, the full RTFs are re-estimated using the LCMV outputs. By examining the LCMV outputs, frames dominated by a single speaker can be detected by comparing the energy of each LCMV output. As a practical improvement, the RTF of a speaker is updated only when the LCMV output corresponding to the relevant speaker is relatively high. Finally, the LCMV-BF is re-employed using the estimated RTFs.
The direct-path phase differences are set using the speakers DOA estimated by an online preliminary stage of multiple concurrent DOA estimation. In this stage, assuming J speakers, J dominant DOAs are estimated in each frame using a novel version of the MoG-REM. Only for the DOA estimation, the sparse nature of the speech is exploited (while it has been proven to be efficient with DOA estimation). The output of many multiple-speaker DOA estimators is actually a probability for an existence of speaker in each DOA, while the final DOA of the speakers is still not clear. In this paper, we design an REM-based concurrent DOA estimation that consists only of J Gaussians. Rather than estimating the probabilities, the DOAs of the speakers are directly estimated using the REM algorithm.
The remainder of this paper is organized as follows. In Section 2, the speaker separation problem is formulated. In Section 3, the proposed dual-stage algorithm is overviewed. In Section 4, the REM procedure for the speaker separation is derived. In Section 5, the REM procedure for the multiple-speaker DOA estimation is derived. In Section 6, the performance of the proposed algorithm is evaluated. Section 7 is dedicated to concluding remarks.
where S j , k is the speech signal of the jth speaker, as received by the reference microphone (chosen arbitrary as microphone #1), G i,j , k is the RTF from the jth speaker to the ith microphone w.r.t. the reference microphone, and V i , k denotes ambient noise. The number of microphones is N and the number of sources of interest is J.
By concatenating the signals and RTFs in vectors, (1) can be recast as: where: The ambient noise is modeled as a zero-mean Gaussian vector with a PSD matrix v k : where: z denotes a Gaussian vector, is a PSD matrix, Tr [] denotes trace operation and denotes the matrixdeterminant operation. The individual speech signals S j , k are also modeled as independent and zero-mean Gaussian processes with variance S j , k , In the following sections, the frequency index k and time index are omitted for brevity, whenever no ambiguity arises.

Algorithm overview
The proposed algorithm comprises two stages as detailed below and summarized in Fig. 1.

Speaker extraction
The goal of this paper is to estimate the individual speech signals S j of the dominant J speakers (while the number of speakers J is assumed fixed and known) using the multi-speaker MCWF or the multi-speaker LCMV beamformer [8].
where s Diag S 1 , .., S J is a diagonal matrix (namely, the individual speech signals are assumed mutualy independent). Even though the MCWF usually achieves better noise reduction relative to the LCMV, in many cases the LCMV is preferred due to its distortionless characteristics (especially when a large number of microphones is available). For the main task of this paper, namely speaker separation, the LCMV-BF suffices.

Parameters estimation
For implementing the LCMV-BF (6), an estimate of the RTF matrix G is required. The proposed algorithm for blind and online estimation of G is based on two separated stages: 1 Estimating J dominant DOAs associated with the J dominant active speakers. The DOA of each speaker is chosen from a predefined set of candidate DOAs. 2 Estimating J RTFs g j associated with the J dominant DOAs from the first stage. In each frame, the RTFs are initialized by the direct-path TF (based on the DOAs from the previous stage) and then the RTFs are updated using the MCWF outputs.
To concurrently estimate the multiple DOAs and the RTFs of the speakers, the EM [30] formulation is adopted (separately for each task), as described in the following sections. Moreover, to achieve online estimation of the RTFs and to maintain smooth estimates over time, a recursive version of the EM algorithm is adopted. A block diagram of the proposed two-stage algorithm is depicted in Fig. 1. In Section 4, an estimation procedure for the RTFs is proposed while associating an RTF to each speaker using its associated estimated DOA. In Section 5, an estimation procedure of the J-dominant DOAs is proposed.

Speaker extraction given the DOAs
To implement the EM algorithm, three datasets should be defined: the observed data, the hidden data, and the parameter set. The observed data in our model is the received microphone signals y. We are proposing to define the individual signals s as the hidden data. The parameter set is defined as the RTFs G, and the PSD matrix of the speakers s such that G, s . The E-step evaluates the auxiliary function, while the maximization step maximizes the auxiliary function w.r.t. the set of parameters. The batch EM procedure converges to a local maximum of the likelihood function of the observation [30]. To track time-varying RTFs and to satisfy the online requirements, the CREM [31] algorithm is adopted. CREM is based on smoothing of the auxiliary function along the time axis and executing a single maximization per time instance. The smoothing operation is given by [31,Eq. (10)] where Q R ; is the recursive auxiliary function, is the estimate of at the th time instance and 0 1 is a smoothing factor. The term Q ; 1 is the instantaneous auxiliary function of the th observation, namely the expectation of the log p.d.f. of the complete data (the observed and hidden data) given the observed data and the previous parameter set: The th parameter set estimate is obtained by maximizing Q R ; 1 w.r.t. .

Auxiliary function
By applying the Bayes rule, the p.d.f. of the complete instantaneous data is given by: where the conditional p.d.f. in (10) is given by: f y s; C y, Gs, v (11) and the p.d.f. of s is given by f s C s, 0, s . Finally, the auxiliary function is given by The EM is notoriously known for converging to local maxima and hence proper initialization is mandatory. In the following section such initialization is discussed.

Initialization of the individual speaker RTFs
Since the RTFs of the speakers in G are time-varying, we propose to reinitialize them in each frame using the estimated DOAs of the speakers. In each new frame, the previous RTFs are discarded and substituted by RTFs which are based on DOA only (as initialization). In the M-step, the RTFs are re-estimated using the smoothed latentvariables. Using the DOAs, the RTFs are initialized by the direct-path transfer function namely the relative phase from the desired speaker to the ith microphone w.r.t. the reference microphone. Accordingly, given the estimates of each speaker DOA j , the RTFs can be initialized by: where T s denotes the sampling period and i j denotes the time difference of arrival (TDOA) between microphone i and the reference microphone given the jth speaker DOA j . Note that the DOAs are blindly estimated, as explained in Section 5.
Examining only the horizontal plane, and given the twodimensional positions of the microphones, the TDOA is given by: where c is the sound velocity and x i is the horizontal position of microphone i.

Initialization of the individual speakers PSD
Similarly to the RTFs initialization, we propose to reinitialize the PSDs in each frame using the estimated DOAs. The PSDs of the speakers can be initialized by maximizing the p.d.f. of the observations given the relative phase (13): where f y; s , D C y; 0, D s D H v (16) and D is an M J matrix with elements defined by D i,j D i,j . Taking the derivative of the p.d.f. above w.r.t. s and equating to zero attains the estimate of the speaker PSDs: where s LCMV D is the multi-speaker LCMV output vector and v res is the residual noise PSD matrix at the output of the multi-speaker LCMV stage, Since s is defined as diagonal matrix the off-diagonal elements of the estimated matrix in (17) are zeroed-out.

Instantaneous expectation and maximization steps
Examining (12), the E-step in the th time instance boils down to the calculation of s and ss H , where for any stochastic variable a, a E a y; . Using the multi-speaker MCWF [8], the following expressions are obtained: is the multi-speaker MCWF. Using the expectations above, the instantaneous auxiliary function Q ; is given by: Substituting the auxiliary function (21) in the recursive equation from (8) and following some algebraic simplifications, the implementation of (8) can be summarized according to the following recursive equations: Using A and B , the recursive auxiliary function can be rewritten as Similar to the batch EM procedure, the M-step is obtained by maximizing Q R ; w.r.t. the problem parameters. The speaker PSDs and the RTFs estimates are then given by: Since s is defined as diagonal matrix the off-diagonal of its estimate should be zeroed-out. Note that the RTFs are discarded in each new frame and reinitialized using the DOA based steering vector (see (13)). Nevertheless, the RTFs are re-estimated by the updated recursivevariables A and B (see (24b)). These variables are only slightly updated form frame to frame using the smoothing factor . Therefore, the final estimate of the RTFs is only slightly updated. The re-initialization of the RTFs in each frame only influences the estimates of s and ss H as used in (22).

Practical considerations
Due to the intermittent nature of the speech signal, a few speakers may be non-active in several frames. This will result a few elements on the main diagonal of s that are close to zero. Note that as the number of speakers J is set in advance, it might be larger than the instantaneous number of active speakers in several frames.
As a matter of fact, bins where only a single speaker is dominant should be the preferred for the task of estimating the RTFs, since the other speakers do not bias the estimate. To determine these TF bins, the power ratio between the desired speech and other interfering speech signals, denoted as desired speaker-to-interferes ratio (DSIR), may be examined for each TF bin according to DSIR j . Using the PSD matrix initialization (17), the RTFs should be estimated only if the DSIR j obtains a high value. In that case, the RTFs are estimated by applying the following simplified formula: where is some predefined threshold.
To summarize this part of the proposed algorithm, the REM procedure for estimating the individual speaker signals given the DOAs is given in Algorithm 1.

DOA estimation
For the estimation of the speakers' DOA, we take a different statistical model, and assume hereinafter that the W-disjoint orthogonality property of the speech [32,33] holds. This assumption was shown to be beneficial in handling multi-speaker DOA estimation tasks [25][26][27][28][29]. Using this TF sparsity assumption, the signal observed at the ith microphone can remodeled as described in [29]: where the variables j , k are indicators that the jth speaker is active at the , k th TF bin. A disjoint activity of the speakers can be imposed by allowing the J indicators j , k to have only a single non-zero element per each TF bin. The RTF D i j , k is solely defined by the direct-path, as given in (13).
The indicators [ 1 , , J ] will be used as the hidden data under this formulation. The parameter set is accordingly defined as the DOAs j and the speakers PSD S j such that j , S j J j 1 . Unlike [29], where the probabilities of each candidate DOA are estimated, in this paper the J dominant DOAs are determined from the DOA candidate set. In [29], a subsequent pick peaking stage is therefore required. In the proposed algorithm, the J DOAs are estimated during the M-step.

Auxiliary function
Using Bayes rule, the p.d.f. of the complete data is given by: where the conditional p.d.f. in (27) is composed as a weighted sum of J Gaussians: The p.d.f. of the indicators is f J j 1 p j j , with p j the probabilities of activity for each speaker and J j 1 p j 1. These probabilities may be initialized as 1 J.

Initial estimation of the speech PSDs
It was shown in [29] that for each DOA j , the corresponding speech PSDs are independent of the E-step and thus can be estimated prior to the EM iterations. For each DOA j , the corresponding PSD is estimated by maximizing the relevant Gaussian:  In [29], it was shown that using the Fisher-Neyman factorization, the log-likelihood above can be expressed as where S MVDR j  Substituting the estimate of the speech PSDs in the log p.d.f. in (30), yields where SNR is the posterior SNR of each speaker, and C stands for equality up to a constant independent of the relevant parameters.

The EM iterations
Using the log p.d.f. from (32) and the definitions (27)-(28), the auxiliary function is given by: where j is the expected indicator j E j y; 1 . According to [29], the expressions for the indicators can be simplified to: where T j 1 SNR j exp SNR j is the sufficient statistics.
Using the expected indicators j and the auxiliary function in (32), the smoothing stage in (8) is summarized according to the following recursive equation: Note that the term log SNR j in (35) can be omitted because it probably does not influence the maximization 1 . The M-step is obtained by maximizing Q R j ; j 1 w.r.t. j and p j for j 1, , J: Note that the estimation of the DOA is obtained using all frequency bins. Since there is no closedform expression for j , the term k Q R j ; j 1 should be calculated for each possible j . Practically, a set of DOA candidates can be predefined (for example 0 , , 345 with a 1 resolution) and k Q R j ; j 1 can be calculated only for these candidates. Then, separately to each source j, the DOA which maximizes k Q R j ; j 1 is selected as the jth speaker DOA.
The estimated probability of each speaker p j (Eq. (36a)) may be utilized to discard the redundant speaker in the beamforming stage (see the third block in Fig. 1). When p j is lower than a predefined threshold, it implies that the jth speaker is inactive and the final beamforming may include only the other active speakers.
The REM algorithm for estimating the desired speaker DOA is summarized in Algorithm 2 and depicted in the block diagram in Fig. 2.

Performance evaluation
The performance of the proposed algorithm is evaluated using recorded signals of two concurrent speakers on the two presented tasks: (1) online DOA estimation and (2) time-varying source separation. Correspondingly, we used two quality measures: (1) the mean absolute error (MAE) between the estimated and oracle DOAs and (2) the power level between the speakers at the output, as a measure of the separation capabilities.

Recording setup
Overall, nine experiments were conducted. In each experiment, two 60-s long speech signals of male and female speakers were separately recorded in the acoustic lab at CEVA Inc. premises, as shown in Fig. 3a. CEVA Inc. DSP platform was used as the acquisition device. The circular array with 5 cm diameter comprises six microphones at the perimeter. The device is depicted in Fig. 3b. The two speakers were either standing or moving with various trajectories around the array, approximately 1 m from the array center. The speed of the moving speaker was approximately 1 m/s. The various source trajectories are described in Table 1.
The reverberation time was approximately adjusted to T 60 0.2 sec using the room panels and additional furniture. The utterances were generated by adding the two separately recorded speech signals together with both spectrally and spatially white noise, with power of 40 dB below the power of the overall speech signals.
The sampling frequency was 16 kHz, and the frame length of the STFT was set to 128 ms with 32 ms overlap. The resolution of the candidates DOA was set to 15 in the range [ 0 : 345 ]. The frequency band 500 3500 Hz was used for the DOA estimation.

Baseline method
The proposed algorithm was compared with a baseline localization and separation algorithm conceptually based on [20]. The steps of the baseline algorithm are:   [20,Eq.(19)]. 4 Implement the LCMV-BF using the two estimated RTFs.

Tracking results
The tracking algorithms estimate the two dominant DOAs for each frame. Let 1 and 2 be an estimate of the DOAs of the two speakers at frame , as obtained by either the proposed and baseline algorithms, and 1 and 2 , be the oracle DOAs, respectively. Define the MAE as: where L is the number of frames in the utterance. The oracle DOAs were obtained by apply the proposed algorithm to the separated inputs x M and x F while assumed a single speaker.
The trajectories of the estimated DOAs for both proposed and the baseline algorithms for all nine experiments are depicted in Figs. 4, 5, and 6, together with the oracle trajectories. The MAEs for all cases are presented in Table 2.
Looking at the tracking curves and the MAEs, the proposed algorithm clings well to the oracle speakers DOA contours, and significantly outperforms the baseline algorithm.
Note that when the speakers' trajectories intersect, the estimates may suffer from unavoidable permutation ambiguity. Consequently, while both trajectories are accurately estimated, the association between them and the speakers may switch after the intersection point (see  Experiments #2, #3, #4, #6, #7, and #8). Note, that by definition (37), the MAE is agnostic to such permutations. The intersecting trajectories may also result in significant errors when the DOAs of the speakers become closer (see Experiments #2, #3, #4, #7, and #8). This is also reflected in the relatively high MAE values (see Experiments #4, #5, and #7). Higher MAE values are also encountered at the initial convergence period (see Experiments #4, #5, and #6).
The performance improvement of the proposed algorithm may be attributed to the MVDR-BF front-end, which is capable of suppressing the interference sources, as opposed to the SRP-PHAT front-end, which is adopted by the baseline method.
The proposed DOA estimator is also evaluated in comparison with the baseline algorithm in multiple reverberation times (T 60 ) and SNR values. Two moving speakers were simulated by convolving randomly selected male or female utterances with the room response, simulated using an open source signal generator 3 . The microphone signals are then contaminated by a directional, spectrally pink, noise source in several SNR levels. The trajectories of the two speakers were set as clockwise and counterclockwise.
The MAEs for different values of T 60 and for SNR=40 dB are presented in Table 3. The MAEs for different SNR levels and for T 60 0.2 are presented in Table 4.
The performance of both the baseline and the proposed algorithms degrades as the reverberation level increases. However, the accuracy of the proposed algorithm is significantly higher and is limited by 30 . Similar trends can be observed in Table 4, with a significant advantage of the proposed algorithm, with errors kept in the range of 11 14.5 .

Separation results
The separation capabilities of the proposed and baseline algorithms were assessed by evaluating the speaker-tointerference ratio (SIR) improvement. For convenience, all examined scenarios comprised one male and one female speakers. The microphone signals are thus given by: where x M and x F denote the reverberant male and female signals, respectively, as captured by the microphones, and v denotes a spatially and spectrally white noise signal. Both the DOAs D and the RTFs G matrices were estimated from the mixed signals y and the corresponding LCMV-BFs were constructed. The beamformers were then independently applied to the male and female components of the received microphone signals: Now, if the estimated RTFs are approximately equal to the true ones, we expect the algorithm to produce the following two-channel outputs: where S F and S M are the male and female speech signals as observed at the reference microphone. The two alternative outputs result in from the permutation ambiguity problem that was discussed above. This problem my be arbitrarily encountered for each time-frame. If  We can now define SIR measures: for j 1, 2, and similarly Since an absolute value of the ratio (in dB) is calculated for each time-frame, these measures are indifferent to the permutation problem.
To evaluate the SIR improvement of all algorithms, we also calculate the input SIR: The output SIR results for all experiments are presented in Table 5. It can be verified that SIR j D results are generally higher for the proposed algorithm than for the baseline algorithm, due to the better estimation accuracy of DOAs.
The ratios SIR j G are generally better for the proposed algorithm. The improvement is caused apparently by the MCWF usage for the RTF estimation in (19) which supplies better separation between the speakers within the RTF estimation procedure.
Finally, the algorithms are evaluated by assessing the sonograms of the various outputs for Experiment #9 as depicted in Fig. 7.
Careful examination of the sonograms, demonstrates the improved separation capabilities of the proposed algorithm in comparison with the baseline algorithm. For example, examining the signals in the time periods 2-3 Sec and 5-6 Sec, it can be verified that the proposed algorithm, as compared with the baseline method, better suppresses the female speech at Output 1, while maintaining low distortion for the male speaker.
The proposed speaker separation procedure is also evaluated versus the baseline algorithm for different reverberation levels (T 60 ) and SNR levels. The SIRs for different T 60 and for SNR=40 dB are presented in Table 6. The SIRs for various SNR values and for T 60 0.2 are presented in Table 7.
It is evident from Table 6 that the performance of both the baseline and proposed algorithms degrades with increasing reverberation level and that the proposed algorithm outperforms the baseline algorithm. Analyzing the results in Table 4, it is clearly demonstrated  that the performance of the proposed algorithm does not degrade with decreasing SNR in the range 0-40 dB. Generally, for both the proposed and the baseline algorithms, the utilization of the RTFs enhances the separation capabilities as compared with direct-path only systems. Finally, the proposed separation technique is evaluated for different values of . Recall that is used in (25) to limit the RTF estimation only to a single dominant speaker TF bins. The SIRs for different values of and for SNR 40 dB and T 60 0.2 are presented in Table 8. It can be verified that the choice of significantly influences performance and that setting 0.6 yields the best results.

Comparison with open embedded audition system (ODAS)
In this section, the proposed algorithm is further evaluated versus a state-of-the-art algorithm, namely ODAS 4 .
ODAS is an open-source library dedicated to a combined sound source localization, tracking and separation. Two static speakers were simulated using open source signal generator 5 . The DOAs of the speakers were set to 45 and 135 w.r.t. the array center, and their distance from the array was set to 1 m. Clean speech utterances were randomly drawn from a set of male and female speakers. The reverberation time was set to T 60 0.3. The performance of the proposed algorithm and of ODAS algorithm were evaluated as a function of the overlap percentage between the speakers. Two widely used speech quality and intelligibility measures, namely perceptual evaluation of speech quality (PESQ) [35] and short-time objective inteligibility measure (STOI) [36], were used to evaluate the performance of the algorithms. The comparison between the algorithms is reported in Table 9. It is clearly demonstrated that the proposed algorithm outperforms the ODAS algorithm in both measures.

Conclusions
We have presented an online algorithm for separating moving sources. The proposed algorithm comprises two stages: (1) online DOA tracking and (2) online RTF estimation. The two stages employ different statistical models. The estimated RTFs are used as building blocks of a continuously-adapted LCMV-BF. The proposed algorithm is compared with a baseline method using real recordings in the challenging task of separating concurrently active and moving sources.