Forward-backward recursive expectation-maximization for concurrent speaker tracking

In this paper, a study addressing the task of tracking multiple concurrent speakers in reverberant conditions is presented. Since both past and future observations can contribute to the current location estimate, we propose a forward-backward approach, which improves tracking accuracy by introducing near-future data to the estimator, in the cost of an additional short latency. Unlike classical target tracking, we apply a non-Bayesian approach, which does not make assumptions with respect to the target trajectories, except for assuming a realistic change in the parameters due to natural behaviour. The proposed method is based on the recursive expectation-maximization (REM) approach. The new method is dubbed forward-backward recursive expectation-maximization (FB-REM). The performance is demonstrated using an experimental study, where the tested scenarios involve both simulated and recorded signals, with typical reverberation levels and multiple moving sources. It is shown that the proposed algorithm outperforms the regular common causal (REM).


Introduction
The task of multiple target tracking (or dynamic localization) has significant importance in civil, military and surveillance applications such as improving beamforming accuracy in speech enhancement applications, e.g. speech separation, indoor robotic assistance, and automatic steering of cameras [1][2][3][4]. Although many state-ofthe-art approaches to speech separation are based on deep learning, model-based methods are preferable in cases where training data is not available. As for computervision-based methods for speaker tracking, in several cases, cameras are not allowed due to power consumption or privacy constraints. In addition, these methods are not suitable in cases where there is no direct line-ofsight between the sensor and the speaker being tracked. Classical tracking addresses only a subset of possible trajectories, since it usually makes assumptions, regarding e.g. the target velocity, acceleration, and jerk. In case of noncontinuous signals like speech, this might lead to tracking loss.
There are scenarios where the sensor arrays are moving [5], but usually in a room environment, the arrays are assumed to be static and only the sources are moving and should be tracked. Some tracking algorithms are based on direction of arrival (DOA) estimation. The multiple signal classification (MUSIC) algorithm [6] applies a subspace method that was later adapted to the challenges of speech processing in [7]. The steered response power with phase transform (SRP-PHAT) algorithm [8] uses generalizations of cross-correlation methods for DOA estimation. This algorithm can be applied to both 2D or 3D localization and tracking problems. Although several features were used in the literature for speaker localization and tracking, the most commonly used features are the sub-band time difference of arrivals (TDOAs) [9] or DOAs [10].
Supervised learning methods can also be used for this task. Deep learning methods can be trained to find the DOA in different acoustic conditions. Deep learning methods have recently been proposed for sound source localization. In [11,12], simple feed-forward deep neural networks (DNNs) were trained using generalized cross-correlation (GCC)-based audio features, demonstrating improved performance as compared with classical approaches. Yet, this method is mainly designed to deal with a single sound source at a time. In [13], the authors trained a DNN for multi-speaker DOA estimation. In [14,15], time-domain features were used, demonstrating performance improvements in highly reverberant enclosures.
The main drawbacks of the DNN approaches for tracking applications are the need for large, usually labelled, training sets and the high sensitivity to mismatch between train and test conditions. When we are interested in ad hoc distributed networks, training is not always possible and changes of array constellation between train and test can often occur. Various tracking algorithms for distributed microphone arrays were proposed in [16][17][18][19][20][21][22]. In [23][24][25][26], the outdoor case is emphasized, where sensor noise is usually dominant, unlike indoor environment that is dominated by reverberation. Well-known Bayesian approaches were also applied for tracking, such as probability hypothesis density (PHD) [27,28], particle filters, and other statistical based methods [29][30][31][32][33]. Several tracking and localization algorithms were tested as part of the LOCalization And TrAcking (LOCATA) challenge [34,35].
The classical Bayesian approach is not optimal for the task of tracking concurrent speakers within a room, because subjects tend to naturally and arbitrarily move without an organized route and with many pauses. For this application, the non-Bayesian maximum likelihood (ML) approach might be more suitable, since it assumes that the speakers' trajectories are deterministic unknown parameters, bearing no statistical model. A well-known procedure for inferring the ML estimate is the expectation-maximization (EM) algorithm [36] that iteratively increases the likelihood function.
For the task of online speaker tracking, a recursive version of the EM is required, since it facilitates tracking of time-varying parameters, and enables online estimates with relatively low computational and memory loads. The first recursive version of the EM algorithm was suggested by Titterington [37], where a Newton-based method minimizes the Kullback-Leibler divergence (KLD) between the actual and parametric likelihood function, assuming that the observations are independent and identically distributed (i.i.d.). An almost surely convergence proof of Titterington recursive expectation-maximization (TREM) algorithm was given by Wang and Zhao in [38,39], based on the results of Delyon [40]. A stochastic approximation version for the EM algorithm was proposed by Delyon et al. in [41], and its convergence was proven therein. A further study of the recursive expectation-maximization (REM) approach appeared in [42] for the problem of DOA estimation, using TREM and another recursive algorithm suggested by the authors. They showed that both algorithms converge with probability one (w.p.1.) to a stationary point of the likelihood function.
A different approach of REM was proposed by Cappé and Moulines [43], in which the model parameters and the hidden signal are estimated simultaneously. In the E-step, the sufficient statistic for the parameters is recursively updated using the latest observation. In the M-step, the parameters are optimized using the latest statistic approximation. It is shown in [43] that this series of parameters estimates converges to local minima of the KLD in the case of independent observations. REM-based algorithms for speaker tracking in noisy and reverberant environments were presented in [44], where both Titterington's REM [37] and Cappé-Moulines' REM [43] were applied to the problem. The W-disjoint orthogonality (WDO) property was assumed to hold, as commonly used by other algorithms, in order to improve the robustness and to facilitate concurrent speaker tracking [9,[45][46][47][48].
Spatially distributed microphone nodes were used in [44], but computations were carried out in a central processing unit. In many applications/scenarios, since the communication bandwidth (BW) is limited, distributed computation is beneficial, where each node executes part of the computations and transmits its result rather than the entire observed data [49]. The ring-based algorithm and its modifications are based on the incremental expectation-maximization (IEM) principle suggested in [50]. The recursive distributed expectation-maximization (RDEM) method was recently proposed for distributed source localization [51], using Titterington's REM [37]. In this paper, we use the same topology from [51].
Since the methods in [44] and [51] apply only causal recursion, the localization accuracy in silent or noisy timesegments often deteriorates. Exploiting both past and future observations may improve the estimation accuracy of the current state, especially for non-stationary signals. The estimation pertaining to the past data is usually referred to as forward filtering, while the state estimation pertaining to the future data is referred to as backward filtering, which runs backwards. In online applications, the usage of future observations imposes latency to the overall system and should therefore be restricted.
A bi-directional version of recursion estimation has been presented in [52] for video signals, where the data is processed off-line in order to use the future samples in addition to the past samples. Some of the Bayesian approaches deal with specific speech processing applications with techniques developed originally for communication applications. A maximum a posteriori (MAP) approach that exploits the Viterbi algorithm was presented in [53], where a forward-backward recursion has been used to address pauses of the speech signal.
In this paper, we propose a new tracking mechanism and use it to modify the recursive distributed expectationmaximization (RDEM) [51], resulting in the tracking forward-backward recursive expectation-maximization (TFB-REM), which is a non-Bayesian algorithm. The new algorithm allows to configure the latency according to the real-time constraints of the system and obtains an improved performance.
The rest of the paper is organized as follows. Formulation of the tracking problem and its probabilistic model is described in Section 2. Then, in Section 3, the forwardbackward recursive expectation-maximization (FB-REM) is introduced and applied to the tracking problem of multiple speakers in Section 4. Experimental results are given in Section 5, and conclusions are drawn in Section 6.

Speaker tracking problem formulation
One of the major applications of the recursive algorithms is target tracking. In this work, focus is given to the tracking of concurrent speakers, a more challenging task due to the non-stationary and intermittent nature of speech signals. The problem is formulated in the time-frequency domain and in the spatial domain. Let b = 1, . . . , B denote the frequency bin index, and S the number of acoustic signals captured by M microphones (an even number), organized in M/2 independent pairs. The signal captured by the ith microphone, i = 1, 2 of the mth pair, m = 1, . . . , M/2, is given by where s = 1, . . . , S is the source index, v s (t, b) denotes the sth source signal, n i m (t, b) denotes an additive noise, and a i sm (t, b) denotes the time-variant room impulse response (RIR).
For each time-frequency (T-F) bin, the pair-wise TDOAs, denoted by τ m,b (t), are calculated from the crosscorrelation. Under the assumption of speech sparsity [54], each time-frequency bin vector is dominated by a single source, namely that the WDO assumption is satisfied. This implies that for each T-F bin, the summation in (1) is dominated by only a single source.
Let p =[ x, y, z] be the 3D Cartesian coordinates within a given indoor space and further let P be the set of all candidate speakers' positions, without assuming any prior knowledge about the number of the sources and their dynamics. Multiple positions may receive high values according to the number of active speakers. The grid of candidate positions is defined similarly to [51]. The noiseless TDOAs associated with each grid point can be calculated in advance from geometrical considerations: where p 1 m and p 2 m are the locations of the microphones assumed to be perfectly known, ||·|| denotes the Euclidean norm, and c is the sound velocity.
We then attribute a mixture of Gaussians (MoG) statistical model to the TDOAs: where σ 2 is the Gaussians' variance, which is assumed to be a known, constant, parameter. The weights w p (t) are unknown parameters, designating the probability of a speaker to be located at position p at time t that should satisfy the following constraints: The set of unknown parameters to be found by the EM algorithm consists of the Gaussian weights, w(t) = vec p w p (t) , since the mean of each Gaussian is calculated in advance over the grid of positions and its variance is assumed to be known. This set of unknown parameters can be thought of as a set of hypotheses for potential speakers' positions. Each grid point with sufficiently high weight will be marked as a potential speaker's position. The probability density function (p.d.f.) of all augmented measurements at each node at time t is given by In the following sections, we propose a forwardbackward method for the estimation of this model's parameters, w(t). First, in Section 3, we derive the general scheme of the forward-backward version of the REM. Then, in Section 4, the general scheme is applied to the above model for speaker tracking.

FB-REM-general derivation
We begin the derivation in defining the criterion for online parameter estimation and present the general algorithm proposed in this paper, namely the FB-REM scheme. The FB-REM approach for parameter estimation is, similarly to the more general REM approach, a non-Bayesian method that does not rely on a statistical model for the parameters. In this section, we use θ for the set of parameters. In Section 3.1, we discuss the classical REM, and in Section 3.2 the FB-REM.

REM
The REM algorithm is first derived for i.i.d. observations, denoted by τ (t). The true, unknown p.d.f. of the observations is denoted by h(τ ), and the parametric p.d.f. is f (τ ; θ ). A common criterion for online parameter estimation is stated in terms of the KLD, defined as where E{·} denotes the expectation taken with respect to (w.r.t.) h(τ ), the time index t does not appear in τ (t) since time dependency is cancelled out by the expectation operation and due to the i.i.d. nature of the observations. In our case, we write the optimization criterion as a minimization of the KLD: It is noteworthy that θ * and the ML (batch) estimation of θ are asymptotically equivalent, i.e. they converge for growing number of observations. The KLD criterion is commonly used in online procedures [43], since it fits dynamic cases, where the parameters are time-varying. As discussed in Section 1, the EM is a common approach for maximizing the likelihood in problems involving hidden data. The hidden data at time t is defined as y(t), and the respective complete data p.d.f. is assumed to be i.i.d. and denoted by f (τ (t), y(t); θ).
The specific definition of y(t) for our tracking problem is given in Section 4.1. The EM can also be applied recursively, e.g. as presented by Titterington [37]. The TREM maximizes the KLD using the recursion where θ Ti t is the previous estimate of θ , s τ (t); θ denotes the scoring function, and I C (θ ) denotes the Fisher information matrix (FIM) of the complete data, where E C denotes the expectation taken w.r.t. the complete data p.d.f f (τ , y; θ ). The computation of Eqs. (9) and (10) requires the estimation of the hidden data, which is derived in Section 4. It was shown in [38] that under certain regularity conditions, the TREM converges w.p.1. to a stationary point of the KLD. However, a w.p.1. convergence is impossible in cases of time-varying parameters, and we therefore substitute t −1 in (8) by a constant smoothing coefficient (1 − γ F ), facilitating tracking capabilities for dynamic cases. Denote by θ F t the parameter estimation at time t, obtained by the following forward recursion, Note that unlike (8), we also normalized θ F t by γ F , which is equivalent to a constant attenuation and does not affect the asymptotic behaviour of the algorithm. The procedure (11) also converges to a stationary point of the KLD, but in a weak sense.
Further note that when the parameters are timevarying, the observations are no longer identically distributed, and asymptotic convergence is not relevant anymore. For an extensive theoretical examination of this recursive scheme for parameter estimation, please refer to [55] and Chapter 8 in [56]. However, an intuitive yet accurate description of the convergence can be given as follows. For higher values of γ F , the update rate of the algorithm is slower, meaning that the estimation is biased by the past values of the parameter, but the estimation variance is lower, due to the longer averaging window. For lower γ F values, the convergence speed is higher, which means a lower bias, but this comes at the expense of higher variance due to the shorter averaging window.
In the following section, we propose a method that improves the performance of (11) at the expense of higher latency. This is done by utilizing future observations in the estimation procedure and may reduce both the bias and the variance of the estimator. Reduction of the bias induced by past information is very important in the non-stationary case of moving speakers.

The proposed FB-REM approach
To use the near-future observations, we propose the FB-REM algorithm, defined by where θ F t was defined in (11), and 0 ≤ α FB ≤ 1 is a weighting factor of the past and the future terms. The backward estimator θ B t is calculated by the backward recursion where D is the number of future samples used for the current estimate.
In a Bayesian framework, there would have been an optimal choice of the smoothing factors γ F , γ B , and α FB , obtained by a Bayesian statistical model. The gain of the Kalman smoother [57] and the coefficient used in Viterbi algorithm [58] are determined this way. However, since we intentionally adopted a non-Bayesian approach, the values of γ F , γ B , and α FB are determined according to the required dynamics of the algorithm, and the nature of the stochastic processes. We previously discussed how γ F trades-off the update speed versus the accuracy of the algorithm. Similarly, high γ B reduces the variance of the algorithm in the expense of higher bias. Finally, α FB determines the weight of the past and the future observations on the current estimate.
In practice, α FB is mainly determined by the number of the future observations that are actually used to update the estimation, which in turn is determined by the latency constraints of the application. Rewriting (13) as It can be seen that for every value of γ B and every computing precision requirement, there exists an integer D max such that γ D max B ≈ 0. If D = D max is chosen, the backward recursion is equivalent to an infinite backward recursion, similarly to the forward recursion for large enough t values. However, in practical applications, using D max future observations might introduce unacceptable latency, and a lower value of D should be chosen instead. This choice deteriorates the accuracy of the backward recursion (13), which should be compensated by increasing the value of α FB .
In the next section, the algorithm (12) will be applied to the problem of multiple speaker tracking.

Tracking forward-backward REM (TFB-REM) for concurrent dynamic speakers
The concept developed above is applied to the multiple speaker tracking problem. We first briefly mention the tracking forward-recursive expectation-maximization (TF-REM) for the forward direction, as developed in [51] for localization using only past and present samples. Then, the TF-REM algorithm is extended by adding the backward recursion using the FB-REM that was developed in the previous section. The combination of these two directions is dubbed tracking forward-backward recursive expectation-maximization (TFB-REM). The rest of this section is organized as follows. In Section 4.1, we define the hidden data model we use in the tracking problem, and in Sections 4.2 and 4.3, the REM and the forwardbackward REM algorithms are applied to this model, respectively.

Hidden data statistical model
The TF-REM was already presented in [51] and is briefly repeated here, since it is the basis for developing the TFB-REM in the next section. First, we present the hidden data model that complements the observation p.d.f. in (5). Following [51], we let y m (t, b, p) be an indicator random variable that equals to one if a speaker located at p is active in the observed variable τ m,b (t) and zero otherwise. In other words, y m (t, b, p) = 1 means that a speaker in p is present in the (t, b) bin of the mth node. Intuitively, the tracking challenge becomes much simpler to solve given this additional information. This formulation also allows a distributed implementation of the algorithm, which is useful in many cases of ad hoc networks. Each pair of microphones may describe an electronic device with independent processing unit and communication capability.
The conditional p.d.f. of the observations, assuming independence between nodes is where y(t) = vec m,b,p (y m (t, b, p)) is the hidden data set. Thus, the complete data p.d.f. is given by In Section 4.2, y m (t, b, p) is estimated by the forward recursion, and in Section 4.3, its backward estimator is given. These estimates are denoted byŷ  (t, b, p), respectively. Correspondingly, the forward and backward estimates of w p (t) will be denoted byŵ F,p (t) andŵ B,p (t), respectively.

RDEM applied in the forward direction
In [44] and [51], the TF-REM was derived for the general algorithm in (11) in detail, and only the resulting formulae are given in this section. In the E-Step, estimation of the hidden data is given bŷ Now, define the aggregation of the hidden variables along the frequency axis as: t, b, p).
The results of the E-step are utilized for weight estimation per position:  (19) 3) Find sources location: Apply a threshold toŵ F,R (t) 4) Go back to 2) for next time point.
As was shown in [44] and [51], the M-Step reduces to a compact recursive equation: whereŵ F (t) is a vector consisting of the instantaneous version of the parameters, The TF-REM procedure is summarized in Algorithm 1. Note that the local hidden set facilitates a wide range of network topology alternatives. These distributed computation aspects are left for future study, as the main contribution of this manuscript is independent of the network topology.

TFB-REM for multiple speakers
In order to develop the TFB-REM, we first derive the backward recursion. Its derivation follows the steps of the TF-REM replacing the notation F with B and the time index (t − 1) with (t + 1) in Eqs. (17)-(21) above. In the backward recursion, the time samples are processed in the anti-causal direction, and the following formula is obtained. (22) Due to this anti-causal process, we have to choose a processing delay, defined above as D. The algorithm tracks the speakers along time using relevant past and future samples according to this processing delay and the speaker dynamics.
Finally, the FB-REM approach in (12) for the problem of speaker tracking can be written as, This is a weighted superposition of two separate localization maps created independently by the two RDEM processes that build jointly a single localization map for each relevant time point. Note that when α FB = 1, there is no use of future data and we obtain the TF-REM derived in the previous subsection.
As in many other EM-based algorithms, initialization is an important and non-trivial task. The TF-REM proceed from one time step to another according to the recursion. The backward RDEM at time t is initialized with a uniform position distribution at the future time t + D.
The localization map in (23) consists of soft values representing the probability of an acoustic activity at each specific position in the room. After estimatingŵ FB,R (t) at each time step, we apply a threshold for all values to determine the active positions, meaning the number of active speakers and their current positions.
Before switching to the next time point estimation, we setŵ which is assumed to be an improved estimate of the current acoustic position map. We then start the entire procedure for both RDEM and their weighted combination described above for the next time step. The entire TFB-REM procedure is summarized in Algorithm 2.

Results and discussion
This section evaluates the performance of the TFB-REM algorithm and compare it to TF-REM and to a generalization of the SRP-PHAT algorithm [8] for the 2D case [59]. The study comprises four subsections. In Section 5.1, we discuss the parameters of the algorithms and the considerations taken when tuning them. In Section 5.2, we describe the room setup for simulation and recordings, and Section 5.3 compares the various algorithms using Monte Carlo simulations and then examines the sensitivity to array perturbations. Finally, in Section 5.4, we compare the performance of the algorithms for real-life recordings of moving speakers acquired at the Bar-Ilan acoustic lab.

Parameter choice
There are a few important parameters mentioned above that should be chosen to improve the tracking algorithm performance. Two important parameters are the TREM smoothing factors γ F and γ B . The parameter γ F was already tuned experimentally for the regular RDEM [44]. We tuned both values empirically for our model. We tried values in the range 0.01 − 0.9 and eventually chose γ F = 0.015 and γ B = 0.04.
The next parameter to be discussed is the past-future weighting coefficient, α FB . For off-line applications, we might expect the past and the future samples to be equally weighted, meaning α FB = 0.5, since the relevance of past or future observations to the present observation is similar. However, since we apply the proposed algorithms in an online scenario, the overall latency was restricted, thus rendering the backward recursion less accurate than the forward recursion. In this case, the forward recursion should be weighted more than the backward recursion. The results show that the best choice was 0.65 (we tried values in the range of 0.1 − 0.9).
An important parameter, which has close relations with the smoothing factors, is the latency, D. Assuming slow dynamics for indoor speaker tracking, it is reasonable to set D to approximately 1 s. It is obvious that for online applications, as addressed here, we might set D to a smaller value.
The MoG variance, σ 2 was set to 2[Samples 2 ]. The unit [Samples] is a function of the sampling frequency, which is set to 16 kHz. In this experimental study, the variance is fixed over time and frequency bin, but we found that it can significantly influence the performance. Therefore, an adaptive mechanism for setting σ 2 is left for a future study.
The size of a single unit in the localization grid was set to 0.10 × 0.10 m 2 , which was shown in [44] to be sufficient for tracking real speakers that should not be treated as point sources, due to their body volume. Preliminary tests showed that the chosen resolution best fits the trade-off between computational complexity and required accuracy.
The number of frequency bins was also examined experimentally within the range that is reported in other narrow-band approaches [9], and was set to B = 16 bins. The choice of the short-time Fourier transform (STFT) frame length is 64 ms, similarly to previous localization and tracking algorithms [44,49,51,[60][61][62][63].

Room setup
To evaluate the tracking capabilities of the algorithms, we tested them in both simulated data and recordings of real human speaker setups, as described in the sequel. The dimensions of the simulated room were 6 × 6 × 2.4 m, with twelve pairs of microphones encompassing the acoustic scene in the room. The number of microphone pairs (or nodes) is very important in the presence of high noise and high reverberation levels. A setup of twelve microphones pairs was chosen following the experimental study in [44]. A map of the simulated room and microphones is given in Fig. 1, where the microphone positions are marked by 'o'-s, and every pair is numbered from 1 to 12. In the examined scenario, all speakers and microphones are positioned at the same height, for simplicity.
Simulation of moving sources was carried out by the image method [64], with an efficient implementation of the RIRs computation in [65]. In this experiment, one, two, or three sources moved along randomly chosen trajectories within the room. The RIRs along these trajectories were sampled every 0.04 s to generate moving sources in reverberant environment.
Recording of real human speakers was carried out as described in [44,51]. The recordings were carried out in the speech and acoustic lab of Bar-Ilan University. This is a 6 × 6 × 2.4 m room with a reverberation time controlled by 60 interchangeable panels covering the room facets. The measurement equipment includes a RME Hammerfall DSP Digiface sound-card and a RME Octamic (for Microphone Pre Amp and digitization (A/D)). AKG type CK-32 omnidirectional microphones were used. All measurements were carried out with a sampling frequency of 48 kHz and a resolution of 24 bits. The multi-microphone signals were acquired using Matlab © . A snapshot of the room tuned for low reverberation level (T 60 = 250 ms) is shown in Fig. 2.
In the real recordings, only seven pairs of microphones were used. As in the simulated experiment, we focused on a two-dimensional tracking, while the height estimation can be derived as an extension of the algorithms. While we assume a two-dimensional setup with all the microphones positioned at height, 130 cm from the floor, we note that real human speakers are recorded and their height vary a bit.
For controlling the recordings and determining an accurate ground-truth for the tracking algorithms, we had a video camera in the room as well as very precise paths marked on the floor of the room. The speakers were walking along the paths and were recorded with all microphones and video in a synchronized way.

Performance in simulated recordings
Using simulated recordings enables a comprehensive statistical investigation and testing of a large variety of trajectories in the room. Three different scenarios were examined. The first scenario consist of a single speaker and reverberation time T 60 = 400 ms. The two other scenarios consist of either two or three speakers and reverberation time T 60 = 120 ms.
To compare the performance of the different algorithms, we followed the procedure described in [49,51], updated for a dynamic speaker scenario. We executed 100 Monte Carlo trials and calculated the root mean square error   Table 1. The performance of the SRP-PHAT, TF-REM, and TFB-REM are compared for one-, two-, and threespeaker scenarios. As evident from the table, the reference algorithm SRP-PHAT has the highest RMSE for more than one speaker.
The TF-REM has higher RMSE than the TFB-REM, as expected by the incorporation of future data.
The trajectories of one trial of the simulated scenario with two speakers for the TF-REM algorithm is shown in Fig. 3. The real locations are marked with black circles ('o') and the estimates with blue dots ('·'). The same scenario with the TFB-REM algorithm is shown in Fig. 4. The real  trajectories of the speakers are sampled in slightly different time stamps, due to the latency that is introduced to the TFB-REM algorithm. It can be observed that for both algorithms the majority of estimates are very close to the real locations. The TF-REM produces three outliers, while the TFB-REM has only a single outlier.
A profound advantage of integrating information from various time slots is to bridge gaps of inactive time periods. We examined such a case and computed the tracking error for both algorithms as shown in Fig. 5. It can be observed that the TFB-REM outperforms the TF-REM in this case for the relevant time slots (around 8 s) and later around 14 (smaller differences). The estimation is applied once every second, since dynamic is slow indoor and the complexity of the tracking highly depends on the resolution used.
We have also evaluated the effect of imperfect microphone positions on the performance. The results for one scenario with two speakers is presented in Table 2. The performance of the SRP-PHAT, TF-REM, and proposed TFB-REM are evaluated for a few random misplacement of the microphones. The proposed algorithm outperforms the other two algorithms for standard deviation of up to 50 mm. The highest performance advantage is in the case of perfect array calibration.

Performance for real human recordings
The real recordings include two experiments with human speakers. The first recording was executed with reverberation time T 60 = 250 ms and a single speaker standing for 9 s and then walking on a 2.10-m-long straight line for 33 s. From this point, we focus on the two recursive algorithms in order to illustrate the influence of using future samples. The tracking errors for the two algorithms are shown in Fig. 6. It can be observed that the TFB-REM  The second recording involved two concurrent speakers standing for 2 s and then walking on parallel straight lines towards each other for 21 s. The averaged tracking errors for both algorithms are shown in Fig. 7. It can be observed that the TFB-REM significantly outperforms the TF-REM. The RMSE for the TFB-REM is 0.48 m, and for the TF-REM, it is 0.83 m. The additional speaker naturally degrades the tracking results of both algorithms.

Conclusions
We have developed an online, non-Bayesian, algorithm for multiple concurrent speaker tracking in reverberant environments. We have first introduced a backward recursive version of the RDEM algorithm. Then, the TFB-REM, a combined forward-backward RDEM algorithm, was developed. The TF-REM and the TFB-REM algorithms are evaluated using both simulated environment and real recordings of walking humans and compared with a baseline method, a variant of the SRP-PHAT method for the 2D case.
We demonstrated that the introduction of the short latency in the TFB-REM indeed improves performance, especially by facilitating the tracking of intermittent sources by bridging over short silence periods. Specifically, we have compared the tracking capabilities of the forward and the forward-backward schemes in single-, two-, and three-speaker scenarios. While the TFB-REM only slightly outperformed the TF-REM in the singlespeaker scenario, it was significantly better for two or three concurrent speaker scenarios.
Unlike other approaches (mainly Bayesian approaches), the recursive algorithms neither make assumptions regarding the speakers' dynamics nor require training data, as most DNN-based approaches. They are also shown to be rather robust to inaccuracy of array location, which is a very common imperfection in ad hoc networks.