Skip to main content

Estimation of acoustic echoes using expectation-maximization methods


Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.


During the past decade, there has been an increased research interest in robot and drone audition [13]. Hearing capabilities enable robots to understand and interact with humans [4]. Moreover, it has also been proven useful for sensing the physical environment. For example, it can be used for estimating the locations of acoustic sources, the position of a robot or drone, and the positions of acoustic reflectors and for inferring room geometry [5, 6]. Potentially, this can enable autonomous indoor operation of robots and drones.

Some different approaches for tackling the above estimation problems have already been considered. In a broad sense, these can be classified as being either passive or active. The passive approach relies on using external sound sources in the environment to conduct the localization. Examples of such sources could be human speech, noise from machinery, or ego-noise from other robots or drones. This approach was, e.g., used for solving the acoustic simultaneous localization and mapping (aSLAM) problem [79]. With aSLAM, it is possible to estimate the robot location relative to a number of passive acoustic sources in its vicinity. One obvious advantage of such passive approaches is that they are non-intrusive since only already existing sounds are used in the estimation. This comes at a price, however, since many acoustic sources, such as human speech, contains periods of inactivity, which can lead to unreliable estimates. This is particularly true with moving objects such as robots and drones. Moreover, to facilitate autonomous indoor operation, it is of great importance to also estimate the location of acoustic reflectors, e.g., walls, which is difficult with the passive approach, where only relative timing information is available.

The alternative, which we consider in this paper, is the active approach. In this approach, one or more loudspeakers are used to probe the environment using a known signal. Subsequently, a number of microphones are used to record the sound after it has propagated through the environment. Compared to the passive approach, this facilitate the estimation of the times-of-arrival (TOAs) of both the direct and reflected sound components. With this information, the localization accuracy can be increased significantly compared to the passive approach, and the task of acoustic reflector localization becomes less complex. In the following, we briefly outline some of the most recent and relevant work on active approaches. Some authors have considered the problem of estimating both room geometry and a robot’s position with a setup consisting of a collocated microphone and speaker pair [10]. To achieve this, they utilize TOA estimates of the first order reflections. The TOAs are assumed known or estimated beforehand. To tackle the estimation problem with the considered single-channel setup (i.e., one microphone and one loudspeaker), they consider multiple observations from different time instances and locations, i.e., movement is assumed. Based on this, they then proposed two different methods: a method based on basic trigonometry, and another one based on Bayesian filtering. A similar approach also based on a priori RIR/TOA knowledge was considered using a multichannel setup in the context of robotics in [11]. Other authors considered an approach where the TOAs of the first order echoes are utilized for estimating the arbitrary convex room shapes [12]. As briefly mentioned, these as well as other active approaches do not consider the TOA estimation problem, which is an equally important and difficult problem in itself due to, e.g., spurious estimates [13]. Moreover, methods relying on first- and second-order reflections only suffer from the inevitable problem of echolabeling [14]. In addition to this, many methods are based on only one microphone and one loudspeaker, but this lead to ambiguity in the mapping of the TOA estimates of the first-order reflections unless more transducers are included or movement is exploited.

These issues will be addressed in this paper, where we consider a setup consisting of a microphone array which is collocated with a single loudspeaker. More specifically, we consider a uniform circular array that could be placed on the perimeter of, e.g., a drone or robot platform, with a loudspeaker located in its center. With this setup in mind, we propose a number of expectation-maximization (EM) methods for estimating both the TOAs and directions-of-arrival (DOA) of a number of the acoustic reflections. This has the benefit of not only yielding more accurate TOAs compared to a single-channel approach, but also of reducing the ambiguity of the estimated reflections since the DOA is estimated simultaneously. In fact, this means that the estimates directly reveal the locations of mirror sources, which greatly simplifies the task of localizing the acoustic reflector positions. The proposed methods are derived in the time-domain, and, thus, estimates the parameters of interest directly from the recorded signals, i.e., not from estimated room impulse responses as in numerous state-of-the-art methods. While joint TOA and DOA estimation is a new topic in the context of robot and drone audition, it has been considered previously in multiuser and multipath communication systems [1517]. However, it has not yet been considered for acoustic reflector localization to the best of our knowledge. The paper builds on the results reported in our earlier paper [18] and extends on this work in several ways. First, we relax our previous noise assumptions and derive the optimal estimators for these more realistic scenarios. The first scenario deals with spatially independent white Gaussian noise with different noise variances across the microphones, e.g., to simulate low quality or faulty microphones. The second scenario considered deals with spatio-temporarily correlated noise, which we tackle using prewhitening. Here, we include different approaches for the prewhitening. Moreover, we have included a beamformer interpretation of one of the proposed multichannel estimators, which provides an intuitive understanding of the EM-based method. In addition to this, we included further experimental work to show case the merits of the different proposed estimators and how they compare with traditional methods.

The rest of the paper is organized as follows. In Section 2, we propose the signal model for the considered setup along with a problem formulation. Then, in Section 3, we briefly revisit the single-channel EM method for TOA estimation, which serves as our reference method. Inspired by this, we then proceed with the derivation of the different TOA and DOA estimators in Section 4. Finally, the paper closes with the experimental results and conclusions in Sections 5 and 6, respectively.

Problem formulation

We now proceed to lay the foundation for the derivation of EM-based methods for estimating the TOA and TDOA of the acoustic echoes. This is done by formulating the relevant temporal and spatial signal models.

Time-domain model

Consider a setup with a single loudspeaker and M microphones that are assumed to be collocated on some hardware platform, e.g., a mobile robot or a drone. The loudspeaker is used to probe the environment with a known sound while the microphones are used to record the sound emitted by the loudspeaker including its acoustic reflections from physical objects and boundaries, e.g., walls. Both the microphones and loudspeakers are assumed to be omnidirectional and ideal. While this assumption might not hold in practice, we do not consider the handling of non-ideal characteristics in this paper. As suggested in other work [5], this might be partly addressed by estimating and introducing another filter accounting for the hardware characteristics, which may also be included in the methods proposed later. Moreover, the non-ideal characteristic of the hardware, i.e., loudspeakers could be modeled as shown in [5], but this is not included when formulating the following estimator.

We can then formulate a general model for the signal recorded by microphone m, for m=1,…,M, as

$$\begin{array}{*{20}l} y_{m}(n)=h_{m}*s(n)+v_{m}(n) = x_{m}(n)+v_{m}(n), \end{array} $$

where, xm(n)=hms(n), hm is the acoustic impulse response as measured from the loudspeaker to the mth microphone, and s(n) is a known signal being played back by the loudspeaker. Finally, vm(n) is an additive noise term, which is supposed to model ego-noise from a robot/drone platform, interfering sound sources (e.g., human speakers), thermal sensor noise, etc., that is, the signal s(n) is used to probe the environment to, eventually, facilitate the estimation of the parameters of the acoustic echoes, such as their TOA and TDOA. Thus, we proceed by rewriting the observation model as a sum of the individual reflectionsFootnote 1in noise, i.e.,

$$\begin{array}{*{20}l} y_{m}(n)=\sum_{r=1}^{\infty} g_{m,r}s(n-\tau_{\text{ref},r}-\eta_{m,r})+v_{m}(n), \end{array} $$

with gm,r being the attenuation of the rth reflection from the loudspeaker to the mth microphone, e.g., due to the inverse square law for sound propagation and sound absorption in the acoustic reflectors. Furthermore, ηm,r=τm,rτref,r is the TDOA of the rth component measured between a reference point and microphone m, while τm,r and τref,r are the TOAs of the rth component on microphone m and the reference point, respectively.

Acoustic impulse responses often exhibit a certain structure, which can be characterized by two parts: the early part, which is sparse in time and contains the direct-path and early reflections, and the late part, which is a more stochastic, dense, and characterized by decaying tail of late reflections (Fig. 1). This suggests that we can split the model as [19]

$$\begin{array}{*{20}l} y_{m}(n)&=\sum_{r=1}^{R} g_{m,r}s(n-\tau_{\text{ref},r}-\eta_{m,r})\\ &\quad+d_{m}(n) +v_{m}(n),\notag \end{array} $$
Fig. 1

An example of a synthetic room impulse response illustrating its different parts, i.e., the direct-path component, early reflections and late reflections/ reverberation

where R is the number of early reflections, and dm(n) is the late reverberation. A common assumption is that the late reverberation can be modeled as a spatially homogeneous and isotropic sound field with time-varying power but known coherence function [20]. If we collect N samples from each microphone and assume stationarity within the corresponding time frame, the vector model for our observations becomes:

$$\begin{array}{*{20}l} \mathbf{y}_{m}(n)=&\sum_{r=1}^{R} g_{m,r}\mathbf{s}(n-\tau_{\text{ref},r}-\eta_{m,r})\\ &+ \mathbf{d}_{m}(n)+\mathbf{v}_{m}(n),\notag \end{array} $$

with ym(n), s(n), d(n), and vm(n) being vectors comprising N time samples of ym(n), s(n), dm(n), and vm(n), respectively, e.g.,

$$\begin{array}{*{20}l} \mathbf{y}_{m}(n)=\left[y_{m}(n) \quad \cdots \quad y_{m} (n+N-1)\right]^{T}, \end{array} $$

This leaves us with the problem of estimating R unknown TOAs and MR TDOAs from the observations ym(n), for m=1,…,M. However, if we know the geometry of the loudspeaker and microphone array configuration, we can significantly reduces the dimensionality of this problem by further parametrizing the TDOAs in terms of the directions-of-arrival (DOAs).

Array model

While the array model can in principle be chosen arbitrarily, we choose to exemplify the TDOA modeling with a setup where the loudspeaker is placed in the center of a uniform circular array (UCA). Such a setup could be placed on, e.g., a robot or drone platform to enable the estimation of the angle of and distance to acoustic reflectors, e.g., to facilitate autonomous and sound-based navigation.

If we assume the reference point to be the center of the UCA, it can be shown that the TDOAs, for a setup like this, can be modeled as

$$\begin{array}{*{20}l} \eta_{m,r}=d\sin\psi_{r}\cos(\theta_{m}-\phi_{r})\frac{f_{s}}c \end{array} $$

where d is the radius of the UCA, ψr and ϕr are the inclination and azimuth angles of the rth reflection, respectively, and θm is the angle of the mth microphone on the circle forming the UCA. These definitions are illustrated in the UCA example in Fig. 2. In addition to this, fs is the sampling frequency, and c is the speed of sound.

Fig. 2

Example of a uniform circular array with six microphones

The TDOA model in (5) can then be combined with the observation model in (4). By doing this, the estimation problem at hand is then simplified to the estimation of 2R angles, i.e., ψr and ϕr, for r=1,…,R, rather than MR TDOAs. It should be noted here that the considered UCA configuration introduces ambiguities, e.g., an acoustic reflection impinging from an elevation of 0 will result in the same TDOAs as an acoustic reflection mirrored around the UCA plane, i.e., at an elevation angle of 180. However, this ambiguity can easily be accounted for by applying the proposed methods on array structures with microphones in all three dimensions, e.g., spherical microphone arrays [21].

Single-channel estimation

Before presenting the proposed TOA and TDOA estimators, we briefly revisit an EM-based method for single-channel TOA estimation, i.e., that is with a setup consisting of one loudspeaker and one microphone. The original version of this method was proposed in [22] under a white Gaussian noise assumption and serves as a reference for the proposed methods.

White Gaussian noise

In the following, we leave out the microphone index, i.e., subscript m, since only a single microphone is considered. We assume that the additive noise, i.e., both the late reverberation and the background noise is independent and identically distributed white Gaussian and zero-mean. Later, as part of the proposed multichannel methods, this assumption is substituted with a more realistic one, where the late reverberation is modeled as being spatio-temporarily correlated. The signal model in (4) then reduces to

$$\begin{array}{*{20}l} \mathbf{y}(n)=\sum_{r=1}^{R} g_{r}\mathbf{s}(n-\tau_{r})+\mathbf{v}(n), \end{array} $$

where v(n) is distributed as \(\mathcal {N}(\mathbf {0},\mathbf {C})\), with 0 being a vector of zeroes, \(\mathbf {C}=\mathrm {E}[\mathbf {v}(n)\mathbf {v}^{T}(n)]=\sigma _{v}^{2}\mathbf {I}_{{N}}\) is the N×N covariance matrix of v(n), \(\sigma _{v}^{2}\) is its variance, IN denotes the N×N identity matrix, and E[·] is the mathematical expectation operator. The maximum likelihood (ML) estimator of the unknown parameters, i.e., the gains and the TOAs, is well known to be the nonlinear least squares (NLS) criterion in this case, i.e.,

$$\begin{array}{*{20}l} \{\widehat{\boldsymbol{\tau}},\widehat{\mathbf{g}}\}=\underset{\boldsymbol{\tau},\mathbf{g}}{\text{argmin}}\left\|\mathbf{y}(n)-\sum_{r=1}^{R}g_{r}\mathbf{s}(n-\tau_{r}) \right\|^{2}, \end{array} $$


$$\begin{array}{*{20}l} \boldsymbol{\tau}&=\left[\tau_{1} \quad \cdots \quad \tau_{R}\right]^{T},\\ \mathbf{g}&=\left[g_{1} \quad \cdots \quad g_{R}\right]^{T}. \end{array} $$

While this estimator is statistically efficient, it also requires computationally costly search since the cost function is high-dimensional and non-convex with respect to the TOAs.

A computationally more efficient way of implementing this estimator could be to adopt the expectation-maximization (EM) approach for superimposed signals proposed in [22]. The concept behind this approach is to define the complete data as the observation of all individual signals, i.e., each of the individual early reflections in our case. According to the previously stated signal model in (4), the individual observations can be modeled as

$$\begin{array}{*{20}l} \mathbf{x}_{r}(n)=g_{r}\mathbf{s}(n-\tau_{r})+\mathbf{v}_{r}(n), \end{array} $$

for r=1,…,R, where vr(n) is obtained by arbitrarily decomposing the combined noise term, v(n), into R different components adhering to

$$\begin{array}{*{20}l} \sum_{r=1}^{R}\mathbf{v}_{r}(n)=\mathbf{v}(n). \end{array} $$

Moreover, the observed signal can be written as the sum of individual observations such as:

$$\begin{array}{*{20}l} \mathbf{y}(n) = \sum_{r=1}^{R}\mathbf{x}_{r}(n). \end{array} $$

Following [22], we let the individual noise terms be independent, zero-mean, white Gaussian, and distributed as \(\mathcal {N}(\mathbf {0},\beta _{r}\mathbf {C})\). Furthermore, the scaling factors, βr are non-negative, real-valued scalars that satisfy

$$\begin{array}{*{20}l} \sum_{r=1}^{R}\beta_{r}=1. \end{array} $$

Under these assumptions, it can be shown that the EM algorithm for estimating the gains and the time-of-arrivals is given by [22]

E-step: for r=1,…,R, compute

$$\begin{array}{*{20}l} \widehat{\mathbf{x}}_{r}^{(i)}(n)&=\widehat{g}_{r}^{(i)}\mathbf{s}\left(n-\widehat{\tau}_{r}^{(i)}\right)\\ &\quad+\beta_{r}\left[\mathbf{y}(n)-\sum_{k=1}^{R}\widehat{g}_{k}^{(i)}\mathbf{s}\left(n-\widehat{\tau}_{k}^{(i)}\right)\right].\notag \end{array} $$


$$\begin{array}{*{20}l} \{\widehat{g}_{r},\widehat{\tau}_{r}\}^{(i+1)}=\underset{g,\tau}{\text{argmin}}\|\widehat{\mathbf{x}}_{r}^{(i)}(n)-g\mathbf{s}(n-\tau)\|^{2}, \end{array} $$

where (i) is denoting the iteration index. If the length, N, of the analysis window is long compared to the length of the known signal, s(n), the M-step can be simplified as

$$\begin{array}{*{20}l} \widehat{\tau}_{r} &= \underset{\tau}{\text{argmax}} \widehat{\mathbf{x}}_{r}^{T}(n)\mathbf{s}(n-\tau), \end{array} $$
$$\begin{array}{*{20}l} \widehat{g}_{r} &= \frac{\widehat{\mathbf{x}}_{r}^{T}(n)\mathbf{s}(n-\widehat{\tau}_{r})}{\|\mathbf{s}(n)\|^{2}}. \end{array} $$

We see that the estimation problem has been greatly simplified with this signals decomposition, since we now have 2R one-dimensional estimators rather than a 2R-dimensional estimator as in (7). From this simplified version of the M-step, we can make some interesting interpretations. First in (14), the individual observations are applied with a matched filter based on the known source signal. The TOA is estimated as the one maximizing the output power of the matched filter. Secondly, the estimated TOAs are used to obtain closed-form estimated of the gains in (15), which is based on a least squares fit between the known source signal and the estimated contribution of the rth component.

Multichannel estimation

We now proceed to consider the multichannel case, where we have one loudspeaker and multiple microphones. First, we consider a white Gaussian noise scenario similar to Section 3.1 where the noise is independent across the microphones, after which we turn to the more realistic scenarios with correlated noise.

Spatially independent white Gaussian noise

If we first assume that the noise is temporally white Gaussian and independent and the late reverberation is negligible, the signal model in (4) reduces to

$$\begin{array}{*{20}l} \mathbf{y}_{m}(n)=\sum_{r=1}^{R} g_{m,r}\mathbf{s}(n-\tau_{\text{ref},r}-\eta_{m,r})+\mathbf{v}_{m}(n), \end{array} $$

for m=1,…,M. Subsequently, we can aggregate the observations from all microphones in one model as

$$\begin{array}{*{20}l} \mathbf{y}(n)&=\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\mathbf{v}(n)\\ &=\left[\mathbf{y}_{1} \quad \mathbf{y}_{2} \quad \cdots \quad \mathbf{y}_{M}\right]^{T}\notag, \end{array} $$

where v(n) is the stacked noise terms from each microphone defined similarly to y(n), and

$$\begin{array}{*{20}l} \boldsymbol{\eta}_{r} &= \left[\eta_{1,r} \quad \eta_{2,r} \quad \cdots \quad \eta_{M,r}\right]^{T},\\ \mathbf{g}_{r} &= \left[g_{1,r} \quad g_{2,r} \quad \cdots \quad g_{M,r}\right]^{T}. \end{array} $$

In addition to this, we note that, under the assumptions of spatial independent white Gaussian noise, the covariance matrix, C of the stacked noise, v(n) is diagonal and given by

$$\begin{array}{*{20}l} \mathbf{C}=\text{diag}\left(\sigma_{v_{1}}^{2}\mathbf{I}_{{N}},\sigma_{v_{2}}^{2}\mathbf{I}_{{N}},\ldots,\sigma_{v_{M}}^{2}\mathbf{I}_{{N}} \right), \end{array} $$

where diag(·) is the operator constructing a diagonal matrix from the input of scalars(/matrices) and C is the MN×MN covariance matrix. Furthermore,

$$\begin{array}{*{20}l} \mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})=\left[g_{1,r}\mathbf{D}_{\eta_{1,r}}^{T} \quad \cdots \quad g_{M,r}\mathbf{D}_{\eta_{M,r}}^{T} \right]^{T}, \end{array} $$

and Dη is a circular shift matrix which delays a signal by −η samples.

With these definitions, the ML estimator for the problem at hand becomes

$$\begin{array}{*{20}l} \{\widehat{\mathbf{g}},\widehat{\boldsymbol{\tau}},\widehat{\boldsymbol{\eta}}\}=\underset{\mathbf{g},\boldsymbol{\tau},\boldsymbol{\eta}}{\text{argmin}} J(\mathbf{g},\boldsymbol{\tau},\boldsymbol{\eta}), \end{array} $$


$$\begin{array}{*{20}l} J(\mathbf{g},\boldsymbol{\tau},\boldsymbol{\eta})=\left\| \mathbf{y}(n)-\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r}) \right\|^{2}_{\mathbf{C}^{-1}} \end{array} $$

such that \(\|\mathbf {x}\|_{\mathbf {W}}^{2}=\mathbf {x}^{T}\mathbf {Wx}\), where W denotes the weighted 2-norm of x. Moreover, g,τ, and η are the parameter vectors containing all unknown gains, TOAs and TDOAs, respectively. In the single-channel case, the ML estimator ends up being high-dimensional and non-convex, resulting in a practically infeasible computational complexity if implemented directly. Therefore, we propose to adopt the EM framework also for the multichannel scenario.

Like in the single-channel approach, we consider the complete data to be all the individual observations of the reflections, but in this case from all the M microphones. Each of the observations can thus, for r=1,…,R, be modeled as

$$\begin{array}{*{20}l} \mathbf{x}_{r}=\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\mathbf{v}_{r}(n). \end{array} $$

The decomposition is assumed to satisfy the conditions in (9)–(11). Then, it can be shown that the EM-algorithm for the multichannel estimation problem is given by

E-step: for r=1,…,R, compute

$$\begin{array}{*{20}l} &\widehat{\mathbf{x}}_{r}^{(i)}(n)=\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{r}^{(i)},\widehat{\mathbf{g}}_{r}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},r}^{(i)}\right) \end{array} $$
$$\begin{array}{*{20}l} &+\beta_{r}\left[\mathbf{y}(n)-\sum_{k=1}^{R}\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{k}^{(i)},\widehat{\mathbf{g}}_{k}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},k}^{(i)}\right)\right]\notag. \end{array} $$

M-step: for r=1,…,R,

$$\begin{array}{*{20}l} \{\widehat{\mathbf{g}}_{r},\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}^{(i+1)} &=\underset{\mathbf{g},\tau,\boldsymbol{\eta}}{\text{argmin}} J_{r}(\mathbf{g},\tau,\boldsymbol{\eta}), \end{array} $$

with Jr(g,τ,η) being a weighted least squares estimator defined as

$$\begin{array}{*{20}l} J_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=\left\|\widehat{\mathbf{x}}_{r}^{(i)}(n)-\mathbf{H}(\boldsymbol{\boldsymbol{\eta},\mathbf{g}})\mathbf{s}(n-\tau)\right\|_{\mathbf{C}^{-1}}^{2}. \end{array} $$

If we explicitly write the cost function, we get

$$\begin{array}{*{20}l} J_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=&\sum_{m=1}^{M}\frac{\|\widehat{\mathbf{x}}_{m,r}(n)\|^{2}}{\sigma_{v_{m}}^{2}}\notag\\ &+\|\mathbf{s}(n-\tau)\|^{2}\sum_{m=1}^{M}\frac{g_{m,{r}}^{2}}{\sigma_{v_{m}}^{2}}\notag\\ &-2\sum_{m=1}^{M} \frac{g_{m,{r}}\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\mathbf{s}(n-\tau), \end{array} $$

This can be used to simplify the M-step by making a few observations. Clearly, the first term in this expression does not depend on any parameter of interest. Moreover, if we assume that the analysis window is long compared to the length of the known source signal, s(n), we observe that the second term does not depend on either the TOAs or the TDOAs. That is, to estimate these time parameters, we only need to consider the maximization of the last term, i.e.,

$$\begin{array}{*{20}l} \{ \widehat{\tau}_{\text{ref},r},\widehat{\boldsymbol{\eta}}_{r}\} = \underset{\tau,\boldsymbol{\eta}}{\text{argmax}}& \sum_{m=1}^{M}\frac{g_{m,r}\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\\ &\qquad\qquad\qquad\times \mathbf{s}(n-\tau) \notag, \end{array} $$

The gains, gm,r, and the noise statistics, \(\sigma _{v_{m}}^{2}\), are unknown in practice. However, if the noise is assumed (quasi-)stationary, its variance can be estimated from microphone recordings acquired before emitting the known source signal, s(n). By taking the partial derivative of (26) with respect to gm,r, we obtain the following closed-form estimate for gm,r

$$\begin{array}{*{20}l} \widehat{g}_{m,r}=\frac{\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\widehat{\eta}_{m}}\mathbf{s}(n-\widehat{\tau}_{\text{ref},r})}{\|\mathbf{s}(n)\|^{2}}, \end{array} $$

If the reflections are assumed to be in the far-field of the array, we can further simplify the estimators. In this case, the gains of reflection r will be the same across all microphones for r=1,…,R. That is, we can instead estimate the TOAs and TDOAs as

$$\begin{array}{*{20}l} \{\widehat{\tau}_{\text{ref},r},\widehat{\boldsymbol{\eta}}_{r}\} \approx \underset{\tau,\boldsymbol{\eta}}{\text{argmax}}& \left(\sum_{m=1}^{M}\frac{\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\right)\\ & \qquad\qquad\qquad \times \mathbf{s}(n-\tau) \notag. \end{array} $$

Subsequently, the gain estimator can then be reformulated as

$$\begin{array}{*{20}l} \widehat{g}_{r}=\left(\sum_{m=1}^{M}\frac{1}{\sigma_{v_{m}}^{2}}\right)^{-1}\sum_{m=1}^{M}\frac{\widehat{\mathbf{x}}_{m,r}^{T}\mathbf{D}_{\widehat{\eta}_{m}}}{\sigma_{v_{m}}^{2}}\frac{\mathbf{s}(n-\widehat{\tau}_{\text{ref},r})}{\|\mathbf{s}(n)\|^{2}}, \end{array} $$

If the geometry of the loudspeaker and microphone configuration is known, we further reduce the dimensionality of the estimation problem. This is achieved by parameterizing the TDOAs, ηm,r, for r=1,…,R and m=1,…,M using the array model, e.g., the one for a UCA configuration formulated in (5). Then, the TOA and TDOA estimator in the M-step can be written as

$$\begin{array}{*{20}l} \{\widehat{\tau}_{\text{ref},r},\widehat{\phi}_{r},\widehat{\psi}_{r}\}\approx & \underset{\tau,\phi,\psi}{\text{argmax}} \left(\sum_{m=1}^{M}\frac{\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\right)\notag\\ & \times \mathbf{s}(n-\tau), \end{array} $$

where ηm is replaced by the expression in (5). In this way, we only need to estimate two angles for each reflection, whereas the estimator in, e.g., (30) requires the estimation of M TDOAs (or M−1 if one of the microphone positions is used as the reference point). That is, the computational benefits of using the array model increases as we increase the number of microphones. It can be shown that the resulting estimators in the M-step has an interesting interpretation as minimum variance distortionless response (MVDR) beamforming followed by a matched filter as we show in the following subsection.

Beamformer interpretation

Intuitively, if we were able to observe the reflections individually in noise and the noise is differently distributed across the microphones, then it would be natural to apply an MVDR beamformer to these to optimally account for the noise when estimating the TOAs and TDOAs. Let us consider the scenario where we have a filtering matrix, W, which we use to process the individually observed reflections in (22):

$$\begin{array}{*{20}l} \mathbf{z}(n)=\mathbf{W}^{T}\mathbf{x}_{r}(n). \end{array} $$

Then, we define the residual noise power after this filtering as the normalized sum of the residual noise variances over the different time indices included in z(n), i.e., n,n+1,…,n+N−1. Mathematically, this is equivalent to

$$\begin{array}{*{20}l} \sigma_{v,f}^{2} &= \mathrm{E}\left[\frac{1}{N}\text{Tr}\left\{\mathbf{W}^{T}\mathbf{v}_{r}(n)\mathbf{v}_{r}^{T}(n)\mathbf{W}\right\}\right]\notag\\ &=\frac{\beta_{r}}{N}\text{Tr}\left\{\mathbf{W}^{T}\mathbf{C}\mathbf{W}\right\}, \end{array} $$

where Tr{·} is the trace operator. Obviously, by inspection of the individual observation model in (22), we can see that the following expression needs to be satisfied for the filter to be distortionless with respect to the known source signal:

$$\begin{array}{*{20}l} \mathbf{W}^{T}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})=\mathbf{I}_{{N}}. \end{array} $$

That is, omitting the arguments of the steering matrix H(ηr,gr) for brevity, the problem of finding the MVDR solution for W can be formulated as

$$\begin{array}{*{20}l} \min_{\mathbf{W}} \text{Tr}\left\{\mathbf{W}^{T}\mathbf{C}\mathbf{W}\right\}\quad\text{s.t.}\quad\mathbf{W}^{T}\mathbf{H}=\mathbf{I}_{{N}}. \end{array} $$

It can be shown that the solution to the quadratic optimization problem with linear constraints is given by

$$\begin{array}{*{20}l} \mathbf{W}_{\mathrm{M}}=\mathbf{C}^{-1}\mathbf{H}\left(\mathbf{H}^{T}\mathbf{C}^{-1}\mathbf{H}\right)^{-1}. \end{array} $$

If we then apply the MVDR filtering matrix to the estimated observation of the rth reflection in noise, careful inspection reveals that

$$\begin{array}{*{20}l} {\mathbf{x}}_{r}^{T}(n)\mathbf{W}_{\mathrm{M}}=\frac{\sum_{m=1}^{M}\frac{g_{m}{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}}{\sum_{m=1}^{M}\frac{g_{m}^{2}}{\sigma_{v_{m}}^{2}}}. \end{array} $$

The denominator is clearly independent of either the TOA or the TDOAs of the rth reflection, so if the objective is to estimate these, we only need to consider the numerator. Interestingly, the numerator resembles the first part of the cost function in (28). This reveals the following interpretation of the M-step. First, the individual observations of the reflections are filtered by an MVDR filter, and the resulting output is then processed by a matched filter with the transmitted signal. The TOA and TDOAs that maximizes the output power of this operation are then the estimates for the rth reflection. This is in line with the findings in [2325], where it was shown that the output of an MVDR/LCMV beamformer provide the sufficient statistics for estimating individual signals.

Spatio-temporarily correlated noise

We now consider the scenario, where the noise is spatio-temporarily correlated, a scenario practically encountered. For example, the late reverberation is often modeled as spatially homogeneous and isotropic sound field [19], resulting in a degree of spatial coherence which is dependent on the distance between the measurement points. Moreover, there might be interfering, quasi-periodic noise sources in the recording environment, like human talkers, ego-noise from a drone/robot, etc. For such scenarios, we can rewrite the model in (4) as

$$\begin{array}{*{20}l} \mathbf{y}(n)=\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\mathbf{d}(n), \end{array} $$


$$\begin{array}{*{20}l} \mathbf{d}(n)=\left[\mathbf{d}_{1}^{T}(n) \quad \mathbf{d}_{2}^{T}(n) \quad \cdots \quad \mathbf{d}_{M}^{T}(n)\right]^{T}. \end{array} $$

To deal with scenarios like this, we can preprocess the observed signals, such that the white Gaussian noise assumptions of the EM method is satisfied.

One way to achieve this is to use spatio-temporal decorrelation technique. Let us consider the correlated noise terms of the model in (4), i.e., dm(n), for m=1,…,M. First, we define the spatio-temporal correlation matrix as

$$\begin{array}{*{20}l} \mathbf{C}_{d}=\mathrm{E}\left[\mathbf{d}(n)\mathbf{d}^{T}(n)\right]. \end{array} $$

If we assume that this matrix is Hermitian and positive definite, the Cholesky factorization of it is given by

$$\begin{array}{*{20}l} \mathbf{C}_{d}=\mathbf{LL}^{T}, \end{array} $$

where L is a lower triangular matrix with real and positive diagonal entries. That is, to whiten the noise term before estimating the unknown parameters, we can left-multiply the observation in (38) with L−1 [26]. The prewhitened observations are thus given by

$$\begin{array}{*{20}l} \overline{\mathbf{y}}(n) &= \mathbf{L}^{-1}\mathbf{y}(n)\\ &=\mathbf{L}^{-1}\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\overline{\mathbf{d}}(n),\notag \end{array} $$

where \(\overline {\mathbf {d}}(n) = \mathbf {L}^{-1}\mathbf {d}(n)\). Based on this and [22], we end up with the following EM method for estimating the acoustic reflection parameters when the noise is correlated in time and space:

E-step: for r=1,…,R, compute

$$\begin{array}{*{20}l} \widehat{\mathbf{x}}_{r}^{(i)}(n)&=\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{r}^{(i)},\widehat{\mathbf{g}}_{r}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},r}^{(i)}\right)\\ &+\beta_{r}\left[\mathbf{y}(n)-\sum_{k=1}^{R}\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{k}^{(i)},\widehat{\mathbf{g}}_{k}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},k}^{(i)}\right)\right]\notag. \end{array} $$

M-step: for r=1,…,R,

$$\begin{array}{*{20}l} \{\widehat{\mathbf{g}}_{r},\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}^{(i+1)}&=\underset{\mathbf{g},\tau,\boldsymbol{\eta}}{\text{argmin}}\overline{J}_{r}(\mathbf{g},\tau,\boldsymbol{\eta}). \end{array} $$


$$\begin{array}{*{20}l} \overline{J}_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=\left\|\mathbf{L}^{-1}\left(\widehat{\mathbf{x}}_{r}^{(i)}(n)-\mathbf{H}(\boldsymbol{\boldsymbol{\eta},\mathbf{g}})\mathbf{s}(n-\tau)\right)\right\|^{2}, \end{array} $$

Eventually, we can explicitly write the cost function for the M-step as

$$\begin{array}{*{20}l} &\overline{J}_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\mathbf{x}_{r}(n)\notag\\ &\ \ \quad +\mathbf{s}^{T}(n-\tau)\mathbf{H}^{T}(\boldsymbol{\eta},\mathbf{g})\mathbf{C}_{d}^{-1}\mathbf{H}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau)\notag\\ &\ \ \quad -2\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\mathbf{H}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau), \end{array} $$

Compared with the cost function in (26), the minimization of (46) is more challenging. For example, the second term in (46) will generally depend on the DOA/TDOAs. That is, if we assume the reflections to be in the far-field of the array, we can adopt an iterative estimation scheme, where we first estimate the TOA and TDOAs, then update the TDOAs, and, finally, estimate the gains, i.e., for r=1,…,R:

Step 1: Obtain estimates of the TOA and TDOAs as

$$\begin{array}{*{20}l} \{\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}& = \underset{\tau,\boldsymbol{\eta}}{\text{argmax}} \mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau), \end{array} $$


$$\begin{array}{*{20}l} \overline{\mathbf{H}}(\boldsymbol{\eta})&=\left[\mathbf{D}_{\eta_{1}}^{T} \quad \cdots \quad \mathbf{D}_{\eta_{M}}^{T}\right]^{T}. \end{array} $$

Step 2: Update the TDOA estimates as

$$\begin{array}{*{20}l} \widehat{\boldsymbol{\eta}}_{r}=\arg\min_{\boldsymbol{\eta}} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta})+\overline{J}_{3,r}(g_{r},\boldsymbol{\eta}), \end{array} $$


$$\begin{array}{*{20}l} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta}) &=g_{r}^{2}\mathbf{s}\left(n-\widehat{\tau}_{r}\right)\overline{\mathbf{H}}^{T}(\boldsymbol{\eta})\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\eta}) \end{array} $$
$$\begin{array}{*{20}l}& \qquad \qquad \qquad \qquad \qquad \times\mathbf{s}(n-\widehat{\tau}_{r})\notag\\ \overline{J}_{3,r} (g_{r},\boldsymbol{\eta}) &=-2g_{r}\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\eta})\mathbf{s}(n-\widehat{\tau}_{r}). \end{array} $$

Step 3: Estimate the unknown gain as

$$\begin{array}{*{20}l} \widehat{g}_{r}=\frac{\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\widehat{\eta}}_{r})\mathbf{s}(n-\widehat{\tau}_{r})}{\mathbf{s}^{T}(n-\widehat{\tau}_{r})\overline{\mathbf{H}}^{T}(\boldsymbol{\widehat{\eta}}_{r})\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\widehat{\eta}}_{r})\mathbf{s}(n-\widehat{\tau}_{r})}. \end{array} $$

with the TOA and TDOA estimates from (47) and (48), respectively. If needed, these steps can then be repeated until convergence. It is also possible to simplify the M-step further by using particular signals as the known signal, s(n). By close inspection of the second term of the cost function in (48), we get

$$\begin{array}{*{20}l} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta})&=g_{r}^{2}\sum_{i=1}^{M}\sum_{j=1}^{M} c_{i,j} \\&\qquad \times \mathbf{s}^{T}\left(n-\tau-\eta_{i}\right)\mathbf{s}\left(n-\tau-\eta_{j}\right),\notag \end{array} $$

where ci,j denotes the (i,j)th element of \(\mathbf {C}_{d}^{-1}\). This reveals that, if the known probe signal is an uncorrelated noise sequence, it is reasonable to assume that this term is independent of both the TOA and the TDOAs, meaning that we can skip the update step in (48).

Kronecker decomposition

Another challenge with the prewhitening based estimator is the inversion of the noise covariance matrix, Cd, which has a high dimension of NM×NM. However, if we assume that the covariance matrix is separable, we can approximate it with two smaller matrices [27], i.e.,

$$\begin{array}{*{20}l} \mathbf{C}_{d} \approx \mathbf{C}_{\mathrm{s}}\otimes \mathbf{C}_{\mathrm{t}}. \end{array} $$

where Cs and Ct represents the spatial and temporal correlation matrices of dimensions M×M and N×N, respectively, and denotes the Kronecker product operator. Since \((\mathbf {C}_{\mathrm {s}}\otimes \mathbf {C}_{\mathrm {t}})^{-1}=\mathbf {C}_{\mathrm {s}}^{-1}\otimes \mathbf {C}_{\mathrm {t}}^{-1}\), we now only need to invert these smaller matrices, which is both numerically and computationally preferable. Moreover, we can now conduct the prewhitening using the Cholesky factorization of these smaller matrices due to the mixed-product property, yielding

$$\begin{array}{*{20}l} \mathbf{C}_{\mathrm{s}}\otimes \mathbf{C}_{\mathrm{t}} = \mathbf{L}_{\mathrm{s}}\mathbf{L}_{\mathrm{s}}^{T}\otimes \mathbf{L}_{\mathrm{t}}\mathbf{L}_{\mathrm{t}}^{T} = (\mathbf{L}_{\mathrm{s}}\otimes\mathbf{L}_{\mathrm{t}}) (\mathbf{L}_{\mathrm{s}}^{T}\otimes\mathbf{L}_{\mathrm{t}}^{T}). \end{array} $$

In other words, by assuming separability, we can approximate L in (41) by LsLt. Eventually, it can be shown that, for uncorrelated probe signals, the Kronecker product decomposition allows us to rewrite the first step of the M-step in (44) as

Step 1:

$$\begin{array}{*{20}l} \{\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}&=\underset{\tau,\boldsymbol{\eta}}{\text{argmax}} \mathbf{x}_{r}^{T}(n)\left(\mathbf{C}_{\mathrm{s}}^{-1}\otimes\mathbf{C}_{\mathrm{t}}^{-1}\right)\overline{\mathbf{H}}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau),\notag\\ &=\underset{\tau,\boldsymbol{\eta}}{\text{argmax}} \text{tr}\left(\mathbf{X}_{r}^{T}(n)\mathbf{C}_{\mathrm{t}}^{-1}\mathbf{S}_{\tau,\boldsymbol{\eta}}(n)\mathbf{C}_{\mathrm{s}}^{-1}\right) \end{array} $$
$$\begin{array}{*{20}l} &=\underset{\tau,\boldsymbol{\eta}}{\text{argmax}}\sum_{m=1}^{M}\widetilde{\mathbf{x}}_{m,r}^{T}(n)\widetilde{\mathbf{s}}(n-\tau-\eta_{m}) \end{array} $$


$$\begin{array}{*{20}l} \mathbf{X}_{r}(n)&=\left[\mathbf{x}_{1,r}(n) \quad \cdots \quad \mathbf{x}_{M,r}(n)\right], \end{array} $$
$$\begin{array}{*{20}l} \mathbf{S}_{\tau,\boldsymbol{\eta}}(n)&=\left[\mathbf{D}_{\eta_{1}}\mathbf{s}(n-\tau) \quad \cdots \quad \mathbf{D}_{\eta_{M}}\mathbf{s}(n-\tau)\right],\notag\\ &=\left[\mathbf{s}(n-\tau-\eta_{1}) \quad \cdots \quad \mathbf{s}(n-\tau-\eta_{M})\right], \end{array} $$

and the vectors \(\widetilde {\mathbf {x}}_{m,r}(n)\) and \(\widetilde {\mathbf {s}}(n-\tau -\eta _{m})\) are the prewhitened observation and probe signals for microphone m, respectively, defined as the mth columns of the following matrices:

$$\begin{array}{*{20}l} \widetilde{\mathbf{X}}_{r}(n)&=\mathbf{L}_{\mathrm{t}}^{-1}\mathbf{X}_{r}(n)\mathbf{L}^{-T}_{\mathrm{s}} \end{array} $$
$$\begin{array}{*{20}l} \widetilde{\mathbf{S}}_{\tau,\boldsymbol{\eta}}(n)&=\mathbf{L}_{\mathrm{t}}^{-1}\mathbf{S}_{\tau,\boldsymbol{\eta}}(n)\mathbf{L}^{-T}_{\mathrm{s}}. \end{array} $$

These expressions can be interpreted in the following way. The left hand multiplication with \(\mathbf {L}_{t}^{-1}\) corresponds to temporal prewhitening of all the microphone signals, whereas the right hand multiplication with \(\mathbf {L}_{s}^{-T}\) corresponds to spatial prewhitening of all time snapshots.

Step 2: With the Kronecker decomposition, the second term of the cost function in (49) becomes

$$\begin{array}{*{20}l} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta}) = g_{r}^{2}\text{tr}(\widetilde{\mathbf{S}}_{\tau,\eta}^{T}(n)\widetilde{\mathbf{S}}_{\tau,\eta}(n)). \end{array} $$

This does not depend on the TOAs and TDOAs, so the Kronecker decompositions allow us to skip the intermediate step of updating the TDOAs as in (48). We can therefore directly proceed to conducting the closed form estimate of the gains as

$$\begin{array}{*{20}l} \widehat{g}_{r}= \frac{\sum_{m=1}^{M}\widetilde{\mathbf{x}}_{m,r}^{T}(n)\widetilde{\mathbf{s}}(n-\tau-\eta_{m})}{M\|\widetilde{\mathbf{s}}(n)\|^{2}}. \end{array} $$

Even after all the presented simplifications and assumptions, the computational complexity of the proposed methods might still be considered relatively high due to their iterative and multidimensional nature. However, although not considered in this paper, we expect that further reductions in the computational complexity can be obtained by employing, e.g., the space alternating generalized expectation (SAGE) algorithm rather than the EM algorithm [28], or through a recursive EM procedure as suggested in [29], where the number of iterations per time instance can be reduced by instead tracking the parameters of interest over time.

Temporal prewhitening with filter

One issue with this prewhitening approach still is that the number samples in time might be relatively high in practice. The consequence of this is that, even with the Kronecker decomposition of the noise correlation matrix, the inversion of Lt might be intractable in practice since its dimensions equal the number of time samples. An alternative approach could be to use a lower order filter for the prewhitening instead [30]. If we assume that the noise follows an autoregressive model, we can approximate it as:

$$\begin{array}{*{20}l} d(n)\approx\sum_{p=1}^{P}a_{p} d(n-p). \end{array} $$

Given the noise correlation matrix, Ct, we can obtain the AR coefficients of the noise using the Levinson-Durbin recursion. The prewhitening filter is then formed using the AR coefficients as the coefficients of a Pth order FIR filter, hpw(p)=ap. Subsequently, the prewhitened signals are obtained as

$$\begin{array}{*{20}l} \widetilde{x}_{m,r}(n)&=\sum_{p=0}^{P}h_{\text{pw}}(p)x_{m,r}(n-p), \end{array} $$
$$\begin{array}{*{20}l} \widetilde{s}(n)&=\sum_{p=0}^{P}h_{\text{pw}}(p)s(n-p), \end{array} $$

where hpw(0)=1.

Covariance estimation

In the previous subsections, we have considered the covariance matrices as known quantities. However, we need to estimate these from the observed data in practice. If no particular structure is assumed for the covariance matrix, a common approach is to use the following estimator [31]

$$\begin{array}{*{20}l} \widehat{\mathbf{C}}_{d}=\frac{1}{N-K+1}\sum_{n=0}^{N-K}\mathbf{d}(n)\mathbf{d}(n)^{T}, \end{array} $$


$$\begin{array}{*{20}l} \mathbf{d}(n)&=\left[\mathbf{d}_{1}(n) \quad \cdots \quad \mathbf{d}_{M}(n)\right]^{T}, \end{array} $$
$$\begin{array}{*{20}l} \mathbf{d}_{m}(n)&=\left[d_{m}(n) \quad \cdots \quad d_{m}(n+K-1)\right]^{T}. \end{array} $$

As evident from, e.g., (47), the estimated covariance needs to be invertible. This requires that

$$\begin{array}{*{20}l} K\leq\frac{N+1}{M+1}. \end{array} $$

where K is the number of snapshots, N is the number of samples of the signal, and M is the number of microphones. Consequently, we can only use relatively short temporal subvectors, dm(n) in the estimation of the covariance matrix when the number of microphones is increased.

If it is assumed that the multichannel noise samples in d(n) follows a multichannel matrix normal distribution, the maximum likelihood (ML) estimator for the noise covariance matrix can be derived [32]. Unfortunately, the resulting estimator is not closed form, but it can be implemented using the iterative flip-flop algorithm in Algorithm 1. In some cases, e.g., if one of the covariance matrices are close to being rank deficient, this iterative procedure can be problematic, since their inverses are required. Different approaches for dealing with this and the computational complexity of the iterative procedure have been considered [31, 33]. Alternatively, a non-iterative estimator can be used such as [31]

$$\begin{array}{*{20}l} \widehat{\mathbf{C}}_{\mathrm{s}}&=\frac{1}{(N-K+1)\text{tr}\left(\mathbf{C}_{\mathrm{t}}\right)}\sum_{n=0}^{N-K}\mathbf{D}^{T}(n)\mathbf{D}(n), \end{array} $$
$$\begin{array}{*{20}l} \widehat{\mathbf{C}}_{\mathrm{t}}&=\frac{1}{(N-K+1)\text{tr}\left(\widehat{\mathbf{C}}_{\mathrm{s}}\right)}\sum_{n=0}^{N-K}\mathbf{D}(n)\mathbf{D}^{T}(n), \end{array} $$


$$\begin{array}{*{20}l} \mathbf{D}(n)=\left[\mathbf{d}_{1}(n) \quad \mathbf{d}_{2}(n) \quad \cdots \quad \mathbf{d}_{M}(n)\right]. \end{array} $$

As indicated in (70), the trace of the temporal covariance is assumed to be known. This might not be the case in practice; however, in most situations, we can simply replace it by an arbitrary value, since its main purpose is to resolve the ambiguity

$$\begin{array}{*{20}l} \mathbf{C}_{d}=\mathbf{C}_{\mathrm{s}}\otimes\mathbf{C}_{\mathrm{t}}=\left(\frac1\alpha\mathbf{C}_{s}\right)\otimes(\alpha\mathbf{C}_{\mathrm{t}}). \end{array} $$

Non-stationary noise

While the stationarity assumption may not hold in practice, there are a number of ways to address this problem. For example, we may reduce the length, N, of the probe signal and the analysis window, which would naturally increase the validity of the assumption. Alternatively, we may decouple the prewhitening and estimation parts, as suggested in Section 4.5. In this way, we may first prewhiten our signal using a filter, and then apply the proposed estimators with a white Gaussian noise assumption on the prewhitened signals. This approach can be exploited to take the non-stationarity of the noise into account by updating the prewhitening filters over time, according to the changing AR coefficients of the noise. Estimating non-stationary noise parameters, however, is more difficult, since the statistics need to be tracked during the presence of the desired signal, i.e., the probe signal and its reflections in our case. This problem has been well-investigated in other audio signal processing problems, such as speech enhancement [3437].

Results and discussion

In this section, we investigate the performance of the different variants of the proposed EM method. More specifically, we consider the variant assuming spatially independent white Gaussian in Section 4.1 resulting in noise variance weighting (EM-UCA-NW), and its special case where the noise variance is assumed equal (EM-UCA) [18]. Moreover, we consider the setup with correlated noise proposed in Section 4.3 resulting in the prewhitening-based approach (EM-UCA-PW). The experiments were carried out using signals that were generated using the room impulse response generator [38]. The dimensions of the simulated room were set to 8×6×5 m, the reverberation time (T60) was set to 0.6 s while the speed of sound is fixed at 343 m/s. The loudspeaker was positioned at the center of an UCA at (1×1.5×2.5) m while the UCA has M=4 microphones with a radius of d=0.2 m. Although, any type of known broadband signal could be used to probe the environment, such as a chirp signal or maximum length sequences (MLS) [39], we decided to use a white Gaussian noise sequence as the known sound source, s(n), consisting of 1,500 samples from a Gaussian distribution. This sequence was subsequently zero-padded to get a total signal length of 20,000 samples. The objective of the zero-padding was to get a longer analysis window to ensure that the first few reflections are present in the observation. Moreover, as discussed in Section 4.3, the reason for using a WGN sequence is that the EM estimator can be simplified if the probe signal is an uncorrelated signal. In addition to this, using such a broadband sequence minimizes the effects of spatial aliasing [40]. The sampling frequency fs was set to 22,050 Hz. We assumed that the direct component is subtracted from the observed signal given that we know the arrangement of the loudspeaker and the microphones. Knowing the array geometry enables either offline measurement of the impulse response of the direct-path component offline or analytical computation of the impulse response of the direct-path component based on the geometry. The background noise comprises of two components: one being diffuse spherical noise and the other being thermal sensor noise. The diffuse spherical noise was generated using the method described in [41] using the rotor noise of a drone from the DREGON database [3]. The drone audio file used to generate the diffuse spherical noise corresponds to rotors running at 70 revolutions per second (RPS). The thermal sensor noise was simulated as spatially independent white Gaussian noise. Both these noises were added to the observed signal before estimating the parameters. The evaluation was then conducted for different signal-to-diffuse noise ratios (SDNRs) and signal-to-sensor noise ratios (SSNRs). In the following subsections, we evaluate the performance of our propose method in various conditions.

Comparison of with state-of-the-art

The aim of the first experiment was to compare the proposed method with existing state-of-the-art methods. The EM algorithm was set to estimate R=3 reflections with 40 iterations and β was set to \(\frac {1}{R}\). The main application for this manuscript is acoustic reflector mapping for robot audition. For this application, the mapping should be possible in unknown, complex environments, and we therefore do not rely on trivial room geometry models as opposed to many of the traditional methods for room geometry estimation [1012]. Therefore, we chose to use a small number of reflections in the estimation (i.e., R=3), to mainly estimate the TOAs/DOAs of first-order reflections impinging from nearby acoustic reflectors. These can be directly mapped to acoustic reflector positions based on the estimated time and angle of arrival. While this will not facilitate the localization of all acoustic reflectors at any given time instance, we can carry out such estimation over time and space, to generate a map of an arbitrary room geometry (see Section 5.4). An alternative to choosing a fixed reflection order would be to combine the proposed method with order estimation methods [42, 43]. To initialize the method, the gain estimates, \(\widehat {g}_{m,r}\), were sampled from a uniform distribution over the interval [0;1], the TOAs, \(\widehat {\tau }_{1,r}\), were sampled from a uniform discrete distribution over the time indices corresponding to the analysis window, and the DOAs, \(\widehat {\phi }_{r}\), were sampled from a uniform distribution over the interval [0;360]. After emitting and recording the known source signal, an analysis window of each recording was considered starting from τmin samples to τmax samples after the source signal was emitted. In this experiment, the analysis window was set such that the search is made between 0.5 to 2 m. This was done to primarily capture the first order reflections. The lower bound was chosen because we can only search for reflectors that are outside the geometry of the array, which, in our experiments, had a radius of 0.2 m. After 2 m, the performance of the proposed method degrades because the energy of the reflected signals decrease quadratically over distance, which motivated the choice of the upper limit.

The proposed EM method (EM-UCA) was compared to the single-channel EM method (EM-SC) in [22] in terms of TOA accuracy, applied to the first microphone. Moreover, these were compared with a common approach to extracting TOAs from estimated RIR through peak-picking (RIR-PP). Finally, the performance was also compared with our previous work [44] termed the non-linear least squares estimator (NLS). The results for the TOA estimation are shown in Fig. 3, where the accuracy was defined as the percentage of TOA estimates that were within ± 2% tolerance of one of the true parameters of the first-order reflections computed using the image-source method. This was measured for different SDNRs while the SSNR was fixed to 10 dB, and for each SDNR, the accuracy was measured over 100 Monte-Carlo simulations. As seen in Fig. 3, the proposed method clearly outperforms the existing method by providing higher accuracy at lower SDNRs.

Fig. 3

Comparison of the proposed EM-UCA method with state-of-the-art methods in terms of TOA estimation accuracy

Furthermore, the computation time of the RIR-PP and the proposed method, EM-UCA, were measured. This test was performed in MATLAB using the built-in function timeit on a standard desktop computer running a Microsoft Windows 10 operating system with an Intel Core i7 CPU with 3.40 GHz processing speed and 16 GB of RAM. A Monte Carlo simulation with 100 trials was performed on each method and an average time was calculated. The measured computation times of the RIR-PP and the EM-UCA were 0.0063 s and 25.74 s, respectively, for R=1 and an SDNR of 40 dB. This shows that the improved estimation accuracy with the proposed method comes at the cost of a higher computational complexity. It is important to stress, however, that in applications such as acoustic reflector localization with a drone, it is common to have negative SNR conditions [45], where the RIR-PP method may fail to provide accurate estimates as opposed to the proposed method (see, e.g., Fig. 3). Moreover, the computational cost could be reduced further by, e.g., employing the recursive EM approach [29, 46]. If the TOA/DOA estimation is carried out continuously over time and space, the EM algorithm may be initialized using previous estimates, which may significantly reduce the number of iterations needed for convergence. Another potential computational saving may be obtained by deriving the proposed methods in the frequency domain.

Evaluation for different diffuse noise conditions

In the second experiment, we evaluated the effect of the proposed prewhitening approach under different diffuse noise conditions. To test the performance of the EM algorithm under such realistic scenarios, we test our estimator for different SDNRs in the interval [− 40;10] dB while setting the SSNR to 40 dB. Here, we are comparing the EM algorithm with and without the prewhitening in terms of both TOA and DOA estimation accuracy as seen in Figs. 4 and 5, respectively. The diffuse rotor noise is indeed correlated with strong periodic components, but the results show that the proposed prewhitening approach can successfully account for this and can retain a high estimation accuracy at SDNRs levels 20 dB lower than those needed for the EM-UCA approach.

Fig. 4

TOA estimation accuracy of the proposed EM method with and without prewhitening

Fig. 5

DOA estimation accuracy of the proposed EM method with and without prewhitening

Evaluation for faulty/noisy microphone conditions

In this experiment, we consider a scenario where one microphone is excessively noisy compared to the other microphones. An example of this could be a robot platform, where one microphone is placed closer to an ego-noise source such as a fan, leading to TOA and DOA estimation errors. To simulate this effect, we set thermal noise of a single microphone to an SSNR level of − 10 dB, while the thermal noise of the remaining microphones are set to an SSNR level of 40 dB. As seen in Figs. 6 and 7, the performance of the EM algorithm with noise variance weighting is less affected by the high thermal sensor noise in terms of both TOA and DOA estimation accuracy. Moreover, we conducted an experiment without diffuse noise, where the SSNR level of the faulty microphone was changed from − 40 to 0 dB. These results are shown in Figs. 8 and 9, and show that the estimation accuracy is already degrading from 0 dB SSNR and downwards when using the EM-UCA approach, whereas the proposed EM-UCA-NW approach retains a high accuracy.

Fig. 6

TOA estimation accuracy of the proposed EM method with and without noise variance weighting when 1 microphone has a lower SSNR of − 10 dB while the remaining microphones has a SSNR of 40 dB

Fig. 7

DOA estimation accuracy of the proposed EM method with and without noise variance weighting when 1 microphone has a lower SSNR of − 10 dB while the remaining microphones has a SSNR of 40 dB

Fig. 8

TOA estimation accuracy of the proposed EM method with and without noise variance weighting for different SSNR levels for one of the microphones

Fig. 9

DOA estimation accuracy of the proposed EM method with and without noise variance weighting for different SSNR levels for one of the microphones

Application example of the proposed method

We consider an application example where the localization of the acoustic reflectors is done using the proposed EM method with and without prewhitening. More specifically, we have used filter-based prewhitening approach as discussed in Section 4.5. This experiment thus shows how the proposed method can be used to map an environment using a moving robot platform. The room parameters were kept the same as the earlier experiment. Furthermore, the SDNR was set to − 10 dB corresponding to a strong ego-noise. The loudspeaker-microphone arrangement was similar to the previous experiments and follows the a predefined path as shown in Fig. 10 indicated by the blue dashed line. As depicted in the figure, the EM algorithm with prewhitening performs better at estimating acoustic reflector using the estimated TOAs and DOAs, compared to EM algorithm without prewhitening.

Fig. 10

Example of reflector localization for different array positions using the proposed EM method with and without prewhitening based on TOA and DOA estimates at an SDNR of − 10 dB


In this paper, we consider the problem of estimating the time- and direction-of-arrivals of acoustic echoes using a loudspeaker emitting a known source signal and multiple microphones. Among other examples, this is an important problem in robot and drone audition, where these parameters can reveal the positions of nearby acoustic reflectors and thus facilitate mapping and navigation of a physical environment. Some methods exist for solving the problems of acoustic reflector localization and room geometry estimation; however, most of these rely on a priori information, e.g., of the TOAs or DOAs of the acoustic echoes. However, estimating these is a difficult problem on its own, which is dealt with by the methods proposed herein. Moreover, even when the TOAs are estimated for some of the traditional approaches, the difficult problem of echolabeling needs to be solved, since the order of the corresponding reflection is generally unknown. We therefore propose different methods for estimating, not only the TOAs, but also the DOAs of acoustic echoes. By estimating the DOAs also, it is possible to resolve some of the ambiguity introduced by knowing only the TOAs. The proposed method is based on the expectation-maximization framework and are derived to be optimal under different conditions ranging from the simple white Gaussian noise scenario to scenarios with correlated and colored noise. In the experiments, we show that proposed methods are able to estimate the TOAs and DOAs with higher accuracy and noise robustness compared to existing methods. Moreover, we show that some of the proposed variants can account for colored noise and scenarios where a microphone is faulty or more noisy than the other microphones of the array. Finally, we conducted a more applied experiment, where it is illustrated how a room can be mapped from the estimated parameters, which is relevant to, e.g., autonomous robot and drone applications. While the proposed method has a higher computation time than traditional methods, this can be reduced significantly by adopting the recursive EM scheme and deriving the proposed methods in the frequency domain.

Availability of data and materials

Not applicable.


  1. 1.

    In our definition, the direct-path component is one of the reflections, i.e., the 0th order reflection corresponding to r=1.







Uniform circular array


Signal-to-noise ratio




Acoustic simultaneous localization and mapping


Room impulse response


Time difference-of-arrival


Maximum likelihood


Minimum variance distortionless response


Linearly constrained minimum variance


Space alternating generalized expectation


Finite impulse response




Proposed method without prewhitening or noise weighting


Proposed method with only noise weighting


Proposed method with only prewhitening

T 60 :

Reverberation time (60 dB)


Revolutions per minute


Database of drone audio recordings


Signal-to-diffuse-noise ratio


Signal-to-sensor-noise ratio


Single channel EM method


RIR-based method with peak picking


nonlinear least squares


  1. 1

    C. Rascon, I. Meza, Localization of sound sources in robotics: a review. Robot. Auton. Syst.96:, 184–210 (2017).

    Article  Google Scholar 

  2. 2

    H. W. Löllmann, A. Moore, P. A. Naylor, B. Rafaely, R. Horaud, A. Mazel, W. Kellermann, in Hands-free Speech Comm. and Microphone Arrays. Microphone array signal processing for robot audition, (2017), pp. 51–55.

  3. 3

    M. Strauss, P. Mordel, V. Miguet, A. Deleforge, in IEEE/RJS Int. Conf. Intelligent Robots and Systems. DREGON: dataset and methods for UAV-embedded sound source localization, (2018), pp. 5735–5742.

  4. 4

    F. Badeig, Q. Pelorson, S. Arias, V. Drouard, I. D. Gebru, X. Li, G. Evangelidis, R. Horaud, in Int. Conf. Multimodal Interaction. A distributed architecture for interacting with NAO, (2015).

  5. 5

    F. Antonacci, J. Filos, M. R. P. Thomas, E. A. P. Habets, A. Sarti, P. A. Naylor, S. Tubaro, Inference of room geometry from acoustic impulse responses. IEEE Trans. Audio Speech Lang. Process.20(10), 2683–2695 (2012).

    Article  Google Scholar 

  6. 6

    M. Coutino, M. B. Møller, J. K. Nielsen, R. Heusdens, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Greedy alternative for room geometry estimation from acoustic echoes: a subspace-based method, (2017), pp. 366–370.

  7. 7

    J. -S. Hu, C. -Y. Chan, C. -K. Wang, M. -T. Lee, C. -Y. Kuo, Simultaneous localization of a mobile robot and multiple sound sources using a microphone array. Adv. Robot.25(1–2), 135–152 (2011).

    Article  Google Scholar 

  8. 8

    S. Ogiso, T. Kawagishi, K. Mizutani, N. Wakatsuki, K. Zempo, Self-localization method for mobile robot using acoustic beacons. ROBOMECH J.2(1), 12 (2015).

    Article  Google Scholar 

  9. 9

    C. Evers, P. A. Naylor, Acoustic SLAM. IEEE/ACM Trans. Audio Speech Lang. Process.26:, 1484–1498 (2018).

    Article  Google Scholar 

  10. 10

    M. Kreković, I. Dokmanić, M. Vetterli, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. EchoSLAM: simultaneous localization and mapping with acoustic echoes, (2016), pp. 11–15.

  11. 11

    L. Nguyen, J. V. Miro, X. Qiu, in IEEE/RSJ Int. Conf. Intell. Robots and Syst. Can a robot hear the shape and dimensions of a room? (2019), pp. 5346–5351.

  12. 12

    T. Wang, F. Peng, B. Chen, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. First order echo based room shape recovery using a single mobile device, (2016), pp. 5346–5351.

  13. 13

    I. J. Kelly, F. M. Boland, Detecting arrivals in room impulse responses with dynamic time warping. IEEE/ACM Trans. Audio Speech Lang. Process.22(7), 1139–1147 (2014).

    Article  Google Scholar 

  14. 14

    M. D. Plumbley, Hearing the shape of a room. Proc. Natl. Acad. Sci. U S A. 110(30), 12162–12163 (2013).

    Article  Google Scholar 

  15. 15

    L. B. Nelson, H. V. Poor, Iterative multiuser receivers for CDMA channels: an EM-based approach. IEEE Trans. Commun.44(12), 1700–1710 (1996).

    Article  Google Scholar 

  16. 16

    M. C. Vanderveen, C. B. Papadias, A. Paulraj, Joint angle and delay estimation (JADE) for multipath signals arriving at an antenna array. IEEE Commun. Lett.1(1), 12–14 (1997).

    Article  Google Scholar 

  17. 17

    J. Verhaevert, E. V. Lil, A. V. de Capelle, Direction of arrival (DOA) parameter estimation with the SAGE algorithm. Signal Process.84(3), 619–629 (2004).

    Article  Google Scholar 

  18. 18

    J. R. Jensen, U. Saqib, S. Gannot, in Proc. IEEE Workshop Appl. of Signal Process. to Aud. and Acoust. An EM method for multichannel TOA and DOA estimation of acoustic echoes, (2019).

  19. 19

    S. Braun, A. Kuklasiński, O. Schwartz, O. Thiergart, E. A. P. Habets, S. Gannot, S. Doclo, J. Jensen, Evaluation and comparison of late reverberation power spectral density estimators. IEEE/ACM Trans. Audio Speech Lang. Process.26(6), 1056–1071 (2018).

    Article  Google Scholar 

  20. 20

    B. F. Cron, C. H. Sherman, Spatial–correlation functions for various noise models. J. Acoust. Soc. Am.34(11), 1732–1736 (1962).

    Article  Google Scholar 

  21. 21

    H. Sun, T. D. Abhayapala, P. N. Samarasinghe, in Proc. IEEE Workshop Appl. of Signal Process. to Aud. and Acoust. Active noise control over 3D space with multiple circular arrays, (2019), pp. 135–139.

  22. 22

    M. Feder, E. Weinstein, Parameter estimation of superimposed signals using the EM algorithm. IEEE Trans. Acoust. Speech Signal Process.36(4), 477–489 (1988).

    Article  Google Scholar 

  23. 23

    O. Schwartz, S. Gannot, E. A. P. Habets, Multispeaker LCMV beamformer and postfilter for source separation and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process.25(5), 940–951 (2017).

    Article  Google Scholar 

  24. 24

    R. Balan, J. Rosca, in Proc. IEEE Workshop Sensor Array and Multichannel Signal Process. Microphone array speech enhancement by bayesian estimation of spectral amplitude and phase, (2002), pp. 209–213.

  25. 25

    L. L. Scharf, Statistical signal processing: detection, estimation, and time series analysis (Addison-Wesley Publishing Company, Michigan, 1991).

    Google Scholar 

  26. 26

    P. C. Hansen, S. H. Jensen, Prewhitening for rank-deficient noise in subspace methods for noise reduction. IEEE Trans. Signal Process.53(10), 3718–3726 (2005).

    MathSciNet  Article  Google Scholar 

  27. 27

    G. Reinsel, Multivariate repeated-measurement or growth curve models with multivariate random-effects covariance structure. J. Am. Statist. Assoc.77(377), 190–195 (1982).

    MathSciNet  Article  Google Scholar 

  28. 28

    J. A. Fessler, A. O. Hero, Space-alternating generalized expectation-maximization algorithm. IEEE Trans. Signal Process.42(10), 2664–2677 (1994).

    Article  Google Scholar 

  29. 29

    O. Schwartz, S. Gannot, Speaker tracking using recursive EM algorithms. IEEE/ACM Trans. Audio Speech Lang. Process.22(2), 392–402 (2014).

    Article  Google Scholar 

  30. 30

    S. M. Nørholm, J. R. Jensen, M. G. Christensen, Instantaneous fundamental frequency estimation with optimal segmentation for nonstationary voiced speech. IEEE/ACM Trans. Audio Speech Lang. Process.24(12), 2354–2367 (2016).

    Article  Google Scholar 

  31. 31

    M. H. Castaneda, J. A. Nossek, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Estimation of rank deficient covariance matrices with Kronecker structure, (2014), pp. 394–398.

  32. 32

    P. Dutilleul, The MLE algorithm for the matrix normal distribution. J. Statist. Comput. Simul.64(2), 105–123 (1999).

    Article  Google Scholar 

  33. 33

    K. Werner, M. Jansson, P. Stoica, On estimation of covariance matrices with Kronecker product structure. IEEE Trans. Signal Process.56(2), 478–491 (2008).

    MathSciNet  Article  Google Scholar 

  34. 34

    R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process.9(5), 504–512 (2001).

    Article  Google Scholar 

  35. 35

    T. Gerkmann, R. C. Hendriks, Unbiased MMSE-based noise power estimation with low complexity and low tracking delay. IEEE Trans. Audio Speech Lang. Process.20(4), 1383–1393 (2012).

    Article  Google Scholar 

  36. 36

    R. C. Hendriks, R. Heusdens, J. Jensen, in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. MMSE based noise PSD tracking with low complexity, (2010), pp. 4266–4269.

  37. 37

    J. K. Nielsen, M. S. Kavalekalam, M. G. Christensen, J. Boldt, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Model-based noise PSD estimation from speech in non-stationary noise, (2018), pp. 5424–5428.

  38. 38

    E. A. P. Habets, Room impulse response generator. Technical report, Technische Universiteit Eindhoven (2010). Ver. 2.0.20100920.

  39. 39

    D. Florencio, Z. Zhang, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. Maximum a posteriori estimation of room impulse responses, (2015), pp. 728–732.

  40. 40

    J. Dmochowski, J. Benesty, S. Affes, On spatial aliasing in microphone arrays. IEEE Trans. Signal Process.57(4), 1383–1395 (2009).

    MathSciNet  Article  Google Scholar 

  41. 41

    E. A. P. Habets, I. Cohen, S. Gannot, Generating nonstationary multisensor signals under a spatial coherence constraint. J. Acoust. Soc. Am.124(5), 2911–2917 (2008).

    Article  Google Scholar 

  42. 42

    K. Han, A. Nehorai, Improved source number detection and direction estimation with nested arrays and ULAs using jackknifing. IEEE Trans. Signal Process.61(23), 6118–6128 (2013).

    MathSciNet  Article  Google Scholar 

  43. 43

    P. Stoica, Y. Selen, Model-order selection: a review of information criterion rules. IEEE Signal Process. Mag.21(4), 36–47 (2004).

    Article  Google Scholar 

  44. 44

    U. Saqib, J. R. Jensen, in Proc. European Signal Processing Conf. Sound-based distance estimation for indoor navigation in the presence of ego noise, (2019), pp. 1–5.

  45. 45

    A. Deleforge, D. Di Carlo, M. Strauss, R. Serizel, L. Marcenaro, Audio-based search and rescue with a drone: highlights from the IEEE signal processing cup 2019 student competition [SP competitions]. IEEE Signal Process. Mag.36(5), 138–144 (2019).

    Article  Google Scholar 

  46. 46

    K. Weisberg, S. Gannot, O. Schwartz, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. An online multiple-speaker DOA tracking using the CappÉ-Moulines recursive expectation-maximization algorithm, (2019), pp. 656–660.

Download references


Not applicable.


This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 871245.

Author information




JRJ and SG designed the idea for the manuscript. JRJ and US conducted the experiments. All the authors contributed to the writing of this work. Moreover, all author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Jesper Rindom Jensen.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Saqib, U., Gannot, S. & Jensen, J. Estimation of acoustic echoes using expectation-maximization methods. J AUDIO SPEECH MUSIC PROC. 2020, 12 (2020).

Download citation


  • TOA estimation
  • DOA estimation
  • Expectation-maximization
  • Active source localization
  • Robot/drone audition
  • Prewhitening