Estimation of acoustic echoes using expectation-maximization methods

Estimation problems like room geometry estimation and localization of acoustic reflectors are of great interest and importance in robot and drone audition. Several methods for tackling these problems exist, but most of them rely on information about times-of-arrival (TOAs) of the acoustic echoes. These need to be estimated in practice, which is a difficult problem in itself, especially in robot applications which are characterized by high ego-noise. Moreover, even if TOAs are successfully extracted, the difficult problem of echolabeling needs to be solved. In this paper, we propose multiple expectation-maximization (EM) methods, for jointly estimating the TOAs and directions-of-arrival (DOA) of the echoes, with a uniform circular array (UCA) and a loudspeaker in its center for probing the environment. The different methods are derived to be optimal under different noise conditions. The experimental results show that the proposed methods outperform existing methods in terms of estimation accuracy in noisy conditions. For example, it can provide accurate estimates at SNR of 10 dB lower compared to TOA extraction from room impulse responses, which is often used. Furthermore, the results confirm that the proposed methods can account for scenarios with colored noise or faulty microphones. Finally, we show the applicability of the proposed methods in mapping of an indoor environment.


Introduction
During the past decade, there has been an increased research interest in robot and drone audition [1][2][3]. Hearing capabilities enable robots to understand and interact with humans [4]. Moreover, it has also been proven useful for sensing the physical environment. For example, it can be used for estimating the locations of acoustic sources, the position of a robot or drone, and the positions of acoustic reflectors and for inferring room geometry [5,6]. Potentially, this can enable autonomous indoor operation of robots and drones.
Some different approaches for tackling the above estimation problems have already been considered. In a broad sense, these can be classified as being either passive or passive approach, where only relative timing information is available.
The alternative, which we consider in this paper, is the active approach. In this approach, one or more loudspeakers are used to probe the environment using a known signal. Subsequently, a number of microphones are used to record the sound after it has propagated through the environment. Compared to the passive approach, this facilitate the estimation of the times-of-arrival (TOAs) of both the direct and reflected sound components. With this information, the localization accuracy can be increased significantly compared to the passive approach, and the task of acoustic reflector localization becomes less complex. In the following, we briefly outline some of the most recent and relevant work on active approaches. Some authors have considered the problem of estimating both room geometry and a robot's position with a setup consisting of a collocated microphone and speaker pair [10]. To achieve this, they utilize TOA estimates of the first order reflections. The TOAs are assumed known or estimated beforehand. To tackle the estimation problem with the considered single-channel setup (i.e., one microphone and one loudspeaker), they consider multiple observations from different time instances and locations, i.e., movement is assumed. Based on this, they then proposed two different methods: a method based on basic trigonometry, and another one based on Bayesian filtering. A similar approach also based on a priori RIR/TOA knowledge was considered using a multichannel setup in the context of robotics in [11]. Other authors considered an approach where the TOAs of the first order echoes are utilized for estimating the arbitrary convex room shapes [12]. As briefly mentioned, these as well as other active approaches do not consider the TOA estimation problem, which is an equally important and difficult problem in itself due to, e.g., spurious estimates [13]. Moreover, methods relying on first-and second-order reflections only suffer from the inevitable problem of echolabeling [14]. In addition to this, many methods are based on only one microphone and one loudspeaker, but this lead to ambiguity in the mapping of the TOA estimates of the first-order reflections unless more transducers are included or movement is exploited.
These issues will be addressed in this paper, where we consider a setup consisting of a microphone array which is collocated with a single loudspeaker. More specifically, we consider a uniform circular array that could be placed on the perimeter of, e.g., a drone or robot platform, with a loudspeaker located in its center. With this setup in mind, we propose a number of expectation-maximization (EM) methods for estimating both the TOAs and directionsof-arrival (DOA) of a number of the acoustic reflections. This has the benefit of not only yielding more accurate TOAs compared to a single-channel approach, but also of reducing the ambiguity of the estimated reflections since the DOA is estimated simultaneously. In fact, this means that the estimates directly reveal the locations of mirror sources, which greatly simplifies the task of localizing the acoustic reflector positions. The proposed methods are derived in the time-domain, and, thus, estimates the parameters of interest directly from the recorded signals, i.e., not from estimated room impulse responses as in numerous state-of-the-art methods. While joint TOA and DOA estimation is a new topic in the context of robot and drone audition, it has been considered previously in multiuser and multipath communication systems [15][16][17]. However, it has not yet been considered for acoustic reflector localization to the best of our knowledge. The paper builds on the results reported in our earlier paper [18] and extends on this work in several ways. First, we relax our previous noise assumptions and derive the optimal estimators for these more realistic scenarios. The first scenario deals with spatially independent white Gaussian noise with different noise variances across the microphones, e.g., to simulate low quality or faulty microphones. The second scenario considered deals with spatio-temporarily correlated noise, which we tackle using prewhitening. Here, we include different approaches for the prewhitening. Moreover, we have included a beamformer interpretation of one of the proposed multichannel estimators, which provides an intuitive understanding of the EM-based method. In addition to this, we included further experimental work to show case the merits of the different proposed estimators and how they compare with traditional methods.
The rest of the paper is organized as follows. In Section 2, we propose the signal model for the considered setup along with a problem formulation. Then, in Section 3, we briefly revisit the single-channel EM method for TOA estimation, which serves as our reference method. Inspired by this, we then proceed with the derivation of the different TOA and DOA estimators in Section 4. Finally, the paper closes with the experimental results and conclusions in Sections 5 and 6, respectively.

Problem formulation
We now proceed to lay the foundation for the derivation of EM-based methods for estimating the TOA and TDOA of the acoustic echoes. This is done by formulating the relevant temporal and spatial signal models.

Time-domain model
Consider a setup with a single loudspeaker and M microphones that are assumed to be collocated on some hardware platform, e.g., a mobile robot or a drone. The loudspeaker is used to probe the environment with a known sound while the microphones are used to record the sound emitted by the loudspeaker including its acoustic reflections from physical objects and boundaries, e.g., walls. Both the microphones and loudspeakers are assumed to be omnidirectional and ideal. While this assumption might not hold in practice, we do not consider the handling of non-ideal characteristics in this paper. As suggested in other work [5], this might be partly addressed by estimating and introducing another filter accounting for the hardware characteristics, which may also be included in the methods proposed later. Moreover, the non-ideal characteristic of the hardware, i.e., loudspeakers could be modeled as shown in [5], but this is not included when formulating the following estimator. We can then formulate a general model for the signal recorded by microphone m, for m = 1, . . . , M, as where, x m (n) = h m * s(n), h m is the acoustic impulse response as measured from the loudspeaker to the mth microphone, and s(n) is a known signal being played back by the loudspeaker. Finally, v m (n) is an additive noise term, which is supposed to model ego-noise from a robot/drone platform, interfering sound sources (e.g., human speakers), thermal sensor noise, etc. , that is, the signal s(n) is used to probe the environment to, eventually, facilitate the estimation of the parameters of the acoustic echoes, such as their TOA and TDOA. Thus, we proceed by rewriting the observation model as a sum of the individual reflections 1 in noise, i.e., with g m,r being the attenuation of the rth reflection from the loudspeaker to the mth microphone, e.g., due to the inverse square law for sound propagation and sound 1 In our definition, the direct-path component is one of the reflections, i.e., the 0th order reflection corresponding to r = 1. Acoustic impulse responses often exhibit a certain structure, which can be characterized by two parts: the early part, which is sparse in time and contains the directpath and early reflections, and the late part, which is a more stochastic, dense, and characterized by decaying tail of late reflections (Fig. 1). This suggests that we can split the model as [19] y m (n) = where R is the number of early reflections, and d m (n) is the late reverberation. A common assumption is that the late reverberation can be modeled as a spatially homogeneous and isotropic sound field with time-varying power but known coherence function [20]. If we collect N samples from each microphone and assume stationarity within the corresponding time frame, the vector model for our observations becomes: However, if we know the geometry of the loudspeaker and microphone array configuration, we can significantly reduces the dimensionality of this problem by further parametrizing the TDOAs in terms of the directions-of-arrival (DOAs).

Array model
While the array model can in principle be chosen arbitrarily, we choose to exemplify the TDOA modeling with a setup where the loudspeaker is placed in the center of a uniform circular array (UCA). Such a setup could be placed on, e.g., a robot or drone platform to enable the estimation of the angle of and distance to acoustic reflectors, e.g., to facilitate autonomous and sound-based navigation.
If we assume the reference point to be the center of the UCA, it can be shown that the TDOAs, for a setup like this, can be modeled as where d is the radius of the UCA, ψ r and φ r are the inclination and azimuth angles of the rth reflection, respectively, and θ m is the angle of the mth microphone on the circle forming the UCA. These definitions are illustrated in the UCA example in Fig. 2. In addition to this, f s is the sampling frequency, and c is the speed of sound. The TDOA model in (5) can then be combined with the observation model in (4). By doing this, the estimation problem at hand is then simplified to the estimation of 2R angles, i.e., ψ r and φ r , for r = 1, . . . , R, rather than MR TDOAs. It should be noted here that the considered UCA configuration introduces ambiguities, e.g., an acoustic reflection impinging from an elevation of 0 • will result in the same TDOAs as an acoustic reflection mirrored around the UCA plane, i.e., at an elevation angle of 180 • . However, this ambiguity can easily be accounted for by applying the proposed methods on array structures with microphones in all three dimensions, e.g., spherical microphone arrays [21].

Single-channel estimation
Before presenting the proposed TOA and TDOA estimators, we briefly revisit an EM-based method for singlechannel TOA estimation, i.e., that is with a setup consisting of one loudspeaker and one microphone. The original version of this method was proposed in [22] under a white Gaussian noise assumption and serves as a reference for the proposed methods.

White Gaussian noise
In the following, we leave out the microphone index, i.e., subscript m, since only a single microphone is considered. We assume that the additive noise, i.e., both the late reverberation and the background noise is independent and identically distributed white Gaussian and zero-mean. Later, as part of the proposed multichannel methods, this assumption is substituted with a more realistic one, where the late reverberation is modeled as being spatio-temporarily correlated. The signal model in (4) then reduces to where v(n) is distributed as N (0, C), with 0 being a vector of zeroes, v is its variance, I N denotes the N × N identity matrix, and E[ ·] is the mathematical expectation operator. The maximum likelihood (ML) estimator of the unknown parameters, i.e., the gains and the TOAs, is well known to be the nonlinear least squares (NLS) criterion in this case, i.e., where While this estimator is statistically efficient, it also requires computationally costly search since the cost function is high-dimensional and non-convex with respect to the TOAs. A computationally more efficient way of implementing this estimator could be to adopt the expectationmaximization (EM) approach for superimposed signals proposed in [22]. The concept behind this approach is to define the complete data as the observation of all individual signals, i.e., each of the individual early reflections in our case. According to the previously stated signal model in (4), the individual observations can be modeled as Moreover, the observed signal can be written as the sum of individual observations such as: Following [22], we let the individual noise terms be independent, zero-mean, white Gaussian, and distributed as N (0, β r C). Furthermore, the scaling factors, β r are nonnegative, real-valued scalars that satisfy Under these assumptions, it can be shown that the EM algorithm for estimating the gains and the time-of-arrivals is given by [22] E-step: for r = 1, . . . , R, compute

M-step:
where (i) is denoting the iteration index. If the length, N, of the analysis window is long compared to the length of the known signal, s(n), the M-step can be simplified as We see that the estimation problem has been greatly simplified with this signals decomposition, since we now have 2R one-dimensional estimators rather than a 2Rdimensional estimator as in (7). From this simplified version of the M-step, we can make some interesting interpretations. First in (14), the individual observations are applied with a matched filter based on the known source signal. The TOA is estimated as the one maximizing the output power of the matched filter. Secondly, the estimated TOAs are used to obtain closed-form estimated of the gains in (15), which is based on a least squares fit between the known source signal and the estimated contribution of the rth component.

Multichannel estimation
We now proceed to consider the multichannel case, where we have one loudspeaker and multiple microphones. First, we consider a white Gaussian noise scenario similar to Section 3.1 where the noise is independent across the microphones, after which we turn to the more realistic scenarios with correlated noise.

Spatially independent white Gaussian noise
If we first assume that the noise is temporally white Gaussian and independent and the late reverberation is negligible, the signal model in (4) reduces to for m = 1, . . . , M. Subsequently, we can aggregate the observations from all microphones in one model as where v(n) is the stacked noise terms from each microphone defined similarly to y(n), and η r = η 1,r η 2,r · · · η M,r T , In addition to this, we note that, under the assumptions of spatial independent white Gaussian noise, the covariance matrix, C of the stacked noise, v(n) is diagonal and given by where diag(·) is the operator constructing a diagonal matrix from the input of scalars(/matrices) and C is the MN × MN covariance matrix. Furthermore, and D η is a circular shift matrix which delays a signal by −η samples. With these definitions, the ML estimator for the problem at hand becomes where such that x 2 W = x T Wx, where W denotes the weighted 2-norm of x. Moreover, g, τ , and η are the parameter vectors containing all unknown gains, TOAs and TDOAs, respectively. In the single-channel case, the ML estimator ends up being high-dimensional and non-convex, resulting in a practically infeasible computational complexity if implemented directly. Therefore, we propose to adopt the EM framework also for the multichannel scenario.
Like in the single-channel approach, we consider the complete data to be all the individual observations of the reflections, but in this case from all the M microphones. Each of the observations can thus, for r = 1, . . . , R, be modeled as The decomposition is assumed to satisfy the conditions in (9)- (11). Then, it can be shown that the EM-algorithm for the multichannel estimation problem is given by with J r (g, τ , η) being a weighted least squares estimator defined as If we explicitly write the cost function, we get This can be used to simplify the M-step by making a few observations. Clearly, the first term in this expression does not depend on any parameter of interest. Moreover, if we assume that the analysis window is long compared to the length of the known source signal, s(n), we observe that the second term does not depend on either the TOAs or the TDOAs. That is, to estimate these time parameters, we only need to consider the maximization of the last term, i.e., The gains, g m,r , and the noise statistics, σ 2 v m , are unknown in practice. However, if the noise is assumed (quasi-)stationary, its variance can be estimated from microphone recordings acquired before emitting the known source signal, s(n). By taking the partial derivative of (26) with respect to g m,r , we obtain the following closed-form estimate for g m,r If the reflections are assumed to be in the far-field of the array, we can further simplify the estimators. In this case, the gains of reflection r will be the same across all microphones for r = 1, . . . , R. That is, we can instead estimate the TOAs and TDOAs as Subsequently, the gain estimator can then be reformulated as If the geometry of the loudspeaker and microphone configuration is known, we further reduce the dimensionality of the estimation problem. This is achieved by parameterizing the TDOAs, η m,r , for r = 1, . . . , R and m = 1, . . . , M using the array model, e.g., the one for a UCA configuration formulated in (5). Then, the TOA and TDOA estimator in the M-step can be written as where η m is replaced by the expression in (5). In this way, we only need to estimate two angles for each reflection, whereas the estimator in, e.g., (30) requires the estimation of M TDOAs (or M − 1 if one of the microphone positions is used as the reference point). That is, the computational benefits of using the array model increases as we increase the number of microphones. It can be shown that the resulting estimators in the M-step has an interesting interpretation as minimum variance distortionless response (MVDR) beamforming followed by a matched filter as we show in the following subsection.

Beamformer interpretation
Intuitively, if we were able to observe the reflections individually in noise and the noise is differently distributed across the microphones, then it would be natural to apply an MVDR beamformer to these to optimally account for the noise when estimating the TOAs and TDOAs. Let us consider the scenario where we have a filtering matrix, W, which we use to process the individually observed reflections in (22): Then, we define the residual noise power after this filtering as the normalized sum of the residual noise variances over the different time indices included in z(n), i.e., n, n + 1, . . . , n + N − 1. Mathematically, this is equivalent to where Tr{·} is the trace operator. Obviously, by inspection of the individual observation model in (22), we can see that the following expression needs to be satisfied for the filter to be distortionless with respect to the known source signal: That is, omitting the arguments of the steering matrix H(η r , g r ) for brevity, the problem of finding the MVDR solution for W can be formulated as It can be shown that the solution to the quadratic optimization problem with linear constraints is given by If we then apply the MVDR filtering matrix to the estimated observation of the rth reflection in noise, careful inspection reveals that The denominator is clearly independent of either the TOA or the TDOAs of the rth reflection, so if the objective is to estimate these, we only need to consider the numerator. Interestingly, the numerator resembles the first part of the cost function in (28). This reveals the following interpretation of the M-step. First, the individual observations of the reflections are filtered by an MVDR filter, and the resulting output is then processed by a matched filter with the transmitted signal. The TOA and TDOAs that maximizes the output power of this operation are then the estimates for the rth reflection. This is in line with the findings in [23][24][25], where it was shown that the output of an MVDR/LCMV beamformer provide the sufficient statistics for estimating individual signals.

Spatio-temporarily correlated noise
We now consider the scenario, where the noise is spatiotemporarily correlated, a scenario practically encountered. For example, the late reverberation is often modeled as spatially homogeneous and isotropic sound field [19], resulting in a degree of spatial coherence which is dependent on the distance between the measurement points. Moreover, there might be interfering, quasi-periodic noise sources in the recording environment, like human talkers, ego-noise from a drone/robot, etc. For such scenarios, we can rewrite the model in (4) as where To deal with scenarios like this, we can preprocess the observed signals, such that the white Gaussian noise assumptions of the EM method is satisfied. One way to achieve this is to use spatio-temporal decorrelation technique. Let us consider the correlated noise terms of the model in (4), i.e., d m (n), for m = 1, . . . , M. First, we define the spatio-temporal correlation matrix as If we assume that this matrix is Hermitian and positive definite, the Cholesky factorization of it is given by where L is a lower triangular matrix with real and positive diagonal entries. That is, to whiten the noise term before estimating the unknown parameters, we can left-multiply the observation in (38) with L −1 [26]. The prewhitened observations are thus given by where d(n) = L −1 d(n). Based on this and [22], we end up with the following EM method for estimating the acoustic reflection parameters when the noise is correlated in time and space: E-step: for r = 1, . . . , R, compute where Eventually, we can explicitly write the cost function for the M-step as Compared with the cost function in (26), the minimization of (46) is more challenging. For example, the second term in (46) will generally depend on the DOA/TDOAs. That is, if we assume the reflections to be in the far-field of the array, we can adopt an iterative estimation scheme, where we first estimate the TOA and TDOAs, then update the TDOAs, and, finally, estimate the gains, i.e., for r = 1, . . . , R: Step 1: Obtain estimates of the TOA and TDOAs as where Step 2: Update the TDOA estimates as where Step 3: Estimate the unknown gain as with the TOA and TDOA estimates from (47) and (48), respectively. If needed, these steps can then be repeated until convergence. It is also possible to simplify the M-step further by using particular signals as the known signal, s(n). By close inspection of the second term of the cost function in (48), we get where c i,j denotes the (i, j)th element of C −1 d . This reveals that, if the known probe signal is an uncorrelated noise sequence, it is reasonable to assume that this term is independent of both the TOA and the TDOAs, meaning that we can skip the update step in (48).

Kronecker decomposition
Another challenge with the prewhitening based estimator is the inversion of the noise covariance matrix, C d , which has a high dimension of NM × NM. However, if we assume that the covariance matrix is separable, we can approximate it with two smaller matrices [27], i.e., where C s and C t represents the spatial and temporal correlation matrices of dimensions M × M and N × N, respectively, and ⊗ denotes the Kronecker product operator. Since (C s ⊗ C t ) −1 = C −1 s ⊗ C −1 t , we now only need to invert these smaller matrices, which is both numerically and computationally preferable. Moreover, we can now conduct the prewhitening using the Cholesky factorization of these smaller matrices due to the mixed-product property, yielding (54) In other words, by assuming separability, we can approximate L in (41) by L s ⊗ L t . Eventually, it can be shown that, for uncorrelated probe signals, the Kronecker product decomposition allows us to rewrite the first step of the M-step in (44) as Step 1: and the vectors x m,r (n) and s(n − τ − η m ) are the prewhitened observation and probe signals for microphone m, respectively, defined as the mth columns of the following matrices: These expressions can be interpreted in the following way. The left hand multiplication with L −1 t corresponds to temporal prewhitening of all the microphone signals, whereas the right hand multiplication with L −T s corresponds to spatial prewhitening of all time snapshots.
Step 2: With the Kronecker decomposition, the second term of the cost function in (49) becomes This does not depend on the TOAs and TDOAs, so the Kronecker decompositions allow us to skip the intermediate step of updating the TDOAs as in (48). We can therefore directly proceed to conducting the closed form estimate of the gains as Even after all the presented simplifications and assumptions, the computational complexity of the proposed methods might still be considered relatively high due to their iterative and multidimensional nature. However, although not considered in this paper, we expect that further reductions in the computational complexity can be obtained by employing, e.g., the space alternating generalized expectation (SAGE) algorithm rather than the EM algorithm [28], or through a recursive EM procedure as suggested in [29], where the number of iterations per time instance can be reduced by instead tracking the parameters of interest over time.

Temporal prewhitening with filter
One issue with this prewhitening approach still is that the number samples in time might be relatively high in practice. The consequence of this is that, even with the Kronecker decomposition of the noise correlation matrix, the inversion of L t might be intractable in practice since its dimensions equal the number of time samples. An alternative approach could be to use a lower order filter for the prewhitening instead [30]. If we assume that the noise follows an autoregressive model, we can approximate it as: Given the noise correlation matrix, C t , we can obtain the AR coefficients of the noise using the Levinson-Durbin recursion. The prewhitening filter is then formed using the AR coefficients as the coefficients of a Pth order FIR filter, h pw (p) = a p . Subsequently, the prewhitened signals are obtained as where h pw (0) = 1.

Covariance estimation
In the previous subsections, we have considered the covariance matrices as known quantities. However, we need to estimate these from the observed data in practice. If no particular structure is assumed for the covariance matrix, a common approach is to use the following estimator [31] where As evident from, e.g., (47), the estimated covariance needs to be invertible. This requires that where K is the number of snapshots, N is the number of samples of the signal, and M is the number of microphones. Consequently, we can only use relatively short temporal subvectors, d m (n) in the estimation of the covariance matrix when the number of microphones is increased.  [32]. Result: Estimates of temporal and spatial covariance matrices, C t and C s .
If it is assumed that the multichannel noise samples in d(n) follows a multichannel matrix normal distribution, the maximum likelihood (ML) estimator for the noise covariance matrix can be derived [32]. Unfortunately, the resulting estimator is not closed form, but it can be implemented using the iterative flip-flop algorithm in Algorithm 1. In some cases, e.g., if one of the covariance matrices are close to being rank deficient, this iterative procedure can be problematic, since their inverses are required. Different approaches for dealing with this and the computational complexity of the iterative procedure have been considered [31,33]. Alternatively, a non-iterative estimator can be used such as [31] where As indicated in (70), the trace of the temporal covariance is assumed to be known. This might not be the case in practice; however, in most situations, we can simply replace it by an arbitrary value, since its main purpose is to resolve the ambiguity

Non-stationary noise
While the stationarity assumption may not hold in practice, there are a number of ways to address this problem. For example, we may reduce the length, N, of the probe signal and the analysis window, which would naturally increase the validity of the assumption. Alternatively, we may decouple the prewhitening and estimation parts, as suggested in Section 4.5. In this way, we may first prewhiten our signal using a filter, and then apply the proposed estimators with a white Gaussian noise assumption on the prewhitened signals. This approach can be exploited to take the non-stationarity of the noise into account by updating the prewhitening filters over time, according to the changing AR coefficients of the noise. Estimating non-stationary noise parameters, however, is more difficult, since the statistics need to be tracked during the presence of the desired signal, i.e., the probe signal and its reflections in our case. This problem has been well-investigated in other audio signal processing problems, such as speech enhancement [34][35][36][37].

Results and discussion
In this section, we investigate the performance of the different variants of the proposed EM method. More specifically, we consider the variant assuming spatially independent white Gaussian in Section 4.1 resulting in noise variance weighting (EM-UCA-NW), and its special case where the noise variance is assumed equal (EM-UCA) [18]. Moreover, we consider the setup with correlated noise proposed in Section 4.3 resulting in the prewhitening-based approach (EM-UCA-PW). The experiments were carried out using signals that were generated using the room impulse response generator [38]. The dimensions of the simulated room were set to 8 × 6 × 5 m, the reverberation time (T 60 ) was set to 0.6 s while the speed of sound is fixed at 343 m/s. The loudspeaker was positioned at the center of an UCA at (1 × 1.5 × 2.5) m while the UCA has M = 4 microphones with a radius of d = 0.2 m. Although, any type of known broadband signal could be used to probe the environment, such as a chirp signal or maximum length sequences (MLS) [39], we decided to use a white Gaussian noise sequence as the known sound source, s(n), consisting of 1,500 samples from a Gaussian distribution. This sequence was subsequently zero-padded to get a total signal length of 20,000 samples. The objective of the zero-padding was to get a longer analysis window to ensure that the first few reflections are present in the observation. Moreover, as discussed in Section 4.3, the reason for using a WGN sequence is that the EM estimator can be simplified if the probe signal is an uncorrelated signal. In addition to this, using such a broadband sequence minimizes the effects of spatial aliasing [40].
The sampling frequency f s was set to 22,050 Hz. We assumed that the direct component is subtracted from the observed signal given that we know the arrangement of the loudspeaker and the microphones. Knowing the array geometry enables either offline measurement of the impulse response of the direct-path component offline or analytical computation of the impulse response of the direct-path component based on the geometry. The background noise comprises of two components: one being (2020) 2020:12 Page 11 of 15 diffuse spherical noise and the other being thermal sensor noise. The diffuse spherical noise was generated using the method described in [41] using the rotor noise of a drone from the DREGON database [3]. The drone audio file used to generate the diffuse spherical noise corresponds to rotors running at 70 revolutions per second (RPS). The thermal sensor noise was simulated as spatially independent white Gaussian noise. Both these noises were added to the observed signal before estimating the parameters. The evaluation was then conducted for different signalto-diffuse noise ratios (SDNRs) and signal-to-sensor noise ratios (SSNRs). In the following subsections, we evaluate the performance of our propose method in various conditions.

Comparison of with state-of-the-art
The aim of the first experiment was to compare the proposed method with existing state-of-the-art methods.
The EM algorithm was set to estimate R = 3 reflections with 40 iterations and β was set to 1 R . The main application for this manuscript is acoustic reflector mapping for robot audition. For this application, the mapping should be possible in unknown, complex environments, and we therefore do not rely on trivial room geometry models as opposed to many of the traditional methods for room geometry estimation [10][11][12]. Therefore, we chose to use a small number of reflections in the estimation (i.e., R = 3), to mainly estimate the TOAs/DOAs of first-order reflections impinging from nearby acoustic reflectors. These can be directly mapped to acoustic reflector positions based on the estimated time and angle of arrival. While this will not facilitate the localization of all acoustic reflectors at any given time instance, we can carry out such estimation over time and space, to generate a map of an arbitrary room geometry (see Section 5.4). An alternative to choosing a fixed reflection order would be to combine the proposed method with order estimation methods [42,43]. To initialize the method, the gain estimates, g m,r , were sampled from a uniform distribution over the interval [ 0; 1], the TOAs, τ 1,r , were sampled from a uniform discrete distribution over the time indices corresponding to the analysis window, and the DOAs, φ r , were sampled from a uniform distribution over the interval [ 0 • ; 360 • ]. After emitting and recording the known source signal, an analysis window of each recording was considered starting from τ min samples to τ max samples after the source signal was emitted. In this experiment, the analysis window was set such that the search is made between 0.5 to 2 m. This was done to primarily capture the first order reflections. The lower bound was chosen because we can only search for reflectors that are outside the geometry of the array, which, in our experiments, had a radius of 0.2 m. After 2 m, the performance of the proposed method degrades because the energy of the reflected signals decrease quadratically over distance, which motivated the choice of the upper limit.
The proposed EM method (EM-UCA) was compared to the single-channel EM method (EM-SC) in [22] in terms of TOA accuracy, applied to the first microphone. Moreover, these were compared with a common approach to extracting TOAs from estimated RIR through peakpicking (RIR-PP). Finally, the performance was also compared with our previous work [44] termed the non-linear least squares estimator (NLS). The results for the TOA estimation are shown in Fig. 3, where the accuracy was defined as the percentage of TOA estimates that were within ± 2% tolerance of one of the true parameters of the first-order reflections computed using the image-source method. This was measured for different SDNRs while the SSNR was fixed to 10 dB, and for each SDNR, the accuracy was measured over 100 Monte-Carlo simulations. As seen in Fig. 3, the proposed method clearly outperforms the existing method by providing higher accuracy at lower SDNRs.
Furthermore, the computation time of the RIR-PP and the proposed method, EM-UCA, were measured. This test was performed in MATLAB using the built-in function timeit on a standard desktop computer running a Microsoft Windows 10 operating system with an Intel Core i7 CPU with 3.40 GHz processing speed and 16 GB of RAM. A Monte Carlo simulation with 100 trials was performed on each method and an average time was calculated. The measured computation times of the RIR-PP and the EM-UCA were 0.0063 s and 25.74 s, respectively, for R = 1 and an SDNR of 40 dB. This shows that the improved estimation accuracy with the proposed method comes at the cost of a higher computational complexity. It is important to stress, however, that in applications such as acoustic reflector localization with a drone, it is common to have negative SNR conditions [45], where the RIR-PP method may fail to provide accurate estimates as opposed to the proposed method (see, e.g., Fig. 3). Moreover, the computational cost could be reduced further by, e.g., employing the recursive EM approach [29,46]. If the TOA/DOA estimation is carried out continuously over time and space, the EM algorithm may be initialized using previous estimates, which may significantly reduce the number of iterations needed for convergence. Another potential computational saving may be obtained by deriving the proposed methods in the frequency domain.

Evaluation for different diffuse noise conditions
In the second experiment, we evaluated the effect of the proposed prewhitening approach under different diffuse noise conditions. To test the performance of the EM algorithm under such realistic scenarios, we test our estimator for different SDNRs in the interval [ − 40; 10] dB while setting the SSNR to 40 dB. Here, we are comparing the EM algorithm with and without the prewhitening in terms of both TOA and DOA estimation accuracy as seen in Figs. 4 and 5, respectively. The diffuse rotor noise is indeed correlated with strong periodic components, but the results show that the proposed prewhitening approach can successfully account for this and can retain a high estimation accuracy at SDNRs levels 20 dB lower than those needed for the EM-UCA approach.

Evaluation for faulty/noisy microphone conditions
In this experiment, we consider a scenario where one microphone is excessively noisy compared to the other microphones. An example of this could be a robot platform, where one microphone is placed closer to an egonoise source such as a fan, leading to TOA and DOA estimation errors. To simulate this effect, we set thermal noise of a single microphone to an SSNR level of − 10 dB, while the thermal noise of the remaining microphones are set to an SSNR level of 40 dB. As seen in Figs. 6 and 7, the performance of the EM algorithm with noise variance weighting is less affected by the high thermal sensor noise in terms of both TOA and DOA estimation accuracy. Moreover, we conducted an experiment without diffuse noise, where the SSNR level of the faulty microphone was changed from − 40 to 0 dB. These results are shown in Figs. 8 and 9, and show that the estimation accuracy is already degrading from 0 dB SSNR and downwards when using the EM-UCA approach, whereas the proposed EM-UCA-NW approach retains a high accuracy.

Application example of the proposed method
We consider an application example where the localization of the acoustic reflectors is done using the proposed EM method with and without prewhitening. More specifically, we have used filter-based prewhitening approach as discussed in Section 4.5. This experiment thus shows how the proposed method can be used to map an environment using a moving robot platform. The room parameters were kept the same as the earlier experiment. Furthermore, the SDNR was set to − 10 dB corresponding to a strong ego-noise. The loudspeaker-microphone arrangement was similar to the previous experiments and follows the a predefined path as shown in Fig. 10 indicated by the

Conclusion
In this paper, we consider the problem of estimating the time-and direction-of-arrivals of acoustic echoes using a loudspeaker emitting a known source signal and multiple microphones. Among other examples, this is an important problem in robot and drone audition, where these parameters can reveal the positions of nearby acoustic reflectors and thus facilitate mapping and navigation of a physical environment. Some methods exist for solving the problems of acoustic reflector localization and room geometry estimation; however, most of these rely on a priori information, e.g., of the TOAs or DOAs of the acoustic echoes. However, estimating these is a difficult problem on its own, which is dealt with by the methods proposed herein. Moreover, even when the TOAs are estimated for some of the traditional approaches, the difficult problem of echolabeling needs to be solved, since the order of the corresponding reflection is generally unknown. We therefore propose different methods for estimating, not only the TOAs, but also the DOAs of acoustic echoes. By estimating the DOAs also, it is possible to resolve some of the ambiguity introduced by knowing only the TOAs. The proposed method is based on the expectationmaximization framework and are derived to be optimal under different conditions ranging from the simple white Gaussian noise scenario to scenarios with correlated and colored noise. In the experiments, we show that proposed methods are able to estimate the TOAs and DOAs with higher accuracy and noise robustness compared to existing methods. Moreover, we show that some of the proposed variants can account for colored noise and scenarios where a microphone is faulty or more noisy than the other microphones of the array. Finally, we conducted a more applied experiment, where it is illustrated how a room can be mapped from the estimated parameters, which is relevant to, e.g., autonomous robot and drone applications. While the proposed method has a higher computation time than traditional methods, this can be reduced significantly by adopting the recursive EM scheme and deriving the proposed methods in the frequency domain.