We now proceed to consider the multichannel case, where we have one loudspeaker and multiple microphones. First, we consider a white Gaussian noise scenario similar to Section 3.1 where the noise is independent across the microphones, after which we turn to the more realistic scenarios with correlated noise.
Spatially independent white Gaussian noise
If we first assume that the noise is temporally white Gaussian and independent and the late reverberation is negligible, the signal model in (4) reduces to
$$\begin{array}{*{20}l} \mathbf{y}_{m}(n)=\sum_{r=1}^{R} g_{m,r}\mathbf{s}(n-\tau_{\text{ref},r}-\eta_{m,r})+\mathbf{v}_{m}(n), \end{array} $$
(16)
for m=1,…,M. Subsequently, we can aggregate the observations from all microphones in one model as
$$\begin{array}{*{20}l} \mathbf{y}(n)&=\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\mathbf{v}(n)\\ &=\left[\mathbf{y}_{1} \quad \mathbf{y}_{2} \quad \cdots \quad \mathbf{y}_{M}\right]^{T}\notag, \end{array} $$
(17)
where v(n) is the stacked noise terms from each microphone defined similarly to y(n), and
$$\begin{array}{*{20}l} \boldsymbol{\eta}_{r} &= \left[\eta_{1,r} \quad \eta_{2,r} \quad \cdots \quad \eta_{M,r}\right]^{T},\\ \mathbf{g}_{r} &= \left[g_{1,r} \quad g_{2,r} \quad \cdots \quad g_{M,r}\right]^{T}. \end{array} $$
In addition to this, we note that, under the assumptions of spatial independent white Gaussian noise, the covariance matrix, C of the stacked noise, v(n) is diagonal and given by
$$\begin{array}{*{20}l} \mathbf{C}=\text{diag}\left(\sigma_{v_{1}}^{2}\mathbf{I}_{{N}},\sigma_{v_{2}}^{2}\mathbf{I}_{{N}},\ldots,\sigma_{v_{M}}^{2}\mathbf{I}_{{N}} \right), \end{array} $$
(18)
where diag(·) is the operator constructing a diagonal matrix from the input of scalars(/matrices) and C is the MN×MN covariance matrix. Furthermore,
$$\begin{array}{*{20}l} \mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})=\left[g_{1,r}\mathbf{D}_{\eta_{1,r}}^{T} \quad \cdots \quad g_{M,r}\mathbf{D}_{\eta_{M,r}}^{T} \right]^{T}, \end{array} $$
(19)
and Dη is a circular shift matrix which delays a signal by −η samples.
With these definitions, the ML estimator for the problem at hand becomes
$$\begin{array}{*{20}l} \{\widehat{\mathbf{g}},\widehat{\boldsymbol{\tau}},\widehat{\boldsymbol{\eta}}\}=\underset{\mathbf{g},\boldsymbol{\tau},\boldsymbol{\eta}}{\text{argmin}} J(\mathbf{g},\boldsymbol{\tau},\boldsymbol{\eta}), \end{array} $$
(20)
where
$$\begin{array}{*{20}l} J(\mathbf{g},\boldsymbol{\tau},\boldsymbol{\eta})=\left\| \mathbf{y}(n)-\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r}) \right\|^{2}_{\mathbf{C}^{-1}} \end{array} $$
(21)
such that \(\|\mathbf {x}\|_{\mathbf {W}}^{2}=\mathbf {x}^{T}\mathbf {Wx}\), where W denotes the weighted 2-norm of x. Moreover, g,τ, and η are the parameter vectors containing all unknown gains, TOAs and TDOAs, respectively. In the single-channel case, the ML estimator ends up being high-dimensional and non-convex, resulting in a practically infeasible computational complexity if implemented directly. Therefore, we propose to adopt the EM framework also for the multichannel scenario.
Like in the single-channel approach, we consider the complete data to be all the individual observations of the reflections, but in this case from all the M microphones. Each of the observations can thus, for r=1,…,R, be modeled as
$$\begin{array}{*{20}l} \mathbf{x}_{r}=\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\mathbf{v}_{r}(n). \end{array} $$
(22)
The decomposition is assumed to satisfy the conditions in (9)–(11). Then, it can be shown that the EM-algorithm for the multichannel estimation problem is given by
E-step: for r=1,…,R, compute
$$\begin{array}{*{20}l} &\widehat{\mathbf{x}}_{r}^{(i)}(n)=\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{r}^{(i)},\widehat{\mathbf{g}}_{r}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},r}^{(i)}\right) \end{array} $$
(23)
$$\begin{array}{*{20}l} &+\beta_{r}\left[\mathbf{y}(n)-\sum_{k=1}^{R}\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{k}^{(i)},\widehat{\mathbf{g}}_{k}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},k}^{(i)}\right)\right]\notag. \end{array} $$
M-step: for r=1,…,R,
$$\begin{array}{*{20}l} \{\widehat{\mathbf{g}}_{r},\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}^{(i+1)} &=\underset{\mathbf{g},\tau,\boldsymbol{\eta}}{\text{argmin}} J_{r}(\mathbf{g},\tau,\boldsymbol{\eta}), \end{array} $$
(24)
with Jr(g,τ,η) being a weighted least squares estimator defined as
$$\begin{array}{*{20}l} J_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=\left\|\widehat{\mathbf{x}}_{r}^{(i)}(n)-\mathbf{H}(\boldsymbol{\boldsymbol{\eta},\mathbf{g}})\mathbf{s}(n-\tau)\right\|_{\mathbf{C}^{-1}}^{2}. \end{array} $$
(25)
If we explicitly write the cost function, we get
$$\begin{array}{*{20}l} J_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=&\sum_{m=1}^{M}\frac{\|\widehat{\mathbf{x}}_{m,r}(n)\|^{2}}{\sigma_{v_{m}}^{2}}\notag\\ &+\|\mathbf{s}(n-\tau)\|^{2}\sum_{m=1}^{M}\frac{g_{m,{r}}^{2}}{\sigma_{v_{m}}^{2}}\notag\\ &-2\sum_{m=1}^{M} \frac{g_{m,{r}}\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\mathbf{s}(n-\tau), \end{array} $$
(26)
This can be used to simplify the M-step by making a few observations. Clearly, the first term in this expression does not depend on any parameter of interest. Moreover, if we assume that the analysis window is long compared to the length of the known source signal, s(n), we observe that the second term does not depend on either the TOAs or the TDOAs. That is, to estimate these time parameters, we only need to consider the maximization of the last term, i.e.,
$$\begin{array}{*{20}l} \{ \widehat{\tau}_{\text{ref},r},\widehat{\boldsymbol{\eta}}_{r}\} = \underset{\tau,\boldsymbol{\eta}}{\text{argmax}}& \sum_{m=1}^{M}\frac{g_{m,r}\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\\ &\qquad\qquad\qquad\times \mathbf{s}(n-\tau) \notag, \end{array} $$
(27)
The gains, gm,r, and the noise statistics, \(\sigma _{v_{m}}^{2}\), are unknown in practice. However, if the noise is assumed (quasi-)stationary, its variance can be estimated from microphone recordings acquired before emitting the known source signal, s(n). By taking the partial derivative of (26) with respect to gm,r, we obtain the following closed-form estimate for gm,r
$$\begin{array}{*{20}l} \widehat{g}_{m,r}=\frac{\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\widehat{\eta}_{m}}\mathbf{s}(n-\widehat{\tau}_{\text{ref},r})}{\|\mathbf{s}(n)\|^{2}}, \end{array} $$
(28)
If the reflections are assumed to be in the far-field of the array, we can further simplify the estimators. In this case, the gains of reflection r will be the same across all microphones for r=1,…,R. That is, we can instead estimate the TOAs and TDOAs as
$$\begin{array}{*{20}l} \{\widehat{\tau}_{\text{ref},r},\widehat{\boldsymbol{\eta}}_{r}\} \approx \underset{\tau,\boldsymbol{\eta}}{\text{argmax}}& \left(\sum_{m=1}^{M}\frac{\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\right)\\ & \qquad\qquad\qquad \times \mathbf{s}(n-\tau) \notag. \end{array} $$
(29)
Subsequently, the gain estimator can then be reformulated as
$$\begin{array}{*{20}l} \widehat{g}_{r}=\left(\sum_{m=1}^{M}\frac{1}{\sigma_{v_{m}}^{2}}\right)^{-1}\sum_{m=1}^{M}\frac{\widehat{\mathbf{x}}_{m,r}^{T}\mathbf{D}_{\widehat{\eta}_{m}}}{\sigma_{v_{m}}^{2}}\frac{\mathbf{s}(n-\widehat{\tau}_{\text{ref},r})}{\|\mathbf{s}(n)\|^{2}}, \end{array} $$
(30)
If the geometry of the loudspeaker and microphone configuration is known, we further reduce the dimensionality of the estimation problem. This is achieved by parameterizing the TDOAs, ηm,r, for r=1,…,R and m=1,…,M using the array model, e.g., the one for a UCA configuration formulated in (5). Then, the TOA and TDOA estimator in the M-step can be written as
$$\begin{array}{*{20}l} \{\widehat{\tau}_{\text{ref},r},\widehat{\phi}_{r},\widehat{\psi}_{r}\}\approx & \underset{\tau,\phi,\psi}{\text{argmax}} \left(\sum_{m=1}^{M}\frac{\widehat{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}\right)\notag\\ & \times \mathbf{s}(n-\tau), \end{array} $$
(31)
where ηm is replaced by the expression in (5). In this way, we only need to estimate two angles for each reflection, whereas the estimator in, e.g., (30) requires the estimation of M TDOAs (or M−1 if one of the microphone positions is used as the reference point). That is, the computational benefits of using the array model increases as we increase the number of microphones. It can be shown that the resulting estimators in the M-step has an interesting interpretation as minimum variance distortionless response (MVDR) beamforming followed by a matched filter as we show in the following subsection.
Beamformer interpretation
Intuitively, if we were able to observe the reflections individually in noise and the noise is differently distributed across the microphones, then it would be natural to apply an MVDR beamformer to these to optimally account for the noise when estimating the TOAs and TDOAs. Let us consider the scenario where we have a filtering matrix, W, which we use to process the individually observed reflections in (22):
$$\begin{array}{*{20}l} \mathbf{z}(n)=\mathbf{W}^{T}\mathbf{x}_{r}(n). \end{array} $$
(32)
Then, we define the residual noise power after this filtering as the normalized sum of the residual noise variances over the different time indices included in z(n), i.e., n,n+1,…,n+N−1. Mathematically, this is equivalent to
$$\begin{array}{*{20}l} \sigma_{v,f}^{2} &= \mathrm{E}\left[\frac{1}{N}\text{Tr}\left\{\mathbf{W}^{T}\mathbf{v}_{r}(n)\mathbf{v}_{r}^{T}(n)\mathbf{W}\right\}\right]\notag\\ &=\frac{\beta_{r}}{N}\text{Tr}\left\{\mathbf{W}^{T}\mathbf{C}\mathbf{W}\right\}, \end{array} $$
(33)
where Tr{·} is the trace operator. Obviously, by inspection of the individual observation model in (22), we can see that the following expression needs to be satisfied for the filter to be distortionless with respect to the known source signal:
$$\begin{array}{*{20}l} \mathbf{W}^{T}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})=\mathbf{I}_{{N}}. \end{array} $$
(34)
That is, omitting the arguments of the steering matrix H(ηr,gr) for brevity, the problem of finding the MVDR solution for W can be formulated as
$$\begin{array}{*{20}l} \min_{\mathbf{W}} \text{Tr}\left\{\mathbf{W}^{T}\mathbf{C}\mathbf{W}\right\}\quad\text{s.t.}\quad\mathbf{W}^{T}\mathbf{H}=\mathbf{I}_{{N}}. \end{array} $$
(35)
It can be shown that the solution to the quadratic optimization problem with linear constraints is given by
$$\begin{array}{*{20}l} \mathbf{W}_{\mathrm{M}}=\mathbf{C}^{-1}\mathbf{H}\left(\mathbf{H}^{T}\mathbf{C}^{-1}\mathbf{H}\right)^{-1}. \end{array} $$
(36)
If we then apply the MVDR filtering matrix to the estimated observation of the rth reflection in noise, careful inspection reveals that
$$\begin{array}{*{20}l} {\mathbf{x}}_{r}^{T}(n)\mathbf{W}_{\mathrm{M}}=\frac{\sum_{m=1}^{M}\frac{g_{m}{\mathbf{x}}_{m,r}^{T}(n)\mathbf{D}_{\eta_{m}}}{\sigma_{v_{m}}^{2}}}{\sum_{m=1}^{M}\frac{g_{m}^{2}}{\sigma_{v_{m}}^{2}}}. \end{array} $$
(37)
The denominator is clearly independent of either the TOA or the TDOAs of the rth reflection, so if the objective is to estimate these, we only need to consider the numerator. Interestingly, the numerator resembles the first part of the cost function in (28). This reveals the following interpretation of the M-step. First, the individual observations of the reflections are filtered by an MVDR filter, and the resulting output is then processed by a matched filter with the transmitted signal. The TOA and TDOAs that maximizes the output power of this operation are then the estimates for the rth reflection. This is in line with the findings in [23–25], where it was shown that the output of an MVDR/LCMV beamformer provide the sufficient statistics for estimating individual signals.
Spatio-temporarily correlated noise
We now consider the scenario, where the noise is spatio-temporarily correlated, a scenario practically encountered. For example, the late reverberation is often modeled as spatially homogeneous and isotropic sound field [19], resulting in a degree of spatial coherence which is dependent on the distance between the measurement points. Moreover, there might be interfering, quasi-periodic noise sources in the recording environment, like human talkers, ego-noise from a drone/robot, etc. For such scenarios, we can rewrite the model in (4) as
$$\begin{array}{*{20}l} \mathbf{y}(n)=\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\mathbf{d}(n), \end{array} $$
(38)
where
$$\begin{array}{*{20}l} \mathbf{d}(n)=\left[\mathbf{d}_{1}^{T}(n) \quad \mathbf{d}_{2}^{T}(n) \quad \cdots \quad \mathbf{d}_{M}^{T}(n)\right]^{T}. \end{array} $$
(39)
To deal with scenarios like this, we can preprocess the observed signals, such that the white Gaussian noise assumptions of the EM method is satisfied.
One way to achieve this is to use spatio-temporal decorrelation technique. Let us consider the correlated noise terms of the model in (4), i.e., dm(n), for m=1,…,M. First, we define the spatio-temporal correlation matrix as
$$\begin{array}{*{20}l} \mathbf{C}_{d}=\mathrm{E}\left[\mathbf{d}(n)\mathbf{d}^{T}(n)\right]. \end{array} $$
(40)
If we assume that this matrix is Hermitian and positive definite, the Cholesky factorization of it is given by
$$\begin{array}{*{20}l} \mathbf{C}_{d}=\mathbf{LL}^{T}, \end{array} $$
(41)
where L is a lower triangular matrix with real and positive diagonal entries. That is, to whiten the noise term before estimating the unknown parameters, we can left-multiply the observation in (38) with L−1 [26]. The prewhitened observations are thus given by
$$\begin{array}{*{20}l} \overline{\mathbf{y}}(n) &= \mathbf{L}^{-1}\mathbf{y}(n)\\ &=\mathbf{L}^{-1}\sum_{r=1}^{R}\mathbf{H}(\boldsymbol{\eta}_{r},\mathbf{g}_{r})\mathbf{s}(n-\tau_{\text{ref},r})+\overline{\mathbf{d}}(n),\notag \end{array} $$
(42)
where \(\overline {\mathbf {d}}(n) = \mathbf {L}^{-1}\mathbf {d}(n)\). Based on this and [22], we end up with the following EM method for estimating the acoustic reflection parameters when the noise is correlated in time and space:
E-step: for r=1,…,R, compute
$$\begin{array}{*{20}l} \widehat{\mathbf{x}}_{r}^{(i)}(n)&=\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{r}^{(i)},\widehat{\mathbf{g}}_{r}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},r}^{(i)}\right)\\ &+\beta_{r}\left[\mathbf{y}(n)-\sum_{k=1}^{R}\mathbf{H}\left(\widehat{\boldsymbol{\eta}}_{k}^{(i)},\widehat{\mathbf{g}}_{k}^{(i)}\right)\mathbf{s}\left(n-\widehat{\tau}_{\text{ref},k}^{(i)}\right)\right]\notag. \end{array} $$
(43)
M-step: for r=1,…,R,
$$\begin{array}{*{20}l} \{\widehat{\mathbf{g}}_{r},\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}^{(i+1)}&=\underset{\mathbf{g},\tau,\boldsymbol{\eta}}{\text{argmin}}\overline{J}_{r}(\mathbf{g},\tau,\boldsymbol{\eta}). \end{array} $$
(44)
where
$$\begin{array}{*{20}l} \overline{J}_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=\left\|\mathbf{L}^{-1}\left(\widehat{\mathbf{x}}_{r}^{(i)}(n)-\mathbf{H}(\boldsymbol{\boldsymbol{\eta},\mathbf{g}})\mathbf{s}(n-\tau)\right)\right\|^{2}, \end{array} $$
(45)
Eventually, we can explicitly write the cost function for the M-step as
$$\begin{array}{*{20}l} &\overline{J}_{r}(\mathbf{g},\tau,\boldsymbol{\eta})=\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\mathbf{x}_{r}(n)\notag\\ &\ \ \quad +\mathbf{s}^{T}(n-\tau)\mathbf{H}^{T}(\boldsymbol{\eta},\mathbf{g})\mathbf{C}_{d}^{-1}\mathbf{H}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau)\notag\\ &\ \ \quad -2\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\mathbf{H}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau), \end{array} $$
(46)
Compared with the cost function in (26), the minimization of (46) is more challenging. For example, the second term in (46) will generally depend on the DOA/TDOAs. That is, if we assume the reflections to be in the far-field of the array, we can adopt an iterative estimation scheme, where we first estimate the TOA and TDOAs, then update the TDOAs, and, finally, estimate the gains, i.e., for r=1,…,R:
Step 1: Obtain estimates of the TOA and TDOAs as
$$\begin{array}{*{20}l} \{\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}& = \underset{\tau,\boldsymbol{\eta}}{\text{argmax}} \mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau), \end{array} $$
(47)
where
$$\begin{array}{*{20}l} \overline{\mathbf{H}}(\boldsymbol{\eta})&=\left[\mathbf{D}_{\eta_{1}}^{T} \quad \cdots \quad \mathbf{D}_{\eta_{M}}^{T}\right]^{T}. \end{array} $$
Step 2: Update the TDOA estimates as
$$\begin{array}{*{20}l} \widehat{\boldsymbol{\eta}}_{r}=\arg\min_{\boldsymbol{\eta}} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta})+\overline{J}_{3,r}(g_{r},\boldsymbol{\eta}), \end{array} $$
(48)
where
$$\begin{array}{*{20}l} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta}) &=g_{r}^{2}\mathbf{s}\left(n-\widehat{\tau}_{r}\right)\overline{\mathbf{H}}^{T}(\boldsymbol{\eta})\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\eta}) \end{array} $$
(49)
$$\begin{array}{*{20}l}& \qquad \qquad \qquad \qquad \qquad \times\mathbf{s}(n-\widehat{\tau}_{r})\notag\\ \overline{J}_{3,r} (g_{r},\boldsymbol{\eta}) &=-2g_{r}\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\eta})\mathbf{s}(n-\widehat{\tau}_{r}). \end{array} $$
(50)
Step 3: Estimate the unknown gain as
$$\begin{array}{*{20}l} \widehat{g}_{r}=\frac{\mathbf{x}_{r}^{T}(n)\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\widehat{\eta}}_{r})\mathbf{s}(n-\widehat{\tau}_{r})}{\mathbf{s}^{T}(n-\widehat{\tau}_{r})\overline{\mathbf{H}}^{T}(\boldsymbol{\widehat{\eta}}_{r})\mathbf{C}_{d}^{-1}\overline{\mathbf{H}}(\boldsymbol{\widehat{\eta}}_{r})\mathbf{s}(n-\widehat{\tau}_{r})}. \end{array} $$
(51)
with the TOA and TDOA estimates from (47) and (48), respectively. If needed, these steps can then be repeated until convergence. It is also possible to simplify the M-step further by using particular signals as the known signal, s(n). By close inspection of the second term of the cost function in (48), we get
$$\begin{array}{*{20}l} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta})&=g_{r}^{2}\sum_{i=1}^{M}\sum_{j=1}^{M} c_{i,j} \\&\qquad \times \mathbf{s}^{T}\left(n-\tau-\eta_{i}\right)\mathbf{s}\left(n-\tau-\eta_{j}\right),\notag \end{array} $$
(52)
where ci,j denotes the (i,j)th element of \(\mathbf {C}_{d}^{-1}\). This reveals that, if the known probe signal is an uncorrelated noise sequence, it is reasonable to assume that this term is independent of both the TOA and the TDOAs, meaning that we can skip the update step in (48).
Kronecker decomposition
Another challenge with the prewhitening based estimator is the inversion of the noise covariance matrix, Cd, which has a high dimension of NM×NM. However, if we assume that the covariance matrix is separable, we can approximate it with two smaller matrices [27], i.e.,
$$\begin{array}{*{20}l} \mathbf{C}_{d} \approx \mathbf{C}_{\mathrm{s}}\otimes \mathbf{C}_{\mathrm{t}}. \end{array} $$
(53)
where Cs and Ct represents the spatial and temporal correlation matrices of dimensions M×M and N×N, respectively, and ⊗ denotes the Kronecker product operator. Since \((\mathbf {C}_{\mathrm {s}}\otimes \mathbf {C}_{\mathrm {t}})^{-1}=\mathbf {C}_{\mathrm {s}}^{-1}\otimes \mathbf {C}_{\mathrm {t}}^{-1}\), we now only need to invert these smaller matrices, which is both numerically and computationally preferable. Moreover, we can now conduct the prewhitening using the Cholesky factorization of these smaller matrices due to the mixed-product property, yielding
$$\begin{array}{*{20}l} \mathbf{C}_{\mathrm{s}}\otimes \mathbf{C}_{\mathrm{t}} = \mathbf{L}_{\mathrm{s}}\mathbf{L}_{\mathrm{s}}^{T}\otimes \mathbf{L}_{\mathrm{t}}\mathbf{L}_{\mathrm{t}}^{T} = (\mathbf{L}_{\mathrm{s}}\otimes\mathbf{L}_{\mathrm{t}}) (\mathbf{L}_{\mathrm{s}}^{T}\otimes\mathbf{L}_{\mathrm{t}}^{T}). \end{array} $$
(54)
In other words, by assuming separability, we can approximate L in (41) by Ls⊗Lt. Eventually, it can be shown that, for uncorrelated probe signals, the Kronecker product decomposition allows us to rewrite the first step of the M-step in (44) as
Step 1:
$$\begin{array}{*{20}l} \{\widehat{\tau}_{r},\widehat{\boldsymbol{\eta}}_{r}\}&=\underset{\tau,\boldsymbol{\eta}}{\text{argmax}} \mathbf{x}_{r}^{T}(n)\left(\mathbf{C}_{\mathrm{s}}^{-1}\otimes\mathbf{C}_{\mathrm{t}}^{-1}\right)\overline{\mathbf{H}}(\boldsymbol{\eta},\mathbf{g})\mathbf{s}(n-\tau),\notag\\ &=\underset{\tau,\boldsymbol{\eta}}{\text{argmax}} \text{tr}\left(\mathbf{X}_{r}^{T}(n)\mathbf{C}_{\mathrm{t}}^{-1}\mathbf{S}_{\tau,\boldsymbol{\eta}}(n)\mathbf{C}_{\mathrm{s}}^{-1}\right) \end{array} $$
(55)
$$\begin{array}{*{20}l} &=\underset{\tau,\boldsymbol{\eta}}{\text{argmax}}\sum_{m=1}^{M}\widetilde{\mathbf{x}}_{m,r}^{T}(n)\widetilde{\mathbf{s}}(n-\tau-\eta_{m}) \end{array} $$
(56)
where
$$\begin{array}{*{20}l} \mathbf{X}_{r}(n)&=\left[\mathbf{x}_{1,r}(n) \quad \cdots \quad \mathbf{x}_{M,r}(n)\right], \end{array} $$
(57)
$$\begin{array}{*{20}l} \mathbf{S}_{\tau,\boldsymbol{\eta}}(n)&=\left[\mathbf{D}_{\eta_{1}}\mathbf{s}(n-\tau) \quad \cdots \quad \mathbf{D}_{\eta_{M}}\mathbf{s}(n-\tau)\right],\notag\\ &=\left[\mathbf{s}(n-\tau-\eta_{1}) \quad \cdots \quad \mathbf{s}(n-\tau-\eta_{M})\right], \end{array} $$
(58)
and the vectors \(\widetilde {\mathbf {x}}_{m,r}(n)\) and \(\widetilde {\mathbf {s}}(n-\tau -\eta _{m})\) are the prewhitened observation and probe signals for microphone m, respectively, defined as the mth columns of the following matrices:
$$\begin{array}{*{20}l} \widetilde{\mathbf{X}}_{r}(n)&=\mathbf{L}_{\mathrm{t}}^{-1}\mathbf{X}_{r}(n)\mathbf{L}^{-T}_{\mathrm{s}} \end{array} $$
(59)
$$\begin{array}{*{20}l} \widetilde{\mathbf{S}}_{\tau,\boldsymbol{\eta}}(n)&=\mathbf{L}_{\mathrm{t}}^{-1}\mathbf{S}_{\tau,\boldsymbol{\eta}}(n)\mathbf{L}^{-T}_{\mathrm{s}}. \end{array} $$
(60)
These expressions can be interpreted in the following way. The left hand multiplication with \(\mathbf {L}_{t}^{-1}\) corresponds to temporal prewhitening of all the microphone signals, whereas the right hand multiplication with \(\mathbf {L}_{s}^{-T}\) corresponds to spatial prewhitening of all time snapshots.
Step 2: With the Kronecker decomposition, the second term of the cost function in (49) becomes
$$\begin{array}{*{20}l} \overline{J}_{2,r}(g_{r},\boldsymbol{\eta}) = g_{r}^{2}\text{tr}(\widetilde{\mathbf{S}}_{\tau,\eta}^{T}(n)\widetilde{\mathbf{S}}_{\tau,\eta}(n)). \end{array} $$
(61)
This does not depend on the TOAs and TDOAs, so the Kronecker decompositions allow us to skip the intermediate step of updating the TDOAs as in (48). We can therefore directly proceed to conducting the closed form estimate of the gains as
$$\begin{array}{*{20}l} \widehat{g}_{r}= \frac{\sum_{m=1}^{M}\widetilde{\mathbf{x}}_{m,r}^{T}(n)\widetilde{\mathbf{s}}(n-\tau-\eta_{m})}{M\|\widetilde{\mathbf{s}}(n)\|^{2}}. \end{array} $$
(62)
Even after all the presented simplifications and assumptions, the computational complexity of the proposed methods might still be considered relatively high due to their iterative and multidimensional nature. However, although not considered in this paper, we expect that further reductions in the computational complexity can be obtained by employing, e.g., the space alternating generalized expectation (SAGE) algorithm rather than the EM algorithm [28], or through a recursive EM procedure as suggested in [29], where the number of iterations per time instance can be reduced by instead tracking the parameters of interest over time.
Temporal prewhitening with filter
One issue with this prewhitening approach still is that the number samples in time might be relatively high in practice. The consequence of this is that, even with the Kronecker decomposition of the noise correlation matrix, the inversion of Lt might be intractable in practice since its dimensions equal the number of time samples. An alternative approach could be to use a lower order filter for the prewhitening instead [30]. If we assume that the noise follows an autoregressive model, we can approximate it as:
$$\begin{array}{*{20}l} d(n)\approx\sum_{p=1}^{P}a_{p} d(n-p). \end{array} $$
(63)
Given the noise correlation matrix, Ct, we can obtain the AR coefficients of the noise using the Levinson-Durbin recursion. The prewhitening filter is then formed using the AR coefficients as the coefficients of a Pth order FIR filter, hpw(p)=ap. Subsequently, the prewhitened signals are obtained as
$$\begin{array}{*{20}l} \widetilde{x}_{m,r}(n)&=\sum_{p=0}^{P}h_{\text{pw}}(p)x_{m,r}(n-p), \end{array} $$
(64)
$$\begin{array}{*{20}l} \widetilde{s}(n)&=\sum_{p=0}^{P}h_{\text{pw}}(p)s(n-p), \end{array} $$
(65)
where hpw(0)=1.
Covariance estimation
In the previous subsections, we have considered the covariance matrices as known quantities. However, we need to estimate these from the observed data in practice. If no particular structure is assumed for the covariance matrix, a common approach is to use the following estimator [31]
$$\begin{array}{*{20}l} \widehat{\mathbf{C}}_{d}=\frac{1}{N-K+1}\sum_{n=0}^{N-K}\mathbf{d}(n)\mathbf{d}(n)^{T}, \end{array} $$
(66)
where
$$\begin{array}{*{20}l} \mathbf{d}(n)&=\left[\mathbf{d}_{1}(n) \quad \cdots \quad \mathbf{d}_{M}(n)\right]^{T}, \end{array} $$
(67)
$$\begin{array}{*{20}l} \mathbf{d}_{m}(n)&=\left[d_{m}(n) \quad \cdots \quad d_{m}(n+K-1)\right]^{T}. \end{array} $$
(68)
As evident from, e.g., (47), the estimated covariance needs to be invertible. This requires that
$$\begin{array}{*{20}l} K\leq\frac{N+1}{M+1}. \end{array} $$
(69)
where K is the number of snapshots, N is the number of samples of the signal, and M is the number of microphones. Consequently, we can only use relatively short temporal subvectors, dm(n) in the estimation of the covariance matrix when the number of microphones is increased.
If it is assumed that the multichannel noise samples in d(n) follows a multichannel matrix normal distribution, the maximum likelihood (ML) estimator for the noise covariance matrix can be derived [32]. Unfortunately, the resulting estimator is not closed form, but it can be implemented using the iterative flip-flop algorithm in Algorithm 1. In some cases, e.g., if one of the covariance matrices are close to being rank deficient, this iterative procedure can be problematic, since their inverses are required. Different approaches for dealing with this and the computational complexity of the iterative procedure have been considered [31, 33]. Alternatively, a non-iterative estimator can be used such as [31]
$$\begin{array}{*{20}l} \widehat{\mathbf{C}}_{\mathrm{s}}&=\frac{1}{(N-K+1)\text{tr}\left(\mathbf{C}_{\mathrm{t}}\right)}\sum_{n=0}^{N-K}\mathbf{D}^{T}(n)\mathbf{D}(n), \end{array} $$
(70)
$$\begin{array}{*{20}l} \widehat{\mathbf{C}}_{\mathrm{t}}&=\frac{1}{(N-K+1)\text{tr}\left(\widehat{\mathbf{C}}_{\mathrm{s}}\right)}\sum_{n=0}^{N-K}\mathbf{D}(n)\mathbf{D}^{T}(n), \end{array} $$
(71)
where
$$\begin{array}{*{20}l} \mathbf{D}(n)=\left[\mathbf{d}_{1}(n) \quad \mathbf{d}_{2}(n) \quad \cdots \quad \mathbf{d}_{M}(n)\right]. \end{array} $$
(72)
As indicated in (70), the trace of the temporal covariance is assumed to be known. This might not be the case in practice; however, in most situations, we can simply replace it by an arbitrary value, since its main purpose is to resolve the ambiguity
$$\begin{array}{*{20}l} \mathbf{C}_{d}=\mathbf{C}_{\mathrm{s}}\otimes\mathbf{C}_{\mathrm{t}}=\left(\frac1\alpha\mathbf{C}_{s}\right)\otimes(\alpha\mathbf{C}_{\mathrm{t}}). \end{array} $$
(73)
Non-stationary noise
While the stationarity assumption may not hold in practice, there are a number of ways to address this problem. For example, we may reduce the length, N, of the probe signal and the analysis window, which would naturally increase the validity of the assumption. Alternatively, we may decouple the prewhitening and estimation parts, as suggested in Section 4.5. In this way, we may first prewhiten our signal using a filter, and then apply the proposed estimators with a white Gaussian noise assumption on the prewhitened signals. This approach can be exploited to take the non-stationarity of the noise into account by updating the prewhitening filters over time, according to the changing AR coefficients of the noise. Estimating non-stationary noise parameters, however, is more difficult, since the statistics need to be tracked during the presence of the desired signal, i.e., the probe signal and its reflections in our case. This problem has been well-investigated in other audio signal processing problems, such as speech enhancement [34–37].