Skip to main content

Geometry calibration in wireless acoustic sensor networks utilizing DoA and distance information

Abstract

Due to the ad hoc nature of wireless acoustic sensor networks, the position of the sensor nodes is typically unknown. This contribution proposes a technique to estimate the position and orientation of the sensor nodes from the recorded speech signals. The method assumes that a node comprises a microphone array with synchronously sampled microphones rather than a single microphone, but does not require the sampling clocks of the nodes to be synchronized. From the observed audio signals, the distances between the acoustic sources and arrays, as well as the directions of arrival, are estimated. They serve as input to a non-linear least squares problem, from which both the sensor nodes’ positions and orientations, as well as the source positions, are alternatingly estimated in an iterative process. Given one set of unknowns, i.e., either the source positions or the sensor nodes’ geometry, the other set of unknowns can be computed in closed-form. The proposed approach is computationally efficient and the first one, which employs both distance and directional information for geometry calibration in a common cost function. Since both distance and direction of arrival measurements suffer from outliers, e.g., caused by strong reflections of the sound waves on the surfaces of the room, we introduce measures to deemphasize or remove unreliable measurements. Additionally, we discuss modifications of our previously proposed deep neural network-based acoustic distance estimator, to account not only for omnidirectional sources but also for directional sources. Simulation results show good positioning accuracy and compare very favorably with alternative approaches from the literature.

Introduction

A wireless acoustic sensor network (WASN) consists of sensor nodes, which are connected via a wireless link and where each node is equipped with one or more microphones, a computing and a networking module [1, 2]. A network of distributed microphones offers the advantage of superior signal capture, because it increases the probability that a sensor is close to every relevant sound source, be it a desired signal or an interfering source. Information about the position of an acoustic source may be used for acoustic beamforming and for realizing location-based functionality, such as switching on lights depending on a speaker’s position or steering a camera to a speaker who is outside its field of view. Source position information is also beneficial for the estimation of the phase offset between the sampling oscillators of the distributed sensor nodes [3, 4].

However, source location information can only be obtained from the audio signals without using additional prior knowledge, e.g., about source position candidates, like it is used in fingerprinting-based methods [5, 6], if the position of the sensors, i.e., the microphones, is known. This, however, is an unrealistic assumption, because one of the key advantages of WASNs is that they are typically an ad hoc network formed by non-stationary devices, e.g., the smartphones of users, and, possibly, stationary devices, such as a TV or a smart speaker. For such a setup, the spatial configuration and even the number of sensor nodes is unknown a priori and may even be changing over time, e.g., with people, and thus smartphones, entering and leaving the setup.

Geometry calibration refers to the task of determining the spatial position of the distributed microphones [7]. In case of sensor nodes equipped with an array of microphones [8], the orientation of the array is also of interest. An ideal calibration algorithm should infer the geometry of the network while the network is being used, i.e., solely from the recorded audio signals, neither requiring the playback of special calibration signals nor human assistance through manually measured distances. The calibration should be fast, not only during initial setup but also when detecting a change in the network configuration [9] which triggers a re-calibration.

There is a further desirable feature, which is the independence from synchronized sampling clocks across the network (see [1012]). Clearly, the tasks of geometry calibration and synchronization of the sensor nodes’ sampling clocks are often closely linked [7]. Geometry calibration approaches relying on time difference of arrival (TDoA) [13, 14], time of arrival (ToA) [15], or time of flight (ToF) [16] information investigate time points of sound emission and/or intersignal delays, requiring that the clocks of the sensor nodes are synchronized.

Only the direction of arrival (DoA)-based approach does not require clock synchronization at (sub-)sample precision. Here, the assumption is that sensor nodes are equipped with microphone arrays to be able to estimate the angle under which an acoustic source is observed. This requires that the microphones comprising the array share the same clock signal, while the clocks at different nodes only need to be coarsely synchronized, e.g., via [1720]. That coarse synchronization, i.e., a synchronization with an accuracy of a few tens of milliseconds, is necessary to identify same signal segments across devices. DoA-based calibration obviously suffers from scale indeterminacy: only a relative geometry can be estimated, as no information is available to infer an absolute distance.

Once measurements are given, be it ToA, TDoA, DoA or even combinations thereof [21, 22], the actual estimation of the spatial arrangement of the network amounts to the optimization of a cost function, which measures the agreement of an assumed geometry with the given measurements [13, 2327]. This typically is a non-linear least squares (LS) problem [28, 29], for which no closed-form solution is known. Due to the non-convexity of the problem, iterative solutions depend on the initialization. What complicates matters further is the fact that the acoustic measurements, such as DoAs, suffer from reverberation, which results in outliers that can spoil the geometry calibration process. To combat those, the iterative optimization is often embedded in a random sample consensus (RANSAC) method [30], which, however, significantly increases the computational load.

The approach presented here offers two innovations. First, we employ acoustic distance estimates, in addition to DoA measurements, which will solve the scale ambiguity of purely DoA-based geometry calibration and still renders clock synchronization at sample precision unnecessary. Compared to our previous approach presented in [31] which already utilized DoA and distance estimates in a two-stage manner, the approach proposed in the paper at hand combines both types of estimates directly in a common cost function.

In [32, 33], it has been shown how the distance between an acoustic source and a microphone array can be estimated from the coherent-to-diffuse power ratio (CDR), the ratio between the power of the coherent, and the diffuse part of the received audio signal. The authors employed Gaussian processes (GPs) to estimate the distance between a close pair of microphones and the acoustic source. This technique performed well if the GP was trained in the target environment but generalized poorly to new acoustic environments. Better generalization capabilities were achieved by deep neural network (DNN)-based acoustic distance estimation, where the network was exposed to many different acoustic environments during training [31]. However, this approach to distance estimation needs signal segments where a coherent source is active for a time around 1 s to work well. This requirement excludes impulsive source signals but is generally fulfilled by speech. Therefore, we consider speech as source signal but do not exclude other acoustic sources. In the contribution at hand, we build upon the DNN approach and further generalize it to perform better in the presence of directional sources.

The second contribution of this paper is the formulation of geometry calibration as a data set matching problem, similarly to [13], however, employing both distance and DoA estimates. Since data set matching can be efficiently realized, it greatly reduces the computational complexity of the task and thus the time it takes to estimate the geometry compared to a gradient-based optimization of a cost function. Moreover, we integrate the data set matching into an error-model-based re-weighting scheme and present a formal proof of convergence for it. The re-weighting scheme robustifies the geometry calibration process w.r.t. observations with large errors without the need of using a RANSAC. Additionally, a detailed experimental investigation of the proposed approach to geometry calibration is presented beside the mathematical analysis. Furthermore, the formulation as a data set matching problem allows the inference of the network’s geometry even if it only consists of two sensor nodes, each equipped with at least three microphones which do not lie on a line.

The paper is organized as follows: In Section 2, the geometry calibration problem and the notation is summarized, followed by the description of the cost function we investigate for geometry estimation in Section 3. Subsequently, the distance estimation via DNNs is briefly described in Section 4. In Section 5, the experimental results are summarized before we end the paper by drawing some conclusions in Section 6.

Geometry calibration setup

We consider a WASN, where a set of sensor nodes is randomly placed in a reverberant environment (see Fig. 1). Note that we investigate geometry calibration in a 2-dimensional space; however, the extension to 3-dimensional space is in principle straight-forward.

Fig. 1
figure1

Geometry calibration problem (red: sensor nodes; dark blue: acoustic sources; blue: source k; global coordinate system (x,y); local coordinate systems ())

We assume that the internal geometric arrangement of each node’s microphone array is known and that all microphones making up an array are synchronously sampled, which we consider a realistic assumption. To be able to identify which DoA and distance estimates made by the different sensor nodes correspond to the same source signal, we further assume that a coarse time synchronization, i.e., a synchronization with an accuracy of a few tens of milliseconds, exists between the clocks of the different sensor nodes. This can be established, e.g., by NTP [17] or PTP [18]. We do, however, not require time synchronization at the precision of a few parts per million (ppm).

The WASN consists of L sensor nodes (red dots in Fig. 1), each equipped with a microphone array centered at positions \(\boldsymbol {n}_{l}{=}\left [\begin {array}{ll} n_{l,x} &n_{l,y} \end {array}\right ]^{\mathrm {T}}\)with an orientation θl,l{1,2,…,L} relative to the global coordinate system, which is spanned by the depicted coordinate axes x and y. Here, θl corresponds to the rotation angle between the local coordinate system of the l-th node and the global coordinate system, i.e., the angle between the positive x-axes of the global and the local coordinate system (measured counterclockwise from the positive x-axis to the positive y-axis). The K acoustic sources (blue dots in Fig. 1) are at positions \(\boldsymbol {s}_{k}{=}\left [\begin {array}{ll} s_{k,x} &s_{k,y} \end {array}\right ]^{\mathrm {T}}\), k{1,2,…,K}. We assume that only one source is active at any given time. Note that the positions of the sensor nodes nl, their orientations θl, and the positions of the acoustic sources sk are all unknown and will be estimated through a geometry calibration procedure from the observed acoustic source signals.

The geometry calibration task amounts to determining the set Ωgeo={n1,…,nL,θ1,…,θL}. Furthermore, all source positions are gathered in the set Ωs={s1,…,sK}, which will be estimated alongside geometry calibration. This results in the set of all unknowns Ω=ΩgeoΩs.

Since a sensor node does not know its own position or orientation within the global coordinate system, all observations are given in the node’s local coordinate system (see Fig. 2 for an illustration). In the following, the superscript (l) denotes that a quantity is measured in the local coordinate system of the l-th sensor node. Thus, the position of the k-th acoustic source, if expressed in the local coordinate system of the l-th sensor node, is denoted as \(\boldsymbol {s}_{k}^{(l)} {=} \left [s_{k,x}^{(l)},s_{k,y}^{(l)}\right ]^{\mathrm {T}}\). Quantities without a superscript are measured in the global coordinate system. For example, sk corresponds to the position of the k-th acoustic source described in the global coordinate system.

Fig. 2
figure2

Position of an acoustic source within the global coordinate system (x,y) and local coordinate system () of node

Each sensor node l, l{1,…,L}, computes DoA estimates \(\widehat {\varphi }_{k}^{(l)}\) and distance estimates \(\widehat {d}_{k}^{\:(l)}\) to the acoustic source k, k{1,…,K}, all w.r.t. the node’s local coordinate system. Altogether, this results in K·L DoA estimates and K·L distance estimates available for geometry calibration.

Geometry calibration using DoAs and source node distances

To carry out geometry calibration, the given observations in the sensors’ local coordinate systems have to be transferred to a common global coordinate system. Then, a cost function is defined that measures the fit of the transferred observations to an assumed geometry. The minimization of this cost function provides the positions and orientations of the sensor nodes, as well as the positions of the acoustic sources.

Development of a cost function

The position \(\boldsymbol {s}_{k}^{(l)}\) of source k w.r.t. the local coordinate system of sensor node l is given by

$$\begin{array}{*{20}l} \boldsymbol{s}_{k}^{(l)} = d_{k}^{(l)} \left[\begin{array}{ll} \cos\left(\varphi_{k}^{(l)}\right) &\sin\left(\varphi_{k}^{(l)}\right) \end{array}\right]^{\mathrm{T}}. \end{array} $$
(1)

To project \(\boldsymbol {s}_{k}^{(l)}\) into the global coordinate system, the following translation and rotation operation is applied:

$$\begin{array}{*{20}l} \boldsymbol{s}_{k} &= \boldsymbol{R}(\theta_{l}) \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l} \end{array} $$
(2)
$$\begin{array}{*{20}l} &= d_{k}^{(l)}\left[\begin{array}{ll} \cos\left(\varphi_{k}^{(l)} + \theta_{l}\right) \\ \sin\left(\varphi_{k}^{(l)} + \theta_{l}\right) \end{array}\right] + \boldsymbol{n}_{l}. \end{array} $$
(3)

Here,

$$\begin{array}{*{20}l} \boldsymbol{R}(\theta_{l}) = \left[\begin{array}{ll} \cos(\theta_{l}) & -\sin(\theta_{l}) \\ \sin(\theta_{l}) & \cos(\theta_{l})\end{array}\right] := \boldsymbol{R}_{l} \end{array} $$
(4)

denotes the rotation matrix corresponding to the rotation angle θl.

If all distances and angles were perfectly known, all \(\boldsymbol {s}_{k}^{(l)}\) would map to a unique position sk. Hence, the geometry can be inferred by minimizing the deviation of the projected source positions from an assumed position sk by minimizing the LS cost function J(Ω):

$$\begin{array}{*{20}l} \widehat{\Omega} = \underset{\Omega}{\operatorname{argmin}} \underbrace{\sum_{l=1}^{L} \sum_{k=1}^{K} \left\|\boldsymbol{s}_{k} - \left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}\right)\right\|_{2}^{2}}_{:=J(\Omega)}, \end{array} $$
(5)

with ·2 denoting the Euclidean norm. Note that at least K=2 spatially different acoustic source positions have to be observed to arrive at an (over-)determined system of equations which is defined by \(\boldsymbol {s}_{k} = \boldsymbol {R}_{l} \boldsymbol {s}_{k}^{(l)} + \boldsymbol {n}_{l}\) with \(l \in \{1, \dots, L\}\) and \(k \in \{1, \dots, K\}\).

There exists no closed-form solution for the non-linear optimization problem in (5). Thus, (5) has to be solved by an iterative optimization algorithm, e.g., by Newton’s method as proposed in [23] or by gradient descent.

Prior works, e.g., [23], have shown that the iterative optimization strongly depends on the initial values. Furthermore, the optimization is computationally demanding and, depending on the number of observed acoustic source positions, very time consuming, which limits its usefulness for WASNs with typically limited computational resources. In the following, we will present a computationally much more reasonable approach.

Geometry calibration by data set matching

We now interpret the relative acoustic source positions (see (1)) as the vertices of a rigid body. Matching the rigid body shapes as observed by the different sensor nodes will result in an efficient way for geometry calibration as described in [13]. In the following, we shortly recapitulate the concept of efficient geometry calibration based on data set matching [34, 35]. Let

$$\begin{array}{*{20}l} \boldsymbol{S}^{(l)} = \left[\begin{array}{lll} \boldsymbol{s}_{1}^{(l)} &\cdots & \boldsymbol{s}_{K}^{(l)} \end{array}\right]. \end{array} $$
(6)

be the matrix of all K source positions, as measured in the local coordinate system of sensor node l. Similarly, let S be the same matrix of source positions, but now measured in the global coordinate system. The dispersion matrix Dl is defined as follows [35]:

$$\begin{array}{*{20}l} \boldsymbol{D}_{l} = \frac{1}{K} \left(\boldsymbol{S}^{(l)} - \bar{\boldsymbol{s}}^{(l)}\boldsymbol{1}^{\mathrm{T}}\right) \boldsymbol{W}_{l} \left(\boldsymbol{S} -\bar{\boldsymbol{s}}\boldsymbol{1}^{\mathrm{T}}\right)^{\mathrm{T}}, \end{array} $$
(7)

where 1 denotes a vector of all ones. Wl is a diagonal matrix with (Wl)k,k=wkl, where (·)i,j denotes the i-th row and j-th column element of a matrix. \(\bar {\boldsymbol {s}}^{(l)}\) corresponds to the centroid of the observations made by sensor node l and \(\bar {\boldsymbol {s}}\) is the centroid of the source positions expressed in the global coordinate system:

$$\begin{array}{*{20}l} \bar{\boldsymbol{s}}^{(l)} = \frac{\sum\limits_{k=1}^{K} w_{{kl}} \boldsymbol{s}_{k}^{(l)}}{\sum\limits_{k=1}^{K} w_{{kl}}} \:\:\:\: \text{ and} \:\:\:\: \bar{\boldsymbol{s}} = \frac{\sum\limits_{k=1}^{K} w_{{kl}} \boldsymbol{s}_{k}}{\sum\limits_{k=1}^{K} w_{{kl}}}. \end{array} $$
(8)

The weights wkl will be introduced in Section 3.3 to control the impact of an individual observation \(\boldsymbol {s}_{k}^{(l)}\) on the geometry estimates.

Carrying out a singular value decomposition (SVD) of the dispersion matrix gives Dl=UΣVT. The estimate \(\widehat {\boldsymbol {R}}_{l}\) of the rotation matrix is then given by [34, 35]

$$\begin{array}{*{20}l} \widehat{\boldsymbol{R}}_{l} = \boldsymbol{V}\boldsymbol{U}^{\mathrm{T}}, \end{array} $$
(9)

and the orientation of the corresponding sensor node by:

$$\begin{array}{*{20}l} \widehat{\theta}_{l} = \arctan \! 2\left(\left(\widehat{\boldsymbol{R}}_{l}\right)_{1, 1}, \left(\widehat{\boldsymbol{R}}_{l}\right)_{2, 1}\right). \end{array} $$
(10)

Here, arctan 2 is the four-quadrant arc tangent. Thus, the l-th sensor node position estimate \(\widehat {\boldsymbol {n}}_{l}\) in the reference coordinate system is given by

$$\begin{array}{*{20}l} \widehat{\boldsymbol{n}}_{l} = \bar{\boldsymbol{s}} - \widehat{\boldsymbol{R}}_{l} \bar{\boldsymbol{s}}^{(l)}. \end{array} $$
(11)

Note that the described data set matching procedure corresponds to minimizing the following cost function [34]:

$$\begin{array}{*{20}l} J\left(\boldsymbol{n}_{l}, \boldsymbol{R}_{l}\right)= \sum_{k=1}^{K} w_{{kl}} \left|\left|\boldsymbol{s}_{k} - \left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}\right)\right|\right|_{2}^{2}. \end{array} $$
(12)

Geometry calibration by iterative data set matching

We now generalize the findings of the last section to an arbitrary number L of sensor nodes. Moreover, we consider the source positions as additional unknowns. The resulting cost function

$$\begin{array}{*{20}l} J(\Omega) = \sum_{l=1}^{L} \sum_{k=1}^{K} w_{{kl}} \left|\left|\boldsymbol{s}_{k} - \left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}\right)\right|\right|_{2}^{2} \end{array} $$
(13)

is optimized by alternating between the estimation of the set of source positions Ωs and the estimation of the sensor node parameters Ωgeo.

Starting from an initial set of source positions Ωs, the geometry Ωgeo can be determined by optimizing (12) for each sensor node l{1,…,L} by data set matching as outlined in the last section. Note that the estimated positions are given relative to a reference coordinate system. The origin and orientation of this reference coordinate system is a result of the calibration process.

Given a geometry Ωgeo the positions sk can be estimated for each acoustic source k{1,…,K} via:

$$\begin{array}{*{20}l} {\hat{\boldsymbol{s}}_{k}} = \underset{{\boldsymbol{s}_{k}}}{\operatorname{argmin}} \sum_{l=1}^{L} w_{{kl}} \left|\left|\boldsymbol{s}_{k} - \left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}\right)\right|\right|_{2}^{2}. \end{array} $$
(14)

For this, a closed-form solution exists, which is given by

$$\begin{array}{*{20}l} \hat{\boldsymbol{s}}_{k} = \frac{\sum\limits_{l=1}^{L}w_{{kl}}\left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}\right) }{\sum\limits_{l=1}^{L}w_{{kl}}}. \end{array} $$
(15)

What remains is to describe how the weights wkl are chosen. They should reflect how well the observations \(\boldsymbol {s}_{k}^{(l)}\) fit to the model specified by \(\widehat {\Omega }_{{\text {geo}}}\) and \(\widehat {\Omega }_{\boldsymbol {s}}\). This can be achieved by setting

$$\begin{array}{*{20}l} w_{{kl}} = \frac{1}{\left|\left|\hat{\boldsymbol{s}}_{k} - \left(\widehat{\boldsymbol{R}}_{l} \boldsymbol{s}_{k}^{(l)} + \hat{\boldsymbol{n}}_{l} \right)\right|\right|_{2}}. \end{array} $$
(16)

With these weights and the ideas of [36], (13) can be interpreted as an iteratively re-weighted least squares (IRLS) algorithm [37] which minimizes the following sum of Euclidean distances:

$$\begin{array}{*{20}l} {\widehat{\Omega} = \underset{\Omega}{\operatorname{argmin}}\sum_{l=1}^{L} \sum_{k=1}^{K} \left|\left|\boldsymbol{s}_{k} - \left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}\right)\right|\right|_{2}.} \end{array} $$
(17)

Consequently, the resulting optimization problem is less sensitive to outliers than the optimization problem in (5).

Implementation details

Algorithm 1 summarizes the iterative data set matching used for geometry calibration. In the beginning the set of observations \(\mathcal {S}^{(1)} = \left \{\boldsymbol {s}_{1}^{(1)}, \boldsymbol {s}_{2}^{(1)}, \dots, \boldsymbol {s}_{K}^{(1)}\right \}\) made by sensor node 1 is used as initial estimate of the acoustic sources’ position set \(\widehat {\Omega }_{\boldsymbol {s}}\). Experiments on the convergence behavior have shown that the effect of the choice of the sensor node, whose observations are used for initialization, is negligible (see Section 5.2). Due to the fact that at this point no statement can be made about the quality of the observations \(\boldsymbol {s}_{k}^{(l)}\), the initial weights are all set to one: wkl=1;k,l.

Subsequently, a first estimate of the geometry \(\widehat {\Omega }_{{\text {geo}}}\) can be derived by data set matching (line 3) utilizing \(\widehat {\Omega }_{\boldsymbol {s}}\) as reference source positions. Then, \(\widehat {\Omega }_{{\text {geo}}}\) is used to estimate the sources’ positions \(\widehat {\Omega }_{\boldsymbol {s}}\) (line 4) based on (15) with the weights still left as above. In the next iterations, the weights are chosen as described in (16). The iterative weighted data set matching procedure, i.e., lines 3–5 in Algorithm 1, is repeated until \(\widehat {\Omega }_{{\text {geo}}}\) and \(\widehat {\Omega }_{\boldsymbol {s}}\) converge. A detailed analysis of the convergence behavior of this part of the algorithm can be found in the Appendix.

Although outliers are already addressed by the weights wkl to some extent, they can still have a detrimental influence on the results of the iterative optimization process if the corresponding errors are very large. Therefore, after convergence, the iterative weighted data set matching procedure is repeated again (lines 7–12); however, only on that subset of observations \(\mathcal {S}_{{\text {fit}}}\) that best fits to the model defined by the current estimates \(\widehat {\Omega }_{{\text {geo}}}\) and \(\widehat {\Omega }_{\boldsymbol {s}}\).

There are two criteria that describe how well the observations \(\boldsymbol {s}_{k}^{(l)}\) made by sensor node l fit to the model specified by \(\widehat {\Omega }_{{\text {geo}}}\) and \(\widehat {\Omega }_{\boldsymbol {s}}\). First, there are the distances between \(\boldsymbol {s}_{k}^{(l)}\) and the source position estimates \(\boldsymbol {s}_{k}^{(o)}, o \in \{1, \dots, L\} \backslash \{l\}\), made by the other sensor nodes:

$$\begin{array}{*{20}l} \epsilon_{k}(l,o) &= \left|\left|\left(\widehat{\boldsymbol{R}}_{l} \boldsymbol{s}_{k}^{(l)} + \hat{\boldsymbol{n}}_{l} \right) - \left(\widehat{\boldsymbol{R}}_{o} \boldsymbol{s}_{k}^{(o)} + \hat{\boldsymbol{n}}_{o} \right)\right|\right|_{2}. \end{array} $$
(18)

Second, there is the distance between the observations after being projected and the estimated source position measured in the global coordinate system:

$$\begin{array}{*{20}l} \sigma_{k}(l) = \left|\left|\hat{\boldsymbol{s}}_{k} - \left(\widehat{\boldsymbol{R}}_{l} \boldsymbol{s}_{k}^{(l)} + \hat{\boldsymbol{n}}_{l} \right)\right|\right|_{2}. \end{array} $$
(19)

Note that the choice of εk(l,o) and σk(l) is motivated by the fact that all relative source positions observed by the single sensor nodes would map on the same position in the global coordinate system if the observations are perfect.

Combining the two criteria results in the function

$$\begin{array}{*{20}l} C_{k}(l) = \sigma_{k}(l) + {\frac{1}{L-1}} \underset{o\neq l}{\sum_{o=1}^{L}} \epsilon_{k}(l,o), \end{array} $$
(20)

used for the selection of \(\mathcal {S}_{{\text {fit}}}\). The distance and DoA measurements of source k made by a node l are included in \(\mathcal {S}_{{\text {fit}}}\) only if the resulting relative source position belongs to the best γ measurements made by a node l. With Ck(l) outliers can be identified based on the fact that they do not align well with the source position estimates of the other nodes for the current geometry.

In principle, this fitness selection could also be integrated in the first iterative data set matching rounds (lines 3–5). However, initial experiments have shown that this may lead to a degradation of performance if the number of observed source positions K is small. This can be explained by the fact that observations are discarded based on a model which is still not converged.

Acoustic distance estimation

To gather distance and, respectively, scaling information that can be used for geometry calibration, we propose to utilize the DNN-based distance estimator which we introduced in [31]. This distance estimator shows state-of-the-art performance and good generalization capabilities to different acoustic environments. In the following, we just concentrate on an adaptation of the distance estimator to directional sources and refer to [31] for a detailed description.

Our approach to acoustic distance estimation considers a microphone pair recording a signal x(t) emitted by a single acoustic source. The reverberant signal, being captured by the ν-th microphone, ν{1,2}, is modeled as follows [32]:

$$\begin{array}{*{20}l} y_{\nu}(t) &= h_{\nu}(t) \ast x(t) + v_{\nu}(t) \\ &= \underbrace{h_{\nu,e}(t) \ast x(t)}_{c_{\nu}(t)} + \underbrace{{h_{\nu,\ell}}(t) \ast x(t) + v_{\nu}(t)}_{r_{\nu}(t)}, \end{array} $$
(21)

with vν(t) corresponding to white sensor noise and hν(t) corresponding to the room impulse response which models the sound propagation from the source to the ν-th microphone. The operator denotes a convolution. hν(t) can be divided into hν,e(t) modeling the direct path and the early reflections and hν,(t) modeling the late reflections. Thus, yν(t) can be split up into a coherent component cν(t) which corresponds to the direct path and the early reflections and a diffuse component rν(t) produced by the late reflections and the sensor noise.

In [32] it was shown that the CDR, i.e., the power ratio of the coherent signal component cν(t) to diffuse signal component rν(t), is related to the distance between the microphone pair and the acoustic source (the larger the distance the smaller the value of the CDR). The DNN-based distance estimator utilizes a time-frequency representation of the CDR as an input feature.

Due to the large effort needed to measure room impulse responses (RIRs) in various acoustic environments, we here stick to synthetic RIRs for the training of the distance estimator, using the RIR generator of [38]. However, there are a lot of simplifying assumptions for the simulation of RIRs. For example, the room is modeled as a cuboid, and an omnidirectional characteristic is typically assumed for the acoustic sources and microphones.

Especially the omnidirectional characteristic of the acoustic sources is a large deviation from reality, because a real acoustic source, like a speaker, typically exhibits directivity. While an omnidirectional source emits sound waves with equal power in all directions, a directional source emits most of the power into one direction. In both cases, the sound waves are reflected multiple times on the surfaces of the room which mainly causes the late reflections and accumulates to hν,. Hence, a directional source pointing towards a microphone array causes a less diffuse signal compared to an omnidirectional source that is assumed in the simulated RIRs. Consequently, a distance estimator trained with simulated RIRs and applied to recordings of directional sources, pointing towards the microphone array, would exhibit a systematic error and underestimates the distance. Furthermore, a directional source may cause a more diffuse signal compared to an omnidirectional source if it does not point towards a microphone array, causing a systematic overestimation of the distance. However, this case is not further investigated as such recording conditions are not included in the MIRD database [39] which is used in the experimental section.

We approach this mismatch by applying a recently proposed direct-to-reverberant ratio (DRR) data augmentation technique [40]. The DRR is defined as

$$\begin{array}{*{20}l} \eta_{\nu} = \frac{\sum_{t} h_{\nu,e}^{2}(t)}{\sum_{t} {h_{\nu,\ell}^{2}}(t)}. \end{array} $$
(22)

Considering (21), it is obvious that CDR and DRR are equivalent [41] if the influence of the sensor noise is negligible. Consequently, an augmentation of the DRR results into an augmentation of the CDR.

Therefore, during training, a scalar gain α is applied to hν,e(t) which contains the direct path and the early reflections of the RIRs. To avoid discontinuities within the RIR caused by the scaling, a window wd(t) is employed to smooth the product α·hν,e(t):

$$\begin{array}{*{20}l} \overline{h}_{\nu,e}(t) = w_{d}(t) \cdot \alpha \cdot h_{\nu,e}(t) + \left(1-w_{d}(t)\right) \cdot h_{\nu,e}(t). \end{array} $$
(23)

Hereby, wd(t) corresponds to a Hann window of 5 ms size, which is centered around the time delay td corresponding to the direct path. td is identified by the location of the maximum of |hν(t)|.

Due to the fact that the directivity of the acoustic source is unknown in general there is also no knowledge how α has to be chosen to adapt the simulated RIRs to the real scenario. Nevertheless, it is known that the DRR of the simulated RIRs has to be increased if a directional source pointing towards the center of the microphone pair is considered. Thus, \(\alpha {\sim } \mathcal {U}(1, \alpha _{{max}})\) is used, where αmax corresponds to the fixed upper limit of α and \({\sim } \mathcal {U}(\text {min}, \text {max})\) denotes to uniformly draw a value from the interval [min,max].

Furthermore, the DRR is only manipulated with probability Pr(aug). Hence, beside manipulated examples, also examples that are not manipulated are presented to the DNN during training. The non-manipulated examples should ease the process of learning that examples being manipulated with different scaling factors α belong to the same distance.

Experimental results

In this section, the proposed approach to geometry calibration is evaluated. First, the adaptation of the DNN-based acoustic distance estimation method to directional sources is examined. For deeper insights into acoustic distance estimation see [31]. Afterwards, the proposed approach to geometry calibration is investigated based on simulations of the considered scenario.

Acoustic distance estimation

In the following, the adaptation of the DNN-based distance estimator to directional sources is evaluated on the MIRD database [39]. This database consists of measured RIRs for multiple source positions on an angular grid at a distance of 1 m and 2 m. The measurements took place in a 6 m×6 m×2.4 m room with a configurable reverberation time T60. From the data we used, the two subsets corresponding to T60=360 ms and T60=610 ms, considering the central microphone pair with inter microphone distance equal to 8 cm.

The setups of the MIRD database are limited w.r.t. the number of source and sensor positions. Nevertheless, the experimental data is sufficient to proof that the approach works for directional acoustic sources and not only on simulated audio data of omnidirectional sources. We refer to [31] for a detailed investigation of a wider range of considered setups using simulated data.

As described in Section 4, the distance estimator is trained utilizing RIRs which are simulated using the implementation of [38]. The training set consists of 100,000 source microphone pair constellations whereby the properties of the considered room and the placement of the microphone pair and acoustic source is randomly drawn for each of these constellations. Table 1 summarizes the corresponding probability distributions. We first draw the position of the microphone pair and then place the acoustic source relative to this position at the same height using the distance d and the DoA φ.

Table 1 Description of the training set of the distance estimator used on the MIRD database

The RIRs are used to reverberate clean speech signals from the TIMIT database [42]. During training, these speech probes are randomly drawn from the database. For the evaluation of the distance estimator on the MIRD database, we utilized R=100 speech probes which were randomly drawn from the TIMIT database and then reverberated by each of the RIRs.

In the following, the configuration and training scheme of the distance estimator are explained. We employ 1 s long speech segments to calculate the CDR which results in a feature map that is passed to the DNN. The short-time Fourier transform (STFT), which is needed to estimate the CDR, utilizes a Blackman window of size 25 ms, and a frame shift of 10 ms. The CDR is calculated for frequencies between 125Hz and 3.5 kHz, which corresponds to the frequency range, where speech has significant power.

Table 2 shows the architecture of the DNN used for distance estimation. The estimator is trained using Adam [43] with a mini-batch size of B=32 and a learning rate of 3·10−4 for 500,000 iterations. Besides, the maximum DRR augmentation factor αmax is chosen to be equal to 3. After training, we utilize the best performing checkpoint w.r.t. the mean-absolute error (MAE) of the distance estimates on an independent validation set.

Table 2 Architecture of the DNN used for distance estimation on the MIRD database

The influence of the DRR manipulation probability Pr(aug) can be seen in Table 3. Thereby, the MAE

$$\begin{array}{*{20}l} e_{d} = \frac{1}{2 \cdot A \cdot R} \sum_{c=1}^{2}\sum_{a=1}^{A}\sum_{r=1}^{R}|d(c, a) - \widehat{d}_{r}(c, a)| \end{array} $$
(24)
Table 3 MAE ed/ m on the MIRD database and the corresponding simulated RIRs

is used as metric. Here, d(1,a)=1 m and d(2,a)=2 m correspond to the ground truth distance at DoA-candidate a. \(\widehat {d}_{r}(c, a)\) denotes the corresponding estimate using the r-th speech sample and A the number of DoA in the angular grid of the MIRD database. Furthermore, results for distance estimation on a simulated version of the RIRs of the MIRD database with omnidirectional sources are provided (see Table 3).

Without DRR augmentation, i.e., for Pr(aug)=0, the distance estimation error is large compared to the error on simulated RIRs. This can be explained by the systematic error resulting from the fact that the simulated RIRs used during the training include more diffuse signal parts than the recorded RIR. With DRR augmentation the error of the distance estimates on the MIRD database can be reduced and the best performance is achieved if the DRR of all examples is manipulated during training. However, DRR augmentation makes the learning process more difficult, which increases the error on the simulated RIRs.

Geometry calibration

To evaluate the proposed approach to geometry calibration, we generated a data set consisting of G=100 simulated scenarios. Thereby, each scenario corresponds to a WASN with L=4 sensor nodes. Furthermore, each scenario contains acoustic sources at a fixed amount of K=100 spatially independent positions within the room. This number can be justified by the fact that in realistic environments, e.g., living rooms, acoustic sources like speakers will move over time such that the amount of observed acoustic source positions will also grow over time. All rooms have a random width rwU(6 m,7 m), random length \(r_{l} {\sim } \mathcal {U}({5}\text { m}, {6}\text { m}),\) and a fixed height rh of 3 m. In the experiments, we investigate reverberation times T60 from the set {300 ms,400 ms,500 ms,600 ms}.

Both, the nodes and the acoustic sources, are placed at a height of 1.4 m, whereby the sensor nodes are equipped with a circular array with six microphones and a diameter of 5 cm. The way how the sensor nodes and the acoustic sources are placed within the room is exemplarily shown in Fig. 3.

Fig. 3
figure3

Simulated setup; red: microphones; blue: acoustic sources; gray area: possible area to randomly place sensor nodes (microphone arrays); all sensor nodes and acoustic sources have a minimum distance of 0.1 m to the closest wall; 1 m spacing between the gray areas

We assume that at each of the possible K=100 source positions, a 1 s long speech signal is emitted, whereby the speech signals are randomly drawn from the TIMIT database. The speech samples are reverberated by RIRs gathered from the RIR generator of [38]. Subsequently, the reverberant signals are used for distance and DoA estimation.

We employ the convolutional recurrent neural network (CRNN) which we proposed in [31] to compute the distance estimates used for geometry calibration. Feature extraction, training set, and training scheme mainly coincide with the ones described in Section 5.1. The description of the corresponding training set which consists of 10,000 source node constellations can be found in Table 4. During training, DRR augmentation is used with a manipulation probability of Pr(aug)=0.5.

Table 4 Description of the training set of the distance estimator used for geometry calibration

We take the three microphone pairs formed by the opposing microphones of the considered circular microphone array for distance estimation. The CDR is estimated for each of these microphone pairs and the three resulting feature maps are jointly passed to the CRNN.

DoA estimation is done using the complex Watson kernel method introduced in [44], where it was shown that this estimator is competitive to state-of-the-art estimators. The considered DoA candidates have an angular resolution of 1 and the concentration parameter of the complex Watson probability density function is chosen to be κ=5.

The fitness selection contained in our approach to geometry calibration always selects the best 50% relative source positions for each sensor node.

Figures 4 and 5 show the cumulative distribution function (CDF) of the distance and DoA estimation errors. The majority of distance and DoA estimates exhibits only small errors, so in general there will be enough reliable estimates for geometry calibration. But in both cases, there is also a non-negligible amount of estimates exhibiting large errors which have to be considered as outliers. It can also be observed that the amount of outliers increases with increasing reverberation time T60. We refer to [31, 44] for a comparison of the used estimators to alternative estimators.

Fig. 4
figure4

CDF of the distance estimation error

Fig. 5
figure5

CDF of the DoA estimation error

After the geometry calibration process is started, more and more observed relative source positions \(\boldsymbol {s}_{k}^{(l)}\) will become available. The resulting effect on the geometry calibration results can be seen in Fig. 6, which displays the MAE of the sensor nodes’ position

$$\begin{array}{*{20}l} e_{p} = \frac{1}{G \cdot L} \sum_{g=1}^{G} \sum_{l=1}^{L} \left|\left|\boldsymbol{n}_{l,g} - \widehat{\boldsymbol{n}}_{l,g}\right|\right|_{2} \end{array} $$
(25)
Fig. 6
figure6

Influence of number of source positions on calibration performance

and orientation

$$\begin{array}{*{20}l} e_{o} = \frac{1}{G \cdot L} \sum_{g=1}^{G} \sum_{l=1}^{L} \left|\angle\left(e^{j\left(\theta_{l,g} - \widehat{\theta}_{l,g}\right)}\right)\right|, \end{array} $$
(26)

where (·) denotes the phase of a complex-valued number. Further, nl,g and θl,g are the ground truth values of the location parameters of the l-th node in the g-th scenario and \(\widehat {\boldsymbol {n}}_{l,g}\) and \(\widehat {\theta }_{l,g}\) denote the corresponding estimates. Note that the geometry estimates are projected into the coordinate system of the ground truth geometry using data set matching to align the sensor node positions before the errors are calculated.

Figure 6 shows that the geometry estimation error gets smaller when more source positions have been observed and thus more relative source position estimates exhibiting a small error are available. Hence, the estimate of the geometry will improve over time. However, reasonable results can already be achieved with a small amount of observed source positions. This especially holds for scenarios with small reverberation times T60 where the estimates of the relative source positions are less error-prone.

In addition to the MAE of the geometry estimates, the distribution of the corresponding error is displayed in Figs. 7 and 8 for K=20 and K=100 observed source positions. For a small number of observed source positions, i.e., K=20, the majority of node position and node orientation estimates shows acceptably small errors. As can be seen, there are still outliers exhibiting large errors, despite the used error-model-based re-weighting method and the fitness selection method.

Fig. 7
figure7

Distribution of the geometry calibration error for K=20

Fig. 8
figure8

Distribution of the geometry calibration error for K=100

If more source positions are observed, e.g., K=100, the probability increases that a sufficient amount of good relative source position estimates is available, thus improving the average calibration accuracy and also decreasing the number of outliers.

Table 5 shows the influence of the individual outlier rejection and error handling steps of our approach to geometry calibration, namely the weighting in data set matching (WLS), the weighting in source localization (WLS SRC), and the fitness selection (Select). If all weights are set to wkl=1;k,l, and fitness selection is omitted, the geometry estimates are clearly worse compared to the other cases depicted in the table. Introducing weighting factors in data set matching and source localization improves the results. However, the experiment with active data selection reveals that the weighting is not powerful enough to completely suppress the detrimental effect of outliers, which can only be achieved by removing these outliers from the processed data via fitness selection.

Table 5 Influence of the weighting of the proposed geometry calibration procedure for K=20 and T60=500 ms

Figures 9 and 10 show the effect of fitness selection on the distribution of the DoA and distance estimation errors. Fitness selection causes larger errors to occur less frequently for both quantities, removing a large portion of the outliers. This especially holds for the distance estimates.

Fig. 9
figure9

Effect of fitness selection on the distribution of DoA estimation errors for K=20

Fig. 10
figure10

Effect of fitness selection on the distribution of distance estimation errors for K=20

These outliers are often caused by strong early reflections of sound on surfaces in the room, e.g., when a sensor node is placed near to a wall, resulting in poor distance and DoA estimates. However, outliers can also occur if a source is too close to a sensor node, i.e., the far-field assumption for DoA estimation is not met, or the distance between a sensor node and an acoustic source is too large which leads to a challenging situation for distance estimation. Because of the large number of possible reasons for outliers in the DoA and distance estimates, we refer the reader to the relevant literature for a more detailed discussion [31, 44, 45].

The convergence behavior of the sensor nodes’ positions is shown in Fig. 11 based on the CDF of the average spread of the sensor node position estimates

$$\begin{array}{*{20}l} {\zeta_{\boldsymbol{n}_{l}} = \frac{1}{I} \sum_{i=1}^{I} \left|\left|\widehat{\boldsymbol{n}}_{l, i} - \mu_{\boldsymbol{n}_{l}} \right|\right|_{2},} \end{array} $$
(27)
Fig. 11
figure11

Effect of the initialization on the convergence behavior of the sensor nodes’ positions for K=20 and T60=500 ms

whereby \(\widehat {\boldsymbol {n}}_{l, i}\) denotes the estimate of the position of the l-th sensor node resulting from the i-th of the I considered initializations of \(\widehat {\Omega }_{\boldsymbol {s}}\) and \(\mu _{\boldsymbol {n}_{l}} {=} \frac {1}{I}\sum _{i=1}^{I} \widehat {\boldsymbol {n}}_{l, i}\) the corresponding mean.

We compare two initialization strategies, namely the proposed initialization using the observed source positions of one sensor node and a random initialization. For the proposed initialization scheme, the geometry was estimated using the observations of each of the sensor nodes as initial values resulting in I=L=4 different initializations. In the random case, all values of \(\widehat {\Omega }_{\boldsymbol {s}}\) are drawn from a normal distribution and I=100 initialization were considered.

It can be seen that the proposed initialization scheme leads to smaller deviations in the results. In most cases, the spread of the sensor node positions is even vanishingly small. Consequently, the choice of the sensor node whose source position estimates were used as initial values is not critical for the proposed initialization scheme. Moreover, the experiments showed that the spread of the estimated node orientations is in the order of magnitude of (10−13) and can therefore be neglected.

In addition to geometry, our approach also provides estimates of the positions of the sound sources. The MAE of these estimates

$$\begin{array}{*{20}l} e_{\boldsymbol{s}} = \frac{1}{G \cdot K} \sum_{g=1}^{G} \sum_{k=1}^{K} \left|\left|\boldsymbol{s}_{k,g} - \hat{\boldsymbol{s}}_{k,g}\right|\right|_{2} \end{array} $$
(28)

is given in Table 6. Again, the coordinate system of the geometry estimates is aligned with the coordinate system of the ground truth geometry using data set matching before the errors are calculated. These results are compared to the results of source localization, i.e., solving (14) for each acoustic source, using the ground truth geometry. It is shown that for small reverberation times T60, the proposed iterative geometry calibration procedure yields comparable results to source localization using the ground truth geometry of the sensor network. As the reverberation time increases and thus the observation errors increase, the geometry calibration error increases and consequently the source localization error increases.

Table 6 MAE es/ m of source positions with and without fitness selection (Select) for K=20

Moreover, the effect of fitness selection is shown in Table 6. Calculating the MAE es only for the subset of observed source positions selected by the fitness selection always leads to a smaller error. Thus, the algorithm succeeds in selecting a set of observations with smaller errors.

Finally, in Table 7, we compare the proposed approach to geometry calibration to state-of-the-art approaches solely using distance [46] or DoA estimates [29]. Hereby, the DoA-based approach utilizes the optional Maximum Likelihood refinement procedure which was proposed in [29]. Note that the considered distance-based approach called GARDE only delivers estimates for the positions of the sensor nodes and no orientations. Furthermore, the DoA-based approach estimates a relative geometry which has to be scaled subsequently. To this end, we employed the ground truth source node distances to fix the scaling as described in [31].

Table 7 Comparison of the calibration results and average computing time \(\overline {T}_{c}\)

Table 7 shows that our approach is able to outperform both approaches by far. This can be explained by the additional information which results from the combined usage of distance and DoA information. In addition to that, the considered DoA-based approach contains no outlier handling while GARDE suffers from the outliers in the distance estimates.

The proposed approach also compares favorably in terms of computational effort, when looking at the average computing time \(\overline {T}_{c}\), i.e., the average time which is needed to estimate the geometry once. The average computing time for distance estimation (47 ms) and the average computing time for DoA estimation (545 ms) are not included in \(\overline {T}_{c}\). Note that the DoA-based approach utilizes a Fortran accelerated implementation [47] to optimize the underlying cost function while all other approaches are based on a Python implementation. Moreover, Table 7 provides the average computing time required to solve the optimization problem in (5) by the Broyden-Fletcher–Goldfarb-Shanno (BFGS) method and the average computing time of the proposed approach if the weighting and the fitness selection is omitted which also can be interpreted as solving (5). Thereby, the latter leads to the same results as the BFGS method while being 70 times faster. This leaves room for the additional computing time required for the weighting and fitness selection in our approach. Consequently, despite its iterative character the proposed approach shows competitive computing time compared to the other considered approaches while providing better geometry estimates.

Conclusions

In this paper, we proposed an approach to geometry calibration in a WASN using DoA and distance information. The DoA and distances are estimated from the microphone signals and are interpreted as estimates of the relative positions of acoustic sources w.r.t. the coordinate system of the sensor node. Our approach uses these observations to alternatingly estimate the geometry and the acoustic sources’ positions. Hereby, geometry calibration is formulated as an iterative data set matching problem which can be efficiently solved using a SVD.

In order to improve robustness against outliers and large errors contained in the observations, we integrate the iterative geometry estimation and source localization procedure into an error-model-based weighting and observation selection scheme. Simulations show that the proposed approach delivers reliable estimates of the geometry while being computationally efficient. Furthermore, it requires only a coarse synchronization between the sensor nodes.

\thelikesection Appendix

\thelikesection Convergence analysis of geometry calibration using iterative data set matching

We now analyze the convergence behavior of the iterative data set matching procedure, following the ideas of [48]. Therefore, we consider the part of iterative data set matching procedure where fitness selection is not used as shown in Algorithm 2. In the following, the superscript [η] denotes the value after the update in the η-th iteration. Thus, the sets of quantities resulting from the η-th iteration of the alternating optimization procedure are defined as \(\Omega _{{\text {geo}} }^{[\eta ]} {=} \left \{\boldsymbol {n}_{1}^{[\eta ]}, \ldots, \boldsymbol {n}_{L}^{[\eta ]}, \theta _{1}^{[\eta ]}, \ldots, \theta _{L}^{[\eta ]}\right \}\), \(\Omega _{\boldsymbol {s} }^{[\eta ]} {=} \left \{\boldsymbol {s}_{1}^{[\eta ]}, \ldots, \boldsymbol {s}_{K}^{[\eta ]}\right \}\), and \(\Omega _{\mathrm {w} }^{[\eta ]}{=}\left \{w_{11}^{[\eta ]}, \dots, w_{{KL}}^{[\eta ]} \right \}\). \(\boldsymbol {R}_{l}^{[\eta ]}\) denotes the rotation matrix corresponding to \(\theta _{l}^{[\eta ]}\). Furthermore, the cost function is now interpreted as a function of \(\Omega _{{\text {geo}} }^{[\eta ]}, \Omega _{\boldsymbol {s} }^{[\eta ]}\) and \(\Omega _{\mathrm {w} }^{[\eta ]}\):

$$\begin{array}{*{20}l} {J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right) = \sum_{l=1}^{L} \sum_{k=1}^{K} w_{{kl}}^{[\eta]} \left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]}\right)\right|\right|_{2}^{2}.} \end{array} $$
(29)

Considering the (η+1)-th iteration of the alternating optimization the following monotonicity property of the cost function holds:

Lemma 6.1

The inequality

$$\begin{array}{*{20}l} {J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta+1]}\right) \leq J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)} \end{array} $$
(30)

holds for all η>0, i.e., each iteration monotonically decreases the considered cost function.

Proof Inserting the definition of the weights

$$\begin{array}{*{20}l} {w_{{kl}}^{[\eta]} = \frac{1}{\left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}}} \end{array} $$
(31)

into (29) leads to

$$\begin{array}{*{20}l} {}{J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)} {=} &{ \sum_{l=1}^{L} \sum_{k=1}^{K} w_{{kl}}^{[\eta]} \!\!\left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}^{2}} \\ {=} &{ \sum_{l=1}^{L} \sum_{k=1}^{K} \frac{\left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}^{2}}{\left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}}} \\ {=} &{ \sum_{l=1}^{L} \sum_{k=1}^{K} \left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}} \end{array} $$
(32)

for the costs at the end of the η-th iteration.

Firstly, data set matching is used to update the geometry Ωgeo (see line 3 in Algorithm 2). As described in [34] data set matching minimizes the cost function

$$\begin{array}{*{20}l} {J_{\eta}\left(\boldsymbol{n}_{l}, \boldsymbol{R}_{l}\right) = \sum_{k=1}^{K} w_{{kl}}^{[\eta]} \left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l} \right)\right|\right|_{2}^{2}} \end{array} $$
(33)

for each of the L sensor nodes. Considering all L sensor nodes together results in

$$\begin{array}{*{20}l} {}{\Omega_{{\text{geo}}}^{[\eta+1]}} {=} {\underset{\Omega_{{\text{geo}}}}{\operatorname{argmin}}\! \sum_{l=1}^{L} J_{\eta}\!\left(\boldsymbol{n}_{l}, \boldsymbol{R}_{l}\right)} {=\!\underset{\Omega_{{\text{geo}}}}{\operatorname{argmin}} \ J\!\left(\Omega_{{\text{geo}}}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right).} \end{array} $$
(34)

Consequently,

$$\begin{array}{*{20}l} {J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right) \leq J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)} \end{array} $$
(35)

holds.

The next step, i.e., the update of the source positions sk (see line 4 in Algorithm 2), is done by minimizing

$$\begin{array}{*{20}l} {J_{\eta}\left(\boldsymbol{s}_{k}\right)} &{= \sum_{l=1}^{L} w_{{kl}}^{[\eta]} \left|\left|\boldsymbol{s}_{k} - \left(\boldsymbol{R}_{l}^{[\eta+1]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta+1]} \right)\right|\right|_{2}^{2}} \end{array} $$
(36)

for all K source positions. Note that Jη(sk) corresponds to a sum of squared Euclidean distances, i.e, a convex function of sk, and, thus, is convex. Consequently, the resulting linear least squares solution (see (15)) corresponds to the global minimum of Jη(sk). Summarizing this step for all K acoustic sources gives

$$\begin{array}{*{20}l} {}{\Omega_{\boldsymbol{s}}^{[\eta+1]}} {=} {\underset{\Omega_{\boldsymbol{s}}}{\operatorname{argmin}} \sum_{k=1}^{K} J_{\eta}\left(\boldsymbol{s}_{k}\right) =\underset{\Omega_{\boldsymbol{s}}}{\operatorname{argmin}} \ J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}, \Omega_{w}^{[\eta]}\right).} \end{array} $$
(37)

So it follows that

$$\begin{array}{*{20}l} {J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta]}\right) \leq J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)} \end{array} $$
(38)

and with (35) it holds:

$$\begin{array}{*{20}l} {J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta]}\right) \leq J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right).} \end{array} $$
(39)

Finally, the influence of the weight update has to be discussed (see line 5 in Algorithm 2). Applying Titu’s lemma to \(J\left (\Omega _{{\text {geo}}}^{[\eta +1]}, \Omega _{\boldsymbol {s}}^{[\eta +1]}, \Omega _{w}^{[\eta ]}\right)\) gives

$$\begin{array}{*{20}l} &{J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta]}\right)} \\ {=} &{\sum_{l=1}^{L} \sum_{k=1}^{K} \frac{\left|\left|\boldsymbol{s}_{k}^{[\eta+1]} - \left(\boldsymbol{R}_{l}^{[\eta+1]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta+1]} \right)\right|\right|_{2}^{2}}{\left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}}} \\ {\geq} & {\frac{\left(\sum\limits_{l=1}^{L} \sum\limits_{k=1}^{K} \left|\left|\boldsymbol{s}_{k}^{[\eta+1]} - \left(\boldsymbol{R}_{l}^{[\eta+1]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta+1]} \right)\right|\right|_{2}\right)^{2}}{\sum\limits_{l=1}^{L} \sum\limits_{k=1}^{K} \left|\left|\boldsymbol{s}_{k}^{[\eta]} - \left(\boldsymbol{R}_{l}^{[\eta]} \boldsymbol{s}_{k}^{(l)} + \boldsymbol{n}_{l}^{[\eta]} \right)\right|\right|_{2}}} \\ {=} &{ \frac{\left(J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta+1]}\right)\right)^{2}}{J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)}.} \end{array} $$
(40)

With (39) and (40) it follows:

$$\begin{array}{*{20}l} {\frac{\left(J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta+1]}\right)\right)^{2}}{J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)}} {\leq} {J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right).} \end{array} $$
(41)

Since \(J\left (\Omega _{{\text {geo}}}^{[\eta ]}, \Omega _{\boldsymbol {s}}^{[\eta ]}, \Omega _{w}^{[\eta ]}\right) > 0\) holds this results in

$$\begin{array}{*{20}l} {\left(J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta+1]}\right)\right)^{2}} {\leq} {\left(J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)\right)^{2}} \end{array} $$
(42)

and, finally, in

$$\begin{array}{*{20}l} {J\left(\Omega_{{\text{geo}}}^{[\eta+1]}, \Omega_{\boldsymbol{s}}^{[\eta+1]}, \Omega_{w}^{[\eta+1]}\right)} &{\leq J\left(\Omega_{{\text{geo}}}^{[\eta]}, \Omega_{\boldsymbol{s}}^{[\eta]}, \Omega_{w}^{[\eta]}\right)}. \end{array} $$
(43)

Due to the fact that \(J\left (\Omega _{{\text {geo}}}^{[\eta ]}, \Omega _{\boldsymbol {s}}^{[\eta ]}, \Omega _{w}^{[\eta ]}\right)\) is monotonically decreasing and has the lower bound J≥0 it converges to J≥0 for η.

Availability of data and materials

The datasets and Python software code supporting the conclusions of this article are available in the paderwasn repository, https://github.com/fgnt/paderwasn. The MIRD database [39] is available under the following link: https://www.iks.rwth-aachen.de/en/research/tools-downloads/databases/multi-channel-impulse-response-database/.

Abbreviations

BFGS:

Broyden-Fletcher–Goldfarb-Shanno

CDF:

Cumulative distribution function

CDR:

Coherent-to-diffuse power ratio

CRNN:

Convolutional recurrent neural network

DNN:

Deep neural network

DoA:

Direction of arrival

DRR:

Direct-to-reverberant ratio

GP:

Gaussian process

GRU:

Gated recurrent unit

IRLS:

Iteratively re-weighted least squares

LS:

Least squares

MAE:

Mean-absolute error

ppm:

Parts per million

RANSAC:

Random sample consensus

RIR:

Room impulse response

STFT:

Short-time Fourier transform

SVD:

Singular value decomposition

TDoA:

Time difference of arrival

ToA:

Time of arrival

ToF:

Time of flight

WASN:

Wireless acoustic sensor network

References

  1. 1

    A. Bertrand. Applications and trends in wireless acoustic sensor networks: a signal processing perspective, (2011). https://doi.org/10.1109/SCVT.2011.6101302.

  2. 2

    V. Potdar, A. Sharif, E. Chang, in Proc. International Conference on Advanced Information Networking and Applications Workshops (AINA). Wireless Sensor Networks: A Survey (IEEEBradford, 2009), pp. 636–641. https://doi.org/10.1109/WAINA.2009.192.

    Google Scholar 

  3. 3

    N. Ono, H. Kohno, N. Ito, S. Sagayama, in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Blind alignment of asynchronously recorded signals for distributed microphone array (IEEENew Paltz, 2009). https://doi.org/10.1109/ASPAA.2009.5346505.

    Google Scholar 

  4. 4

    S. Wozniak, K. Kowalczyk, Passive Joint Localization and Synchronization of Distributed Microphone Arrays. IEEE Signal Proc. Lett.26(2), 292–296 (2019). https://doi.org/10.1109/LSP.2018.2889438.

    Article  Google Scholar 

  5. 5

    B. Laufer-Goldshtein, R. Talmon, S. Gannot, Semi-supervised source localization on multiple manifolds with distributed microphones. IEEE/ACM Trans. Audio Speech Lang. Process.25(7), 1477–1491 (2017). https://doi.org/10.1109/TASLP.2017.2696310.

    Article  Google Scholar 

  6. 6

    B. Laufer-Goldshtein, R. Talmon, S. Gannot, Semi-supervised source localization on multiple manifolds with distributed microphones. IEEE/ACM Trans. Audio Speech Lang. Process.25(7), 1477–1491 (2017). https://doi.org/10.1109/TASLP.2017.2696310.

    Article  Google Scholar 

  7. 7

    A. Plinge, F. Jacob, R. Haeb-Umbach, G. A. Fink, Acoustic Microphone Geometry Calibration: an overview and experimental evaluation of state-of-the-art algorithms. IEEE Signal Proc. Mag.33(4), 14–29 (2016). https://doi.org/10.1109/MSP.2016.2555198.

    Article  Google Scholar 

  8. 8

    H. Afifi, J. Schmalenstroeer, J. Ullmann, R. Haeb-Umbach, H. Karl, in Proc. ITG Fachtagung Sprachkommunikation (Speech Communication). MARVELO - A Framework for Signal Processing in Wireless Acoustic Sensor Networks (Oldenburg, Germany, 2018).

  9. 9

    G. Miller, A. Brendel, W. Kellermann, S. Gannot, Misalignment recognition in acoustic sensor networks using a semi-supervised source estimation method and Markov random fields (2020). http://arxiv.org/abs/arXiv:2011.03432.

  10. 10

    J. Elson, K. Roemer, in Proc. ACM Workshop on Hot Topics in Networks (HotNets). Wireless sensor networks: a new regime for time synchronization (Association for Computing MachineryPrinceton, 2002).

    Google Scholar 

  11. 11

    R. Lienhart, I. V. Kozintsev, S. Wehr, M. Yeung, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). On the importance of exact synchronization for distributed audio signal processing (IEEEHong Kong, 2003), p. 840. https://doi.org/10.1109/ICASSP.2003.1202774.

    Google Scholar 

  12. 12

    I. -K. Rhee, J. Lee, J. Kim, E. Serpedin, Y. -C. Wu, Clock synchronization in wireless sensor networks: an overview. Sensors. 9(1), 56–85 (2009). https://doi.org/10.3390/s90100056.

    Article  Google Scholar 

  13. 13

    M. Hennecke, T. Plotz, G. A. Fink, J. Schmalenstroeer, R. Haeb-Umbach, in Proc. IEEE/SP Workshop on Statistical Signal Processing (SSP 2009). A hierarchical approach to unsupervised shape calibration of microphone array networks, (2009), pp. 257–260. https://doi.org/10.1109/SSP.2009.5278589.

  14. 14

    L. Wang, T. Hon, J. D. Reiss, A. Cavallaro, Self-localization of ad-hoc arrays using time difference of arrivals. IEEE Trans. Signal Process.64(4), 1018–1033 (2016). https://doi.org/10.1109/TSP.2015.2498130.

    MathSciNet  Article  Google Scholar 

  15. 15

    M. H. Hennecke, G. A. Fink, in Proc. Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA). Towards acoustic self-localization of ad hoc smartphone arrays (Edinburgh, United Kingdom, 2011), pp. 127–132. https://doi.org/10.1109/HSCMA.2011.5942378.

  16. 16

    V. C. Raykar, I. V. Kozintsev, R. Lienhart, Position calibration of microphones and loudspeakers in distributed computing platforms. IEEE Trans. Speech Audio Proc.13(1), 70–83 (2005). https://doi.org/10.1109/TSA.2004.838540.

    Article  Google Scholar 

  17. 17

    D. Mills, Internet Time Synchronization: The Network Time Protocol. IEEE Trans. Commun.39:, 1482–1493 (1991). https://doi.org/10.1109/TSA.2004.838540.

    Article  Google Scholar 

  18. 18

    M. Maróti, B. Kusy, G. Simon, A. Lédeczi, in Proceedings of the 2nd international conference on Embedded networked sensor systems. The flooding time synchronization protocol (Baltimore, 2004), pp. 39–49. https://doi.org/10.1145/1031495.1031501.

  19. 19

    M. Maroti, B. Kusy, G. Simon, A. Ledeczi, in Proc. International Conference on Embedded Networked Sensor Systems (SenSys). The flooding time synchronization protocol (Association for Computing MachineryBaltimore, 2004). https://doi.org/10.1145/1031495.1031501.

    Google Scholar 

  20. 20

    M. Leng, Y. -C. Wu, Distributed clock synchronization for wireless sensor networks using belief propagation. IEEE Trans. Signal Process.59(11), 5404–5414 (2011). https://doi.org/10.1109/TSP.2011.2162832.

    MathSciNet  Article  Google Scholar 

  21. 21

    A. Plinge, G. A. Fink, S. Gannot, Passive online geometry calibration of acoustic sensor networks. IEEE Signal Proc. Lett.24(3), 324–328 (2017). https://doi.org/10.1109/LSP.2017.2662065.

    Article  Google Scholar 

  22. 22

    Y. Dorfan, O. Schwartz, S. Gannot, Joint speaker localization and array calibration using expectation-maximization. EURASIP Journal on Audio, Speech, and Music Processing. 2020(9), 1–19 (2020). https://doi.org/10.1186/s13636-020-00177-1.

    Google Scholar 

  23. 23

    J. Schmalenstroeer, F. Jacob, R. Haeb-Umbach, M. Hennecke, G. A. Fink, in Proc. Annual Conference of the International Speech Communication Association (Interspeech). Unsupervised geometry calibration of acoustic sensor networks using source correspondences (ISCAFlorence, 2011), pp. 597–600.

    Google Scholar 

  24. 24

    F. Jacob, J. Schmalenstroeer, R. Haeb-Umbach, in Proc. International Workshop on Acoustic Signal Enhancement (IWAENC). Microphone array position self-calibration from reverberant speech input (VDEAachen, 2012).

    Google Scholar 

  25. 25

    F. Jacob, J. Schmalenstroeer, R. Haeb-Umbach, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOA-based microphone array postion self-calibration using circular statistics (IEEEVancouver, 2013), pp. 116–120. https://doi.org/10.1109/ICASSP.2013.6637620.

    Google Scholar 

  26. 26

    F. Jacob, R. Haeb-Umbach, in Proc. ITG Fachtagung Sprachkommunikation (Speech Communication). Coordinate mapping between an acoustic and visual sensor network in the shape domain for a joint self-calibrating speaker tracking (VDEErlangen, 2014).

    Google Scholar 

  27. 27

    F. Jacob, R. Haeb-Umbach, Absolute Geometry Calibration of Distributed Microphone Arrays in an Audio-Visual Sensor Network. ArXiv e-prints, abs/1504.03128 (2015).

  28. 28

    R. Wang, Z. Chen, F. Yin, DOA-based three-dimensional node geometry calibration in acoustic sensor networks and its Cramér–Rao bound and sensitivity analysis. IEEE/ACM Trans. Audio Speech Lang. Process.27(9), 1455–1468 (2019). https://doi.org/10.1109/TASLP.2019.2921892.

    Article  Google Scholar 

  29. 29

    S. Wozniak, K. Kowalczyk, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Exploiting rays in blind localization of distributed sensor arrays (IEEEBarcelona, 2020), pp. 221–225. https://doi.org/10.1109/ICASSP40776.2020.9054752.

    Google Scholar 

  30. 30

    M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications ACM. 24(6), 381–395 (1981). https://doi.org/10.1145/358669.358692.

    MathSciNet  Article  Google Scholar 

  31. 31

    T. Gburrek, J. Schmalenstroeer, A. Brendel, W. Kellermann, R. Haeb-Umbach, in Proc. European Signal Processing Conference (EUSIPCO). Deep neural network based distance estimation for geometry calibration in acoustic sensor networks (Amsterdam, The Netherlands, 2021).

  32. 32

    A. Brendel, W. Kellermann, Distributed source localization in acoustic sensor networks using the coherent-to-diffuse power ratio. IEEE J. Sel. Top. Signal Proc.13(1), 61–75 (2019). https://doi.org/10.1109/JSTSP.2019.2900911.

    Article  Google Scholar 

  33. 33

    A. Brendel, A. Regensky, W. Kellermann, in Proc. International Congress on Acoustics. Probabilistic modeling for learning-based distance estimation (Deutsche Gesellschaft für Akustik (DEGA e.V.)Aachen, 2019).

    Google Scholar 

  34. 34

    J. M. Sachar, H. F. Silverman, W. R. Patterson, Microphone position and gain calibration for a large-aperture microphone array. IEEE Trans. Speech Audio Proc.13(1), 42–52 (2005). https://doi.org/10.1109/TSA.2004.834459.

    Article  Google Scholar 

  35. 35

    O. Sorkine-Hornung, M. Rabinovich, Least-squares rigid motion using svd. Computing. 1(1), 1–5 (2017).

    Google Scholar 

  36. 36

    K. Aftab, R. Hartley, J. Trumpf, Generalized weiszfeld algorithms for lq optimization. IEEE Trans. Pattern Anal. Mach. Intell.37(4), 728–745 (2015). https://doi.org/10.1109/TPAMI.2014.2353625.

    Article  Google Scholar 

  37. 37

    I. Daubechies, R. DeVore, M. Fornasier, C. S. Güntürk, Iteratively reweighted least squares minimization for sparse recovery. Commun. Pur. Appl. Math.63(1), 1–38 (2010). https://doi.org/10.1002/cpa.20303.

    MathSciNet  Article  Google Scholar 

  38. 38

    E. A. Habets, Room impulse response generator. Technische Universiteit Eindhoven, Tech. Rep. 2(2.4), 1 (2006).

    Google Scholar 

  39. 39

    E. Hadad, F. Heese, P. Vary, S. Gannot, in Proc. International Workshop on Acoustic Signal Enhancement (IWAENC). Multichannel audio database in various acoustic environments (IEEEAntibes, 2014), pp. 313–317. https://doi.org/10.1109/IWAENC.2014.6954309.

    Google Scholar 

  40. 40

    N. J. Bryan, in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation (IEEEBarcelona, 2020), pp. 1–5. https://doi.org/10.1109/ICASSP40776.2020.9052970.

    Google Scholar 

  41. 41

    A. Schwarz, W. Kellermann, Coherent-to-diffuse power ratio estimation for dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process.23(6), 1006–1018 (2015). https://doi.org/10.1109/TASLP.2015.2418571.

    Article  Google Scholar 

  42. 42

    J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM. NIST (1993). https://doi.org/10.6028/nist.ir.4930.

  43. 43

    D. Kingma, J. Ba, in Proc. International Conference on Learning Representations (ICLR). Adam: a method for stochastic optimization (Banff, Canada, 2014). http://arxiv.org/abs/arXiv:1412.6980v9.

  44. 44

    L. Drude, F. Jacob, R. Haeb-Umbach, in Proc. European Signal Processing Conference (EUSIPCO). DOA-estimation based on a complex Watson kernel method (IEEENice, 2015). https://doi.org/10.1109/EUSIPCO.2015.7362384.

    Google Scholar 

  45. 45

    J. R. Jensen, J. K. Nielsen, R. Heusdens, M. G. Christensen, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOA estimation of audio sources in reverberant environments, (2016), pp. 176–180. https://doi.org/10.1109/ICASSP.2016.7471660.

  46. 46

    T. Gburrek, J. Schmalenstroeer, R. Haeb-Umbach, in Accepted for Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Iterative geometry calibration from distance estimates for wireless acoustic sensor networks, (2021). http://arxiv.org/abs/arXiv:2012.06142.

  47. 47

    R. H. Byrd, P. Lu, J. Nocedal, C. Zhu, A limited memory algorithm for bound constrained optimization. SIAM J. Sci. Comput.16(5), 1190–1208 (1995). https://doi.org/10.1137/0916069.

    MathSciNet  Article  Google Scholar 

  48. 48

    P. V. Giampouras, A. A. Rontogiannis, K. D. Koutroumbas, Alternating iteratively reweighted least squares minimization for low-rank matrix factorization. IEEE Trans. Signal Process.67(2), 490–503 (2019). https://doi.org/10.1109/TSP.2018.2883921.

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Mr. Andreas Brendel for the fruitful discussions on distance estimation.

Funding

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project 282835863. Open Access funding enabled and organized by Projekt DEAL.

Author information

Affiliations

Authors

Contributions

Authors’ contributions

DNN model development and training: TG. Geometry calibration software and experiments: TG and JS. Writing paper: TG, JS, and RH. The authors read and approved the final manuscript.

Authors’ information

Reinhold Haeb-Umbach received the Dipl.-Ing. and Dr.-Ing. degrees from RWTH Aachen University of Technology in 1983 and 1988, respectively. He is currently a professor of Communications Engineering at Paderborn University, Germany. His main research interests are in the fields of statistical signal processing and machine learning, with applications to speech enhancement, acoustic beamforming and source separation, as well as automatic speech recognition and unsupervised learning from speech and audio. He is a fellow of the International Speech Communication Association(ISCA) and of the IEEE.

Joerg Schmalenstroeer received the Dipl.-Ing. and Dr.-Ing. degree in electrical engineering from the University of Paderborn in 2004 and 2010, respectively. Since 2004, he has been a Research Staff Member with the Department of Communications Engineering of the University of Paderborn. His research interests are in acoustic sensor networks and statistical speech signal processing.

Tobias Gburrek is a Ph.D. student at Paderborn University since 2019 where he also pursued his Bachelor’s and Masters’s degree in Electrical Engineering. His research interests include acoustic sensor networks with a focus on geometry calibration and signal processing with deep neural networks.

Corresponding author

Correspondence to Tobias Gburrek.

Ethics declarations

Consent for publication

All authors agree to the publication in this journal.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gburrek, T., Schmalenstroeer, J. & Haeb-Umbach, R. Geometry calibration in wireless acoustic sensor networks utilizing DoA and distance information. J AUDIO SPEECH MUSIC PROC. 2021, 25 (2021). https://doi.org/10.1186/s13636-021-00210-x

Download citation

Keywords

  • Geometry calibration
  • Acoustic distance estimation
  • Deep neural network
  • Coherent-to-diffuse power ratio
  • Direction of arrival