Joint speaker localization and array calibration using expectation-maximization

Ad hoc acoustic networks comprising multiple nodes, each of which consists of several microphones, are addressed. From the ad hoc nature of the node constellation, microphone positions are unknown. Hence, typical tasks, such as localization, tracking, and beamforming, cannot be directly applied. To tackle this challenging joint multiple speaker localization and array calibration task, we propose a novel variant of the expectation-maximization (EM) algorithm. The coordinates of multiple arrays relative to an anchor array are blindly estimated using naturally uttered speech signals of multiple concurrent speakers. The speakers’ locations, relative to the anchor array, are also estimated. The inter-distances of the microphones in each array, as well their orientations, are assumed known, which is a reasonable assumption for many modern mobile devices (in outdoor and in a several indoor scenarios). The well-known initialization problem of the batch EM algorithm is circumvented by an incremental procedure, also derived here. The proposed algorithm is tested by an extensive simulation study.


Introduction
Localization and tracking using multiple arrays of sensors are often handled under the assumption that the locations of the microphone arrays are precisely known. The recent deployment of ad hoc networks introduces a new challenge of estimating the array locations in parallel to routine tasks, such as speaker localization [1][2][3][4][5], noise or reverberation reduction [6][7][8], and speaker separation [9][10][11][12][13]. The solution is complex due to the amount of unknown parameters and the dependencies between them. Many scenarios do not even have a unique single solution, e.g., when the numbers of arrays or active sources are too small. In this paper, a novel expectationmaximization (EM)-based algorithm for the integrated task of speaker localization and array calibration is introduced. The new algorithm combines two tasks: direct positioning determination (DPD) and calibration for ad hoc networks.
*Correspondence: sharon.gannot@biu.ac.il † Member, EURASIP 1 Faculty of Engineering, Bar-Ilan University, 5290002 Ramat-Gan, Israel Full list of author information is available at the end of the article

Multiple direction of arrival estimation
The direction of arrival (DOA) estimation with known sensor positions is a well-studied problem. In [14], the steered response power (SRP)-phase transform (PHAT) algorithm is suggested, which is the generalization of the generalized cross correlation (GCC)-PHAT [15] for an array of microphones in the far-field scenario. Other known multi-channel algorithms are root multiple signal classification (MUSIC) [16,17], minimum variance distortionless response (MVDR) [18], and audio applicable versions [19][20][21] These estimators were not proven to be optimal in the presence of multiple speakers. The DOA estimation [22] in the presence of various noise types can be formulated as a maximum likelihood (ML) estimation problem of deterministic parameters [23][24][25][26]. The DOA challenge in the presence of unknown noise field was dealt with in [23]. The W-disjoint orthogonality (WDO) assumption [27], commonly attributed to speech signals due to their sparseness, is often exploited for DOA estimations tasks [28].
The problem of estimating multiple time difference of arrival (TDoA) (or DOAs) was addressed in [12,[29][30][31] by using the EM procedure. In [29], the task of multiple TDoA estimation is addressed considering twomicrophone (binaural) case, with the WDO [27] applied, namely the dominance of a single speaker at each timefrequency (TF) bin. The authors used the EM procedure and the mixture of Gaussians (MoG) model to cluster the phase differences from each TF bin, where each cluster is associated with a TDoA value. In the E-step, a TF mask, associating each bin with a specific TDoA, was estimated. In the M-step, the probability for each TDoA was estimated, using the number of associations of TF bins.
In [30], an algorithm for estimating multiple DOAs in a reverberant environment was presented. Unlike the method presented in [29], the TF raw samples were clustered rather than their respective phase differences. The MoG model consists of explicit modeling of the reverberation properties. The resulting algorithm was able to localize multiple speakers with reverberation modeled as an additive diffuse noise with time-varying power. The reverberation power was estimated in the M-step for each speaker and for each TF bin. Note that in [30], a noiseless scenario was considered.
In the study presented in [12], the algorithm presented in [30] was extended to the problem of joint localization and separation of concurrent speakers. However, the algorithm requires a known noise power spectral density (PSD) matrix. In [31], the DOA estimation procedure presented in [12,30] was adopted for deriving a DOA estimator for multiple sources in a noisy environment. Stationary noise was assumed with known spatial coherence but, unlike [12], the noise level was assumed unknown, and its level was estimated in the M-step.

Multiple-source cartesian localization
In this paper, when we use the term localization, we refer to higher-dimension problems (at least 2D). A straightforward solution to higher-dimension localization problems involves a triangulation of the 1D problems solved locally by each array of the network [32]. It has the advantage of simplicity, especially in distributed networks, where computations should be shared between nodes. There are many approaches that use triangulation of separate DOAs to solve the 2D or 3D localization problem. An example of such an approach for an acoustic ad hoc network was given in [33].
However, these solutions are not optimal, because only part of the information is utilized during the first step of the estimation. Moreover, in a small area (for example indoor scenario), a more general solution becomes a necessity since near-field conditions are often encountered. Since we do not rely on DOAs and find the locations given the measured signals, our approach is general enough to cover both near-and far-field.
A possible general solution, which directly estimates the location without any intermediate steps, is frequently referred to as DPD [34]. For acoustic localization, DPD approaches were presented in [4,35]. In [4], a generalization of the method in [29] to the estimation of the coordinates of multiple sources, rather than only of their associated TDoAs, was presented using a grid of Cartesian coordinates that covers the room surface. The measured phase differences between microphones is then clustered to the nominal phase differences from each grid point. The probability of a speaker to be located at each grid point was estimated in the M-step. Note that in [4,29], the spatial characteristics of the noise was not explicitly modeled and therefore not optimally treated. Some localization approaches [4,35] use a non-realistic assumption within the context of ad hoc networks relying on perfect knowledge of array positions. This is often referred to as the calibration problem.
Another important challenge in ad hoc networks, tightly connected to the calibration process, is the clock synchronization. Both acoustic and non-acoustic solutions were proposed to overcome this challenge [36][37][38][39][40][41][42][43][44][45]. In the current work, we assume that the nodes are perfectly synchronized, by possibly using one of those approaches. It has been shown that current technology used by commercial personal consumer electronics, like smartphones, provides very small drift and jitters in the clock frequency that can be compensated for by these algorithms. We will hereinafter ignore the synchronization issues.

Array calibration
Finding the location of microphones is a well-covered topic in the literature. For example [46] deals with finding the location of a microphone utilizing a single loudspeaker and the room known shape. Array constellation calibration has been analyzed from a theoretical point of view for far-field [47] and near-field scenarios [48]. For acoustic arrays, a few approaches have already been proposed for calibration, some of which are only suitable for scenarios with a dedicated time for calibrations [49]. Other algorithms utilize ambient sound for finding the interdistances of microphones [50,51].
Calibration performed jointly with localization or tracking of sources presents a greater challenge. A family of algorithms called simultaneous localization and mapping (SLAM) for robots was described in [52][53][54]. In these contributions, the joint estimation of a single moving array trajectory, the positions of static sources, and the major reflectors (e.g., walls) is addressed.
Another popular problem is the estimation of static array locations jointly with tracking of moving acoustic sources [55,56]. The problem is sometimes referred to as simultaneous localization and tracking (SLAT) [57]. Effective solutions for array calibration in dynamic scenarios can utilize the multiple locations visited by the speakers. Such a method, based on genetic algorithm, was recently presented for a scenario where speakers move around a table in the center of the room [58]. The arrays are located on the table and the algorithm estimates the arrays' locations and tracks the speakers. The sensitivity to small movements are discussed in [59,60]. Approaches suitable for static scenarios can also be found in the literature, e.g., [61,62]. They rely on TDoAs between adjacent microphones. Other joint calibration approaches are described in [63][64][65]. Those methods currently work under a very specific set of geometrical conditions. For example, some of them require moving speakers or a minimum number of active speakers to guarantee sufficient amount of data to overcome the problem of geometrical ambiguities. In [64], the proposed algorithm automatically determine the relative three-dimensional positions of audio sensors and sources in an ad hoc network. A closed form approximate solution is derived, which is further refined by minimizing a nonlinear error function. They also account for the lack of temporal synchronization among different platforms. Recently, several approaches, suitable for the static scenario, were presented. The joint estimation problem is solved by applying various mathematical methods [66,67].

Proposed strategy
We suggest in this paper a new EM-based speaker localization and array calibration algorithm. The microphone inter-distances in each array, as well as the orientation of each array, are assumed known in advance, as can be commonly verified in commercial devices, e.g., cellular phones. In addition, since we use omni-directional microphones, it enables usage of acoustic calibration approaches such as inter distance measurements. However, the network constellation, namely the center points of the arrays and the locations of the sources, are unknown in advance and should be jointly estimated by the algorithm.
The challenge is to solve the localization problem of multiple concurrent speakers (more than two) jointly with the calibration problem of multiple arrays without any other information or any additional calibration signals. Following [4], we use the EM and the MoG models to cluster the observed data to centroids located on a grid defined on the surface. An explicit model of the speech and noise is defined within the MoG model, as used in [12].
To address the calibration problem, we add the locations of the array centers to the estimation task. As a result, the locations of the array centers are estimated in the Mstep. Maximization of the auxiliary function of the EM with respect to (w.r.t.) the array centers does not produce a closed-form expression. We utilize the simplifying assumption that the noise signals, as captured by the different arrays, are uncorrelated. This assumption enables us to avoid a multidimensional search of the array centers, i.e., a separate search for each array is obtained, and can be justified empirically if the array centers are sufficiently separated.
The initialization stage was found to be a cumbersome task, due to the large size of the parameter set. We present a new way for self-initialization, which utilizes the collected data in an incremental fashion. One of the arrays is designated as the anchor array and all the other elements (arrays and sources) are localized w.r.t. this anchor. First, the algorithm is applied with only the anchor array while the other arrays are disabled. Then, the other arrays in the network are sequentially added. The location of sources is kept as a soft probability map throughout the iterative procedure. Only after the last iteration, an actual localization is obtained by applying a hard threshold to the final probability map. In this paper, for simplicity, the speakers are assumed to be spatially static across time. In the case of moving speakers, a virtue of recursive EM (REM) algorithm can be utilized [4] using our EM model for the fixed speakers.

Main contributions
The main contributions of this paper are listed below: 1. The problem of joint estimation of the array center positions and multiple speaker position is addressed. The problem is statistically formulated using the probability density function (p.d.f.) of the observations. By maximizing the likelihood of the observations via the EM algorithm, the source positions are inferred. 2. Searching the array center positions is carried out separately for each array, avoiding a simultaneous multidimensional search of the entire set of possible array centers. 3. The statistical model of the multiple speech signals is based on the WDO assumption [27], which was proven to be highly efficient for speaker separation tasks.

Methods
We start from a mathematical description of the problem in the first subsection and then derive the new algorithm in the second subsection.

Problem formulation
We derive a batch EM solution for joint estimation of the positions of static speakers and microphone arrays. The problem formulation is divided into two parts. The first describes the ad hoc network signals in the presence of multiple concurrent speakers and sensor noise, and the second presents the statistical model.

Signal model
Consider Q arrays, each of which is equipped with N microphones receiving signals from J speakers. The number of speakers is not necessarily known in advance. The measured signals are linear combinations of the incoming waveforms. Let Z q,n (t, k) be the signals received by the (q, n)th microphone, where q = 1, . . . , Q is the array index and n = 1, . . . , N is the microphone index within each array. Overall, there are Q × N microphones. The signals in the short-time Fourier transform (STFT) domain are given by: where t = 0, . . . , T − 1 and k = 0, . . . , K − 1 denote the time and frequency indexes respectively. G q,n,j (k) is the direct transfer function (DTF) associating speaker j and microphone (q, n). S j (t, k) is the speech signal uttered by speaker j and V q,n (t, k) is the ambient noise, namely noise signals that result from the environment. Specific spatial characteristics of the noise signals will be later discussed.
Note that the DTF model accounts for near-field scenarios and hence comprises the attenuation of the direct speech wave as well as the respective inter-microphone phase. Also note that the attenuation is known to be much less reliable than the phase. Therefore, multiple arrays should be used. It is demonstrated in the Section 3 by adding arrays of sensors one by one. The DTF is given by: where c is the sound velocity and T s denotes the sampling period. The distance from speaker j to microphone (q, n), d q,n,j is calculated from geometrical considerations as: where p j is the location of the jth speaker and p q,n is the location of the (q, n)th microphone given by: where p q is the position of the center of the qth array and p n (q) is the relative position of the nth microphone w.r.t. the array center. The inter-structure of the arrays and their orientation, namely p n (q), are assumed to be known in advance. Note that the orientation of the arrays can be extracted by various means, for example, compassbased technology [68,69]. The orientation accuracy is often reported around 5 • indoor and much better for outdoor scenarios. For simplicity, we assume hereinafter that the orientation of the nodes is perfectly known to the algorithms, since joint estimation of positions and orientation is too cumbersome, at this stage. To address reverberant environments, an additional term representing the ambient reverberation field can be added to (1). As indicated in, e.g., [30], the reverberation components can be modeled as an additive multidimensional Gaussian interference with spatially diffuse sound field with time-varying level, following the anechoic speech level. In such a case, the reverberation level can also be estimated by the M-step of the EM procedure. In this paper, for the sake of simplicity, the reverberation phenomenon is ignored. It means that the solution will fit indoor with low reverberation levels and outdoor scenarios that are dominant by random noise.
The N microphone signals in the qth array can be concatenated in a vector form: where: The overall observation set, DTFs, and noise components can be concatenated in compound vectors: such that: The goal of this study is to jointly estimate the speaker locations p j and the array center positions p q , in (3),(4).

Statistical model
We use a MoG probability function to characterize the speech signals of all potential speakers. Each speaker can be assumed to be a complex-Gaussian source emitting acoustic waveforms from location p m , where m is the index of the Gaussian component. Because the number of speakers and their locations are unknown in advance, we use a predefined grid as candidate source positions.
The various speakers are assumed to exhibit disjoint activity in the STFT domain (WDO assumption [27]). Therefore, by means of clustering, every TF bin of z(t, k) can be associated with a single active source.
Based on the disjoint activity of the sources, the observations are given the following probabilistic description: where ψ m is the (unknown) probability of a speaker present at p m and M is the number of Gaussians. N c (·; ·, ·) denotes the complex Gaussian p.d.f.: with y a zero-mean complex-Gaussian random vector and its PSD matrix. The matrix m (t, k) is the PSD of z(t, k), given that z(t, k) is associated with the speaker located at p m : where the DTF g m (k) is defined in (7b). The direct-path temporal PSD φ S,m (t, k) and the noise PSD matrix v (t, k) are defined as: The noise components from different arrays are often assumed to be uncorrelated [23], and thus: where . This assumption is a key assumption (as elaborated later) because it allows the estimation of the array centers to be separately executed for each array. This assumption can be well justified in the presence of a spatially white or diffuse noise field, assuming that the inter-array distances are large enough. For the case of a directional noise field, this assumption is invalid.
The PSD matrices of the noise are assumed to be timeinvariant and known in advance or can be estimated during speech absence segments.
Finally, by augmenting all observations for t = 0, . . . , T − 1 and k = 0, . . . , K − 1 in z = vec t,k ({z(t, k)}), the p.d.f. of the entire observation set can be stated as: where the readings for all TF bins are assumed independent [27].
Let the unknown parameter set be θ where p = vec q p q , ψ = vec m (ψ m ), and φ S = vec m,t,k φ S,m (t, k) . It should be emphasized that, unlike the array locations, the speaker locations are indirectly estimated by the soft variables ψ that form a probability map. The number of speakers and their locations are inferred from this probability map.
The maximum likelihood estimation (MLE) problem can readily be stated as: The various assumptions leading to the MLE problem statement are summarized in the following list: 1. Noise signals for different arrays are assumed uncorrelated in (14). This assumption is valid for non-coherent sources (i.e., spatially white or diffuse noise fields). This assumption will be used to simplify the optimization problem. 2. Speakers exhibit disjoint activity in each TF bin, namely z(t, k), is dominated by a single source in (9), as suggested in [27] and subsequent contributions. 3. Noise and speech signals are modeled by complex-Gaussian variables. This assumption is widely used in many speech processing algorithms and can be attributed to the properties of the Fourier transform of sufficiently long frames. 4. Each microphone array is calibrated, i.e., array internal geometry, p n (q) is known. 5. Each microphone array orientation is also known (for example, by using a compass-based technology or a GPS). 6. The speakers are assumed static, namely their positions are fixed and do not change in time. In future research, moving speakers scenarios will be addressed using a virtue of recursive EM, inspired by [4]. 7. The reverberation phenomenon is ignored. The presented algorithm is therefore better suited to scenario that are dominated by random noise, e.g., outdoor scenarios.
In the next subsection, an algorithm is derived for estimating θ . The first two components are the required parameters (array centers and source positions). The last component φ S is a set of nuisance parameters. Since the MLE in this case is of high complexity, it is necessary to use an iterative search algorithm. A widely used algorithm for this type of problems is the EM algorithm. We derive the basic (batch) version of the algorithm. For performance improvement and for mitigating the dependency on the algorithm initialization, we also further introduce a novel modification of that basic EM.

Localization and calibration expectation-Maximization sequence (LACES)
The MLE of θ is developed using the EM algorithm. It uses three datasets and their probability models: the observations, the target parameters (these datasets were already defined in Section 2.1), and the hidden datasets that will be estimated by the algorithm. In our case, we set the hidden data to comprise: (1) the speech signals S m (t, k), which are potentially emitted from each location m in the room, and (2) the association of each TF bin with a single source emitting from a particular location, as in [4].
The association of each TF bin is expressed by x(t, k, m), an indicator that the bin (t, k) is associated with a speaker located at p m . The total number of indicators in the problem is T × K. Note that, under the WDO assumption [27], each TF bin is dominated by a single speaker.
This subsection is split into five parts. In the first part, the basic EM equations are derived. The second one is dedicated to the E-step and the third to the M-step. The fourth summarizes the algorithm and its initialization process. Complexity analysis is given in the last part.

Basic expectation-maximization steps derivation
Denote the hidden data as: Following Bayes' rule, the p.d.f. of the complete dataset, z, x and s, is obtained by: The conditional distribution of the observed data given the hidden data can be expressed as: (20) Using the assumption that the noise signals, as captured by the different arrays are uncorrelated (14), the p.d.f. of the noise signals can be decomposed to a multiplication of per-array quantities: Since the indicator x is independent of speech signals s, its conditional p.d.f. is given by: The speech p.d.f. is frequently assumed to follow a complex-Gaussian distribution: The p.d.f. of the complete dataset is then obtained by collecting the terms in (19)-(23):

E-step
For any variable, the denotation (·) refers to E (·)|z; θ ( −1) . The auxiliary function in our case can be stated as: where: Note that, due to the indicator properties of x(t, k, m), the summation over m is carried out outside the logarithm operation.
For implementing the E-step, the sufficient statistics of the hidden variables are evaluated by the following expressions: 2) x(t, k, m)S m (t, k), 3) In the next list, these expressions are mathematically derived.
1. The expected associations: where: Note that the direct-path g is calculated before each E-step according to the estimated array locations for all possible grid points. The expression for g ( −1) m (k) is given by (7b) and (2), while exchanging the source index j with the candidate location index m and using the estimated array positions p q rather than its true value. 2. The next term for the E-step is the first-order statistics of the speech multiplied by the indicator, given the measurements and the parameters. Using the law of total expectation: Accordingly, the first-order statistics of the speech multiplied by the indicator is then given by (31). Note that the expectation of the mth speaker when the (t, k) bin is associated with the mth speaker is the multichannel Wiener filter (MCWF) (see [70,Eq. (28)]). Otherwise, the expectation is the prior of the signal, as defined in (23), namely identically zero.
3. The third term for the E-step is the expected speech second-order statistics multiplied by the indicator.
Using the law of total expectation, the expected speech second-order statistics multiplied by the indicator is given by (32). Note that, when the (t, k)th bin is associated with the mth speaker, the expected speech second-order statistics is the squared MCWF plus the associated error co variance term (see [70,Eq. (32)]).
4. The last term of the E-step is the expected speech second-order statistics. Using the law of total expectation, the expected speech second-order statistics is given by (33), which is a weighted sum (according to the estimate of the indicator) of the conditional expectation in (32) and the prior variance φ ( −1) S,m (t, k). Note that, when the (t, k)th bin is not associated with the mth speaker, the expected speech second-order statistics is simply the prior variance φ

M-step
The second step of the iterative algorithm is the maximization of (25) w.r.t. the unknown deterministic parameters θ , namely the M-step: 1. Similarly to [4,Eq. (20a)], ψ m is obtained by a constrained 1 maximization of Q 1 (ψ|θ ( −1) ) in (25): 2. The array locations are obtained by the maximization: There is no closed-form solution for the array centers, and therefore, a straightforward solution will require a tedious evaluation of the expression (35) in |P| Q points. Such a search is extremely complex. However, due to the assumption that the noise signals at different arrays are uncorrelated (14), Q 2 (p|θ ( −1) ) simplifies and the search can be carried out separately for each p ( ) q .
Because the search is carried out for each array separately, it requires |P| · Q calculations of the (2020) 2020:9 Page 8 of 19 likelihood term in (35), resulting in a significant calculation saving. Note that p q determines g q,m (k), as evident from (2)-(4). 3. The variance of the speech is obtained by maximizing Q 3 (φ S |θ ( −1) ), resulting in: which is the periodogram of the speech signal, using its second-order statistics.

The lACES algorithm: summary
A conventional EM procedure for the problem at hand can be formalized for any number of nodes,Q according to Algorithm 1, required L iterations.
The classical batch EM algorithm is sensitive to initialization and might converge to a local maximum instead of the global maximum likelihood [71]. Several solutions have been suggested [72] to circumvent the misconvergence phenomenon, including incremental [73], sparse [72], recursive [74], and other variants of the batch EM algorithm. Experimentally, it has been shown that the proposed algorithm might suffer from this misconvergence if a conventional initialization is applied.
In addition, because all locations of the microphones and the speakers in our model are unknown, the origin of the coordinate system should be predefined. We decided to use one of the arrays as the origin, referred to as the anchor node. The entire microphone/speaker constellation is then measured w.r.t. this node. Consequently, the EM algorithm should only search for Q − 1 array center locations.
We propose the following incremental procedure that was empirically shown to converge to the MLE. First, only the anchor node is used by the algorithm. ψ m is initialized to a uniform distribution, and φ S,m (t, k) is calculated based on the anchor position. The nodes are added incrementally until all Q nodes used by the ad hoc network are included. After adding each node, EM iterations are applied with the current measurements, as captured by theQ nodes. In general, the number of iterations can be set to L > 1, but empirically, we see that L = 1 iteration is sufficient for each node addition. The localization and calibration EM sequence (LACES) algorithm is summarized in Algorithm 2. m . The threshold is applied in the way it has been suggested for iterative localization after algorithm convergence [35,[75][76][77][78]. The rationale is to keep the soft values during the EM convergence and apply the threshold only at the end.

Algorithm's complexity
The complexity of the proposed algorithm is high, even though we apply the calibration of each array sequentially, as described above. The complexity is a function of a few parameters. For example, it is very important to choose the correct grid resolution in the room to guarantee proper localization accuracy. However, the trade-off between accuracy and computational burden should be taken into consideration. In Table 1, the relevant param- eters are listed. These parameters were already defined above during the derivation of the algorithm equations. The resources consumed by the proposed algorithm are summarized in Table 2 in terms of computational complexity, communication bandwidth (BW), and memory requirements. Due to the distributed nature of the problem at hand, these resources can be shared by the nodes, thus increasing the algorithm's efficacy. For example, we can start locally from the anchor node and then share the results with the second node and so on. The details depends on the network topology, which is beyond the scope of this paper.

Results and discussion
The proposed algorithm was evaluated using both simulations and real recordings. The performance of the proposed algorithm was evaluated in terms of both node calibration accuracy and concurrent speaker localization. The simulation and recording setups are described in the first subsection. The second subsection summarizes the measures used to evaluate the performance. The simulation results are given in the third subsection. The fourth subsection is about the influence of imperfections on the performance. The fifth subsection is dedicated to the evaluation of the proposed method using real-life recordings. The last subsection introduces a naïve algorithm that might be applied for the same problem. We compare the two approaches in terms of performance and their basic assumptions.

Experimental setup
For simplicity reasons only, we focus on 2D cases, namely both microphones and sources are located at the same height. The 3D cases imposes high computational complexity and will be therefore skipped in this manuscript. In addition, to avoid too strong reflections from either the floor or the ceiling of the acoustic enclosure, we have selected the height of the sources-microphone constellation in the center of the z-axis. The experimental setup for the simulation study and the real-life recordings were designed to be as similar as possible. Accordingly, the speakers were positioned to imitate a group of people sitting around a table located in the center of the room. Three to five microphone arrays were located randomly in the center area of the room to emulate mobile telecommunication devices that are located on that virtual table, each of which with a few microphones. This geometry also simulate an outdoor scenario for which the sensors are restricted to be located within a close area and the sources are located in the perimeter of this area. The nodes jointly constitute an ad hoc acoustic sensor network. The nodes are rectangular with four microphones each, simulating smartphones with known dimensions and orientations. An example of such an array is shown in Fig. 1.
The sampling frequency was set to 8 kHz and the frame length of the STFT to 64 ms with an overlap of 75%. The number of frequency bins was 512. Utterances of simultaneously active male and female speakers were used (signals lengths is 1 s). The speakers were located randomly around the table. The number of speakers was five for the simulations and six for the real recordings.
The frequency band that was proven sufficient for our array sizes was 500 − 2000 Hz. In the simulations, the speech signals were convolved with simple room impulse responses (RIRs) of an anechoic chamber. In the real-life recordings, we recorded the signals in our acoustic lab, set to a low reverberation level (T 60 = 120 ms). In both cases, a synthetic additive white Gaussian noise (AWGN) was added with various signal to noise ratio (SNR) levels. A picture depicting the recordings setup can be found in Fig. 2. The rectangular arrays mentioned above were used in the acoustic lab together with Fostex model 6301BX loudspeakers, serving as sources. A high-quality recording system (by RME) was used to measure the T 60 and to generate the input signals. Although the full size of the room was 6 × 6 × 2.4 m, here, we focus on a smaller search area of 5 × 5 m with a constant height of 135 cm.

Performance measures
Calibration success rate (SR) was calculated using Monte-Carlo simulations according to the number of times the estimation of the node center was sufficiently accurate (up to 20 cm): where S c is the number of successful calibrations and A e is the total number of nodes to be calibrated. This is the only measure used for the calibration stage. If the calibration is sufficiently accurate, then the calibration error in centimeters will be very good; if the calibration fails, the results of subsequent localization stage also fails. For the localization stage, we adapted three statistical measures used in [35,75]. They are only calculated for the cases of successful calibration. The misdetections (MDs) are counted according to the percentage of misdetected speakers: where M s is the number of misdetected sources and R s is the total number of real sources. The false alarm (FA) is the percentage of wrongly detected speakers: where F s is the number of falsely detected sources. Localization root mean square error (RMSE) is a measure of the estimation accuracy of all detected speakers: where s is the source index and e(s) is its respective localization error in meters.

Simulations of random geometric setups
The geometric setup for the simulations is shown in Fig. 3. Three nodes with a square shape (10 × 10 cm) were randomly located with a random orientation in the middle of the room (each microphone is denoted by "•"). Six speakers (denoted by the "+" sign) were located away from the center to imitate a scenario with nodes in the center (on a table for indoor case) and speakers around that center. The main purpose of the simulation was to explore the performance for random geometric setups. The performance of the algorithm was tested for various levels of SNR and various sensor and source locations. The number of different setups generated was 100. We noticed that a single EM iteration per new node (L = 1) yields satisfactory results. The statistical measures for the simulation study are summarized in Table 3. In the presence of white sensor noise, as also demonstrated for the real recordings, the algorithm performance rapidly deteriorates from good results (for SNRs of 20 dB) to very bad results (around SNRs of 0 dB). Note that the localization search grid is 0.2 m ×0.2 m and the localization error is within the grid resolution. We noticed that some compensation for low SNR could be achieved, if we add microphones to each array as long as the noise is spatially white. However, a detailed analysis of how the number of microphones might affect the performance is beyond the scope of this contribution.
To experimentally examine the LACES convergence when arrays are added to the estimation, we plotted the  Fig. 4 for L = 1. The real locations of the five speakers are marked by '+' . The improvement of the localization maps can be observed when additional arrays are utilized. For a single array, only a few of the speakers are detected and many errors are observed. As arrays are added, the estimation improves for all speakers. The final map can be used to infer the number of speakers and their locations.

Sensitivity to imperfections
Before discussing real recordings, it is essential to examine what is the sensitivity of the algorithm to imperfections that exist in any real system.
The first one is sensitivity to inaccurate offset values of the microphones with respect to the center of the array. We use a uniform distribution with various maximal offset. The performance of the algorithm is summarized in Table 4. In the presence of errors in microphones locations, the algorithm performance rapidly deteriorates from good results (for maximal offset of 10 mm) to very bad results (around offset of 20 mm). Seems realistic to assume internal calibration accuracy of around 1 mm, which seems to be high enough in terms of the algorithm performance.
The second analysis is sensitivity to synchronization issues between arrays. We examine the influence of clock  rate differences between arrays. We use a constant frequency offset between the three arrays, which is measured compared to the anchor array in parts per million (ppm) units. One array has maximal offset as indicated in the table and the other one has half the offset. The performance of the algorithm is summarized in Table 5. In the presence of very large frequency offsets, the algorithm performance rapidly deteriorates from good results (for maximal offset of 100 ppm) to very bad results (around an offset of 1000 ppm). It means that even for very low quality of internal clocks, the performance is still satisfactory.  The last analysis is the sensitivity to the reverberation level of the room. As stated above, we assume low reverberation levels, since we observed significant influence on the performance. The performance of the algorithm as a function of the reverberation level is summarized in Table 6. As expected, when reverberation increases, the algorithm performance rapidly deteriorates from good results (T 60 = 100 ms) to very bad results (T 60 = 300 ms).

Real-life recordings in low-reverberation indoor environment
The geometric setup for the real recordings taken at BIU acoustic lab is depicted in Fig. 5. Three arrays with a rectangular shape (8.2 × 14.7 cm) were located in the middle, each of which consists four microphones. Each microphone in the scheme is denoted by the symbol "•. " Six speakers, denoted by the symbol "+, " were located around the center in a meeting room setup. The real recordings are characterized by low reverberation level (T 60 = 120 ms). We tested this array constellation with various levels of sensor AWGN. The analysis of the real recordings is therefore focused on the influence of the SNR level, rather than the reverberation level, on the calibration and localization accuracy. These acoustic conditions can also represent outdoor environments, which are usually characterized by a small number of reflections . We analyze a single scenario in this subsection with signals of the same length used above in the simulated subsection. Table 7 summarizes the results for various SNR conditions. In the Calibration SR column, we designate the number of correctly calibrated arrays out of 2 arrays (the third array is the anchor array). MD is calculated for 6 speakers. It can be seen that for any SNR higher than 14 dB, the performance is very good: the calibration was good for the

Naïve algorithm
In this subsection, as a comparison to the proposed method, we introduce a naïve geometrical technique for estimating both the array centers p q for q = 2, . . . , Q (assuming the reference array position p 1 is known) and the speakers' positions p j for j = 1, . . . , J, with J the number of speakers. Two simplifying assumptions are first made: (1) the number of speakers J is known in advance and (2) the speakers' activity patterns are non-overlapping and the time-segments in which they are active are known as well. Note that the LACES algorithm does not require these simplifying assumptions, that are rarely met in real-life scenarios.
The naïve algorithm uses two datasets: (1) τ q,j -the TDoA between each array centroid and the reference array centroid w.r.t. each speaker; neglecting the TDoAs between the microphones within each array, the TDoA is estimated by maximizing the cross-correlation between each possible pair of signals (one from each array and one from the reference array) and average all the obtained TDoAs-and (2) ϑ q,j -the DOA of each speaker w.r.t. each array. The DOA is estimated by maximizing the SRP steered to all possible DOAs. Note that the orientation of the arrays are known (same as for the LACES algorithm), and hence, the independently estimated DOAs are all referring to the same coordinate system.
The positions of the speakers and arrays should match the TDoA readings between the arrays. Accordingly, the TDoA between the qth array centroid and the reference array centroid (namely, array #1) is given by with c the sound velocity and F s the sampling frequency. Using the observed TDoAs τ q,j , the following cost function should be minimized to obtain an estimate of the positions of the arrays' centroids p q ; q = 2, . . . , Q and the speakers' positions p j ; j = 1, . . . , J: As this cost function in (42) includes both the arrays' and speakers' positions, the search for a global minimum is a cumbersome task.
The positions of the sources and the arrays should also satisfy the relations imposed the DOAs ϑ q,j between the arrays and the sources. Considering only the horizontal plain, the following relation must hold: Note that this relation has an inherent ambiguity. If a specificθ q,j satisfies (43), then alsoθ q,j + π satisfies the same equation.
Concatenating the above relations for all arrays q = 1, . . . , Q yields: where A and B are Q × 2 matrices defined by A q,1:2 = sin(ϑ q,j ), − cos(ϑ q,j ) and B q,1:2 = p T q . The symbol • denotes the Hadamard product (element-wise product). Equation 44 is an over-determined set of equations for p j , provided that Q ≥ 2, and hence can be solved by applying the least squares procedure. The position of the jth speaker p j , as a function of the arrays' positions is then given by: Substituting (45) into the cost function in (42), the array positions p q ; q = 2, . . . , Q can now be estimated independently of the speakers' positions, thus alleviating the computational burden: The cost function in (46) still requires a (Q − 1)dimensional search. To further reduce the complexity, we propose to sequentially minimize the cost function for the sub-network of sizeQ, withQ = 2, . . . , Q. At each step, only the position of the newly added array pQ is estimated, while all previous arrays' positions p 2 , . . . , p (Q−1) , that were estimated in the previous algorithmic steps, are kept unaltered: The 1-dimensional minimization can now be carried out by a simple grid search. We chose an area of 5 × 5 m sur- To exemplify the procedure, the case of six speakers and three arrays, as depicted in Fig. 6, is examined. The reference array is located at [ 2.4, 2.6] m and its position is not estimated by the algorithm. In the first stage, only the position of the second array, located at [ 1.8, 3] m, is estimated. The obtained cost function (47) is depicted in Fig. 7a. Two distinct minima, in [ 1.8, 2.95] and [ 3, 2.25], can be observed. This is attributed to the symmetric behavior of the cost function (47) w.r.t. the reference array, namely for p 1 + p 2 and p 1 − p 2 , as evident from (43). Therefore, an additional disambiguity stage was applied to determine the second array position. For that, we calculated two alternative DOA estimates from the two optional array positions (either [ 1.8, 2.95] or [ 3, 2.25]) towards the estimated position of an arbitrarily chosen speakerj, using (45). The two values were compared to the observed DOA ϑ 2,j . Since [ 1.8, 2.95] better fits the observed DOA than the alternative candidate [ 3, 2.25], it was finally chosen as the position of the second array.
Next, the position of the third array ([ 3, 2.2]) was estimated using the known position of the first array and the already estimated position of the second array. The obtained cost function, which does not suffer from the above ambiguity, is depicted in Fig. 7b, and its minimum is obtained in  depicted in Fig. 8. The averaged estimation error of the speakers is 5.5 cm. The estimation error in localizing the second array is 0.05 cm, while the position of the third array is accurately estimated. The same geometrical setup was used to evaluate the LACES algorithm, but with a more realistic scenario, where the number of speakers unknown and their activity overlapping. The position of the array centroids were accurately found (namely, negligible estimation error) and the average estimation error in localizing the speakers is 11.7 cm. The final localization map is depicted in Fig. 9. The obtained speakers positions are marked with a heat map and the real locations marked with black +.

Conclusions
A major challenge for ad hoc networks is to jointly localize sources and calibrate the positions of the arrays (or nodes) of the network. A novel joint calibration and localization algorithm, suitable for noisy environments, was derived using the EM algorithm. One of the nodes is used as an anchor node. The calibration, i.e., the estimation of the node positions, as well as the speakers' localization are applied relatively to the position of this anchor node.
To alleviate the initialization challenge of the batch EM, an incremental procedure was proposed that sequentially adds the nodes rather than trying to concurrently solve the entire full-dimension problem. The new algorithm, dubbed LACES algorithm, was experimentally studied using both an intensive simulated study and real recordings. It was also compared with a naïve algorithm based on geometrical considerations. While exhibiting high localization accuracy for both the nodes and the speakers in the case of non-overlapping speakers and known number of speakers, the naïve algorithm renders useless in realistic scenarios for which these simplifying assumptions do not hold. The proposed LACES algorithm maintains