NMF-weighted SRP for multi-speaker direction of arrival estimation: robustness to spatial aliasing while exploiting sparsity in the atom-time domain

Localization of multiple speakers using microphone arrays remains a challenging problem, especially in the presence of noise and reverberation. State-of-the-art localization algorithms generally exploit the sparsity of speech in some representation for this purpose. Whereas the broadband approaches exploit time-domain sparsity for multi-speaker localization, narrowband approaches can additionally exploit sparsity and disjointness in the time-frequency representation. Broadband approaches are robust to spatial aliasing but do not optimally exploit the frequency domain sparsity, leading to poor localization performance for arrays with short inter-microphone distances. Narrowband approaches, on the other hand, are vulnerable to spatial aliasing, making them unsuitable for arrays with large inter-microphone spacing. Proposed here is an approach that decomposes a signal spectrum into a weighted sum of broadband spectral components (atoms) and then exploits signal sparsity in the time-atom representation for simultaneous multiple source localization. The decomposition into atoms is performed in situ using non-negative matrix factorization (NMF) of the short-term amplitude spectra and the localization estimate is obtained via a broadband steered-response power (SRP) approach for each active atom of a time frame. This SRP-NMF approach thereby combines the advantages of the narrowband and broadband approaches and performs well on the multi-speaker localization task for a broad range of inter-microphone spacings. On tests conducted on real-world data from public challenges such as SiSEC and LOCATA, and on data generated from recorded room impulse responses, the SRP-NMF approach outperforms the commonly used variants of narrowband and broadband localization approaches in terms of source detection capability and localization accuracy.


Introduction
Speech remains the natural mode of interaction for humans. Present day smart-home devices are, therefore, increasingly equipped with voice controlled personal assistants to exploit this for human-machine interfacing. The performance of such devices depends, to a large *Correspondence: sushmita.t@research.iiit.ac.in 1 Speech Processing Laboratory, International Institute of Information Technology, Hyderabad, India Full list of author information is available at the end of the article extent, on the performance of the localization techniques used in these systems. The term localization in this context implies the detection and spatial localization of a number of overlapping speakers, and it is usually the first stage in many speech communication applications. Accurate acoustic localization of multiple active speakers, however, remains a challenging problem-especially in the presence of background noise and room reverberation.
Localization is typically achieved by means of the spatial diversity afforded by microphone arrays. Large microphone arrays (inter-microphone spacing in the order of a meter) sample the sound fields at large spatial intervals, thereby reducing the effect of diffuse background noise in the localization. However, these arrays are increasingly prone to spatial aliasing at higher frequencies. Compact microphone arrays, with inter-microphone spacing of the order of a few centimeters, offer greater robustness to spatial aliasing, but are biased by diffuse background noise. The size of the chosen array is usually a trade-off between these two factors and, further, is often driven by practical considerations.
State-of-the-art algorithms for multi-speaker localization usually exploit the sparsity and disjointness [1] of speech signals. While some approaches exploit, mainly, temporal sparsity (i.e., speakers are not concurrently active at all times), others exploit the time-frequency (TF) sparsity (i.e., speakers are not concurrently active at all time and frequency points of the short-time frequency domain representation) of speech. Here, the short-time Fourier transform (STFT) representation is typically chosen because of its computational efficiency. The former approaches are categorized as broadband and the latter as narrowband. For both these approaches, the localization estimates over time and/or frequency are subsequently aggregated to obtain an estimate of the number of active sources and their respective locations.
Frequently used broadband methods are based on the generalized cross-correlation (GCC) [2] and its variants, e.g., the average magnitude difference function (AMDF) estimators [3], the adaptive eigenvalue decomposition approach [4], information theoretic criteria-based approaches [5], and the broadband steered-response power approaches [6]. Such approaches typically localize the dominant source in each time segment, thereby exploiting the temporal sparsity induced by natural pauses in speech. The GCC with phase transform (PHAT) weighting has proven to be the most robust among all the GCC weightings in low noise and reverberant environments [7]. However, in GCC-PHAT, the localization errors increase when the signal to noise ratio (SNR) is poor. To address this issue, researchers have proposed SNR-based weights on GCC-PHAT to highlight the speech dominant TF bins and to de-emphasize TF bins with noise or reverberant speech (see, e.g., [8][9][10][11]). A performance assessment of various GCC algorithms may be found in [12].
Narrowband frequency domain approaches, on the other hand, use the approximate disjointness of speech spectra in their short-time frequency domain representation to localize the dominant source at each timefrequency point. Multi-speaker localization is subsequently done by pooling the individual location estimates. In [13], for example, a (reliability-weighted) histogram is computed on the pooled DoA estimates, and the locations of peaks of the histogram yield the speaker location estimates. In [14], instead of a histogram, a mixture of Gaussians (MoG) model is applied to cluster the timedifference of arrival (TDoA) estimates. The approach of [15] is a generalization of [14] in which speaker coordinates are estimated and tracked, rather than speaker TDoAs. Similarly, in [16] the authors propose a MoG clustering of the direction of arrival (DoA) estimates obtained by a narrowband steered response power (SRP) approach. This is extended in [17], where a Laplacian mixture model is proposed for the clustering. In [18], source separation and localization are iteratively tackled: source masks are first estimated by clustering the TDoA estimates at each TF bin and subsequently SRP-PHAT is used to estimate the DoAs of the separated sources. The estimated DoAs are fed back to the cluster tracking approach for updating the cluster centers. Other recent works build upon this basic idea of exploiting the TF sparsity by introducing reliability weights on the time-frequency units before localization such as [19], which uses SNR-based weights, [20], which uses TF weights predicted by neural-networks, and [21], which considers a weighted histogram of the narrowband estimates, where the weights correspond to a heuristic measure of the reliability of the estimate in each TF bin. A comprehensive overview of the relations between the commonly used localization approaches is presented in [22].
When performing source localization independently at each time-frequency point, typical optimization functions for narrowband localization do not yield a unique DoA estimate above a certain frequency. This is due to the appearance of grating lobes, and the phenomenon is termed spatial aliasing. As the distance between the microphones in the array increases, the frequency at which spatial aliasing occurs reduces, leading to ambiguous DoA estimates across a larger band of frequencies.
Broadband approaches circumvent this problem by summing the optimization function across the whole frequency band and computing a location estimate per time frame. Such averaging is indicated for arrays with large inter-element spacing. However, this constitutes a promiscuous averaging across frequencies, each of which may be dominated by a different speaker, leading to (weakened) evidence for only the strongest speaker in that time frame-i.e., only the location of the highest peak in the angular spectrum of the frame is considered as a potential location estimate and other peaks are usually ignored, since they may not reliably indicate other active speaker locations [23]. Multiple speaker localization is still possible in such cases by aggregating the results across different time frames but, by disregarding the frequency sparsity of speech signals, softer speakers (who may not be dominant for a sufficient number of time frames) may not be localized.
Instead of averaging across the whole frequency range, a compromise can be effected by only averaging across smaller, contiguous sub-bands of frequencies and computing a location estimate per time and sub-band region. By pooling the estimates across the various sub-bands, multispeaker localization may still be achieved. Such bands may be either psycho-acoustically motivated (e.g., the Bark scale used in [24]) or heuristically defined. However, these are fixed frequency groupings and the previously described shortcomings with regard to such groupings still hold. Other approaches [25,26] try to resolve the spatial aliasing problem by trying to unwrap the phase differences of spatially aliased microphone pairs. Initial (rough) estimates of the source locations are required to resolve the spatial aliasing, and it is assumed that at least a few non-aliased microphone pairs are available for this. Consequently, this requires arrays with several microphones at staggered distances such that multiple microphone pairs, aliasing at different frequencies, are available.
The key idea of our approach is to average the narrowband optimization function for localization only across frequency bins that show simultaneous excitation in speech (e.g., fundamental frequency and its harmonics for a voiced speech frame, etc.). Thereby the frequency grouping is not fixed, but data-and time frame dependent. Further, since the averaging is carried out across frequency bins that are simultaneously excited during the speech, the interference from other speakers should be minimal in these bins due to the sparsity and disjointness property. Thus, we can simultaneously exploit the time and frequency sparsity of speech while being robust to spatial aliasing-thereby overcoming the shortcomings of the previously mentioned approaches.
Non-negative matrix factorization (NMF) allows for the possibility to learn such typical groupings of the frequencies based on the magnitude spectrum of the microphone signal. These frequency groupings are termed atoms in our work. Thus we speak of localization based on timeatom sparsity, i.e., in any one time frame only a few atoms are active and each active atom only belongs to one speaker, and localizing across the different atoms in a time frame allows for multi-speaker localization. Since we use the SRP approach for localization, our algorithm is termed the SRP-NMF approach.
The rest of the paper is organized as follows: we first summarize prior approaches utilizing NMF for source localization and place our proposed approach in the context of these works. Next, in Section 3, we describe the signal model, followed by a review of the basic ideas underlying state-of-the-art narrowband and broadband SRP approaches. SRP-NMF is introduced and detailed in Section 5. In Section 6, the approach is thoroughly tested. The details of the databases, the comparison approaches and evaluation metrics, the method used to estimate SRP-NMF parameters, an analysis of the results and limitations of the approach are presented. Finally, we summarize the work and briefly mention the future scope.
2 Prior work using NMF for localization NMF has previously been used for source localization and separation in several conceptually different ways. For example, in [27], NMF is applied to decompose the SRP-PHAT function (collated across all time-frequency points) into a combination of angular activity and source presence activity. This decomposition assumes unique maxima of the SRP-PHAT function (i.e., no spatial aliasing), allowing for a sparse decomposition using NMF.
In [28], on the other hand, NMF is used to decompose the GCC-PHAT correlogram matrix to a low-dimensional representation consisting of bases which are the GCC-PHAT correlation functions for each source location and weights (or activation functions) which determine which time frame is dominated by which speaker. Thus, this approach may be interpreted as a broadband GCC-PHAT approach assuming temporal sparsity. As it is a broadband approach, spatial aliasing is not a problem. However, simultaneous localization of multiple sources within a single time frame is not straightforward.
The approach of [29] is, again, fundamentally different from [27] and [28]. Here, complex NMF is used to decompose the multi-channel instantaneous spatial covariance matrix into a combination of weight functions that indicate which locations in a set of (pre-defined) spatial kernels are active (thus corresponding to localization). This approach is supervised-NMF basis functions of the individual source spectra (learnt in a training stage), as well as a pre-defined spatial dictionary are incorporated into the approach.
In a recent separation approach called GCC-NMF [30], GCC-PHAT is used for localization, and the NMF decomposition of the mixture spectrum is used for dictionary learning. Subsequently, the NMF atoms at each time instant are clustered, using the location estimates from GCC-PHAT, to separate the underlying sources. The results of this approach, along with the successful use of NMF in supervised single-channel source separation, indicate that an NMF-based spectral decomposition results in basis functions (atoms) that are sufficiently distinct for each source, and which do not overlap significantly in time-i.e., we have some form of disjointness in the time-atom domain. Thus, we hypothesise that using such atoms as weighting for the frequency averaging would allow for exploiting this time-atom sparsity and disjointness to simultaneously localize multiple sources within a single time frame while being robust to spatial aliasing due to the frequency averaging. Specifically, we investigate the use of an unsupervised NMF decomposition as a weighting function for the SRPbased localization and apply it to the task of multi-speaker localization. Further, we also investigate modifications to the NMF atoms which lead to a better weighting for the purpose of localization, followed by a rigorous evaluation of NMF-weighted SRP for DoA estimation in various room acoustic environments, and with different array configurations. The proposed approach is comprehensively compared to (a) the state-of-the-art localization approaches for closely spaced microphones and (b) the state-of-the-art methods for widely spaced microphones.

Spatial propagation model
Consider an array of M microphones that captures the signals radiated by Q broadband sound sources in the far field. The microphone locations may be expressed in 3D cartesian co-ordinates by the vectors as r 1 , . . . ,r M . Under the far field assumption, the DoA vector for source q in this co-ordinate system can be denoted as: where 0 ≤ θ ≤ 2π is the azimuth angle between the projection of n q (θ, φ) on to the xy plane and the positive x-axis and 0 ≤ φ ≤ π is the elevation angle with respect to the positive z-axis.
In the STFT domain, the image of source q at the array, in the kth frequency bin and bth time frame, can be compactly denoted as: is the STFTdomain representation of the background noise at the array, the net signal captured by the array can be written as: where Under the common assumption of direct path dominance, and taking the signal at the first microphone as the reference, the image of source q at the array can be re-cast, relative to its image at the reference microphone, as: where k = 2πkf s K is the kth discrete frequency, f s is the sampling rate, K is the number of DFT points, r i = r i −r is the position difference between microphones i and , and c is the speed of sound.
The term 1 , e j k r T 21 n q /c , . . . , e j k r T M1 n q /c T is often termed the relative steering vector A q (k) in the literature. Further, it is also often assumed that each TF-bin is dominated by only one source based on W-disjoint orthogonality property [1]. Consequently, assuming source q is dominant in TF-bin (k, b), (2) can be simplified as:

NMF model
Given the STFT representation S q (k, b) of a source signal q, computed over K discrete frequencies and B time frames, we denote the discrete magnitude spectrogram of this signal by the (K × B) non-negative matrix |S q |.
We shall subsequently use the compact notation: to denote a non-negative matrix and its dimen- A low rank approximation of |S q | of rank D can be obtained using NMF as: where W q ∈ R (K×D) . Eq (5) implies that: The columns w d,q , d = 1, 2, . . . , D , of W q encode spectral patterns typical to the source q and are referred to as atoms in the ensuing. The rows of H q encode the activity of the respective atoms in time. A high value of H q (d, b) for an atom d at frame b indicates that the corresponding atom is active in that time frame. However, based on the assumption of signal sparsity in the time-atom representation, only the atoms whose activation values exceed a certain threshold value need be considered as contributing to the signal at a particular time frame. Let D b,q be the set of atom indices whose activation values exceed the threshold at time frame b. Then, we can further simplify (6) as: where w d,q (k) = W q (k, d).

Narrowband SRP (NB-SRP)
To localize a source at any frequency bin k and time frame b, the NB-SRP approach basically steers a constructive beamformer towards each candidate DoA (θ, φ), in a predefined search space of candidate DoAs, and picks the candidate with the maximum energy as the location of the active source at the TF point (k, b). This assumes, implicitly, that the time-frequency bin in question contains a directional source. Formally, this approach may be written as: where is the DoA estimate at each TF bin and J NB-SRP (k, b, θ, φ) is the optimization function given by: In the above, A(k, b, θ, φ) can be any generic beamformer that leads to a constructive reinforcement of a signal along (θ, φ). In practice, the normalized delay-and-sum beamformer of (10) is widely used. Since this is similar to the PHAT weighting, this approach is called the NB-SRP-PHAT.
The source location estimates for the different TF bins, obtained as in (8), are subsequently clustered and the multi-speaker location estimates are obtained as the centroids of these clusters.

Broadband SRP (BB-SRP)
NB-SRP fails to provide a unique maximum for (8) for frequencies above the spatial aliasing frequency. As the inter-microphone distance increases, a larger range of frequencies are affected by spatial aliasing, and the efficacy of NB-SRP-based methods decreases. To overcome this problem, (9) is summed across the frequency range, leading to the broadband SRP (BB-SRP) optimization function [31]: BB-SRP may be seen as a multi-channel analog of GCC-PHAT approach. Note that (11) yields a single localization result per time frame. The results from multiple time frames can then be clustered as in the NB case for multispeaker localization. The broadband approach ameliorates spatial aliasing at the cost of un-utilized TF sparsity. Since only the dominant source is located in each time frame, softer speakers who are not dominant in a sufficient number of time frames may not be localized.

The SRP-NMF approach
As we shall now demonstrate, by incorporating the D T basis functions W = w 1 , w 2 , . . . , w D T obtained from an NMF decomposition of the microphone signal spectrum, we can exploit sparsity in what we term the 'time-atom' domain. For compactness of expression, and without loss of generality, we shall consider localization only in the azimuth plane (i.e., φ = π/2) in the following.
In each time frame we compute a weighted version of (11) as: where w d (k) is the kth element of the dth atom w d . Based on (12), we obtain a DoA estimate per active atom d as: As previously explained, we expect the atoms w d to embody the spectral patterns typical to the underlying sources. Further, the time-frequency sparsity and disjointness of speech results in each atom being unique to a single source. Thus, the weighted sum in (12) only aggregates information across frequencies that are simultaneously excited by a source, yielding a spatial-aliasing robust location estimate for that source in (13). This is the rationale behind the weighting in (12). Multi-speaker localization is subsequently obtained by clustering the DoA estimates computed for all active atoms. We present an intuitive idea of how this works using a toy example in Section 5.1.

Demonstration of the working principle of SRP-NMF
Consider two spatially separated, simultaneously active sources captured by microphones placed 12 cm apart. Each source is a harmonic complex of different fundamental frequencies. Figure 1 describes the two underlying source atoms w d . In this simple example, w 1 (k) = 1 only at frequencies where the source 1 is active, and zero otherwise (the red lines in Fig. 1) and w 2 (k) = 1 only at frequencies where the source 2 is active (the blue dashed lines in Fig. 1). Figure 2 depicts the BB-SRP optimization function J BB-SRP (θ) and the SRP-NMF optimization functions J SRP-NMF (d, θ), d = 1, 2 for the two atoms, over the azimuthal search space. The dashed lines indicate the ground truth DoAs. The locations of the peaks of the optimization functions correspond to the respective DoA estimates. It is evident from this figure that the BB-SRP can localize only one source when considering the dominant peak (and even then with a large error). When considering the locations of the two largest peaks of J BB-SRP (θ) for estimating the two underlying source DoAs, both estimates are in error by more than 5 • . This is quite large for such a synthetic example. In contrast, the SRP-NMF estimates (one each from the respective J SRP-NMF (d, θ)) are much more accurate and localize both sources. This is because the each atom emphasizes frequency components specific to a single source in the weighted summation, while suppressing the other components.

SRP-NMF implementation
With the intuitive understanding from the previous section, we now focus on the implementation details. In a supervised localization approach, source-specific atoms can be easily obtained by NMF of the individual source spectra. However, we focus here on the unsupervised case, where no prior information of the sources to be localized is available. The atoms, therefore, are extracted from the mixture signal at the microphones. It has previously been demonstrated [32] that NMF of mixture spectra still results in atoms that correspond to the underlying source spectra. However, it is not possible to attribute the atoms to their corresponding sources without additional information. In our case, NMF is performed on the average of the magnitude spectrograms of the signals of the different microphones. Another possibility is a weighted average spectrogram where the weights could be estimated based, e.g., on some SNR measure [33,34]. The steps in SRP-NMF localization are: • Compute the average of the magnitude spectrograms of the signals at all microphones m: This yields the average magnitude spectrum matrix , where K and B indicate, again, the number of discrete frequencies and time frames of the STFT representation.
• Decompose |X| using NMF into the matrix , containing the D T dictionary atoms, and the matrix H ∈ R (D T ×B) + containing the activations of these atoms for the different time frames: The cost function used for NMF is the generalized KL divergence [35]: where [ WH] (k, b) indicates element (k, b) of the product WH. The well-known multiplicative update rules are applied to estimate W and H. Once the atoms are obtained, they can be used for the weighting in (12) • We note that only the active atoms of each time frame are used in the localization. To obtain the active atoms for any frame b, they are sorted in decreasing order of their activations H(d, b) in that frame. The first atoms that contribute to a certain percentage (here empirically set at 99 percent) of the sum of the activation values in that frame are considered as active.
The SRP-NMF optimization function is, consequently, where w d b is an active atom at frame b. • By maximizing (17) with respect to θ, a DoA estimate is obtained for each active atom in frame b as: • Lastly, we compute the histogram of the DoA estimates across all the time-atom combinations. The locations of peaks in the histogram correspond to DoA estimates of the active sources in the given mixtures.

NMF modifications
The NMF decomposition of speech spectra as in (15) results in dictionary atoms with higher energy at low frequencies than at high frequencies. This is because speech signals typically have a larger energy at the lower frequencies. Further, due to the large dynamic range of speech, the energy in high frequency components can be several decibels lower than that in low-frequency components [32]. This characteristic is, subsequently, also reflected in the NMF atoms. When these atoms are used as weighting functions, the resulting histogram of location estimates is biased towards the broadside of the array. We illustrate this on a 3 source stereo mixture (dev1_male3_liverec_250ms_5cm_mix.wav) from the SiSEC database. The details of the database are in Section 6.3. The ground truth DoAs of the 3 sources are 50 • , 80 • and 105 • . The histogram obtained by the SRP-NMF is shown in Fig. 3. The bias at the broadside of the array (around 90 • ) is evident from the figure. While the second and third peaks near 90 • are prominent, the first peak at 50 • , which is away from the broadside, is not clear. This broadside bias can be explained as follows: localization essentially exploits the inter-microphone phase difference (IPD), which is a linear function of frequency (with some added non-linearities in real scenarios due to reverberation [28]). This linear dependence implies that low frequencies have smaller IPDs (concentrated around 0), compared to high frequencies. This leads to localization around the broadside for the low frequencies. When using the weighted averaging, the dominant low frequency components in the atoms thereby emphasize the broadside direction.
To remove this bias, a penalty term [28,36] is added to flatten the atoms, thereby reducing the dominance of low frequency components in the atoms. This penalty term is given by: where W T W (d, d) indicates the elements along the main diagonal of W T W. This leads to the constrained NMF (CNMF) cost function: where β is the weighting factor of the penalty term. The multiplicative update equations subsequently become: where 1 represents a matrix of ones of the appropriate dimensions, represents the Hadamard product and the division is element-wise. This constrained decomposition favors atoms with a flat spectrum. Figure 4 shows the histogram of SRP-NMF when using the CNMF decomposition, where it may be observed that the broadside bias is overcome and azimuths of all the sources are correctly estimated.

Experimental evaluation
In this section, the performance of SRP-NMF is compared to the state-of-the-art localization approaches for closely spaced and widely spaced microphones. Since our approach is closely related to the SRP/GCC family of approaches (being, as it were, an intermediate between the broadband and narrowband versions of these), and because these are the typical, well-understood methods for source localization, these form the basis for our benchmark. Specifically, we compare our approach to: • The NB-SRP-PHAT according to Section 4.1; • A sub-band variant of the above (termed Bark-SRP-PHAT), where the optimization function is averaged over Sub-bands defined according to the Bark scale as in [24]; and • Four other best performing algorithms among a broad variety of localization algorithms benchmarked in [19] and implemented within the open source Multichannel BSS-locate toolbox [37].
For completeness, a brief summary of Bark-SRP-PHAT and the approaches from the Multichannel BSS-locate toolbox is given in Section 6.1. Tests are conducted on four different databases (three of which are openly available) in order to evaluate the approaches across different microphone arrays (different spacing and configurations) as well as in different acoustic environments, from relatively dry (T 60 ≈ 130ms) to highly reverberant (T 60 ≈ 660ms). The evaluation setup is described in Section 6.2, followed by the details of the databases used. The evaluation metrics are described in Section 6.4 and the method adopted for choosing NMF parameters is presented in Section 6.5.
Further, Section 6.6 presents a comparison of the proposed SRP-NMF to a supervised approach wherein the underlying sources at each microphone are first separated using NMF, and localization is subsequently performed on the separated sources. Section 6.7 presents the results of the benchmarking.

Bark-SRP-PHAT
NB-SRP and BB-SRP are, respectively, fully narrowband or fully broadband approaches. However, SRP-NMF only averages the optimization function over a (sourcedependent) subset of frequencies. Thus, we include a comparison with a modified SRP approach where the optimization function is averaged along sub-bands, where the sub-bands are the critical bands defined according to the Bark scale. A single localization estimate is computed for each critical band within a time frame. These estimates are then pooled across all time frames in a manner similar to the narrowband SRP-PHAT approach, to obtain the final localization result. This approach thus exploits available sparsity and disjointness in time and sub-bands. This scale was chosen because of its psychoacoustical relevance, as seen in previous localization research (e.g., [24]).

MVDRW approaches
The MVDRW approaches [19] use minimum variance distortionless response (MVDR) beamforming to estimate, for each frequency bin k and each time frame b, the signal to noise ratio (SNR) in all azimuth directions. Since the spatial characteristics of the sound field are taken into account, the SNR indicates, effectively, time-frequency bins where the direct path of a single-source is dominant. The MVDRWsum variant averages the SNR across all time-frequency points and, subsequently, the DoA estimates are computed as the location of the peaks of this averaged SNR. When all sources are simultaneously active within the observation interval, this averaging is beneficial. However, when a source is active only for a few time frames, the averaging smooths out the estimate, thereby possibly not localizing the source. Hence [19] also proposes an alternative called MVDRWmax, where a max pooling of the SNR is performed over time.

GCC-variants
The two GCC-variants considered in [19] are the GCC-NONLINsum and GCC-NONLINmax. The key difference with the traditional GCC is the non-linear weighting applied to compensate for the wide lobes of the GCC for closely spaced microphones [38]. In GCC-NONLINsum and GCC-NONLINmax, respectively, the sum and max pooling of the GCC-PHAT, computed over the azimuthal space, is done across time.
As previously stated, these approaches were chosen for the benchmark because they have previously been demonstrated to be the best performing approaches among a broad variety of localization approaches. Further, since the implementation of these approaches is open source, it allows for a reproducible, fair benchmark against which new methods may be compared.

Evaluation setup
For all the experiments, the complex-valued short-term Fourier spectra were generated from 16 kHz mixtures using a DFT size of 1024 samples (i.e., K = 512) and a hop size of 512 samples. A periodic square-root Hann window of size 1024 samples is used prior to computing the DFT.
The NMF parameters D T and β are set to 55 and 60 respectively. These parameters are set based on preliminary experiments that are described in Section 6.5. The maximum number of NMF iterations is 200.
For all the approaches, the azimuth search space (0 • − 180 • ) was divided into a uniformly spaced grid with a 2.5 • spacing between adjacent grid points. Further, in all cases, it is assumed that the number of speakers in the mixture is known.

Data
The following four databases, covering a wide range of recording environments, are used for evaluations.

Signal Separation and Evaluation Campaign (SiSEC) [39]
The dev1 and dev2 development data of SiSEC, consisting of under-determined stereo channel speech mixtures, is used. The mixtures are generated by adding live recordings of static sources played through loudspeakers in a meeting room (4.45m x 3.55m x 2.5m) and recorded one at a time by a pair of omnidirectional microphones. Two reverberation times of 130 ms and 250 ms are considered. Two stereo arrays are used: one with an intermicrophone spacing of 5cm (SiSEC1) and the other with spacing of 1m (SiSEC2). The speakers are at a distance of 0.80m or 1.20m from the array, and at azimuths between 30 • and 150 • with respect to the array axis. The data thus collected consists of twenty 10 s long mixtures of 3 or 4 simultaneous speakers (either all male or all female). The ground truth values of DoAs are provided. They were further verified by applying the GCC-NONLIN approach on the individual source images that are available in the data set.
Since the mixtures are generated by mixing live recordings from a real environment, they also contain measurement and background noise. Further, both closely spaced and widely spaced arrays can be evaluated in the same setting. This makes the SiSEC dataset ideal for the comparison of the various approaches.

Challenge on acoustic source LOCalization And
TrAcking (LOCATA) [40] LOCATA comprises multi-channel recordings in a realworld closed environment setup. Among several tasks that this challenge offers, we consider Task1: localization of a single, static speaker using a static microphone array and Task2: localization of multiple static speakers using a static microphone array. The data consists of simultaneous recordings of static sources. Sentences selected from the CSTR VCTK database [41] are played back through loudspeakers in a computer laboratory (dimensions: 7.1m x 9.8m x 3 m, T 60 = 550ms). These signals are recorded by a non-uniformly spaced linear array of 13 microphones [40]. In total, there are 6 mixtures of one to four speakers, and the mixtures are between 3 s to 7 s long. The ground truth values of the source locations are provided.
To evaluate different linear array configurations we consider 4 uniform sub-arrays: 3 mics with 4 cm intermicrophone spacing (LOCATA1), 3 mics with an 8 cm inter-microphone spacing (LOCATA2), 3 mics with 16 cm inter-microphone spacing (LOCATA3), and 5 mics with a 4 cm inter-microphone spacing (LOCATA4). This dataset is generated from live recordings in a highly reverberant room, which makes it interesting for benchmarking localization approaches.

Aachen Multi-Channel Impulse Response Database (AACHEN) [42]
This is a database of impulse responses measured in a room with configurable reverberation levels. Three configurations are available, with respective T 60 s of 160 ms, 360 ms and 610 ms. The measurements were carried out for several source positions for azimuths ranging from 0 • to 180 • in steps of 15 • and at distances of 1 m and 2 m from the microphone array. Three different microphone array configurations are available. For this paper, we choose the room configuration with T 60 = 610 ms. The impulse responses corresponding to sources placed at a distance of 2m from the 8 microphone uniform linear array with an inter-microphone spacing of 8 cm are selected. Multi-channel speech signals are generated by convolving the selected impulse responses with dry speech signals. Fifty mixtures, each 5 s long, and from 3 speakers (randomly chosen from the TSP database [43]), placed randomly at 3 different azimuths with respect to the array axis are generated.

UGent Multi-Channel Impulse Response Database (UGENT)
The impulse responses from the UGENT database were measured using exponential sine sweeps for azimuth angles varying from 15 • to 175 • with the source at a distance of 2m from the array. The recordings were conducted in a meeting room with a T 60 ≈ 660ms. The microphone array is a triangular array with the following microphone coordinates: (0m,0m,0m), (0.043m,0m,0m) and (0.022m, − 0.037m,0m). Fifty mixture files, each of 5 s duration, are generated with 3 speakers (randomly chosen from the TSP database) placed at random, different azimuths. Except for the UGent database, all other databases are openly accessible.

Evaluation metrics
The evaluation measures chosen are a detection metric (Fmeasure) and a location accuracy metric (mean azimuth error -MAE). In a given dataset, let N be the total number of sources in all mixture files and N e be the number of sources that are localized by an approach. The estimated source azimuths for each mixture are matched to the ground truth azimuths by greedy matching to ensure minimum azimuth error. If, after matching, the estimated source azimuth is within ±7.5 • of the ground truth estimate then the source is said to be correctly estimated. Let N c be the number of sources correctly localized for all mixtures. Then the F-measure is given by where Recall = N c /N and Precision = N c /N e The more the number of sources correctly localized, the higher the F-measure.
To quantify the localization accuracy, we present two error metrics: MAE and MAEfine. While MAE is the mean azimuth error between the estimated DoAs and true DoAs after greedy matching (irrespective of whether an approach managed to correctly localize all sources within the 7.5 • tolerance), MAEfine is the mean error between the correctly estimated DoAs and true DoAs. Thus, while MAE gives location accuracy over all the sources in the mixture, MAEfine gives location accuracy of only the correctly detected sources. The former may, therefore, be seen as a global performance metric whereas the latter indicates a local performance criterion with respect to correctly detected sources.

Selecting suitable NMF parameters
To obtain suitable values of the flattening penalty term β and the dictionary size D T , the localization performance of SRP-NMF is evaluated on a small dataset over a range of β and D T . Table 1 shows the F-measure obtained by SRP-NMF on SiSEC1 data for β varying from 0 to 80 and D T from 15 to 55. It may be seen that with β fixed, as the dictionary size increases, the localization performance initially improves and later saturates. A similar trend is observed when D T is fixed and β is increased. The pairs of β and D T that yield an F-measure ≥0.95 (in bold) have similar performance and can be chosen as the NMF parameters. While a lower D T leads to less computational complexity, a lower β leads to a lower residual error in the NMF approximation (i.e., a better approximation of the magnitude spectrum). Therefore, among various combinations of β and D T that yield a comparable F-score, a lower β (such as 30) and lower D T (such as 25) are preferred. However, we choose slightly higher parameter values to ensure robust performance and to allow generalization to other datasets with possibly more reverberation and/or noise. Hence, in the subsequent experiments, the values of β and D T are set to 60 and 55 respectively.
The trends in Table 1 are illustrated in Figs. 5 and 6 for a mixture of 4 concurrent speakers. Figure 5 depicts the histogram plots of SRP-NMF with β ranging between 0 and 80 and D T = 35. It is evident from the figure that when β=0, the peaks further away from the broadside direction are not prominent. The reason for this was explained in Section 5.2.1. As β increases, the peaks become increasingly prominent and can be easily detected. Figure 6 presents the effect of varying D T on the SRP-NMF outcome. Here, β is fixed at 60 and D T increases from 5 to 55. It may be seen that as the dictionary size increases, the histogram peaks become increasingly distinct.

Experiment with supervised separation and localization
The basic idea for the proposed approach has its roots in the successful use of NMF for supervised source separation. Hence, we compare, here, the performance of SRP-NMF against a supervised variant where the microphone signals are first decomposed into their underlying sources using NMF [44] and the localization is then performed on the separated sources using the broadband SRP approach. This approach is termed SNMF-SRP, and is implemented as follows: 1 First, for any test case, the magnitude spectrum |S q | of each individual source q in the mixture is  basis function matrix for that source, where D q is the number of atoms for source q. We assume that the number of atoms is the same for all sources, i.e., D q = D ∀q. The basis functions for all sources are then concatenated into a matrix W as: 2 NMF is next used to decompose the magnitude spectrogram of the mixture at any one reference microphone m as |X m | ≈ WH. In this step, W is kept fixed and only the activation matrix H is adapted. This matrix can then be partitioned into the activations of the individual sources as: where B is the total number of frames in the mixture spectrogram. 3 The spectral magnitude estimates for each source can then be obtained as: | S q | = W q H q . These estimates are used to define binary masks for each source, whereby each TF point is allocated to the source with the maximum contribution (i.e., the dominant source) at that TF-point. 4 The binary masks belonging to each source are, finally, applied to the complex mixture spectrograms at all microphones, and the broadband SRP-PHAT approach is used to obtain the source location estimate.
Since SNMF-SRP first separates the sources before localizing them, the interference from the other sources is minimized in the localization. Further, a binary mask attributes a time- frequency point (k, b) to only the dominant source at that point. Due to this "winner-takesall" strategy, only the dominant source components are preserved at each time-frequency point. Consequently, the effect of the interference on the SRP-PHAT function is further reduced, resulting in more accurate DoA estimates as compared to when continuous masks are used. This experiment with oracle knowledge of the underlying sources should, therefore, give a good indication of the possible upper bound of our proposed approach.
We note that an alternative to the SNMF-SRP would be unsupervised NMF-based separation approaches. Such approaches may be seen as comprising the following two steps: (a) decomposing the mixture spectrum into basis functions and their corresponding activations, and, (b) grouping (clustering) the basis functions according to the sources they belong to, to generate the separated source signals. Usually, some additional signal knowledge or signal model needs to be incorporated into the approach to perform the clustering and the quality of the source separation is, consequently, dependent on the kind of clustering approach. Typically, these steps are not performed independently, and the clustering model is often incorporated (explicitly or implicitly) as a set of additional constraints in the decomposition step. If one neglects the additional step (and associated effort) of grouping the basis components and simply uses the obtained basis functions as a weighting within the SRP-PHAT approach, then there is no conceptual difference between our proposed approach and the use of unsupervised NMF-based separation followed by localization.

Experimental set-up
We compare, first, the SNMF-SRP and SRP-NMF. For this purpose, fifty mixtures, each 5 s long and comprising 3 sound sources at randomly chosen azimuths, ranging from 15 • to 175 • , are generated using room impulse responses from the AACHEN database. The responses corresponding to the room configuration with T 60 = 610 ms are used. Two arrays are considered: the 8 microphone uniform linear array with 8 cm inter-microphone spacing, and a 4-microphone uniform linear sub-array with 4 cm inter-microphone spacing (this is part of a larger 8-mic array with spacing 4-4-4-8-4-4-4). The position of the speakers was also randomly chosen for each test file. The optimal dictionary size and weighting factor for the SNMF-SRP approach are first determined in a manner similar to that for SRP-NMF, and using data from the 3-mic sub-array with inter-microphone spacing of 8 cm. Dictionary sizes D SNMF-SRP of 50, 90, and 130 and weighting factors β SNMF-SRP of 0, 20, and 40 are evaluated. The F-measure and MAE obtained for each case are reported in Table 2, from where it is observed that a dictionary size D SNMF-SRP of 130 and β SNMF-SRP of 20 give the best results in terms of the chosen metrics. These are consequently fixed for the subsequent evaluation of the SNMF-SRP approach.  Since we can expect the best localization performance in the absence of reverberation and interfering sources, we simulate this case as well and include it in the comparison (this is termed direct-path (DP) single-source-SRP-PHAT). To obtain this result, each source in the mixture is individually simulated at the arrays. Further, for generating the source image, the room impulse response is limited to only the filter taps corresponding to the direct path and 20 ms of early reflections. Then, a DoA estimate is obtained by the broadband SRP-PHAT. This corresponds to the localization of a single source in the near absence of reverberation and noise and, thus forms a further performance upper bound for all the approaches.
The figures show that, especially for a smaller number of microphones and lower inter-microphone spacing, the supervised NMF-SRP approach is significantly better than the proposed unsupervised SRP-NMF. The SRP-NMF has the lowest F-measure and the largest MAE. This indicates that incorporating the knowledge of the underlying sources may be beneficial when the spatial diversity is limited and cannot be fully exploited. As the spatial diversity increases, the performance of the unsupervised method begins to converge to that of the supervised approach. As expected, the performance of both these approaches are upper bounded by the DP-single-source SRP-PHAT approach.

Results and discussion
The benchmarking results, in terms of F-measure and mean azimuth errors, for the various datasets are plotted in Fig. 9. We start with the MAEfine metric, which focusses on the average localization error for sources that have been correctly localized. The chosen margin for a correct localization implies that the MAEfine is necessarily ≤ 7.5 • . Figure 9 further indicates that the MAEfine metric is comparable among all the approaches, with a difference of only about 1 deg or less (except for the GCC-NonLinsum and MVDRWsum of LOCATA1 and MVDRWmax of LOCATA2, where it is slightly higher). Thus, we may not claim, categorically, that any particular approach is better than the other in terms of this metric. More indicative metrics for the performance of any approach would be the MAE and F-measure, which are discussed next.
NB-SRP-PHAT localizes well with closely spaced stereo microphones and its performance deteriorates with larger inter-microphone spacing due to spatial aliasing. This is clearly seen from the SiSEC results, where its performance is better in SiSEC1 (5 cm spacing) than in SiSEC2 (1 m spacing). Furthermore, in the case of multiple microphones, it performs poorly in LOCATA1 and UGENT. The reason for the poor performance may be explained as follows: both LOCATA1 and UGENT have only 3 microphones that are very closely spaced (≈4 cms apart) and high reverberation ( T 60 ≈ 600 ms). We hypothesize that the TF bins in which noise or reverberant components are dominant are allocated to spurious locations and, since NB-SRP-PHAT pools the decisions per TF bin, these spurious locations mask the source locations in the histogram. This behavior is worse in closely spaced arrays, as the beam pattern of the SRP optimization function has wide main lobes. Increasing the microphone separation or the number of microphones, narrows the main lobes thus improving the performance -as is evident in LOCATA2/3 and LOCATA4 respectively.
Among the GCC-NONLIN approaches, max pooling performs better than sum pooling, which verifies the conclusions in [19]. Further, due to the non-linearity introduced to improve the performance in microphone arrays with short inter-microphone spacing (cf. Section 6.1), the GCC-NONLINmax performs reasonably well in almost all datasets and microphone configurations.
Between the MVDRW methods, max and sum pooling give similar results for the smaller array of SiSEC1. In SiSEC2, sum pooling is superior, which is consistent with [19]. However, for a larger number of microphones max pooling performs better in all microphone configurations. In LOCATA1 and UGENT, though the beampattern of MVDR has wide lobes due to closely spaced microphones, the performance of the MVDRW-based approaches is better than that of NB-SRP-PHAT. We reason that this is because the MVDRW approaches factor in the sound field characteristics and introduce a frequency weighting that emphasizes the time-frequency bins that are dominated by the direct sound of a single source (cf. Section 6.1). Figure 9 also indicates that SRP-NMF performs consistently well across the various databases. In terms of MAE and F-measure, the scores of SRP-NMF is among the top two for each tested case. The atom weighting highlights time-frequency bins consisting of information relating to a single source, similar to SNR weighting, thus exploiting time-atom sparsity and leading to superior performance in short arrays. In large arrays, averaging the optimization function across the frequency axis ensures robustness to spatial aliasing, thus leading to good performance.  Further, the performance of SRP-NMF is consistently better than (or comparable to) that of the Bark-SRP-PHAT, indicating the benefit of the data-dependent weighted frequency averaging, as compared to a fixed frequency averaging. Lastly, we also include a comparison with the SNMF-SRP (cf. Section 6.6) for the AACHEN and UGENT data. It may be seen, then, that this supervised approach outperforms all the other unsupervised approaches-which is expected, based on the results in Section 6.6 and Figs. 7 and 8. We note that since SNMF-SRP is based on the availability of the underlying source signals, it could not be applied to the LOCATA data, where this information is not consistently available. Further, we chose not to report performance metrics of this approach on the SiSEC data, since all approaches perform well in this case, and the performance of SNMF-SRP would add no value in a comparative analysis of the performances.
While the evaluation conclusively demonstrates the benefit of the proposed SRP-NMF approach, this comes at the cost of increased computational complexity. Its complexity is more than that of NB-SRP-PHAT and depends on the number of active atoms per frame. Further, we empirically observe that SRP-NMF gives good DoA estimates if the data segments are long (> 3s). We hypothesize, consequently, that the NMF dictionary atoms extracted from short segments may not be accurate. Therefore, in the current form, SRP-NMF is not suitable for real-time applications. However, with pretrained dictionaries, the requirement of long data segments can be relaxed and SRP-NMF can be explored for real-time localization.
In order to better appreciate the benefits of the SRP-NMF approach, a graphical comparison of SRP-PHAT and SRP-NMF is presented in Figs. 10 and 11. These depict the histogram plots obtained by SRP-PHAT and SRP-NMF on a real-room mixture consisting of 4 concurrent speakers. Note that SRP-NMF clearly indicates the presence of the 4 sources, whereas the histogram of the SRP-PHAT approach (Fig. 10) does not present clear evidence of all 4 sources. The histogram plot in Fig. 11 can be further improved if subsampling is performed. Subsampling is an approach borrowed from Word Embedding in the field of NLP. Based on the observation that words with high frequency of occurrence do not contribute as much information as the words that occur more rarely, the frequent words are subsampled [45] to counter the imbalance between the frequent and rare words. In a similar manner, in the histogram of estimated DoAs, to counter the imbalances between frequent and occasional DoA estimates (e.g., due to a speaker being only active for a short while), the frequently occurring DoAs are subsampled after crossing a certain threshold. The subsampled version of Fig. 11 is shown in Fig. 12, where the benefit of subsampling is clearly visible.

Conclusions
SRP-NMF is a localization approach that uses the NMF atoms of the underlying sources to obtain a broadband localization estimate for each atom. By exploiting the sparsity of the sources in the time-atom domain, this still allows for the simultaneous localization of multiple sources in a time frame. Thereby the proposed approach combines the benefits of standard broadband and narrowband localization approaches. It can, therefore, be used with compact and large array configurations. Compared to the state-of-the-art narrowband and broadband approaches on data collected in natural room acoustic environments, and with various microphone configurations, the proposed approach can reliably localize the active sources in all cases, and with a comparable or lower localization error. The use of such an NMF-based decomposition and subsequent frequency grouping can be seamlessly extended in a variety of ways. For example, it can be combined with extant methods that improve the robustness of localization approaches to noise (e.g., in combination with the SNR weighting of the MVDR-based approaches), or it can be combined with a priori knowledge in the form of speaker-specific NMF atoms to localize only a specific speaker in the mix. It may also be modified for real-time applications with pre-learned universal NMF dictionary and online estimation of activation coefficients. We intend to address these extensions in future work.