 Research
 Open Access
 Published:
NMFweighted SRP for multispeaker direction of arrival estimation: robustness to spatial aliasing while exploiting sparsity in the atomtime domain
EURASIP Journal on Audio, Speech, and Music Processing volume 2021, Article number: 13 (2021)
Abstract
Localization of multiple speakers using microphone arrays remains a challenging problem, especially in the presence of noise and reverberation. Stateoftheart localization algorithms generally exploit the sparsity of speech in some representation for this purpose. Whereas the broadband approaches exploit timedomain sparsity for multispeaker localization, narrowband approaches can additionally exploit sparsity and disjointness in the timefrequency representation. Broadband approaches are robust to spatial aliasing but do not optimally exploit the frequency domain sparsity, leading to poor localization performance for arrays with short intermicrophone distances. Narrowband approaches, on the other hand, are vulnerable to spatial aliasing, making them unsuitable for arrays with large intermicrophone spacing. Proposed here is an approach that decomposes a signal spectrum into a weighted sum of broadband spectral components (atoms) and then exploits signal sparsity in the timeatom representation for simultaneous multiple source localization. The decomposition into atoms is performed in situ using nonnegative matrix factorization (NMF) of the shortterm amplitude spectra and the localization estimate is obtained via a broadband steeredresponse power (SRP) approach for each active atom of a time frame. This SRPNMF approach thereby combines the advantages of the narrowband and broadband approaches and performs well on the multispeaker localization task for a broad range of intermicrophone spacings. On tests conducted on realworld data from public challenges such as SiSEC and LOCATA, and on data generated from recorded room impulse responses, the SRPNMF approach outperforms the commonly used variants of narrowband and broadband localization approaches in terms of source detection capability and localization accuracy.
Introduction
Speech remains the natural mode of interaction for humans. Present day smarthome devices are, therefore, increasingly equipped with voice controlled personal assistants to exploit this for humanmachine interfacing. The performance of such devices depends, to a large extent, on the performance of the localization techniques used in these systems. The term localization in this context implies the detection and spatial localization of a number of overlapping speakers, and it is usually the first stage in many speech communication applications. Accurate acoustic localization of multiple active speakers, however, remains a challenging problem—especially in the presence of background noise and room reverberation.
Localization is typically achieved by means of the spatial diversity afforded by microphone arrays. Large microphone arrays (intermicrophone spacing in the order of a meter) sample the sound fields at large spatial intervals, thereby reducing the effect of diffuse background noise in the localization. However, these arrays are increasingly prone to spatial aliasing at higher frequencies. Compact microphone arrays, with intermicrophone spacing of the order of a few centimeters, offer greater robustness to spatial aliasing, but are biased by diffuse background noise. The size of the chosen array is usually a tradeoff between these two factors and, further, is often driven by practical considerations.
Stateoftheart algorithms for multispeaker localization usually exploit the sparsity and disjointness [1] of speech signals. While some approaches exploit, mainly, temporal sparsity (i.e., speakers are not concurrently active at all times), others exploit the timefrequency (TF) sparsity (i.e., speakers are not concurrently active at all time and frequency points of the shorttime frequency domain representation) of speech. Here, the shorttime Fourier transform (STFT) representation is typically chosen because of its computational efficiency. The former approaches are categorized as broadband and the latter as narrowband. For both these approaches, the localization estimates over time and/or frequency are subsequently aggregated to obtain an estimate of the number of active sources and their respective locations.
Frequently used broadband methods are based on the generalized crosscorrelation (GCC) [2] and its variants, e.g., the average magnitude difference function (AMDF) estimators [3], the adaptive eigenvalue decomposition approach [4], information theoretic criteriabased approaches [5], and the broadband steeredresponse power approaches [6]. Such approaches typically localize the dominant source in each time segment, thereby exploiting the temporal sparsity induced by natural pauses in speech. The GCC with phase transform (PHAT) weighting has proven to be the most robust among all the GCC weightings in low noise and reverberant environments [7]. However, in GCCPHAT, the localization errors increase when the signal to noise ratio (SNR) is poor. To address this issue, researchers have proposed SNRbased weights on GCCPHAT to highlight the speech dominant TF bins and to deemphasize TF bins with noise or reverberant speech (see, e.g., [8–11]). A performance assessment of various GCC algorithms may be found in [12].
Narrowband frequency domain approaches, on the other hand, use the approximate disjointness of speech spectra in their shorttime frequency domain representation to localize the dominant source at each timefrequency point. Multispeaker localization is subsequently done by pooling the individual location estimates. In [13], for example, a (reliabilityweighted) histogram is computed on the pooled DoA estimates, and the locations of peaks of the histogram yield the speaker location estimates. In [14], instead of a histogram, a mixture of Gaussians (MoG) model is applied to cluster the timedifference of arrival (TDoA) estimates. The approach of [15] is a generalization of [14] in which speaker coordinates are estimated and tracked, rather than speaker TDoAs. Similarly, in [16] the authors propose a MoG clustering of the direction of arrival (DoA) estimates obtained by a narrowband steered response power (SRP) approach. This is extended in [17], where a Laplacian mixture model is proposed for the clustering. In [18], source separation and localization are iteratively tackled: source masks are first estimated by clustering the TDoA estimates at each TF bin and subsequently SRPPHAT is used to estimate the DoAs of the separated sources. The estimated DoAs are fed back to the cluster tracking approach for updating the cluster centers. Other recent works build upon this basic idea of exploiting the TF sparsity by introducing reliability weights on the timefrequency units before localization such as [19], which uses SNRbased weights, [20], which uses TF weights predicted by neuralnetworks, and [21], which considers a weighted histogram of the narrowband estimates, where the weights correspond to a heuristic measure of the reliability of the estimate in each TF bin. A comprehensive overview of the relations between the commonly used localization approaches is presented in [22].
When performing source localization independently at each timefrequency point, typical optimization functions for narrowband localization do not yield a unique DoA estimate above a certain frequency. This is due to the appearance of grating lobes, and the phenomenon is termed spatial aliasing. As the distance between the microphones in the array increases, the frequency at which spatial aliasing occurs reduces, leading to ambiguous DoA estimates across a larger band of frequencies. Broadband approaches circumvent this problem by summing the optimization function across the whole frequency band and computing a location estimate per time frame. Such averaging is indicated for arrays with large interelement spacing. However, this constitutes a promiscuous averaging across frequencies, each of which may be dominated by a different speaker, leading to (weakened) evidence for only the strongest speaker in that time frame—i.e., only the location of the highest peak in the angular spectrum of the frame is considered as a potential location estimate and other peaks are usually ignored, since they may not reliably indicate other active speaker locations [23]. Multiple speaker localization is still possible in such cases by aggregating the results across different time frames but, by disregarding the frequency sparsity of speech signals, softer speakers (who may not be dominant for a sufficient number of time frames) may not be localized.
Instead of averaging across the whole frequency range, a compromise can be effected by only averaging across smaller, contiguous subbands of frequencies and computing a location estimate per time and subband region. By pooling the estimates across the various subbands, multispeaker localization may still be achieved. Such bands may be either psychoacoustically motivated (e.g., the Bark scale used in [24]) or heuristically defined. However, these are fixed frequency groupings and the previously described shortcomings with regard to such groupings still hold. Other approaches [25, 26] try to resolve the spatial aliasing problem by trying to unwrap the phase differences of spatially aliased microphone pairs. Initial (rough) estimates of the source locations are required to resolve the spatial aliasing, and it is assumed that at least a few nonaliased microphone pairs are available for this. Consequently, this requires arrays with several microphones at staggered distances such that multiple microphone pairs, aliasing at different frequencies, are available.
The key idea of our approach is to average the narrowband optimization function for localization only across frequency bins that show simultaneous excitation in speech (e.g., fundamental frequency and its harmonics for a voiced speech frame, etc.). Thereby the frequency grouping is not fixed, but data and time frame dependent. Further, since the averaging is carried out across frequency bins that are simultaneously excited during the speech, the interference from other speakers should be minimal in these bins due to the sparsity and disjointness property. Thus, we can simultaneously exploit the time and frequency sparsity of speech while being robust to spatial aliasing—thereby overcoming the shortcomings of the previously mentioned approaches.
Nonnegative matrix factorization (NMF) allows for the possibility to learn such typical groupings of the frequencies based on the magnitude spectrum of the microphone signal. These frequency groupings are termed atoms in our work. Thus we speak of localization based on timeatom sparsity, i.e., in any one time frame only a few atoms are active and each active atom only belongs to one speaker, and localizing across the different atoms in a time frame allows for multispeaker localization. Since we use the SRP approach for localization, our algorithm is termed the SRPNMF approach.
The rest of the paper is organized as follows: we first summarize prior approaches utilizing NMF for source localization and place our proposed approach in the context of these works. Next, in Section 3, we describe the signal model, followed by a review of the basic ideas underlying stateoftheart narrowband and broadband SRP approaches. SRPNMF is introduced and detailed in Section 5. In Section 6, the approach is thoroughly tested. The details of the databases, the comparison approaches and evaluation metrics, the method used to estimate SRPNMF parameters, an analysis of the results and limitations of the approach are presented. Finally, we summarize the work and briefly mention the future scope.
Prior work using NMF for localization
NMF has previously been used for source localization and separation in several conceptually different ways. For example, in [27], NMF is applied to decompose the SRPPHAT function (collated across all timefrequency points) into a combination of angular activity and source presence activity. This decomposition assumes unique maxima of the SRPPHAT function (i.e., no spatial aliasing), allowing for a sparse decomposition using NMF.
In [28], on the other hand, NMF is used to decompose the GCCPHAT correlogram matrix to a lowdimensional representation consisting of bases which are the GCCPHAT correlation functions for each source location and weights (or activation functions) which determine which time frame is dominated by which speaker. Thus, this approach may be interpreted as a broadband GCCPHAT approach assuming temporal sparsity. As it is a broadband approach, spatial aliasing is not a problem. However, simultaneous localization of multiple sources within a single time frame is not straightforward.
The approach of [29] is, again, fundamentally different from [27] and [28]. Here, complex NMF is used to decompose the multichannel instantaneous spatial covariance matrix into a combination of weight functions that indicate which locations in a set of (predefined) spatial kernels are active (thus corresponding to localization). This approach is supervised—NMF basis functions of the individual source spectra (learnt in a training stage), as well as a predefined spatial dictionary are incorporated into the approach.
In a recent separation approach called GCCNMF [30], GCCPHAT is used for localization, and the NMF decomposition of the mixture spectrum is used for dictionary learning. Subsequently, the NMF atoms at each time instant are clustered, using the location estimates from GCCPHAT, to separate the underlying sources. The results of this approach, along with the successful use of NMF in supervised singlechannel source separation, indicate that an NMFbased spectral decomposition results in basis functions (atoms) that are sufficiently distinct for each source, and which do not overlap significantly in time—i.e., we have some form of disjointness in the timeatom domain. Thus, we hypothesise that using such atoms as weighting for the frequency averaging would allow for exploiting this timeatom sparsity and disjointness to simultaneously localize multiple sources within a single time frame while being robust to spatial aliasing due to the frequency averaging.
Specifically, we investigate the use of an unsupervised NMF decomposition as a weighting function for the SRPbased localization and apply it to the task of multispeaker localization. Further, we also investigate modifications to the NMF atoms which lead to a better weighting for the purpose of localization, followed by a rigorous evaluation of NMFweighted SRP for DoA estimation in various room acoustic environments, and with different array configurations. The proposed approach is comprehensively compared to (a) the stateoftheart localization approaches for closely spaced microphones and (b) the stateoftheart methods for widely spaced microphones.
Signal model
Spatial propagation model
Consider an array of M microphones that captures the signals radiated by Q broadband sound sources in the far field. The microphone locations may be expressed in 3D cartesian coordinates by the vectors as r_{1}, …, r_{M}. Under the far field assumption, the DoA vector for source q in this coordinate system can be denoted as:
where 0≤θ≤2π is the azimuth angle between the projection of n_{q}(θ,ϕ) on to the xy plane and the positive xaxis and 0≤ϕ≤π is the elevation angle with respect to the positive zaxis.
In the STFT domain, the image of source q at the array, in the kth frequency bin and bth time frame, can be compactly denoted as: X_{q}(k,b)=[X_{q,1}(k,b), …,X_{q,M}(k,b)]^{T}. If V(k,b) is the STFTdomain representation of the background noise at the array, the net signal captured by the array can be written as:
where X(k,b)=[X_{1}(k,b), …,X_{M}(k,b)]^{T}.
Under the common assumption of direct path dominance, and taking the signal at the first microphone as the reference, the image of source q at the array can be recast, relative to its image at the reference microphone, as:
where \(\Omega _{k}= \frac {2 \pi kf_{s}}{K}\) is the kth discrete frequency, f_{s} is the sampling rate, K is the number of DFT points, r_{iℓ}=r_{i}−r_{ℓ} is the position difference between microphones i and ℓ, and c is the speed of sound.
The term \(\left (\begin {array}{l}1, e^{{\jmath \,}\Omega _{k}\mathbf {r}_{21}^{T}\mathbf {n}_{q}/c},\ldots, e^{{\jmath \,}\Omega _{k}\mathbf {r}_{M1}^{T}\mathbf {n}_{q}/c} \end {array}\right)^{T}\) is often termed the relative steering vector A_{q}(k) in the literature. Further, it is also often assumed that each TFbin is dominated by only one source based on Wdisjoint orthogonality property [1]. Consequently, assuming source q is dominant in TFbin (k,b), (2) can be simplified as:
NMF model
Given the STFT representation S_{q}(k,b) of a source signal q, computed over K discrete frequencies and B time frames, we denote the discrete magnitude spectrogram of this signal by the (K×B) nonnegative matrix S_{q}. We shall subsequently use the compact notation: \(\mathbf {S}_{q}\in \mathbb {R}_{+}^{({K}\times B)}\) to denote a nonnegative matrix and its dimensions. The element (k,b) of the matrix S_{q} is denoted as S_{q}(k,b).
A low rank approximation of S_{q} of rank D can be obtained using NMF as:
where \(\mathbf {W}_{q} \in \mathbb {R}_{+}^{({K}\times D)}\) and \(\mathbf {H}_{q} \in \mathbb {R}_{+}^{(D\times B)}\). Eq (5) implies that:
The columns w_{d,q}, d=1,2,…,D, of W_{q} encode spectral patterns typical to the source q and are referred to as atoms in the ensuing. The rows of H_{q} encode the activity of the respective atoms in time. A high value of H_{q}(d,b) for an atom d at frame b indicates that the corresponding atom is active in that time frame.
However, based on the assumption of signal sparsity in the timeatom representation, only the atoms whose activation values exceed a certain threshold value need be considered as contributing to the signal at a particular time frame. Let \(\mathcal {D}_{b,q}\) be the set of atom indices whose activation values exceed the threshold at time frame b. Then, we can further simplify (6) as:
where w_{d,q}(k)=W_{q}(k,d).
Steered response power beamformers
Narrowband SRP (NBSRP)
To localize a source at any frequency bin k and time frame b, the NBSRP approach basically steers a constructive beamformer towards each candidate DoA (θ,ϕ), in a predefined search space of candidate DoAs, and picks the candidate with the maximum energy as the location of the active source at the TF point (k,b). This assumes, implicitly, that the timefrequency bin in question contains a directional source. Formally, this approach may be written as:
where \((\widehat {\theta }(k,b),\widehat {\phi }(k,b))\) is the DoA estimate at each TF bin and \(\mathcal {J}_{\text {NBSRP}}(k,b, \theta,\phi)\) is the optimization function given by:
In the above, A(k,b,θ,ϕ) can be any generic beamformer that leads to a constructive reinforcement of a signal along (θ,ϕ). In practice, the normalized delayandsum beamformer of (10) is widely used. Since this is similar to the PHAT weighting, this approach is called the NBSRPPHAT.
The source location estimates for the different TF bins, obtained as in (8), are subsequently clustered and the multispeaker location estimates are obtained as the centroids of these clusters.
Broadband SRP (BBSRP)
NBSRP fails to provide a unique maximum for (8) for frequencies above the spatial aliasing frequency. As the intermicrophone distance increases, a larger range of frequencies are affected by spatial aliasing, and the efficacy of NBSRPbased methods decreases. To overcome this problem, (9) is summed across the frequency range, leading to the broadband SRP (BBSRP) optimization function [31]:
BBSRP may be seen as a multichannel analog of GCCPHAT approach. Note that (11) yields a single localization result per time frame. The results from multiple time frames can then be clustered as in the NB case for multispeaker localization. The broadband approach ameliorates spatial aliasing at the cost of unutilized TF sparsity. Since only the dominant source is located in each time frame, softer speakers who are not dominant in a sufficient number of time frames may not be localized.
The SRPNMF approach
As we shall now demonstrate, by incorporating the D_{T} basis functions \(\mathbf {W} = \left [\mathbf {w}_{1},\mathbf {w}_{2},\ldots,\mathbf {w}_{D_{T}}\right ]\) obtained from an NMF decomposition of the microphone signal spectrum, we can exploit sparsity in what we term the ‘timeatom’ domain. For compactness of expression, and without loss of generality, we shall consider localization only in the azimuth plane (i.e., ϕ=π/2) in the following.
In each time frame we compute a weighted version of (11) as:
where w_{d}(k) is the kth element of the dth atom w_{d}. Based on (12), we obtain a DoA estimate per active atomd as:
As previously explained, we expect the atoms w_{d} to embody the spectral patterns typical to the underlying sources. Further, the timefrequency sparsity and disjointness of speech results in each atom being unique to a single source. Thus, the weighted sum in (12) only aggregates information across frequencies that are simultaneously excited by a source, yielding a spatialaliasing robust location estimate for that source in (13). This is the rationale behind the weighting in (12). Multispeaker localization is subsequently obtained by clustering the DoA estimates computed for all active atoms.
We present an intuitive idea of how this works using a toy example in Section 5.1.
Demonstration of the working principle of SRPNMF
Consider two spatially separated, simultaneously active sources captured by microphones placed 12 cm apart. Each source is a harmonic complex of different fundamental frequencies. Figure 1 describes the two underlying source atoms w_{d}. In this simple example, w_{1}(k)=1 only at frequencies where the source 1 is active, and zero otherwise (the red lines in Fig. 1) and w_{2}(k)=1 only at frequencies where the source 2 is active (the blue dashed lines in Fig. 1). Figure 2 depicts the BBSRP optimization function \(\mathcal {J}_{\text {BBSRP}}(\theta)\) and the SRPNMF optimization functions \(\mathcal {J}_{\text {SRPNMF}}(d, \theta),\ d=1,2\) for the two atoms, over the azimuthal search space. The dashed lines indicate the ground truth DoAs. The locations of the peaks of the optimization functions correspond to the respective DoA estimates. It is evident from this figure that the BBSRP can localize only one source when considering the dominant peak (and even then with a large error). When considering the locations of the two largest peaks of \(\mathcal {J}_{\text {BBSRP}}(\theta)\) for estimating the two underlying source DoAs, both estimates are in error by more than 5^{∘}. This is quite large for such a synthetic example. In contrast, the SRPNMF estimates (one each from the respective \(\mathcal {J}_{\text {SRPNMF}}(d, \theta)\)) are much more accurate and localize both sources. This is because the each atom emphasizes frequency components specific to a single source in the weighted summation, while suppressing the other components.
SRPNMF implementation
With the intuitive understanding from the previous section, we now focus on the implementation details. In a supervised localization approach, sourcespecific atoms can be easily obtained by NMF of the individual source spectra. However, we focus here on the unsupervised case, where no prior information of the sources to be localized is available. The atoms, therefore, are extracted from the mixture signal at the microphones. It has previously been demonstrated [32] that NMF of mixture spectra still results in atoms that correspond to the underlying source spectra. However, it is not possible to attribute the atoms to their corresponding sources without additional information. In our case, NMF is performed on the average of the magnitude spectrograms of the signals of the different microphones. Another possibility is a weighted average spectrogram where the weights could be estimated based, e.g., on some SNR measure [33, 34].
The steps in SRPNMF localization are:

Compute the average of the magnitude spectrograms of the signals at all microphones m:
$$ \overline{X(k,b)} = \frac{1}{M} \sum_{m=1}^{M}{X_{m}(k,b)}. $$(14)This yields the average magnitude spectrum matrix \(\overline {\mathbf {X}} \in \mathbb {R}_{+}^{(K\times B)}\), where K and B indicate, again, the number of discrete frequencies and time frames of the STFT representation.

Decompose \(\overline {\mathbf {X}}\) using NMF into the matrix \(\mathbf {W} \in {\mathbb {R}_{+}^{(K\times {D_{T}})}}\), containing the D_{T} dictionary atoms, and the matrix \(\mathbf {H}\in {\mathbb {R}_{+}^{(D_{T}\times {B})}}\) containing the activations of these atoms for the different time frames:
$$\begin{array}{*{20}l} \overline{\mathbf{X}} \approx \mathbf{W}\mathbf{H}\,. \end{array} $$(15)The cost function used for NMF is the generalized KL divergence [35]:
$$ \begin{aligned} D_{\text{KL}}(\overline{\mathbf{X}},\mathbf{WH}) &= \sum_{k} \sum_{b}(\overline{X(k,b)}\log \left(\frac{\overline{X(k,b)}}{[\mathbf{WH}]{(k,b)}}\right)\\ &\overline{X(k,b)}+[\mathbf{WH}]{(k,b)}), \end{aligned} $$(16)where [WH](k,b) indicates element (k,b) of the product WH. The wellknown multiplicative update rules are applied to estimate W and H. Once the atoms are obtained, they can be used for the weighting in (12)

We note that only the active atoms of each time frame are used in the localization. To obtain the active atoms for any frame b, they are sorted in decreasing order of their activations H(d,b) in that frame. The first atoms that contribute to a certain percentage (here empirically set at 99 percent) of the sum of the activation values in that frame are considered as active.
The SRPNMF optimization function is, consequently,
$$ \begin{aligned} \mathcal{J}_{\text{SRPNMF}}(d_{b},b,\theta) &= \sum_{k} w_{d_{b}}(k)\left \mathbf{A}^{H}(k,d_{b},b, \theta)\mathbf{X}(k,b)\right^{2}, \end{aligned} $$(17)where \(\mathbf {w}_{d_{b}}\) is an active atom at frame b.

By maximizing (17) with respect to θ, a DoA estimate is obtained for each active atom in frame b as:
$$ \widehat{\theta}(d_{b},b) = \underset{\theta}{\text{argmax}}\, {\mathcal{J}_{\text{SRPNMF}}(d_{b},b,\theta).} $$(18) 
Lastly, we compute the histogram of the DoA estimates across all the timeatom combinations. The locations of peaks in the histogram correspond to DoA estimates of the active sources in the given mixtures.
NMF modifications
The NMF decomposition of speech spectra as in (15) results in dictionary atoms with higher energy at low frequencies than at high frequencies. This is because speech signals typically have a larger energy at the lower frequencies. Further, due to the large dynamic range of speech, the energy in high frequency components can be several decibels lower than that in lowfrequency components [32]. This characteristic is, subsequently, also reflected in the NMF atoms. When these atoms are used as weighting functions, the resulting histogram of location estimates is biased towards the broadside of the array. We illustrate this on a 3 source stereo mixture (dev1_male3_liverec_250ms_5cm_mix.wav) from the SiSEC database. The details of the database are in Section 6.3. The ground truth DoAs of the 3 sources are 50^{∘},80^{∘} and 105^{∘}. The histogram obtained by the SRPNMF is shown in Fig. 3. The bias at the broadside of the array (around 90^{∘}) is evident from the figure. While the second and third peaks near 90^{∘} are prominent, the first peak at 50^{∘}, which is away from the broadside, is not clear.
This broadside bias can be explained as follows: localization essentially exploits the intermicrophone phase difference (IPD), which is a linear function of frequency (with some added nonlinearities in real scenarios due to reverberation [28]). This linear dependence implies that low frequencies have smaller IPDs (concentrated around 0), compared to high frequencies. This leads to localization around the broadside for the low frequencies. When using the weighted averaging, the dominant low frequency components in the atoms thereby emphasize the broadside direction.
To remove this bias, a penalty term [28, 36] is added to flatten the atoms, thereby reducing the dominance of low frequency components in the atoms. This penalty term is given by:
where [W^{T}W](d,d) indicates the elements along the main diagonal of W^{T}W. This leads to the constrained NMF (CNMF) cost function:
where β is the weighting factor of the penalty term. The multiplicative update equations subsequently become:
where 1 represents a matrix of ones of the appropriate dimensions, ⊙ represents the Hadamard product and the division is elementwise. This constrained decomposition favors atoms with a flat spectrum. Figure 4 shows the histogram of SRPNMF when using the CNMF decomposition, where it may be observed that the broadside bias is overcome and azimuths of all the sources are correctly estimated.
Experimental evaluation
In this section, the performance of SRPNMF is compared to the stateoftheart localization approaches for closely spaced and widely spaced microphones. Since our approach is closely related to the SRP/GCC family of approaches (being, as it were, an intermediate between the broadband and narrowband versions of these), and because these are the typical, wellunderstood methods for source localization, these form the basis for our benchmark.
Specifically, we compare our approach to:

The NBSRPPHAT according to Section 4.1;

A subband variant of the above (termed BarkSRPPHAT), where the optimization function is averaged over Subbands defined according to the Bark scale as in [24]; and

Four other best performing algorithms among a broad variety of localization algorithms benchmarked in [19] and implemented within the open source Multichannel BSSlocate toolbox [37].
For completeness, a brief summary of BarkSRPPHAT and the approaches from the Multichannel BSSlocate toolbox is given in Section 6.1.
Tests are conducted on four different databases (three of which are openly available) in order to evaluate the approaches across different microphone arrays (different spacing and configurations) as well as in different acoustic environments, from relatively dry (T_{60}≈130ms) to highly reverberant (T_{60}≈660ms). The evaluation setup is described in Section 6.2, followed by the details of the databases used. The evaluation metrics are described in Section 6.4 and the method adopted for choosing NMF parameters is presented in Section 6.5.
Further, Section 6.6 presents a comparison of the proposed SRPNMF to a supervised approach wherein the underlying sources at each microphone are first separated using NMF, and localization is subsequently performed on the separated sources.
Section 6.7 presents the results of the benchmarking.
Brief summary of benchmarked approaches
BarkSRPPHAT
NBSRP and BBSRP are, respectively, fully narrowband or fully broadband approaches. However, SRPNMF only averages the optimization function over a (sourcedependent) subset of frequencies. Thus, we include a comparison with a modified SRP approach where the optimization function is averaged along subbands, where the subbands are the critical bands defined according to the Bark scale. A single localization estimate is computed for each critical band within a time frame. These estimates are then pooled across all time frames in a manner similar to the narrowband SRPPHAT approach, to obtain the final localization result. This approach thus exploits available sparsity and disjointness in time and subbands. This scale was chosen because of its psychoacoustical relevance, as seen in previous localization research (e.g., [24]).
MVDRW approaches
The MVDRW approaches [19] use minimum variance distortionless response (MVDR) beamforming to estimate, for each frequency bin k and each time frame b, the signal to noise ratio (SNR) in all azimuth directions. Since the spatial characteristics of the sound field are taken into account, the SNR indicates, effectively, timefrequency bins where the direct path of a singlesource is dominant. The MVDRWsum variant averages the SNR across all timefrequency points and, subsequently, the DoA estimates are computed as the location of the peaks of this averaged SNR. When all sources are simultaneously active within the observation interval, this averaging is beneficial. However, when a source is active only for a few time frames, the averaging smooths out the estimate, thereby possibly not localizing the source. Hence [19] also proposes an alternative called MVDRWmax, where a max pooling of the SNR is performed over time.
GCCvariants
The two GCCvariants considered in [19] are the GCCNONLINsum and GCCNONLINmax. The key difference with the traditional GCC is the nonlinear weighting applied to compensate for the wide lobes of the GCC for closely spaced microphones [38]. In GCCNONLINsum and GCCNONLINmax, respectively, the sum and max pooling of the GCCPHAT, computed over the azimuthal space, is done across time.
As previously stated, these approaches were chosen for the benchmark because they have previously been demonstrated to be the best performing approaches among a broad variety of localization approaches. Further, since the implementation of these approaches is open source, it allows for a reproducible, fair benchmark against which new methods may be compared.
Evaluation setup
For all the experiments, the complexvalued shortterm Fourier spectra were generated from 16 kHz mixtures using a DFT size of 1024 samples (i.e., K=512) and a hop size of 512 samples. A periodic squareroot Hann window of size 1024 samples is used prior to computing the DFT.
The NMF parameters D_{T} and β are set to 55 and 60 respectively. These parameters are set based on preliminary experiments that are described in Section 6.5. The maximum number of NMF iterations is 200.
For all the approaches, the azimuth search space (0^{∘}−180^{∘}) was divided into a uniformly spaced grid with a 2.5^{∘} spacing between adjacent grid points. Further, in all cases, it is assumed that the number of speakers in the mixture is known.
Data
The following four databases, covering a wide range of recording environments, are used for evaluations.
Signal Separation and Evaluation Campaign (SiSEC) [39]
The dev1 and dev2 development data of SiSEC, consisting of underdetermined stereo channel speech mixtures, is used. The mixtures are generated by adding live recordings of static sources played through loudspeakers in a meeting room (4.45m x 3.55m x 2.5m) and recorded one at a time by a pair of omnidirectional microphones. Two reverberation times of 130 ms and 250 ms are considered.
Two stereo arrays are used: one with an intermicrophone spacing of 5cm (SiSEC1) and the other with spacing of 1m (SiSEC2). The speakers are at a distance of 0.80m or 1.20m from the array, and at azimuths between 30^{∘} and 150^{∘} with respect to the array axis. The data thus collected consists of twenty 10 s long mixtures of 3 or 4 simultaneous speakers (either all male or all female). The ground truth values of DoAs are provided. They were further verified by applying the GCCNONLIN approach on the individual source images that are available in the data set.
Since the mixtures are generated by mixing live recordings from a real environment, they also contain measurement and background noise. Further, both closely spaced and widely spaced arrays can be evaluated in the same setting. This makes the SiSEC dataset ideal for the comparison of the various approaches.
Challenge on acoustic source LOCalization And TrAcking (LOCATA) [40]
LOCATA comprises multichannel recordings in a realworld closed environment setup. Among several tasks that this challenge offers, we consider Task1: localization of a single, static speaker using a static microphone array and Task2: localization of multiple static speakers using a static microphone array.
The data consists of simultaneous recordings of static sources. Sentences selected from the CSTR VCTK database [41] are played back through loudspeakers in a computer laboratory (dimensions: 7.1m x 9.8m x 3 m, T_{60}=550ms). These signals are recorded by a nonuniformly spaced linear array of 13 microphones [40]. In total, there are 6 mixtures of one to four speakers, and the mixtures are between 3 s to 7 s long. The ground truth values of the source locations are provided.
To evaluate different linear array configurations we consider 4 uniform subarrays: 3 mics with 4 cm intermicrophone spacing (LOCATA1), 3 mics with an 8 cm intermicrophone spacing (LOCATA2), 3 mics with 16 cm intermicrophone spacing (LOCATA3), and 5 mics with a 4 cm intermicrophone spacing (LOCATA4). This dataset is generated from live recordings in a highly reverberant room, which makes it interesting for benchmarking localization approaches.
Aachen MultiChannel Impulse Response Database (AACHEN) [42]
This is a database of impulse responses measured in a room with configurable reverberation levels. Three configurations are available, with respective T_{60}s of 160 ms, 360 ms and 610 ms. The measurements were carried out for several source positions for azimuths ranging from 0^{∘} to 180^{∘} in steps of 15^{∘} and at distances of 1 m and 2 m from the microphone array. Three different microphone array configurations are available.
For this paper, we choose the room configuration with T_{60}=610 ms. The impulse responses corresponding to sources placed at a distance of 2m from the 8 microphone uniform linear array with an intermicrophone spacing of 8 cm are selected. Multichannel speech signals are generated by convolving the selected impulse responses with dry speech signals. Fifty mixtures, each 5 s long, and from 3 speakers (randomly chosen from the TSP database [43]), placed randomly at 3 different azimuths with respect to the array axis are generated.
UGent MultiChannel Impulse Response Database (UGENT)
The impulse responses from the UGENT database were measured using exponential sine sweeps for azimuth angles varying from 15^{∘} to 175^{∘} with the source at a distance of 2m from the array. The recordings were conducted in a meeting room with a T_{60}≈660ms. The microphone array is a triangular array with the following microphone coordinates: (0m,0m,0m), (0.043m,0m,0m) and (0.022m, − 0.037m,0m). Fifty mixture files, each of 5 s duration, are generated with 3 speakers (randomly chosen from the TSP database) placed at random, different azimuths.
Except for the UGent database, all other databases are openly accessible.
Evaluation metrics
The evaluation measures chosen are a detection metric (Fmeasure) and a location accuracy metric (mean azimuth error  MAE). In a given dataset, let N be the total number of sources in all mixture files and N_{e} be the number of sources that are localized by an approach. The estimated source azimuths for each mixture are matched to the ground truth azimuths by greedy matching to ensure minimum azimuth error. If, after matching, the estimated source azimuth is within ±7.5^{∘} of the ground truth estimate then the source is said to be correctly estimated. Let N_{c} be the number of sources correctly localized for all mixtures. Then the Fmeasure is given by
where Recall=N_{c}/N and Precision=N_{c}/N_{e} The more the number of sources correctly localized, the higher the Fmeasure.
To quantify the localization accuracy, we present two error metrics: MAE and MAEfine. While MAE is the mean azimuth error between the estimated DoAs and true DoAs after greedy matching (irrespective of whether an approach managed to correctly localize all sources within the 7.5^{∘} tolerance), MAEfine is the mean error between the correctly estimated DoAs and true DoAs. Thus, while MAE gives location accuracy over all the sources in the mixture, MAEfine gives location accuracy of only the correctly detected sources. The former may, therefore, be seen as a global performance metric whereas the latter indicates a local performance criterion with respect to correctly detected sources.
Selecting suitable NMF parameters
To obtain suitable values of the flattening penalty term β and the dictionary size D_{T}, the localization performance of SRPNMF is evaluated on a small dataset over a range of β and D_{T}.
Table 1 shows the Fmeasure obtained by SRPNMF on SiSEC1 data for β varying from 0 to 80 and D_{T} from 15 to 55. It may be seen that with β fixed, as the dictionary size increases, the localization performance initially improves and later saturates. A similar trend is observed when D_{T} is fixed and β is increased. The pairs of β and D_{T} that yield an Fmeasure ≥0.95 (in bold) have similar performance and can be chosen as the NMF parameters. While a lower D_{T} leads to less computational complexity, a lower β leads to a lower residual error in the NMF approximation (i.e., a better approximation of the magnitude spectrum). Therefore, among various combinations of β and D_{T} that yield a comparable Fscore, a lower β (such as 30) and lower D_{T} (such as 25) are preferred. However, we choose slightly higher parameter values to ensure robust performance and to allow generalization to other datasets with possibly more reverberation and/or noise. Hence, in the subsequent experiments, the values of β and D_{T} are set to 60 and 55 respectively.
The trends in Table 1 are illustrated in Figs. 5 and 6 for a mixture of 4 concurrent speakers. Figure 5 depicts the histogram plots of SRPNMF with β ranging between 0 and 80 and D_{T}=35. It is evident from the figure that when β=0, the peaks further away from the broadside direction are not prominent. The reason for this was explained in Section 5.2.1. As β increases, the peaks become increasingly prominent and can be easily detected.
Figure 6 presents the effect of varying D_{T} on the SRPNMF outcome. Here, β is fixed at 60 and D_{T} increases from 5 to 55. It may be seen that as the dictionary size increases, the histogram peaks become increasingly distinct.
Experiment with supervised separation and localization
The basic idea for the proposed approach has its roots in the successful use of NMF for supervised source separation. Hence, we compare, here, the performance of SRPNMF against a supervised variant where the microphone signals are first decomposed into their underlying sources using NMF [44] and the localization is then performed on the separated sources using the broadband SRP approach. This approach is termed SNMFSRP, and is implemented as follows:

1
First, for any test case, the magnitude spectrum S_{q} of each individual source q in the mixture is decomposed using constrained NMF. This results in the \(\mathbf {W}_{q} \in \mathbb {R}_{+}^{(K \times D_{q})}\) basis function matrix for that source, where D_{q} is the number of atoms for source q. We assume that the number of atoms is the same for all sources, i.e., D_{q}=D ∀q. The basis functions for all sources are then concatenated into a matrix W as:
$$ \mathbf{W} = \left[\begin{array}{l} \mathbf{W}_{1}, \mathbf{W}_{2}, \ldots, \mathbf{W}_{Q}\end{array}\right] \ \in\mathbb{R}_{+}^{(K\times {QD})} $$(23) 
2
NMF is next used to decompose the magnitude spectrogram of the mixture at any one reference microphone m as X_{m}≈WH. In this step, W is kept fixed and only the activation matrix H is adapted. This matrix can then be partitioned into the activations of the individual sources as:
$$ \mathbf{H} = \left[\begin{array}{l} \mathbf{H}_{1}^{T}, \mathbf{H}_{2}^{T}, \ldots, \mathbf{H}_{Q}^{T}\end{array}\right] \ \in\mathbb{R}_{+}^{({QD}\times {B})}, $$(24)where B is the total number of frames in the mixture spectrogram.

3
The spectral magnitude estimates for each source can then be obtained as: \(\widehat {\mathbf {S}}_{q} = \mathbf {W}_{q}\mathbf {H}_{q}\,\). These estimates are used to define binary masks for each source, whereby each TF point is allocated to the source with the maximum contribution (i.e., the dominant source) at that TFpoint.

4
The binary masks belonging to each source are, finally, applied to the complex mixture spectrograms at all microphones, and the broadband SRPPHAT approach is used to obtain the source location estimate.
Since SNMFSRP first separates the sources before localizing them, the interference from the other sources is minimized in the localization. Further, a binary mask attributes a timefrequency point (k,b) to only the dominant source at that point. Due to this “winnertakesall” strategy, only the dominant source components are preserved at each timefrequency point. Consequently, the effect of the interference on the SRPPHAT function is further reduced, resulting in more accurate DoA estimates as compared to when continuous masks are used. This experiment with oracle knowledge of the underlying sources should, therefore, give a good indication of the possible upper bound of our proposed approach.
We note that an alternative to the SNMFSRP would be unsupervised NMFbased separation approaches. Such approaches may be seen as comprising the following two steps: (a) decomposing the mixture spectrum into basis functions and their corresponding activations, and, (b) grouping (clustering) the basis functions according to the sources they belong to, to generate the separated source signals. Usually, some additional signal knowledge or signal model needs to be incorporated into the approach to perform the clustering and the quality of the source separation is, consequently, dependent on the kind of clustering approach. Typically, these steps are not performed independently, and the clustering model is often incorporated (explicitly or implicitly) as a set of additional constraints in the decomposition step. If one neglects the additional step (and associated effort) of grouping the basis components and simply uses the obtained basis functions as a weighting within the SRPPHAT approach, then there is no conceptual difference between our proposed approach and the use of unsupervised NMFbased separation followed by localization.
Experimental setup
We compare, first, the SNMFSRP and SRPNMF. For this purpose, fifty mixtures, each 5 s long and comprising 3 sound sources at randomly chosen azimuths, ranging from 15^{∘} to 175^{∘}, are generated using room impulse responses from the AACHEN database. The responses corresponding to the room configuration with T_{60}=610 ms are used. Two arrays are considered: the 8 microphone uniform linear array with 8 cm intermicrophone spacing, and a 4microphone uniform linear subarray with 4 cm intermicrophone spacing (this is part of a larger 8mic array with spacing 4448444). The position of the speakers was also randomly chosen for each test file. The optimal dictionary size and weighting factor for the SNMFSRP approach are first determined in a manner similar to that for SRPNMF, and using data from the 3mic subarray with intermicrophone spacing of 8 cm.
Dictionary sizes D_{SNMFSRP} of 50, 90, and 130 and weighting factors β_{SNMFSRP} of 0, 20, and 40 are evaluated. The Fmeasure and MAE obtained for each case are reported in Table 2, from where it is observed that a dictionary size D_{SNMFSRP} of 130 and β_{SNMFSRP} of 20 give the best results in terms of the chosen metrics. These are consequently fixed for the subsequent evaluation of the SNMFSRP approach.
Figures 7 and 8 depict the performance of SNMFSRP compared to SRPNMF.
Since we can expect the best localization performance in the absence of reverberation and interfering sources, we simulate this case as well and include it in the comparison (this is termed directpath (DP) singlesourceSRPPHAT). To obtain this result, each source in the mixture is individually simulated at the arrays. Further, for generating the source image, the room impulse response is limited to only the filter taps corresponding to the direct path and 20 ms of early reflections. Then, a DoA estimate is obtained by the broadband SRPPHAT. This corresponds to the localization of a single source in the near absence of reverberation and noise and, thus forms a further performance upper bound for all the approaches.
The figures show that, especially for a smaller number of microphones and lower intermicrophone spacing, the supervised NMFSRP approach is significantly better than the proposed unsupervised SRPNMF. The SRPNMF has the lowest Fmeasure and the largest MAE. This indicates that incorporating the knowledge of the underlying sources may be beneficial when the spatial diversity is limited and cannot be fully exploited. As the spatial diversity increases, the performance of the unsupervised method begins to converge to that of the supervised approach. As expected, the performance of both these approaches are upper bounded by the DPsinglesource SRPPHAT approach.
Results and discussion
The benchmarking results, in terms of Fmeasure and mean azimuth errors, for the various datasets are plotted in Fig. 9. We start with the MAEfine metric, which focusses on the average localization error for sources that have been correctly localized. The chosen margin for a correct localization implies that the MAEfine is necessarily ≤7.5^{∘}. Figure 9 further indicates that the MAEfine metric is comparable among all the approaches, with a difference of only about 1 deg or less (except for the GCCNonLinsum and MVDRWsum of LOCATA1 and MVDRWmax of LOCATA2, where it is slightly higher). Thus, we may not claim, categorically, that any particular approach is better than the other in terms of this metric. More indicative metrics for the performance of any approach would be the MAE and Fmeasure, which are discussed next.
NBSRPPHAT localizes well with closely spaced stereo microphones and its performance deteriorates with larger intermicrophone spacing due to spatial aliasing. This is clearly seen from the SiSEC results, where its performance is better in SiSEC1 (5 cm spacing) than in SiSEC2 (1 m spacing). Furthermore, in the case of multiple microphones, it performs poorly in LOCATA1 and UGENT. The reason for the poor performance may be explained as follows: both LOCATA1 and UGENT have only 3 microphones that are very closely spaced (≈4 cms apart) and high reverberation (T_{60}≈600 ms). We hypothesize that the TF bins in which noise or reverberant components are dominant are allocated to spurious locations and, since NBSRPPHAT pools the decisions per TF bin, these spurious locations mask the source locations in the histogram. This behavior is worse in closely spaced arrays, as the beam pattern of the SRP optimization function has wide main lobes. Increasing the microphone separation or the number of microphones, narrows the main lobes thus improving the performance  as is evident in LOCATA2/3 and LOCATA4 respectively.
Among the GCCNONLIN approaches, max pooling performs better than sum pooling, which verifies the conclusions in [19]. Further, due to the nonlinearity introduced to improve the performance in microphone arrays with short intermicrophone spacing (cf. Section 6.1), the GCCNONLINmax performs reasonably well in almost all datasets and microphone configurations.
Between the MVDRW methods, max and sum pooling give similar results for the smaller array of SiSEC1. In SiSEC2, sum pooling is superior, which is consistent with [19]. However, for a larger number of microphones max pooling performs better in all microphone configurations. In LOCATA1 and UGENT, though the beampattern of MVDR has wide lobes due to closely spaced microphones, the performance of the MVDRWbased approaches is better than that of NBSRPPHAT. We reason that this is because the MVDRW approaches factor in the sound field characteristics and introduce a frequency weighting that emphasizes the timefrequency bins that are dominated by the direct sound of a single source (cf. Section 6.1).
Figure 9 also indicates that SRPNMF performs consistently well across the various databases. In terms of MAE and Fmeasure, the scores of SRPNMF is among the top two for each tested case. The atom weighting highlights timefrequency bins consisting of information relating to a single source, similar to SNR weighting, thus exploiting timeatom sparsity and leading to superior performance in short arrays. In large arrays, averaging the optimization function across the frequency axis ensures robustness to spatial aliasing, thus leading to good performance. Further, the performance of SRPNMF is consistently better than (or comparable to) that of the BarkSRPPHAT, indicating the benefit of the datadependent weighted frequency averaging, as compared to a fixed frequency averaging.
Lastly, we also include a comparison with the SNMFSRP (cf. Section 6.6) for the AACHEN and UGENT data. It may be seen, then, that this supervised approach outperforms all the other unsupervised approaches—which is expected, based on the results in Section 6.6 and Figs. 7 and 8. We note that since SNMFSRP is based on the availability of the underlying source signals, it could not be applied to the LOCATA data, where this information is not consistently available. Further, we chose not to report performance metrics of this approach on the SiSEC data, since all approaches perform well in this case, and the performance of SNMFSRP would add no value in a comparative analysis of the performances.
While the evaluation conclusively demonstrates the benefit of the proposed SRPNMF approach, this comes at the cost of increased computational complexity. Its complexity is more than that of NBSRPPHAT and depends on the number of active atoms per frame. Further, we empirically observe that SRPNMF gives good DoA estimates if the data segments are long (>3s). We hypothesize, consequently, that the NMF dictionary atoms extracted from short segments may not be accurate. Therefore, in the current form, SRPNMF is not suitable for realtime applications. However, with pretrained dictionaries, the requirement of long data segments can be relaxed and SRPNMF can be explored for realtime localization.
In order to better appreciate the benefits of the SRPNMF approach, a graphical comparison of SRPPHAT and SRPNMF is presented in Figs. 10 and 11. These depict the histogram plots obtained by SRPPHAT and SRPNMF on a realroom mixture consisting of 4 concurrent speakers. Note that SRPNMF clearly indicates the presence of the 4 sources, whereas the histogram of the SRPPHAT approach (Fig. 10) does not present clear evidence of all 4 sources. The histogram plot in Fig. 11 can be further improved if subsampling is performed. Subsampling is an approach borrowed from Word Embedding in the field of NLP. Based on the observation that words with high frequency of occurrence do not contribute as much information as the words that occur more rarely, the frequent words are subsampled [45] to counter the imbalance between the frequent and rare words. In a similar manner, in the histogram of estimated DoAs, to counter the imbalances between frequent and occasional DoA estimates (e.g., due to a speaker being only active for a short while), the frequently occurring DoAs are subsampled after crossing a certain threshold. The subsampled version of Fig. 11 is shown in Fig. 12, where the benefit of subsampling is clearly visible.
Conclusions
SRPNMF is a localization approach that uses the NMF atoms of the underlying sources to obtain a broadband localization estimate for each atom. By exploiting the sparsity of the sources in the timeatom domain, this still allows for the simultaneous localization of multiple sources in a time frame. Thereby the proposed approach combines the benefits of standard broadband and narrowband localization approaches. It can, therefore, be used with compact and large array configurations. Compared to the stateoftheart narrowband and broadband approaches on data collected in natural room acoustic environments, and with various microphone configurations, the proposed approach can reliably localize the active sources in all cases, and with a comparable or lower localization error. The use of such an NMFbased decomposition and subsequent frequency grouping can be seamlessly extended in a variety of ways. For example, it can be combined with extant methods that improve the robustness of localization approaches to noise (e.g., in combination with the SNR weighting of the MVDRbased approaches), or it can be combined with a priori knowledge in the form of speakerspecific NMF atoms to localize only a specific speaker in the mix. It may also be modified for realtime applications with prelearned universal NMF dictionary and online estimation of activation coefficients. We intend to address these extensions in future work.
Availability of data and materials
SiSEC1 and SiSEC2 are publicly available at: https://sisec.inria.fr/sisec2016/2016underdeterminedspeechandmusicmixtures
LOCATA data is available at https://www.locata.lms.tf.fau.de/
AACHEN impulse responses are at http://www.iks.rwthaachen.de/en/research/toolsdownloads/databases/multichannelimpulseresponsedatabase/
UGENT MultiChannel Impulse Response Database is not publicly available but is available from the last author on reasonable request.
Details of the speech database used in the evaluations may be found in [43].
Abbreviations
 NMF:

Nonnegative matrix factorization
 SRP:

Steeredresponse power
 TF:

Timefrequency
 STFT:

Shorttime Fourier transform
 GCC:

Generalized crosscorrelation
 AMDF:

Average magnitude difference function
 PHAT:

Phase transform
 DP:

Directpath
 SNR:

signal to noise ratio
 TDoA:

Timedifference of arrival
 DoA:

Direction of arrival
 IPD:

Intermicrophone phase difference
 NBSRP:

Narrowband SRP
 BBSRP:

Broadband SRP
 CNMF:

Constrained NMF
 SNMF:

Supervised NMF
 MoG:

Mixture of Gaussians
 MBSS:

Multichannel BSS locate
 SiSEC:

Signal separation and evaluation campaign
 LOCATA:

Localization and tracking
 MAE:

Mean azimuth error
 MVDR:

Minimum variance distortionless response
 MVDRW:

Weighted MVDR
 NLP:

Natural language processing
 AACHEN:

RWTH AACHEN university
 UGENT:

Ghent University
References
 1
S. Rickard, O. Yilmaz, in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1. On the approximate Wdisjoint orthogonality of speech, (2002), pp. 529–532. https://doi.org/10.1109/ICASSP.2002.5743771.
 2
C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Proc. (TASSP). 24(4), 320–327 (1976).
 3
G. Jacovitti, G. Scarano, Discrete time techniques for time delay estimation. IEEE Trans. Signal Proc. (TSP). 41(2), 525–533 (1993).
 4
J. Benesty, Adaptive eigenvalue decomposition algorithm for passive acoustic source localization. J. Acoust. Soc. Am.107(1), 384–391 (2000).
 5
F. Talantzis, A. G. Constantinides, L. C. Polymenakos, Estimation of direction of arrival using information theory. IEEE Signal Proc. Lett.12:, 561–564 (2005).
 6
J. DiBiase, H. F. Silverman, M. S. Brandstein, in Microphone arrays: signal processing techniques and applications, ed. by M. Brandstein, D. Ward. Robust localization in reverberant rooms (SpringerNew York, 2001), pp. 157–180.
 7
C. Zhang, D. Florencio, Z. Zhang, in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Why does PHAT work well in lownoise, reverberative environments? (2008), pp. 2565–2568.
 8
J. Valin, F. Michaud, J. Rouat, D. Letourneau, in Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003), 2. Robust sound source localization using a microphone array on a mobile robot, (2003), pp. 1228–1233.
 9
Y. Rui, D. Florencio, in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2. Time delay estimation in the presence of correlated noise and reverberation, (2004), p. 133.
 10
H. Kang, M. Graczyk, J. Skoglund, in 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). On prefiltering strategies for the GCCPHAT algorithm, (2016), pp. 1–5.
 11
Z. Wang, X. Zhang, D. Wang, Robust speaker localization guided by deep learningbased timefrequency masking. IEEE Trans. Audio Speech Lang. Process. (TASLP). 27(1), 178–188 (2019).
 12
J. M. PerezLorenzo, R. VicianaAbad, P. RecheLopez, F. Rivas, J. Escolano, Evaluation of generalized crosscorrelation methods for direction of arrival estimation using two microphones in real environments. Appl. Acoust.73(8), 698–712 (2012).
 13
B. Loesch, B. Yang, in IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). Source number estimation and clustering for underdetermined blind source separation, (2008), pp. 1–4.
 14
M. I. Mandel, D. P. W. Ellis, T. Jebara, in Proceedings of the Annual Conference on Neural Information Processing Systems. An em algorithm for localizing multiple sound: sources in reverberant environments, (2006), pp. 953–960.
 15
O. Schwartz, S. Gannot, Speaker tracking using recursive EM algorithms. IEEE Trans. Audio Speech Lang. Process. (TASLP). 22(2), 392–402 (2014).
 16
N. Madhu, R. Martin, in IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). A scalable framework for multiple speaker localization and tracking, (2008), pp. 1–4.
 17
M. Cobos, J. J. Lopez, D. Martinez, Twomicrophone multispeaker localization based on a Laplacian mixture model. Digit. Signal Process.21(1), 66–76 (2011).
 18
M. Swartling, B. Sällberg, N. Grbić, Source localization for multiple speech sources using low complexity nonparametric source separation and clustering. Signal Process.91(8), 1781–1788 (2011).
 19
C. Blandin, A. Ozerov, E. Vincent, Multisource TDOA estimation in reverberant audio using angular spectra and clustering. Signal Process.92(8), 1950–1960 (2012).
 20
P. Pertilä, Online blind speech separation using multiple acoustic speaker tracking and timefrequency masking. Comput. Speech Lang.27(3), 683–702 (2013).
 21
E. Hadad, S. Gannot, in 2018 IEEE International Conference on the Science of Electrical Engineering in Israel (ICSEE). Multispeaker direction of arrival estimation using SRPPHAT algorithm with a weighted histogram, (2018), pp. 1–5.
 22
N. Madhu, R. Martin, in Advances in digital speech transmission, ed. by R. Martin, U. Heute, and C. Antweiler. Acoustic source localization with microphone arrays (John Wiley & Sons, Ltd.New York, USA, 2008), pp. 135–170.
 23
D. Bechler, K. Kroschel, in IEEE International Workshop on Acoustic Signal Enhancement (IWAENC). Considering the second peak in the GCC function for multisource TDOA estimation with a microphone array, (2003), pp. 315–318.
 24
C. Faller, J. Merimaa, Source localization in complex listening situations: selection of binaural cues based on interaural coherence. J. Acoust. Soc. Am.116(5), 3075–3089 (2004).
 25
M. Togami, T. Sumiyoshi, A. Amano, in 2007 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), 1. Stepwise phase difference restoration method for sound source localization using multiple microphone pairs, (2007), pp. 117–120.
 26
M. Togami, A. Amano, T. Sumiyoshi, Y. Obuchi, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOA estimation method based on sparseness of speech sources for human symbiotic robots, (2009), pp. 3693–3696.
 27
J. Traa, P. Smaragdis, N. D. Stein, D. Wingate, in 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Directional NMF for joint source localization and separation, (2015), pp. 1–5.
 28
H. Kayser, J. Anemüller, K. Adiloğlu, in 2014 IEEE 8th Sensor Array and Multichannel Signal Processing Workshop (SAM). Estimation of interchannel phase differences using nonnegative matrix factorization, (2014), pp. 77–80.
 29
A. MuñozMontoro, V. MontielZafra, J. CarabiasOrti, J. TorreCruz, F. CanadasQuesada, P. VeraCandeas, in Proceedings of the International Congress on Acoustics (ICA). Source localization using a spatial kernel based covariance model and supervised complex nonnegative matrix factorization, (2019), pp. 3321–3328.
 30
S. U. N. Wood, J. Rouat, S. Dupont, G. Pironkov, Blind speech separation and enhancement with GCCNMF. IEEE Trans. Audio Speech Lang. Process. (TASLP). 25(4), 745–755 (2017).
 31
J. DiBiase, A highaccuracy, lowlatency technique for talker localization in reverberant environments. Ph.D. dissertation (Brown University, Providence RI, USA, 2000).
 32
T. Virtanen, J. F. Gemmeke, B. Raj, P. Smaragdis, Compositional models for audio processing: uncovering the structure of sound mixtures. IEEE Signal Process. Mag.32:, 125–144 (2015).
 33
J. Tchorz, B. Kollmeier, SNR estimation based on amplitude modulation analysis with applications to noise suppression. IEEE Trans. Speech Audio Process. (TSAP). 11(3), 184–192 (2003).
 34
S. Elshamy, N. Madhu, W. Tirry, T. Fingscheidt, Instantaneous a priori SNR estimation by cepstral excitation manipulation. IEEE Trans. Audio Speech Lang. Process. (TASLP). 25(8), 1592–1605 (2017).
 35
D. D. Lee, H. S. Seung, in Advances in Neural Information Processing Systems 13, ed. by T. K. Leen, T. G. Dietterich, and V. Tresp. Algorithms for nonnegative matrix factorization, (2001), pp. 556–562.
 36
V. P. Pauca, J. Piper, R. J. Plemmons, Nonnegative matrix factorization for spectral data analysis. Linear Algebra Appl.416(1), 29–47 (2006). Special Issue devoted to the Haifa 2005 conference on matrix theory.
 37
R. Lebarbenchon, E. Camberlein, MultiChannel BSS Locate (2018). https://bassdb.gforge.inria.fr/bsslocate/bsslocate. Accessed 4 2020.
 38
B. Loesch, B. Yang, in 9th International Conference on Latent variable analysis and signal separation (LVA/ICA). Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions (SpringerBerlin, Heidelberg, 2010), pp. 41–48.
 39
N. Ono, Z. Koldovský, S. Miyabe, N. Ito, in 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP). The 2013 signal separation evaluation campaign, (2013), pp. 1–6.
 40
H. W. Löllmann, C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, P. A. Naylor, W. Kellermann, in 2018 IEEE 10th Sensor Array and Multichannel Signal Processing Workshop (SAM). The LOCATA challenge data corpus for acoustic source localization and tracking, (2018), pp. 410–414.
 41
C. Veaux, J. Yamagishi, K. MacDonald, English Multispeaker Corpus for CSTR Voice Cloning Toolkit (University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019). https://doi.org/10.7488/ds/2645.
 42
Multichannel impulse response database. https://www.iks.rwthaachen.de/en/research/toolsdownloads/databases/multichannelimpulseresponsedatabase/. Accessed 12 2020.
 43
P. Kabal, TSP speech database. Technical report (Telecommunications and Signal Processing Laboratory, McGill University, Canada, 2002).
 44
C. Févotte, E. Vincent, A. Ozerov, in Audio Source Separation, ed. by S. Makino. Singlechannel audio source separation with NMF: divergences, constraints and algorithms (SpringerCham, 2018), pp. 1–24.
 45
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, in Advances in neural information processing systems 26, ed. by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger. Distributed representations of words and phrases and their compositionality, (2013), pp. 3111–3119.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Affiliations
Contributions
NM + ST: conceptualized SRPNMF; ST implemented the approaches, introduced improvements, and conducted the experiments under the supervision of NM and SVG. NM and ST were involved in the writing. All the authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors state that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Thakallapalli, S., Gangashetty, S.V. & Madhu, N. NMFweighted SRP for multispeaker direction of arrival estimation: robustness to spatial aliasing while exploiting sparsity in the atomtime domain. J AUDIO SPEECH MUSIC PROC. 2021, 13 (2021). https://doi.org/10.1186/s1363602100201y
Received:
Accepted:
Published:
Keywords
 Sound source localization
 Directionofarrival
 Nonnegative matrix factorization
 Spatial aliasing
 Speech sparsity