Direction-of-arrival and power spectral density estimation using a single directional microphone and group-sparse optimization

In this paper, two approaches are proposed for estimating the direction of arrival (DOA) and power spectral density (PSD) of stationary point sources by using a single, rotating, directional microphone. These approaches are based on a method previously presented by the authors, in which point source DOAs were estimated by using a broad-band signal model and solving a group-sparse optimization problem, where the number of observations made by the rotating directional microphone can be lower than the number of candidate DOAs in an angular grid. The DOA estimation is followed by the estimation of the sources’ PSDs through the solution of an overdetermined least squares problem. The first approach proposed in this paper includes the use of an additional nonnegativity constraint on the residual noise term when solving the group-sparse optimization problem and is referred to as the Group Lasso Least Squares (GL-LS) approach. The second proposed approach, in addition to the new nonnegativity constraint, employs a narrowband signal model when building the linear system of equations used for formulating the group-sparse optimization problem, where the DOAs and PSDs can be jointly estimated by iterative, group-wise reweighting. This is referred to as the Group-Lasso with l 1 -reweighting (GL-L1) approach. Both proposed approaches are implemented using the alternating direction method of multipliers (ADMM), and their performance is evaluated through simulations in which different setup conditions are considered, ranging from different types of model mismatch to variations in the acoustic scene and microphone directivity pattern. The results obtained show that in a scenario involving a microphone response mismatch between observed data and the signal model used, having the additional nonnegativity constraint on the residual noise can improve the DOA estimation for the case of GL-LS and the PSD estimation for the case of GL-L1. Moreover, the GL-L1 approach can present an advantage over GL-LS in terms of DOA estimation performance in scenarios with low SNR or where multiple sources are closely located to each other. Finally, it is shown that having the least squares PSD re-estimation step is beneficial in most scenarios, such that GL-LS outperformed GL-L1 in terms of PSD estimation errors.


Introduction
In the field of audio signal processing, the ability to exploit spectral and spatial information from the auditory scene plays an important role in developing speech enhancement, noise reduction, and scene analysis techniques [1][2][3][4][5][6][7][8].The applications in which these tasks have great influence are numerous, with some examples being binaural processing for hearing aids, hands-free telephony and videoconferencing, acoustic surveillance, autonomous robots, and so on.By localizing target and interfering sound sources in space and estimating their power spectral densities (PSDs), one can distinguish them and process the recorded signals using the appropriate noise reduction and source classification methods.
Sound source localization is performed by estimating the direction of arrival (DOA) of the signals being recorded.Pioneering methods such as Capon's beamformer [9], the MUSIC algorithm [10], and generalized correlation-based methods [11,12] are still used for the task of DOA estimation and keep being further modified for improved performance.Alternatively, different approaches have also been developed.Compressed sensing techniques have become more popular [13][14][15][16], with the focus on exploiting sparsity in the signal models considered.Moreover, with the growth of machine learning, data-driven methods have naturally gained more attention as well [17,18].
Although much has been achieved through the studies presented above, some challenges still remain present.When estimating DOAs, most methods rely on having recordings from multiple microphones, such that their spatial diversity is used for inferring a source's location [36].However, it is well known that in practice, constraints on hardware design, computational complexity and simultaneous access to multiple microphones' data, often encountered in different devices, may limit the ideal advantages of microphone array processing techniques [37][38][39].In this paper, we propose two alternative approaches for DOA and PSD estimation using a single, rotating, and directional microphone.By exploring the potential of a single-microphone setup, this study aims to not only provide a single-channel solution that can be more easily adapted to diverse applications, but to also establish a foundational framework for potential multi-channel extensions that exploit similar principles to those considered in the development of the proposed approaches.
In literature, single-channel DOA estimation methods have already been proposed, with some examples including the use of machine learning [40], a circularly moving microphone that exploits the Doppler effect [41] and a time delay-based subspace approach with a single hydrophone [42].Our previous work [43] introduced the concept of using a single, rotating directional microphone for performing DOA and PSD estimation.The proposed method involved capturing spatially static and localized sound sources with a cardioid microphone, oriented towards different directions for different observation frames, so that changes in the microphone signal power could be analyzed for determining spatial information about the sources generating the observed sound field.By solving a group-sparsity constrained optimization problem while using a broadband signal model and an overcomplete angular dictionary of possible candidate DOAs for the point sources, DOA estimates could be obtained and used to estimate the localized point sources' PSDs, by solving an overdetermined least squares problem with a nonnegativity constraint for each frequency bin separately.The use of group-sparsity has been previously exploited in multi-channel DOA estimation methods, such as in [44], where a covariance matrix estimated from signals captured at multiple co-prime arrays is modeled based on the corresponding steering vectors and point source PSDs.Moreover, estimating PSDs through the solution of a least squares problem has been also performed in [45], in which multiple beamformers are applied to microphone array signals and their outputs are used as the proposed method's observed data.Nonetheless, when employing the mentioned multi-channel DOA and PSD estimation methods, it becomes necessary to assess the performance constraints associated with the microphone array configuration available in the considered scenario.Factors that could potentially influence these limitations include the spacing between microphones and the overall array geometry [36,46].
In this paper, the method proposed in [43] is extended in two aspects, resulting in two alternative approaches.First, in order to add robustness in scenarios with a mismatch between the signal model and the generated data, we propose to use an additional nonnegativity constraint on the residual noise term when solving the group-sparse optimization problem.We refer to this as the Group-Lasso Least Squares (GL-LS) approach.Note that GL-LS is still based on the broadband signal model presented in [43].Second, in addition to the new non-negativity constraint, we also propose an optimization problem based on a narrowband signal model, where the DOAs and PSDs can be jointly estimated by iterative, group-wise reweighting.This is referred to as the Group-Lasso with l 1 -reweighting (GL-L1) approach.An efficient implemen- tation of both approaches using the alternating direction method of multipliers (ADMM) is presented, as opposed to the use of CVX [47] as it was done in our previous work [43].
In order to evaluate the performance of the proposed approaches, a series of simulations subject to different types of model mismatch are performed, by introducing off-grid DOAs, non ideal microphone responses, and reverberation for the case of a single point source.
We also analyze scenarios involving two point sources of different broadband power and varying angular separation between them.Finally, we consider the case of three point sources when using higher-order directivity patterns.These simulations greatly extend the aspects considered for evaluating the proposed approaches in comparison with those presented in our previous work [43].We show that most of the times, having the PSD re-estimation step with least squares as performed with GL-LS is beneficial, but also that redesigning the system of equations in a frequency-dependent manner as performed in GL-L1 may help in distinguishing closely located sources.
This paper is structured as follows.In Section 2, we present the signal model.In Section 3, we explain the proposed approaches.In Section 4, we explain the ADMM-based implementation.In Section 5, we present the simulations setup, the results obtained, and discussion.Finally, in Section 6, we conclude with a summary of the work presented and future work.

Rotational microphone signal model
In the short-time Fourier transform (STFT) domain, the signal recorded by a single, directional microphone rotating in the horizontal plane is modeled as where (•) * denotes the complex conjugate, k the dis- crete frequency index, n the observation frame index, θ the look direction, and γ n the microphone orienta- tion at frame n.We assume that there are a total of P point sources in the far field and that P is known.The expression in (1) describes that the resulting microphone signal Y(k, n) is composed of the sum of the P point source signals S p (k, n) arriving from P distinct directions, ϑ 1 to ϑ P , weighted by the direction-depend- ent microphone response a(k, θ − γ n ) , relative to the microphone orientation γ n , and by the room transfer function (RTF) H p (k, θ) from the p-th far-field source (1) to the microphone, added to diffuse or sensor noise D(k, n).If we consider that the recording is performed in anechoic conditions, then H p (k, θ) = δ(θ − ϑ p ) , ∀ k and for p = 1, . . ., P , such that the expression in ( 1) is reduced to Assuming that the source signals are uncorrelated and stationary during the entire observation, such that their PSDs remain constant across different time frames, and that the microphone response is real-valued, the microphone signal PSD φ Y (k, n) can be described as follows: where φ D (k, n) is the noise PSD for frequency k and time frame n, and φ S p (k) is the PSD for frequency k corresponding to the p-th source at position ϑ p .As the directional microphone is oriented towards different directions γ n for different observation frames n, the relative positions of the sound sources with respect to the microphone do not remain the same, and consequently their PSD values φ S p (k) are multiplied with different squared microphone response coefficients |a(k, ϑ p − γ n )| 2 over different time frames.

Grid-based system of equations
If we assume that measurements of φ Y (k, n) are avail- able for multiple time frames, with n = 1, . . ., N , a linear system of equations can be built for estimating the point source DOAs ϑ p and PSDs φ S p (k) , for p = 1, . . ., P and k = 1, . . ., K , where K denotes the number of frequency bins.As the directional weighting factor |a(k, ϑ p − γ n )| 2 in (3) depends on the unknown source DOA, we use a grid of candidate positions defined between 0 and 2π and an overcomplete dictionary of corresponding microphone response coefficients in order to build the linear system of equations, as previously proposed in [43].
For a single candidate angle, denoted as θ l , a vector ϕ S,θ l with PSD values for all frequencies is defined as where (•) ⊤ denotes the transpose and φ S (k, θ l ) is the PSD at candidate position θ l for frequency k.We stack dif- ferent ϕ S,θ l vectors for all candidate DOAs from a given L-element angular grid, with P ≪ L and N < L , in order to obtain the vector we ultimately aim to estimate as (3) which will be used to obtain θp and φS p (k) such that θp ∈ {θ 1 , . . ., θ L } , for p = 1, . . ., P.
One possible construction of the linear system of equations is defined as follows.Firstly, a vector ϕ Y ,n contain- ing the PSD values for the microphone observation frame n is defined as By stacking different ϕ Y ,n vectors for all observations n = 1 to n = N , we obtain a vector ϕ Y as Similarly, we also define the vectors for the diffuse component as In order to build an overcomplete microphone response matrix involving all possible candidate DOAs, we firstly define a vector containing the squared, microphone response for a candidate angle θ l relative to the microphone orientation γ n as Hence, the microphone response matrix is defined as where A n,l = diag(a(θ l − γ n )) and diag(•) denotes the creation of a diagonal matrix from the elements of a given vector.The linear system of equations can then be written as which corresponds to a matrix representation of (3) for different observation frames, and which we refer to as the narrowband system of equations in the remainder.
By ensuring that γ 1 = γ 2 = • • • = γ N , with 0 ≤ γ n ≤ 2π , ∀ n , and assuming that ϕ Y and Ā are known, source locali- zation in terms of DOA estimation can be achieved by (5) ( solving the proposed linear system of equations.From the estimated vector φS , we can identify in which direction θ l within the angular grid there are peaks in power, indi- cating the point source DOAs θp and their PSDs φS p (k) , assuming that θp ∈ {θ 1 , . . ., θ L }.
While this narrowband system of equations has all frequency bins decoupled from each other, it is also possible to build a broadband version of it.By considering an N × N identity matrix denoted as I N and a K-element column vector of ones denoted as 1 K , we define the broadband vectors and matrix: where ⊗ denotes the Kronecker product.This operation over the previously defined vectors ϕ Y and ϕ D results in the sum over all frequency bins of the PSD values from the microphone signal and the diffuse component, respectively.It also results in a "dimensionality reduction" of the original matrix Ā such that A n,l is replaced by a(θ l − γ n ) stacked together accordingly.Using ( 13)-( 15), the linear system of equations can be written as which we refer to as the broadband system of equations in the remainder, and which is equivalent to the system of equations used in [43].Like the narrowband system presented in (12), the broadband system in (16) can be solved for estimating the point source DOAs, with the caveat of having lost the frequency-dependent information in the microphone observation vector by summing the target input PSD vector over all frequencies.Consequently, an additional PSD re-estimation step is required when using the broadband system of equations as will be explained further in the following section.

Proposed approaches
In this section, we explain the broadband and narrowband approach proposed for estimating the DOAs and PSDs of point sources modeled as previously described in Section 2. The first approach involves the solution to a group-sparse constrained optimization problem for estimating the DOAs, constructed based on the (13) broadband system of equations in ( 16), followed by a least-squares step for re-estimating the point sources' PSDs.This approach is here named GL-LS.The second approach, involving the solution to a group-sparse constrained optimization problem based on the narrowband system of equations in (12), considers an iterative process of reweighting the group-sparsity penalty present in the formulation for jointly estimating the DOAs and PSDs of the point sources.This approach is here named GL-L1.

Group Lasso followed by least squares (GL-LS)
Assuming that N < KL , the linear system in ( 16) is underdetermined.Therefore, the following Group Lasso [48] optimization problem is proposed to be solved: The nonnegativity constraint in (17b) is necessary for complying with the intrinsic nonnegativity property of PSD values [49], whereas the nonnegativity constraint over the error term, as formulated in (17c), is included as a means to model the noise signal φ D (k, n) present in the microphone observations, such that more robustness can be achieved in case of a possible model mismatch, for instance between the assumed matrix A and the actual microphone response.The Group Lasso formulation includes the regularization term L l=1 ϕ S,θ l 2 , which enforces sparsity between so-called different groups [48].When assuming that only a limited number of point sources are present in space ( P ≪ L ), using this group- sparsity penalty is a way of ensuring that only a few of the subvectors ϕ S,θ l composing ϕ S will be activated, that is, have a magnitude significantly different from zero.After solving the optimization problem (17a)-(17c) and obtaining φS , and therefore, φS,θ 1 , . . ., φS,θ L , the PSD values can be averaged over the K frequency bins for each of the L candidate directions, allowing the point source DOAs to be estimated by finding the indices of θ l for which there are peaks in the average PSD.For a total of P sources assumed to be present, P peaks should then be identified.
One may note that the resulting PSD estimates for all directions in the angular grid obtained from solving (17a)-(17c) will be inherently biased, due to the groupsparsity penalty included in the optimization problem [50].Moreover, the summation over frequency of the PSD values present in ϕ Y , see (13), results in the loss of (17a) frequency-dependent information on the detected point sources' PSDs.These limiting factors motivate the use of a re-estimation step for the PSD values using the estimated DOAs, as previously proposed in [43].
The new PSD vectors are defined as follows: We define a new matrix A S (k) ∈ R N ×P which now contains squared microphone response coefficients for only the directions θ1 , . . ., θP ∈ {θ 1 , . . ., θ L } where the P sources are assumed to be located, based on the preceding DOA estimation: Using the PSD signal model in (3), a new linear system of equations for the microphone signal PSD is then formulated, for each frequency bin, as If P ≤ N , a constrained least-squares approach can be used for solving the overdetermined linear system and estimating the PSD values of the point sources: Hence, in this re-estimation step, we avoid the bias induced by the Group Lasso formulation presented in the DOA estimation step and allow for a more accurate PSD estimation for the stationary point sources.
When compared to our previous work [43], the GL-LS approach described in this subsection presents an extension of the previous idea of solving a group-sparse constrained optimization problem for estimating DOAs and PSDs of point sources by including the additional nonnegativity constraint over the error term, in order to provide robustness against model mismatches that may be present.

Group Lasso with l 1 -reweighting (GL-L1)
As an alternative to the method proposed in Section 3.1, we consider employing the narrowband input vector ϕ Y (18) (23b) subject to φ S (k) ≥ 0 and response matrix Ā (see the model in ( 12)) and con- struct the following optimization problem: In this approach, we include the use of group weights, denoted w l for l = 1, . . ., L , in order to allow for a group-wise, iterative reweighting process aimed at enhancing sparsity within the final solution obtained [51,52].This is motivated by the goal of jointly estimating the point source DOAs and PSDs from the solution to the constrained optimization problem, without performing a least-squares re-estimation step as in the GL-LS approach, in which the optimization problem in ( 17) is solved only for estimating the source's DOAs.
The group sparsity is enhanced by successively solving the optimization problem in (24), while updating, between iterations, the sparsity penalization of each estimated group separately as a function of their corresponding norms [51,53,54].The weight updates can be expressed as where i denotes the reweighting iteration index, and ǫ > 0 is a fixed parameter introduced for avoiding a divi- sion by zero.
After reaching convergence from the group-wise iterative reweighting procedure, i.e., repeatedly re-estimating φS,θ l while updating the sparsity penalty weights for differ- ent groups individually, the point source DOA estimation is performed in a similar fashion as proposed in Section 3.1.Assuming a total number of P point sources, the P largest peaks in frequency-averaged power obtained from φS are picked, and the estimated DOAs θp are obtained from the indices of θ l for the corresponding groups selected.
Simultaneously, the point source PSDs are obtained as the estimated values of the corresponding subvectors contained in φS with angles θp , for p = 1, . . ., P .Hence, in this approach, the PSD estimates are obtained jointly with the DOA estimates, without performing an additional least-squares re-estimation step as in the proposed GL-LS approach.
The GL-L1 approach described in this subsection presents an extension of our previous work [43] in more (24a) aspects than GL-LS, namely the employment of the additional nonnegativity constraint over the error term, the use of the narrowband linear system of equations, and the iterative group reweighting process for jointly estimating DOAs and PSDs without a least-squares re-estimation step.

Implementation
In order to solve the optimization problems defined in ( 17) and ( 24) in an efficient way, we employ the alternating direction method of multipliers (ADMM) algorithm [55], as opposed to the use of CVX [47] as it was done in our previous work [43].Since in both proposed approaches the vector ϕ S to be estimated is the same, and their distinction only consists in the construction of the input vectors and matrices employed, as well as in the use of group weights, we firstly describe the used ADMM implementation in general terms and then further clarify each method's full algorithm afterwards.
Let a general optimization problem involving the same group sparsity and nonnegativity constraints present in (17) and ( 24) be described as where x l corresponds to the l-th group composing the vector x .We can use auxiliary variables, denoted u 1 and u 2 , to rewrite the problem as where u 2,l corresponds to the l-th group composing the vector u 2 , such that it can be represented as The augmented Lagrangian [55] of the optimization problem in (27) can be written as where d 1 and d 2 are the dual variables, and ρ can be inter- preted as a dual update step size [55].
Considering an ADMM iteration indexed as j, we define the following short hands: The updates for each variable in (29) are The first update in (32)

is computed as
The u 1 update in (33) takes into account its nonnegativ- ity constraint and is computed as where max(•) denotes the element-wise max operator.
The update of u 2 in (34) is computed as for l = 1, . . ., L , where T(•) represents a group-wise shrinkage function, defined as (29) (37) After computing the updates of d 1 and d 2 as in ( 35) and (36), respectively, the whole iterative process is repeated until convergence [55].For more detailed derivations of each ADMM update equation, we refer to [55].
As previously mentioned, the difference in implementation between the proposed methods GL-LS and GL-L1 relies on how the input data structure for the ADMM algorithm here explained is chosen.In the case of GL-LS, we use ϕ Y and A as input vector and response matrix, respectively, and we set all group weights w l equal to one.After running the ADMM scheme once, the result is used for estimating the source DOAs and re-estimating the source PSDs as explained in Section 3.1.In the case of GL-L1, we use ϕ Y and Ā as input, the weights are also initially set equal to one, and the ADMM scheme is repeated until convergence while re-updating the group weights as described in Section 3.2, so that a final joint DOA and PSD estimation is obtained.A summary of the GL-LS implementation is presented in Algorithm 1, while a summary of the GL-L1 implementation is presented in Algorithm 2.
Regarding the computational complexity of each proposed approach, numerous factors will affect the overall cost of both GL-LS and GL-L1.Firstly, when analyzing a single iteration within the ADMM scheme, we can observe that the most computationally demanding update occurs in (37), and that its cost will depend on the dimensions of the response matrix being employed [55].The use of the wideband signal model by GL-LS and the narrowband signal model by GL-L1 results in an assymptotic complexity of O(NKL) and O(NK 2 L) , respectively, for each of the proposed approaches.In the case of GL-LS, the cost of the additional PSD reestimation step through the solution of a least squares problem will assymptotically be O(N 3 ) .Finally, the total runtime experienced by each proposed approach will depend on the convergence criterion selected for the ADMM scheme and the number of iterations required for satisfying it.Therefore, even though the use of the proposed approaches may significantly differ in overall computational cost depending on the number of microphone orientations N being considered and setup conditions that can affect the convergence rate, the GL-LS approach is currently more computationally efficient than the GL-L1 approach.One may note, however, that the sparse structure of the matrix employed in the narrowband signal model, see (11), allows for using sparsity-aware methods for matrix computations [56,57] that can potentially reduce the complexity of GL-L1.(40) T(z, κ) = max 1 − κ �z� 2 , 0 z.

Simulations
In order to evaluate the performance of both proposed approaches in terms of DOA and PSD estimation, simulations are done in MATLAB considering different setups.We aim to observe the advantages or disadvantages of using GL-LS or GL-L1 under different conditions, ranging from different types of model mismatch in Sections 5.2 to 5.3 to variations of the acoustic scene in Sections 5.4 to 5.5 and of the microphone directivity pattern in Section 5.6.
In Table 1, a summary of the main parameters used in each subsection is presented.For all simulations, the sampling frequency is 16 kHz, the source signals are stationary, and the microphone signal PSD φ Y (k, n) is estimated with Welch's method, using a 512-point Hann window corresponding to a length of 32 ms and 50% overlap over a duration of 500 ms for each microphone orientation γ n , and therefore, each obser- vation frame n.For a total of N observations, N different microphone orientations uniformly distributed over 360 • are simulated.The L candidate directions used for building the microphone response matrices are defined according to a uniformly spaced angular grid given a certain resolution in degrees.The signal-to-noise ratio (SNR) is defined as SNR = σ 2 S 1 /σ 2 D , where σ 2 S 1 denotes the broadband power of the first source ( p = 1 ), and σ 2 D denotes the broadband power of the diffuse component.The regularization parameter is heuristically set as a function of � A ⊤ ϕ Y � ∞ and � Ā⊤ ϕ Y � ∞ for the GL-LS and GL-L1 approaches, respectively, with � • � ∞ denoting the l ∞ -norm.The choice of may be suboptimal; however, it is motivated by the objective of obtaining solutions with both approaches that are generalizable for all different scenarios tested.In the implementation of the GL-LS and GL-L1 approaches, the weights are initialized as w l = 1 , for l = 1, . . ., L , and in the case of GL-L1, the reweighting process is performed twice, resulting in three repetitions of the ADMM scheme.For each scenario considered, a total number of 100 Monte Carlo realizations are simulated.With the exception of Section 5.4, all scenarios are simulated under anechoic conditions, and with the exception of Section 5.5, the simulated source signals correspond to speech-shaped noise, obtained by filtering white Gaussian noise with a 16th order linear prediction filter based on a male speech signal from [58].
In Section 5.1, the performance measures used to evaluate the proposed approaches are defined.In Sections 5.2 and 5.3, we present simulations in which a single point source is present, whereas in Sections 5.5 and 5.6, the performance is evaluated in scenarios with 2 and 3 sources, respectively.The parameters varied for each simulated setup are further explained in each corresponding subsection.

Performance measures
The DOA estimation is evaluated by computing the mean absolute error (MAE) between the estimated DOA and the true source DOA, for each point source p separately, and can be expressed as where N r denotes the total number of Monte Carlo reali- zations and r is the realization index.
The point source PSD estimation is evaluated by computing the normalized mean squared error (NMSE), for each point source separately, as

Different grid resolutions
Firstly, we consider a simulation setup for evaluating the performance of each method proposed as a function of the grid resolution selected for building the microphone response matrices, containing all candidate positions for a single point source.In addition, different SNR values are used to evaluate the methods' robustness to additive diffuse noise.(41 .
Table 1 Main simulation parameters for each subsection A single point source emitting stationary speechshaped noise of variance σ 2 S 1 is simulated in anechoic conditions, with its DOA being randomly generated between 0 • and 360 • for each Monte Carlo realization, yielding the possibility of the source position being on or off-grid.A total of N = 6 observation frames are used, with the microphone orientations uniformly distributed over 360 • (i.e., 0 • , 60 • , 120 • , 180 • , 240 • , and 300 • ) for building the linear systems of equations, i.e., ( 16) and ( 12) for GL-LS and GL-L1, respectively.We assume the observations are made with an ideal cardioid microphone with flat frequency response, defined as a cardioid (k, θ) = 0.5 + 0.5 cos(θ) , ∀k , and that the micro- phone is static during each observation frame.The diffuse component is white Gaussian noise, with variance σ 2 D .The grid resolution is varied from 1 • to 40 • , and the SNR is varied from 0 to 0 dB.The regularization parameter is heuristically set to 0.1� A ⊤ ϕ Y � ∞ and 0.1� Ā⊤ ϕ Y � ∞ for GL-LS and GL-L1, respectively.
The estimation errors obtained when using GL-LS and GL-L1 for all combinations of grid resolution and SNR considered are presented in Figs. 1 and 2, respectively.When analyzing the DOA estimation in terms of MAE, we can observe that for both methods, the performance seems to converge to a certain minimum achievable error corresponding to a quarter of the grid resolution with an increase in SNR, which is a result of both methods picking the closest grid point to the point source's actual position.In that case, the MAE linearly increases as the grid resolution varies from 5 • to 40 • , whereas for the case of a one-degree resolution, the error indicates a possible limitation as a function of the angular variation of the cardioid directivity pattern in estimating the true source DOAs.For SNRs lower than 10 dB, the MAE obtained with the GL-LS approach varies more for different grid resolutions than the MAE obtained with GL-L1.This could indicate that while having a finer grid resolution can be beneficial in the DOA estimation, employing the narrowband system of equations where frequencydependent information is preserved, as in the GL-L1 approach, can positively affect the robustness towards diffuse noise when trying to localize a point source that does not present a flat spectrum, such as the one simulated in this scenario.
When analyzing the PSD estimation in terms of NMSE, we observe that, for GL-LS, the NMSE seems to only depend on the SNR and not on the grid resolution, indicating that although the PSD re-estimation step via least squares depends on the previously estimated point source DOA, a mismatch between the chosen angular Fig. 1 Estimation errors obtained with GL-LS for different grid resolutions and SNRs grid for building the microphone response matrix and the source's true DOA does not impact the PSD estimation accuracy.This is also observed when GL-L1 is used with a grid resolution above 10 • .When comparing both methods, we observe that GL-LS overall achieves a lower PSD estimation error than GL-L1 for different SNR values.

Different levels of microphone response mismatch
In this set of simulations, we aim to investigate the effect of a mismatch between the microphone response used for generating the input vectors ϕ Y and φY used by GL-LS and GL-L1, respectively, and the assumed microphone response when building the linear system of equations to be solved in each proposed approach.In practical scenarios, such a mismatch can occur when it is assumed that the microphone being used presents an ideally flat frequency response, whereas, in reality, it becomes more directional for higher frequencies instead.By fixing the use of an ideal cardioid microphone response when building the linear system of equations, a performance comparison between both approaches presented can be done for different cases of model mismatch caused by the use of the frequency-dependent microphone responses when generating the input vectors ϕ Y and φY used by GL-LS and GL-L1, respectively.
As an additional comparison, we also execute the proposed methods without including the additional nonnegativity constraint on the error term expressed in (17c) and (24c) for GL-LS and GL-L1, respectively, so that the possible improvement in robustness due to the constraint can be analyzed.In the case of GL-LS, this would correspond to solving the optimization problem presented in [43], and these versions of the proposed approaches are here referred to as GL-LS 0 and GL-L1 0 .
A single point source of speech-shaped noise is again simulated in anechoic conditions, with its DOA being randomly generated for each Monte Carlo realization.The SNR is here fixed at 10 dB and the grid resolution is fixed at 10 • .For a normalized frequency value f ∈ [0, 1] , the frequency-dependent directivity patterns, denoted as Sub-to-cardioid and Omni-to-cardioid, are defined as a linear combination of two directivity functions: where: (43)  The estimation errors obtained when using both versions of each proposed approach with different microphone responses are presented in Fig. 3.We can observe that for the case of generating data with an ideal cardioid, the additional constraint over the error term does not significantly affect the DOA estimation performance in terms of MAE for neither of the methods.However, it does result in slight improvement of the PSD estimation in terms of NMSE for the GL-L1 method.We can also observe that, when an actual mismatch between the microphone response used for building the response matrices and the microphone response used for generating the data is present, the DOA estimation is indeed improved for both approaches when considering the Sub-to-cardioid response, but not when considering the use of the Omni-to-cardioid response.This is suspected to be due low level of directivity in lower frequencies presented by the microphone response, such that the observed microphone PSD for different orientations does not provide sufficient directional information to appropriately localize the target source.
In terms of PSD estimation, we observe that the NMSE for GL-LS 0 and GL-LS do not significantly differ (44) a H (θ)= 0.5 + 0.5 cos(θ) a L (θ) = 0.75 + 0.25 cos(θ) Sub-to-cardioid (45) a H (θ)= 0.5 + 0.5 cos(θ) a L (θ) = 1 Omni-to-cardioid even with a more apparent difference in DOA estimation errors.This may be due to the fact that the GL-LS method is only affected by the additional nonnegativity constraint during the DOA estimation step, and the PSD is then re-estimated via least squares.In the latter step, the microphone response mismatch is still present and therefore not compensated for, possibly yielding similar error levels.A similar effect was observed in Section 5.2 for GL-LS, in which the mean mismatch between the true DOA and the chosen grid candidate, which depends on the grid resolution, did not impact the PSD estimation performance.
When comparing methods GL-L1 and GL-L1 0 , the NMSE decreases with the inclusion of the additional nonnegativity constraint, even with an increase in MAE for the case of the Omni-to-cardioid microphone.This indicates that even if the additional constraint does not improve the DOA estimation, it can still positively affect the PSD estimation.We also observe that the GL-L1 approach seems to be more robust towards model mismatch than GL-LS in terms of PSD estimation, with or without the nonnegativity constraint over the error term, suggesting an advantage of employing the proposed narrowband system of equations instead of its wideband counterpart in this scenario.

Different levels of reverberation
While the simulations done in anechoic conditions in Sections 5.2 to 5.3 were included to show important performance characteristics of the two proposed approaches, a b Fig. 3 Estimation errors obtained with GL-LS and GL-L1, as well as with their modified versions which exclude the nonnegativity constraint over the error term (GL-LS 0 and GL-L1 0 , respectively), while using different microphone directivity patterns to generate the observed data we here consider a more realistic scenario where a sound source is placed in reverberant environments.
A room of dimensions 6.3 × 5.1 × 2.5 m is simulated, with a single, ideal cardioid microphone placed at coordinates [3.7, 2.1, 1.5] m.The point source is placed at the same height as the microphone and its DOA is set by randomly generating its coordinates within the room for each Monte Carlo realization, with the constraints of being at least 0.1 m away from the room boundaries and exceeding the setup's critical distance from the microphone, denoted r c , which varies with the reverberation time T 60 considered [59].An illustration of the simu- lated room with the constraints on the source position is shown in Fig. 4. For each of the N = 6 microphone orientations, the room impulse response is generated using the image source method implemented in [60] and convolved with the original point source signal, composed of speech-shaped noise.The grid resolution for building the microphone response matrices is set to 10 • , and the SNR is in this case defined as the ratio between the broadband power of the reverberant source signal and the diffuse noise component, with its value fixed at 10 dB.The reverberation time is varied from T 60 = 0 s to T 60 = 0.6 s , where T 60 = 0 s corresponds to the ane- choic case, and with reflections being simulated only on the two-dimensional plane in order to concord with the signal model in (1).Finally, the NMSE for evaluating the PSD estimation is computed with respect to a newly defined reference, corresponding to the PSD of the point source signal recorded with the cardioid microphone oriented towards the source's true DOA in the reverberant environment considered.This reference would correspond to the one expressed in (42) and used for evaluating the results obtained when considering anechoic conditions, as the reverberation time would correspond to zero.
The estimation errors obtained when using GL-LS and GL-L1 for all reverberation times considered are presented in Fig. 5.We can observe that the method GL-LS presents overall lower MAE and NMSE than the method GL-L1, indicating the benefit, in this scenario, of summing the signal PSD over frequency.Moreover, both approaches show the tendency of a performance degradation in DOA and PSD estimation with an increase in reverberation time, which is expected as reverberation is not explicitly accounted for in the utilized models (12) and ( 16).

Influence of angular separation and power ratio between two sources
So far, we assumed that only a single point source was used when performing the simulations previously presented in Sections 5.2, 5.3, and 5.4.Now, in order to allow a performance comparison between the proposed approaches regarding their capacity in separating different sources, we consider the case where two point sources with different spectral content are recorded by an ideal cardioid microphone.One of the point sources' DOA is randomly selected, whereas the DOA of the other source is then set according to a certain angular separation from the first source, varied from 30 • to 180 • with a 30 • -step.The two sources, indexed as p = 1 and p = 2 , emit colored noise based on a third-octave band filter centered on 1 kHz and 2 kHz, respectively.A power ratio between sources is defined as where σ 2 S 1 and σ 2 S 2 denote the broadband variance of sources p = 1 and p = 2 , respectively.The power ratio is set to 0 dB, 3 dB, and 6 dB according to (46).The grid resolution is fixed at 10 • and the SNR, which is still defined with respect to the broadband variance of source p = 1 , is fixed at 10 dB.The regularization parameter is now set to 0.005� A ⊤ ϕ Y � ∞ and 0.005� Ā⊤ ϕ Y � ∞ for the GL-LS and GL-L1 methods, respectively, as an attempt to decrease the influence of the group sparsity penalty, since in this scenario more than a single group is expected to be activated.The DOA and PSD estimation errors, computed for each source separately, are presented for the GL-LS and GL-L1 methods in Figs. 6 and 7, respectively.We can observe that, when using GL-LS, the MAE for both sources decreases as the angular separation between them increases, with the error being consistently lower for p = 2 when compared to p = 1 for PR = 3 dB and PR = 6 dB .We also observe that the difference in terms of MAE between the sources increases as the power ratio increases.These results are due to the second source having higher broadband power than the first, and therefore, being less corrupted by the diffuse noise in comparison.We also observe a similar behavior in the PSD estimation in terms of NMSE, with all errors converging to around −20 dB as the angular separation reaches 180 • , and the difference in PSD estimation error between sources increasing with the power ratio.When considering GL-L1 method, we can observe that, for angular separation values below 60 • , its MAE is lower than the one obtained with GL-LS, indicating a benefit of using a frequency-dependent structure when building the proposed linear system of equations to identify closely located sources.However, as opposed to the behavior of GL-LS, the MAE for p = 1 increases with angular sepa- ration while the MAE for p = 2 remains reasonably con- stant for PR = 3 dB and PR = 6 dB .By further analyzing the multiple realizations of each scenario considered, it was possible to observe that the estimated vector φS often presented, especially for the case when PR > 0 dB , spurious peaks of frequency-averaged PSD neighboring the second source's estimated position ( θ2 ), indicating a spreading of the target source's power over multiple, neighboring candidate DOAs.If the frequency-averaged power of a candidate location neighboring source p = 2 is greater than the one related to source p = 1 , then the algorithm will select the incorrect candidate and yield higher DOA estimation errors, which increase with the angular separation between sources and present opposing trends to those of the GL-LS approach.Since the regularization parameter has been heuristically chosen in this work's simulation and may be suboptimal, further tuning could potentially be carried out in order to improve this approach's DOA estimation performance in multi-source scenarios.
Despite an increase in MAE for p = 1 as a function of the angular separation, we can observe that, in terms of PSD estimation, the use of the GL-L1 method presents fairly constant NMSE values for both sources and all PR values considered.It is also observed that similarly to the use of GL-LS, the difference in terms of NMSE between two sources increases with the power ratio.Overall, the use of GL-LS showed to achieve lower PSD errors in most cases.

Influence of the microphone directivity pattern in the case of three sources
As a further investigation on the capacities of the proposed approaches to discriminate between different point sources in space, a new set of simulations is built for a case of three point sources.Upon testing the planned scenario, it was observed that using a cardioid microphone when simulating the recorded signals did not provide enough directional diversity to allow for three distinct peaks to be identified within the estimated vector φS .For this reason, we test both methods using microphones with higher-order directivity patterns based on the higher-order differential microphones studied in [61].
A general, second-order microphone directivity pattern, denoted Ŵ(θ) , can be expressed as where θ denotes the angle and c , c 1 , and c 2 are real-val- ued scaling coefficients.By varying the values of c 0 , c 1 and c 2 , one can obtain different second-order directiv- ity patterns.In this study, we consider simulating three different patterns, here denoted Cardioid-A, Cardioid-B and Hypercardioid-2, based on the different combinations of coefficient values proposed in [61] and presented in Table 2.An illustration of each microphone directivity pattern in absolute values is presented in Fig. 8.
Three point sources of equal power emitting speechshaped noise are simulated in anechoic conditions, with the DOA of the first source ( ϑ 1 ) being randomly selected between 0 • and 360 • , and the DOAs of the two remaining sources ( ϑ 2 and ϑ 3 ) being set according to the angular separation considered as ϑ 2 = ϑ 1 + �θ and ϑ 3 = ϑ 1 + 2�θ , with �θ denoting such separation value.As opposed to the previous simulations presented, a total of N = 9 microphone orientations uniformly distrib- uted over 360 • (i.e., 0 • , 40 • , 80 • , 120 • , 160 • , 200 • , 240 • , 280 • and 320 • ) are used when building the linear system of equations used for each method, due to the need for more observation data in order to successfully differentiate the three different point sources.Both the DOA ( 47) and PSD estimation errors are averaged over all three sources, since they are simulated with equal power, and are denoted as MAE and NMSE , respectively.
The DOA and PSD estimation errors for the GL-LS and GL-L1 methods while using different second-order directivity patterns and angular separations between sources are presented in Figs. 9 and 10, respectively.When evaluating the DOA estimation, we observe again that the performance of GL-LS strongly depends on the angular separation between sources.However, the choice of microphone directivity pattern has only presented an impact for the case of an angular separation of 60 • between sources, in which the Hypercardioid-2 provided slightly better performance.For the GL-L1 approach, the same pattern yields lower or equal MAE values for all angular separations considered.We again observe that GL-L1 yields lower MAE than GL-LS for the case of sources with angular separation below 60 • , indicating the benefit, in this case, of employing the narrowband signal model.
In terms of PSD estimation, it is observed again that, for both proposed approaches, the use of the Hypercardioid-2 pattern overall yields NMSE values that are lower than or similar to those obtained with the Cardioid-A and Cardioid-B patterns.The GL-LS approach presents better PSD estimation performance than GL-L1, despite its greater sensitivity to the angular separation between sources when performing the preceding DOA estimation step.

Discussion on the performance limitations of GL-LS and GL-L1
Based on the simulation results presented in this section, it can be observed that for both proposed approaches GL-LS and GL-L1, the DOA and PSD estimation performance can depend on multiple factors, which, when combined, can lead to the necessity of a thorough investigation over the conditions in which their application is to be considered.
In the single-source scenarios simulated in this work, the results indicate that although the intuitive choice of using a finer grid of candidate DOAs can aid at obtaining better DOA estimates, with the MAE being lower bounded by approximately a quarter of the grid resolution, factors such as the noise level and different mismatches between the signal model assumed in the proposed approaches and the practical conditions in which the microphone signals are observed can strongly affect the overall performance of both DOA and PSD estimation.Regarding the model mismatches, it was observed that possible microphone calibration errors in its directivity and room reverberation lead to higher MAE and NMSE values.
In the multi-source scenarios, it was observed that the angular separation and difference in power between sources can significantly affect the DOA estimation performance of both proposed approaches and the PSD estimation performance of GL-LS.
Finally, although the extensive simulations presented in this work can already provide valuable information on the performance trends of the proposed approaches in numerous scenarios, more definitive evaluations can be obtained when considering the case of nonstationary signals and using experimental data.Many of the prevalent applications of DOA and PSD estimation involve speech signals and a combination of model mismatches resulting from multiple factors, which can be simultaneously present in practical setups.Therefore, an investigation of the proposed approaches' behavior under these conditions is required not only for gaining further clarity on the current applicability of the proposed approaches, but also to identify which main aspects should be considered in future work in order to enhance their performance.

Conclusion
In this paper, we proposed two approaches for performing DOA and PSD estimation of one or more point sources.The first approach, named GL-LS, is based on a broadband signal model with the PSDs summed over frequency for solving a group-sparse optimization problem with nonnegativity constraints over the desired output vector and the resulting error term, such that the sources' DOAs can be estimated from an overcomplete dictionary of angular candidate positions.Subsequently, a least squares step is performed for re-estimating the point sources' PSDs based on the estimated DOA information.The second approach, named GL-L1, is based on a narrowband signal model structure for solving an analogous optimization problem, which in this case is iteratively, group-wise reweighted for enhancing the solution's sparsity and jointly providing both DOA and PSD estimates.
Both approaches are implemented using ADMM, and simulations were performed for evaluating their performance under different conditions.Compared to the original method which GL-LS and GL-L1 were based on [43], it was observed that, in a scenario involving a microphone response model mismatch, having the additional nonnegativity constraint over the error term can improve the DOA estimation for the case of GL-LS and the PSD estimation for the case of GL-L1.Moreover, in terms of DOA estimation, the GL-L1 approach presented an advantage over GL-LS in scenarios with low SNR or where multiple sources are closely located to each other.Finally, it was shown that having the least squares PSD re-estimation step is beneficial in most scenarios, such that GL-LS outperformed GL-L1 in terms of PSD estimation errors.
Future work includes a further study of the influence of the choice of microphone orientations and directivity pattern when acquiring measurement data over the DOA and PSD estimation performance, expanding the proposed approaches to the multi-channel case and performing experimental tests.This research work was carried out at the ESAT Laboratory of KU Leuven, in the frame of FWO Mandate SB 1S86520N and FWO Mandate 12ZD622N.The research leading to these results has also received funding from the European Research Council under the European Union's Horizon 2020 research and innovation program/ERC Consolidator Grant: SONORA (no.773268).This paper reflects only the authors' views and the Union is not liable for any use that may be made of the contained information.

Availability of data and materials
The proposed approaches' implementation is accessible from the corresponding author on reasonable request.

Fig. 2
Fig. 2 Estimation errors obtained with GL-L1 for different grid resolutions and SNRs

2 S 1 ,Fig. 4 Fig. 5 Fig. 6
Fig.4 Illustration of simulated room where the source position is randomly generated within the area represented in gray (at least 10 cm from the wall and farther than the room's critical distance r c )

Fig. 7
Fig. 7 Estimation errors obtained with GL-L1 for different values of angular separation and power ratio between two sources

Fig.
Fig. Microphone directivity patterns considered this study represented in absolute values

Directivity pattern [c 0 , c 1 , c 2 ]Fig. 9 Fig. 10
Fig. 9 errors averaged over three sources and obtained with GL-LS for different values of angular separation, while using different microphone directivity patterns

Table 2
Coefficient values for microphone directivity patterns used