Microphone utility estimation in acoustic sensor networks using single-channel signal features

In multichannel signal processing with distributed sensors, choosing the optimal subset of observed sensor signals to be exploited is crucial in order to maximize algorithmic performance and reduce computational load, ideally both at the same time. In the acoustic domain, signal cross-correlation is a natural choice to quantify the usefulness of microphone signals, i.e., microphone utility, for coherent array processing, but its estimation requires that the uncoded signals are synchronized and transmitted between nodes. In resource-constrained environments like acoustic sensor networks, low data transmission rates often make transmission of all observed signals to the centralized location infeasible, thus discouraging direct estimation of signal cross-correlation. Instead, we employ characteristic features of the recorded signals to estimate the usefulness of individual microphone signals using the Magnitude-Squared Coherence (MSC) between the source and respective microphone signal as ground-truth metric. In this contribution, we provide a comprehensive analysis of model-based microphone utility estimation approaches that use signal features and, as an alternative, also propose machine learning-based estimation methods that identify optimal sensor signal utility features. The performance of both approaches is validated experimentally using both simulated and recorded acoustic data, comprising a variety of realistic and practically relevant acoustic scenarios including moving and static sources.


Introduction
An acoustic sensor network (ASN) comprises multiple spatially distributed microphones, including multiple distributed compact microphone arrays, that typically communicate wirelessly. Capturing different perspectives of the acoustic scene, the signals recorded by these distributed microphones encode spatial information exploitable by multichannel signal processing algorithms. These algorithms accomplish crucial tasks [1] like acoustic source localization [2][3][4] and tracking [5,6], extraction and enhancement of an acoustic Source of Interest (SOI) [7][8][9], hands-free communication [10], acoustic monitoring [11,12], and scene classification and acoustic event detection [13]. As the microphones in ASNs often have no common sampling clock, their signals must be synchronized before joint processing.
The performance of these signal processing algorithms is affected by many factors including the proximity of the microphones to desired and undesired acoustic sources, reverberation, additive noise, orientation and occlusion of microphones, among others. As a result, the signals obtained from different microphones are generally not equally useful for the abovementioned tasks, potentially even detrimental in extreme cases if inappropriate importance is assigned to them. To ensure optimal algorithmic performance at minimum transmission cost and computational cost, a diligent selection which of the observed microphone signals to process and which to discard is crucial in order to avoid unnecessary signal transmission or synchronization efforts. Unsurprisingly, this task has received considerable attention in the literature: the selection of a single best channel for Automatic Speech Recognition (ASR) based on signal features has been explored in [14]. A utility measure specifically for Minimum Mean Square Error (MMSE) signal extraction has been proposed in [15,16], followed by a distributed version in [17]. MMSE signal extraction under rate constraints tailored specifically for application in for hearing aids was treated in [18][19][20]. Furthermore, joint microphone subset selection and speech enhancement using deep learning were proposed in [21]. Microphone subset selection to minimize communication cost with upperbounded output noise power has been investigated for both Minimum Variance Distortionless Response (MVDR) [22] and Linearly Constrained Minimum Variance (LCMV) [23] beamforming. However, these methods either neglect the limitations of the underlying ASN regarding communication cost or are tailored to a specific application or cost function. In the following, we present a different approach that overcomes both drawbacks, i.e., requires little transmission data rate and is applicable to a broad class of signal processing applications that rely on coherent input signals.
Many multichannel algorithms, e.g., for signal enhancement or localization using compact arrays [7,24], assume coherent, i.e., linearly related, input signals and exploit the spatial information captured by the interchannel phase differences. While this obviously applies to the signal components evoked by an SOI, it also holds for noise reference signals because they must admit a prediction, often linear, of residual noise components in order to suppress them. Thus, the cross-correlation of microphone signal pairs and measures derived from it, in particular the spatial coherence and the MSC, are intuitive measures for quantifying the usefulness of observed microphone signals and have been used in literature for that purpose, e.g., in [25]. For synchronized microphones with sufficient transmission data rate, e.g., for wired compact microphone arrays, direct estimation of the inter-channel coherence from the observed uncoded microphone signals to rate their utility is straightforward. However, in ASNs, this approach is often precluded by a limited transmission data rate, e.g., of current wireless networks [26], especially when the number of microphones is large. This issue is further compounded if the available data rate must be shared with other, possibly non-audio, applications, like video streaming in smart home environments. Furthermore, the microphone signals in ASNs generally do not share a common sampling clock [27]. While sampling time offsets are readily handled by suitable signal processing techniques [28], clock skew will often still pose a problem. Even when the accumulated sampling time offset within one processing block only amounts to fractions of a sampling period, imperfect cancelation can have a catastrophic effect on differential signal processing applications [29]. Furthermore, although the sampling rate variation across multiple copies of a single devices can be very low [30], this may not necessarily be true for ASNs comprised of heterogeneous, cheap consumer devices. Therefore, clock skew in ASNs should not be generally neglected. Thus, potentially costly synchronization of the signal waveforms is generally required prior to estimation of the coherence, which disqualifies direct estimation of the signal crosscorrelation. To identify promising candidate microphone signals for synchronization and subsequent joint processing in ASNs without prior signal synchronization, other techniques are required.
To address these unique challenges of ASNs, we employ a compressed signal representation in the form of single-channel signal feature sequences, which are extracted from temporal blocks of the microphone signals, to reduce the amount of data to be transmitted by roughly two to three orders of magnitude. To accurately assess the communication cost, many additional factors should be considered including, but not limited to, the radio-frequency environment and radio Signal-to-Noise Ratio (SNR) at the wireless transceivers, modulation and coding schemes, medium access control and arbitration (potentially via distributed algorithms), protocols and the associated overhead, and the temporal duration of transmission frames, which is unfortunately beyond the scope of this contribution. While acknowledging the implied simplification, we use the data amount as a proxy due to its conceptual simplicity and monotonous relation with actual communication cost, i.e., reducing data amount never increases cost when the environment is constant. The employed features must be characteristic for the microphone signals, i.e., to allow for (at least approximate) reconstruction of the inter-channel MSC.
In this contribution, we consider acoustic scenarios often encountered in smart home applications comprising a single SOI captured by multiple distributed microphones in an acoustic enclosure, as depicted in Fig. 1. After estimating the usefulness of the recorded microphone signals, a subset thereof is selected and transmitted to the central wireless Access Point (AP) for subsequent coherent multichannel signal processing. Although Fig. 1 shows an exemplary scenario with a wireless network with a central AP, this is not constraining the scope of the paper. For the considerations in this paper, the AP may be replaced by a network node acting as a local center implementing the multichannel signal processing algorithm. While the case of wired networks is also covered, their typically large transmission capacity may allow raw signal transmission and thereby limit the importance of the developed feature-based utility estimation. Instead of a specific signal processing algorithm, we consider a broad class of algorithms that rely on coherent input signals and do not require signals unrelated to the SOI, e.g., as noise references. In addition, we do not consider application-specific cost functions or performance metrics, e.g., Signal-to-Distortion Ratio (SDR) and Echo Return Loss Enhancement (ERLE), such that the proposed utility estimation scheme is appropriate for many subsequent multichannel signal processing applications. We instead generate utility estimates to match the groundtruth coherence between the SOI signal and the observed microphone signals. To this end, the proposed generic system comprises two subsystems depicted as in Fig. 2: a feature extraction system, a copy of which runs for each microphone signal on the associated network node, and a utility estimation system running on the AP. In the feature extraction stage, characteristic signal features are extracted from the observed microphone signals independently from each other. No cross-channel features are employed in order to not exclude single-microphone network nodes. The feature sequences obtained from each microphone are then transmitted to the central AP, which estimates the individual microphones' utility values by correlating the feature sequences. A set of Kalman Filters (KFs) with time-varying temporal smoothing provides a robust estimation framework for the feature covariance. Utility estimates are obtained by extracting structural information from the resulting covariance matrices via the corresponding Fiedler vector [31] that reflects a notion of average connectivity. A joint approach simultaneously considering single-channel features and network transmission cost was proposed in [32]. The efficacy of the proposed utility estimates for two specific important signal processing tasks, robust source localization and spatial filtering, was demonstrated in [33] and [34], respectively. Therein, sensor selection by optimizing the proposed utility measure has shown close-to-optimal performance, such that we focus only on the generic utility measure in this contribution.
In the remainder of this article, we review and provide more detailed descriptions of the model-based realizations of the two subsystems proposed in [32,35] in Sections 3.1 and 3.2 by explicitly stating and discussing the model assumptions of the KF. Formulating microphone selection as a graph bi-partitioning problem, the Fiedler vector yields an optimum soft assignment of each individual microphone to one of the two groups of most and least useful microphones, which further justifies its use as a utility measure. In Section 3.3, we provide new results on the suitability of established signal features for recovering inter-channel MSC. To this end, the feature selection task is formulated as a Least Absolute Shrinkage and Selection Operator (LASSO) regression problem which is then solved numerically to obtain an optimal set of signal features. In Section 4, we propose novel Machine Learning (ML)-based realizations for both subsystems whose combination can be learned in an end-to-end fashion, which constitutes a major contribution of this work. In Section 5, the efficacy of the proposed scheme and its individual components is validated. Different algorithmic variants, i.e., purely model-based, purely ML-based, and hybrid realizations of the proposed system, are investigated. To this end, comprehensive experiments for both synthesized and recorded data from realistic scenarios are conducted, including different reverberation times, additive noise and obstruction of sensors, different microphone arrangements, as well as static and moving SOIs. [y] m . It will be used as a normalized similarity measure for features in Section 3.2 and as performance measure for the experiments in Section 5.

Notation and signal model
In the following, let t denote the discrete-time sample index and let f ∈ {1, . . . , F } denote the feature index where F is the number of extracted features per channel. Recalling Fig. 1, we consider an acoustic scenario comprising a single coherent SOI recorded by J microphones, each of which represents a separate node in the ASN. The signal captured by the microphone indexed by j ∈ P = {1, . . . , J } is where s[t] is the dry SOI signal, h j [t] is the acoustic impulse response from the SOI to the j-th microphone, and * denotes linear convolution. Note that the SOI is not necessarily static, i.e., the acoustic impulse responses h j [t] in (2) are considered time-invariant only for short observation intervals, but may change from one interval to the next as the SOI moves. The fully coherent spatial images of the SOI s[t] * h j [t] are superimposed by a spatially diffuse or incoherent noise field, such that the mutual coherence between the noise components n j [t], ∀j ∈ P is negligibly small. Thus, observed correlation between two microphone signals x j [t] , x j ′ [t] is predominantly caused by the common SOI signal. Although competing point-like sources are not explicitly modeled in (2), the proposed method is still applicable given sufficient temporal sparsity, i.e., time intervals where only one of the sources is active, provided that the identity of the active source changes slowly enough to be tracked by the KF. Furthermore, consider an ASN spanning two rooms each with its own SOI connected by, e.g., open doors. With the microphones in each room predominantly capturing their respective SOI, the realizations of the proposed system in Sections 3 and 4 can still facilitate a distinction of microphones w. r. t. the dominant SOI. In this case, the scenario essentially decouples into two separate problems, but it is generally not known in advance which of the two possible solutions is found. To ensure a deterministic selection, additional source selection mechanisms exploiting preference information are needed, which is beyond the scope of this paper. In any case, the signal model (2) should be viewed as a first step towards developing methods for more general acoustic scenarios.
As the proposed utility estimation relies only on the correlation of feature sequences computed from signal frames, only a coarse synchronization of the signal frames between different sensors has to be assured, such that the proposed scheme is practically relevant. However, we assume that the sensor signals are synchronous to compute the oracle MSC between each microphone and the source. The same holds for the microphone pair-wise complex coherence function, and thus MSC, which are the foundation of the baselines baseline-CDR and baseline-MSC, respectively, in Section 5.
To this end, the signals are partitioned into blocks indexed by k ∈ {1, . . . , K } with a length of L b samples and a shift of L s between successive blocks, e.g., for the j-th microphone signal x j [t], With the discrete frequency bin index n ∈ {1, . . . , L b } , let ˆ s,x j [k, n] , ˆ s,s [k, n] and ˆ x j ,x j [k, n] denote short-time estimates of the respective cross-Power Spectral Density (PSD) and auto-PSDs of s[t] and x j [k] , e.g., (3) where DFT L b denotes the L b -point Discrete Fourier Transform (DFT). As a broadband ground-truth utility measure for the j-th microphone, the frequency-averaged narrowband MSC between the (latent) source signal s[t] and the j-th microphone signal x j [t] is used. Note that we drop the superscript · from γ j [k] in (5) for notational simplicity in the following. Under the assumption that the SOI signal s [t] and noise signals n j [t] are mutually uncorrelated, and that the acoustic impulse responses in (2) are much shorter than the DFT length, the approximations hold. In practice, MSC estimates derived from (6) and (7) are subject to detrimental effects stemming from the combination of limited temporal observation intervals and the characteristics of the acoustic impulse responses between SOI and the microphones as captured by, e.g., relative time delay and Direct-to-Reverberation Ratio (DRR). Nevertheless, since coherent multichannel processing algorithms also degrade with the same impairments, the degradation in estimation accuracy of the coherence can be assumed to be correlated with the performance of signal processing algorithms and, hence, utility of the involved sensors. Then, the summands in (5)

Model-based utility estimation using Spectral Graph Partitioning
We first review the model-based realizations of [32,35] in Sections 3.1 and 3.2. Although there is no strictly analytical relation between the extracted feature values and the utility values, the approach is based on the notion that the PCCs of the different extracted feature sequences all reflect the same pair-wise similarity of the underlying microphone signals. Due to this model assumption, and to differentiate it from the ML-based approach in Section 4, which requires training data to determine the model parameters, this approach is termed model-based. Advancing the previous heuristic feature selection [36], we formulate the feature selection task as a LASSO regression problem with a sparsitypromoting regularizer in Section 3.3 to optimize the trade-off between accuracy and number of features to be transmitted. Solving this optimization problem yields an optimal selection of features for a set of representative acoustic scenarios.

Node-wise feature extraction
There is a wide variety of potential signal features [37,38] to describe acoustic signals. Since acoustic scenarios are typically not static in practice due to, e.g., moving acoustic sources or obstructions, the usefulness of microphones is equally time-variant. Hence, within the comprehensive feature taxonomy in [37], we focus on features extracted from short observation intervals to characterize single-channel signals. The features may be computed in the time domain based on the digital signal waveform, or in the frequency domain based on the magnitude spectrum of the signals. As a result, we consider the following block-wise features: • Envelope of waveform • Zero-crossing rate • Statistical moments (centroid, standard deviation, skewness, kurtosis) of the signal waveform • Entropy of waveform • Statistical moments (centroid, standard deviation, skewness, kurtosis) of the magnitude spectrum • Spectral shape features (slope, power flatness, amplitude flatness, roll-off ) • Temporal variation of magnitude spectra (spectral flux, spectral flux of normalized magnitude spectra, spectral variation) In [36], it was experimentally shown that three features (temporal skewness, temporal kurtosis, spectral flux) are suitable to recover the structure of the spatial MSC matrix of a set of microphone signals. However, the features were selected heuristically based on the visual similarity of the corresponding feature covariance matrices and the ground-truth MSC matrix. Therefore, a more rigorous discussion of the importance of specific signal features is provided in Section 3.3. Generally, a single feature sequence, i.e., a sequence of feature values over several time frames, is insufficient to characterize the signals, since the extraction of each signal feature can at best maintain information about the original signal [39], but typically incurs a loss of information. When multiple sufficiently different features are used, they capture different parts of the information contained in the signals, such that they complement each other in describing the original signals. Thus, jointly processing such different features allows for a more accurate characterization of the signals compared to a single feature.
To this end, given a signal block denote the observed value of the signal feature f ∈ {1, . . . , F } for said signal block. Collecting the feature values of different channels for time frame k yields the instantaneous feature vector Unlike the signal waveforms, which require precise synchronization of the sampling clocks for joint processing, the feature values of different microphones are much less susceptible to asynchronous sampling. With only a single feature value every L s signal samples, sampling rate offsets on the order of tens of parts per million (PPM) barely affect the extracted feature sequences. Hence, periodical coarse synchronization of the signal block boundaries is sufficient to avoid excessive drift of the observation windows in different network nodes, allowing synchronization to occur less frequently and with lower accuracy requirements.

Utility estimation
In this section, we review the utility estimation scheme based on correlation of feature sequences originally proposed in [32,35,36] and show its relation to established Graph Bisection techniques. The model-based utility estimation comprises three steps, which are outlined in the following subsections: 1 Robustly estimate the cross-channel PCCs of the feature sequences separately for each feature via a set of KF 2 Fuse information contained in the PCCs from different features 3 Estimate each microphone's utility from the fused information by means of Spectral Graph Partitioning For clarity, a visual guide of these steps and the involved matrices and vectors is provided in Fig. 3.
Step 1) Feature correlation coefficients: For computing the PCCs, first the cross-channel covariance matrices are estimated for each feature f ∈ {1, . . . , F } separately. Therein, Ê denotes an approximate statistical expectation operator whose practical realization we discuss below. Furthermore, the matrix  is the outer product of the instantaneous observed feature vector a (f ) [k] after subtracting its recursive temporal average controlled by the recursive averaging factor is generally timevariant to account for the aforementioned SOI movement, and thus online estimation is preferred over batch estimation. In order to track this temporal variability, we use a separate KF [40] for each feature f. Let the latent state vector at time frame k be denoted by where t (f ) [k] and o (f ) [k] denote the state-transition and observation noise vectors, respectively. In other words, the most probable state transition is that the utility, and hence the feature covariance, stays the same. However, if the state does change, it has no predictable preference direction. Similarly simple motion models are effectively used in acoustic echo cancellation [42] and dereverberation [43]. With (17) and (18), the KF simplifies to temporal smoothing, albeit with a time-variant smoothing constant. Compared to fixed averaging constants, this allows placing higher confidence in observations with high signal energy, i.e., likely SOI activity.
Assuming a normally distributed latent state vector z (f ) [k] like in [35] for simplicity and mathematical tractability leads to the prior distribution (14) with the aforementioned mean vector µ (f ) [k] and covariance matrix is neither known nor easily modeled, we assume a zeromean Gaussian random walk with transition distribution as it is the least informative model but, due to the Central Limit Theorem (CLT) [44], fits well for natural processes where changes in the latent state are often the result of many independent influences. In order to remain agnostic to the source-microphone arrangement in different scenarios, the time-invariant and feature-independent process noise covariance matrix is chosen as a scaled identity matrix where α 1 ∈ R + is a positive tunable parameter. Intuitively, two closely spaced microphones produce similar feature sequences and thus the way their estimated PCCs w. r. t. a third microphone change over time will be correlated. While these scenario-specific correlations could in principle be exploited for more accurate estimation by tailoring t to the scenario, doing so would harm the generalization of the transition model to other scenarios and furthermore requires the acquisition of sufficient data to estimate an optimal t . Therefore, to avoid biasing the random walk process, we choose not to model these correlations, i.e., keep t diagonal.
Choosing the least informative emission model for the observations ξ  With all components of the KF in place, the update equations for mean vector and state covariance are [45] with the Kalman gain matrix and initial values Note that the updates in (25) to (27)  Step 2) Feature combination: As outlined earlier, the PCC matrices of different features B (f ) [k] capture different aspects of the underlying inter-channel coherence. To recover an estimate of the inter-channel coherence from the multiple feature correlation coefficient matrices, we consider channel-wise matrices where each C j [k] contains the inter-channel PCCs of all F feature sequences of all J microphone channels w. r. t. the corresponding feature sequence of a reference channel j.
Note that each column of C j [k] , corresponding to one particular signal feature, models the same underlying (25) inter-channel coherence. The PCCs of different features are then combined for each channel j by extracting the dominant column structure of C j [k] , i.e., finding its best rank-1 approximation in the Least Squares (LS) sense [46] Since C j [k] is generally non-square, the solution of (32) [46]. Thus, given an estimate from the previous time step, the principal singular vector can be estimated using power methods [46] as Initial experiments comparing the power method and full SVD have shown that the spectrum of C j [k] varies slowly over time such that a single iteration of (33) and (34) is sufficient to accurately track the principal singular vector for the proposed utility estimation scheme.
In order to restore the intuitive notion of a similarity measure, the estimated principal singular vectors from (34) are re-normalized such that the similarity of each channel to itself is equal to one, and then concatenated to form the overall channel similarity matrix Step 3) Spectral Graph Partitioning: Microphone selection is equivalent to partitioning the set of available microphones P into two, potentially time-variant, disjoint subsets comprising the selected and discarded microphones, respectively. Recalling the signal model (2), we use the convention that the former subset S[k] contains the microphones capturing the SOI with high quality while the latter subset is its complement S[k] for those microphones dominated by the non-coherent noise field. Relaxing the (32) min hard assignment of microphones to these subsets to a soft assignment leads to continuous real-valued utility estimates as shown in the following. Spectral partitioning techniques [31,47,48] operating on graph structures can determine such optimal partitionings very efficiently, especially when the number of microphones J is large. Thus, we model the pairwise similarity of microphone channels using a time-variant graph structure G(V, E[k]) [47], comprising a set of vertices V representing microphones and a set of weighted edges E[k] representing the microphones' similarity at time frame k. For each edge (j, j ′ , w jj ′ [k]) ∈ E[k] , the weight w jj ′ [k] ∈ [0, 1] captures the similarity of microphones j and j ′ . The graph is equivalently specified by its weighted adjacency matrix W[k] ∈ R J ×J , containing all weights w jj ′ [k], ∀j, j ′ ∈ P . The pairwise microphone similarity should be a symmetric measure, i.e., channel j should be as similar to j ′ as channel j ′ is to j, such that w jj ′ [k] = w j ′ j [k] . To reflect this symmetry and the varying degrees of similarity, the graph should be undirected and weighted. Since the matrix R[k] in (35) does not necessarily exhibit these properties, only the symmetric part of its element-wise magnitude is used to construct the weighted adjacency matrix W[k] , i.e., The degree [47] of the j-th vertex is defined as the sum of all outgoing edges' weights which are collected in the diagonal degree matrix Note that d j [k] ≥ 1, ∀j ∈ P since the sum in (37) includes w jj [k] = 1 , which ensures invertibility of D[k] even for degenerate graphs.
For an ideal partitioning, like for clustering, it is desirable that microphone signals belonging to the same group are similar while microphone signals belonging to different groups are dissimilar to allow for a clear distinction between the selected and the discarded microphones.
Using (2) gives an interpretation in the context of microphone selection: SOI-dominated microphones exhibit strongly mutually correlated feature sequences and thus form one of the two partition subsets, while the feature sequences of noise-dominated microphones are only weakly correlated with the SOI-dominated microphones, and thus form the other subset. In addition, even if the noise components n j [t] are uncorrelated, (36)  their features likely are correlated, especially if they capture underlying statistics like variance. These intergroup and intra-group similarities of a set S[k] ⊂ P and its complement S[k] are measured by [48] respectively. Balancing the inter-and intra-group similarity to avoid degenerate solutions yields the normalized cut objective function [49] As shown in [49], minimization of (41)  with the normalized random-walk Laplacian matrix [47]  Thus, an approximate minimizer of (42) is obtained by finding the smallest eigenvalue and its corresponding eigenvector of L[k] . The trivial eigenvalue 0 and its corresponding eigenvector 1 J [48] are excluded by the constraint (43). Thus, the solution is the so-called Fiedler vector v[k] , i.e., the eigenvector corresponding to the smallest non-trivial eigenvalue of L[k] [48], which automatically satisfies (43) as shown in [49]. While an approximate solution to the discrete problem can be obtained by discretizing v[k] , e.g., based on the sign of each element, here we use the real-valued solution directly as an estimate of the microphones' utility.
As an eigenvector, the scale and in particular the sign of v[k] is ambiguous, i.e., both −v[k] and v[k] are valid solutions to the eigenvalue problem (46). The same holds for the objective function (41), which is invariant to exchanging S[k] with S[k] . This ambiguity is usually not a problem for partitioning, since only the association of vertices to groups is desired, but not the identity of each group. In other words, the partitioning given by v[k] distinguishes between the most and least useful microphones, but does not say which group is which. Additionally, in low-SNR scenarios, noise-dominated microphone signals may exhibit large feature PCC values due to similar noise signal statistics despite only weakly coherent noise signals. To facilitate this distinction, we consider supplemental side information captured by a vector β[k] which is correlated with the preliminary utility estimates Choices for β[k] are discussed below. Depending on the sign of the PCC ρ[k] , the sign of the estimated utility values is flipped to produce the final utility estimates In [32], the supplemental information was chosen as the node degree, i.e., [β[k]] j = d j [k] . While this choice allows detection of outliers if the volumes of the two subsets in the partition are very different, i.e., a large majority of microphones is either useful or not useful, it also requires further assumptions or knowledge about the identity of the majority group, e.g., that the majority of microphones observes the desired SOI. To address these shortcomings, we consider typical SOI and interfering signals: typical SOI signals, especially speech, are non-Gaussian and exhibit spectro-temporal structure. Meanwhile, typical signal degradations, like reverberation or additive noncoherent noise, exhibit less or no structure, thus reducing the structure of the acoustic mixture. Thus, the differential signal entropy [39] is used to capture the structuredness of the observed signals (48) as in [35]. Therein, the Probability Density Function (PDF) is estimated by its N B -bin histogram with e n B denoting the histogram bin edges. Note that, for the experiments conducted in Section 5, the signal blocks used to estimate entropy in (50) are chosen longer than those for the feature extraction. The entire microphone utility estimation procedure using Spectral Graph Partitioning is concisely summarized as pseudocode in Algorithm 1.
In the presence of point-like interferers, the signal model (2) no longer strictly holds, such that it should be understood as a first step towards developing methods for more general acoustic scenarios. Hence, somewhat degraded estimation performance must be expected, where the extent of degradation depends on the particular scenario. For example, in an ASN spanning two rooms each with their own SOI with only low-level cross-talk between rooms and low-level additive noise, groups of useful microphones for either SOI can be identified, which still matches well with the desired outcome. As a second example, consider an ASN in a single room, with two closely spaced point sources. For temporally overlapping source activity with both sources contributing similar signal power to each microphone, all microphones exhibit reduced utility w. r. t. either source as the other source is considered as noise, again matching qualitatively with reduced feature covariance. For source counting and associating the microphone subsets with the correct SOI, additional mechanisms need to be developed that are beyond the scope of this paper.

Importance of specific signal features
Choosing an appropriate set of characteristic signal features for the microphone signals is vital: too few features result in low estimation accuracy, while too many features unnecessarily strain the wireless network. Even for an appropriate number of features, inappropriate features may even reduce overall estimation accuracy. To explore the importance of individual signal features, we formulate the feature selection as a LS regression problem with a sparsity-promoting regularizer in (53) below in order to obtain a low regression error while using as few features as possible. Specifically, we interpret the matrix C j [k] as a dictionary matrix whose columns, or atoms, contain the cross-channel correlation coefficients between the reference channel j and all channels for one specific signal feature, and which are linearly combined to approximate the MSC of the observed microphone signals. However, for the purpose of estimating microphone utility and microphone selection, the relative utility of microphone channels is more important than the absolute values, such that the zero-mean MSC vector, i.e., is used as the target quantity. Thus, the ℓ 1 -regularized LS cost function for a single acoustic scenario comprising J microphone signals with K time frames is where φ = φ 1 , . . . , φ F T ∈ R F captures the contribution of each feature and the parameter δ ∈ R + indirectly controls the sparsity of the vector, i.e., the number of used features. The results of this optimization are shown in Section 5.2.

Learning-based utility estimation
Artificial Neural Networks (ANNs) offer the ability to learn an optimum feature set (for given training data) to characterize the microphone signals, as well as optimally combining the features for estimating microphone utility. Thus, we propose learning-based alternatives to both the model-based feature extraction (see Section 3.1) and the utility estimation (see Section 3.2) subsystems. The extractor module in Fig. 4 realizes the feature extractor on the left-hand side of Fig. 2 (both in red), while the estimator module in Fig. 5 realizes the utility estimator on the right-hand side of Fig. 2 (both in blue). For both subsystems in Fig. 2, the ANN architectures are chosen to reflect the modeling capabilities of their model-based counterparts. Both modules are trained together in an end-to-end fashion. During inference, the extractor and utility estimator modules run on the network nodes and the AP, respectively, such that only the compressed feature representation need to be transmitted to the AP.

Node-wise feature extraction
The signal features discussed in Section 3.3, although effective for utility estimation, are not necessarily optimally suited for utility estimation. Learning a set of features specifically tailored to characterize microphone signals for the purpose of estimating utility promises improved accuracy and a more compact representation. The structure of the feature extractor module is depicted schematically in Fig. 4.
Recalling that the ground-truth utility is given by the MSC, spectral representations of the input data appear to be an obvious choice. Since the phase of a signal is largely uninformative w. r. t. the SOI without a second signal for reference, we focus on models using the magnitude spectrum as input in the following. Our initial experiments support this, where models using the magnitude spectrum have outperformed models using the time-domain waveform. The resulting halving of the model's input size is a welcome additional benefit.
Thus, the magnitude spectrum of a single microphone signal block x j [k] as defined in (3) is computed first. Due to the loss of phase information, this transform is not invertible and thus prevents the model from learning exact equivalents of the time-domain features in Section 3.1. The magnitude spectrum then passes through a series of fully connected feed-forward layers that get progressively narrower to condense information until a desired number of signal features is reached. The final batch normalization and Gated Recurrent Unit (GRU) layer allows the extractor module to learn features that describe the evolution of some quantity over time, e.g., spectral flux. Trained weights are shared between the instances of the module at different microphones, i.e., no sensor-specific features are extracted.

Utility estimation
The architecture of the utility estimator is shown in Fig. 5 using the concatenated feature vectors a j [k] from the individual microphones as an input. The memory of the first GRU layer allows capturing the temporal evolution of the feature sequences and allows establishing relations between the different microphone signals based on their extracted features. The following fully connected layers all contain the same number of neurons and are responsible for regression of the GRU outputs onto the target MSC values. Passing the feature sequences themselves into the ANN, instead of the PCCs as in the model-based method in Section 3, allows the network to differentiate between useful and non-useful microphones, such that no separate disambiguation step or supplemental information is needed. Unlike the model-based estimation in Section 3.2, the number of microphones J directly determines the number of neurons in the later fully connected layers. Thus, the model must be retrained whenever the number of microphones changes, but is capable of learning optimal feature representations. For practical applications, building a modular model, e.g., from microphone pair-wise submodels, could overcome this restriction at the cost of some modeling capability.

Experimental validation
The algorithms from Sections 3 and 4 are evaluated on simulated and recorded acoustic data. The considered scenarios feature both static and moving SOIs, different room dimensions and reverberation times, and different arrangements of J = 10 microphones, some of which may be physically obstructed by objects. Although each microphone represents its own network node here, this does not conflict with the general assumptions outlined in Section 1.

Acoustic data
This section describes the different acoustic data used in the following experimental validation.

Simulated data
Microphone signals for a single SOI moving in a shoe box room are simulated using the image-source method [50,51]. The SOI trajectory is restricted to the Region of Interest (RoI), chosen as a horizontal plane at 1.2-m height with at least 1-m distance to the walls. The trajectory is spatially discretized such that successive SOI positions are at most 5 cm apart. The resulting set of time-variant Room Impulse Responses (RIRs) is then convolved with the corresponding SOI signal excerpts to obtain the microphone signals evoked by the moving SOI. Speech segments of 28 s duration, from both male and female speakers, are used as SOI signals. The source moves rapidly during the time intervals 8-10 s and 18-20 s, and otherwise moves slightly around a resting position to simulate the behavior of human speakers. With a maximum cross section through the RoI of about 8 m, the maximum possible SOI speed is about 4m/s. Under these constraints, 20 different, random source trajectories are generated. Three different rooms with typical living room-like acoustic properties (see Table 1) are considered. In each room, J = 10 cardioid microphones are placed at random positions and with random azimuthal rotation. In total, R sim = 120 distinct acoustic setups (20 trajectories × 2 signals × 3 rooms) are simulated, resulting in 56 min of speech data. The generated SOI images are superimposed with spatially uncorrelated white noise of an equal, fixed level to attain an SNR of 10 dB at the microphone with the strongest source image on average. Due to the lower SOI contribution, other microphones have a lower average SNR. Figure 6 illustrates room A along with an exemplary source trajectory.

Recorded data
The recorded acoustic data is obtained from J = 10 microphones arranged pair-wise in a quarter circle around a static loudspeaker representing the SOI as shown in Fig. 7. Although the microphones capsules are omnidirectional, they exhibit nonuniform directivity due to being mounted in metal enclosures facing the SOI which causes diffraction. SOI signals comprise male, female, and children's speech. Instead of a moving source, different usefulness of the microphones is induced by occluding some of the sensors. Obstacles may cover two microphone pairs as indicated in Fig. 7, or a single microphone pair. Additionally, obstacles consist of different materials, i.e., solid wood, foam, and cloth, such that sound can permeate through some of them. In total, R rec = 36 distinct acoustic setups (12 obstructions × 3 signals) are recorded, resulting in 36 min of speech data. Like for simulated data, spatially uncorrelated white noise is added to the recorded microphone signals to achieve an SNR of 10 dB.

Feature importance
Summing C(φ) in (53) over R sim = 120 experiment trials (see Section 5.1.1) and then minimizing the sum yields the features weights φ depicted in Fig. 8. Naturally, higher values of δ result in sparser solutions, i.e., less selected features, ranging from 3 features to 12 features for the considered range of δ . The most important features appear to be lower-order statistical moments of the temporal waveform (td_centroid, td_spread, td_skewness), higher-order statistical moments of the magnitude spectrum (sd_skewness, sd_kurtosis), and features capturing the temporal variation of the magnitude spectrum (sd_flux, sd_variation, sd_fluxnorm). For δ = 0.001 , the selection comprises the four features td_skewness, sd_slope, sd_kurtosis, and sd_fluxnorm, two of which were also part of the heuristic selection made in [36]. To keep the number of selected features   similar to prior work [32,35], we choose the aforementioned four features of δ = 0.001 for the experimental validation in Section 5.3. Note that the obtained feature weights are only used for feature selection so far, but could possibly be used to improve the estimates' robustness in the future.  (50), since it is estimated by a histogram approach, uses longer blocks of 32 000 samples for more robust estimates. Due to the larger block size, the estimated differential entropy also changes more slowly over time, thus promoting temporal continuity of the estimated utility via (49). For the proposed model-based approach from Section 3, termed model-KF in the following, the microphone signals are characterized using the four features identified in Section 3.3, i.e., td_skewness, sd_slope, sd_kurtosis, and sd_fluxnorm. The temporal recursive smoothing factor in (14) is chosen as = 0.99 ; the scaling factors for the KF process and observation noise are α 1 = 1 and α 2 = 50 , respectively.

Accuracy of estimated utilities
For the learning-based approach, the extractor contains six fully connected layers with 513, 256, 128, 64, 32, and 16 neurons, respectively, followed by a single GRU layer with 16 inputs and 16 hidden states. Recall that L b = 1024 such that the 513 inputs to the first layer correspond to the non-redundant part of the signal's magnitude spectrum. The utility estimator contains a single GRU layer with 16J = 160 inputs and 10 hidden states, followed by three fully connected layers with 10 neurons each. Since identical copies of the extractor module are run for each microphone channel j ∈ P , the total number of parameters is 175 000 for the extractor regardless of the number of microphones J, and 5 500 for the utility estimator with the above configuration which scales asymptotically quadratically with the number of channels J. The objective function to be minimized in training is the Mean Square Error (MSE) between the estimated utility u[k; ] and the ground-truth MSC γ [k] where denotes the set of all trainable parameters. The model is trained by the Adam optimizer [52] with a learning rate of 10 −3 . The available acoustic scenarios (120 for ML-sim, 36 for ML-tuned and 156 for MLjoint, respectively) are split into 70% training and 30% validation data. Due to the combinatorial construction of the acoustic data (see Sections 5.1.1and 5.1.2), it is likely that the same speech signal occurs both in the training and the testing data. However, they never occur in the same combination of source trajectory and simulated room which are the predominant influencing factors of microphone utility.
A total of 8 algorithmic variants and baselines are evaluated. Note that we deliberately to not enforce a common constraint regarding communication cost, because our primary goal is to establish performance bounds. However, for practical application, the trade-off between accurate utility estimates and minimal communication costs must be carefully considered. First, two baseline variants, termed baseline-MSC and baseline-CDR, use the cross-microphone MSC and the Coherent-to-Diffuse power Ratio (CDR) as oracle features. For baseline-MSC, the estimated MSC values directly represent a normalized similarity of the respective microphones, s. t. they directly comprise B (f ) [k] in (30). The CDR is computed by a Direction of Arrival (DOA)-independent estimator [53] assuming a diffuse noise coherence, which has been successfully used in weighting and selecting observations made by different microphones [33,54]. Because the CDR is not bounded, the diffuseness [53,55] is used in its place to construct B (f ) [k] . Note that although MSC and CDR imply oracle knowledge in the sense of signal availability at the AP and thus transmission of the sensor signal, the MSC is still computed from time-limited observation windows and thus entails all of the associated estimation challenges, e.g., [56,57]. The same holds for the CDR, as it is based directly on the estimated MSC.
The model-based system described in Section 3 including the KFs for covariance estimation is termed model-KF. To evaluate the effectiveness of the KF, a variant of the proposed system is evaluated that uses a simple recursive temporal smoothing like (14) for feature covariance estimation, termed model-smooth. Furthermore, to judge the modeling capabilities of the ML-estimator, hybrid combines all 18 traditional features from Section 3.1 and the ML-based utility estimator module from Section 4.2. Computational complexity of the different features obviously varies, computation of a single feature is about 20-40× faster than real time running single-threaded on a Core TM i5-6600K at 3.5 GHz. Entropy computation being only slightly faster than real time is a notable exception, but required for the disambiguation of solutions in (49).
In addition, three different training variants are investigated: The first variant ML-sim uses exclusively simulated data. For practical application, it is highly desirable to deploy a pre-trained model and fine-tune its parameters specifically toward a new unseen scenario. To this end, a copy of the pre-trained ML-sim is fine-tuned on recorded data, termed ML-tuned. Finally, a third version trained on both simulated and recorded data simultaneously, termed ML-joint, is also evaluated. In terms of simulation run time, the learning-based variants achieve about 40× real-time speeds, i.e., are on par with the computation of a single model-based feature. Figure 9 shows the median, as well as lower and upper quartile, of the PCC r[k] across trials as a function of the time frame k for simulated data, which should ideally produce values close to 1. Whenever the source moves (indicated by the gray-shaded vertical areas in Fig. 9), the source-microphone distances suddenly change causing the observed rapid decrease of r [k]. Both baseline methods baseline-MSC and baseline-CDR achieve limited performance due to the high relative noise levels in the microphone signals and the time-limited observation windows impeding accurate estimation. Both purely model-based variants model-KF and model-smooth exhibit good steady-state performance with PCCs around 0.9 and quick initial convergence and reconvergence after source motion. The KF variant model-KF achieves slightly better accuracy on average than modelsmooth and is more robust, e.g., visible at around 4 s and 13 s. The trained hybrid offers only very small improvements over model-KF and model-smooth despite using all of the features, indicating that the four selected features for model-KF and model-smooth are close to optimal for these scenarios. The learning-based ML-sim trained on matching data achieves very similar performance, trading a more consistent performance when the SOI does not move for a slightly slower reconvergence behavior. As expected, fine-tuning the ML model using recorded data significantly degrades performance for simulated data, as shown by ML-tuned. Finally, the ML model with both simulated and recorded data from the beginning, i.e., ML-joint, clearly outperforms all other considered methods, with only minor breakdowns and very fast recovery. Interestingly, incorporating recorded data besides the simulated into the training procedure also improves performance on simulated data. Convergence of all methods is very fast, reaching peak accuracy almost instantaneously after the SOI becomes active after an initial silence period of about 1 s.

Simulated data
While the ML models implicitly learning the temporal structure of the source movement might be a concern here, our experiments with random time intervals of 4 to 12 s between two successive source movements have shown no noticeable degradation compared to fixed time intervals.

Recorded data
As for the simulated data, Fig. 10 shows the median and quartiles of r[k] across trials as a function of the time frame k for recorded data. Since the SOI is static, the usefulness of microphones is predominantly influenced by their occlusion and no clear temporal structure can be discerned. Both of the oracle baselines achieve consistent but limited performance with PCCs between 0.6 and 0.8. The advantage of baseline-CDR may be attributed to the diffuse noise coherence model, which enhances the contrast between microphones since residual coherence is considered as noise, particularly in low-frequency