 Empirical Research
 Open access
 Published:
Microphone utility estimation in acoustic sensor networks using singlechannel signal features
EURASIP Journal on Audio, Speech, and Music Processing volume 2023, Article number: 29 (2023)
Abstract
In multichannel signal processing with distributed sensors, choosing the optimal subset of observed sensor signals to be exploited is crucial in order to maximize algorithmic performance and reduce computational load, ideally both at the same time. In the acoustic domain, signal crosscorrelation is a natural choice to quantify the usefulness of microphone signals, i.e., microphone utility, for coherent array processing, but its estimation requires that the uncoded signals are synchronized and transmitted between nodes. In resourceconstrained environments like acoustic sensor networks, low data transmission rates often make transmission of all observed signals to the centralized location infeasible, thus discouraging direct estimation of signal crosscorrelation. Instead, we employ characteristic features of the recorded signals to estimate the usefulness of individual microphone signals using the MagnitudeSquared Coherence (MSC) between the source and respective microphone signal as groundtruth metric. In this contribution, we provide a comprehensive analysis of modelbased microphone utility estimation approaches that use signal features and, as an alternative, also propose machine learningbased estimation methods that identify optimal sensor signal utility features. The performance of both approaches is validated experimentally using both simulated and recorded acoustic data, comprising a variety of realistic and practically relevant acoustic scenarios including moving and static sources.
1 Introduction
An acoustic sensor network (ASN) comprises multiple spatially distributed microphones, including multiple distributed compact microphone arrays, that typically communicate wirelessly. Capturing different perspectives of the acoustic scene, the signals recorded by these distributed microphones encode spatial information exploitable by multichannel signal processing algorithms. These algorithms accomplish crucial tasks [1] like acoustic source localization [2,3,4] and tracking [5, 6], extraction and enhancement of an acoustic Source of Interest (SOI) [7,8,9], handsfree communication [10], acoustic monitoring [11, 12], and scene classification and acoustic event detection [13]. As the microphones in ASNs often have no common sampling clock, their signals must be synchronized before joint processing.
The performance of these signal processing algorithms is affected by many factors including the proximity of the microphones to desired and undesired acoustic sources, reverberation, additive noise, orientation and occlusion of microphones, among others. As a result, the signals obtained from different microphones are generally not equally useful for the abovementioned tasks, potentially even detrimental in extreme cases if inappropriate importance is assigned to them. To ensure optimal algorithmic performance at minimum transmission cost and computational cost, a diligent selection which of the observed microphone signals to process and which to discard is crucial in order to avoid unnecessary signal transmission or synchronization efforts. Unsurprisingly, this task has received considerable attention in the literature: the selection of a single best channel for Automatic Speech Recognition (ASR) based on signal features has been explored in [14]. A utility measure specifically for Minimum Mean Square Error (MMSE) signal extraction has been proposed in [15, 16], followed by a distributed version in [17]. MMSE signal extraction under rate constraints tailored specifically for application in for hearing aids was treated in [18,19,20]. Furthermore, joint microphone subset selection and speech enhancement using deep learning were proposed in [21]. Microphone subset selection to minimize communication cost with upperbounded output noise power has been investigated for both Minimum Variance Distortionless Response (MVDR) [22] and Linearly Constrained Minimum Variance (LCMV) [23] beamforming. However, these methods either neglect the limitations of the underlying ASN regarding communication cost or are tailored to a specific application or cost function. In the following, we present a different approach that overcomes both drawbacks, i.e., requires little transmission data rate and is applicable to a broad class of signal processing applications that rely on coherent input signals.
Many multichannel algorithms, e.g., for signal enhancement or localization using compact arrays [7, 24], assume coherent, i.e., linearly related, input signals and exploit the spatial information captured by the interchannel phase differences. While this obviously applies to the signal components evoked by an SOI, it also holds for noise reference signals because they must admit a prediction, often linear, of residual noise components in order to suppress them. Thus, the crosscorrelation of microphone signal pairs and measures derived from it, in particular the spatial coherence and the MSC, are intuitive measures for quantifying the usefulness of observed microphone signals and have been used in literature for that purpose, e.g., in [25]. For synchronized microphones with sufficient transmission data rate, e.g., for wired compact microphone arrays, direct estimation of the interchannel coherence from the observed uncoded microphone signals to rate their utility is straightforward. However, in ASNs, this approach is often precluded by a limited transmission data rate, e.g., of current wireless networks [26], especially when the number of microphones is large. This issue is further compounded if the available data rate must be shared with other, possibly nonaudio, applications, like video streaming in smart home environments. Furthermore, the microphone signals in ASNs generally do not share a common sampling clock [27]. While sampling time offsets are readily handled by suitable signal processing techniques [28], clock skew will often still pose a problem. Even when the accumulated sampling time offset within one processing block only amounts to fractions of a sampling period, imperfect cancelation can have a catastrophic effect on differential signal processing applications [29]. Furthermore, although the sampling rate variation across multiple copies of a single devices can be very low [30], this may not necessarily be true for ASNs comprised of heterogeneous, cheap consumer devices. Therefore, clock skew in ASNs should not be generally neglected. Thus, potentially costly synchronization of the signal waveforms is generally required prior to estimation of the coherence, which disqualifies direct estimation of the signal crosscorrelation. To identify promising candidate microphone signals for synchronization and subsequent joint processing in ASNs without prior signal synchronization, other techniques are required.
To address these unique challenges of ASNs, we employ a compressed signal representation in the form of singlechannel signal feature sequences, which are extracted from temporal blocks of the microphone signals, to reduce the amount of data to be transmitted by roughly two to three orders of magnitude. To accurately assess the communication cost, many additional factors should be considered including, but not limited to, the radiofrequency environment and radio SignaltoNoise Ratio (SNR) at the wireless transceivers, modulation and coding schemes, medium access control and arbitration (potentially via distributed algorithms), protocols and the associated overhead, and the temporal duration of transmission frames, which is unfortunately beyond the scope of this contribution. While acknowledging the implied simplification, we use the data amount as a proxy due to its conceptual simplicity and monotonous relation with actual communication cost, i.e., reducing data amount never increases cost when the environment is constant. The employed features must be characteristic for the microphone signals, i.e., to allow for (at least approximate) reconstruction of the interchannel MSC.
In this contribution, we consider acoustic scenarios often encountered in smart home applications comprising a single SOI captured by multiple distributed microphones in an acoustic enclosure, as depicted in Fig. 1. After estimating the usefulness of the recorded microphone signals, a subset thereof is selected and transmitted to the central wireless Access Point (AP) for subsequent coherent multichannel signal processing. Although Fig. 1 shows an exemplary scenario with a wireless network with a central AP, this is not constraining the scope of the paper. For the considerations in this paper, the AP may be replaced by a network node acting as a local center implementing the multichannel signal processing algorithm. While the case of wired networks is also covered, their typically large transmission capacity may allow raw signal transmission and thereby limit the importance of the developed featurebased utility estimation. Instead of a specific signal processing algorithm, we consider a broad class of algorithms that rely on coherent input signals and do not require signals unrelated to the SOI, e.g., as noise references. In addition, we do not consider applicationspecific cost functions or performance metrics, e.g., SignaltoDistortion Ratio (SDR) and Echo Return Loss Enhancement (ERLE), such that the proposed utility estimation scheme is appropriate for many subsequent multichannel signal processing applications. We instead generate utility estimates to match the groundtruth coherence between the SOI signal and the observed microphone signals. To this end, the proposed generic system comprises two subsystems depicted as in Fig. 2: a feature extraction system, a copy of which runs for each microphone signal on the associated network node, and a utility estimation system running on the AP. In the feature extraction stage, characteristic signal features are extracted from the observed microphone signals independently from each other. No crosschannel features are employed in order to not exclude singlemicrophone network nodes. The feature sequences obtained from each microphone are then transmitted to the central AP, which estimates the individual microphones’ utility values by correlating the feature sequences. A set of Kalman Filters (KFs) with timevarying temporal smoothing provides a robust estimation framework for the feature covariance. Utility estimates are obtained by extracting structural information from the resulting covariance matrices via the corresponding Fiedler vector [31] that reflects a notion of average connectivity. A joint approach simultaneously considering singlechannel features and network transmission cost was proposed in [32]. The efficacy of the proposed utility estimates for two specific important signal processing tasks, robust source localization and spatial filtering, was demonstrated in [33] and [34], respectively. Therein, sensor selection by optimizing the proposed utility measure has shown closetooptimal performance, such that we focus only on the generic utility measure in this contribution.
In the remainder of this article, we review and provide more detailed descriptions of the modelbased realizations of the two subsystems proposed in [32, 35] in Sections 3.1 and 3.2 by explicitly stating and discussing the model assumptions of the KF. Formulating microphone selection as a graph bipartitioning problem, the Fiedler vector yields an optimum soft assignment of each individual microphone to one of the two groups of most and least useful microphones, which further justifies its use as a utility measure. In Section 3.3, we provide new results on the suitability of established signal features for recovering interchannel MSC. To this end, the feature selection task is formulated as a Least Absolute Shrinkage and Selection Operator (LASSO) regression problem which is then solved numerically to obtain an optimal set of signal features. In Section 4, we propose novel Machine Learning (ML)based realizations for both subsystems whose combination can be learned in an endtoend fashion, which constitutes a major contribution of this work. In Section 5, the efficacy of the proposed scheme and its individual components is validated. Different algorithmic variants, i.e., purely modelbased, purely MLbased, and hybrid realizations of the proposed system, are investigated. To this end, comprehensive experiments for both synthesized and recorded data from realistic scenarios are conducted, including different reverberation times, additive noise and obstruction of sensors, different microphone arrangements, as well as static and moving SOIs.
2 Notation and signal model
In this article, scalar quantities are denoted by slanted nonbold symbols x, while vectors and matrices are denoted by boldface lowercase \(\textbf{x}\) and uppercase symbols \(\textbf{X}\), respectively. Furthermore, \([\textbf{x}]_{m}\) denotes the mth element of vector \(\textbf{x}\), and \([\textbf{X}]_{mm'}\) denotes the (m,\(m'\))th element of matrix \(\textbf{X}\). The Mdimensional allzeros and allones vectors are denoted by \(\textbf{0}_{M}\) and \(\textbf{1}_{M}\), respectively, the \(M \times M\) identity matrix is denoted by \(\textbf{I}_{M}\), and the operator \(\textrm{Diag}(\cdot )\) embeds the elements of its argument on the main diagonal of a square matrix. The Pearson correlation coefficient (PCC) of two Melement vectors \(\textbf{x}\), \(\textbf{y}\) is defined as
with means \(\overline{x} = \frac{1}{M} \sum _{m=1}^{M}[\textbf{x}]_{m}\) and \(\overline{y} = \frac{1}{M} \sum _{m=1}^{M}[\textbf{y}]_{m}\). It will be used as a normalized similarity measure for features in Section 3.2 and as performance measure for the experiments in Section 5.
In the following, let t denote the discretetime sample index and let \(f \in \{1,\ldots ,F\}\) denote the feature index where F is the number of extracted features per channel. Recalling Fig. 1, we consider an acoustic scenario comprising a single coherent SOI recorded by J microphones, each of which represents a separate node in the ASN. The signal captured by the microphone indexed by \(j \in \mathcal {P} = \{1,\ldots ,J\}\) is
where s[t] is the dry SOI signal, \(h_{j}[t]\) is the acoustic impulse response from the SOI to the jth microphone, and \(*\) denotes linear convolution. Note that the SOI is not necessarily static, i.e., the acoustic impulse responses \(h_{j}[t]\) in (2) are considered timeinvariant only for short observation intervals, but may change from one interval to the next as the SOI moves. The fully coherent spatial images of the SOI \(s[t] * h_{j}[t]\) are superimposed by a spatially diffuse or incoherent noise field, such that the mutual coherence between the noise components \(n_{j}[t], \;\forall {j}\in \mathcal {P}\) is negligibly small. Thus, observed correlation between two microphone signals \(x_{j}[t]\), \(x_{j'}[t]\) is predominantly caused by the common SOI signal. Although competing pointlike sources are not explicitly modeled in (2), the proposed method is still applicable given sufficient temporal sparsity, i.e., time intervals where only one of the sources is active, provided that the identity of the active source changes slowly enough to be tracked by the KF. Furthermore, consider an ASN spanning two rooms each with its own SOI connected by, e.g., open doors. With the microphones in each room predominantly capturing their respective SOI, the realizations of the proposed system in Sections 3 and 4 can still facilitate a distinction of microphones w. r. t. the dominant SOI. In this case, the scenario essentially decouples into two separate problems, but it is generally not known in advance which of the two possible solutions is found. To ensure a deterministic selection, additional source selection mechanisms exploiting preference information are needed, which is beyond the scope of this paper. In any case, the signal model (2) should be viewed as a first step towards developing methods for more general acoustic scenarios.
As the proposed utility estimation relies only on the correlation of feature sequences computed from signal frames, only a coarse synchronization of the signal frames between different sensors has to be assured, such that the proposed scheme is practically relevant. However, we assume that the sensor signals are synchronous to compute the oracle MSC between each microphone and the source. The same holds for the microphone pairwise complex coherence function, and thus MSC, which are the foundation of the baselines baselineCDR and baselineMSC, respectively, in Section 5.
To this end, the signals are partitioned into blocks indexed by \(k\in \{1,\ldots ,K\}\) with a length of \({L_{\textrm{b}}}\) samples and a shift of \({L_{\textrm{s}}}\) between successive blocks, e.g., for the jth microphone signal \(x_{j}[t]\),
With the discrete frequency bin index \(n\in \{1,\ldots ,{L_{\textrm{b}}}\}\), let \(\hat{{\Phi }}_{s,x_{j}}[k,n]\), \(\hat{{\Phi }}_{s,s}[k,n]\) and \(\hat{{\Phi }}_{x_{j},x_{j}}[k,n]\) denote shorttime estimates of the respective crossPower Spectral Density (PSD) and autoPSDs of s[t] and \(x_{j}[k]\), e.g.,
where \(\textrm{DFT}_{{L_{\textrm{b}}}}\) denotes the \({L_{\textrm{b}}}\)point Discrete Fourier Transform (DFT). As a broadband groundtruth utility measure for the jth microphone, the frequencyaveraged narrowband MSC
between the (latent) source signal s[t] and the jth microphone signal \(x_{j}[t]\) is used. Note that we drop the superscript \(\hat{\cdot }\) from \(\gamma _{j}[k]\) in (5) for notational simplicity in the following. Under the assumption that the SOI signal s[t] and noise signals \(n_{j}[t]\) are mutually uncorrelated, and that the acoustic impulse responses in (2) are much shorter than the DFT length, the approximations
hold. In practice, MSC estimates derived from (6) and (7) are subject to detrimental effects stemming from the combination of limited temporal observation intervals and the characteristics of the acoustic impulse responses between SOI and the microphones as captured by, e.g., relative time delay and DirecttoReverberation Ratio (DRR). Nevertheless, since coherent multichannel processing algorithms also degrade with the same impairments, the degradation in estimation accuracy of the coherence can be assumed to be correlated with the performance of signal processing algorithms and, hence, utility of the involved sensors.
Then, the summands in (5) simplify to
with the channelwise SNR
Clearly, the MSC is a function of the SNR, with extremal values \(\gamma _{j}[k,n]=0\) for \(\widehat{\textrm{SNR}}_{j}[k,n]=0\), and \(\gamma _{j}[k,n]\rightarrow 1\) for \(\widehat{\textrm{SNR}}_{j}[k,n]\rightarrow \infty\). The frequencyaveraged sourcemicrophone MSC values of all J microphones are collected in the vector
3 Modelbased utility estimation using Spectral Graph Partitioning
We first review the modelbased realizations of [32, 35] in Sections 3.1 and 3.2. Although there is no strictly analytical relation between the extracted feature values and the utility values, the approach is based on the notion that the PCCs of the different extracted feature sequences all reflect the same pairwise similarity of the underlying microphone signals. Due to this model assumption, and to differentiate it from the MLbased approach in Section 4, which requires training data to determine the model parameters, this approach is termed modelbased. Advancing the previous heuristic feature selection [36], we formulate the feature selection task as a LASSO regression problem with a sparsitypromoting regularizer in Section 3.3 to optimize the tradeoff between accuracy and number of features to be transmitted. Solving this optimization problem yields an optimal selection of features for a set of representative acoustic scenarios.
3.1 Nodewise feature extraction
There is a wide variety of potential signal features [37, 38] to describe acoustic signals. Since acoustic scenarios are typically not static in practice due to, e.g., moving acoustic sources or obstructions, the usefulness of microphones is equally timevariant. Hence, within the comprehensive feature taxonomy in [37], we focus on features extracted from short observation intervals to characterize singlechannel signals. The features may be computed in the time domain based on the digital signal waveform, or in the frequency domain based on the magnitude spectrum of the signals. As a result, we consider the following blockwise features:

Envelope of waveform

Zerocrossing rate

Statistical moments (centroid, standard deviation, skewness, kurtosis) of the signal waveform

Entropy of waveform

Statistical moments (centroid, standard deviation, skewness, kurtosis) of the magnitude spectrum

Spectral shape features (slope, power flatness, amplitude flatness, rolloff)

Temporal variation of magnitude spectra (spectral flux, spectral flux of normalized magnitude spectra, spectral variation)
In [36], it was experimentally shown that three features (temporal skewness, temporal kurtosis, spectral flux) are suitable to recover the structure of the spatial MSC matrix of a set of microphone signals. However, the features were selected heuristically based on the visual similarity of the corresponding feature covariance matrices and the groundtruth MSC matrix. Therefore, a more rigorous discussion of the importance of specific signal features is provided in Section 3.3.
Generally, a single feature sequence, i.e., a sequence of feature values over several time frames, is insufficient to characterize the signals, since the extraction of each signal feature can at best maintain information about the original signal [39], but typically incurs a loss of information. When multiple sufficiently different features are used, they capture different parts of the information contained in the signals, such that they complement each other in describing the original signals. Thus, jointly processing such different features allows for a more accurate characterization of the signals compared to a single feature.
To this end, given a signal block \(\textbf{x}_{j}[k]\), let \(a_{j}^{(f)}[k]\) denote the observed value of the signal feature \(f\in \{1,\ldots ,F\}\) for said signal block. Collecting the feature values of different channels for time frame k yields the instantaneous feature vector
Unlike the signal waveforms, which require precise synchronization of the sampling clocks for joint processing, the feature values of different microphones are much less susceptible to asynchronous sampling. With only a single feature value every \({L_{\textrm{s}}}\) signal samples, sampling rate offsets on the order of tens of parts per million (PPM) barely affect the extracted feature sequences. Hence, periodical coarse synchronization of the signal block boundaries is sufficient to avoid excessive drift of the observation windows in different network nodes, allowing synchronization to occur less frequently and with lower accuracy requirements.
3.2 Utility estimation
In this section, we review the utility estimation scheme based on correlation of feature sequences originally proposed in [32, 35, 36] and show its relation to established Graph Bisection techniques. The modelbased utility estimation comprises three steps, which are outlined in the following subsections:

1
Robustly estimate the crosschannel PCCs of the feature sequences separately for each feature via a set of KF

2
Fuse information contained in the PCCs from different features

3
Estimate each microphone’s utility from the fused information by means of Spectral Graph Partitioning
For clarity, a visual guide of these steps and the involved matrices and vectors is provided in Fig. 3.
Step 1) Feature correlation coefficients: For computing the PCCs, first the crosschannel covariance matrices
are estimated for each feature \(f \in \{1,\ldots ,F\}\) separately. Therein, \(\mathbb {\hat{E}}\) denotes an approximate statistical expectation operator whose practical realization we discuss below. Furthermore, the matrix
is the outer product of the instantaneous observed feature vector \(\textbf{a}^{(f)}[k]\) after subtracting its recursive temporal average
controlled by the recursive averaging factor \(\lambda \in [0,1]\) with initial value \(\overline{\textbf{a}}^{(f)}[0] = \textbf{0}_{J}\).
Note that the estimated \(\textbf{B}^{(f)}[k]\) is generally timevariant to account for the aforementioned SOI movement, and thus online estimation is preferred over batch estimation. In order to track this temporal variability, we use a separate KF [40] for each feature f. Let the latent state vector at time frame k be denoted by \(\textbf{z}^{(f)}[k]\). Its mean vector \(\varvec{\mu }^{(f)}[k]\) captures the covariance matrix \(\textbf{B}^{(f)}[k]\) to be estimated and the instantaneous observation vector \(\varvec{\xi }^{(f)}[k]\) captures the matrix \(\textbf{A}^{(f)}[k]\). Since both \(\textbf{B}^{(f)}[k]\) and \(\textbf{A}^{(f)}[k]\) are symmetric, it is sufficient to only consider their nonredundant elements. We choose their diagonal elements and lower triangular elements, such that the dimensionality of the state vector \(\textbf{z}^{(f)}[k]\), the mean vector \(\varvec{\mu }^{(f)}[k]\), and the observation vector \(\varvec{\xi }^{(f)}[k]\) can be chosen to be only \(Q = \frac{J (J1)}{2}\) instead of \(J^2\) while still precisely modeling the full matrices. This can be expressed compactly using the halfvectorization operator \(\textrm{vech}\) [41] collecting all relevant matrix elements in vectors, i.e.,
The statetransition and model equations of the KF are
where \(\textbf{t}^{(f)}[k]\) and \(\textbf{o}^{(f)}[k]\) denote the statetransition and observation noise vectors, respectively. In other words, the most probable state transition is that the utility, and hence the feature covariance, stays the same. However, if the state does change, it has no predictable preference direction. Similarly simple motion models are effectively used in acoustic echo cancellation [42] and dereverberation [43]. With (17) and (18), the KF simplifies to temporal smoothing, albeit with a timevariant smoothing constant. Compared to fixed averaging constants, this allows placing higher confidence in observations with high signal energy, i.e., likely SOI activity.
Assuming a normally distributed latent state vector \(\textbf{z}^{(f)}[k]\) like in [35] for simplicity and mathematical tractability leads to the prior distribution
with the aforementioned mean vector \(\varvec{\mu }^{(f)}[k]\) and covariance matrix \(\varvec{\Sigma }_{\textrm{s}}^{(f)}[k] \in \mathbb {R}^{Q \times Q}\). Since the trend of \(\textbf{z}^{(k)}[k]\) is neither known nor easily modeled, we assume a zeromean Gaussian random walk with transition distribution
as it is the least informative model but, due to the Central Limit Theorem (CLT) [44], fits well for natural processes where changes in the latent state are often the result of many independent influences. In order to remain agnostic to the sourcemicrophone arrangement in different scenarios, the timeinvariant and featureindependent process noise covariance matrix is chosen as a scaled identity matrix
where \(\alpha _1 \in \mathbb {R}^{+}\) is a positive tunable parameter. Intuitively, two closely spaced microphones produce similar feature sequences and thus the way their estimated PCCs w. r. t. a third microphone change over time will be correlated. While these scenariospecific correlations could in principle be exploited for more accurate estimation by tailoring \(\varvec{\Sigma }_{\textrm{t}}\) to the scenario, doing so would harm the generalization of the transition model to other scenarios and furthermore requires the acquisition of sufficient data to estimate an optimal \(\varvec{\Sigma }_{\textrm{t}}\). Therefore, to avoid biasing the random walk process, we choose not to model these correlations, i.e., keep \(\varvec{\Sigma }_{\textrm{t}}\) diagonal.
Choosing the least informative emission model for the observations \(\varvec{\xi }^{(f)}[k]\) for simplicity as well yields the multivariate Gaussian emission distribution
with the observation noise covariance matrix
Therein, \(\alpha _2 \in \mathbb {R}^{+}\) is a positive tunable parameter and the matrix \(\textbf{E}[k] \in \mathbb {R}^{J\times {J}}\) contains the geometric means of signal frame energies \(e_{j}[k] = \Vert \textbf{x}_{j}[k] \Vert _{2}^{2}\) (see (3)) reflecting the signal variances for each microphone pair
The small positive constant \(\epsilon\) ensures invertibility of \(\varvec{\Sigma }_{\textrm{o}}[k]\) in (23) during speech absence periods. This choice is motivated by the notion that the observed feature values are better at characterizing the microphone signals during time frames with high signal energy \(e_{j}[k]\), i.e., the observation noise of the KF is inversely related to the signal energy \(e_{j}[k]\).
With all components of the KF in place, the update equations for mean vector and state covariance are [45]
with the Kalman gain matrix
and initial values
Note that the updates in (25) to (27) can be computed very efficiently since all involved matrices are diagonal.
For each time frame, after updating the KFs for all features \(f \in \{1,\ldots ,F\}\), the elements of the covariance matrix \(\textbf{B}^{(f)}[k]\) are recovered from the KF mean vector \(\varvec{\mu }^{(f)}[k]\) by reversing the halfvectorization, i.e.,
Finally, the elements of the perfeature PCC matrices \(\tilde{\textbf{B}}^{(f)}[k] \quad \forall {f}\) are obtained from the estimated covariances by normalization according to
Step 2) Feature combination: As outlined earlier, the PCC matrices of different features \(\tilde{\textbf{B}}^{(f)}[k]\) capture different aspects of the underlying interchannel coherence. To recover an estimate of the interchannel coherence from the multiple feature correlation coefficient matrices, we consider channelwise matrices
where each \(\textbf{C}_{j}[k]\) contains the interchannel PCCs of all F feature sequences of all J microphone channels w. r. t. the corresponding feature sequence of a reference channel j.
Note that each column of \(\textbf{C}_{j}[k]\), corresponding to one particular signal feature, models the same underlying interchannel coherence. The PCCs of different features are then combined for each channel j by extracting the dominant column structure of \(\textbf{C}_{j}[k]\), i.e., finding its best rank1 approximation in the Least Squares (LS) sense [46]
Since \(\textbf{C}_{j}[k]\) is generally nonsquare, the solution of (32) is obtained by the Singular Value Decomposition (SVD), where \(\sigma _{j}[k] \in \mathbb {R}^{+}\) is the largest singular value of \(\textbf{C}_{j}[k]\), and \(\textbf{r}_{j}[k] \in \mathbb {R}^{J}\) and \(\textbf{t}_{j}[k] \in \mathbb {R}^{F}\) are the principal left and right singular vectors, respectively. The principal left singular vector \(\textbf{r}_{j}[k]\) captures the contribution of each channel to the dominant structure of \(\textbf{C}_{j}[k]\), while the principle right singular vector \(\textbf{t}_{j}[k]\) captures the contribution of each feature to the dominant structure.
To facilitate tracking of \(\textbf{r}_{j}[k]\) in timevariant scenarios and avoid recomputation of the full SVD in each time step, the principal left singular vector is instead iteratively refined over time. To this end, recall that the left singular vectors of \(\textbf{C}_{j}[k]\) are identical to the eigenvectors of the Gram matrix \(\textbf{C}_{j}[k] \textbf{C}_{j}^{\textrm{T}}[k]\) [46]. Thus, given an estimate from the previous time step, the principal singular vector can be estimated using power methods [46] as
Initial experiments comparing the power method and full SVD have shown that the spectrum of \(\textbf{C}_{j}[k]\) varies slowly over time such that a single iteration of (33) and (34) is sufficient to accurately track the principal singular vector for the proposed utility estimation scheme.
In order to restore the intuitive notion of a similarity measure, the estimated principal singular vectors from (34) are renormalized such that the similarity of each channel to itself is equal to one, and then concatenated to form the overall channel similarity matrix
Step 3) Spectral Graph Partitioning: Microphone selection is equivalent to partitioning the set of available microphones \(\mathcal {P}\) into two, potentially timevariant, disjoint subsets comprising the selected and discarded microphones, respectively. Recalling the signal model (2), we use the convention that the former subset \(\mathcal {S}[k]\) contains the microphones capturing the SOI with high quality while the latter subset is its complement \(\overline{\mathcal {S}}[k]\) for those microphones dominated by the noncoherent noise field. Relaxing the hard assignment of microphones to these subsets to a soft assignment leads to continuous realvalued utility estimates as shown in the following. Spectral partitioning techniques [31, 47, 48] operating on graph structures can determine such optimal partitionings very efficiently, especially when the number of microphones J is large.
Thus, we model the pairwise similarity of microphone channels using a timevariant graph structure \(\mathcal {G}(\mathcal {V},\mathcal {E}[k])\) [47], comprising a set of vertices \(\mathcal {V}\) representing microphones and a set of weighted edges \(\mathcal {E}[k]\) representing the microphones’ similarity at time frame k. For each edge \((j,j',w_{jj'}[k]) \in \mathcal {E}[k]\), the weight \(w_{jj'}[k] \in [0,1]\) captures the similarity of microphones j and \(j'\). The graph is equivalently specified by its weighted adjacency matrix \(\textbf{W}[k] \in \mathbb {R}^{J \times J}\), containing all weights \(w_{jj'}[k], \forall j, j' \in \mathcal {P}\). The pairwise microphone similarity should be a symmetric measure, i.e., channel j should be as similar to \(j'\) as channel \(j'\) is to j, such that \(w_{jj'}[k] = w_{j'{j}}[k]\). To reflect this symmetry and the varying degrees of similarity, the graph should be undirected and weighted. Since the matrix \(\textbf{R}[k]\) in (35) does not necessarily exhibit these properties, only the symmetric part of its elementwise magnitude is used to construct the weighted adjacency matrix \(\textbf{W}[k]\), i.e.,
The degree [47] of the jth vertex is defined as the sum of all outgoing edges’ weights
which are collected in the diagonal degree matrix
Note that \(d_{j}[k] \ge 1, \forall {j}\in \mathcal {P}\) since the sum in (37) includes \(w_{jj}[k]=1\), which ensures invertibility of \(\textbf{D}[k]\) even for degenerate graphs.
For an ideal partitioning, like for clustering, it is desirable that microphone signals belonging to the same group are similar while microphone signals belonging to different groups are dissimilar to allow for a clear distinction between the selected and the discarded microphones.
Using (2) gives an interpretation in the context of microphone selection: SOIdominated microphones exhibit strongly mutually correlated feature sequences and thus form one of the two partition subsets, while the feature sequences of noisedominated microphones are only weakly correlated with the SOIdominated microphones, and thus form the other subset. In addition, even if the noise components \(n_{j}[t]\) are uncorrelated, their features likely are correlated, especially if they capture underlying statistics like variance. These intergroup and intragroup similarities of a set \(\mathcal {S}[k] \subset \mathcal {P}\) and its complement \(\overline{\mathcal {S}}[k]\) are measured by [48]
respectively. Balancing the inter and intragroup similarity to avoid degenerate solutions yields the normalized cut objective function [49]
As shown in [49], minimization of (41) w. r. t. \(\mathcal {S}\) can be reformulated as minimization of the generalized Rayleigh quotient
where \(\textbf{i}[k]\) is a Jdimensional discrete indicator vector satisfying
Additionally, the elements of \(\textbf{i}[k]\) may only take either of two values [49]
When the discreteness constraint (44) on \(\textbf{i}[k]\) is relaxed to allow arbitrary real values, i.e., \(\textbf{i}[k] \in \mathbb {R}^{J}\), the minimizer of the generalized Rayleigh quotient in (42) is a solution to the generalized eigenvalue problem
where \(\lambda [k]\) is the generalized eigenvalue and \(\textbf{i}[k]\) is the generalized eigenvector. The equivalent standard eigenvalue problem is obtained by leftmultiplication of \(\textbf{D}^{1}[k]\)
with the normalized randomwalk Laplacian matrix [47]
Thus, an approximate minimizer of (42) is obtained by finding the smallest eigenvalue and its corresponding eigenvector of \(\textbf{L}[k]\). The trivial eigenvalue 0 and its corresponding eigenvector \(\textbf{1}_{J}\) [48] are excluded by the constraint (43). Thus, the solution is the socalled Fiedler vector \(\textbf{v}[k]\), i.e., the eigenvector corresponding to the smallest nontrivial eigenvalue of \(\textbf{L}[k]\) [48], which automatically satisfies (43) as shown in [49]. While an approximate solution to the discrete problem can be obtained by discretizing \(\textbf{v}[k]\), e.g., based on the sign of each element, here we use the realvalued solution directly as an estimate of the microphones’ utility.
As an eigenvector, the scale and in particular the sign of \(\textbf{v}[k]\) is ambiguous, i.e., both \(\textbf{v}[k]\) and \(\textbf{v}[k]\) are valid solutions to the eigenvalue problem (46). The same holds for the objective function (41), which is invariant to exchanging \(\mathcal {S}[k]\) with \(\overline{\mathcal {S}}[k]\). This ambiguity is usually not a problem for partitioning, since only the association of vertices to groups is desired, but not the identity of each group. In other words, the partitioning given by \(\textbf{v}[k]\) distinguishes between the most and least useful microphones, but does not say which group is which. Additionally, in lowSNR scenarios, noisedominated microphone signals may exhibit large feature PCC values due to similar noise signal statistics despite only weakly coherent noise signals. To facilitate this distinction, we consider supplemental side information captured by a vector \(\varvec{\beta }[k]\) which is correlated with the preliminary utility estimates
Choices for \(\varvec{\beta }[k]\) are discussed below. Depending on the sign of the PCC \(\rho [k]\), the sign of the estimated utility values is flipped to produce the final utility estimates
In [32], the supplemental information was chosen as the node degree, i.e., \(\left[ \varvec{\beta }[k]\right] _{j} = d_{j}[k]\). While this choice allows detection of outliers if the volumes of the two subsets in the partition are very different, i.e., a large majority of microphones is either useful or not useful, it also requires further assumptions or knowledge about the identity of the majority group, e.g., that the majority of microphones observes the desired SOI. To address these shortcomings, we consider typical SOI and interfering signals: typical SOI signals, especially speech, are nonGaussian and exhibit spectrotemporal structure. Meanwhile, typical signal degradations, like reverberation or additive noncoherent noise, exhibit less or no structure, thus reducing the structure of the acoustic mixture. Thus, the differential signal entropy [39] is used to capture the structuredness of the observed signals
as in [35]. Therein, the Probability Density Function (PDF) is estimated by its \(N_{\textrm{B}}\)bin histogram
with \(e_{n_{\textrm{B}}}\) denoting the histogram bin edges. Note that, for the experiments conducted in Section 5, the signal blocks used to estimate entropy in (50) are chosen longer than those for the feature extraction. The entire microphone utility estimation procedure using Spectral Graph Partitioning is concisely summarized as pseudocode in Algorithm 1.
In the presence of pointlike interferers, the signal model (2) no longer strictly holds, such that it should be understood as a first step towards developing methods for more general acoustic scenarios. Hence, somewhat degraded estimation performance must be expected, where the extent of degradation depends on the particular scenario. For example, in an ASN spanning two rooms each with their own SOI with only lowlevel crosstalk between rooms and lowlevel additive noise, groups of useful microphones for either SOI can be identified, which still matches well with the desired outcome. As a second example, consider an ASN in a single room, with two closely spaced point sources. For temporally overlapping source activity with both sources contributing similar signal power to each microphone, all microphones exhibit reduced utility w. r. t. either source as the other source is considered as noise, again matching qualitatively with reduced feature covariance. For source counting and associating the microphone subsets with the correct SOI, additional mechanisms need to be developed that are beyond the scope of this paper.
3.3 Importance of specific signal features
Choosing an appropriate set of characteristic signal features for the microphone signals is vital: too few features result in low estimation accuracy, while too many features unnecessarily strain the wireless network. Even for an appropriate number of features, inappropriate features may even reduce overall estimation accuracy. To explore the importance of individual signal features, we formulate the feature selection as a LS regression problem with a sparsitypromoting regularizer in (53) below in order to obtain a low regression error while using as few features as possible. Specifically, we interpret the matrix \(\textbf{C}_{j}[k]\) as a dictionary matrix whose columns, or atoms, contain the crosschannel correlation coefficients between the reference channel j and all channels for one specific signal feature, and which are linearly combined to approximate the MSC of the observed microphone signals. However, for the purpose of estimating microphone utility and microphone selection, the relative utility of microphone channels is more important than the absolute values, such that the zeromean MSC vector, i.e.,
is used as the target quantity. Thus, the \(\ell _1\)regularized LS cost function for a single acoustic scenario comprising J microphone signals with K time frames is
where \(\varvec{\phi } = \left[ \begin{array}{ccc} \phi _{1},&\ldots ,&\phi _{F} \end{array}\right] ^{\textrm{T}} \in \mathbb {R}^{F}\) captures the contribution of each feature and the parameter \(\delta \in \mathbb {R}^{+}\) indirectly controls the sparsity of the vector, i.e., the number of used features. The results of this optimization are shown in Section 5.2.
4 Learningbased utility estimation
Artificial Neural Networks (ANNs) offer the ability to learn an optimum feature set (for given training data) to characterize the microphone signals, as well as optimally combining the features for estimating microphone utility. Thus, we propose learningbased alternatives to both the modelbased feature extraction (see Section 3.1) and the utility estimation (see Section 3.2) subsystems. The extractor module in Fig. 4 realizes the feature extractor on the lefthand side of Fig. 2 (both in red), while the estimator module in Fig. 5 realizes the utility estimator on the righthand side of Fig. 2 (both in blue). For both subsystems in Fig. 2, the ANN architectures are chosen to reflect the modeling capabilities of their modelbased counterparts. Both modules are trained together in an endtoend fashion. During inference, the extractor and utility estimator modules run on the network nodes and the AP, respectively, such that only the compressed feature representation need to be transmitted to the AP.
4.1 Nodewise feature extraction
The signal features discussed in Section 3.3, although effective for utility estimation, are not necessarily optimally suited for utility estimation. Learning a set of features specifically tailored to characterize microphone signals for the purpose of estimating utility promises improved accuracy and a more compact representation. The structure of the feature extractor module is depicted schematically in Fig. 4.
Recalling that the groundtruth utility is given by the MSC, spectral representations of the input data appear to be an obvious choice. Since the phase of a signal is largely uninformative w. r. t. the SOI without a second signal for reference, we focus on models using the magnitude spectrum as input in the following. Our initial experiments support this, where models using the magnitude spectrum have outperformed models using the timedomain waveform. The resulting halving of the model’s input size is a welcome additional benefit.
Thus, the magnitude spectrum of a single microphone signal block \(\textbf{x}_{j}[k]\) as defined in (3) is computed first. Due to the loss of phase information, this transform is not invertible and thus prevents the model from learning exact equivalents of the timedomain features in Section 3.1. The magnitude spectrum then passes through a series of fully connected feedforward layers that get progressively narrower to condense information until a desired number of signal features is reached. The final batch normalization and Gated Recurrent Unit (GRU) layer allows the extractor module to learn features that describe the evolution of some quantity over time, e.g., spectral flux. Trained weights are shared between the instances of the module at different microphones, i.e., no sensorspecific features are extracted.
4.2 Utility estimation
The architecture of the utility estimator is shown in Fig. 5 using the concatenated feature vectors \(\textbf{a}_{j}[k]\) from the individual microphones as an input. The memory of the first GRU layer allows capturing the temporal evolution of the feature sequences and allows establishing relations between the different microphone signals based on their extracted features. The following fully connected layers all contain the same number of neurons and are responsible for regression of the GRU outputs onto the target MSC values. Passing the feature sequences themselves into the ANN, instead of the PCCs as in the modelbased method in Section 3, allows the network to differentiate between useful and nonuseful microphones, such that no separate disambiguation step or supplemental information is needed. Unlike the modelbased estimation in Section 3.2, the number of microphones J directly determines the number of neurons in the later fully connected layers. Thus, the model must be retrained whenever the number of microphones changes, but is capable of learning optimal feature representations. For practical applications, building a modular model, e.g., from microphone pairwise submodels, could overcome this restriction at the cost of some modeling capability.
5 Experimental validation
The algorithms from Sections 3 and 4 are evaluated on simulated and recorded acoustic data. The considered scenarios feature both static and moving SOIs, different room dimensions and reverberation times, and different arrangements of \(J=10\) microphones, some of which may be physically obstructed by objects. Although each microphone represents its own network node here, this does not conflict with the general assumptions outlined in Section 1.
5.1 Acoustic data
This section describes the different acoustic data used in the following experimental validation.
5.1.1 Simulated data
Microphone signals for a single SOI moving in a shoe box room are simulated using the imagesource method [50, 51]. The SOI trajectory is restricted to the Region of Interest (RoI), chosen as a horizontal plane at 1.2m height with at least 1m distance to the walls. The trajectory is spatially discretized such that successive SOI positions are at most 5 cm apart. The resulting set of timevariant Room Impulse Responses (RIRs) is then convolved with the corresponding SOI signal excerpts to obtain the microphone signals evoked by the moving SOI. Speech segments of 28 s duration, from both male and female speakers, are used as SOI signals. The source moves rapidly during the time intervals 8–10 s and 18–20 s, and otherwise moves slightly around a resting position to simulate the behavior of human speakers. With a maximum cross section through the RoI of about 8 m, the maximum possible SOI speed is about 4m/s. Under these constraints, 20 different, random source trajectories are generated. Three different rooms with typical living roomlike acoustic properties (see Table 1) are considered. In each room, \(J=10\) cardioid microphones are placed at random positions and with random azimuthal rotation. In total, \(R_{\text {sim}} = 120\) distinct acoustic setups (20 trajectories \(\times\) 2 signals \(\times\) 3 rooms) are simulated, resulting in 56 min of speech data. The generated SOI images are superimposed with spatially uncorrelated white noise of an equal, fixed level to attain an SNR of 10 dB at the microphone with the strongest source image on average. Due to the lower SOI contribution, other microphones have a lower average SNR. Figure 6 illustrates room A along with an exemplary source trajectory.
5.1.2 Recorded data
The recorded acoustic data is obtained from \(J=10\) microphones arranged pairwise in a quarter circle around a static loudspeaker representing the SOI as shown in Fig. 7. Although the microphones capsules are omnidirectional, they exhibit nonuniform directivity due to being mounted in metal enclosures facing the SOI which causes diffraction. SOI signals comprise male, female, and children’s speech. Instead of a moving source, different usefulness of the microphones is induced by occluding some of the sensors. Obstacles may cover two microphone pairs as indicated in Fig. 7, or a single microphone pair. Additionally, obstacles consist of different materials, i.e., solid wood, foam, and cloth, such that sound can permeate through some of them. In total, \(R_{\text {rec}} = 36\) distinct acoustic setups (12 obstructions \(\times\) 3 signals) are recorded, resulting in 36 min of speech data. Like for simulated data, spatially uncorrelated white noise is added to the recorded microphone signals to achieve an SNR of 10 dB.
5.2 Feature importance
Summing \(\mathcal {C}(\varvec{\phi })\) in (53) over \(R_{\text {sim}}=120\) experiment trials (see Section 5.1.1) and then minimizing the sum yields the features weights \(\varvec{\phi }\) depicted in Fig. 8. Naturally, higher values of \(\delta\) result in sparser solutions, i.e., less selected features, ranging from 3 features to 12 features for the considered range of \(\delta\). The most important features appear to be lowerorder statistical moments of the temporal waveform (td_centroid, td_spread, td_skewness), higherorder statistical moments of the magnitude spectrum (sd_skewness, sd_kurtosis), and features capturing the temporal variation of the magnitude spectrum (sd_flux, sd_variation, sd_fluxnorm). For \(\delta =0.001\), the selection comprises the four features td_skewness, sd_slope, sd_kurtosis, and sd_fluxnorm, two of which were also part of the heuristic selection made in [36]. To keep the number of selected features similar to prior work [32, 35], we choose the aforementioned four features of \(\delta =0.001\) for the experimental validation in Section 5.3. Note that the obtained feature weights are only used for feature selection so far, but could possibly be used to improve the estimates’ robustness in the future.
5.3 Accuracy of estimated utilities
Estimation accuracy is quantified by computing the timevariant PCC between the estimated utility vector \(\textbf{u}[k]\) and the MSC vector \(\varvec{\gamma }[k]\)
For the following experiments, signals are sampled at \(f_{\textrm{s}} =\) 16 kHz. For block processing, signals are partitioned into blocks of \({L_{\textrm{b}}}=1024\) samples with a block shift of \({L_{\textrm{s}}}=512\) samples. As the only exception, differential entropy in (50), since it is estimated by a histogram approach, uses longer blocks of \(32\,000\) samples for more robust estimates. Due to the larger block size, the estimated differential entropy also changes more slowly over time, thus promoting temporal continuity of the estimated utility via (49).
For the proposed modelbased approach from Section 3, termed modelKF in the following, the microphone signals are characterized using the four features identified in Section 3.3, i.e., td_skewness, sd_slope, sd_kurtosis, and sd_fluxnorm. The temporal recursive smoothing factor in (14) is chosen as \(\lambda = 0.99\); the scaling factors for the KF process and observation noise are \(\alpha _1 = 1\) and \(\alpha _2 = 50\), respectively.
For the learningbased approach, the extractor contains six fully connected layers with 513, 256, 128, 64, 32, and 16 neurons, respectively, followed by a single GRU layer with 16 inputs and 16 hidden states. Recall that \({L_{\textrm{b}}}=1024\) such that the 513 inputs to the first layer correspond to the nonredundant part of the signal’s magnitude spectrum. The utility estimator contains a single GRU layer with \(16J = 160\) inputs and 10 hidden states, followed by three fully connected layers with 10 neurons each. Since identical copies of the extractor module are run for each microphone channel \(j\in \mathcal {P}\), the total number of parameters is \(175\,000\) for the extractor regardless of the number of microphones J, and \(5\,500\) for the utility estimator with the above configuration which scales asymptotically quadratically with the number of channels J. The objective function to be minimized in training is the Mean Square Error (MSE) between the estimated utility \(\textbf{u}[k; \Psi ]\) and the groundtruth MSC \(\varvec{\gamma }[k]\)
where \(\Psi\) denotes the set of all trainable parameters. The model is trained by the Adam optimizer [52] with a learning rate of \(10^{3}\). The available acoustic scenarios (120 for MLsim, 36 for MLtuned and 156 for MLjoint, respectively) are split into 70% training and 30% validation data. Due to the combinatorial construction of the acoustic data (see Sections 5.1.1and 5.1.2), it is likely that the same speech signal occurs both in the training and the testing data. However, they never occur in the same combination of source trajectory and simulated room which are the predominant influencing factors of microphone utility.
A total of 8 algorithmic variants and baselines are evaluated. Note that we deliberately to not enforce a common constraint regarding communication cost, because our primary goal is to establish performance bounds. However, for practical application, the tradeoff between accurate utility estimates and minimal communication costs must be carefully considered. First, two baseline variants, termed baselineMSC and baselineCDR, use the crossmicrophone MSC and the CoherenttoDiffuse power Ratio (CDR) as oracle features. For baselineMSC, the estimated MSC values directly represent a normalized similarity of the respective microphones, s. t. they directly comprise \(\tilde{\textbf{B}}^{(f)}[k]\) in (30). The CDR is computed by a Direction of Arrival (DOA)independent estimator [53] assuming a diffuse noise coherence, which has been successfully used in weighting and selecting observations made by different microphones [33, 54]. Because the CDR is not bounded, the diffuseness [53, 55] is used in its place to construct \(\tilde{\textbf{B}}^{(f)}[k]\). Note that although MSC and CDR imply oracle knowledge in the sense of signal availability at the AP and thus transmission of the sensor signal, the MSC is still computed from timelimited observation windows and thus entails all of the associated estimation challenges, e.g., [56, 57]. The same holds for the CDR, as it is based directly on the estimated MSC.
The modelbased system described in Section 3 including the KFs for covariance estimation is termed modelKF. To evaluate the effectiveness of the KF, a variant of the proposed system is evaluated that uses a simple recursive temporal smoothing like (14) for feature covariance estimation, termed modelsmooth. Furthermore, to judge the modeling capabilities of the MLestimator, hybrid combines all 18 traditional features from Section 3.1 and the MLbased utility estimator module from Section 4.2. Computational complexity of the different features obviously varies, computation of a single feature is about 20–40× faster than real time running singlethreaded on a Core\(^{\text {TM}}\) i56600K at 3.5 GHz. Entropy computation being only slightly faster than real time is a notable exception, but required for the disambiguation of solutions in (49).
In addition, three different training variants are investigated: The first variant MLsim uses exclusively simulated data. For practical application, it is highly desirable to deploy a pretrained model and finetune its parameters specifically toward a new unseen scenario. To this end, a copy of the pretrained MLsim is finetuned on recorded data, termed MLtuned. Finally, a third version trained on both simulated and recorded data simultaneously, termed MLjoint, is also evaluated. In terms of simulation run time, the learningbased variants achieve about 40× realtime speeds, i.e., are on par with the computation of a single modelbased feature.
5.3.1 Simulated data
Figure 9 shows the median, as well as lower and upper quartile, of the PCC r[k] across trials as a function of the time frame k for simulated data, which should ideally produce values close to 1. Whenever the source moves (indicated by the grayshaded vertical areas in Fig. 9), the sourcemicrophone distances suddenly change causing the observed rapid decrease of r[k]. Both baseline methods baselineMSC and baselineCDR achieve limited performance due to the high relative noise levels in the microphone signals and the timelimited observation windows impeding accurate estimation. Both purely modelbased variants modelKF and modelsmooth exhibit good steadystate performance with PCCs around 0.9 and quick initial convergence and reconvergence after source motion. The KF variant modelKF achieves slightly better accuracy on average than modelsmooth and is more robust, e.g., visible at around 4 s and 13 s. The trained hybrid offers only very small improvements over modelKF and modelsmooth despite using all of the features, indicating that the four selected features for modelKF and modelsmooth are close to optimal for these scenarios. The learningbased MLsim trained on matching data achieves very similar performance, trading a more consistent performance when the SOI does not move for a slightly slower reconvergence behavior. As expected, finetuning the ML model using recorded data significantly degrades performance for simulated data, as shown by MLtuned. Finally, the ML model with both simulated and recorded data from the beginning, i.e., MLjoint, clearly outperforms all other considered methods, with only minor breakdowns and very fast recovery. Interestingly, incorporating recorded data besides the simulated into the training procedure also improves performance on simulated data. Convergence of all methods is very fast, reaching peak accuracy almost instantaneously after the SOI becomes active after an initial silence period of about 1 s.
While the ML models implicitly learning the temporal structure of the source movement might be a concern here, our experiments with random time intervals of 4 to 12 s between two successive source movements have shown no noticeable degradation compared to fixed time intervals.
5.3.2 Recorded data
As for the simulated data, Fig. 10 shows the median and quartiles of r[k] across trials as a function of the time frame k for recorded data. Since the SOI is static, the usefulness of microphones is predominantly influenced by their occlusion and no clear temporal structure can be discerned. Both of the oracle baselines achieve consistent but limited performance with PCCs between 0.6 and 0.8. The advantage of baselineCDR may be attributed to the diffuse noise coherence model, which enhances the contrast between microphones since residual coherence is considered as noise, particularly in lowfrequency regions. While the median of both modelKF and modelsmooth reaches 0.9 after about 2 s, their performance degrades over the experiment duration. Note that convergence of these two modelbased variants and the baselines is initially delayed by about 1.5 s due to incorrect disambiguation of the utility estimates in (49), indicating opportunity for future improvements. Beyond this initial phase, (49) is effective at disambiguating the microphone partitions as shown by the consistently positive values in Fig. 9. This is reinforced by hybrid, which simultaneously avoids this initial delay and achieves significantly better performance. Thus, hybrid, which has access to all features, outperforming modelKF suggests that the four selected features are suboptimal for the type of degradation encountered in the recorded data. While the relatively weak performance of MLsim with median values of around 0.5 is unsurprising since the model was not trained using recorded data, the method does not completely fail for unseen data. The performance of MLtuned is only on par with modelKF, indicating that adjustment of a pretrained model is not as straightforward as anticipated, likely due to pretraining driving the ANN parameters to a local minimum that cannot be escaped easily by subsequent tuning. Like for simulated data, MLjoint outperforms other methods on recorded data, achieving almost ideal values extremely fast and consistently, i.e., at almost no spread. Because they share their architecture and thus modeling capability, the advantage of MLjoint over MLtuned is due to the different training data, which matches the phenomenon that unrelated data improves performance, as it is also observed for simulated data (see Fig. 9).
5.3.3 Identification of a single most useful microphone
Besides the accuracy of the estimated continuousvalued utilities, the capability of different algorithmic variants to correctly identify the microphone with the highest utility is investigated. To this end, the channelwise SNR in (9) is computed using oracle knowledge of the individual signal components, i.e., the SOI source image and the additive noise. The microphone that maximizes (9) is considered the most useful, representing the ground truth in this experiment. Note that, as the SNR changes over time, so does the identity of the best microphone. For brevity, we restrict the investigation to the best performing modelbased and MLbased variants, i.e., modelKF, hybrid and MLjoint. For each variant, the microphone with the highest estimated utility is selected. In the absence of a more directly comparable approach, the microphone with the highest average pairwise CDR is selected as a baseline. Therein, the CDR is computed using the DOAindependent estimator [53] and a diffuse noise coherence model as described in Section 5.3. Because the microphones are connected to separate network nodes, this CDR baseline requires transmission of all microphone signals to one of the network nodes, which limits its practical applicability in ASNs. As performance measure, we use the fraction of time frames in which the estimated identity of the most useful microphone coincides with the SNRbased ground truth.
The obtained results are shown in Table 2 separately for simulated and recorded data. For simulated data, all proposed variants clearly outperform the CDR baseline, with MLjoint achieving the highest accuracy as expected from the previous results. In the more challenging scenarios with recorded data, the overall accuracy decreases for all methods. Although the CDR baseline outperforms both modelKF and hybrid which use handcrafted signal features, the MLbased MLjoint outperforms all off the remaining considered methods. It must be reiterated that the CDR baseline in Table 2 uses the microphone signal MSC as oracle knowledge, requiring transmission of all microphone signals. In contrast, the proposed modelKF, hybrid and MLjoint have no such limitations.
5.3.4 Discussion
Let us summarize the previous Sections 5.3.1 to 5.3.3 and point out implications for practical application. The performance of MLsim on recorded data and MLtuned on simulated data indicates limits on the generalization capabilities of the respective trained models to unseen data. Meanwhile, MLjoint provides very good utility estimates but requires both simulated and recorded acoustic data for training. This suggests that data mismatch due to the simplified acoustic simulation, e.g., neglecting the occlusions present in recorded data, is responsible for the aforementioned performance degradation of MLsim and MLtuned, requiring further investigation of the root cause. Comparing the results of MLtuned and MLjoint, adaptation of a pretrained network to new scenarios is not straightforward, likely requiring more sophisticated transfer learning techniques. As major drawback for ASNs in realistic conditions, obtaining a sufficient amount of labeled training data for a variety of acoustic scenarios is difficult since the estimation of the MSC values necessary for training require prior transmission and potentially synchronization of all observed signals. While this could be remedied, e.g., by network nodes with enough memory to buffer the signals before transmission, this problem is beyond the scope of this contribution. Furthermore, the architecture of the utility estimator module explicitly depends on the number of microphones J and thus requires retraining whenever J changes, e.g., new ASN nodes are added.
In contrast, the modelbased approach modelKF has shown a more modest, yet robust, performance for both simulated and recorded data. It is also blind, i.e., does not require knowledge of array geometries, acoustic meta parameters like reverberation time, and especially the number of microphones J. Thus, it can be straightforwardly deployed in different acoustic environments without the need to collect acoustic data to train or finetune the model. For a real system, a modelbased scheme can be used as initial solution to collect labeled training data, which can then be used to tailor an MLbased model to the specific acoustic scenario of the training data.
6 Conclusion
In this contribution, we tackled microphone utility estimation for ASNs. Specifically, we revisited modelbased approaches and discussed the usefulness of specific features, with features describing temporal variations and higherorder statistical moments of the signals’ magnitude spectra being the most useful overall. Furthermore, we proposed alternative, machine learningbased realizations to learn an optimal feature set and utility estimator. Experimental validation showed that both model and MLbased approaches are viable in principle with their own strengths and drawbacks. The modelbased approach is straightforwardly applied to ASNs with an arbitrary number of microphones J, but is clearly outperformed by suitably trained ML models. In contrast, the MLbased approaches, particularly MLjoint, achieve superior performance if matching training data are available.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 ANN:

Artificial Neural Network
 AP:

Access Point
 ASN:

Acoustic sensor network
 ASR:

Automatic Speech Recognition
 CDR:

CoherenttoDiffuse power Ratio
 CLT:

Central Limit Theorem
 DFT:

Discrete Fourier Transform
 DOA:

Direction of Arrival
 DRR:

DirecttoReverberation Ratio
 ERLE:

Echo Return Loss Enhancement
 GRU:

Gated Recurrent Unit
 KF:

Kalman Filter
 LASSO:

Least Absolute Shrinkage and Selection Operator
 LCMV:

Linearly Constrained Minimum Variance
 LS:

Least Squares
 ML:

Machine Learning
 MMSE:

Minimum mean square error
 MSC:

MagnitudeSquared Coherence
 MSE:

Mean Square Error
 MVDR:

Minimum Variance Distortionless Response
 PCC:

Pearson correlation coefficient
 PDF:

Probability Density Function
 PPM:

Parts per million
 PSD:

Power Spectral Density
 RIR:

Room Impulse Response
 RoI:

Region of Interest
 SDR:

SignaltoDistortion Ratio
 SNR:

SignaltoNoise Ratio
 SOI:

Source of Interest
 SVD:

Singular Value Decomposition
References
A. Bertrand, in, 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT). Applications and trends in wireless acoustic sensor networks: A signal processing perspective 2011, 1–6 (2011). https://doi.org/10.1109/SCVT.2011.6101302
H. Wang, P. Chu, in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, Voice source localization for automatic camera pointing system in videoconferencing (1997), pp. 187–190. https://doi.org/10.1109/ICASSP.1997.599595
L. Cheng, C. Wu, Y. Zhang, H. Wu, M. Li, C. Maple, A Survey of Localization in Wireless Sensor Network. Int. J. Distrib. Sens. Netw. 8(12) (2012). https://doi.org/10.1155/2012/962523
A. Brendel, W. Kellermann, Distributed source localization in acoustic sensor networks using the coherenttodiffuse power ratio. IEEE J. Sel. Top. Sign. Process. 13(1), 61–75 (2019). https://doi.org/10.1109/JSTSP.2019.2900911
L. Kaplan, Q. Le, N. Molnar, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Maximum likelihood methods for bearingsonly target localization, vol.5 (2001), pp. 3001–3004. https://doi.org/10.1109/ICASSP.2001.940281
C. Evers, H.W. Löllmann, H. Mellmann, A. Schmidt, H. Barfuss, P.A. Naylor, W. Kellermann, The LOCATA challenge: Acoustic source localization and tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1620–1643 (2020)
M. Brandstein, Microphone arrays: signal processing techniques and applications (Springer Science & Business Media, Berlin, 2001)
A. Bertrand, M. Moonen, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Distributed adaptive estimation of correlated nodespecific signals in a fully connected sensor network (Taipei, Taiwan, 2009), pp. 2053–2056. https://doi.org/10.1109/ICASSP.2009.4960018
S. MarkovichGolan, A. Bertrand, M. Moonen, S. Gannot, Optimal distributed minimumvariance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Process. 107, 4–20 (2015)
S.L. Gay, J. Benesty, Acoustic signal processing for telecommunication, vol. 551 (Springer Science & Business Media, New York, 2012)
L.M. Oliveira, J.J. Rodrigues, Wireless Sensor Networks: a Survey on Environmental Monitoring. J. Commun. 6(2), 143–151 (2011). https://doi.org/10.4304/jcm.6.2.143151
S. Goetze, J. Schroder, S. Gerlach, D. Hollosi, J.E. Appell, F. Wallhoff, Acoustic monitoring and localization for social care. J. Comput. Sci. Eng. 6(1), 40–50 (2012)
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, T. Virtanen, in DCASE 2017  Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017 Challenge setup: Tasks, datasets and baseline system (Munich, Germany, 2017). https://hal.inria.fr/hal01627981. Accessed date 19 Dec 2021
M. Wolf, C. Nadeu, in Proc. of I Joint SIGIL/Microsoft Workshop Speech Lang. Technol. Iberian Lang., Towards microphone selection based on room impulse response energyrelated measures (Porto Salvo, Portugal, 2009), p. 4
A. Bertrand, M. Moonen, in European Signal Process. Conf. (EUSIPCO), Efficient sensor subset selection and link failure response for linear MMSE signal estimation in wireless sensor networks (Aalborg, Denmark, 2010), pp. 1092–1096
J. Szurley, A. Bertrand, M. Moonen, P. Ruckebusch, I. Moerman, in European Signal Process. Conf. (EUSIPCO), Energy aware greedy subset selection for speech enhancement in wireless acoustic sensor networks (Bucharest, 2012), pp. 789–793
J. Szurley, A. Bertrand, P. Ruckebusch, I. Moerman, M. Moonen, Greedy distributed node selection for nodespecific signal estimation in wireless sensor networks. Signal Process. 94, 57–73 (2014). https://doi.org/10.1016/j.sigpro.2013.06.010
O. Roy, M. Vetterli, RateConstrained Collaborative Noise Reduction for Wireless Hearing Aids. IEEE Trans. Signal Process. 57(2), 645–657 (2009). https://doi.org/10.1109/TSP.2008.2009267. https://ieeexplore.ieee.org/document/4671085/
S. Srinivasan, A.C. den Brinker, RateConstrained Beamforming in Binaural Hearing Aids. EURASIP J. Adv. Signal Process. 2009(1) (2009). https://doi.org/10.1155/2009/257197. https://aspeurasipjournals.springeropen.com/articles/10.1155/2009/257197. Accessed date 10 June 2023
J. Amini, R.C. Hendriks, R. Heusdens, M. Guo, J. Jensen, RateConstrained Noise Reduction in Wireless Acoustic Sensor Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1–12 (2020). https://doi.org/10.1109/TASLP.2019.2947777. https://ieeexplore.ieee.org/document/8871150/
J. Casebeer, J. Kaikaus, P. Smaragdis, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), CommunicationCost Aware Microphone Selection for Neural Speech Enhancement with AdHoc Microphone Arrays (2021), pp. 8438–8442. https://doi.org/10.1109/ICASSP39728.2021.9414775
J. Zhang, S.P. Chepuri, R.C. Hendriks, R. Heusdens, Microphone Subset Selection for MVDR Beamformer Based Noise Reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 26(3), 550–563 (2018). https://doi.org/10.1109/TASLP.2017.2786544
J. Zhang, J. Du, L.R. Dai, Sensor Selection for Relative Acoustic Transfer Function Steered LinearlyConstrained Beamformers. IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021). https://doi.org/10.1109/TASLP.2021.3064399
J. Benesty, J. Chen, Y. Huang, Microphone array signal processing, vol. 1 (Springer Science & Business Media, Berlin, 2008)
K. Kumatani, J. McDonough, J. Lehman, B. Raj, in Joint Workshop Handsfree Speech Commun. Microphone Arrays (HSCMA), Channel selection based on multichannel crosscorrelation coefficients for distant speech recognition (Edinburgh, UK, 2011), pp. 1–6. https://doi.org/10.1109/HSCMA.2011.5942398
IEEE Standard for Information technology  Telecommunications and information exchange between systems Local and metropolitan area networks  Specific requirements  Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. IEEE Std 802.112016 (Revision of IEEE Std 802.112012) (Aachen, 2016), pp. 1–3534
A. Chinaev, G. Enzner, T. Gburrek, J. Schmalenstroeer, in European Signal Process. Conf. (EUSIPCO), Online Estimation of Sampling Rate Offsets in Wireless Acoustic Sensor Networks with Packet Loss (Dublin, 2021), pp. 1–5
S.E. Kotti, R. Heusdens, R.C. Hendriks, in 2020 28th European Signal Processing Conference (EUSIPCO), ClockOffset and Microphone Gain Mismatch Invariant Beamforming (IEEE, Amsterdam, Netherlands, 2021), pp. 176–180. https://doi.org/10.23919/Eusipco47968.2020.9287852. https://ieeexplore.ieee.org/document/9287852/. Accessed date 10 June 2023
S. Wehr, I. Kozintsev, R. Lienhart, W. Kellermann, in IEEE Sixth International Symposium on Multimedia Software Engineering, Synchronization of acoustic sensors for distributed adhoc audio networks and its use for blind source separation (Miami, USA, IEEE, 2004), pp.18–25
D. Cherkassky, S. Gannot, Blind Synchronization in Wireless Acoustic Sensor Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 651–661 (2017). https://doi.org/10.1109/TASLP.2017.2655259. https://ieeexplore.ieee.org/document/7827105/
M. Fiedler, Algebraic connectivity of graphs. Czechoslov. Math. J. 23(2), 298–305 (1973)
M. Günther, H. Afifi, A. Brendel, H. Karl, W. Kellermann, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Networkaware optimal microphone channel selection in wireless acoustic sensor networks (Toronto, 2021)
M. Günther, A. Brendel, W. Kellermann, in 14. ITG Conf. on Speech Comm., Microphone Utilitybased Weighting for Robust Acoustic Source Localization in Wireless Acoustic Sensor Networks (Kiel, Germany, 2021)
H. Afifi, M. Günther, A. Brendel, H. Karl, W. Kellermann, in 14. ITG Conf. on Speech Comm., Reinforcement Learningbased Microphone Selection in Wireless Acoustic Sensor Networks Considering Network and Acoustic Utilities (Kiel, Germany, 2021)
M. Günther, A. Brendel, W. Kellermann, in European Signal Process. Conf. (EUSIPCO), Online estimation of timevariant microphone utility in wireless acoustic sensor networks using singlechannel signal features (Dublin, Ireland, 2021)
M. Günther, A. Brendel, W. Kellermann, in Int. Congress on Acoust. (ICA), Singlechannel signal features for estimating microphone utility for coherent signal processing (2019), pp. 2716–2723
G. Peeters, A large set of audio features for sound description (similarity and classification). CUIDADO project Ircam technical report (2004). http://recherche.ircam.fr/equipes/analysesynthese/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf. Accessed date 19 Dec 2021
T. Virtanen, M. Plumbley, D. Ellis, Computational analysis of sound scenes and events (Springer, Cham, 2018)
T.M. Cover, J.A. Thomas, Elements of information theory, 2nd edn. (WileyInterscience, Hoboken, N.J., 2006)
R.E. Kalman, A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 82(1), 35–45 (1960). https://doi.org/10.1115/1.3662552. https://doi.org/10.1115/1.3662552
H.V. Henderson, S.R. Searle, Vec and vech operators for matrices, with some uses in Jacobians and multivariate statistics. Can. J. Stat. 7(1), 65–81 (1979)
G. Enzner, H. Buchner, A. Favrot, F. Kuech, in Academic Press Library in Signal Processing, vol. 4, Acoustic Echo Control (Elsevier, Oxford, 2014), pp. 807–877
B. Schwartz, S. Gannot, E.A.P. Habets, Online Speech Dereverberation Using Kalman Filter and EM Algorithm. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 394–406 (2015)
A. Papoulis, S.U. Pillai, Probability, Random Variables and Stochastic Processes, 4th edn. (McGrawHill, New York, 2001)
C.M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006)
G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (The Johns Hopkins University Press, Baltimore, 2013)
F.R. Chung, F.C. Graham, Spectral graph theory, vol. 92 (American Mathematical Soc, Providence, Rhode Island, 1997)
U. Von Luxburg, A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern. Anal. Mach. Intell. 22(8), 888–905 (2000)
J.B. Allen, D.A. Berkley, Image method for efficiently simulating smallroom acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979). https://doi.org/10.1121/1.382599
E.A.P. Habets. Signal generator for MATLAB (2011). https://github.com/ehabets/SignalGenerator. Accessed date 19 Dec 2021
D.P. Kingma, J. Ba. Adam. A method for stochastic optimization. (San Diego, 2017), Available online at https://arxiv.org/abs/1412.6980v9. https://doi.org/10.48550/arXiv.1412.6980
A. Schwarz, W. Kellermann, CoherenttoDiffuse Power Ratio Estimation for Dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 1006–1018 (2015)
A. Brendel, C. Huang, W. Kellermann, STFT Bin Selection for Localization Algorithms based on the Sparsity of Speech Signal Spectra (Proc, Euronoise, 2018)
G. Del Galdo, M. Taseska, O. Thiergart, J. Ahonen, V. Pulkki, The diffuse sound field in energetic analysis. J. Acoust. Soc. Am. 131(3), 2141–2151 (2012). https://doi.org/10.1121/1.3682064. Number: 3
G. Carter, Time delay estimation for passive sonar signal processing. IEEE Trans. Acoust. Speech Signal Process. 29(3), 463–470 (1981)
G. Carter, Coherence and time delay estimation. Proc. IEEE 75(2), 236–255 (1987)
Acknowledgements
The authors thank Adhithyan Ramadoss for his help acquiring the recorded acoustic data used in the experimental study.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—282835863—within the Research Unit FOR2457 “Acoustic Sensor Networks.”
Author information
Authors and Affiliations
Contributions
MG designed the proposed systems, designed and conducted the experimental studies, analyzed their results, and drafted the manuscript. AB codesigned the proposed systems and the experiments, and provided invaluable technical feedback on the manuscript draft. WK provided extensive feedback on the manuscript draft, and helped with interpreting the experimental results. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
WK is the Lead Guest Editor of the special issue “Signal Processing and Machine Learning for Speech and Audio in Acoustic Sensor Networks” this manuscript is submitted to.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Günther, M., Brendel, A. & Kellermann, W. Microphone utility estimation in acoustic sensor networks using singlechannel signal features. J AUDIO SPEECH MUSIC PROC. 2023, 29 (2023). https://doi.org/10.1186/s13636023002947
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636023002947