Skip to main content

Microphone utility estimation in acoustic sensor networks using single-channel signal features

Abstract

In multichannel signal processing with distributed sensors, choosing the optimal subset of observed sensor signals to be exploited is crucial in order to maximize algorithmic performance and reduce computational load, ideally both at the same time. In the acoustic domain, signal cross-correlation is a natural choice to quantify the usefulness of microphone signals, i.e., microphone utility, for coherent array processing, but its estimation requires that the uncoded signals are synchronized and transmitted between nodes. In resource-constrained environments like acoustic sensor networks, low data transmission rates often make transmission of all observed signals to the centralized location infeasible, thus discouraging direct estimation of signal cross-correlation. Instead, we employ characteristic features of the recorded signals to estimate the usefulness of individual microphone signals using the Magnitude-Squared Coherence (MSC) between the source and respective microphone signal as ground-truth metric. In this contribution, we provide a comprehensive analysis of model-based microphone utility estimation approaches that use signal features and, as an alternative, also propose machine learning-based estimation methods that identify optimal sensor signal utility features. The performance of both approaches is validated experimentally using both simulated and recorded acoustic data, comprising a variety of realistic and practically relevant acoustic scenarios including moving and static sources.

1 Introduction

An acoustic sensor network (ASN) comprises multiple spatially distributed microphones, including multiple distributed compact microphone arrays, that typically communicate wirelessly. Capturing different perspectives of the acoustic scene, the signals recorded by these distributed microphones encode spatial information exploitable by multichannel signal processing algorithms. These algorithms accomplish crucial tasks [1] like acoustic source localization [2,3,4] and tracking [5, 6], extraction and enhancement of an acoustic Source of Interest (SOI) [7,8,9], hands-free communication [10], acoustic monitoring [11, 12], and scene classification and acoustic event detection [13]. As the microphones in ASNs often have no common sampling clock, their signals must be synchronized before joint processing.

The performance of these signal processing algorithms is affected by many factors including the proximity of the microphones to desired and undesired acoustic sources, reverberation, additive noise, orientation and occlusion of microphones, among others. As a result, the signals obtained from different microphones are generally not equally useful for the abovementioned tasks, potentially even detrimental in extreme cases if inappropriate importance is assigned to them. To ensure optimal algorithmic performance at minimum transmission cost and computational cost, a diligent selection which of the observed microphone signals to process and which to discard is crucial in order to avoid unnecessary signal transmission or synchronization efforts. Unsurprisingly, this task has received considerable attention in the literature: the selection of a single best channel for Automatic Speech Recognition (ASR) based on signal features has been explored in [14]. A utility measure specifically for Minimum Mean Square Error (MMSE) signal extraction has been proposed in [15, 16], followed by a distributed version in [17]. MMSE signal extraction under rate constraints tailored specifically for application in for hearing aids was treated in [18,19,20]. Furthermore, joint microphone subset selection and speech enhancement using deep learning were proposed in [21]. Microphone subset selection to minimize communication cost with upper-bounded output noise power has been investigated for both Minimum Variance Distortionless Response (MVDR) [22] and Linearly Constrained Minimum Variance (LCMV) [23] beamforming. However, these methods either neglect the limitations of the underlying ASN regarding communication cost or are tailored to a specific application or cost function. In the following, we present a different approach that overcomes both drawbacks, i.e., requires little transmission data rate and is applicable to a broad class of signal processing applications that rely on coherent input signals.

Many multichannel algorithms, e.g., for signal enhancement or localization using compact arrays [7, 24], assume coherent, i.e., linearly related, input signals and exploit the spatial information captured by the inter-channel phase differences. While this obviously applies to the signal components evoked by an SOI, it also holds for noise reference signals because they must admit a prediction, often linear, of residual noise components in order to suppress them. Thus, the cross-correlation of microphone signal pairs and measures derived from it, in particular the spatial coherence and the MSC, are intuitive measures for quantifying the usefulness of observed microphone signals and have been used in literature for that purpose, e.g., in [25]. For synchronized microphones with sufficient transmission data rate, e.g., for wired compact microphone arrays, direct estimation of the inter-channel coherence from the observed uncoded microphone signals to rate their utility is straightforward. However, in ASNs, this approach is often precluded by a limited transmission data rate, e.g., of current wireless networks [26], especially when the number of microphones is large. This issue is further compounded if the available data rate must be shared with other, possibly non-audio, applications, like video streaming in smart home environments. Furthermore, the microphone signals in ASNs generally do not share a common sampling clock [27]. While sampling time offsets are readily handled by suitable signal processing techniques [28], clock skew will often still pose a problem. Even when the accumulated sampling time offset within one processing block only amounts to fractions of a sampling period, imperfect cancelation can have a catastrophic effect on differential signal processing applications [29]. Furthermore, although the sampling rate variation across multiple copies of a single devices can be very low [30], this may not necessarily be true for ASNs comprised of heterogeneous, cheap consumer devices. Therefore, clock skew in ASNs should not be generally neglected. Thus, potentially costly synchronization of the signal waveforms is generally required prior to estimation of the coherence, which disqualifies direct estimation of the signal cross-correlation. To identify promising candidate microphone signals for synchronization and subsequent joint processing in ASNs without prior signal synchronization, other techniques are required.

To address these unique challenges of ASNs, we employ a compressed signal representation in the form of single-channel signal feature sequences, which are extracted from temporal blocks of the microphone signals, to reduce the amount of data to be transmitted by roughly two to three orders of magnitude. To accurately assess the communication cost, many additional factors should be considered including, but not limited to, the radio-frequency environment and radio Signal-to-Noise Ratio (SNR) at the wireless transceivers, modulation and coding schemes, medium access control and arbitration (potentially via distributed algorithms), protocols and the associated overhead, and the temporal duration of transmission frames, which is unfortunately beyond the scope of this contribution. While acknowledging the implied simplification, we use the data amount as a proxy due to its conceptual simplicity and monotonous relation with actual communication cost, i.e., reducing data amount never increases cost when the environment is constant. The employed features must be characteristic for the microphone signals, i.e., to allow for (at least approximate) reconstruction of the inter-channel MSC.

In this contribution, we consider acoustic scenarios often encountered in smart home applications comprising a single SOI captured by multiple distributed microphones in an acoustic enclosure, as depicted in Fig. 1. After estimating the usefulness of the recorded microphone signals, a subset thereof is selected and transmitted to the central wireless Access Point (AP) for subsequent coherent multichannel signal processing. Although Fig. 1 shows an exemplary scenario with a wireless network with a central AP, this is not constraining the scope of the paper. For the considerations in this paper, the AP may be replaced by a network node acting as a local center implementing the multichannel signal processing algorithm. While the case of wired networks is also covered, their typically large transmission capacity may allow raw signal transmission and thereby limit the importance of the developed feature-based utility estimation. Instead of a specific signal processing algorithm, we consider a broad class of algorithms that rely on coherent input signals and do not require signals unrelated to the SOI, e.g., as noise references. In addition, we do not consider application-specific cost functions or performance metrics, e.g., Signal-to-Distortion Ratio (SDR) and Echo Return Loss Enhancement (ERLE), such that the proposed utility estimation scheme is appropriate for many subsequent multichannel signal processing applications. We instead generate utility estimates to match the ground-truth coherence between the SOI signal and the observed microphone signals. To this end, the proposed generic system comprises two subsystems depicted as in Fig. 2: a feature extraction system, a copy of which runs for each microphone signal on the associated network node, and a utility estimation system running on the AP. In the feature extraction stage, characteristic signal features are extracted from the observed microphone signals independently from each other. No cross-channel features are employed in order to not exclude single-microphone network nodes. The feature sequences obtained from each microphone are then transmitted to the central AP, which estimates the individual microphones’ utility values by correlating the feature sequences. A set of Kalman Filters (KFs) with time-varying temporal smoothing provides a robust estimation framework for the feature covariance. Utility estimates are obtained by extracting structural information from the resulting covariance matrices via the corresponding Fiedler vector [31] that reflects a notion of average connectivity. A joint approach simultaneously considering single-channel features and network transmission cost was proposed in [32]. The efficacy of the proposed utility estimates for two specific important signal processing tasks, robust source localization and spatial filtering, was demonstrated in [33] and [34], respectively. Therein, sensor selection by optimizing the proposed utility measure has shown close-to-optimal performance, such that we focus only on the generic utility measure in this contribution.

Fig. 1
figure 1

Scenario for an ASN with a single SOI captured by spatially distributed microphones

Fig. 2
figure 2

System overview: In the feature extraction stage, characteristic signal feature sequences are computed for each microphone signal independently. Afterwards, the features sequences from all microphones are collected at the AP and used to estimate each microphone’s utility

In the remainder of this article, we review and provide more detailed descriptions of the model-based realizations of the two subsystems proposed in [32, 35] in Sections 3.1 and 3.2 by explicitly stating and discussing the model assumptions of the KF. Formulating microphone selection as a graph bi-partitioning problem, the Fiedler vector yields an optimum soft assignment of each individual microphone to one of the two groups of most and least useful microphones, which further justifies its use as a utility measure. In Section 3.3, we provide new results on the suitability of established signal features for recovering inter-channel MSC. To this end, the feature selection task is formulated as a Least Absolute Shrinkage and Selection Operator (LASSO) regression problem which is then solved numerically to obtain an optimal set of signal features. In Section 4, we propose novel Machine Learning (ML)-based realizations for both subsystems whose combination can be learned in an end-to-end fashion, which constitutes a major contribution of this work. In Section 5, the efficacy of the proposed scheme and its individual components is validated. Different algorithmic variants, i.e., purely model-based, purely ML-based, and hybrid realizations of the proposed system, are investigated. To this end, comprehensive experiments for both synthesized and recorded data from realistic scenarios are conducted, including different reverberation times, additive noise and obstruction of sensors, different microphone arrangements, as well as static and moving SOIs.

2 Notation and signal model

In this article, scalar quantities are denoted by slanted non-bold symbols x, while vectors and matrices are denoted by bold-face lowercase \(\textbf{x}\) and uppercase symbols \(\textbf{X}\), respectively. Furthermore, \([\textbf{x}]_{m}\) denotes the m-th element of vector \(\textbf{x}\), and \([\textbf{X}]_{mm'}\) denotes the (m,\(m'\))-th element of matrix \(\textbf{X}\). The M-dimensional all-zeros and all-ones vectors are denoted by \(\textbf{0}_{M}\) and \(\textbf{1}_{M}\), respectively, the \(M \times M\) identity matrix is denoted by \(\textbf{I}_{M}\), and the operator \(\textrm{Diag}(\cdot )\) embeds the elements of its argument on the main diagonal of a square matrix. The Pearson correlation coefficient (PCC) of two M-element vectors \(\textbf{x}\), \(\textbf{y}\) is defined as

$$\begin{aligned} \mathcal {R}(\textbf{x},\textbf{y}) = \frac{\sum _{m=1}^{M} ([\textbf{x}]_{m} - \overline{x}) ([\textbf{y}]_{m} - \overline{y})}{\sqrt{ \sum _{m=1}^{M} ([\textbf{x}]_{m} - \overline{x})^2} \sqrt{\sum _{m=1}^{M} ([\textbf{y}]_{m} - \overline{y})^2 }} \end{aligned}$$
(1)

with means \(\overline{x} = \frac{1}{M} \sum _{m=1}^{M}[\textbf{x}]_{m}\) and \(\overline{y} = \frac{1}{M} \sum _{m=1}^{M}[\textbf{y}]_{m}\). It will be used as a normalized similarity measure for features in Section 3.2 and as performance measure for the experiments in Section 5.

In the following, let t denote the discrete-time sample index and let \(f \in \{1,\ldots ,F\}\) denote the feature index where F is the number of extracted features per channel. Recalling Fig. 1, we consider an acoustic scenario comprising a single coherent SOI recorded by J microphones, each of which represents a separate node in the ASN. The signal captured by the microphone indexed by \(j \in \mathcal {P} = \{1,\ldots ,J\}\) is

$$\begin{aligned} x_{j}[t] = s[t] * h_{j}[t] + n_{j}[t], \end{aligned}$$
(2)

where s[t] is the dry SOI signal, \(h_{j}[t]\) is the acoustic impulse response from the SOI to the j-th microphone, and \(*\) denotes linear convolution. Note that the SOI is not necessarily static, i.e., the acoustic impulse responses \(h_{j}[t]\) in (2) are considered time-invariant only for short observation intervals, but may change from one interval to the next as the SOI moves. The fully coherent spatial images of the SOI \(s[t] * h_{j}[t]\) are superimposed by a spatially diffuse or incoherent noise field, such that the mutual coherence between the noise components \(n_{j}[t], \;\forall {j}\in \mathcal {P}\) is negligibly small. Thus, observed correlation between two microphone signals \(x_{j}[t]\), \(x_{j'}[t]\) is predominantly caused by the common SOI signal. Although competing point-like sources are not explicitly modeled in (2), the proposed method is still applicable given sufficient temporal sparsity, i.e., time intervals where only one of the sources is active, provided that the identity of the active source changes slowly enough to be tracked by the KF. Furthermore, consider an ASN spanning two rooms each with its own SOI connected by, e.g., open doors. With the microphones in each room predominantly capturing their respective SOI, the realizations of the proposed system in Sections 3 and 4 can still facilitate a distinction of microphones w. r. t. the dominant SOI. In this case, the scenario essentially decouples into two separate problems, but it is generally not known in advance which of the two possible solutions is found. To ensure a deterministic selection, additional source selection mechanisms exploiting preference information are needed, which is beyond the scope of this paper. In any case, the signal model (2) should be viewed as a first step towards developing methods for more general acoustic scenarios.

As the proposed utility estimation relies only on the correlation of feature sequences computed from signal frames, only a coarse synchronization of the signal frames between different sensors has to be assured, such that the proposed scheme is practically relevant. However, we assume that the sensor signals are synchronous to compute the oracle MSC between each microphone and the source. The same holds for the microphone pair-wise complex coherence function, and thus MSC, which are the foundation of the baselines baseline-CDR and baseline-MSC, respectively, in Section 5.

To this end, the signals are partitioned into blocks indexed by \(k\in \{1,\ldots ,K\}\) with a length of \({L_{\textrm{b}}}\) samples and a shift of \({L_{\textrm{s}}}\) between successive blocks, e.g., for the j-th microphone signal \(x_{j}[t]\),

$$\begin{aligned} \textbf{x}_{j}[k] = \left[ \begin{array}{ccc} x_{j}[k{L_{\textrm{s}}}],&\ldots ,&x_{j}[k{L_{\textrm{s}}}+{L_{\textrm{b}}}-1] \end{array}\right]^{\textrm{T}} \in \mathbb {R}^{{L_{\textrm{b}}}}. \end{aligned}$$
(3)

With the discrete frequency bin index \(n\in \{1,\ldots ,{L_{\textrm{b}}}\}\), let \(\hat{{\Phi }}_{s,x_{j}}[k,n]\), \(\hat{{\Phi }}_{s,s}[k,n]\) and \(\hat{{\Phi }}_{x_{j},x_{j}}[k,n]\) denote short-time estimates of the respective cross-Power Spectral Density (PSD) and auto-PSDs of s[t] and \(x_{j}[k]\), e.g.,

$$\begin{aligned} \hat{{\Phi }}_{s,x_{j}}[k,n] = \hat{\mathbb {E}}\left( \left[ \textrm{DFT}_{{L_{\textrm{b}}}} \left( \textbf{s}[k]\right) \right] _{n} \cdot \left[ \textrm{DFT}_{{L_{\textrm{b}}}} \left( \textbf{x}_{j}[k]\right) \right] _{n} \right) , \end{aligned}$$
(4)

where \(\textrm{DFT}_{{L_{\textrm{b}}}}\) denotes the \({L_{\textrm{b}}}\)-point Discrete Fourier Transform (DFT). As a broadband ground-truth utility measure for the j-th microphone, the frequency-averaged narrowband MSC

$$\begin{aligned} \gamma _{j}[k] = \frac{1}{{L_{\textrm{b}}}} \sum _{n=1}^{{L_{\textrm{b}}}} \left| \frac{\hat{{\Phi }}_{s,x_{j}}[k,n]}{\sqrt{\hat{{\Phi }}_{s,s}[k,n] \cdot \hat{{\Phi }}_{x_{j},x_{j}}[k,n]}} \right| ^2 \end{aligned}$$
(5)

between the (latent) source signal s[t] and the j-th microphone signal \(x_{j}[t]\) is used. Note that we drop the superscript \(\hat{\cdot }\) from \(\gamma _{j}[k]\) in (5) for notational simplicity in the following. Under the assumption that the SOI signal s[t] and noise signals \(n_{j}[t]\) are mutually uncorrelated, and that the acoustic impulse responses in (2) are much shorter than the DFT length, the approximations

$$\begin{aligned} \hat{{\Phi }}_{s,x_{j}}[k,n]\approx & {} H_{j}^{*}[k,n] \, \hat{{\Phi }}_{s,s}[k,n], \end{aligned}$$
(6)
$$\begin{aligned} \hat{{\Phi }}_{x_{j},x_{j}}[k,n]\approx & {} \left| H_{j}[k,n]\right| ^{2} \, \hat{{\Phi }}_{s,s}[k,n] + \hat{{\Phi }}_{n_{j},n_{j}}[k,n] \end{aligned}$$
(7)

hold. In practice, MSC estimates derived from (6) and (7) are subject to detrimental effects stemming from the combination of limited temporal observation intervals and the characteristics of the acoustic impulse responses between SOI and the microphones as captured by, e.g., relative time delay and Direct-to-Reverberation Ratio (DRR). Nevertheless, since coherent multichannel processing algorithms also degrade with the same impairments, the degradation in estimation accuracy of the coherence can be assumed to be correlated with the performance of signal processing algorithms and, hence, utility of the involved sensors.

Then, the summands in (5) simplify to

$$\begin{aligned} \left| \frac{\hat{{\Phi }}_{s,x_{j}}[k,n]}{\sqrt{\hat{{\Phi }}_{s,s}[k,n] \cdot \hat{{\Phi }}_{x_{j},x_{j}}[k,n]}} \right| ^2 = \frac{\widehat{\textrm{SNR}}_{j}[k,n]}{1 + \widehat{\textrm{SNR}}_{j}[k,n]}, \end{aligned}$$
(8)

with the channel-wise SNR

$$\begin{aligned} \widehat{\textrm{SNR}}_{j}[k,n] = \frac{\left| H_{j}[k,n]\right| ^{2} \,\hat{{\Phi }}_{s,s}[k,n]}{\hat{{\Phi }}_{n_{j},n_{j}}[k,n]}. \end{aligned}$$
(9)

Clearly, the MSC is a function of the SNR, with extremal values \(\gamma _{j}[k,n]=0\) for \(\widehat{\textrm{SNR}}_{j}[k,n]=0\), and \(\gamma _{j}[k,n]\rightarrow 1\) for \(\widehat{\textrm{SNR}}_{j}[k,n]\rightarrow \infty\). The frequency-averaged source-microphone MSC values of all J microphones are collected in the vector

$$\begin{aligned} \varvec{\gamma }[k] = \left[ \begin{array}{ccc} \gamma _{1}[k],&\ldots ,&\gamma _{J}[k] \end{array}\right] ^{\textrm{T}} \in [0,1]^{J}. \end{aligned}$$
(10)

3 Model-based utility estimation using Spectral Graph Partitioning

We first review the model-based realizations of [32, 35] in Sections 3.1 and 3.2. Although there is no strictly analytical relation between the extracted feature values and the utility values, the approach is based on the notion that the PCCs of the different extracted feature sequences all reflect the same pair-wise similarity of the underlying microphone signals. Due to this model assumption, and to differentiate it from the ML-based approach in Section 4, which requires training data to determine the model parameters, this approach is termed model-based. Advancing the previous heuristic feature selection [36], we formulate the feature selection task as a LASSO regression problem with a sparsity-promoting regularizer in Section 3.3 to optimize the trade-off between accuracy and number of features to be transmitted. Solving this optimization problem yields an optimal selection of features for a set of representative acoustic scenarios.

3.1 Node-wise feature extraction

There is a wide variety of potential signal features [37, 38] to describe acoustic signals. Since acoustic scenarios are typically not static in practice due to, e.g., moving acoustic sources or obstructions, the usefulness of microphones is equally time-variant. Hence, within the comprehensive feature taxonomy in [37], we focus on features extracted from short observation intervals to characterize single-channel signals. The features may be computed in the time domain based on the digital signal waveform, or in the frequency domain based on the magnitude spectrum of the signals. As a result, we consider the following block-wise features:

  • Envelope of waveform

  • Zero-crossing rate

  • Statistical moments (centroid, standard deviation, skewness, kurtosis) of the signal waveform

  • Entropy of waveform

  • Statistical moments (centroid, standard deviation, skewness, kurtosis) of the magnitude spectrum

  • Spectral shape features (slope, power flatness, amplitude flatness, roll-off)

  • Temporal variation of magnitude spectra (spectral flux, spectral flux of normalized magnitude spectra, spectral variation)

In [36], it was experimentally shown that three features (temporal skewness, temporal kurtosis, spectral flux) are suitable to recover the structure of the spatial MSC matrix of a set of microphone signals. However, the features were selected heuristically based on the visual similarity of the corresponding feature covariance matrices and the ground-truth MSC matrix. Therefore, a more rigorous discussion of the importance of specific signal features is provided in Section 3.3.

Generally, a single feature sequence, i.e., a sequence of feature values over several time frames, is insufficient to characterize the signals, since the extraction of each signal feature can at best maintain information about the original signal [39], but typically incurs a loss of information. When multiple sufficiently different features are used, they capture different parts of the information contained in the signals, such that they complement each other in describing the original signals. Thus, jointly processing such different features allows for a more accurate characterization of the signals compared to a single feature.

To this end, given a signal block \(\textbf{x}_{j}[k]\), let \(a_{j}^{(f)}[k]\) denote the observed value of the signal feature \(f\in \{1,\ldots ,F\}\) for said signal block. Collecting the feature values of different channels for time frame k yields the instantaneous feature vector

$$\begin{aligned} \textbf{a}^{(f)}[k] = \left[ \begin{array}{ccc} a_{1}^{(f)}[k],&\ldots ,&a_{J}^{(f)}[k] \end{array}\right] ^{\textrm{T}} \in \mathbb {R}^{J}. \end{aligned}$$
(11)

Unlike the signal waveforms, which require precise synchronization of the sampling clocks for joint processing, the feature values of different microphones are much less susceptible to asynchronous sampling. With only a single feature value every \({L_{\textrm{s}}}\) signal samples, sampling rate offsets on the order of tens of parts per million (PPM) barely affect the extracted feature sequences. Hence, periodical coarse synchronization of the signal block boundaries is sufficient to avoid excessive drift of the observation windows in different network nodes, allowing synchronization to occur less frequently and with lower accuracy requirements.

3.2 Utility estimation

In this section, we review the utility estimation scheme based on correlation of feature sequences originally proposed in [32, 35, 36] and show its relation to established Graph Bisection techniques. The model-based utility estimation comprises three steps, which are outlined in the following subsections:

  1. 1

    Robustly estimate the cross-channel PCCs of the feature sequences separately for each feature via a set of KF

  2. 2

    Fuse information contained in the PCCs from different features

  3. 3

    Estimate each microphone’s utility from the fused information by means of Spectral Graph Partitioning

For clarity, a visual guide of these steps and the involved matrices and vectors is provided in Fig. 3.

Fig. 3
figure 3

Overview of model-based utility estimation. For clarity, only two features are illustrated in Step 1

Step 1) Feature correlation coefficients: For computing the PCCs, first the cross-channel covariance matrices

$$\begin{aligned} \textbf{B}^{(f)}[k]&= \left[ \begin{array}{ccc} b_{1,1}^{(f)}[k] &{} \cdots &{} b_{1,J}^{(f)}[k] \\ \vdots &{} &{} \vdots \\ b_{J,1}^{(f)}[k] &{} \cdots &{} b_{J,J}^{(f)}[k] \end{array}\right] = \mathbb {\hat{E}}\left( \textbf{A}^{(f)}[k] \right) \end{aligned}$$
(12)

are estimated for each feature \(f \in \{1,\ldots ,F\}\) separately. Therein, \(\mathbb {\hat{E}}\) denotes an approximate statistical expectation operator whose practical realization we discuss below. Furthermore, the matrix

$$\begin{aligned} \textbf{A}^{(f)}[k]&= \left( \textbf{a}^{(f)}[k] - \overline{\textbf{a}}^{(f)}[k]\right) \left( \textbf{a}^{(f)}[k] - \overline{\textbf{a}}^{(f)}[k] \right) ^{\textrm{T}} \end{aligned}$$
(13)

is the outer product of the instantaneous observed feature vector \(\textbf{a}^{(f)}[k]\) after subtracting its recursive temporal average

$$\begin{aligned} \overline{\textbf{a}}^{(f)}[k+1] = \lambda \overline{\textbf{a}}^{(f)}[k] + (1-\lambda ) \textbf{a}^{(f)}[k+1], \end{aligned}$$
(14)

controlled by the recursive averaging factor \(\lambda \in [0,1]\) with initial value \(\overline{\textbf{a}}^{(f)}[0] = \textbf{0}_{J}\).

Note that the estimated \(\textbf{B}^{(f)}[k]\) is generally time-variant to account for the aforementioned SOI movement, and thus online estimation is preferred over batch estimation. In order to track this temporal variability, we use a separate KF [40] for each feature f. Let the latent state vector at time frame k be denoted by \(\textbf{z}^{(f)}[k]\). Its mean vector \(\varvec{\mu }^{(f)}[k]\) captures the covariance matrix \(\textbf{B}^{(f)}[k]\) to be estimated and the instantaneous observation vector \(\varvec{\xi }^{(f)}[k]\) captures the matrix \(\textbf{A}^{(f)}[k]\). Since both \(\textbf{B}^{(f)}[k]\) and \(\textbf{A}^{(f)}[k]\) are symmetric, it is sufficient to only consider their non-redundant elements. We choose their diagonal elements and lower triangular elements, such that the dimensionality of the state vector \(\textbf{z}^{(f)}[k]\), the mean vector \(\varvec{\mu }^{(f)}[k]\), and the observation vector \(\varvec{\xi }^{(f)}[k]\) can be chosen to be only \(Q = \frac{J (J-1)}{2}\) instead of \(J^2\) while still precisely modeling the full matrices. This can be expressed compactly using the half-vectorization operator \(\textrm{vech}\) [41] collecting all relevant matrix elements in vectors, i.e.,

$$\begin{aligned} \varvec{\mu }^{(f)}[k]= & {} \textrm{vech} \left( \textbf{B}^{(f)}[k] \right) \in \mathbb {R}^{Q}, \end{aligned}$$
(15)
$$\begin{aligned} \varvec{\xi }^{(f)}[k]= & {} \textrm{vech}\left( \textbf{A}^{(f)}[k] \right) \in \mathbb {R}^{Q}. \end{aligned}$$
(16)

The state-transition and model equations of the KF are

$$\begin{aligned} \textbf{z}^{(f)}[k+1]= & {} \textbf{z}^{(f)}[k] + \textbf{t}^{(f)}[k], \end{aligned}$$
(17)
$$\begin{aligned} \varvec{\xi }^{(f)}[k]= & {} \textbf{z}^{(f)}[k] + \textbf{o}^{(f)}[k], \end{aligned}$$
(18)

where \(\textbf{t}^{(f)}[k]\) and \(\textbf{o}^{(f)}[k]\) denote the state-transition and observation noise vectors, respectively. In other words, the most probable state transition is that the utility, and hence the feature covariance, stays the same. However, if the state does change, it has no predictable preference direction. Similarly simple motion models are effectively used in acoustic echo cancellation [42] and dereverberation [43]. With (17) and (18), the KF simplifies to temporal smoothing, albeit with a time-variant smoothing constant. Compared to fixed averaging constants, this allows placing higher confidence in observations with high signal energy, i.e., likely SOI activity.

Assuming a normally distributed latent state vector \(\textbf{z}^{(f)}[k]\) like in [35] for simplicity and mathematical tractability leads to the prior distribution

$$\begin{aligned} p\left( \textbf{z}^{(f)}[k]\right) = \mathcal {N}\left( \textbf{z}^{(f)}[k] \;|\; \varvec{\mu }^{(f)}[k], \varvec{\Sigma }_{\textrm{s}}^{(f)}[k]\right) \end{aligned}$$
(19)

with the aforementioned mean vector \(\varvec{\mu }^{(f)}[k]\) and covariance matrix \(\varvec{\Sigma }_{\textrm{s}}^{(f)}[k] \in \mathbb {R}^{Q \times Q}\). Since the trend of \(\textbf{z}^{(k)}[k]\) is neither known nor easily modeled, we assume a zero-mean Gaussian random walk with transition distribution

$$\begin{aligned} p\left( \textbf{z}^{(f)}[k+1] \;|\; \textbf{z}^{(f)}[k]\right) = \ldots \nonumber \\ \qquad \mathcal {N}\left( \textbf{z}^{(f)}[k+1] \;|\; \textbf{z}^{(f)}[k], \varvec{\Sigma }_{\textrm{t}} \right) \end{aligned}$$
(20)

as it is the least informative model but, due to the Central Limit Theorem (CLT) [44], fits well for natural processes where changes in the latent state are often the result of many independent influences. In order to remain agnostic to the source-microphone arrangement in different scenarios, the time-invariant and feature-independent process noise covariance matrix is chosen as a scaled identity matrix

$$\begin{aligned} \varvec{\Sigma }_{\textrm{t}} = \alpha _1 \textbf{I}_{Q} \in \mathbb {R}^{Q\times {Q}}, \end{aligned}$$
(21)

where \(\alpha _1 \in \mathbb {R}^{+}\) is a positive tunable parameter. Intuitively, two closely spaced microphones produce similar feature sequences and thus the way their estimated PCCs w. r. t. a third microphone change over time will be correlated. While these scenario-specific correlations could in principle be exploited for more accurate estimation by tailoring \(\varvec{\Sigma }_{\textrm{t}}\) to the scenario, doing so would harm the generalization of the transition model to other scenarios and furthermore requires the acquisition of sufficient data to estimate an optimal \(\varvec{\Sigma }_{\textrm{t}}\). Therefore, to avoid biasing the random walk process, we choose not to model these correlations, i.e., keep \(\varvec{\Sigma }_{\textrm{t}}\) diagonal.

Choosing the least informative emission model for the observations \(\varvec{\xi }^{(f)}[k]\) for simplicity as well yields the multivariate Gaussian emission distribution

$$\begin{aligned} p\left( \varvec{\xi }^{(f)}[k] \;|\; \textbf{z}^{(f)}[k]\right) = \mathcal {N}\left( \varvec{\xi }^{(f)}[k] \;|\; \textbf{z}^{(f)}[k], \varvec{\Sigma }_{\textrm{o}}[k] \right) \end{aligned}$$
(22)

with the observation noise covariance matrix

$$\begin{aligned} \varvec{\Sigma }_{\textrm{o}}[k] = \alpha _2 \left( \text {Diag}\left( \text {vech}\left( \textbf{E}[k] \right) \right) \right) ^{-1} \in \mathbb {R}^{Q\times {Q}}. \end{aligned}$$
(23)

Therein, \(\alpha _2 \in \mathbb {R}^{+}\) is a positive tunable parameter and the matrix \(\textbf{E}[k] \in \mathbb {R}^{J\times {J}}\) contains the geometric means of signal frame energies \(e_{j}[k] = \Vert \textbf{x}_{j}[k] \Vert _{2}^{2}\) (see (3)) reflecting the signal variances for each microphone pair

$$\begin{aligned} \left[ \textbf{E}[k]\right] _{jj'} = \sqrt{e_{j}[k] \cdot e_{j'}[k]} + \epsilon , \quad \forall j, j' \in \mathcal {P}. \end{aligned}$$
(24)

The small positive constant \(\epsilon\) ensures invertibility of \(\varvec{\Sigma }_{\textrm{o}}[k]\) in (23) during speech absence periods. This choice is motivated by the notion that the observed feature values are better at characterizing the microphone signals during time frames with high signal energy \(e_{j}[k]\), i.e., the observation noise of the KF is inversely related to the signal energy \(e_{j}[k]\).

With all components of the KF in place, the update equations for mean vector and state covariance are [45]

$$\begin{aligned} \varvec{\mu }^{(f)}[k+1]= & {} \varvec{\mu }^{(f)}[k] + \textbf{K}^{(f)}[k] \left( \varvec{\xi }^{(f)}[k] - \varvec{\mu }^{(f)}[k]\right) , \end{aligned}$$
(25)
$$\begin{aligned} \varvec{\Sigma }_{\textrm{s}}^{(f)}[k+1]= & {} \left( \varvec{\Sigma }_{\textrm{s}}^{(f)}[k] + \varvec{\Sigma }_{\textrm{t}} \right) \left( \textbf{I}_{Q} - \textbf{K}^{(f)}[k]\right) , \end{aligned}$$
(26)

with the Kalman gain matrix

$$\begin{aligned} \textbf{K}^{(f)}[k] = \left( \varvec{\Sigma }_{\textrm{s}}^{(f)}[k] + \varvec{\Sigma }_{\textrm{t}} \right) \left( \varvec{\Sigma }_{\textrm{s}}^{(f)}[k] + \varvec{\Sigma }_{\textrm{t}} + \varvec{\Sigma }_{\textrm{o}}[k]\right) ^{-1} \end{aligned}$$
(27)

and initial values

$$\begin{aligned} \varvec{\mu }^{(f)}[0] = \textbf{0}_{Q}, \quad \quad \quad \varvec{\Sigma }_{\textrm{s}}^{(f)}[0]&= \textbf{I}_{Q}. \end{aligned}$$
(28)

Note that the updates in (25) to (27) can be computed very efficiently since all involved matrices are diagonal.

For each time frame, after updating the KFs for all features \(f \in \{1,\ldots ,F\}\), the elements of the covariance matrix \(\textbf{B}^{(f)}[k]\) are recovered from the KF mean vector \(\varvec{\mu }^{(f)}[k]\) by reversing the half-vectorization, i.e.,

$$\begin{aligned} b_{j,j'}^{(f)}[k] = \left[ \textbf{B}^{(f)}[k]\right] _{jj'} = \left[ \textrm{vech}^{-1} \left( \varvec{\mu }^{(f)}[k] \right) \right] _{jj'}. \end{aligned}$$
(29)

Finally, the elements of the per-feature PCC matrices \(\tilde{\textbf{B}}^{(f)}[k] \quad \forall {f}\) are obtained from the estimated covariances by normalization according to

$$\begin{aligned} \tilde{b}_{j,j'}^{(f)}[k] = \left[ \tilde{\textbf{B}}^{(f)}[k]\right] _{jj'} = \frac{b_{j,j'}^{(f)}[k]}{\sqrt{b_{j,j}^{(f)}[k]} \cdot \sqrt{b_{j',j'}^{(f)}[k]}}. \end{aligned}$$
(30)

Step 2) Feature combination: As outlined earlier, the PCC matrices of different features \(\tilde{\textbf{B}}^{(f)}[k]\) capture different aspects of the underlying inter-channel coherence. To recover an estimate of the inter-channel coherence from the multiple feature correlation coefficient matrices, we consider channel-wise matrices

$$\begin{aligned} \textbf{C}_{j}[k] = \left[ \begin{array}{ccc} \tilde{b}_{j,1}^{(1)}[k] &{} \ldots &{} \tilde{b}_{j,1}^{(F)}[k] \\ \vdots &{} &{} \vdots \\ \tilde{b}_{j,J}^{(1)}[k] &{} \ldots &{} \tilde{b}_{j,J}^{(F)}[k] \end{array}\right] \in \mathbb {R}^{J\times {F}}, \end{aligned}$$
(31)

where each \(\textbf{C}_{j}[k]\) contains the inter-channel PCCs of all F feature sequences of all J microphone channels w. r. t. the corresponding feature sequence of a reference channel j.

Note that each column of \(\textbf{C}_{j}[k]\), corresponding to one particular signal feature, models the same underlying inter-channel coherence. The PCCs of different features are then combined for each channel j by extracting the dominant column structure of \(\textbf{C}_{j}[k]\), i.e., finding its best rank-1 approximation in the Least Squares (LS) sense [46]

$$\begin{aligned} \underset{\sigma _{j}[k],\textbf{r}_{j}[k],\textbf{t}_{j}[k]}{\textrm{min}} \left\Vert \textbf{C}_{j}[k] - \sigma _{j}[k] \textbf{r}_{j}[k]\textbf{t}_{j}^{\textrm{T}}[k] \right\Vert ^{2}_{2}. \end{aligned}$$
(32)

Since \(\textbf{C}_{j}[k]\) is generally non-square, the solution of (32) is obtained by the Singular Value Decomposition (SVD), where \(\sigma _{j}[k] \in \mathbb {R}^{+}\) is the largest singular value of \(\textbf{C}_{j}[k]\), and \(\textbf{r}_{j}[k] \in \mathbb {R}^{J}\) and \(\textbf{t}_{j}[k] \in \mathbb {R}^{F}\) are the principal left and right singular vectors, respectively. The principal left singular vector \(\textbf{r}_{j}[k]\) captures the contribution of each channel to the dominant structure of \(\textbf{C}_{j}[k]\), while the principle right singular vector \(\textbf{t}_{j}[k]\) captures the contribution of each feature to the dominant structure.

To facilitate tracking of \(\textbf{r}_{j}[k]\) in time-variant scenarios and avoid recomputation of the full SVD in each time step, the principal left singular vector is instead iteratively refined over time. To this end, recall that the left singular vectors of \(\textbf{C}_{j}[k]\) are identical to the eigenvectors of the Gram matrix \(\textbf{C}_{j}[k] \textbf{C}_{j}^{\textrm{T}}[k]\) [46]. Thus, given an estimate from the previous time step, the principal singular vector can be estimated using power methods [46] as

$$\begin{aligned} \check{\textbf{r}}_{j}[k+1]= & {} \left( \textbf{C}_{j}[k+1] \textbf{C}_{j}^{\textrm{T}}[k+1]\right) \textbf{r}_{j}[k], \end{aligned}$$
(33)
$$\begin{aligned} \textbf{r}_{j}[k+1]= & {} \frac{\check{\textbf{r}}_{j}[k+1]}{\Vert \check{\textbf{r}}_{j}[k+1] \Vert _2}. \end{aligned}$$
(34)

Initial experiments comparing the power method and full SVD have shown that the spectrum of \(\textbf{C}_{j}[k]\) varies slowly over time such that a single iteration of (33) and (34) is sufficient to accurately track the principal singular vector for the proposed utility estimation scheme.

In order to restore the intuitive notion of a similarity measure, the estimated principal singular vectors from (34) are re-normalized such that the similarity of each channel to itself is equal to one, and then concatenated to form the overall channel similarity matrix

$$\begin{aligned} \textbf{R}[k] = \left[ \begin{array}{ccc} \frac{\textbf{r}_{1}[k]}{\left[ \textbf{r}_{1}[k]\right] _1},&\ldots ,&\frac{\textbf{r}_{J}[k]}{\left[ \textbf{r}_{J}[k]\right] _{J}} \end{array}\right] \in \mathbb {R}^{J\times {J}}. \end{aligned}$$
(35)

Step 3) Spectral Graph Partitioning: Microphone selection is equivalent to partitioning the set of available microphones \(\mathcal {P}\) into two, potentially time-variant, disjoint subsets comprising the selected and discarded microphones, respectively. Recalling the signal model (2), we use the convention that the former subset \(\mathcal {S}[k]\) contains the microphones capturing the SOI with high quality while the latter subset is its complement \(\overline{\mathcal {S}}[k]\) for those microphones dominated by the non-coherent noise field. Relaxing the hard assignment of microphones to these subsets to a soft assignment leads to continuous real-valued utility estimates as shown in the following. Spectral partitioning techniques [31, 47, 48] operating on graph structures can determine such optimal partitionings very efficiently, especially when the number of microphones J is large.

Thus, we model the pairwise similarity of microphone channels using a time-variant graph structure \(\mathcal {G}(\mathcal {V},\mathcal {E}[k])\) [47], comprising a set of vertices \(\mathcal {V}\) representing microphones and a set of weighted edges \(\mathcal {E}[k]\) representing the microphones’ similarity at time frame k. For each edge \((j,j',w_{jj'}[k]) \in \mathcal {E}[k]\), the weight \(w_{jj'}[k] \in [0,1]\) captures the similarity of microphones j and \(j'\). The graph is equivalently specified by its weighted adjacency matrix \(\textbf{W}[k] \in \mathbb {R}^{J \times J}\), containing all weights \(w_{jj'}[k], \forall j, j' \in \mathcal {P}\). The pairwise microphone similarity should be a symmetric measure, i.e., channel j should be as similar to \(j'\) as channel \(j'\) is to j, such that \(w_{jj'}[k] = w_{j'{j}}[k]\). To reflect this symmetry and the varying degrees of similarity, the graph should be undirected and weighted. Since the matrix \(\textbf{R}[k]\) in (35) does not necessarily exhibit these properties, only the symmetric part of its element-wise magnitude is used to construct the weighted adjacency matrix \(\textbf{W}[k]\), i.e.,

$$\begin{aligned} w_{jj'}[k]= & {} \left[ \textbf{W}[k]\right] _{jj'} \nonumber \\= & {} \frac{1}{2} \left( \left| \left[ \textbf{R}[k]\right] _{jj'}\right| + \left| \left[ \textbf{R}[k]\right] _{j'{j}}\right| \right) . \end{aligned}$$
(36)

The degree [47] of the j-th vertex is defined as the sum of all outgoing edges’ weights

$$\begin{aligned} d_{j}[k] = \sum \limits _{j'=1}^{J} w_{jj'}[k], \end{aligned}$$
(37)

which are collected in the diagonal degree matrix

$$\begin{aligned} \textbf{D}[k] = \textrm{Diag} \left\{ d_{1}[k], \ldots , d_{J}[k]\right\} . \end{aligned}$$
(38)

Note that \(d_{j}[k] \ge 1, \forall {j}\in \mathcal {P}\) since the sum in (37) includes \(w_{jj}[k]=1\), which ensures invertibility of \(\textbf{D}[k]\) even for degenerate graphs.

For an ideal partitioning, like for clustering, it is desirable that microphone signals belonging to the same group are similar while microphone signals belonging to different groups are dissimilar to allow for a clear distinction between the selected and the discarded microphones.

Using (2) gives an interpretation in the context of microphone selection: SOI-dominated microphones exhibit strongly mutually correlated feature sequences and thus form one of the two partition subsets, while the feature sequences of noise-dominated microphones are only weakly correlated with the SOI-dominated microphones, and thus form the other subset. In addition, even if the noise components \(n_{j}[t]\) are uncorrelated, their features likely are correlated, especially if they capture underlying statistics like variance. These inter-group and intra-group similarities of a set \(\mathcal {S}[k] \subset \mathcal {P}\) and its complement \(\overline{\mathcal {S}}[k]\) are measured by [48]

$$\begin{aligned} \textrm{cut}(\mathcal {S}[k], \overline{\mathcal {S}}[k])= & {} \sum \limits _{{j \in \mathcal {S}[k], j' \in \overline{\mathcal {S}}[k]}} w_{j j'}[k], \end{aligned}$$
(39)
$$\begin{aligned} \textrm{vol}(\mathcal {S}[k])= & {} \sum \limits _{j \in \mathcal {S}[k]} d_{j}[k], \end{aligned}$$
(40)

respectively. Balancing the inter- and intra-group similarity to avoid degenerate solutions yields the normalized cut objective function [49]

$$\begin{aligned} \textrm{ncut}(\mathcal {S}[k],\overline{\mathcal {S}}[k])= & {} \textrm{cut}(\mathcal {S}[k],\overline{\mathcal {S}}[k]) \, \cdot \nonumber \\{} & {} \left( \frac{1}{\textrm{vol}(\mathcal {S}[k])} + \frac{1}{\textrm{vol}(\overline{\mathcal {S}}[k])} \right) . \end{aligned}$$
(41)

As shown in [49], minimization of (41) w. r. t. \(\mathcal {S}\) can be reformulated as minimization of the generalized Rayleigh quotient

$$\begin{aligned} \underset{{\mathcal {S}[k],\overline{\mathcal {S}}[k]}}{\textrm{min}} \; \textrm{ncut}(\mathcal {S}[k],\overline{\mathcal {S}}[k]) = \underset{\textbf{i}[k]}{\textrm{min}} \frac{\textbf{i}^{\textrm{T}}[k] \left( \textbf{D}[k] - \textbf{W}[k] \right) \textbf{i}[k]}{\textbf{i}^{\textrm{T}}[k] \textbf{D}[k] \textbf{i}[k]}, \end{aligned}$$
(42)

where \(\textbf{i}[k]\) is a J-dimensional discrete indicator vector satisfying

$$\begin{aligned} \textbf{i}^{\textrm{T}}[k] \textbf{D}[k] \textbf{1}_{J} = 0. \end{aligned}$$
(43)

Additionally, the elements of \(\textbf{i}[k]\) may only take either of two values [49]

$$\begin{aligned} \left[ \textbf{i}[k]\right] _{j} \in \left\{ 1,\; \frac{\sum _{j' \in \mathcal {S}[k]} d_{j'}[k]}{\sum _{j' \in \overline{\mathcal {S}}[k]} d_{j'}[k]} \right\} . \end{aligned}$$
(44)

When the discreteness constraint (44) on \(\textbf{i}[k]\) is relaxed to allow arbitrary real values, i.e., \(\textbf{i}[k] \in \mathbb {R}^{J}\), the minimizer of the generalized Rayleigh quotient in (42) is a solution to the generalized eigenvalue problem

$$\begin{aligned} \left( \textbf{D}[k] - \textbf{W}[k]\right) \textbf{i}[k] = \lambda [k] \textbf{D}[k] \textbf{i}[k], \end{aligned}$$
(45)

where \(\lambda [k]\) is the generalized eigenvalue and \(\textbf{i}[k]\) is the generalized eigenvector. The equivalent standard eigenvalue problem is obtained by left-multiplication of \(\textbf{D}^{-1}[k]\)

$$\begin{aligned} \textbf{L}[k] \textbf{i}[k] = \lambda [k] \textbf{i}[k] \end{aligned}$$
(46)

with the normalized random-walk Laplacian matrix [47]

$$\begin{aligned} \textbf{L}[k]= & {} \textbf{D}^{-1}[k] \left( \textbf{D}[k] - \textbf{W}[k]\right) \nonumber \\= & {} \textbf{I}_{J} - \textbf{D}^{-1}[k] \textbf{W}[k]. \end{aligned}$$
(47)

Thus, an approximate minimizer of (42) is obtained by finding the smallest eigenvalue and its corresponding eigenvector of \(\textbf{L}[k]\). The trivial eigenvalue 0 and its corresponding eigenvector \(\textbf{1}_{J}\) [48] are excluded by the constraint (43). Thus, the solution is the so-called Fiedler vector \(\textbf{v}[k]\), i.e., the eigenvector corresponding to the smallest non-trivial eigenvalue of \(\textbf{L}[k]\) [48], which automatically satisfies (43) as shown in [49]. While an approximate solution to the discrete problem can be obtained by discretizing \(\textbf{v}[k]\), e.g., based on the sign of each element, here we use the real-valued solution directly as an estimate of the microphones’ utility.

As an eigenvector, the scale and in particular the sign of \(\textbf{v}[k]\) is ambiguous, i.e., both \(-\textbf{v}[k]\) and \(\textbf{v}[k]\) are valid solutions to the eigenvalue problem (46). The same holds for the objective function (41), which is invariant to exchanging \(\mathcal {S}[k]\) with \(\overline{\mathcal {S}}[k]\). This ambiguity is usually not a problem for partitioning, since only the association of vertices to groups is desired, but not the identity of each group. In other words, the partitioning given by \(\textbf{v}[k]\) distinguishes between the most and least useful microphones, but does not say which group is which. Additionally, in low-SNR scenarios, noise-dominated microphone signals may exhibit large feature PCC values due to similar noise signal statistics despite only weakly coherent noise signals. To facilitate this distinction, we consider supplemental side information captured by a vector \(\varvec{\beta }[k]\) which is correlated with the preliminary utility estimates

$$\begin{aligned} \rho [k] = \mathcal {R}\left( \textbf{v}[k], \varvec{\beta }[k] \right) . \end{aligned}$$
(48)

Choices for \(\varvec{\beta }[k]\) are discussed below. Depending on the sign of the PCC \(\rho [k]\), the sign of the estimated utility values is flipped to produce the final utility estimates

$$\begin{aligned} \textbf{u}[k]&= \left\{ \begin{array}{rr} \textbf{v}[k] &{} \text {if } \rho [k] \ge 0 \\ -\textbf{v}[k] &{} \text {if } \rho [k] < 0 \\ \end{array}\right. . \end{aligned}$$
(49)

In [32], the supplemental information was chosen as the node degree, i.e., \(\left[ \varvec{\beta }[k]\right] _{j} = d_{j}[k]\). While this choice allows detection of outliers if the volumes of the two subsets in the partition are very different, i.e., a large majority of microphones is either useful or not useful, it also requires further assumptions or knowledge about the identity of the majority group, e.g., that the majority of microphones observes the desired SOI. To address these shortcomings, we consider typical SOI and interfering signals: typical SOI signals, especially speech, are non-Gaussian and exhibit spectro-temporal structure. Meanwhile, typical signal degradations, like reverberation or additive non-coherent noise, exhibit less or no structure, thus reducing the structure of the acoustic mixture. Thus, the differential signal entropy [39] is used to capture the structuredness of the observed signals

$$\begin{aligned} \left[ \varvec{\beta }[k]\right] _{j}= & {} - \mathcal {H} \left( \textbf{x}{j}[k]\right) \nonumber \\= & {} \sum \limits _{{n_{\textrm{B}} = 0}}^{N_{\textrm{B}}-1} \hat{p}(n_{\textrm{B}}; k) \; \log _{2}(\hat{p}(n_{\textrm{B}}; k)) \end{aligned}$$
(50)

as in [35]. Therein, the Probability Density Function (PDF) is estimated by its \(N_{\textrm{B}}\)-bin histogram

$$\begin{aligned} \hat{p}(n_{\textrm{B}}; k) = \frac{1}{{L_{\textrm{b}}}} \left| \left\{ t,\; e_{n_{\textrm{B}}} \le \left[ \textbf{x}_{j}[k]\right] _{t} < e_{n_{\textrm{B}}+1} \right\} \right| \end{aligned}$$
(51)

with \(e_{n_{\textrm{B}}}\) denoting the histogram bin edges. Note that, for the experiments conducted in Section 5, the signal blocks used to estimate entropy in (50) are chosen longer than those for the feature extraction. The entire microphone utility estimation procedure using Spectral Graph Partitioning is concisely summarized as pseudocode in Algorithm 1.

In the presence of point-like interferers, the signal model (2) no longer strictly holds, such that it should be understood as a first step towards developing methods for more general acoustic scenarios. Hence, somewhat degraded estimation performance must be expected, where the extent of degradation depends on the particular scenario. For example, in an ASN spanning two rooms each with their own SOI with only low-level cross-talk between rooms and low-level additive noise, groups of useful microphones for either SOI can be identified, which still matches well with the desired outcome. As a second example, consider an ASN in a single room, with two closely spaced point sources. For temporally overlapping source activity with both sources contributing similar signal power to each microphone, all microphones exhibit reduced utility w. r. t. either source as the other source is considered as noise, again matching qualitatively with reduced feature covariance. For source counting and associating the microphone subsets with the correct SOI, additional mechanisms need to be developed that are beyond the scope of this paper.

figure a

Algorithm 1 Recursive microphone utility update using Spectral Graph Partitioning

3.3 Importance of specific signal features

Choosing an appropriate set of characteristic signal features for the microphone signals is vital: too few features result in low estimation accuracy, while too many features unnecessarily strain the wireless network. Even for an appropriate number of features, inappropriate features may even reduce overall estimation accuracy. To explore the importance of individual signal features, we formulate the feature selection as a LS regression problem with a sparsity-promoting regularizer in (53) below in order to obtain a low regression error while using as few features as possible. Specifically, we interpret the matrix \(\textbf{C}_{j}[k]\) as a dictionary matrix whose columns, or atoms, contain the cross-channel correlation coefficients between the reference channel j and all channels for one specific signal feature, and which are linearly combined to approximate the MSC of the observed microphone signals. However, for the purpose of estimating microphone utility and microphone selection, the relative utility of microphone channels is more important than the absolute values, such that the zero-mean MSC vector, i.e.,

$$\begin{aligned} \widetilde{\varvec{\gamma }}[k] = \varvec{\gamma }[k] - \left( \frac{1}{J}\sum \limits _{j=1}^{J} \gamma _{j}[k] \right) \textbf{1}_{J}, \end{aligned}$$
(52)

is used as the target quantity. Thus, the \(\ell _1\)-regularized LS cost function for a single acoustic scenario comprising J microphone signals with K time frames is

$$\begin{aligned} \mathcal {C}(\varvec{\phi }) = \frac{1}{J{K}} \sum \limits _{j=1}^{J} \sum \limits _{k=1}^{K} \Vert \widetilde{\varvec{\gamma }}[k] - \textbf{C}_{j}[k] \varvec{\phi } \Vert _{2}^{2} + \delta \Vert \varvec{\phi } \Vert _1, \end{aligned}$$
(53)

where \(\varvec{\phi } = \left[ \begin{array}{ccc} \phi _{1},&\ldots ,&\phi _{F} \end{array}\right] ^{\textrm{T}} \in \mathbb {R}^{F}\) captures the contribution of each feature and the parameter \(\delta \in \mathbb {R}^{+}\) indirectly controls the sparsity of the vector, i.e., the number of used features. The results of this optimization are shown in Section 5.2.

4 Learning-based utility estimation

Artificial Neural Networks (ANNs) offer the ability to learn an optimum feature set (for given training data) to characterize the microphone signals, as well as optimally combining the features for estimating microphone utility. Thus, we propose learning-based alternatives to both the model-based feature extraction (see Section 3.1) and the utility estimation (see Section 3.2) subsystems. The extractor module in Fig. 4 realizes the feature extractor on the left-hand side of Fig. 2 (both in red), while the estimator module in Fig. 5 realizes the utility estimator on the right-hand side of Fig. 2 (both in blue). For both subsystems in Fig. 2, the ANN architectures are chosen to reflect the modeling capabilities of their model-based counterparts. Both modules are trained together in an end-to-end fashion. During inference, the extractor and utility estimator modules run on the network nodes and the AP, respectively, such that only the compressed feature representation need to be transmitted to the AP.

4.1 Node-wise feature extraction

The signal features discussed in Section 3.3, although effective for utility estimation, are not necessarily optimally suited for utility estimation. Learning a set of features specifically tailored to characterize microphone signals for the purpose of estimating utility promises improved accuracy and a more compact representation. The structure of the feature extractor module is depicted schematically in Fig. 4.

Fig. 4
figure 4

Architecture of the feature extractor module

Recalling that the ground-truth utility is given by the MSC, spectral representations of the input data appear to be an obvious choice. Since the phase of a signal is largely uninformative w. r. t. the SOI without a second signal for reference, we focus on models using the magnitude spectrum as input in the following. Our initial experiments support this, where models using the magnitude spectrum have outperformed models using the time-domain waveform. The resulting halving of the model’s input size is a welcome additional benefit.

Thus, the magnitude spectrum of a single microphone signal block \(\textbf{x}_{j}[k]\) as defined in (3) is computed first. Due to the loss of phase information, this transform is not invertible and thus prevents the model from learning exact equivalents of the time-domain features in Section 3.1. The magnitude spectrum then passes through a series of fully connected feed-forward layers that get progressively narrower to condense information until a desired number of signal features is reached. The final batch normalization and Gated Recurrent Unit (GRU) layer allows the extractor module to learn features that describe the evolution of some quantity over time, e.g., spectral flux. Trained weights are shared between the instances of the module at different microphones, i.e., no sensor-specific features are extracted.

4.2 Utility estimation

The architecture of the utility estimator is shown in Fig. 5 using the concatenated feature vectors \(\textbf{a}_{j}[k]\) from the individual microphones as an input. The memory of the first GRU layer allows capturing the temporal evolution of the feature sequences and allows establishing relations between the different microphone signals based on their extracted features. The following fully connected layers all contain the same number of neurons and are responsible for regression of the GRU outputs onto the target MSC values. Passing the feature sequences themselves into the ANN, instead of the PCCs as in the model-based method in Section 3, allows the network to differentiate between useful and non-useful microphones, such that no separate disambiguation step or supplemental information is needed. Unlike the model-based estimation in Section 3.2, the number of microphones J directly determines the number of neurons in the later fully connected layers. Thus, the model must be retrained whenever the number of microphones changes, but is capable of learning optimal feature representations. For practical applications, building a modular model, e.g., from microphone pair-wise submodels, could overcome this restriction at the cost of some modeling capability.

Fig. 5
figure 5

Architecture of the utility estimator module. Feature vectors from different microphones are concatenated to form a single, longer feature vector. The GRUs exploit the temporal information contained in the feature sequences. The FC layers estimate the microphone utility from the GRU outputs

5 Experimental validation

The algorithms from Sections 3 and 4 are evaluated on simulated and recorded acoustic data. The considered scenarios feature both static and moving SOIs, different room dimensions and reverberation times, and different arrangements of \(J=10\) microphones, some of which may be physically obstructed by objects. Although each microphone represents its own network node here, this does not conflict with the general assumptions outlined in Section 1.

5.1 Acoustic data

This section describes the different acoustic data used in the following experimental validation.

5.1.1 Simulated data

Microphone signals for a single SOI moving in a shoe box room are simulated using the image-source method [50, 51]. The SOI trajectory is restricted to the Region of Interest (RoI), chosen as a horizontal plane at 1.2-m height with at least 1-m distance to the walls. The trajectory is spatially discretized such that successive SOI positions are at most 5 cm apart. The resulting set of time-variant Room Impulse Responses (RIRs) is then convolved with the corresponding SOI signal excerpts to obtain the microphone signals evoked by the moving SOI. Speech segments of 28 s duration, from both male and female speakers, are used as SOI signals. The source moves rapidly during the time intervals 8–10 s and 18–20 s, and otherwise moves slightly around a resting position to simulate the behavior of human speakers. With a maximum cross section through the RoI of about 8 m, the maximum possible SOI speed is about 4m/s. Under these constraints, 20 different, random source trajectories are generated. Three different rooms with typical living room-like acoustic properties (see Table 1) are considered. In each room, \(J=10\) cardioid microphones are placed at random positions and with random azimuthal rotation. In total, \(R_{\text {sim}} = 120\) distinct acoustic setups (20 trajectories \(\times\) 2 signals \(\times\) 3 rooms) are simulated, resulting in 56 min of speech data. The generated SOI images are superimposed with spatially uncorrelated white noise of an equal, fixed level to attain an SNR of 10 dB at the microphone with the strongest source image on average. Due to the lower SOI contribution, other microphones have a lower average SNR. Figure 6 illustrates room A along with an exemplary source trajectory.

Table 1 Dimensions and reverberation times of simulated rooms
Fig. 6
figure 6

Simulated room A and exemplary source trajectory (red) for synthesized data

5.1.2 Recorded data

The recorded acoustic data is obtained from \(J=10\) microphones arranged pair-wise in a quarter circle around a static loudspeaker representing the SOI as shown in Fig. 7. Although the microphones capsules are omnidirectional, they exhibit nonuniform directivity due to being mounted in metal enclosures facing the SOI which causes diffraction. SOI signals comprise male, female, and children’s speech. Instead of a moving source, different usefulness of the microphones is induced by occluding some of the sensors. Obstacles may cover two microphone pairs as indicated in Fig. 7, or a single microphone pair. Additionally, obstacles consist of different materials, i.e., solid wood, foam, and cloth, such that sound can permeate through some of them. In total, \(R_{\text {rec}} = 36\) distinct acoustic setups (12 obstructions \(\times\) 3 signals) are recorded, resulting in 36 min of speech data. Like for simulated data, spatially uncorrelated white noise is added to the recorded microphone signals to achieve an SNR of 10 dB.

Fig. 7
figure 7

Experiment setup and exemplary obstruction for recorded data

5.2 Feature importance

Summing \(\mathcal {C}(\varvec{\phi })\) in (53) over \(R_{\text {sim}}=120\) experiment trials (see Section 5.1.1) and then minimizing the sum yields the features weights \(\varvec{\phi }\) depicted in Fig. 8. Naturally, higher values of \(\delta\) result in sparser solutions, i.e., less selected features, ranging from 3 features to 12 features for the considered range of \(\delta\). The most important features appear to be lower-order statistical moments of the temporal waveform (td_centroid, td_spread, td_skewness), higher-order statistical moments of the magnitude spectrum (sd_skewness, sd_kurtosis), and features capturing the temporal variation of the magnitude spectrum (sd_flux, sd_variation, sd_fluxnorm). For \(\delta =0.001\), the selection comprises the four features td_skewness, sd_slope, sd_kurtosis, and sd_fluxnorm, two of which were also part of the heuristic selection made in [36]. To keep the number of selected features similar to prior work [32, 35], we choose the aforementioned four features of \(\delta =0.001\) for the experimental validation in Section 5.3. Note that the obtained feature weights are only used for feature selection so far, but could possibly be used to improve the estimates’ robustness in the future.

Fig. 8
figure 8

Feature weights \(\phi _{f}\) for different values of \(\delta\)

5.3 Accuracy of estimated utilities

Estimation accuracy is quantified by computing the time-variant PCC between the estimated utility vector \(\textbf{u}[k]\) and the MSC vector \(\varvec{\gamma }[k]\)

$$\begin{aligned} r[k] = \mathcal {R}(\textbf{u}[k], \varvec{\gamma }[k]). \end{aligned}$$
(54)

For the following experiments, signals are sampled at \(f_{\textrm{s}} =\) 16 kHz. For block processing, signals are partitioned into blocks of \({L_{\textrm{b}}}=1024\) samples with a block shift of \({L_{\textrm{s}}}=512\) samples. As the only exception, differential entropy in (50), since it is estimated by a histogram approach, uses longer blocks of \(32\,000\) samples for more robust estimates. Due to the larger block size, the estimated differential entropy also changes more slowly over time, thus promoting temporal continuity of the estimated utility via (49).

For the proposed model-based approach from Section 3, termed model-KF in the following, the microphone signals are characterized using the four features identified in Section 3.3, i.e., td_skewness, sd_slope, sd_kurtosis, and sd_fluxnorm. The temporal recursive smoothing factor in (14) is chosen as \(\lambda = 0.99\); the scaling factors for the KF process and observation noise are \(\alpha _1 = 1\) and \(\alpha _2 = 50\), respectively.

For the learning-based approach, the extractor contains six fully connected layers with 513, 256, 128, 64, 32, and 16 neurons, respectively, followed by a single GRU layer with 16 inputs and 16 hidden states. Recall that \({L_{\textrm{b}}}=1024\) such that the 513 inputs to the first layer correspond to the non-redundant part of the signal’s magnitude spectrum. The utility estimator contains a single GRU layer with \(16J = 160\) inputs and 10 hidden states, followed by three fully connected layers with 10 neurons each. Since identical copies of the extractor module are run for each microphone channel \(j\in \mathcal {P}\), the total number of parameters is \(175\,000\) for the extractor regardless of the number of microphones J, and \(5\,500\) for the utility estimator with the above configuration which scales asymptotically quadratically with the number of channels J. The objective function to be minimized in training is the Mean Square Error (MSE) between the estimated utility \(\textbf{u}[k; \Psi ]\) and the ground-truth MSC \(\varvec{\gamma }[k]\)

$$\begin{aligned} \min _{\Psi } \frac{1}{K} \sum \limits _{k=1}^{K} \Vert \textbf{u}[k; \Psi ] - \varvec{\gamma }[k] \Vert ^{2}_{2}, \end{aligned}$$
(55)

where \(\Psi\) denotes the set of all trainable parameters. The model is trained by the Adam optimizer [52] with a learning rate of \(10^{-3}\). The available acoustic scenarios (120 for ML-sim, 36 for ML-tuned and 156 for ML-joint, respectively) are split into 70% training and 30% validation data. Due to the combinatorial construction of the acoustic data (see Sections 5.1.1and 5.1.2), it is likely that the same speech signal occurs both in the training and the testing data. However, they never occur in the same combination of source trajectory and simulated room which are the predominant influencing factors of microphone utility.

A total of 8 algorithmic variants and baselines are evaluated. Note that we deliberately to not enforce a common constraint regarding communication cost, because our primary goal is to establish performance bounds. However, for practical application, the trade-off between accurate utility estimates and minimal communication costs must be carefully considered. First, two baseline variants, termed baseline-MSC and baseline-CDR, use the cross-microphone MSC and the Coherent-to-Diffuse power Ratio (CDR) as oracle features. For baseline-MSC, the estimated MSC values directly represent a normalized similarity of the respective microphones, s. t. they directly comprise \(\tilde{\textbf{B}}^{(f)}[k]\) in (30). The CDR is computed by a Direction of Arrival (DOA)-independent estimator [53] assuming a diffuse noise coherence, which has been successfully used in weighting and selecting observations made by different microphones [33, 54]. Because the CDR is not bounded, the diffuseness [53, 55] is used in its place to construct \(\tilde{\textbf{B}}^{(f)}[k]\). Note that although MSC and CDR imply oracle knowledge in the sense of signal availability at the AP and thus transmission of the sensor signal, the MSC is still computed from time-limited observation windows and thus entails all of the associated estimation challenges, e.g., [56, 57]. The same holds for the CDR, as it is based directly on the estimated MSC.

The model-based system described in Section 3 including the KFs for covariance estimation is termed model-KF. To evaluate the effectiveness of the KF, a variant of the proposed system is evaluated that uses a simple recursive temporal smoothing like (14) for feature covariance estimation, termed model-smooth. Furthermore, to judge the modeling capabilities of the ML-estimator, hybrid combines all 18 traditional features from Section 3.1 and the ML-based utility estimator module from Section 4.2. Computational complexity of the different features obviously varies, computation of a single feature is about 20–40× faster than real time running single-threaded on a Core\(^{\text {TM}}\) i5-6600K at 3.5 GHz. Entropy computation being only slightly faster than real time is a notable exception, but required for the disambiguation of solutions in (49).

In addition, three different training variants are investigated: The first variant ML-sim uses exclusively simulated data. For practical application, it is highly desirable to deploy a pre-trained model and fine-tune its parameters specifically toward a new unseen scenario. To this end, a copy of the pre-trained ML-sim is fine-tuned on recorded data, termed ML-tuned. Finally, a third version trained on both simulated and recorded data simultaneously, termed ML-joint, is also evaluated. In terms of simulation run time, the learning-based variants achieve about 40× real-time speeds, i.e., are on par with the computation of a single model-based feature.

5.3.1 Simulated data

Figure 9 shows the median, as well as lower and upper quartile, of the PCC r[k] across trials as a function of the time frame k for simulated data, which should ideally produce values close to 1. Whenever the source moves (indicated by the gray-shaded vertical areas in Fig. 9), the source-microphone distances suddenly change causing the observed rapid decrease of r[k]. Both baseline methods baseline-MSC and baseline-CDR achieve limited performance due to the high relative noise levels in the microphone signals and the time-limited observation windows impeding accurate estimation. Both purely model-based variants model-KF and model-smooth exhibit good steady-state performance with PCCs around 0.9 and quick initial convergence and reconvergence after source motion. The KF variant model-KF achieves slightly better accuracy on average than model-smooth and is more robust, e.g., visible at around 4 s and 13 s. The trained hybrid offers only very small improvements over model-KF and model-smooth despite using all of the features, indicating that the four selected features for model-KF and model-smooth are close to optimal for these scenarios. The learning-based ML-sim trained on matching data achieves very similar performance, trading a more consistent performance when the SOI does not move for a slightly slower reconvergence behavior. As expected, fine-tuning the ML model using recorded data significantly degrades performance for simulated data, as shown by ML-tuned. Finally, the ML model with both simulated and recorded data from the beginning, i.e., ML-joint, clearly outperforms all other considered methods, with only minor breakdowns and very fast recovery. Interestingly, incorporating recorded data besides the simulated into the training procedure also improves performance on simulated data. Convergence of all methods is very fast, reaching peak accuracy almost instantaneously after the SOI becomes active after an initial silence period of about 1 s.

Fig. 9
figure 9

Median (solid) and lower/upper quartile (shared areas) of PCC r[k] over all \(R_{\text {sim}}=120\) synthetic experiment trials. Gray-shaded areas indicate time intervals of source movement

While the ML models implicitly learning the temporal structure of the source movement might be a concern here, our experiments with random time intervals of 4 to 12 s between two successive source movements have shown no noticeable degradation compared to fixed time intervals.

5.3.2 Recorded data

As for the simulated data, Fig. 10 shows the median and quartiles of r[k] across trials as a function of the time frame k for recorded data. Since the SOI is static, the usefulness of microphones is predominantly influenced by their occlusion and no clear temporal structure can be discerned. Both of the oracle baselines achieve consistent but limited performance with PCCs between 0.6 and 0.8. The advantage of baseline-CDR may be attributed to the diffuse noise coherence model, which enhances the contrast between microphones since residual coherence is considered as noise, particularly in low-frequency regions. While the median of both model-KF and model-smooth reaches 0.9 after about 2 s, their performance degrades over the experiment duration. Note that convergence of these two model-based variants and the baselines is initially delayed by about 1.5 s due to incorrect disambiguation of the utility estimates in (49), indicating opportunity for future improvements. Beyond this initial phase, (49) is effective at disambiguating the microphone partitions as shown by the consistently positive values in Fig. 9. This is reinforced by hybrid, which simultaneously avoids this initial delay and achieves significantly better performance. Thus, hybrid, which has access to all features, outperforming model-KF suggests that the four selected features are suboptimal for the type of degradation encountered in the recorded data. While the relatively weak performance of ML-sim with median values of around 0.5 is unsurprising since the model was not trained using recorded data, the method does not completely fail for unseen data. The performance of ML-tuned is only on par with model-KF, indicating that adjustment of a pre-trained model is not as straightforward as anticipated, likely due to pre-training driving the ANN parameters to a local minimum that cannot be escaped easily by subsequent tuning. Like for simulated data, ML-joint outperforms other methods on recorded data, achieving almost ideal values extremely fast and consistently, i.e., at almost no spread. Because they share their architecture and thus modeling capability, the advantage of ML-joint over ML-tuned is due to the different training data, which matches the phenomenon that unrelated data improves performance, as it is also observed for simulated data (see Fig. 9).

Fig. 10
figure 10

Median (solid) and lower/upper quartile (shared areas) of PCC r[k] over all \(R_{\text {rec}}=36\) real-data experiment trials

5.3.3 Identification of a single most useful microphone

Besides the accuracy of the estimated continuous-valued utilities, the capability of different algorithmic variants to correctly identify the microphone with the highest utility is investigated. To this end, the channel-wise SNR in (9) is computed using oracle knowledge of the individual signal components, i.e., the SOI source image and the additive noise. The microphone that maximizes (9) is considered the most useful, representing the ground truth in this experiment. Note that, as the SNR changes over time, so does the identity of the best microphone. For brevity, we restrict the investigation to the best performing model-based and ML-based variants, i.e., model-KF, hybrid and ML-joint. For each variant, the microphone with the highest estimated utility is selected. In the absence of a more directly comparable approach, the microphone with the highest average pair-wise CDR is selected as a baseline. Therein, the CDR is computed using the DOA-independent estimator [53] and a diffuse noise coherence model as described in Section 5.3. Because the microphones are connected to separate network nodes, this CDR baseline requires transmission of all microphone signals to one of the network nodes, which limits its practical applicability in ASNs. As performance measure, we use the fraction of time frames in which the estimated identity of the most useful microphone coincides with the SNR-based ground truth.

Table 2 Fraction of time frames where the single most useful microphone is identified correctly. The ground-truth selection is given by the microphone with the maximum oracle SNR (9)

The obtained results are shown in Table 2 separately for simulated and recorded data. For simulated data, all proposed variants clearly outperform the CDR baseline, with ML-joint achieving the highest accuracy as expected from the previous results. In the more challenging scenarios with recorded data, the overall accuracy decreases for all methods. Although the CDR baseline outperforms both model-KF and hybrid which use hand-crafted signal features, the ML-based ML-joint outperforms all off the remaining considered methods. It must be reiterated that the CDR baseline in Table 2 uses the microphone signal MSC as oracle knowledge, requiring transmission of all microphone signals. In contrast, the proposed model-KF, hybrid and ML-joint have no such limitations.

5.3.4 Discussion

Let us summarize the previous Sections 5.3.1 to 5.3.3 and point out implications for practical application. The performance of ML-sim on recorded data and ML-tuned on simulated data indicates limits on the generalization capabilities of the respective trained models to unseen data. Meanwhile, ML-joint provides very good utility estimates but requires both simulated and recorded acoustic data for training. This suggests that data mismatch due to the simplified acoustic simulation, e.g., neglecting the occlusions present in recorded data, is responsible for the aforementioned performance degradation of ML-sim and ML-tuned, requiring further investigation of the root cause. Comparing the results of ML-tuned and ML-joint, adaptation of a pre-trained network to new scenarios is not straightforward, likely requiring more sophisticated transfer learning techniques. As major drawback for ASNs in realistic conditions, obtaining a sufficient amount of labeled training data for a variety of acoustic scenarios is difficult since the estimation of the MSC values necessary for training require prior transmission and potentially synchronization of all observed signals. While this could be remedied, e.g., by network nodes with enough memory to buffer the signals before transmission, this problem is beyond the scope of this contribution. Furthermore, the architecture of the utility estimator module explicitly depends on the number of microphones J and thus requires retraining whenever J changes, e.g., new ASN nodes are added.

In contrast, the model-based approach model-KF has shown a more modest, yet robust, performance for both simulated and recorded data. It is also blind, i.e., does not require knowledge of array geometries, acoustic meta parameters like reverberation time, and especially the number of microphones J. Thus, it can be straightforwardly deployed in different acoustic environments without the need to collect acoustic data to train or fine-tune the model. For a real system, a model-based scheme can be used as initial solution to collect labeled training data, which can then be used to tailor an ML-based model to the specific acoustic scenario of the training data.

6 Conclusion

In this contribution, we tackled microphone utility estimation for ASNs. Specifically, we revisited model-based approaches and discussed the usefulness of specific features, with features describing temporal variations and higher-order statistical moments of the signals’ magnitude spectra being the most useful overall. Furthermore, we proposed alternative, machine learning-based realizations to learn an optimal feature set and utility estimator. Experimental validation showed that both model- and ML-based approaches are viable in principle with their own strengths and drawbacks. The model-based approach is straightforwardly applied to ASNs with an arbitrary number of microphones J, but is clearly outperformed by suitably trained ML models. In contrast, the ML-based approaches, particularly ML-joint, achieve superior performance if matching training data are available.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

ANN:

Artificial Neural Network

AP:

Access Point

ASN:

Acoustic sensor network

ASR:

Automatic Speech Recognition

CDR:

Coherent-to-Diffuse power Ratio

CLT:

Central Limit Theorem

DFT:

Discrete Fourier Transform

DOA:

Direction of Arrival

DRR:

Direct-to-Reverberation Ratio

ERLE:

Echo Return Loss Enhancement

GRU:

Gated Recurrent Unit

KF:

Kalman Filter

LASSO:

Least Absolute Shrinkage and Selection Operator

LCMV:

Linearly Constrained Minimum Variance

LS:

Least Squares

ML:

Machine Learning

MMSE:

Minimum mean square error

MSC:

Magnitude-Squared Coherence

MSE:

Mean Square Error

MVDR:

Minimum Variance Distortionless Response

PCC:

Pearson correlation coefficient

PDF:

Probability Density Function

PPM:

Parts per million

PSD:

Power Spectral Density

RIR:

Room Impulse Response

RoI:

Region of Interest

SDR:

Signal-to-Distortion Ratio

SNR:

Signal-to-Noise Ratio

SOI:

Source of Interest

SVD:

Singular Value Decomposition

References

  1. A. Bertrand, in, 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT). Applications and trends in wireless acoustic sensor networks: A signal processing perspective 2011, 1–6 (2011). https://doi.org/10.1109/SCVT.2011.6101302

  2. H. Wang, P. Chu, in 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, Voice source localization for automatic camera pointing system in videoconferencing (1997), pp. 187–190. https://doi.org/10.1109/ICASSP.1997.599595

  3. L. Cheng, C. Wu, Y. Zhang, H. Wu, M. Li, C. Maple, A Survey of Localization in Wireless Sensor Network. Int. J. Distrib. Sens. Netw. 8(12) (2012). https://doi.org/10.1155/2012/962523

  4. A. Brendel, W. Kellermann, Distributed source localization in acoustic sensor networks using the coherent-to-diffuse power ratio. IEEE J. Sel. Top. Sign. Process. 13(1), 61–75 (2019). https://doi.org/10.1109/JSTSP.2019.2900911

    Article  Google Scholar 

  5. L. Kaplan, Q. Le, N. Molnar, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Maximum likelihood methods for bearings-only target localization, vol.5 (2001), pp. 3001–3004. https://doi.org/10.1109/ICASSP.2001.940281

  6. C. Evers, H.W. Löllmann, H. Mellmann, A. Schmidt, H. Barfuss, P.A. Naylor, W. Kellermann, The LOCATA challenge: Acoustic source localization and tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1620–1643 (2020)

    Article  Google Scholar 

  7. M. Brandstein, Microphone arrays: signal processing techniques and applications (Springer Science & Business Media, Berlin, 2001)

    Book  Google Scholar 

  8. A. Bertrand, M. Moonen, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Distributed adaptive estimation of correlated node-specific signals in a fully connected sensor network (Taipei, Taiwan, 2009), pp. 2053–2056. https://doi.org/10.1109/ICASSP.2009.4960018

  9. S. Markovich-Golan, A. Bertrand, M. Moonen, S. Gannot, Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Process. 107, 4–20 (2015)

    Article  Google Scholar 

  10. S.L. Gay, J. Benesty, Acoustic signal processing for telecommunication, vol. 551 (Springer Science & Business Media, New York, 2012)

    Google Scholar 

  11. L.M. Oliveira, J.J. Rodrigues, Wireless Sensor Networks: a Survey on Environmental Monitoring. J. Commun. 6(2), 143–151 (2011). https://doi.org/10.4304/jcm.6.2.143-151

    Article  Google Scholar 

  12. S. Goetze, J. Schroder, S. Gerlach, D. Hollosi, J.E. Appell, F. Wallhoff, Acoustic monitoring and localization for social care. J. Comput. Sci. Eng. 6(1), 40–50 (2012)

    Article  Google Scholar 

  13. A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, T. Virtanen, in DCASE 2017 - Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017 Challenge setup: Tasks, datasets and baseline system (Munich, Germany, 2017). https://hal.inria.fr/hal-01627981. Accessed date 19 Dec 2021

  14. M. Wolf, C. Nadeu, in Proc. of I Joint SIG-IL/Microsoft Workshop Speech Lang. Technol. Iberian Lang., Towards microphone selection based on room impulse response energy-related measures (Porto Salvo, Portugal, 2009), p. 4

  15. A. Bertrand, M. Moonen, in European Signal Process. Conf. (EUSIPCO), Efficient sensor subset selection and link failure response for linear MMSE signal estimation in wireless sensor networks (Aalborg, Denmark, 2010), pp. 1092–1096

  16. J. Szurley, A. Bertrand, M. Moonen, P. Ruckebusch, I. Moerman, in European Signal Process. Conf. (EUSIPCO), Energy aware greedy subset selection for speech enhancement in wireless acoustic sensor networks (Bucharest, 2012), pp. 789–793

  17. J. Szurley, A. Bertrand, P. Ruckebusch, I. Moerman, M. Moonen, Greedy distributed node selection for node-specific signal estimation in wireless sensor networks. Signal Process. 94, 57–73 (2014). https://doi.org/10.1016/j.sigpro.2013.06.010

    Article  MATH  Google Scholar 

  18. O. Roy, M. Vetterli, Rate-Constrained Collaborative Noise Reduction for Wireless Hearing Aids. IEEE Trans. Signal Process. 57(2), 645–657 (2009). https://doi.org/10.1109/TSP.2008.2009267. https://ieeexplore.ieee.org/document/4671085/

  19. S. Srinivasan, A.C. den Brinker, Rate-Constrained Beamforming in Binaural Hearing Aids. EURASIP J. Adv. Signal Process. 2009(1) (2009). https://doi.org/10.1155/2009/257197. https://asp-eurasipjournals.springeropen.com/articles/10.1155/2009/257197. Accessed date 10 June 2023

  20. J. Amini, R.C. Hendriks, R. Heusdens, M. Guo, J. Jensen, Rate-Constrained Noise Reduction in Wireless Acoustic Sensor Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1–12 (2020). https://doi.org/10.1109/TASLP.2019.2947777. https://ieeexplore.ieee.org/document/8871150/

  21. J. Casebeer, J. Kaikaus, P. Smaragdis, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Communication-Cost Aware Microphone Selection for Neural Speech Enhancement with Ad-Hoc Microphone Arrays (2021), pp. 8438–8442. https://doi.org/10.1109/ICASSP39728.2021.9414775

  22. J. Zhang, S.P. Chepuri, R.C. Hendriks, R. Heusdens, Microphone Subset Selection for MVDR Beamformer Based Noise Reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 26(3), 550–563 (2018). https://doi.org/10.1109/TASLP.2017.2786544

    Article  Google Scholar 

  23. J. Zhang, J. Du, L.R. Dai, Sensor Selection for Relative Acoustic Transfer Function Steered Linearly-Constrained Beamformers. IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021). https://doi.org/10.1109/TASLP.2021.3064399

  24. J. Benesty, J. Chen, Y. Huang, Microphone array signal processing, vol. 1 (Springer Science & Business Media, Berlin, 2008)

    Google Scholar 

  25. K. Kumatani, J. McDonough, J. Lehman, B. Raj, in Joint Workshop Hands-free Speech Commun. Microphone Arrays (HSCMA), Channel selection based on multichannel cross-correlation coefficients for distant speech recognition (Edinburgh, UK, 2011), pp. 1–6. https://doi.org/10.1109/HSCMA.2011.5942398

  26. IEEE Standard for Information technology - Telecommunications and information exchange between systems Local and metropolitan area networks - Specific requirements - Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications. IEEE Std 802.11-2016 (Revision of IEEE Std 802.11-2012) (Aachen, 2016), pp. 1–3534

  27. A. Chinaev, G. Enzner, T. Gburrek, J. Schmalenstroeer, in European Signal Process. Conf. (EUSIPCO), Online Estimation of Sampling Rate Offsets in Wireless Acoustic Sensor Networks with Packet Loss (Dublin, 2021), pp. 1–5

  28. S.E. Kotti, R. Heusdens, R.C. Hendriks, in 2020 28th European Signal Processing Conference (EUSIPCO), Clock-Offset and Microphone Gain Mismatch Invariant Beamforming (IEEE, Amsterdam, Netherlands, 2021), pp. 176–180. https://doi.org/10.23919/Eusipco47968.2020.9287852. https://ieeexplore.ieee.org/document/9287852/. Accessed date 10 June 2023

  29. S. Wehr, I. Kozintsev, R. Lienhart, W. Kellermann, in IEEE Sixth International Symposium on Multimedia Software Engineering, Synchronization of acoustic sensors for distributed ad-hoc audio networks and its use for blind source separation (Miami, USA, IEEE, 2004), pp.18–25

    Google Scholar 

  30. D. Cherkassky, S. Gannot, Blind Synchronization in Wireless Acoustic Sensor Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 25(3), 651–661 (2017). https://doi.org/10.1109/TASLP.2017.2655259. https://ieeexplore.ieee.org/document/7827105/

  31. M. Fiedler, Algebraic connectivity of graphs. Czechoslov. Math. J. 23(2), 298–305 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  32. M. Günther, H. Afifi, A. Brendel, H. Karl, W. Kellermann, in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Network-aware optimal microphone channel selection in wireless acoustic sensor networks (Toronto, 2021)

  33. M. Günther, A. Brendel, W. Kellermann, in 14. ITG Conf. on Speech Comm., Microphone Utility-based Weighting for Robust Acoustic Source Localization in Wireless Acoustic Sensor Networks (Kiel, Germany, 2021)

  34. H. Afifi, M. Günther, A. Brendel, H. Karl, W. Kellermann, in 14. ITG Conf. on Speech Comm., Reinforcement Learning-based Microphone Selection in Wireless Acoustic Sensor Networks Considering Network and Acoustic Utilities (Kiel, Germany, 2021)

  35. M. Günther, A. Brendel, W. Kellermann, in European Signal Process. Conf. (EUSIPCO), Online estimation of time-variant microphone utility in wireless acoustic sensor networks using single-channel signal features (Dublin, Ireland, 2021)

  36. M. Günther, A. Brendel, W. Kellermann, in Int. Congress on Acoust. (ICA), Single-channel signal features for estimating microphone utility for coherent signal processing (2019), pp. 2716–2723

  37. G. Peeters, A large set of audio features for sound description (similarity and classification). CUIDADO project Ircam technical report (2004). http://recherche.ircam.fr/equipes/analyse-synthese/peeters/ARTICLES/Peeters_2003_cuidadoaudiofeatures.pdf. Accessed date 19 Dec 2021

  38. T. Virtanen, M. Plumbley, D. Ellis, Computational analysis of sound scenes and events (Springer, Cham, 2018)

    Book  Google Scholar 

  39. T.M. Cover, J.A. Thomas, Elements of information theory, 2nd edn. (Wiley-Interscience, Hoboken, N.J., 2006)

    MATH  Google Scholar 

  40. R.E. Kalman, A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 82(1), 35–45 (1960). https://doi.org/10.1115/1.3662552. https://doi.org/10.1115/1.3662552

  41. H.V. Henderson, S.R. Searle, Vec and vech operators for matrices, with some uses in Jacobians and multivariate statistics. Can. J. Stat. 7(1), 65–81 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  42. G. Enzner, H. Buchner, A. Favrot, F. Kuech, in Academic Press Library in Signal Processing, vol. 4, Acoustic Echo Control (Elsevier, Oxford, 2014), pp. 807–877

  43. B. Schwartz, S. Gannot, E.A.P. Habets, Online Speech Dereverberation Using Kalman Filter and EM Algorithm. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 394–406 (2015)

    Article  Google Scholar 

  44. A. Papoulis, S.U. Pillai, Probability, Random Variables and Stochastic Processes, 4th edn. (McGraw-Hill, New York, 2001)

    Google Scholar 

  45. C.M. Bishop, Pattern Recognition and Machine Learning (Springer, New York, 2006)

    MATH  Google Scholar 

  46. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (The Johns Hopkins University Press, Baltimore, 2013)

    Book  MATH  Google Scholar 

  47. F.R. Chung, F.C. Graham, Spectral graph theory, vol. 92 (American Mathematical Soc, Providence, Rhode Island, 1997)

    Google Scholar 

  48. U. Von Luxburg, A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  49. J. Shi, J. Malik, Normalized cuts and image segmentation. IEEE Trans. Pattern. Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  50. J.B. Allen, D.A. Berkley, Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979). https://doi.org/10.1121/1.382599

    Article  Google Scholar 

  51. E.A.P. Habets. Signal generator for MATLAB (2011). https://github.com/ehabets/Signal-Generator. Accessed date 19 Dec 2021

  52. D.P. Kingma, J. Ba. Adam. A method for stochastic optimization. (San Diego, 2017), Available online at https://arxiv.org/abs/1412.6980v9https://doi.org/10.48550/arXiv.1412.6980

  53. A. Schwarz, W. Kellermann, Coherent-to-Diffuse Power Ratio Estimation for Dereverberation. IEEE/ACM Trans. Audio Speech Lang. Process. 23(6), 1006–1018 (2015)

    Article  Google Scholar 

  54. A. Brendel, C. Huang, W. Kellermann, STFT Bin Selection for Localization Algorithms based on the Sparsity of Speech Signal Spectra (Proc, Euronoise, 2018)

    Google Scholar 

  55. G. Del Galdo, M. Taseska, O. Thiergart, J. Ahonen, V. Pulkki, The diffuse sound field in energetic analysis. J. Acoust. Soc. Am. 131(3), 2141–2151 (2012). https://doi.org/10.1121/1.3682064. Number: 3

  56. G. Carter, Time delay estimation for passive sonar signal processing. IEEE Trans. Acoust. Speech Signal Process. 29(3), 463–470 (1981)

    Article  Google Scholar 

  57. G. Carter, Coherence and time delay estimation. Proc. IEEE 75(2), 236–255 (1987)

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank Adhithyan Ramadoss for his help acquiring the recorded acoustic data used in the experimental study.

Funding

Open Access funding enabled and organized by Projekt DEAL. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—282835863—within the Research Unit FOR2457 “Acoustic Sensor Networks.”

Author information

Authors and Affiliations

Authors

Contributions

MG designed the proposed systems, designed and conducted the experimental studies, analyzed their results, and drafted the manuscript. AB co-designed the proposed systems and the experiments, and provided invaluable technical feedback on the manuscript draft. WK provided extensive feedback on the manuscript draft, and helped with interpreting the experimental results. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Michael Günther or Walter Kellermann.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

WK is the Lead Guest Editor of the special issue “Signal Processing and Machine Learning for Speech and Audio in Acoustic Sensor Networks” this manuscript is submitted to.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Günther, M., Brendel, A. & Kellermann, W. Microphone utility estimation in acoustic sensor networks using single-channel signal features. J AUDIO SPEECH MUSIC PROC. 2023, 29 (2023). https://doi.org/10.1186/s13636-023-00294-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-023-00294-7

Keywords