Skip to main content

Parallel processing of distributed beamforming and multichannel linear prediction for speech denoising and deverberation in wireless acoustic sensor networks

Abstract

More and more smart home devices with microphones come into our life in these years; it is highly desirable to connect these microphones as wireless acoustic sensor networks (WASNs) so that these devices can be better controlled in an enclosure. For indoor applications, both environmental noise and room reverberation may severely degrade speech quality, and thus both of them need to be removed to improve users’ experience. For this goal, this paper proposes a parallel processing framework of distributed beamforming and multichannel linear prediction (DB-BFMCLP), which consists of generalized sidelobe canceler and multichannel linear prediction for simultaneous speech dereverberation and noise reduction in WASNs. By sharing a common desired response vector, the proposed DB-BFMCLP can provide a significant reduction in communication bandwidth without sacrificing performance. The convergence guarantee of the DB-BFMCLP to its centralized implementation is derived mathematically. Simulation results verify the superiority of the proposed method to the existing related methods in noisy and reverberant scenarios.

1 Introduction

Recent progress in micro-electro-mechanical systems (MEMS) and wireless communications enable the development and popularization of low-cost and low-power wireless sensor networks (WSNs) [1]. A WSN usually consists of multiple nodes connected by wireless links, which has been applied to various fields including speech extraction, acoustic source localization, and acoustic event detection [2, 3]. In general, wireless acoustic sensor networks (WASNs) that can be applied for smart home devices equipped with at lease one microphone in each device. Each node also has an individual signal processing unit and a communication module to achieve monitoring, processing, and broadcasting, respectively.

Compared with conventional compact microphone arrays, a WASN comprises several nodes that are placed dispersedly and/or randomly, so that it can cover a much larger area. Besides, WASNs can enhance the robustness and the extensibility of the system by the decentralized operation [4]. The often-studied problems of WASNs are synchronization acquisition and transmission. The main factor of the synchronization problem is the offset of clocks oscillators, and many efforts have been made to solve the clock synchronization problem [5]. The other factor is the asynchronous or synchronous updating of signals and parameters in each node; it has been proved that appropriate updating would more likely lead to optimal estimators [6, 7].

For WASNs, centralized methods need to gather all observations in a fusion center. In theory, centralized methods can achieve the best performance for complete information, but these methods usually require a large communication bandwidth and computational power. Due to limited transmission bandwidth and energy resources in practice, the optimal centralized methods are often difficult if not impossible for practical applications. An alternative solution is to use distributed methods, which can achieve nearly the same performance as centralized methods, while requiring much fewer broadcast channels [6, 8,9,10,11,12].

Recently, many speech enhancement methods have been proposed for WASNs to reduce environmental noise. In [10], a distributed multichannel Wiener filter (DB-MWF) method was proposed for binaural hearing aids. This method considered the case of a single speech source under stationary noise scenarios assumption and only one-channel signal was transmitted from each node to the other. In [6], the coexistence of multiple speakers was considered, a distributed adaptive node-specific signal estimation (DANSE) method was proposed, which aims to obtain different outputs in each node. In [11], a linearly constrained distributed adaptive node-specific signal estimation (LC-DANSE) method was proposed, which uses a node-specific linearly constrained minimum variance (LCMV) beamformer. A distributed generalized sidelobe canceler (DB-GSC) with multiple constraints was presented for speech enhancement in [12], where the convergence of the DB-GSC to the centralized generalized sidelobe canceler (GSC) was proved. Note that the DB-GSC method was based on a specific transformation that allows reformulating the centralized beamformer as a sum of all local GSC.

Apart from noise, room reverberation may also degrade speech quality severely in an enclosure [13, 14]. For indoor speech communication applications, such as hands-free telephony, speaking to smart home devices, and conference call, microphones are often placed at a certain distance from the desired speaker. In these circumstances, microphones can receive not only the direct sound but also the reflections because of the surrounding objects and walls, where the late reflections are referred to as reverberation. It has been shown that these undesired reverberant components degrade both the performance of automatic speech recognition (ASR) systems and speech perceptual quality. To solve this problem, many dereverberation methods have been proposed [15,16,17,18]. Among these methods, the multi-channel linear prediction (MCLP) proposed in [19] is widely used for its promising performance. The weighted recursive least squares (RLS) method was introduced to accelerate the convergence rate of the filtering parameters in [19]. In addition, [20] demonstrated that the MCLP can suppress reverberation without assuming specific acoustic conditions, although it was originally proposed for single-source dereverberation under noise-free scenarios. For WASNs, dereverberation is also very important for speech enhancement. In [21, 22], two multi-channel dereverberation approaches in ad hoc microphone arrays were introduced, in which the reverberation was reduced by selecting a subset of microphones with a relatively lower level of reverberation. Unlike noise reduction methods, dereverberation methods for WASNs often ignore the constraints, e.g., the limited transmission bandwidth and energy resource.

In reverberant and noisy environments, dereverberation and noise reduction should be integrated in a parallel processing framework or in a serial processing framework [23, 24]. In [23], a system was proposed that employs multiple-output MCLP followed by the minimum variance distortionless response (MVDR) beamformer. However, the cascade architecture of the system has high computational complexity and is difficult to extend to the WASNs. In [24], the sidelobe-cancelation (SC) filter was combined with the linear prediction (LP) filter to a unified framework named integrated sidelobe cancelation and linear prediction (ISCLP), where the two filters are estimated jointly by a Kalman filter. However, the GSC performance is highly dependent on the quality of the estimated relative early transfer functions (RETFs). To prevent the self-cancelation phenomenon caused by inaccurate RETFs, the filter coefficients of the GSC update only when the speakers are all inactive. Therefore, the filter of the GSC and that of the MCLP cannot update their coefficients simultaneously, especially when considering that the MCLP needs to update its filter coefficients when the speakers are active [12, 19, 25]. A joint optimization of the two filters is still unsolved for WASNs.

To solve the above difficulty, we unify the GSC and the MCLP together into a beamforming and multichannel linear prediction (BFMCLP) framework, which can achieve the independent update for both filters, to deal with reverberant speech in noisy scenarios. Besides, by sharing the common response vector and deriving the distributed RLS method, we extend the BFMCLP to a distributed implementation (DB-BFMCLP) which is potential for the WASNs. The DB-BFMCLP method needs much fewer signals to be broadcasted in each node than centralized methods.

The remainder of this paper is organized as follows. In Section 2, the problem formulation is presented. In Section 3, the centralized BFMCLP method is described. The DB-BFMCLP is presented in Section 4, and its convergence to the BFMCLP is also included in this section. In Section 5, we evaluate the performance by simulations. Finally, some conclusions are given in Section 6.

2 Problem formulation

In this section, we consider that a fully connected WASN with M microphones contains J nodes (\(M\ge J\)), and the number of speech sources observed by this WASN is N. \(M_j\) denotes the number of microphones in the jth node, and we have \(\sum _{j=1}^{J}M_j=M\). This paper focuses on the situation that each node is equipped with more than one microphone. It should be noted that no communication-bandwidth reduction can be obtained in the node with only one microphone, since at least one-channel signal needs to be transmitted if all nodes are used for better performance instead of using only partial nodes. In some previous studies, more attention is paid to the problems about sensor subset selection, source location, or network topology in the area of the WASNs consisted of several signal-microphone nodes [8, 26, 27]. These problems are out of the scope of this paper.

In the short-time Fourier transform (STFT) domain, the reverberant observation of the speech signal from the nth speaker captured by the mth microphone can be modeled as

$$\begin{aligned} x_{nm}(k,t)=\sum \limits _{l=0}^{L_h-1}a_{nm}(k,l)s_n(k,t-l), \end{aligned}$$
(1)

where t and k denote the time-frame and frequency-bin indices, respectively. \(a_{nm}(k,l)\) denotes the time-invariant acoustic transfer function (ATF) between the nth source and the mth microphone, and \(L_h\) depends on the reverberation time and the length of the STFT window. In this paper, we treat all frequency sub-bands independently, the frequency-bin index k is hereafter omitted for brevity. In WASNs, we use the vector notation:

$$\begin{aligned} \textbf{x}_n(t)=\left[ {x_{n1}(t),x_{n2}(t),...,x_{nM}(t)}\right] ^T, \end{aligned}$$
(2)

with \((\cdot )^T\) denoting the transpose. By dividing ATFs coefficients, the reverberant speech components from the nth speaker \(\textbf{x}_n(t)\) may be decomposed into the direct and early reflected components \(\textbf{x}_{n|e}\) and late reverberant components \(\textbf{x}_{n|l}\), given by:

$$\begin{aligned} \textbf{x}_n(t)=\textbf{x}_{n|e}(t)+\textbf{x}_{n|l}(t). \end{aligned}$$
(3)

In practice, the ATFs are difficult to estimate without the knowledge of the acoustic sources. Instead, the RETFs are often chosen to characterize the relative relationship of the desired source signals received by microphones:

$$\begin{aligned} \textbf{x}_{n|e}(t)=\textbf{h}_{n}\left[ {\textbf{x}_{n|e}(t)}\right] _1, \end{aligned}$$
(4)

where \(\left[ {\cdot }\right] _1\) denotes the first item of the vector, \(\textbf{h}_{n}\) denotes an \(M\times 1\) RETF between the nth speaker and M microphones in the WASN. It is obvious that \(\left[ {\textbf{h}_{n}}\right] _1=1\). Consider all the N speakers and the \(M\times 1\) vector \(\textbf{v}(t)\) represents the environmental noise, the stacked \(M\times 1\) vector of received signals by all microphones is given by:

$$\begin{aligned} \textbf{y}(t)= & {} \textbf{x}_{e}(t)+\textbf{x}_{l}(t)+\textbf{v}(t) \nonumber \\= & {} \sum \limits _{n=1}^{N}\textbf{x}_{n|e}(t)+\sum \limits _{n=1}^{N}\textbf{x}_{n|l}(t)+\textbf{v}(t), \end{aligned}$$
(5)

where \(\textbf{x}_e(t)=\textbf{H}\left[ {\left[ {\textbf{x}_{1|e}(t)}\right] _1,\left[ {\textbf{x}_{2|e}(t)}\right] _1,...,\left[ {\textbf{x}_{N|e}(t)}\right] _1}\right] ^T\), and \(\textbf{H}=\left[ {\textbf{h}_1,\textbf{h}_2,...,\textbf{h}_N}\right]\) is the \(M\times N\) RETFs matrix for all the N speakers.

In the Jth node WASN, the vectors \(\textbf{y}(t)\) and \(\textbf{h}_n\), and the matrix \(\textbf{H}\) can be stacked by all nodes:

$$\begin{aligned} \textbf{y}(t)= & {} \left[ {\bar{\textbf{y}}^T_{1}(t),\bar{\textbf{y}}^T_{2}(t),...,\bar{\textbf{y}}^T_{J}(t)}\right] ^T,\end{aligned}$$
(6)
$$\begin{aligned} \bar{\textbf{y}}_{j}(t)= & {} \left[ {y_{j1}(t),y_{j2}(t),...,y_{jM_{j}}(t)}\right] ^T,\end{aligned}$$
(7)
$$\begin{aligned} \textbf{h}_n= & {} \left[ {\bar{\textbf{h}}^T_{n1},\bar{\textbf{h}}^T_{n2},...,\bar{\textbf{h}}^T_{nJ}}\right] ^T,\end{aligned}$$
(8)
$$\begin{aligned} {\textbf{H}}= & {} \left[ {\begin{array}{c} {{{\bar{\textbf{H}}}_1}}\\ {{{\bar{\textbf{H}}}_2}}\\ \vdots \\ {{{\bar{\textbf{H}}}_J}} \end{array}} \right] = \left[ {\begin{array}{cccc} {{{\bar{\textbf{h}}}_{11}}}&{}{{{\bar{\textbf{h}}}_{21}}}&{} \cdots &{}{{{\bar{\textbf{h}}}_{N1}}}\\ {{{\bar{\textbf{h}}}_{12}}}&{}{{{\bar{\textbf{h}}}_{22}}}&{} \cdots &{}{{{\bar{\textbf{h}}}_{N2}}}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {{{\bar{\textbf{h}}}_{1J}}}&{}{{{\bar{\textbf{h}}}_{2J}}}&{} \cdots &{}{{{\bar{\textbf{h}}}_{NJ}}} \end{array}} \right] , \end{aligned}$$
(9)

where \((\bar{\cdot })\) denotes the local data belonging to one node, \(y_{ji}(t)\) denotes the ith microphone signal of the jth node. The vectors \(\bar{\textbf{y}}_{j}(t)\in \mathbb {C}^{M_j\times 1}\) and \({{\bar{\textbf{h}}}_{nj}}\in \mathbb {C}^{M_j\times 1}\) denote the signal captured by the jth node and the RETF from the nth speaker to the jth node, respectively.

3 BFMCLP

In this section, we develop the parallel processing of the BFMCLP for simultaneous speech dereverberation and noise reduction. We introduce the BFMCLP at the beginning, and then investigate its stability.

3.1 Framework

The parallel processing framework of the BFMCLP is shown in Fig. 1. It consists of GSC and MCLP, and the microphone signal vector \(\textbf{y}(t)\) is used as input to both parallel branches. As shown in the block diagram, the GSC consists of three components: a fixed beamformer (FB) \(\textbf{f}\) steers a beam to a desired speaker and reduces the other competing speakers, a blocking matrix (BM) \(\textbf{B}\) which is orthogonal to the target signal cancels the desired speaker, and a data-dependent adaptive filter \(\textbf{w}\) filters the output of \(\textbf{B}\). The difference between the signals from the FB path and the adaptive filter \(\textbf{w}\) filter path is the original GSC output [28].

Fig. 1
figure 1

Block-diagram of the BFMCLP

The accuracy of the estimated RETFs matrix has a significant impact on the performance of the GSC. If the desired speech can be completely canceled in the BM, the GSC performs well in suppressing noise, interferences, and late reverberant components without distorting the desired speech. An estimation of the RETF for one speaker can be obtained by performing eigenvector decomposition on the corresponding covariance matrix and the eigenvector associated with the maximum eigenvalue is then extracted [25, 29]. Beforehand, the desired covariance matrix needs to be computed by subtracting the noise covariance matrix from the noisy covariance matrix. We assume that the activity patterns of the speakers are non-overlapping, and an ideal voice activity detector (VAD) is employed in this paper. In this way, the desired covariance matrices can be obtained at the initialization stage of the WASN.

However, the accuracy of estimated RETFs will decrease significantly with the increase of the reverberation time. To prevent the speech cancelation problem caused by inaccurate RETFs, the adaptive filter \(\textbf{w}(t)\) only updates when the desired speaker is inactive, whereas such an update strategy may lead to performance degradation for dereverberation. To overcome this problem, the MCLP is introduced to suppress reverberation by deconvolution in the second branch, which consists of a delay module and an estimated room regression vector \(\textbf{g}\). Note that Eq. (1) indicates that the reverberation effect can be modeled as the output of a multi-channel autoregressive (MCAR) system. It is the theoretical basis of the adaptive dereverberation method, where the microphone array signals can be expressed as the model of MCLP [30, 31]. In this section, we propose the BFMCLP method, in which the GSC and the MCLP are performed in parallel. In this way, we can achieve much better performance when the speech degrades by both reverberation and noise. The details of the BFMCLP method are presented below.

The FB \(\textbf{f}\) can be defined with the following constraints set:

$$\begin{aligned} {{\textbf{H}}^H}{\textbf{f}} = {\textbf{p}}, \end{aligned}$$
(10)

where \((\cdot )^H\) in the following denote conjugate transpose, and \(\textbf{p}\) is an \(N\times 1\) desired response vector consisting of ones and zeros. The desired output \(d\left( t \right)\) of the BFMCLP is the sum of the direct and early reflected components of the desired speakers which correspond to 1 in the vector \(\textbf{p}\):

$$\begin{aligned} d\left( t \right) = \left[ {{x_{1|e}}\left( t \right) ,{x_{2|e}}\left( t \right) ,...,{x_{N|e}}\left( t \right) } \right] {\textbf{p}}. \end{aligned}$$
(11)

A closed-form solution of Eq. (10) is \({\textbf{f}} = {\textbf{H}}{\left( {{{\textbf{H}}^H}{\textbf{H}}} \right) ^{ - 1}}{\textbf{p}}\), and the output of the FB is:

$$\begin{aligned} c\left( t \right) = {{\textbf{f}}^H}{\textbf{y}}\left( t \right) = d\left( t \right) + {{\textbf{f}}^H}\left( {{{\textbf{x}}_l}\left( t \right) + {\textbf{v}}\left( t \right) } \right) . \end{aligned}$$
(12)

Let the BM \(\textbf{B}\in \mathbb {C}^{M\times (M-N)}\) be defined as a basis for the orthogonal complement of the space spanned by the columns of matrix \(\textbf{H}\), it is designed to cancel the desired speakers, given by

$$\begin{aligned} {{\textbf{B}}^H}{\textbf{H}} = {{\textbf{0}}_{\left( {M - N} \right) \times N}}, \end{aligned}$$
(13)

and a closed-form solution of Eq. (13) can be written as \({\textbf{B}} = {\left[ {{\textbf{I}} - {\textbf{H}}{{\left( {{{\textbf{H}}^H}{\textbf{H}}} \right) }^{ - 1}}{\textbf{H}}^H} \right] _{:,1:M - N}}\). The output of the BM can be given by:

$$\begin{aligned} {\textbf{u}}\left( t \right) = {{\textbf{B}}^H}{\textbf{y}}\left( t \right) = {{\textbf{B}}^H}\left( {{\textbf{x}}\left( t \right) + {\textbf{v}}\left( t \right) } \right) . \end{aligned}$$
(14)

In the MCLP branch, \(\textbf{q}(t)\) denotes the delayed signal of \(\textbf{y}(t)\):

$$\begin{aligned} {\textbf{q}}\left( t \right)= & {} {\left[ {{\textbf{q}}_1^T\left( t \right) ,...,{\textbf{q}}_M^T\left( t \right) } \right] ^T},\end{aligned}$$
(15)
$$\begin{aligned} {{\textbf{q}}_m}\left( t \right)= & {} {\left[ {{y_m}\left( {t - \tau } \right) ,...,{y_m}\left( {t - \tau - \left( {{L_g} - 1} \right) } \right) } \right] ^T}, \end{aligned}$$
(16)

where \(L_g\) depends mainly on \(L_h\), and \(\tau\) denotes the prediction delay in the MCLP model which can prevent the over whitening problem [31]. As shown in Fig. 1, the output of BFMCLP \(\hat{d}(t)\) can be given by:

$$\begin{aligned} \hat{d} \left( t \right)= & {} c\left( t \right) - {c_B}\left( t \right) - {c_L}\left( t \right) \nonumber \\= & {} c\left( t \right) - {{\textbf{w}}^H}\left( t-1 \right) {\textbf{u}}\left( t \right) - {{\textbf{g}}^H}\left( t-1 \right) {\textbf{q}}\left( t \right) , \end{aligned}$$
(17)

where \(\hat{d}(t)\) denotes the estimation of the desired speaker, \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update independently over time. In the BFMCLP method, the two branches are designed for joint dereverberation and noise reduction.

The filter coefficients \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update iteratively with the normalized least mean squares (NLMS) [32] and RLS [33], respectively, and the details are summarized in Table 1.

Table 1 The details of the BFMCLP method

As shown in Table 1, \(\textbf{k}(t)\) is the gain vector, \(\textbf{P}(t)\) is the inverse correlation matrix of the input signal \(\textbf{q}(t)\), \(0<\alpha <1\) and \(0<\rho <1\) denote the forgetting factors, \(\lambda (t)\) denotes the variance of the desired signal, and \(\mu\) is the step size. It is to be emphasized that \(\textbf{w}(t)\) only updates when all the speakers are inactive, while \(\textbf{g}(t)\) updates continuously. In this way, instead of estimating both filters simultaneously, the BFMCLP can prevent the self-cancelation problem effectively.

3.2 Stability of the BFMCLP

Especially note that there is no distortion for the desired signals in the output of the BFMCLP. Because of the existence of the BM \(\textbf{B}\) and the prediction delay \(\tau\), we have \({\textrm{E}}\{ {{\textbf{u}}\left( t \right) {d^*}\left( t \right) } \} = {\textbf{0}}\) and \({\textrm{E}}\{ {{\textbf{q}}\left( t \right) {d^*}\left( t \right) } \} = {\textbf{0}}\) in which \({\textrm{E}}\{\cdot \}\) denotes the expectation, indicating that \(c_b(t)\) and \(c_L(t)\) are all uncorrelated with the desired signal d(t).

In this subsection, we will further prove that the independent update of the two paths will not cause divergence of the system. We assume that the microphone signals are composed of speakers and one interference radiating from a specific direction:

$$\begin{aligned} {\textbf{y}}\left( t \right) =\textbf{x}_{e}(t) + {{\textbf{x}}_l}\left( t \right) + {{\textbf{v}}_e}\left( t \right) + {{\textbf{v}}_l}\left( t \right) , \end{aligned}$$
(18)

where \({{\textbf{v}}_l}\left( t \right)\) and \({{\textbf{v}}_e}\left( t \right)\) are the early-reflected components and late-reverberant components of the noise, respectively. We assume the RETFs are known. In the following, we analyze the system in two situations: the desired speaker is active or inactive.

3.2.1 Speaker active

When the speaker is active, the filter coefficients of the GSC branch are all fixed. Because the RETFs are estimated in advance and \(\textbf{w}(t)\) updates when only the noise exists, the GSC can suppress the early-reflected components of the noise without distorting the desired speaker. Thus, the output of the GSC branch is

$$\begin{aligned} {c_{\textrm{GSC}}}\left( t \right)= & {} c(t)-c_B(t) \nonumber \\= & {} {\left( {{\textbf{f}} - {\textbf{Bw}}} \right) ^H}{\textbf{y}}\left( t \right) \\= & {} {d}\left( t \right) + {\left( {{\textbf{f}} - {\textbf{Bw}}} \right) ^H}\left( {{{\textbf{x}}_l}\left( t \right) + {{\textbf{v}}_l}\left( t \right) } \right) ,\nonumber \end{aligned}$$
(19)

The input of the MCLP branch \(\textbf{q}(t)\) is always correlated to \(\left( {{{\textbf{x}}_l}\left( t \right) + {{\textbf{v}}_l}\left( t \right) } \right)\), and the MCLP aims to suppress the late-reverberation components by making the output \(\hat{d}(t)=c_{\textrm{GSC}}(t)-c_{{L}}(t)\) temporally uncorrelated [20].

3.2.2 Speaker inactive

When the speaker is inactive, the vector \(\textbf{y}(t)={{\textbf{v}}_e}\left( t \right) + {{\textbf{v}}_l}\left( t \right)\) needs to be canceled completely. In other words, the system should minimize the \({\textrm{E}}\left\{ {{{\left| {\hat{d}\left( t \right) } \right| }^2}} \right\}\), where the \({\hat{d}\left( t \right) }\) denotes the residual in this subsection. And the filters \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update independently and simultaneously over time.

The filter \(\textbf{w}(t)\), which updates by the NLMS method, should minimize the cost function [28]:

$$\begin{aligned} J_{\textbf{w}}\left( t \right) ={\left\| {{\textbf{w}}\left( {t} \right) - {\textbf{w}}\left( t-1 \right) } \right\| ^{\textrm{2}}} + {\mathop {\mathrm Re}\nolimits } \{ {{\lambda _{\textbf{w}} ^*}\hat{d}\left( t \right) } \}, \end{aligned}$$
(20)

where \(\lambda _{\textbf{w}}\) is the Lagrange multiplier, and Re\(\{\cdot \}\) extracts the real part of a complex variable. Differentiate \(J_{\textbf{w}}\left( t \right)\) with the respect to \({\textbf{w}}\left( {t} \right)\):

$$\begin{aligned} \frac{{\partial {J_{{\textbf{w}}}}\left( t \right) }}{{\partial {{\textbf{w}}^H}\left( {t} \right) }} = 2\left( {{\textbf{w}}\left( {t} \right) - {\textbf{w}}\left( t-1 \right) } \right) - {\lambda _{\textbf{w}} ^*}{\textbf{u}}\left( t \right) . \end{aligned}$$
(21)

By setting Eq. (21) to zero, we can obtain the optimal filter coefficients:

$$\begin{aligned} {\textbf{w}}\left( {t} \right) = {\textbf{w}}\left( t-1 \right) + \frac{1}{2}{\lambda _{\textbf{w}} ^*}{\textbf{u}}\left( t \right) . \end{aligned}$$
(22)

We set the constraint

$$\begin{aligned} {c\left( t \right) = {{\textbf{w}}^H}\left( {t } \right) {\textbf{u}}\left( t \right) + {{\textbf{g}}^H}\left( t-1 \right) {\textbf{q}}\left( t \right) }, \end{aligned}$$
(23)

and solve for the \(\lambda _{\textbf{w}}\) by substituting Eq. (22) into Eq. (23), given by

$$\begin{aligned} {c}\left( t \right) ={{\textbf{w}}^H}\left( {t-1} \right) {\textbf{u}}\left( t \right) +\frac{1}{2}{\lambda _{\textbf{w}}}\left\| \textbf{u}(t)\right\| ^2+ {{\textbf{g}}^H}\left( t-1 \right) {\textbf{q}}\left( t \right) , \end{aligned}$$
(24)

then we obtain:

$$\begin{aligned} \lambda _{\textbf{w}}=\frac{2\hat{d}(t)}{\left\| \textbf{u}(t)\right\| ^2}, \end{aligned}$$
(25)

where \(\hat{d}(t)\) is defined in Eq. (60) in Table 1. Thus, Eq. (63) in Table 1 can be obtained by substituting Eq. (25) into Eq. (22) and introducing a scaling factor denoted by \(\mu\), where P(t) is a recursive average of \(\left\| \textbf{u}(t)\right\| ^2\).

Next, we consider the filter \(\textbf{g}(t)\). In the method of least squares, the optimized \(\textbf{g}(t)\) in BFMCLP should satisfy the principle of orthogonality:

$$\begin{aligned} {\textrm{E}}\{ {{\textbf{q}}\left( t \right) \hat{d}^*\left( t \right) } \}=\textbf{0}. \end{aligned}$$
(26)

Then we can get [28, 31, 33]:

$$\begin{aligned} {\mathbf {\Phi g}} {{ = }} {\textbf{z}}, \end{aligned}$$
(27)

where \({\mathbf {\Phi }}{\mathrm { = E}}\{ {{\textbf{q}}\left( t \right) {{\textbf{q}}^H}\left( t \right) } \}\) denotes the correlation matrix of the input \(\textbf{q}(t)\), and \({\textbf{z}}{\mathrm { = E}}\left[ {{\textbf{q}}\left( t \right) c_{{\textrm{GSC}}}^*\left( t \right) } \right]\) denotes the cross-correlation vector of \(\textbf{q}(t)\) and \(c_{{\textrm{GSC}}}\left( t \right) =c(t)-c_B(t)\). In the RLS method, the recursive computations of \({\mathbf {\Phi }}\) and \({\textbf{z }}\) are given by:

$$\begin{aligned} {\mathbf {\Phi }}\left( t \right)= & {} {\lambda _{\textbf{g}}}{\mathbf {\Phi }}\left( {t - 1} \right) + {\textbf{q}}\left( t \right) {{\textbf{q}}^H}\left( t \right) ,\end{aligned}$$
(28)
$$\begin{aligned} {\textbf{z}}\left( t \right)= & {} {\lambda _{\textbf{g}}}{\textbf{z}}\left( {t - 1} \right) + {\textbf{q}}\left( t \right) {c_{{\textrm{GSC}}}^*}\left( t \right) , \end{aligned}$$
(29)

where \({\lambda _{\textbf{g}}}\) is the \(forgetting\ factor\). Then, the matrix inversion lemma can be used to obtain the recursive computation of \({\textbf{g}}\left( t \right)\), which is

$$\begin{aligned} {\textbf{g}}\left( t \right)= & {} {{\mathbf {\Phi }}^{ - 1}}\left( t \right) {\textbf{z}}\left( t \right) \nonumber \\= & {} {\textbf{P}}\left( t \right) {\textbf{z}}\left( t \right) \\= & {} {\lambda _{\textbf{g}}}{\textbf{P}}\left( t \right) {\textbf{z}}\left( {t - 1} \right) + {\textbf{P}}\left( t \right) {\textbf{q}}\left( t \right) c_{{\textrm{GSC}}}^*\left( t \right) , \nonumber \end{aligned}$$
(30)

using \({\textbf{P}}\left( t \right) = \lambda _{\textbf{g}}^{{\mathrm { - 1}}}{\textbf{P}}\left( {t - 1} \right) - \lambda _{\textbf{g}}^{{\mathrm { - 1}}}{\textbf{k}}\left( t \right) {{\textbf{q}}^H}\left( t \right) {\textbf{P}}\left( {t - 1} \right)\) and \({\textbf{k}}\left( t \right) = {\textbf{P}}\left( t \right) {\textbf{u}}\left( t \right)\) [28], \(\textbf{g}(t)\) in Eq. (65) in Table 1 can be obtained:

$$\begin{aligned} {\textbf{g}}\left( t \right)= & {} {\textbf{P}}\left( {t - 1} \right) {\textbf{z}}\left( {t - 1} \right) + {\textbf{k}}\left( t \right) {{\textbf{u}}^H}\left( t \right) {\textbf{P}}\left( {t - 1} \right) {\textbf{z}}\left( {t - 1} \right) + {\textbf{P}}\left( t \right) {\textbf{q}}\left( t \right) c_{{\textrm{GSC}}}^*\left( t \right) \nonumber \\= & {} {\textbf{g}}\left( {t - 1} \right) + {\textbf{k}}\left( t \right) \left[ {c_{{\textrm{GSC}}}^*\left( t \right) - {{\textbf{q}}^H}\left( t \right) {\textbf{g}}\left( {t - 1} \right) } \right] \\= & {} {\textbf{g}}\left( {t - 1} \right) + {\textbf{k}}\left( t \right) {{\hat{d}}^*}\left( t \right) . \nonumber \end{aligned}$$
(31)

In summary, using the output of the BFMCLP \({\hat{d}}\left( t \right)\) as the residual for updating the two branches, the whole system can converge towards the optimal solution.

4 DB-BFMCLP

In this section, we extend the BFMCLP for use in the WASNs. A simple estimation is obtained by utilizing only local signals, and the sub-optimal solution can be obtained by doing so, reducing both bandwidth and power consumption.

4.1 Framework

As illustrated in Fig. 2, the input \({{\textbf{y}}_j}\left( t \right) \in {\mathbb {C}^{{\left( {{M_j} + N - {N_j}} \right) \times 1}}}\) of the jth node is the stacked vector of local signals \({\bar{\textbf{y}}_j}\left( t \right) \in {\mathbb {C}^{{M_j} \times 1}}\) and the transmitted signals \({\dot{\textbf{r}}_j}\left( t \right) \in {\mathbb {C}^{\left( {N - {N_j}} \right) \times 1}}\) from other nodes:

$$\begin{aligned} {{\textbf{y}}_j}\left( t \right) = {\left[ {\begin{array}{cc} {\bar{\textbf{y}}_j^T\left( t \right) }&{\dot{\textbf{r}}_j^T\left( t \right) } \end{array}} \right] ^T}. \end{aligned}$$
(32)
Fig. 2
figure 2

Block-diagram of the DB-BFMCLP. a Overall block-diagram. b Details at the jth node

At the same time, the jth node transmits the shared signals \({{\textbf{r}}_j}\left( t \right) \in {\mathbb {C}^{{N_j} \times 1}}\) to other nodes. \({{\textbf{r}}_j}\left( t \right)\) is defined in such a way as follows. In a typical application scenario of WASNs, the M microphones and the N speakers are all placed randomly and dispersedly; therefore, the signal-to-noise ratios (SNRs) of microphones for each source are different. When the positions of all the speakers are fixed and the activity patterns of the speakers are non-overlapping, we can estimate the distances between each speaker and the nodes in WASN at system initialization stage by using the ideal VAD and the magnitude of the signal received by the first microphone in each node. We choose the microphone with the highest energy for the nth speaker as the reference of the nth speaker. We assume that the \(j_1\)th, \(j_2\)th ... \(j_{N_j}\)th microphones (\(N_j\) in total) in the jth node have speakers, then \({{\textbf{r}}_j}\left( t \right)\) is written as:

$$\begin{aligned} {{\textbf{r}}_j}\left( t \right)= & {} {{\textbf{T}}_j}{\bar{\textbf{y}}_j}\left( t \right) ,\end{aligned}$$
(33)
$$\begin{aligned} {{\textbf{T}}_j}= & {} \left[ {\begin{array}{c} {{{\textbf{t}}_{j{j_1}}}}\\ {{{\textbf{t}}_{j{j_2}}}}\\ \vdots \\ {{{\textbf{t}}_{j{j_{{N_j}}}}}} \end{array}} \right] ,\end{aligned}$$
(34)
$$\begin{aligned} {{\textbf{t}}_{j{j_i}}}= & {} \left[ {\underbrace{\begin{array}{ccc} 0&{...}&0 \end{array}}_{{j_i} - 1}\begin{array}{ccc} {}&1&{} \end{array}\underbrace{\begin{array}{ccc} 0&{...}&0 \end{array}}_{{M_j} - {j_i}}} \right] , \end{aligned}$$
(35)

where \({{\textbf{T}}_j} \in {\mathbb {N}^{{N_j} \times {M_j}}}\) and \({{\textbf{t}}_{j{j_i}}} \in {\mathbb {N}^{1 \times {M_j}}}\). The \(N \times 1\) vector \({\textbf{r}}\left( t \right)\) denotes the stacked vector of all \(\textbf{r}_j\) and the \({\left( {N - {N_j}} \right) \times 1}\) vector \(\dot{\textbf{r}}_j\) denotes the received signal of the jth node, which can be written as:

$$\begin{aligned} {\textbf{r}}\left( t \right)= & {} {\left[ {{\textbf{r}}_1^T\left( t \right) ,{\textbf{r}}_2^T\left( t \right) ,...,{\textbf{r}}_J^T\left( t \right) } \right] ^T},\end{aligned}$$
(36)
$$\begin{aligned} {\dot{\textbf{r}}_j}\left( t \right)= & {} {\left[ {{\textbf{r}}_1^T\left( t \right) ,...,{\textbf{r}}_{j - 1}^T\left( t \right) ,{\textbf{r}}_{j + 1}^T\left( t \right) ,...,{\textbf{r}}_J^T\left( t \right) } \right] ^T}. \end{aligned}$$
(37)

Note that \(\sum \nolimits _{j = 1}^J {{N_j}} = N\), \(0\le {N_j} \le {M_j}\), and a microphone being selected as the reference of one speaker cannot be a reference for another. Similar to \({{\textbf{y}}_j}\left( t \right)\), the RETFs belonging to the jth node are:

$$\begin{aligned} {{\textbf{H}}_j}= & {} \left[ {\begin{array}{c} {{{\bar{\textbf{h}}}_{1j}},{{\bar{\textbf{h}}}_{2j}},...,{{\bar{\textbf{h}}}_{Nj}}}\\ {{{\dot{\textbf{h}}}_{1j}},{{\dot{\textbf{h}}}_{2j}},...,{{\dot{\textbf{h}}}_{Nj}}} \end{array}} \right] ,\end{aligned}$$
(38)
$$\begin{aligned} {{\textbf{h}}_{nj}}= & {} {{\textbf{T}}_j}{\bar{\textbf{h}}_{nj}},\end{aligned}$$
(39)
$$\begin{aligned} {\dot{\textbf{h}}_{nj}}= & {} {\left[ {{\textbf{h}}_{n1}^T,...,{\textbf{h}}_{n\left( {j - 1} \right) }^T,{\textbf{h}}_{n\left( {j + 1} \right) }^T,...{\textbf{h}}_{nJ}^T} \right] ^T}. \end{aligned}$$
(40)

As illustrated in Fig. 2(b), when the constraints set \(\textbf{p}\) is consistent across the WASN, the parameters of the jth node are given by:

$$\begin{aligned} {\bar{\textbf{f}}_j}= & {} \frac{1}{J}{{\textbf{H}}_j}{\left( {{\textbf{H}}_j^H{{\textbf{H}}_j}} \right) ^{ - 1}}{\textbf{p}},\end{aligned}$$
(41)
$$\begin{aligned} {\bar{\textbf{B}}_j}= & {} {\left[ {{\textbf{I}} - {{\textbf{H}}_j}{{\left( {{\textbf{H}}_j^H{{\textbf{H}}_j}} \right) }^{ - 1}}{{\textbf{H}}_j}} \right] _{:,1:{M_j} - {N_j}}},\end{aligned}$$
(42)
$$\begin{aligned} {\bar{\textbf{u}}_j}\left( t \right)= & {} \bar{\textbf{B}}_j^H{{\textbf{y}}_j}\left( t \right) . \end{aligned}$$
(43)

Note that the input of the MCLP branch in the jth node is still \(\bar{\textbf{y}}_j\) rather than \({\textbf{y}}_j\):

$$\begin{aligned} \bar{\textbf{q}}_j\left( t \right)= & {} \left[ {\textbf{q}_{j1}^{T}\left( t \right) ,...,{\textbf{q}}_{j{M_j}}^{T}\left( t \right) } \right] ^{T},\end{aligned}$$
(44)
$$\begin{aligned} {\textbf{q}}_{ji}\left( t \right)= & {} \left[ {y_{ji}\left( {t - \tau } \right) ,...,{y_{ji}}\left( t - \tau - \left( {{L_g} - 1} \right) \right) } \right] ^{T}. \end{aligned}$$
(45)

In addition, we provide more details of the implementation of the DB-BFMCLP in one node as an example in Table 2. Note that all the signals in vector \(\textbf{r}(t)\) can be obtained in each node of the WASN. Without loss of generality, we choose the first item of \(\textbf{r}(t)\) in Eq. (72) in Table 2.

Table 2 The details of the DB-BFMCLP method at the jth node

4.2 Convergence proof

In this part, we will show the convergence property of the proposed DB-BFMCLP to the BFMCLP. As mentioned in Section 3, the filters \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update their coefficients independently in the BFMCLP method. Because the full convergence proof of the DB-GSC to the centralized GSC has been provided in [12], only the convergence of the MCLP branch is presented in this paper. We assume \(\hat{d} \left( t \right) = c\left( t \right) - {c_L}\left( t \right)\) without considering the BM of the GSC and the filter \(\textbf{w}(t)\). Some parameters are introduced for clarification, for example, \(\hat{d}_{\mathrm cen}(t)\) represents the output of the centralized method and \(\hat{d}_{\mathrm dis}(t)\) denotes that of the distributed one.

In the RLS method, there are two different estimation errors, where one is the a priori estimation error and the other is the a posteriori estimation error [28]. The a priori estimation error in the BFMCLP is introduced when estimating the desired speech signal:

$$\begin{aligned} {\hat{d} _{{\textrm{cen}}}}\left( t \right)= & {} c\left( t \right) - {{\textbf{g}}^H}\left( {t - 1} \right) {\textbf{q}}\left( t \right) \nonumber \\= & {} {{\textbf{f}}^H}{\textbf{y}}\left( t \right) - {{\textbf{g}}^H}\left( {t - 1} \right) {\textbf{q}}\left( t \right) . \end{aligned}$$
(46)

And the a posteriori estimation error is given by [28]:

$$\begin{aligned} \hat{z}_{\textrm{cen}}(t)= & {} c\left( t \right) - {{\textbf{g}}^H}\left( t \right) {\textbf{q}}\left( t \right) \nonumber \\= & {} c\left( t \right) - {\left[ {{\textbf{g}}\left( {t - 1} \right) + {\textbf{k}}\left( t \right) {{\hat{d} }_{{\textrm{cen}}}^*}\left( t \right) } \right] ^H}{\textbf{q}}\left( t \right) \\= & {} \left( {1 - {{\textbf{k}}^H}\left( t \right) {\textbf{q}}\left( t \right) } \right) \hat{d}_{{\textrm{cen}}} \left( t \right) , \nonumber \end{aligned}$$
(47)

further, the ratio of the a posteriori estimation error \(\hat{z}_{\textrm{cen}}(t)\) to the a priori estimation error \(\hat{d}_{\textrm{cen}}(t)\) is the conversion factor \(\gamma _{\textrm{cen}}(t)\), given by:

$$\begin{aligned} \gamma _{\textrm{cen}}(t)= & {} \frac{\hat{z}_{\textrm{cen}}(t)}{\hat{d}_{\textrm{cen}}(t)} \nonumber \\= & {} 1-\textbf{k}^H(t)\textbf{q}(t)\\= & {} 1-\frac{\textbf{q}^H(t)\textbf{P}(t-1)\textbf{q}(t)}{\alpha \lambda (t)+\textbf{q}^H(t)\textbf{P}(t-1)\textbf{q}(t)}, \nonumber \end{aligned}$$
(48)

which is determined by the input signal \(\textbf{q}(t)\) and the inverse correlation matrix \(\textbf{P}\). Note that the cost function in RLS is minimized based on the a posteriori estimation error \(\hat{z}_{\textrm{cen}}(t)\), and it does not depend on the a priori estimation error \(\hat{d}_{\textrm{cen}}(t)\) [28]. Obviously, \(\alpha \lambda (t)>0\) and \(\textbf{q}^H(t)\textbf{P}(t-1)\textbf{q}(t)>0\) always hold, which is because \(\textbf{P}\) is a positive definite matrix. Therefore, \(\gamma _{\textrm{cen}}(t)\) is less than 1 on average, leading to the convergence property of the RLS.

Because the common desired response vector \(\textbf{p}\), as shown in Eq. (41), is shared in the WASN, it is obvious that:

$$\begin{aligned} \sum \limits _{j = 1}^J {{\textbf{H}}_j^H{{\bar{\textbf{f}}}_j} = } \frac{1}{J}\sum \limits _{j = 1}^J {\textbf{p}} = {\textbf{p}}, \end{aligned}$$
(49)

and then

$$\begin{aligned} \sum \limits _{j = 1}^J {\bar{\textbf{f}}_j^H{{\textbf{y}}_j}\left( t \right) } = {{\textbf{f}}^H}{\textbf{y}}\left( t \right) = c\left( t \right) . \end{aligned}$$
(50)

As shown in Table 2, the local output of each node in the DB-BFMCLP method can be written as:

$$\begin{aligned} {{\hat{d} }_{{\textrm{dis}}}}\left( t \right)= & {} \sum \limits _{j=1}^{J}\left( {{{\bar{c}}_j}\left( t \right) - \bar{\textbf{g}}_j^H\left( {t - 1} \right) {{\bar{\textbf{q}}}_j}\left( t \right) } \right) \nonumber \\= & {} {\sum \limits _{j = 1}^J {\bar{\textbf{f}}_j^H{{\textbf{y}}_j}\left( t \right) } - {{\textbf{g}}^H}\left( {t - 1} \right) {\textbf{q}}\left( t \right) }\\= & {} {c\left( t \right) - {{\textbf{g}}^H}\left( {t - 1} \right) {\textbf{q}}\left( t \right) }. \nonumber \end{aligned}$$
(51)

The desired recursive equation for updating the room regression vector \(\bar{\textbf{g}}_j^H(n)\) with \(j=\{1,\cdot \cdot \cdot ,J\}\) is

$$\begin{aligned} \bar{\textbf{g}}_j(t)=\bar{\textbf{g}}_j(t-1)+\bar{\textbf{k}}_j(t)\frac{\hat{d}_{\mathrm dis}^*(t)}{J}, \end{aligned}$$
(52)

where \(\bar{\textbf{k}}_j(t)\) is the gain vector of the jth node denoted by Eq. (75) in Table 2. For the sake of analysis, we assume that only the room regression vector of the first node updates. Then, the a posteriori output of distributed method can be denoted as

$$\begin{aligned} \hat{z}_{\mathrm dis}(t)= & {} \left( {{{\bar{c}}_1}\left( t \right) - \bar{\textbf{g}}_1^H\left( {t} \right) {{\bar{\textbf{q}}}_1}\left( t \right) } \right) + \sum \limits _{j=2}^{J}\left( {{{\bar{c}}_j}\left( t \right) - \bar{\textbf{g}}_j^H\left( {t - 1} \right) {{\bar{\textbf{q}}}_j}\left( t \right) } \right) \nonumber \\= & {} {c\left( t \right) - \left[ {\bar{\textbf{g}}_1^H\left( t \right) ,\bar{\textbf{g}}_2^H\left( {t - 1} \right) ,...,\bar{\textbf{g}}_J^H\left( {t - 1} \right) } \right] {{{{\textbf{q}}\left( t \right) .}}}} \end{aligned}$$
(53)

By substituting Eq. (51) and Eq. (52) into Eq. (53), can be further written as

$$\begin{aligned} \hat{z}_{\mathrm dis}(t)=\hat{d}_{\mathrm dis}(t)\left( 1-\frac{\bar{\textbf{k}}_1^H(t)\bar{\textbf{q}}_1(t)}{J}\right) . \end{aligned}$$
(54)

It is obvious that the conversion factor can be written as

$$\begin{aligned} \gamma _{\mathrm dis}(t)=\frac{\hat{z}_{\mathrm dis}(t)}{\hat{d}_{\mathrm dis}(t)}=1-\frac{\bar{\textbf{k}}_1^H(t)\bar{\textbf{q}}_1(t)}{J}. \end{aligned}$$
(55)

Considering Eq. (55), by using the final output \(\hat{d}_{\mathrm dis}\left( t\right)\) for updating the local prediction filter \(\bar{\textbf{g}}_j\left( t\right)\) of all nodes, the relationship between the a posteriori output and the a priori output of the distributed method is similar to the centralized method, and the conversion factor is determined by the delayed signal \(\bar{\textbf{q}}_1\left( t\right)\) and the gain vector \(\bar{\textbf{k}}_1\left( t\right)\). In contrast, if the local room regression vector updates using local output \(\hat{d}_j\left( n\right)\), it is difficult to analyze the relationship. In addition, when all nodes update simultaneously, the conversion factor of distributed structure can be represented as:

$$\begin{aligned} \gamma _{\mathrm dis}(t)= & {} \frac{\hat{z}_{\mathrm dis}(t)}{\hat{d}_{\mathrm dis}(t)} \nonumber \\= & {} 1-\sum \limits _{j=1}^{J}\frac{\bar{\textbf{k}}_j^H(t)\bar{\textbf{q}}_j(t)}{J}\\= & {} 1-\frac{1}{J}\sum \limits _{j=1}^{J}\frac{\bar{\textbf{q}}_j^H(t)\bar{\textbf{P}}_j(t-1)\bar{\textbf{q}}_j(t)}{\alpha \lambda (t)+\bar{\textbf{q}}_j^H(t)\bar{\textbf{P}}_j(t-1)\bar{\textbf{q}}_j(t)}.\nonumber \end{aligned}$$
(56)

It is obvious that \(\gamma _{\mathrm dis}(t)\) is less than 1 on average. Thus, the convergence of the proposed DB-BFMCLP can be guaranteed. It will be demonstrated in the following section that \(\left[ \bar{\textbf{g}}_1^T,\bar{\textbf{g}}_2^T,...,\bar{\textbf{g}}_J^T\right] ^T\) would converge to the optimal solution of the centralized method after enough iterations.

5 Simulations

In this section, to validate the proposed BFMCLP method and the convergence of the proposed DB-BFMCLP method, the two methods are evaluated in the noisy environments with varying degrees of reverberation.

5.1 Simulation setup

The sizes of two simulated rooms are 5 m\(\times\)5 m\(\times\)3 m and 7 m\(\times\)7 m\(\times\)3 m, respectively. The reverberation time of the small room is set to \(T_{60}=450\) ms. For the big room, \(T_{60}=610\) ms, 720 ms, 830 ms, and 940 ms are considered.

Besides, each node in WASNs has 3 microphones with the distance of two adjacent microphones 5 cm. The positions of nodes, speakers, and interferences relative to the room are illustrated in Fig. 3. We select 40 speakers (20 males and 20 females) from the TIMIT database as the clean speech signals. The performance shown as follow is all averaged over several experiments. Each signal of one speaker is set to 30 s, and the simulated signals are obtained by convolving simulated room impulse responses (RIRs). The RIRs are simulated with an efficient implementation of the image source model [34]. A stationary noise is also located in each simulated room. To focus on measuring the performance of the proposed methods, we assume that the clocks of the sensors are synchronized. We further test whether the distributed methods can converge to the optimal solution or not by comparing the results with the centralized methods. Accordingly, we uniformly update the signals and parameters simultaneously.

Fig. 3
figure 3

Parameters of two simulated rooms. a \(T_{60}=450\textrm{ms}\), b \(T_{60}=610\textrm{ms}\), \(720\textrm{ms}\), \(830\textrm{ms}\), 940ms

The sampling rate is 16 kHz. The STFT uses a square-root Hanning window, and the frame length is set to 1024 with the frame shift 512 to balance the performance and the real time of the methods in reverberant and noisy scenarios. The performance is evaluated by four often-used objective measurements including the perceptual evaluation of speech quality (PESQ) [35], the short-time objective intelligibility (STOI) [36], the SNR, and the speech-to-reverberation modulation energy ratio (SRMR) [37].

5.2 Evaluation of the distributed MCLP

We first test the convergence of the distributed MCLP in the DB-BFMCLP in reverberant scenarios, where the first setup (a) with \(T_{\mathrm 60}=450\) ms and the second setup (b) with \(T_{\mathrm 60}=830\) ms are considered. Without the GSC branch, the DB-BFMCLP and BFMCLP become the distributed MCLP (DB-MCLP) and MCLP, respectively. In the circumstances, we choose the first microphone as the reference of the single speaker, and \(\textbf{h}=[1,0,...,0]^T\). The speech signals are located in the position of the desired speaker, and \(L_g=8\) and \(\tau =1\) are set in this evaluation. The PESQ improvements of the outputs of the single node MCLP (SN-MCLP), the centralized MCLP (Cen-MCLP), and the DB-MCLP versus time are depicted in Fig. 4. One can see that the performance of the Cen-MCLP and that of the DB-MCLP is closed when they are both in a convergent state and both outperform the SN-MCLP, and the convergence speed of the distributed approach is faster [38]. This is because the room regression vector \(\textbf{g}\) is separated into lower-dimension ones in the DB-MCLP.

Fig. 4
figure 4

Convergence of the evaluated methods along time in the term of PESQ improvement. (a) \(T_{\mathrm 60}=450\)ms, (b) \(T_{\mathrm 60}=830\)ms

5.3 Evaluation of the BFMCLP and the DB-BFMCLP

We investigate the performance of the BFMCLP and the DB-BFMCLP in noisy and reverberant scenarios by twenty runs. We compare the proposed two methods with five existing related ones. In sum, we use the following seven methods in total for complete comparison: the MCLP, the GSC, the DB-GSC (the distributed structure of the GSC), the LCMV method, the LC-DANSE method (the distributed structure of the LCMV), the BFMCLP method, and the DB-BFMCLP method. In addition, \(L_g=4\) and \(\tau =1\) are chosen in this evaluation.

The signal-to-interference ratio (SIR), which measures the power ratio between the received desired speaker and the competing speaker, is set to 0 dB. The SNR, which defines the power ratio between the speakers and the noise, is set to 13 dB in the cases when studying the influence of the reverberation time. The SNR is set to 5 dB, 10 dB, 15 dB, and 20 dB to evaluate the influence of the noise.

The channel numbers of each method per TF-bin are presented in Table 3. One can see that all of the three distributed methods need fewer channels than their centralized structures. The DB-GSC and DB-BFMCLP require that the number of speakers should not be more than the total number of microphones in the WASN; the two methods are more robust to the number of speakers because \(N < {M_j}\) needs to be satisfied in the LC-DANSE [11].

Table 3 The number of channels transmitted of each method per TF-bin at the jth node

We also show the computational complexity of the BFMCLP and the DB-BFMCLP in Table 4, where both a scalar complex addition and a scalar complex multiplication are counted as one floating point operation (FLOP) [39]. For simplicity of expression, we set \({Q_j} = \left( {{M_j} + N - {N_j}} \right)\). As a comparison, we also present the computational complexity of the existing GSC method. It can be observed from Table 4 that, because of the smaller number of filter dimensions, the complexity of the DB-BFMCLP is reduced significantly.

Table 4 Computational complexity of the three methods per TF-bin at the jth node

The improvements of the above mentioned methods with the four objective measures are presented in Figs. 5 and 6. It is clear that the performance of the DB-BFMCLP and the BFMCLP are closed in most cases, which further verifies the convergence of the DB-BFMCLP to the BFMCLP. An observation in Fig. 5 is that the impact of reverberation on speech quality gradually exceeds that of noise when the reverberation time increases, which causes the performance degradation to the existing related beamformers. Instead, the MCLP can maintain a stable performance. It demonstrates that reverberation can limit the performance of the related beamformers. However, the BFMCLP and the DB-BFMCLP have obvious advantages in all measurements under reverberant and noisy environments, demonstrating the superiority of the parallel structure proposed in this paper.

Fig. 5
figure 5

Performance comparison of the evaluated methods with varying degrees of reverberation (SNR = 13 dB). a PESQ improvement, STOI improvement, c SNR improvement, d SRMR improvement

Fig. 6
figure 6

Performance comparison of the evaluated methods with varying degrees of noise (\(T_{\mathrm 60}=610\) ms). a PESQ improvement, b STOI improvement, c SNR improvement, d SRMR improvement

Furthermore, we perform ten random experiments to verify the stability of the system, where in each experiment the room size \(S \in \left[ 25, 72\right]\) \(\textrm{m}^2\), SIR \(\in \left[ -2, 2\right]\) \(\textrm{dB}\), SNR \(\in \left[ 10, 20\right]\) \(\textrm{dB}\), and reverberation time \(T_{\mathrm 60} \in \left[ 400, 900\right]\) \(\textrm{ms}\) are chosen randomly. Two speakers, one interference and a four-node WASN, are randomly and dispersedly arranged in the room, and the microphone constellation in each node remains fixed as in Section 5.1. The improvements depicted in Fig. 7 indicate the robustness of the DB-BFMCLP and the BFMCLP.

Fig. 7
figure 7

Performance comparison in more general experiments. a PESQ improvement, b SNR improvement

5.4 Evaluation of the influence of VAD errors

An ideal VAD has been used in the previous studies, and the filters and parameters are updated when speakers inactive in speech enhancement methods. In this part, we further study the influence of VAD errors on the performance of the GSC, BFMCLP, and their distributed structures for completeness. Here, \(\phi _s\) indicates the percentage of the speech-and-noise frames that are error detected as noise-only frames.

The influence of \(\phi _s\) on the performance of the four methods is studied in two scenarios using the simulated room depicted in Fig. 3b, with \(T_{\mathrm 60}=610\) ms and SNR = 13 dB. In the first scenario, we assume that the accurate \(\textbf{H}\) still has been known to all nodes; the inaccurate noise frames are only used to update the filter \(\textbf{w}\); the PESQ improvements in this scenario are depicted in Fig. 8a. In the second scenario, the inaccurate noise frames are simultaneously used to estimate the RETF \(\textbf{H}\) and the filter \(\textbf{w}\), and the results are shown in Fig. 8b. The four methods are obviously more sensitive to the estimation error of the RETFs, and the superiority of the two parallel structures to the two GSC-methods can be concluded from the Fig. 8 in either of the two scenarios.

Fig. 8
figure 8

PESQ improvement of the evaluated methods with varying degrees of VAD errors. a Accurate RETFs, b inaccurate RETFs

6 Conclusion

In this paper, for speech enhancement in reverberant and noisy environments, the parallel implementation of BFMCLP method has been proposed and extended for WASNs. The proposed methods suppress reverberation and noise by exploiting the property that the delayed signal in the MCLP and the blocked signal in GSC are all uncorrelated with the desired signal. The parallel architecture has two advantages: one is that the two filters can be updated independently to prevent the self-cancelation problem effectively due to the estimation error of the RETFs, which can improve the stability of the system, and the other is that the parallel architecture can be easily extended to distributed systems. We provide the details of the two parallel methods and prove the convergence of the DB-BFMCLP method. Finally, we test the BFMCLP and the DB-BFMCLP in reverberant and noisy scenarios; simulation results indicate that the two proposed methods outperform the existing methods, and the DB-BFMCLP provides a performance comparable to the centralized BFMCLP, while it significantly reduces both the computational and the transmission cost.

Availability of data and materials

The datasets generated and/or analyzed during the current study are not publicly available due to that all of them can be generated by readers themselves according to the simulation setup in Section 5 but are available from the corresponding author on reasonable request if they have difficulties.

Abbreviations

WSN:

Wireless sensor network

WASN:

Wireless acoustic sensor network

ASR:

Automatic speech recognition

MEMS:

Micro-electro-mechanical system

DB-MWF:

Distributed multichannel Wiener filter

DANSE:

Distributed adaptive node-specific signal estimation

LCMV:

Linearly constrained minimum variance

LC-DANSE:

Linearly constrained distributed adaptive node-specific signal estimation

GSC:

Generalized sidelobe canceler

DB-GSC:

Distributed generalized sidelobe canceler

MCLP:

Multi-channel linear prediction

MVDR:

Minimum variance distortionless response

SC:

Sidelobe-cancelation

LP:

Linear prediction

ISCLP:

Integrated sidelobe cancelation and linear prediction

BFMCLP:

Beamforming and multichannel linear prediction

DB-BFMCLP:

Distributed beamforming and multichannel linear prediction

STFT:

Short-time Fourier transform

RETF:

Relative early transfer function

ATF:

Acoustic transfer function

FB:

Fixed beamformer

BM:

Blocking matrix

VAD:

Voice activity detector

MCAR:

Multi-channel autoregressive

RLS:

Recursive least squares

NLMS:

Normalized least mean squares

RIR:

Room impulse response

PESQ:

Perceptual evaluation of speech quality

STOI:

Short-time objective intelligibility

SRMR:

Speech-to-reverberation modulation energy ratio

SNR:

Signal-to-noise ratio

SIR:

Signal-to-interference ratio

FLOP:

Floating point operation

References

  1. D. Estrin, L. Girod, G. Pottie, M. Srivastava, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Instrumenting the world with wireless sensor networks, vol. 4 (2001), pp. 2033–2036. https://doi.org/10.1109/ICASSP.2001.940390

  2. O.M. Bouzid, G.Y. Tian, J. Neasham, B. Sharif, Investigation of sampling frequency requirements for acoustic source localisation using wireless sensor networks. Appl. Acoust. 74(2), 269–274 (2013). https://doi.org/10.1016/j.apacoust.2010.12.013

    Article  Google Scholar 

  3. R. Ali, T. van Waterschoot, M. Moonen, An integrated mvdr beamformer for speech enhancement using a local microphone array and external microphones. EURASIP J. Audio Speech Music Process. 10 (2021). https://doi.org/10.1186/s13636-020-00192-2

  4. X. Guo, M. Yuan, Y. Ke, C. Zheng, X. Li, Distributed node-specific block-diagonal LCMV beamforming in wireless acoustic sensor networks. Signal Process. 185, 108085(2021). https://doi.org/10.1016/j.sigpro.2021.108085. www.sciencedirect.com/science/article/pii/S0165168421001237

  5. A. Bertrand, in 2011 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT). Applications and trends in wireless acoustic sensor networks: a signal processing perspective (2011). pp. 1–6. https://doi.org/10.1109/SCVT.2011.6101302

  6. A. Bertrand, M. Moonen, Distributed adaptive node-specific signal estimation in fully connected sensor networks–part i: Sequential node updating. IEEE Trans. Sig. Process. 58(10), 5277–5291 (2010). https://doi.org/10.1109/TSP.2010.2052612

    Article  MathSciNet  MATH  Google Scholar 

  7. A. Bertrand, M. Moonen, Distributed adaptive node-specific signal estimation in fully connected sensor networks–part ii: Simultaneous and asynchronous node updating. IEEE Trans. Signal Process. 58(10), 5292–5306 (2010). https://doi.org/10.1109/TSP.2010.2052613

    Article  MathSciNet  MATH  Google Scholar 

  8. J. Zhang, R. Heusdens, R.C. Hendriks, Rate-distributed spatial filtering based noise reduction in wireless acoustic sensor networks. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2015–2026 (2018). https://doi.org/10.1109/TASLP.2018.2851157

    Article  Google Scholar 

  9. S. Markovich-Golan, A. Bertrand, M. Moonen, S. Gannot, Optimal distributed minimum-variance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Process. 107, 4–20 (2015). https://doi.org/10.1016/j.sigpro.2014.07.014

    Article  Google Scholar 

  10. S. Doclo, M. Moonen, T. Van den Bogaert, J. Wouters, Reduced-bandwidth and distributed mwf-based noise reduction algorithms for binaural hearing aids. IEEE Trans. Audio Speech Lang. Process. 17(1), 38–51 (2009). https://doi.org/10.1109/TASL.2008.2004291

    Article  Google Scholar 

  11. A. Bertrand, M. Moonen, Distributed node-specific LCMV beamforming in wireless sensor networks. IEEE Trans. Signal Process. 60(1), 233–246 (2012). https://doi.org/10.1109/TSP.2011.2169409

    Article  MathSciNet  MATH  Google Scholar 

  12. S. Markovich-Golan, S. Gannot, I. Cohen, Distributed multiple constraints generalized sidelobe canceler for fully connected wireless acoustic sensor networks. IEEE Trans. Audio Speech Lang. Process. 21(2), 343–356 (2013). https://doi.org/10.1109/TASL.2012.2224454

    Article  Google Scholar 

  13. P.A. Naylor, N.D. Gaubitch, Speech dereverberation. Springer London. (2010). https://doi.org/10.1007/978-1-84996-056-4

  14. Z. Honghu, Y. Jia, P. Jianxin, Chinese speech intelligibility of elderly people in environments combining reverberation and noise. Appl. Acoust. 150, 1–4 (2019). https://doi.org/10.1016/j.apacoust.2019.02.002

    Article  Google Scholar 

  15. K. Lebart, J.M. Boucher, P. Denbigh, A new method based on spectral subtraction for speech dereverberation. Acta Acustica U. Acustica. 87, 359–366 (2001)

    Google Scholar 

  16. A. Schwarz, K. Reindl, W. Kellermann, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), A two-channel reverberation suppression scheme based on blind signal separation and wiener filtering (2012), pp. 113–116. https://doi.org/10.1109/ICASSP.2012.6287830

  17. E.A.P. Habets, J. Benesty, A two-stage beamforming approach for noise reduction and dereverberation. IEEE Trans. Audio Speech Lang. Process. 21(5), 945–958 (2013). https://doi.org/10.1109/TASL.2013.2239292

    Article  Google Scholar 

  18. M. Miyoshi, Y. Kaneda, Inverse filtering of room acoustics. IEEE Trans. Acoust. Speech Signal Process. 36(2), 145–152 (1988). https://doi.org/10.1109/29.1509

    Article  Google Scholar 

  19. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang, in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation (2008), pp. 85–88. https://doi.org/10.1109/ICASSP.2008.4517552

  20. T. Yoshioka, T. Nakatani, Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707–2720 (2012). https://doi.org/10.1109/TASL.2012.2210879

    Article  Google Scholar 

  21. S. Gergen, A. Nagathil, R. Martin, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Audio signal classification in reverberant environments based on fuzzy-clustered ad-hoc microphone arrays (2013), pp. 3692–3696. https://doi.org/10.1109/ICASSP.2013.6638347

  22. S. Pasha, C. Ritz, in 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Clustered multi-channel dereverberation for ad-hoc microphone arrays (2015). pp. 274–278. https://doi.org/10.1109/APSIPA.2015.7415519

  23. M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori, T. Nakatani, Strategies for distant speech recognitionin reverberant environments (2015). https://doi.org/10.1186/s13634-015-0245-7

    Article  Google Scholar 

  24. T. Dietzen, S. Doclo, M. Moonen, T. van Waterschoot, Integrated sidelobe cancellation and linear prediction kalman filter for joint multi-microphone speech dereverberation, interfering speech cancellation, and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 740–754 (2020). https://doi.org/10.1109/TASLP.2020.2966869

    Article  Google Scholar 

  25. T. Dietzen, A. Spriet, W. Tirry, S. Doclo, M. Moonen, T. van Waterschoot, Comparative analysis of generalized sidelobe cancellation and multi-channel linear prediction for speech dereverberation and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 544–558 (2019). https://doi.org/10.1109/TASLP.2018.2886743

    Article  Google Scholar 

  26. Y. Chan, K. Ho, A simple and efficient estimator for hyperbolic location. IEEE Trans. Signal Process. 42(8), 1905–1915 (1994). https://doi.org/10.1109/78.301830

    Article  Google Scholar 

  27. Y. Zeng, R.C. Hendriks, Distributed delay and sum beamformer for speech enhancement via randomized gossip. IEEE/ACM Trans. Audio Speech, and Language Processing 22(1), 260–273 (2014). https://doi.org/10.1109/TASLP.2013.2290861

  28. S. Haykin, Adaptive Filter Theory (Prentice Hall, 2002)

  29. I. Kodrasi, S. Doclo, in 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA). EVD-based multi-channel dereverberation of a moving speaker using different RETF estimation methods (2017). pp. 116–120. https://doi.org/10.1109/HSCMA.2017.7895573

  30. K. Abed-Meraim, E. Moulines, P. Loubaton, Prediction error method for second-order blind identification. IEEE Trans. Signal Process. 45(3), 694–705 (1997). https://doi.org/10.1109/78.558487

    Article  MATH  Google Scholar 

  31. T. Yoshioka, T. Nakatani, K. Kinoshita, M. Miyoshi, Speech Dereverberation and Denoising Based on Time Varying Speech Model and Autoregressive Reverberation Model (Springer, Berlin Heidelberg, 2010), pp.151–182

    Google Scholar 

  32. S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process. 49(8), 1614–1626 (2001). https://doi.org/10.1109/78.934132

    Article  Google Scholar 

  33. T. Yoshioka, H. Tachibana, T. Nakatani, M. Miyoshi, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Adaptive dereverberation of speech signals with speaker-position change detection (2009), pp. 3733–3736. https://doi.org/10.1109/ICASSP.2009.4960438

  34. J. Allen, D. Berkley, Image method for efficiently simulating small-room acoustics. J. Acoust. Soc. Am. 65, 943–950 (1979). https://doi.org/10.1121/1.382599

    Article  Google Scholar 

  35. A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs, vol. 2 (2001), pp. 749–752. https://doi.org/10.1109/ICASSP.2001.941023

  36. C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011). https://doi.org/10.1109/TASL.2011.2114881

    Article  Google Scholar 

  37. J.F. Santos, M. Senoussaoui, T.H. Falk, in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC). An improved non-intrusive intelligibility metric for noisy and reverberant speech (2014). pp. 55–59. https://doi.org/10.1109/IWAENC.2014.6953337

  38. C. Zheng, A. Deleforge, X. Li, W. Kellermann, Statistical analysis of the multichannel wiener filter using a bivariate normal distribution for sample covariance matrices. IEEE/ACM Trans. Audio Speech Lang. Process. 26(5), 951–966 (2018). https://doi.org/10.1109/TASLP.2018.2800283

    Article  Google Scholar 

  39. H. Raphael, Floating point operations in matrix-vector calculus (Technische Universität München, Tech. rep, 2007)

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62101550.

Author information

Authors and Affiliations

Authors

Contributions

Zhe Han: software and writing original draft. Yuxuan Ke: platform and writing―review and editing. Chengshi Zheng and Xiaodong Li: supervision and writing-review and editing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chengshi Zheng.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, Z., Ke, Y., Li, X. et al. Parallel processing of distributed beamforming and multichannel linear prediction for speech denoising and deverberation in wireless acoustic sensor networks. J AUDIO SPEECH MUSIC PROC. 2023, 25 (2023). https://doi.org/10.1186/s13636-023-00287-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13636-023-00287-6

Keywords

  • Wireless acoustic sensor networks
  • Speech enhancement
  • Microphone arrays