 Methodology
 Open access
 Published:
Parallel processing of distributed beamforming and multichannel linear prediction for speech denoising and deverberation in wireless acoustic sensor networks
EURASIP Journal on Audio, Speech, and Music Processing volumeÂ 2023, ArticleÂ number:Â 25 (2023)
Abstract
More and more smart home devices with microphones come into our life in these years; it is highly desirable to connect these microphones as wireless acoustic sensor networks (WASNs) so that these devices can be better controlled in an enclosure. For indoor applications, both environmental noise and room reverberation may severely degrade speech quality, and thus both of them need to be removed to improve usersâ€™ experience. For this goal, this paper proposes a parallel processing framework of distributed beamforming and multichannel linear prediction (DBBFMCLP), which consists of generalized sidelobe canceler and multichannel linear prediction for simultaneous speech dereverberation and noise reduction in WASNs. By sharing a common desired response vector, the proposed DBBFMCLP can provide a significant reduction in communication bandwidth without sacrificing performance. The convergence guarantee of the DBBFMCLP to its centralized implementation is derived mathematically. Simulation results verify the superiority of the proposed method to the existing related methods in noisy and reverberant scenarios.
1 Introduction
Recent progress in microelectromechanical systems (MEMS) and wireless communications enable the development and popularization of lowcost and lowpower wireless sensor networks (WSNs) [1]. A WSN usually consists of multiple nodes connected by wireless links, which has been applied to various fields including speech extraction, acoustic source localization, and acoustic event detection [2, 3]. In general, wireless acoustic sensor networks (WASNs) that can be applied for smart home devices equipped with at lease one microphone in each device. Each node also has an individual signal processing unit and a communication module to achieve monitoring, processing, and broadcasting, respectively.
Compared with conventional compact microphone arrays, a WASN comprises several nodes that are placed dispersedly and/or randomly, so that it can cover a much larger area. Besides, WASNs can enhance the robustness and the extensibility of the system by the decentralized operation [4]. The oftenstudied problems of WASNs are synchronization acquisition and transmission. The main factor of the synchronization problem is the offset of clocks oscillators, and many efforts have been made to solve the clock synchronization problem [5]. The other factor is the asynchronous or synchronous updating of signals and parameters in each node; it has been proved that appropriate updating would more likely lead to optimal estimators [6, 7].
For WASNs, centralized methods need to gather all observations in a fusion center. In theory, centralized methods can achieve the best performance for complete information, but these methods usually require a large communication bandwidth and computational power. Due to limited transmission bandwidth and energy resources in practice, the optimal centralized methods are often difficult if not impossible for practical applications. An alternative solution is to use distributed methods, which can achieve nearly the same performance as centralized methods, while requiring much fewer broadcast channels [6, 8,9,10,11,12].
Recently, many speech enhancement methods have been proposed for WASNs to reduce environmental noise. In [10], a distributed multichannel Wiener filter (DBMWF) method was proposed for binaural hearing aids. This method considered the case of a single speech source under stationary noise scenarios assumption and only onechannel signal was transmitted from each node to the other. In [6], the coexistence of multiple speakers was considered, a distributed adaptive nodespecific signal estimation (DANSE) method was proposed, which aims to obtain different outputs in each node. In [11], a linearly constrained distributed adaptive nodespecific signal estimation (LCDANSE) method was proposed, which uses a nodespecific linearly constrained minimum variance (LCMV) beamformer. A distributed generalized sidelobe canceler (DBGSC) with multiple constraints was presented for speech enhancement in [12], where the convergence of the DBGSC to the centralized generalized sidelobe canceler (GSC) was proved. Note that the DBGSC method was based on a specific transformation that allows reformulating the centralized beamformer as a sum of all local GSC.
Apart from noise, room reverberation may also degrade speech quality severely in an enclosure [13, 14]. For indoor speech communication applications, such as handsfree telephony, speaking to smart home devices, and conference call, microphones are often placed at a certain distance from the desired speaker. In these circumstances, microphones can receive not only the direct sound but also the reflections because of the surrounding objects and walls, where the late reflections are referred to as reverberation. It has been shown that these undesired reverberant components degrade both the performance of automatic speech recognition (ASR) systems and speech perceptual quality. To solve this problem, many dereverberation methods have been proposed [15,16,17,18]. Among these methods, the multichannel linear prediction (MCLP) proposed in [19] is widely used for its promising performance. The weighted recursive least squares (RLS) method was introduced to accelerate the convergence rate of the filtering parameters in [19]. In addition, [20] demonstrated that the MCLP can suppress reverberation without assuming specific acoustic conditions, although it was originally proposed for singlesource dereverberation under noisefree scenarios. For WASNs, dereverberation is also very important for speech enhancement. In [21, 22], two multichannel dereverberation approaches in ad hoc microphone arrays were introduced, in which the reverberation was reduced by selecting a subset of microphones with a relatively lower level of reverberation. Unlike noise reduction methods, dereverberation methods for WASNs often ignore the constraints, e.g., the limited transmission bandwidth and energy resource.
In reverberant and noisy environments, dereverberation and noise reduction should be integrated in a parallel processing framework or in a serial processing framework [23, 24]. In [23], a system was proposed that employs multipleoutput MCLP followed by the minimum variance distortionless response (MVDR) beamformer. However, the cascade architecture of the system has high computational complexity and is difficult to extend to the WASNs. In [24], the sidelobecancelation (SC) filter was combined with the linear prediction (LP) filter to a unified framework named integrated sidelobe cancelation and linear prediction (ISCLP), where the two filters are estimated jointly by a Kalman filter. However, the GSC performance is highly dependent on the quality of the estimated relative early transfer functions (RETFs). To prevent the selfcancelation phenomenon caused by inaccurate RETFs, the filter coefficients of the GSC update only when the speakers are all inactive. Therefore, the filter of the GSC and that of the MCLP cannot update their coefficients simultaneously, especially when considering that the MCLP needs to update its filter coefficients when the speakers are active [12, 19, 25]. A joint optimization of the two filters is still unsolved for WASNs.
To solve the above difficulty, we unify the GSC and the MCLP together into a beamforming and multichannel linear prediction (BFMCLP) framework, which can achieve the independent update for both filters, to deal with reverberant speech in noisy scenarios. Besides, by sharing the common response vector and deriving the distributed RLS method, we extend the BFMCLP to a distributed implementation (DBBFMCLP) which is potential for the WASNs. The DBBFMCLP method needs much fewer signals to be broadcasted in each node than centralized methods.
The remainder of this paper is organized as follows. In Section 2, the problem formulation is presented. In Section 3, the centralized BFMCLP method is described. The DBBFMCLP is presented in Section 4, and its convergence to the BFMCLP is also included in this section. In Section 5, we evaluate the performance by simulations. Finally, some conclusions are given in Section 6.
2 Problem formulation
In this section, we consider that a fully connected WASN with M microphones contains J nodes (\(M\ge J\)), and the number of speech sources observed by this WASN is N. \(M_j\) denotes the number of microphones in the jth node, and we have \(\sum _{j=1}^{J}M_j=M\). This paper focuses on the situation that each node is equipped with more than one microphone. It should be noted that no communicationbandwidth reduction can be obtained in the node with only one microphone, since at least onechannel signal needs to be transmitted if all nodes are used for better performance instead of using only partial nodes. In some previous studies, more attention is paid to the problems about sensor subset selection, source location, or network topology in the area of the WASNs consisted of several signalmicrophone nodes [8, 26, 27]. These problems are out of the scope of this paper.
In the shorttime Fourier transform (STFT) domain, the reverberant observation of the speech signal from the nth speaker captured by the mth microphone can be modeled as
where t and k denote the timeframe and frequencybin indices, respectively. \(a_{nm}(k,l)\) denotes the timeinvariant acoustic transfer function (ATF) between the nth source and the mth microphone, and \(L_h\) depends on the reverberation time and the length of the STFT window. In this paper, we treat all frequency subbands independently, the frequencybin index k is hereafter omitted for brevity. In WASNs, we use the vector notation:
with \((\cdot )^T\) denoting the transpose. By dividing ATFs coefficients, the reverberant speech components from the nth speaker \(\textbf{x}_n(t)\) may be decomposed into the direct and early reflected components \(\textbf{x}_{ne}\) and late reverberant components \(\textbf{x}_{nl}\), given by:
In practice, the ATFs are difficult to estimate without the knowledge of the acoustic sources. Instead, the RETFs are often chosen to characterize the relative relationship of the desired source signals received by microphones:
where \(\left[ {\cdot }\right] _1\) denotes the first item of the vector, \(\textbf{h}_{n}\) denotes an \(M\times 1\) RETF between the nth speaker and M microphones in the WASN. It is obvious that \(\left[ {\textbf{h}_{n}}\right] _1=1\). Consider all the N speakers and the \(M\times 1\) vector \(\textbf{v}(t)\) represents the environmental noise, the stacked \(M\times 1\) vector of received signals by all microphones is given by:
where \(\textbf{x}_e(t)=\textbf{H}\left[ {\left[ {\textbf{x}_{1e}(t)}\right] _1,\left[ {\textbf{x}_{2e}(t)}\right] _1,...,\left[ {\textbf{x}_{Ne}(t)}\right] _1}\right] ^T\), and \(\textbf{H}=\left[ {\textbf{h}_1,\textbf{h}_2,...,\textbf{h}_N}\right]\) is the \(M\times N\) RETFs matrix for all the N speakers.
In the Jth node WASN, the vectors \(\textbf{y}(t)\) and \(\textbf{h}_n\), and the matrix \(\textbf{H}\) can be stacked by all nodes:
where \((\bar{\cdot })\) denotes the local data belonging to one node, \(y_{ji}(t)\) denotes the ith microphone signal of the jth node. The vectors \(\bar{\textbf{y}}_{j}(t)\in \mathbb {C}^{M_j\times 1}\) and \({{\bar{\textbf{h}}}_{nj}}\in \mathbb {C}^{M_j\times 1}\) denote the signal captured by the jth node and the RETF from the nth speaker to the jth node, respectively.
3 BFMCLP
In this section, we develop the parallel processing of the BFMCLP for simultaneous speech dereverberation and noise reduction. We introduce the BFMCLP at the beginning, and then investigate its stability.
3.1 Framework
The parallel processing framework of the BFMCLP is shown in Fig. 1. It consists of GSC and MCLP, and the microphone signal vector \(\textbf{y}(t)\) is used as input to both parallel branches. As shown in the block diagram, the GSC consists of three components: a fixed beamformer (FB) \(\textbf{f}\) steers a beam to a desired speaker and reduces the other competing speakers, a blocking matrix (BM) \(\textbf{B}\) which is orthogonal to the target signal cancels the desired speaker, and a datadependent adaptive filter \(\textbf{w}\) filters the output of \(\textbf{B}\). The difference between the signals from the FB path and the adaptive filter \(\textbf{w}\) filter path is the original GSC output [28].
The accuracy of the estimated RETFs matrix has a significant impact on the performance of the GSC. If the desired speech can be completely canceled in the BM, the GSC performs well in suppressing noise, interferences, and late reverberant components without distorting the desired speech. An estimation of the RETF for one speaker can be obtained by performing eigenvector decomposition on the corresponding covariance matrix and the eigenvector associated with the maximum eigenvalue is then extracted [25, 29]. Beforehand, the desired covariance matrix needs to be computed by subtracting the noise covariance matrix from the noisy covariance matrix. We assume that the activity patterns of the speakers are nonoverlapping, and an ideal voice activity detector (VAD) is employed in this paper. In this way, the desired covariance matrices can be obtained at the initialization stage of the WASN.
However, the accuracy of estimated RETFs will decrease significantly with the increase of the reverberation time. To prevent the speech cancelation problem caused by inaccurate RETFs, the adaptive filter \(\textbf{w}(t)\) only updates when the desired speaker is inactive, whereas such an update strategy may lead to performance degradation for dereverberation. To overcome this problem, the MCLP is introduced to suppress reverberation by deconvolution in the second branch, which consists of a delay module and an estimated room regression vector \(\textbf{g}\). Note that Eq. (1) indicates that the reverberation effect can be modeled as the output of a multichannel autoregressive (MCAR) system. It is the theoretical basis of the adaptive dereverberation method, where the microphone array signals can be expressed as the model of MCLP [30, 31]. In this section, we propose the BFMCLP method, in which the GSC and the MCLP are performed in parallel. In this way, we can achieve much better performance when the speech degrades by both reverberation and noise. The details of the BFMCLP method are presented below.
The FB \(\textbf{f}\) can be defined with the following constraints set:
where \((\cdot )^H\) in the following denote conjugate transpose, and \(\textbf{p}\) is an \(N\times 1\) desired response vector consisting of ones and zeros. The desired output \(d\left( t \right)\) of the BFMCLP is the sum of the direct and early reflected components of the desired speakers which correspond to 1 in the vector \(\textbf{p}\):
A closedform solution of Eq. (10) is \({\textbf{f}} = {\textbf{H}}{\left( {{{\textbf{H}}^H}{\textbf{H}}} \right) ^{  1}}{\textbf{p}}\), and the output of the FB is:
Let the BM \(\textbf{B}\in \mathbb {C}^{M\times (MN)}\) be defined as a basis for the orthogonal complement of the space spanned by the columns of matrix \(\textbf{H}\), it is designed to cancel the desired speakers, given by
and a closedform solution of Eq. (13) can be written as \({\textbf{B}} = {\left[ {{\textbf{I}}  {\textbf{H}}{{\left( {{{\textbf{H}}^H}{\textbf{H}}} \right) }^{  1}}{\textbf{H}}^H} \right] _{:,1:M  N}}\). The output of the BM can be given by:
In the MCLP branch, \(\textbf{q}(t)\) denotes the delayed signal of \(\textbf{y}(t)\):
where \(L_g\) depends mainly on \(L_h\), and \(\tau\) denotes the prediction delay in the MCLP model which can prevent the over whitening problem [31]. As shown in Fig. 1, the output of BFMCLP \(\hat{d}(t)\) can be given by:
where \(\hat{d}(t)\) denotes the estimation of the desired speaker, \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update independently over time. In the BFMCLP method, the two branches are designed for joint dereverberation and noise reduction.
The filter coefficients \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update iteratively with the normalized least mean squares (NLMS) [32] and RLS [33], respectively, and the details are summarized in Table 1.
As shown in Table 1, \(\textbf{k}(t)\) is the gain vector, \(\textbf{P}(t)\) is the inverse correlation matrix of the input signal \(\textbf{q}(t)\), \(0<\alpha <1\) and \(0<\rho <1\) denote the forgetting factors, \(\lambda (t)\) denotes the variance of the desired signal, and \(\mu\) is the step size. It is to be emphasized that \(\textbf{w}(t)\) only updates when all the speakers are inactive, while \(\textbf{g}(t)\) updates continuously. In this way, instead of estimating both filters simultaneously, the BFMCLP can prevent the selfcancelation problem effectively.
3.2 Stability of the BFMCLP
Especially note that there is no distortion for the desired signals in the output of the BFMCLP. Because of the existence of the BM \(\textbf{B}\) and the prediction delay \(\tau\), we have \({\textrm{E}}\{ {{\textbf{u}}\left( t \right) {d^*}\left( t \right) } \} = {\textbf{0}}\) and \({\textrm{E}}\{ {{\textbf{q}}\left( t \right) {d^*}\left( t \right) } \} = {\textbf{0}}\) in which \({\textrm{E}}\{\cdot \}\) denotes the expectation, indicating that \(c_b(t)\) and \(c_L(t)\) are all uncorrelated with the desired signal d(t).
In this subsection, we will further prove that the independent update of the two paths will not cause divergence of the system. We assume that the microphone signals are composed of speakers and one interference radiating from a specific direction:
where \({{\textbf{v}}_l}\left( t \right)\) and \({{\textbf{v}}_e}\left( t \right)\) are the earlyreflected components and latereverberant components of the noise, respectively. We assume the RETFs are known. In the following, we analyze the system in two situations: the desired speaker is active or inactive.
3.2.1 Speaker active
When the speaker is active, the filter coefficients of the GSC branch are all fixed. Because the RETFs are estimated in advance and \(\textbf{w}(t)\) updates when only the noise exists, the GSC can suppress the earlyreflected components of the noise without distorting the desired speaker. Thus, the output of the GSC branch is
The input of the MCLP branch \(\textbf{q}(t)\) is always correlated to \(\left( {{{\textbf{x}}_l}\left( t \right) + {{\textbf{v}}_l}\left( t \right) } \right)\), and the MCLP aims to suppress the latereverberation components by making the output \(\hat{d}(t)=c_{\textrm{GSC}}(t)c_{{L}}(t)\) temporally uncorrelated [20].
3.2.2 Speaker inactive
When the speaker is inactive, the vector \(\textbf{y}(t)={{\textbf{v}}_e}\left( t \right) + {{\textbf{v}}_l}\left( t \right)\) needs to be canceled completely. In other words, the system should minimize the \({\textrm{E}}\left\{ {{{\left {\hat{d}\left( t \right) } \right }^2}} \right\}\), where the \({\hat{d}\left( t \right) }\) denotes the residual in this subsection. And the filters \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update independently and simultaneously over time.
The filter \(\textbf{w}(t)\), which updates by the NLMS method, should minimize the cost function [28]:
where \(\lambda _{\textbf{w}}\) is the Lagrange multiplier, and Re\(\{\cdot \}\) extracts the real part of a complex variable. Differentiate \(J_{\textbf{w}}\left( t \right)\) with the respect to \({\textbf{w}}\left( {t} \right)\):
By setting Eq. (21) to zero, we can obtain the optimal filter coefficients:
We set the constraint
and solve for the \(\lambda _{\textbf{w}}\) by substituting Eq. (22) into Eq. (23), given by
then we obtain:
where \(\hat{d}(t)\) is defined in Eq. (60) in Table 1. Thus, Eq. (63) in Table 1 can be obtained by substituting Eq. (25) into Eq. (22) and introducing a scaling factor denoted by \(\mu\), where P(t) is a recursive average of \(\left\ \textbf{u}(t)\right\ ^2\).
Next, we consider the filter \(\textbf{g}(t)\). In the method of least squares, the optimized \(\textbf{g}(t)\) in BFMCLP should satisfy the principle of orthogonality:
where \({\mathbf {\Phi }}{\mathrm { = E}}\{ {{\textbf{q}}\left( t \right) {{\textbf{q}}^H}\left( t \right) } \}\) denotes the correlation matrix of the input \(\textbf{q}(t)\), and \({\textbf{z}}{\mathrm { = E}}\left[ {{\textbf{q}}\left( t \right) c_{{\textrm{GSC}}}^*\left( t \right) } \right]\) denotes the crosscorrelation vector of \(\textbf{q}(t)\) and \(c_{{\textrm{GSC}}}\left( t \right) =c(t)c_B(t)\). In the RLS method, the recursive computations of \({\mathbf {\Phi }}\) and \({\textbf{z }}\) are given by:
where \({\lambda _{\textbf{g}}}\) is the \(forgetting\ factor\). Then, the matrix inversion lemma can be used to obtain the recursive computation of \({\textbf{g}}\left( t \right)\), which is
using \({\textbf{P}}\left( t \right) = \lambda _{\textbf{g}}^{{\mathrm {  1}}}{\textbf{P}}\left( {t  1} \right)  \lambda _{\textbf{g}}^{{\mathrm {  1}}}{\textbf{k}}\left( t \right) {{\textbf{q}}^H}\left( t \right) {\textbf{P}}\left( {t  1} \right)\) and \({\textbf{k}}\left( t \right) = {\textbf{P}}\left( t \right) {\textbf{u}}\left( t \right)\) [28], \(\textbf{g}(t)\) in Eq. (65) in Table 1 can be obtained:
In summary, using the output of the BFMCLP \({\hat{d}}\left( t \right)\) as the residual for updating the two branches, the whole system can converge towards the optimal solution.
4 DBBFMCLP
In this section, we extend the BFMCLP for use in the WASNs. A simple estimation is obtained by utilizing only local signals, and the suboptimal solution can be obtained by doing so, reducing both bandwidth and power consumption.
4.1 Framework
As illustrated in Fig. 2, the input \({{\textbf{y}}_j}\left( t \right) \in {\mathbb {C}^{{\left( {{M_j} + N  {N_j}} \right) \times 1}}}\) of the jth node is the stacked vector of local signals \({\bar{\textbf{y}}_j}\left( t \right) \in {\mathbb {C}^{{M_j} \times 1}}\) and the transmitted signals \({\dot{\textbf{r}}_j}\left( t \right) \in {\mathbb {C}^{\left( {N  {N_j}} \right) \times 1}}\) from other nodes:
At the same time, the jth node transmits the shared signals \({{\textbf{r}}_j}\left( t \right) \in {\mathbb {C}^{{N_j} \times 1}}\) to other nodes. \({{\textbf{r}}_j}\left( t \right)\) is defined in such a way as follows. In a typical application scenario of WASNs, the M microphones and the N speakers are all placed randomly and dispersedly; therefore, the signaltonoise ratios (SNRs) of microphones for each source are different. When the positions of all the speakers are fixed and the activity patterns of the speakers are nonoverlapping, we can estimate the distances between each speaker and the nodes in WASN at system initialization stage by using the ideal VAD and the magnitude of the signal received by the first microphone in each node. We choose the microphone with the highest energy for the nth speaker as the reference of the nth speaker. We assume that the \(j_1\)th, \(j_2\)th ... \(j_{N_j}\)th microphones (\(N_j\) in total) in the jth node have speakers, then \({{\textbf{r}}_j}\left( t \right)\) is written as:
where \({{\textbf{T}}_j} \in {\mathbb {N}^{{N_j} \times {M_j}}}\) and \({{\textbf{t}}_{j{j_i}}} \in {\mathbb {N}^{1 \times {M_j}}}\). The \(N \times 1\) vector \({\textbf{r}}\left( t \right)\) denotes the stacked vector of all \(\textbf{r}_j\) and the \({\left( {N  {N_j}} \right) \times 1}\) vector \(\dot{\textbf{r}}_j\) denotes the received signal of the jth node, which can be written as:
Note that \(\sum \nolimits _{j = 1}^J {{N_j}} = N\), \(0\le {N_j} \le {M_j}\), and a microphone being selected as the reference of one speaker cannot be a reference for another. Similar to \({{\textbf{y}}_j}\left( t \right)\), the RETFs belonging to the jth node are:
As illustrated in Fig. 2(b), when the constraints set \(\textbf{p}\) is consistent across the WASN, the parameters of the jth node are given by:
Note that the input of the MCLP branch in the jth node is still \(\bar{\textbf{y}}_j\) rather than \({\textbf{y}}_j\):
In addition, we provide more details of the implementation of the DBBFMCLP in one node as an example in Table 2. Note that all the signals in vector \(\textbf{r}(t)\) can be obtained in each node of the WASN. Without loss of generality, we choose the first item of \(\textbf{r}(t)\) in Eq. (72) in Table 2.
4.2 Convergence proof
In this part, we will show the convergence property of the proposed DBBFMCLP to the BFMCLP. As mentioned in Section 3, the filters \(\textbf{w}(t)\) and \(\textbf{g}(t)\) update their coefficients independently in the BFMCLP method. Because the full convergence proof of the DBGSC to the centralized GSC has been provided in [12], only the convergence of the MCLP branch is presented in this paper. We assume \(\hat{d} \left( t \right) = c\left( t \right)  {c_L}\left( t \right)\) without considering the BM of the GSC and the filter \(\textbf{w}(t)\). Some parameters are introduced for clarification, for example, \(\hat{d}_{\mathrm cen}(t)\) represents the output of the centralized method and \(\hat{d}_{\mathrm dis}(t)\) denotes that of the distributed one.
In the RLS method, there are two different estimation errors, where one is the a priori estimation error and the other is the a posteriori estimation error [28]. The a priori estimation error in the BFMCLP is introduced when estimating the desired speech signal:
And the a posteriori estimation error is given by [28]:
further, the ratio of the a posteriori estimation error \(\hat{z}_{\textrm{cen}}(t)\) to the a priori estimation error \(\hat{d}_{\textrm{cen}}(t)\) is the conversion factor \(\gamma _{\textrm{cen}}(t)\), given by:
which is determined by the input signal \(\textbf{q}(t)\) and the inverse correlation matrix \(\textbf{P}\). Note that the cost function in RLS is minimized based on the a posteriori estimation error \(\hat{z}_{\textrm{cen}}(t)\), and it does not depend on the a priori estimation error \(\hat{d}_{\textrm{cen}}(t)\) [28]. Obviously, \(\alpha \lambda (t)>0\) and \(\textbf{q}^H(t)\textbf{P}(t1)\textbf{q}(t)>0\) always hold, which is because \(\textbf{P}\) is a positive definite matrix. Therefore, \(\gamma _{\textrm{cen}}(t)\) is less than 1 on average, leading to the convergence property of the RLS.
Because the common desired response vector \(\textbf{p}\), as shown in Eq. (41), is shared in the WASN, it is obvious that:
and then
As shown in Table 2, the local output of each node in the DBBFMCLP method can be written as:
The desired recursive equation for updating the room regression vector \(\bar{\textbf{g}}_j^H(n)\) with \(j=\{1,\cdot \cdot \cdot ,J\}\) is
where \(\bar{\textbf{k}}_j(t)\) is the gain vector of the jth node denoted by Eq. (75) in Table 2. For the sake of analysis, we assume that only the room regression vector of the first node updates. Then, the a posteriori output of distributed method can be denoted as
By substitutingÂ Eq.Â (51) and Eq. (52) into Eq. (53),Â can be further written as
It is obvious that the conversion factor can be written as
Considering Eq. (55), by using the final output \(\hat{d}_{\mathrm dis}\left( t\right)\) for updating the local prediction filter \(\bar{\textbf{g}}_j\left( t\right)\) of all nodes, the relationship between the a posteriori output and the a priori output of the distributed method is similar to the centralized method, and the conversion factor is determined by the delayed signal \(\bar{\textbf{q}}_1\left( t\right)\) and the gain vector \(\bar{\textbf{k}}_1\left( t\right)\). In contrast, if the local room regression vector updates using local output \(\hat{d}_j\left( n\right)\), it is difficult to analyze the relationship. In addition, when all nodes update simultaneously, the conversion factor of distributed structure can be represented as:
It is obvious that \(\gamma _{\mathrm dis}(t)\) is less than 1 on average. Thus, the convergence of the proposed DBBFMCLP can be guaranteed. It will be demonstrated in the following section that \(\left[ \bar{\textbf{g}}_1^T,\bar{\textbf{g}}_2^T,...,\bar{\textbf{g}}_J^T\right] ^T\) would converge to the optimal solution of the centralized method after enough iterations.
5 Simulations
In this section, to validate the proposed BFMCLP method and the convergence of the proposed DBBFMCLP method, the two methods are evaluated in the noisy environments with varying degrees of reverberation.
5.1 Simulation setup
The sizes of two simulated rooms are 5 m\(\times\)5 m\(\times\)3 m and 7 m\(\times\)7 m\(\times\)3 m, respectively. The reverberation time of the small room is set to \(T_{60}=450\) ms. For the big room, \(T_{60}=610\) ms, 720 ms, 830 ms, and 940 ms are considered.
Besides, each node in WASNs has 3 microphones with the distance of two adjacent microphones 5 cm. The positions of nodes, speakers, and interferences relative to the room are illustrated in Fig. 3. We select 40 speakers (20 males and 20 females) from the TIMIT database as the clean speech signals. The performance shown as follow is all averaged over several experiments. Each signal of one speaker is set to 30 s, and the simulated signals are obtained by convolving simulated room impulse responses (RIRs). The RIRs are simulated with an efficient implementation of the image source model [34]. A stationary noise is also located in each simulated room. To focus on measuring the performance of the proposed methods, we assume that the clocks of the sensors are synchronized. We further test whether the distributed methods can converge to the optimal solution or not by comparing the results with the centralized methods. Accordingly, we uniformly update the signals and parameters simultaneously.
The sampling rate is 16 kHz. The STFT uses a squareroot Hanning window, and the frame length is set to 1024 with the frame shift 512 to balance the performance and the real time of the methods in reverberant and noisy scenarios. The performance is evaluated by four oftenused objective measurements including the perceptual evaluation of speech quality (PESQ) [35], the shorttime objective intelligibility (STOI) [36], the SNR, and the speechtoreverberation modulation energy ratio (SRMR) [37].
5.2 Evaluation of the distributed MCLP
We first test the convergence of the distributed MCLP in the DBBFMCLP in reverberant scenarios, where the first setup (a) with \(T_{\mathrm 60}=450\) ms and the second setup (b) with \(T_{\mathrm 60}=830\) ms are considered. Without the GSC branch, the DBBFMCLP and BFMCLP become the distributed MCLP (DBMCLP) and MCLP, respectively. In the circumstances, we choose the first microphone as the reference of the single speaker, and \(\textbf{h}=[1,0,...,0]^T\). The speech signals are located in the position of the desired speaker, and \(L_g=8\) and \(\tau =1\) are set in this evaluation. The PESQ improvements of the outputs of the single node MCLP (SNMCLP), the centralized MCLP (CenMCLP), and the DBMCLP versus time are depicted in Fig. 4. One can see that the performance of the CenMCLP and that of the DBMCLP is closed when they are both in a convergent state and both outperform the SNMCLP, and the convergence speed of the distributed approach is faster [38]. This is because the room regression vector \(\textbf{g}\) is separated into lowerdimension ones in the DBMCLP.
5.3 Evaluation of the BFMCLP and the DBBFMCLP
We investigate the performance of the BFMCLP and the DBBFMCLP in noisy and reverberant scenarios by twenty runs. We compare the proposed two methods with five existing related ones. In sum, we use the following seven methods in total for complete comparison: the MCLP, the GSC, the DBGSC (the distributed structure of the GSC), the LCMV method, the LCDANSE method (the distributed structure of the LCMV), the BFMCLP method, and the DBBFMCLP method. In addition, \(L_g=4\) and \(\tau =1\) are chosen in this evaluation.
The signaltointerference ratio (SIR), which measures the power ratio between the received desired speaker and the competing speaker, is set to 0 dB. The SNR, which defines the power ratio between the speakers and the noise, is set to 13 dB in the cases when studying the influence of the reverberation time. The SNR is set to 5 dB, 10 dB, 15 dB, and 20 dB to evaluate the influence of the noise.
The channel numbers of each method per TFbin are presented in Table 3. One can see that all of the three distributed methods need fewer channels than their centralized structures. The DBGSC and DBBFMCLP require that the number of speakers should not be more than the total number of microphones in the WASN; the two methods are more robust to the number of speakers because \(N < {M_j}\) needs to be satisfied in the LCDANSE [11].
We also show the computational complexity of the BFMCLP and the DBBFMCLP in Table 4, where both a scalar complex addition and a scalar complex multiplication are counted as one floating point operation (FLOP) [39]. For simplicity of expression, we set \({Q_j} = \left( {{M_j} + N  {N_j}} \right)\). As a comparison, we also present the computational complexity of the existing GSC method. It can be observed from Table 4 that, because of the smaller number of filter dimensions, the complexity of the DBBFMCLP is reduced significantly.
The improvements of the above mentioned methods with the four objective measures are presented in Figs. 5 and 6. It is clear that the performance of the DBBFMCLP and the BFMCLP are closed in most cases, which further verifies the convergence of the DBBFMCLP to the BFMCLP. An observation in Fig. 5 is that the impact of reverberation on speech quality gradually exceeds that of noise when the reverberation time increases, which causes the performance degradation to the existing related beamformers. Instead, the MCLP can maintain a stable performance. It demonstrates that reverberation can limit the performance of the related beamformers. However, the BFMCLP and the DBBFMCLP have obvious advantages in all measurements under reverberant and noisy environments, demonstrating the superiority of the parallel structure proposed in this paper.
Furthermore, we perform ten random experiments to verify the stability of the system, where in each experiment the room size \(S \in \left[ 25, 72\right]\) \(\textrm{m}^2\), SIR \(\in \left[ 2, 2\right]\) \(\textrm{dB}\), SNR \(\in \left[ 10, 20\right]\) \(\textrm{dB}\), and reverberation time \(T_{\mathrm 60} \in \left[ 400, 900\right]\) \(\textrm{ms}\) are chosen randomly. Two speakers, one interference and a fournode WASN, are randomly and dispersedly arranged in the room, and the microphone constellation in each node remains fixed as in Section 5.1. The improvements depicted in Fig. 7 indicate the robustness of the DBBFMCLP and the BFMCLP.
5.4 Evaluation of the influence of VAD errors
An ideal VAD has been used in the previous studies, and the filters and parameters are updated when speakers inactive in speech enhancement methods. In this part, we further study the influence of VAD errors on the performance of the GSC, BFMCLP, and their distributed structures for completeness. Here, \(\phi _s\) indicates the percentage of the speechandnoise frames that are error detected as noiseonly frames.
The influence of \(\phi _s\) on the performance of the four methods is studied in two scenarios using the simulated room depicted in Fig. 3b, with \(T_{\mathrm 60}=610\) ms and SNR = 13 dB. In the first scenario, we assume that the accurate \(\textbf{H}\) still has been known to all nodes; the inaccurate noise frames are only used to update the filter \(\textbf{w}\); the PESQ improvements in this scenario are depicted in Fig. 8a. In the second scenario, the inaccurate noise frames are simultaneously used to estimate the RETF \(\textbf{H}\) and the filter \(\textbf{w}\), and the results are shown in Fig. 8b. The four methods are obviously more sensitive to the estimation error of the RETFs, and the superiority of the two parallel structures to the two GSCmethods can be concluded from the Fig. 8 in either of the two scenarios.
6 Conclusion
In this paper, for speech enhancement in reverberant and noisy environments, the parallel implementation of BFMCLP method has been proposed and extended for WASNs. The proposed methods suppress reverberation and noise by exploiting the property that the delayed signal in the MCLP and the blocked signal in GSC are all uncorrelated with the desired signal. The parallel architecture has two advantages: one is that the two filters can be updated independently to prevent the selfcancelation problem effectively due to the estimation error of the RETFs, which can improve the stability of the system, and the other is that the parallel architecture can be easily extended to distributed systems. We provide the details of the two parallel methods and prove the convergence of the DBBFMCLP method. Finally, we test the BFMCLP and the DBBFMCLP in reverberant and noisy scenarios; simulation results indicate that the two proposed methods outperform the existing methods, and the DBBFMCLP provides a performance comparable to the centralized BFMCLP, while it significantly reduces both the computational and the transmission cost.
Availability of data and materials
The datasets generated and/or analyzed during the current study are not publicly available due to that all of them can be generated by readers themselves according to the simulation setup in Section 5 but are available from the corresponding author on reasonable request if they have difficulties.
Abbreviations
 WSN:

Wireless sensor network
 WASN:

Wireless acoustic sensor network
 ASR:

Automatic speech recognition
 MEMS:

Microelectromechanical system
 DBMWF:

Distributed multichannel Wiener filter
 DANSE:

Distributed adaptive nodespecific signal estimation
 LCMV:

Linearly constrained minimum variance
 LCDANSE:

Linearly constrained distributed adaptive nodespecific signal estimation
 GSC:

Generalized sidelobe canceler
 DBGSC:

Distributed generalized sidelobe canceler
 MCLP:

Multichannel linear prediction
 MVDR:

Minimum variance distortionless response
 SC:

Sidelobecancelation
 LP:

Linear prediction
 ISCLP:

Integrated sidelobe cancelation and linear prediction
 BFMCLP:

Beamforming and multichannel linear prediction
 DBBFMCLP:

Distributed beamforming and multichannel linear prediction
 STFT:

Shorttime Fourier transform
 RETF:

Relative early transfer function
 ATF:

Acoustic transfer function
 FB:

Fixed beamformer
 BM:

Blocking matrix
 VAD:

Voice activity detector
 MCAR:

Multichannel autoregressive
 RLS:

Recursive least squares
 NLMS:

Normalized least mean squares
 RIR:

Room impulse response
 PESQ:

Perceptual evaluation of speech quality
 STOI:

Shorttime objective intelligibility
 SRMR:

Speechtoreverberation modulation energy ratio
 SNR:

Signaltonoise ratio
 SIR:

Signaltointerference ratio
 FLOP:

Floating point operation
References
D. Estrin, L. Girod, G. Pottie, M. Srivastava, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Instrumenting the world with wireless sensor networks, vol. 4 (2001), pp. 2033â€“2036. https://doi.org/10.1109/ICASSP.2001.940390
O.M. Bouzid, G.Y. Tian, J. Neasham, B. Sharif, Investigation of sampling frequency requirements for acoustic source localisation using wireless sensor networks. Appl. Acoust. 74(2), 269â€“274 (2013). https://doi.org/10.1016/j.apacoust.2010.12.013
R. Ali, T. van Waterschoot, M. Moonen, An integrated mvdr beamformer for speech enhancement using a local microphone array and external microphones. EURASIP J. Audio Speech Music Process. 10 (2021). https://doi.org/10.1186/s13636020001922
X. Guo, M. Yuan, Y. Ke, C. Zheng, X. Li, Distributed nodespecific blockdiagonal LCMV beamforming in wireless acoustic sensor networks. Signal Process. 185, 108085(2021). https://doi.org/10.1016/j.sigpro.2021.108085. www.sciencedirect.com/science/article/pii/S0165168421001237
A. Bertrand, in 2011 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT). Applications and trends in wireless acoustic sensor networks: a signal processing perspective (2011). pp. 1â€“6. https://doi.org/10.1109/SCVT.2011.6101302
A. Bertrand, M. Moonen, Distributed adaptive nodespecific signal estimation in fully connected sensor networksâ€“part i: Sequential node updating. IEEE Trans. Sig. Process. 58(10), 5277â€“5291 (2010). https://doi.org/10.1109/TSP.2010.2052612
A. Bertrand, M. Moonen, Distributed adaptive nodespecific signal estimation in fully connected sensor networksâ€“part ii: Simultaneous and asynchronous node updating. IEEE Trans. Signal Process. 58(10), 5292â€“5306 (2010). https://doi.org/10.1109/TSP.2010.2052613
J. Zhang, R. Heusdens, R.C. Hendriks, Ratedistributed spatial filtering based noise reduction in wireless acoustic sensor networks. IEEE/ACM Trans. Audio Speech Lang. Process. 26(11), 2015â€“2026 (2018). https://doi.org/10.1109/TASLP.2018.2851157
S. MarkovichGolan, A. Bertrand, M. Moonen, S. Gannot, Optimal distributed minimumvariance beamforming approaches for speech enhancement in wireless acoustic sensor networks. Signal Process. 107, 4â€“20 (2015). https://doi.org/10.1016/j.sigpro.2014.07.014
S. Doclo, M. Moonen, T. Van den Bogaert, J. Wouters, Reducedbandwidth and distributed mwfbased noise reduction algorithms for binaural hearing aids. IEEE Trans. Audio Speech Lang. Process. 17(1), 38â€“51 (2009). https://doi.org/10.1109/TASL.2008.2004291
A. Bertrand, M. Moonen, Distributed nodespecific LCMV beamforming in wireless sensor networks. IEEE Trans. Signal Process. 60(1), 233â€“246 (2012). https://doi.org/10.1109/TSP.2011.2169409
S. MarkovichGolan, S. Gannot, I. Cohen, Distributed multiple constraints generalized sidelobe canceler for fully connected wireless acoustic sensor networks. IEEE Trans. Audio Speech Lang. Process. 21(2), 343â€“356 (2013). https://doi.org/10.1109/TASL.2012.2224454
P.A. Naylor, N.D. Gaubitch, Speech dereverberation. Springer London. (2010).Â https://doi.org/10.1007/9781849960564
Z. Honghu, Y. Jia, P. Jianxin, Chinese speech intelligibility of elderly people in environments combining reverberation and noise. Appl. Acoust. 150, 1â€“4 (2019). https://doi.org/10.1016/j.apacoust.2019.02.002
K. Lebart, J.M. Boucher, P. Denbigh, A new method based on spectral subtraction for speech dereverberation. Acta Acustica U. Acustica. 87, 359â€“366 (2001)
A. Schwarz, K. Reindl, W. Kellermann, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), A twochannel reverberation suppression scheme based on blind signal separation and wiener filtering (2012), pp. 113â€“116. https://doi.org/10.1109/ICASSP.2012.6287830
E.A.P. Habets, J. Benesty, A twostage beamforming approach for noise reduction and dereverberation. IEEE Trans. Audio Speech Lang. Process. 21(5), 945â€“958 (2013). https://doi.org/10.1109/TASL.2013.2239292
M. Miyoshi, Y. Kaneda, Inverse filtering of room acoustics. IEEE Trans. Acoust. Speech Signal Process. 36(2), 145â€“152 (1988). https://doi.org/10.1109/29.1509
T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang, in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, Blind speech dereverberation with multichannel linear prediction based on short time fourier transform representation (2008), pp. 85â€“88. https://doi.org/10.1109/ICASSP.2008.4517552
T. Yoshioka, T. Nakatani, Generalization of multichannel linear prediction methods for blind mimo impulse response shortening. IEEE Trans. Audio Speech Lang. Process. 20(10), 2707â€“2720 (2012). https://doi.org/10.1109/TASL.2012.2210879
S. Gergen, A. Nagathil, R. Martin, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Audio signal classification in reverberant environments based on fuzzyclustered adhoc microphone arrays (2013), pp. 3692â€“3696. https://doi.org/10.1109/ICASSP.2013.6638347
S. Pasha, C. Ritz, in 2015 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). Clustered multichannel dereverberation for adhoc microphone arrays (2015). pp. 274â€“278. https://doi.org/10.1109/APSIPA.2015.7415519
M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori, T. Nakatani, Strategies for distant speech recognitionin reverberant environments (2015). https://doi.org/10.1186/s1363401502457
T. Dietzen, S. Doclo, M. Moonen, T. van Waterschoot, Integrated sidelobe cancellation and linear prediction kalman filter for joint multimicrophone speech dereverberation, interfering speech cancellation, and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 740â€“754 (2020). https://doi.org/10.1109/TASLP.2020.2966869
T. Dietzen, A. Spriet, W. Tirry, S. Doclo, M. Moonen, T. van Waterschoot, Comparative analysis of generalized sidelobe cancellation and multichannel linear prediction for speech dereverberation and noise reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 27(3), 544â€“558 (2019). https://doi.org/10.1109/TASLP.2018.2886743
Y. Chan, K. Ho, A simple and efficient estimator for hyperbolic location. IEEE Trans. Signal Process. 42(8), 1905â€“1915 (1994). https://doi.org/10.1109/78.301830
Y. Zeng, R.C. Hendriks, Distributed delay and sum beamformer for speech enhancement via randomized gossip. IEEE/ACM Trans. Audio Speech, and Language Processing 22(1), 260â€“273 (2014). https://doi.org/10.1109/TASLP.2013.2290861
S. Haykin, Adaptive Filter Theory (Prentice Hall, 2002)
I. Kodrasi, S. Doclo, in 2017 Handsfree Speech Communications and Microphone Arrays (HSCMA). EVDbased multichannel dereverberation of a moving speaker using different RETF estimation methods (2017). pp. 116â€“120. https://doi.org/10.1109/HSCMA.2017.7895573
K. AbedMeraim, E. Moulines, P. Loubaton, Prediction error method for secondorder blind identification. IEEE Trans. Signal Process. 45(3), 694â€“705 (1997). https://doi.org/10.1109/78.558487
T. Yoshioka, T. Nakatani, K. Kinoshita, M. Miyoshi, Speech Dereverberation and Denoising Based on Time Varying Speech Model and Autoregressive Reverberation Model (Springer, Berlin Heidelberg, 2010), pp.151â€“182
S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process. 49(8), 1614â€“1626 (2001). https://doi.org/10.1109/78.934132
T. Yoshioka, H. Tachibana, T. Nakatani, M. Miyoshi, in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Adaptive dereverberation of speech signals with speakerposition change detection (2009), pp. 3733â€“3736. https://doi.org/10.1109/ICASSP.2009.4960438
J. Allen, D. Berkley, Image method for efficiently simulating smallroom acoustics. J. Acoust. Soc. Am. 65, 943â€“950 (1979). https://doi.org/10.1121/1.382599
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Perceptual evaluation of speech quality (pesq)a new method for speech quality assessment of telephone networks and codecs, vol. 2 (2001), pp. 749â€“752. https://doi.org/10.1109/ICASSP.2001.941023
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of timefrequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125â€“2136 (2011). https://doi.org/10.1109/TASL.2011.2114881
J.F. Santos, M. Senoussaoui, T.H. Falk, in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC). An improved nonintrusive intelligibility metric for noisy and reverberant speech (2014). pp. 55â€“59. https://doi.org/10.1109/IWAENC.2014.6953337
C. Zheng, A. Deleforge, X. Li, W. Kellermann, Statistical analysis of the multichannel wiener filter using a bivariate normal distribution for sample covariance matrices. IEEE/ACM Trans. Audio Speech Lang. Process. 26(5), 951â€“966 (2018). https://doi.org/10.1109/TASLP.2018.2800283
H. Raphael, Floating point operations in matrixvector calculus (Technische UniversitĂ¤t MĂĽnchen, Tech. rep, 2007)
Acknowledgements
Not applicable.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62101550.
Author information
Authors and Affiliations
Contributions
Zhe Han: software and writing original draft. Yuxuan Ke: platform and writingâ€•review and editing. Chengshi Zheng and Xiaodong Li: supervision and writingreview and editing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Han, Z., Ke, Y., Li, X. et al. Parallel processing of distributed beamforming and multichannel linear prediction for speech denoising and deverberation in wireless acoustic sensor networks. J AUDIO SPEECH MUSIC PROC. 2023, 25 (2023). https://doi.org/10.1186/s13636023002876
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636023002876