 Research
 Open access
 Published:
An integrated MVDR beamformer for speech enhancement using a local microphone array and external microphones
EURASIP Journal on Audio, Speech, and Music Processing volume 2021, Article number: 10 (2021)
Abstract
An integrated version of the minimum variance distortionless response (MVDR) beamformer for speech enhancement using a microphone array has been recently developed, which merges the benefits of imposing constraints defined from both a relative transfer function (RTF) vector based on a priori knowledge and an RTF vector based on a datadependent estimate. In this paper, the integrated MVDR beamformer is extended for use with a microphone configuration where a microphone array, local to a speech processing device, has access to the signals from multiple external microphones (XMs) randomly located in the acoustic environment. The integrated MVDR beamformer is reformulated as a quadratically constrained quadratic program (QCQP) with two constraints, one of which is related to the maximum tolerable speech distortion for the imposition of the a priori RTF vector and the other related to the maximum tolerable speech distortion for the imposition of the datadependent RTF vector. An analysis of how these maximum tolerable speech distortions affect the behaviour of the QCQP is presented, followed by the discussion of a general tuning framework. The integrated MVDR beamformer is then evaluated with audio recordings from behindtheear hearing aid microphones and three XMs for a single desired speech source in a noisy environment. In comparison to relying solely on an a priori RTF vector or a datadependent RTF vector, the results demonstrate that the integrated MVDR beamformer can be tuned to yield different enhanced speech signals, which may be more suitable for improving speech intelligibility despite changes in the desired speech source position and imperfectly estimated spatial correlation matrices.
1 Introduction
Speech processing devices such as a hearing aid, a cochlear implant, or a mobile telephone are commonly equipped with an array of microphones to capture the acoustic environment. The received microphone signals are often a mixture of a desired speech signal plus some undesired noise (any combination of interfering speakers, background noises, and reverberation). As the quality and intelligibility of the desired speech signal is susceptible to considerable degradation in the presence of such noise, the task of suppressing this noise and extracting the desired speech signal, known as speech enhancement, is of critical importance and has been the subject of extensive research [1–3].
While successful speech enhancement strategies have been developed with microphone arrays, in some applications, due to physical space constraints, the spatial variation between the observed microphone signals may not be sufficient to yield an acceptable degree of speech enhancement. Consequently, the potential of using more ad hoc microphone configurations consisting of randomly placed microphones to increase the spatial sampling of the acoustic environment has developed interest [4–12]. In this paper, a specific ad hoc microphone configuration is considered, where a microphone array located on some speech processing device, hereafter referred to as a local microphone array (LMA), is linked with multiple remote or external microphones (XMs) in a centralised processing framework, i.e. all microphone signals are sent to a fusion centre for processing. The terminology of a local microphone array is introduced since the microphone array is considered to be confined or fixed within some area of the acoustic environment relative to the XMs which are subject to movement.
When there is a single desired speech source, speech enhancement can be accomplished by using the minimum variance distortionless response (MVDR) beamformer [13, 14]. One of the important quantities required for computing the MVDR beamformer is a vector of acoustic transfer functions from the desired speech source to all of the microphones. More commonly, however, a vector of relative transfer functions (RTFs) is used instead, which is a normalised version of the acoustic transfer function vector with respect to some reference microphone [15]. In practice, for an LMA, this RTF vector may be measured a priori or based on assumptions regarding microphone characteristics, position, speaker location, and room acoustics (e.g. no reverberation). For instance, in assistive hearing devices, it is sometimes assumed that the desired speech source location is known and this knowledge can be subsequently used to define an a priori RTF vector [16–19]. Alternatively, it may be estimated in an online fashion from the observed microphone data [20, 21] so that it is a fully datadependent estimate.
The situation under consideration throughout this paper is one in which there is an available a priori RTF vector pertaining only to the LMA that may or may not be sufficiently accurate with respect to the true RTF vector. In cases where the a priori RTF vector is not sufficiently accurate, then incorporating the use of a datadependent RTF vector can be viewed as an opportunity for an improved performance provided that the datadependent RTF vector is a better estimate of the true RTF vector. On the other hand, when acoustic conditions are adverse enough to significantly affect the accuracy of the datadependent RTF vector, then relying on the a priori RTF vector can be viewed as a fall back or contingency strategy.
It would therefore be seemingly advantageous to use both an a priori and a datadependent RTF vector in practice. Such an approach has recently been investigated for an LMA only and resulted in an integrated version of the MVDR beamformer [22]. As opposed to imposing either the a priori RTF vector or the datadependent RTF vector as a hard constraint, they were both softened into an unconstrained optimisation problem. It was demonstrated that the resulting integrated MVDR beamformer is a convex combination of an MVDR beamformer that uses the a priori RTF vector, an MVDR beamformer that uses the datadependent RTF vector, a linearly constrained minimum variance (LCMV) beamformer that uses both the a priori and datadependent RTF vector, and an allzero vector, each with realvalued weightings, revealing the versatile nature of such an integrated beamformer.
This paper therefore reexamines the integrated MVDR beamformer for the ad hoc microphone configuration consisting of an LMA located on some speech processing device linked with multiple XMs. Specifically, the integrated MVDR beamformer is reformulated from an alternative perspective, namely that of a quadratically constrained quadratic program (QCQP). This QCQP will consist of two constraints, one of which is related to the maximum tolerable speech distortion for the imposition of the a priori RTF vector and the other related to the maximum tolerable speech distortion for the imposition of the datadependent RTF vector. With respect to the procedures for obtaining the RTF vectors, it is straightforward to obtain a datadependent RTF vector; however, the notion of an a priori RTF vector when XMs are used with an LMA is a bit more ambiguous. In particular, since only partial a priori knowledge is usually available for the part of the RTF vector pertaining to the LMA, the other part pertaining to the XMs will have to be a datadependent estimate and hence a procedure based on partial a priori knowledge [9] would be necessary. As a result, an integrated MVDR beamformer for a microphone configuration with an LMA and XMs will merge an a priori RTF vector that is based on partial a priori knowledge and a fully datadependent one.
With the a priori and the datadependent RTF vector for the LMA and XMs estimated, it will become evident that the optimal filter from the integrated MVDR beamformer, formulated as a QCQP, is identical to that which was derived from [22], where the Lagrangian multipliers associated with the QCQP are equivalent to the tuning parameters that have been considered in [22]. The additional insight of the QCQP formulation is that these tuning parameters or Lagrangian multipliers can be related to a maximum tolerable speech distortion for the imposition of the a priori or the datadependent RTF vector. An analysis of this relationship is provided, which facilitates the tuning of the integrated MVDR beamformer from the more intuitive perspective of the maximum tolerable speech distortions as opposed to the combination of filters as in [22]. A general tuning framework will then be discussed along with the suggestion of some particular tuning strategies.
The integrated MVDR beamformer is then evaluated with audio recordings from behindtheear hearing aid microphones (the LMA) and three XMs for a single desired speech source in a recreated cocktail party scenario. The results demonstrate that the integrated MVDR beamformer can be tuned to yield different enhanced speech signals, which can find a compromise between relying solely on an a priori RTF vector or a datadependent RTF vector, and hence may be more suitable for improving speech intelligibility despite changes in the desired speech source position and imperfectly estimated spatial correlation matrices.
The paper is organised as follows. In Section 2, the data model is defined. In Section 3, the MVDR beamformer as applied to an LMA with XMs is discussed along with the procedures for obtaining the a priori RTF vector based on partial a priori knowledge and the datadependent RTF vector. Section 4 reformulates the integrated MVDR beamformer as a QCQP and provides an analysis on the effect of the maximum tolerable speech distortions due to the imposition of the a priori RTF vector and the datadependent RTF vector. In Section 5, a general tuning framework is presented, as well as some suggested tuning strategies. In Section 6, the integrated MVDR approach is analysed and evaluated with both simulated data as well as experimental data involving the use of behindtheear hearing aid microphones and three XMs. Conclusions are then drawn in Section 7.
2 Data model
2.1 Unprocessed signals
A microphone configuration consisting of an LMA of M_{a} microphones plus M_{e}XMs is considered with one desired speech source in a noisy, reverberant^{Footnote 1} environment. In the shorttime Fourier transform (STFT) domain, the observed vector of microphone signals at frequency bin k and time frame l is represented as:
where (dropping the dependency on k and l for brevity) y=[y_{a}^{T} y_{e}^{T}]^{T}, \({\mathbf {y}_{\mathbf {a}}} = \mathrm {[y_{a,1}\hspace {0.1cm}y_{a,2}\hspace {0.1cm} \dots y_{a,{M_{\mathrm {a}}}}]}^{T}\) is a vector of the LMA signals, \({\mathbf {y}_{\mathbf {e}}} = \mathrm {[y_{e,1}\hspace {0.1cm}y_{e,2}\hspace {0.1cm} \dots y_{e,{M_{\mathrm {e}}}}]}^{T}\) is a vector of the XM signals, x is the speech contribution, represented by s_{a,1}, the desired speech signal in the first (reference) microphone of the LMA, filtered with \(\mathbf {h} = [\mathbf {h}^{T}_{\mathbf {a}} \hspace {0.1cm} {\mathbf {h}_{\mathbf {e}}}^{{T}}]^{T}\), h_{a} is the RTF vector for the LMA (where the first component of h_{a} is equal to 1 since the first microphone is used as the reference), and h_{e} is the RTF vector for the XM signals. Finally, n=[n_{a}^{T} n_{e}^{T}]^{T} represents the noise contribution. Variables with the subscript “ a” refer to the LMA and variables with the subscript “ e” refer to the XMs.
The (M_{a}+M_{e})×(M_{a}+M_{e}) spatial correlation matrix for the speechplusnoise, noiseonly, and speechonly signals is defined respectively as:
where \(\mathbb {E}\{.\}\) is the expectation operator and {.}^{H} is the Hermitian transpose. With the assumption of a single desired speech source from (1), R_{xx} can be represented as a rank1 correlation matrix as follows:
where \({\sigma }^{2}_{\mathrm {s}_{\mathrm {a},1}} = \mathbb {E}\left \{\mathrm {\mathrm {s}_{a,1}}^{2}\right \}\) is the desired speech power spectral density in the first microphone of the LMA. It is further assumed that the desired speech signal is uncorrelated with the noise signal, and hence R_{yy}=R_{xx}+R_{nn}. The speechplusnoise, noiseonly, and speechonly correlation matrix can also be defined solely for the LMA signals respectively as \(\mathbf {R}_{\mathbf {y}_{\mathbf {a}}\mathbf {y}_{\mathbf {a}}} = \mathbb {E}\left \{{\mathbf {y}_{\mathbf {a}}} {\mathbf {y}_{\mathbf {a}}}^{{H}}\right \}\), \(\mathbf {R}_{\mathbf {n}_{\mathbf {a}}\mathbf {n}_{\mathbf {a}}} = \mathbb {E}\left \{{\mathbf {n}_{\mathbf {a}}} {\mathbf {n}_{\mathbf {a}}}^{{H}}\right \}\), and \(\mathbf {R}_{\mathbf {x}_{\mathbf {a}}\mathbf {x}_{\mathbf {a}}} = \mathbb {E}\left \{\mathbf {x}_{\mathbf {a}} \mathbf {x}_{\mathbf {a}}^{{H}}\right \}\), with \(\phantom {\dot {i}\!}\mathbf {R}_{\mathbf {x}_{\mathbf {a}}\mathbf {x}_{\mathbf {a}}}\) also having the same rank1 structure as in (5). It is assumed that all signal correlations can be estimated as if all signals were available in a centralised processor, i.e. a perfect communication link is assumed between the LMA and the XMs with no bandwidth constraints as well as synchronous sampling rates.
The estimate of the desired speech signal in the first microphone of the LMA, z_{1}, is then obtained through a linear filtering of the microphone signals, such that:
where \(\mathbf {w} = \left [\mathbf {w}_{\mathbf {a}}^{{T}} \hspace {0.1cm} \mathbf {w}_{\mathbf {e}}^{{T}}\right ]^{T}\) is a complexvalued filter.
2.2 Prewhitenedtransformed domain
As a preprocessing stage, the unprocessed microphone signals can be firstly transformed with the available a priori RTF vector for the LMA signals and then spatially prewhitened using the resulting transformed noiseonly correlation matrix, yielding a vector of prewhitenedtransformed (PWT) microphone signals. As discussed in [9] and subsequently reviewed in Section 3.1, these preprocessing steps essentially compress the M_{a}LMA signals into one signal. This signal is then used with the preprocessed M_{e}XM signals to obtain an estimate for the missing part of the RTF vector pertaining to the XMs when there is an available a priori RTF vector for the LMA. Therefore, PWT microphone signals will be adopted for convenience throughout this paper.
To define the transformation operation, an M_{a}×(M_{a}−1) blocking matrix \(\widetilde {\mathbf {C}}_{\mathbf {a}}\), and an M_{a}×1 fixed beamformer, \(\widetilde {\mathbf {f}}_{\mathbf {a}}\), are firstly defined such that:
where \(\widetilde {\mathbf {h}}_{\mathbf {a}}\) is an available a priori RTF vector (which is some predetermined estimate or approximation of h_{a}), and the notation \((\hspace {0.05cm}\widetilde {.} \hspace {0.05cm})\) refers to quantities based on available a priori knowledge. Using \(\widetilde {\mathbf {C}}_{\mathbf {a}}\) and \(\widetilde {\mathbf {f}}_{\mathbf {a}}\), an (M_{a}+M_{e}) × (M_{a}+M_{e}) transformation matrix, \(\widetilde {\boldsymbol {\Upsilon }}\), can be defined as:
where \(\widetilde {\boldsymbol {\Upsilon }}_{\mathbf {a}} =\ [\widetilde {\mathbf {C}}_{\mathbf {a}} ~\widetilde {\mathbf {f}}_{\mathbf {a}}]\) and in general I_{𝜗} denotes the 𝜗×𝜗 identity matrix. Consequently, the transformed speechplusnoise signals and the transformed noiseonly signals are defined respectively as:
This transformation domain is simply the LMA signals that pass through a blocking matrix and a fixed beamformer as is done in the first stage of a typical generalised sidelobe canceller (i.e. the adaptive implementation of an MVDR beamformer) [23], along with the unprocessed XM signals.
A spatial prewhitening operation can now be defined from the transformed noiseonly correlation matrix by using the Cholesky decomposition:
where L is an (M_{a}+M_{e})×(M_{a}+M_{e}) lower triangular matrix.
A transformed signal vector can then be prewhitened by premultiplying it with L^{−1} and will be denoted with an underbar (.̲). Hence, the signal model for the unprocessed microphone signals from (1) can be expressed in the PWT domain as^{Footnote 2}:
where \(\underline {\mathbf {y}}\) consists of the PWTLMA and XM signals, i.e. \(\underline {\mathbf {y}} = \left [\underline {\mathbf {y}}^{{{T}}}_{\mathbf {{a}}} \hspace {0.1cm} \underline {\mathbf {y}}^{{{T}}}_{\mathbf {{e}}} \right ]^{{{T}}}\), \(\underline {\mathbf {n}} = \mathbf {L}^{1}\widetilde {\boldsymbol {\Upsilon }}^{{{H}}} \mathbf {n}\), the PWT RTF vector \(\underline {\mathbf {h}} = \mathbf {L}^{1}\widetilde {\boldsymbol {\Upsilon }}^{{{H}}} \mathbf {h}\), and the respective correlation matrices are:
where the expression for \(\underline {\mathbf {R}}_{\mathbf {nn}}\) is a direct consequence of (10). With the assumption of the desired speech source and noise being uncorrelated, it also holds that \(\underline {\mathbf {R}}_{\mathbf {yy}} = \underline {\mathbf {R}}_{\mathbf {xx}} + \underline {\mathbf {R}}_{\mathbf {nn}}\). In the PWT domain, the estimate of the desired speech signal in the first microphone of the LMA, z_{1}, which is equivalent to (6), is then obtained through a linear filtering of the PWT microphone signals, such that:
where \(\underset {\smile }{\mathbf {w}} = \mathbf {L}^{H} \widetilde {\boldsymbol {\Upsilon }}^{1} \mathbf {w}\) is a complexvalued filter^{Footnote 3}.
3 MVDR with an LMA and XMs
The MVDR beamformer minimises the noise power spectral density after filtering (minimum variance), with a constraint that the desired speech signal should not be subject to any distortion (distortionless response), which is specified by an appropriate RTF vector for the MVDR beamformer. For the unprocessed microphone signals, the MVDR beamformer problem can be formulated as:
The solution to (17) yields the optimal filter:
with the desired speech signal estimate, \(\mathrm {z}_{1} = \mathbf {w}_{\text {mvdr}}^{{{H}}} \mathbf {y}\). In practice, both R_{nn} and h are unknown and hence must be estimated.
A datadependent estimate can typically be obtained for R_{nn}, for instance by recursive averaging, with a voice activity detector [24] or a speech presence probability (SPP) estimator [25]. This datadependent estimate will be denoted as \(\hat {\mathbf {R}}_{\mathbf {nn}}\) and in general the notation \(\hat {(.)}\) will refer to any datadependent estimate.
In the PWT domain, it can be seen that using \(\hat {\mathbf {R}}_{\mathbf {nn}}\) in (10) will result in an estimate for the prewhitening operator as \(\hat {\mathbf {L}}\) and hence from (14), \(\hat {\mathbf {R}}_{\mathbf {nn}}\) can be expressed as:
Replacing R_{nn} in (17) with \(\hat {\mathbf {R}}_{\mathbf {nn}}\) in (19) then results in the MVDR beamformer problem formulated in the PWT domain as:
where \(\underset {\smallsmile }{\mathbf {w}}\) is redefined as \(\underset {\smallsmile }{\mathbf {w}} = \hat {\mathbf {L}}^{H} \widetilde {\boldsymbol {\Upsilon }}^{1} \mathbf {w}\) and \(\underline {\mathbf {h}}\) is redefined as \(\underline {\mathbf {h}} = \hat {\mathbf {L}}^{1}\widetilde {\boldsymbol {\Upsilon }}^{{{H}}} \mathbf {h}\). The solution to (20) then yields the optimal filter in the PWT domain:
with the desired speech signal estimate, \(\mathrm {z}_{1} = \mathbf {w}_{\text {mvdr}}^{{{H}}} \underline {\mathbf {y}}\). As h is still unknown, however, it means that \(\underline {\mathbf {h}}\) is also unknown and an estimate for this component is still required. Using the same \(\hat {\mathbf {R}}_{\mathbf {nn}}\), two general approaches for the estimation of \(\underline {\mathbf {h}}\) can be considered, either making use of an available a priori RTF vector pertaining to the LMA or making use of only the observable microphone data, i.e. a fully datadependent estimate. The remainder of this section elaborates on these procedures.
3.1 Using an a priori RTF vector
For a microphone configuration consisting of only an LMA, it is not uncommon to use an a priori RTF vector, \(\widetilde {\mathbf {h}}_{\mathbf {a}}\), in place of the true RTF vector. As mentioned earlier, this may be measured a priori or based on several assumptions regarding the spatial scenario and acoustic environment. For the inclusion of XMs into the microphone configuration, however, the notion of an a priori RTF vector is not so straightforward as no immediate prior knowledge with respect to the XMs can be exploited since there are no restrictions on what type of XMs can be used or where they must be placed in the acoustic environment. Hence, an a priori RTF vector cannot be prescribed, as was the case for the LMA only. However, since a priori information would typically only be available for the LMA, an a priori RTF vector for a microphone configuration of an LMA with XMs can be defined as follows:
which consists partially of the a priori RTF vector pertaining to the LMA, \(\widetilde {\mathbf {h}}_{\mathbf {a}}\), and partially of the RTF vector pertaining to the XM, h_{e}, which is unknown and remains to be estimated. The estimate of h_{e} will be denoted as \(\hat {\widetilde {\mathbf {h}}}_{\mathbf {e}}\) to emphasise that it is constrained by the a priori knowledge set by \(\widetilde {\mathbf {h}}_{\mathbf {a}}\) but estimated from the observed microphone data. In [9], a procedure involving the generalised eigenvalue decomposition (GEVD) was used for obtaining \(\hat {\widetilde {\mathbf {h}}}_{\mathbf {e}}\) which is subsequently reviewed and reframed in the PWT domain.
In the PWT domain, using (13)–(15), a rank1 matrix approximation problem can be firstly formulated to estimate the entire RTF vector [9]:
where ._{F} is the Frobenius norm, and:
where \(\hat {\mathbf {R}}_{\mathbf {yy}}\) is the datadependent estimate of R_{yy}. From (22), an a priori RTF vector in the PWT domain can be defined as follows:
where 0 is a vector of (M_{a}−1) zeros. Replacing h with the a priori RTF vector from (22) then results in:
where now only an estimate is required for h_{e}, which in turn will define the a priori RTF vector. As discussed in [9], it can be observed that it is only the lower (M_{e}+1)×(M_{e}+1) blocks of \(\underline {\hat {\mathbf {R}}}_{\mathbf {yy}}\) and \(\underline {\hat {\mathbf {R}}}_{\mathbf {nn}}\) that are required for estimating h_{e}. Hence, (27) can be reduced to:
where \(\mathbf {J} = \left [\hspace {0.1cm} \mathbf {0}_{(M_{\mathrm {e}}+1) \times (M_{\mathrm {a}}1)} \hspace {0.1cm} \hspace {0.1cm} \mathbf {I}_{(M_{\mathrm {e}}+1)} \hspace {0.1cm}\right ]^{T}\) is a selection matrix, \(\underline {\underline {\hat {\mathbf {R}}}}_{\mathbf {yy}} = \mathbf {J}^{T} \underline {\hat {\mathbf {R}}}_{\mathbf {yy}} \mathbf {J}\), and \(\underline {\underline {\hat {\mathbf {R}}}}_{\mathbf {nn}} = \mathbf {J}^{T} \underline {\hat {\mathbf {R}}}_{\mathbf {nn}} \mathbf {J} = \mathbf {I}_{M_{\mathrm {e}}+1}\). The solution of (28) then follows from a GEVD of the matrix pencil \(\left \{\underline {\underline {\hat {\mathbf {R}}}}_{\mathbf {yy}}, \underline {\underline {\hat {\mathbf {R}}}}_{\mathbf {nn}} \right \}\) or equivalently from the eigenvalue decomposition (EVD) of \(\underline {\underline {\hat {\mathbf {R}}}}_{\mathbf {yy}}\) [26]:
where \(\hat {\mathbf {V}}\) is a (M_{e}+1)×(M_{e}+1) unitary matrix of eigenvectors and \(\hat {\boldsymbol {\Gamma }}\) is a diagonal matrix with the associated eigenvalues in descending order. The estimate of h_{e} then follows from the appropriate scaling of the principal eigenvector, \(\hat {\mathbf {v}}_{\mathbf {p}}\):
where \(\mathbf {e}_{M_{\mathrm {a}}}\) is an (M_{a}+M_{e}) selection vector consisting of all zeros except for a one in the M_{a}th position, \(\hat {v}_{p,1}\) is the first element of \(\hat {\mathbf {v}}_{\mathbf {p}}\), and \(\hat {l}_{M_{\mathrm {a}}}\) is the realvalued (M_{a},M_{a})th element in \(\hat {\mathbf {L}}\). Substitution of this expression into (26) finally yields the a priori RTF vector in the PWT domain as^{Footnote 4}:
Finally, replacing \(\underline {\mathbf {h}}\) in (21) with \(\underline {\widetilde {\mathbf {h}}}\) from (31) results in the MVDR beamformer based on a priori knowledge pertaining to the LMA:
which will be referred to as MVDRAP. The corresponding speech estimate is then computed using (16):
As a consequence of incorporating the a priori information into the rank1 speech model, it can be seen that it is only necessary to filter the last (M_{e}+1) elements of \(\underline {\mathbf {y}}\), i.e. \(\underline {\mathrm {y}}_{\mathrm {a},M_{\mathrm {a}}}\) and \(\underline {\mathbf {y}}_{\mathbf {e}}\), with the lower order, (M_{e}+1) filter defined by \(\hat {l}_{M_{\mathrm {a}}} \hspace {0.05cm} \hat {v}_{p,1}^{*} \hspace {0.05cm} \hat {\mathbf {v}}_{\mathbf {p}} \).
3.2 Using a datadependent RTF vector
In the PWT domain, it is (23) that needs to be solved in order to obtain a fully datadependent estimate of the RTF vector pertaining to the LMA and the XMs. The solution to (23) follows from a GEVD of the matrix pencil \(\left \{\underline {\hat {\mathbf {R}}}_{\mathbf {yy}}, \underline {\hat {\mathbf {R}}}_{\mathbf {nn}} \right \}\) or equivalently from the EVD of \(\underline {\hat {\mathbf {R}}}_{\mathbf {yy}}\):
where \(\hat {\mathbf {Q}}\) is an (M_{a}+M_{e})×(M_{a}+M_{e}) unitary matrix of eigenvectors and \(\hat {\boldsymbol {\Lambda }}\) is a diagonal matrix with the associated eigenvalues in descending order. The estimated RTF vector is then given by the principal (first in this case) eigenvector, \(\hat {\mathbf {q}}_{\mathbf {p}}\):
where \(\hat {\eta }_{q} = \mathbf {e}^{T}_{1} \widetilde {\boldsymbol {\Upsilon }}^{{{H}}} \hspace {0.05cm} \hat {\mathbf {L}} \hspace {0.05cm} \hat {\mathbf {q}}_{\mathbf {p}}\) and e_{1} is an (M_{a}+M_{e}) selection vector with a one as the first element and zeros everywhere else. In the PWT domain, this datadependent RTF vector then becomes:
Replacing \(\underline {\mathbf {h}}\) in (21) with \(\underline {\hat {\mathbf {h}}}\) from (36) results in the MVDR beamformer that makes use of a datadependent RTF vector:
which will be referred to as MVDRDD. The corresponding speech estimate is then computed using (16):
where now all (M_{a}+M_{e}) signals need to be filtered as opposed to only (M_{e}+1) signals in (33) when an a priori RTF vector is used. In general, the MVDRDD would also be used for microphone configurations where there is no a priori knowledge available, such as those consisting of external microphones only.
4 Integrated MVDR beamformer
4.1 Quadratically constrained quadratic program
As opposed to relying on only an a priori RTF vector or a datadependent RTF vector, the merging or integration of both RTF vectors into a single approach can be framed as a quadratically constrained quadratic program (QCQP), firstly with respect to the unprocessed microphone signals:
where \(\widetilde {\epsilon }^{2}\) and \(\hat {\epsilon }^{2}\) are maximumtolerated squared deviations from a distortionless response due to \(\widetilde {\mathbf {h}}\) or \(\hat {\mathbf {h}}\) respectively. The constraints of (39) can also be rewritten in the standard form [27] as follows:
where ℜ{.} denotes the real part of its argument. As the matrices \(\hat {\mathbf {R}}_{\mathbf {nn}}\), \(\widetilde {\mathbf {h}}\widetilde {\mathbf {h}}^{{H}}\), and \(\hat {\mathbf {h}}\hat {\mathbf {h}}^{{H}}\) are all positive semidefinite, it is then evident that the QCQP of (39) is convex [27]. In the PWT domain, (39) is equivalently:
where \(\underline {\widetilde {\mathbf {h}}}\) and \(\underline {\hat {\mathbf {h}}}\) are given in (31) and (36) respectively. Whereas in (20), the hard constraint of \(\underline {\mathbf {h}}\) is replaced by either \(\underline {\widetilde {\mathbf {h}}}\) or \(\underline {\hat {\mathbf {h}}}\), (42) can be interpreted as the relaxation of the hard constraints imposed by \(\underline {\widetilde {\mathbf {h}}}\) or \(\underline {\hat {\mathbf {h}}}\) by the specified deviations \(\widetilde {\epsilon }^{2}\) and \(\hat {\epsilon }^{2}\) respectively. In the following, the quantities \(\left \underset {\smallsmile }{\mathbf {w}}^{{H}} \underline {\widetilde {\mathbf {h}}}  1 \right ^{2}\) and \(\left \underset {\smallsmile }{\mathbf {w}}^{{H}} \underline {\hat {\mathbf {h}}}  1 \right ^{2}\) are referred to as speech distortions and \(\widetilde {\epsilon }^{2}\) and \(\widetilde {\epsilon }^{2}\) are the respective maximum tolerable speech distortions. Furthermore, the first inequality constraint in (42) will be referred to as the a priori constraint (APC), and the second inequality constraint will be referred to as the datadependent constraint (DDC).
The QCQP of (39) is in fact a subset of the more general QCQP considered in [28, 29] and as well as an extension to the parametrised multichannel Wiener filter [30]. In [28, 29], the inequality constraints considered are a set of a priori measured RTF vectors, and in [30], only one inequality constraint is considered. The difference in (39) from both of these approaches is that two inequality constraints are considered, one that relies on a priori knowledge and the other which is fully estimated from the data.
The Lagrangian of (42) is given by:
where α and β are Lagrangian multipliers. Taking the partial derivative of (43) with respect to \(\underset {\smallsmile }{\mathbf {w}}\) and setting to zero results in what will be referred to as the integrated MVDR beamformer, MVDRINT:
where the actual values of α and β depend on the prescribed maximum tolerable speech distortions \(\widetilde {\epsilon }^{2}\) and \(\widetilde {\epsilon }^{2}\). It can also be observed that (44) is in fact identical (in the PWT domain) to the integrated MVDR beamformer considered in [22] and hence can be written as a linear combination of \(\underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}\) and \(\underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}\) with complex weightings^{Footnote 5} [22]:
where \(\underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}\) and \(\underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}\) are given in (32) and (37), respectively, and the complex weightings are given by:
where
and
Using the expressions for \(\underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}\) and \(\underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}\) from (32) and (37) respectively, the resulting speech estimate from the MVDRINT is then:
where \(\widetilde {\mathrm {z}}_{1}\) and \(\hat {\mathrm {z}}_{1}\) are defined in (33) and (38) respectively. Hence, the integrated beamformer output is simply a linear combination of the two speech estimates which relied on either a priori information or not.
Once appropriate values are chosen for \(\widetilde {\epsilon }^{2}\) and \(\hat {\epsilon }^{2}\), then a package for specifying and solving convex programs such as CVX [31, 32] can be used for solving (42). Alternatively, more computationally efficient methods may be applied such as those proposed in [28, 29], one of which is highlighted in Algorithm 1. Here, a gradient ascent method [33] for solving (42) is described, which is based on solving the dual problem:
where \(\mathscr {D}(\alpha, \beta) = \underset {\tiny \underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}}{\inf } \hspace {0.1cm} \mathcal {L}\left (\underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}, \alpha, \beta \right)\) is the infimum of \(\mathcal {L}(\underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}, \alpha, \beta)\) and referred to as the dual function. As the dual function is concave [27], a gradient ascent procedure can be used to update the values of α and β using the gradients, \(\frac {\partial \mathscr {D}(\alpha, \beta)}{\partial \alpha } = \left \underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}^{{H}} \underline {\widetilde {\mathbf {h}}}  1 \right ^{2} \widetilde {\epsilon }^{2}\) and \(\frac {\partial \mathscr {D}(\alpha, \beta)}{\partial \beta } = \left \underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}^{{H}} \underline {\hat {\mathbf {h}}}  1 \right ^{2} \hat {\epsilon }^{2}\), i.e. the gradients of the dual function with respect to the particular Lagrange multiplier are the respective constraints. This then gives rise to Algorithm 1 [29], which makes use of the simplified expressions for \(\underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}\) with the complexvalued weightings as opposed to computing (44) directly. The Lagrangian multipliers, α and β, are then updated via the gradient ascent procedure with the step size γ, whose value can be controlled using a backtracking method [34]. The algorithm continues until the respective gradients are within some specified tolerance, δ.
4.2 Effect of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\)
As the QCQP of (42) in principle is to be solved for every time frame and frequency bin, it can therefore lead to quite a versatile beamformer as the parameters, \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) can be set independently for each frequency in every time frame in order to define the inequality constraints. So although (42) is a wellknown QCQP for which there are several methods available to find the solution, it still remains unclear as to what would be a reasonable strategy for setting or tuning \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) in practice. As opposed to [22], where tuning rules were developed for the Lagrangian multipliers, here a strategy is outlined for tuning \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\), which will in turn compute the appropriate Lagrangian multipliers (for instance as outlined in Algorithm 1), as this is believed to be a more insightful procedure.
In order to develop a strategy for tuning \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\), it will be useful to observe the constraints of (42) in more detail. The derivations that follow will reveal that the space spanned by \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) can be divided into four distinct regions as illustrated in Fig. 1, where each of these regions corresponds to a particular set of constraints being active.
Firstly, substitution of \(\underset {\smallsmile }{\mathbf {w}}_{{\text {int}}} = \mathbf {0}\) into the APC and DDC from (42) shows that when \(\widetilde {\epsilon } > 1\) and \(\hat {\epsilon } > 1\), both the APC and the DDC are inactive. This condition therefore defines the upperright region (region I) of Fig. 1 and indeed corresponds to a complete attenuation of the microphone signals, i.e. a zero output signal.
For the case when \(\hat {\epsilon }\to \infty \), i.e. when the DDC is inactive, then β→0. If the APC is still active however, it becomes^{Footnote 6}:
Furthermore, if \(0\leq \widetilde {\epsilon } \leq 1\), then it can be deduced that:
Substitution of (54) into (53) readily makes this evident, recalling that \(\underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} = 1\). It is worthwhile to also note that by using (46), the relationship between α and \(\widetilde {\epsilon }\) for \(0\leq \widetilde {\epsilon } \leq 1\) is then given as:
In regard to the DDC, as \(\hat {\epsilon }\) is decreased (from \(\hat {\epsilon }\to \infty \)), it remains inactive until \(\left \underset {\smallsmile }{\mathbf {w}}^{{H}} \underline {\hat {\mathbf {h}}}  1 \right  = \hat {\epsilon }\). By substitution of (54) into the DDC of (42), the value of \(\hat {\epsilon }\) at which the DDC becomes active, \(\hat {\epsilon }_{o}\), is given by:
In the limits of \(\widetilde {\epsilon }\), when \(\widetilde {\epsilon } = 1\), \(\hat {\epsilon }_{o} = 1\), and when \(\widetilde {\epsilon } = 0\), \(\hat {\epsilon }_{o} = \left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}}  1 \right \), where depending on \(\underline {\hat {\mathbf {h}}}\), \(\left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}}  1 \right  < 1\) or \(\left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}}  1 \right  \geq 1\). The range of values obtained for \(\hat {\epsilon }_{o}\) from (56) within the domain \(0 \leq \widetilde {\epsilon }\leq 1\) define what will be referred to as the DDC bounding curve as depicted in Fig. 1. Hence, region II in Fig. 1 is enclosed by the DDC bounding curve, \(\widetilde {\epsilon } = 0\) and \(\widetilde {\epsilon } = 1\), representing the space where the APC is active and the DDC is inactive.
A similar analysis can be followed starting from the case when \(\widetilde {\epsilon }\to \infty \), i.e. when the APC is inactive and hence α→0. If the DDC is still active however, it becomes:
When \(0\leq \hat {\epsilon } \leq 1\), then the following relationships can be deduced:
Finally, for the APC, as \(\widetilde {\epsilon }\) is decreased (from initially \(\widetilde {\epsilon }\to \infty \)), the value, \(\widetilde {\epsilon }_{o}\), at which this constraint becomes active is given by:
In the limits of \(\hat {\epsilon }\), when \(\hat {\epsilon } = 1\), \(\widetilde {\epsilon }_{o} = 1\), and when \(\hat {\epsilon } = 0\), \(\widetilde {\epsilon }_{o} = \left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{H}} \underline {\widetilde {\mathbf {h}}}  1\right \), where depending on \(\underline {\widetilde {\mathbf {h}}}\), \(\left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{H}} \underline {\widetilde {\mathbf {h}}}  1\right  < 1\) or \(\left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{H}} \underline {\widetilde {\mathbf {h}}}  1\right  \geq 1\). The range of values obtained for \(\widetilde {\epsilon }_{o}\) from (60) within the domain \(0 \leq \hat {\epsilon }\leq 1\) define what will be referred to as the APC bounding curve as depicted in Fig. 1. Hence, region III in Fig. 1 is enclosed by the APC bounding curve, \(\hat {\epsilon } = 0\) and \(\hat {\epsilon } = 1\), representing the space where the APC is inactive and the DDC is active.
Finally, in the lowerleft region, region IV, both the APC and the DDC become active within the area enclosed by the APC and DDC bounding curve. It should be kept in mind that Fig. 1 is only an illustration and that the shape of the area for which the APC and DDC are both active can change depending on the RTF vectors, \(\underline {\widetilde {\mathbf {h}}}\) and \(\underline {\hat {\mathbf {h}}}\). For instance, Fig. 1 shows \(\left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{H}} \underline {\widetilde {\mathbf {h}}}  1\right  < 1\) and \(\left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}}  1\right  < 1\) (points on the axes), whereas it is possible that either of these points may be greater than or equal to one.
5 Confidence metric and tuning
5.1 Confidence metric
One of the ingredients towards developing a tuning strategy for setting appropriate values for \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) is that of a confidence metric, which is indicative of the confidence in the accuracy of the datadependent RTF vector. In [22], it was proposed that a principal generalised eigenvalue resulting from the datadependent estimation procedure be used as such a confidence metric. In the following, it is proposed again to use such a metric; however, due to the formulation in the PWT domain, the principal eigenvalue, \(\hat {\lambda }_{1}\) from the EVD in (34) will be used. It can be shown that \(\hat {\lambda }_{1}\) is equivalent to the resulting posterior SNR when the MVDRDD is applied and therefore serves as a reasonable metric for making a decision with respect to the accuracy of the datadependent RTF. For the MVDRDD in (37), the resulting posterior SNR is given by:
where it is recalled that \(\underline {\hat {\mathbf {R}}}_{\mathbf {nn}} = \mathbf {I}_{(M_{\mathrm {a}} + M_{\mathrm {e}})}\). Substitution of (34) and (37)^{Footnote 7} into (61) results in \(\widehat {\text {SNR}}_{\text {DD}} = \hat {\lambda }_{1}\).
As in [22], \(\hat {\lambda }_{1}\) can then be used in a logistic function to define the confidence metric, F(l)^{Footnote 8}:
where F(l)∈[0,1], ρ controls the gradient of the transition from 0 to 1, and λ_{t} is a threshold (in dB), beyond which F(l)→1. Hence, as \(10\log _{10}(\hat {\lambda }_{1}(l))\) increases beyond λ_{t}, then F(l)→1, indicating high confidence in the accuracy of the datadependent RTF vector. On the other hand, as \(10\log _{10}(\hat {\lambda }_{1}(l))\) decreases below λ_{t}, then F(l)→0, indicating low confidence in the accuracy of the datadependent RTF vector.
5.2 Tuning strategy
With the depiction of the space spanned by \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) from Fig. 1 in mind, a general twostep procedure can be followed to establish a particular tuning strategy:

1.
Choose two points on the \(\{\hat {\epsilon }, \widetilde {\epsilon }\}\) plane: ε_{AP} and ε_{DD}. The coordinates of ε_{AP}, \(\{\hat {\epsilon }_{AP}, \widetilde {\epsilon }_{AP}\}\), will specify the maximum tolerable speech distortions for the case when there is no confidence in the accuracy of the datadependent RTF vector. The coordinates of ε_{DD}, \(\{\hat {\epsilon }_{DD}, \widetilde {\epsilon }_{DD}\}\), on the other hand, will specify the maximum tolerable speech distortions for the case when there is complete confidence in the accuracy of the datadependent RTF vector.

2.
Define an appropriate path in order to connect ε_{AP} and ε_{DD}, where the variation along this path would be a function of the confidence metric, F(l). As F(l) changes in each timefrequency segment, different values of \(\hat {\epsilon }\) and \(\widetilde {\epsilon }\) will be chosen along this path and subsequently used in the QCQP from (42).
Figure 2 depicts three examples of how such a general tuning strategy can be interpreted in the \(\{\hat {\epsilon }, \widetilde {\epsilon }\}\) plane, where a linear path has been used to connect the points, ε_{AP} and ε_{DD}. Before further elaborating on Fig. 2, however, one possible tuning strategy will be briefly outlined. In this strategy, ε_{AP} and ε_{DD} are chosen by making use of the relationship between the integrated MVDR and the socalled speech distortion weighted multichannel Wiener filter (SDWMWF) [35, 36]. Although ε_{AP} and ε_{DD} can in general be chosen without making use of this relation, it is done to highlight how the speech distortion parameter, μ, from the SDWMWF is related to the maximum tolerable speech distortion parameters of the integrated MVDR, especially as this μ is a wellestablished tradeoff parameter. For the path connecting ε_{AP} and ε_{DD}, a linear path will be defined using the confidence metric, F(l).
In the PWT domain, the cost function for the SDWMWF is given by:
which consists of two terms, the first corresponding to the noise power spectral density after filtering and the second corresponding to the speech distortion. The speech distortion parameter μ∈(0,∞) is used to tradeoff between the amount of noise reduction and speech distortion, where larger values of μ put more emphasis on reducing the noise and smaller values put more emphasis on reducing the speech distortion. Two separate SDWMWF formulations can then be considered for \(\underline {\widetilde {\mathbf {h}}}\) and \(\underline {\hat {\mathbf {h}}}\) respectively:
where \(\widetilde {\mu } \in (0, \infty)\) and \(\hat {\mu } \in (0, \infty)\) are the separate speech distortion parameters for each cost function. The solutions to (64) and (65) are then respectively given by:
where \(\hat {\sigma }^{2}_{\mathrm {s}_{\mathrm {a},1}}\) is an estimate of \({\sigma }^{2}_{\mathrm {s}_{\mathrm {a},1}}\). On comparing the \(\underset {\smallsmile }{\mathbf {w}}_{{\text {int}}}\) in (44) to (66) and (67), it can be observed that there is a relationship between the integrated MVDR beamformer and the SDWMWF. By considering the expressions written as an MVDR beamformer followed by a singlechannel post filter [36], it can be deduced that [22]:
Proceeding to define the coordinates of ε_{AP}, (68) is substituted into (55) to obtain a value for \(\widetilde {\epsilon }\) as:
Hence, the range of values for \(\widetilde {\mu }\) are essentially compressed into a range of values for \(\widetilde {\epsilon }_{AP}\) such that \(0 \leq \widetilde {\epsilon }_{AP} \leq 1\). This means that \(\widetilde {\epsilon }_{AP}\) can be chosen to be within this range without having to specify \(\widetilde {\mu }\). However, (70) serves to clarify how the choice of \(\widetilde {\epsilon }_{AP}\) is related to the cost function of (64).
Using the value of \(\widetilde {\epsilon }_{AP}\) in (56) then yields a range of choices for \(\hat {\epsilon }_{AP}\) such that \(\hat {\epsilon }_{AP} \leq \hat {\epsilon }_{o}\):
If \(\hat {\epsilon }_{AP} = \left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}} (1  \widetilde {\epsilon }_{AP})  1 \right \), then ε_{AP} lies on the DDC bounding curve of Fig. 1. For all values of \(\hat {\epsilon }\) such that \(\hat {\epsilon } > \left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}} (1  \widetilde {\epsilon }_{AP})  1 \right \), the DDC remains inactive and hence setting a value of \(\hat {\epsilon }\) within this region will always result in the same achievable^{Footnote 9} speech distortion defined by \(\left \underset {\smallsmile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}^{{H}} \underline {\hat {\mathbf {h}}} (1  \widetilde {\epsilon }_{AP})  1 \right \). Furthermore, when the DDC is inactive, then (68) holds, so that values of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) in region II from Fig. 1 would result in the SDWMWF from (66).
Similarly, by firstly substituting (69) in (59) and making use of (60), the coordinates \(\{\hat {\epsilon }_{DD}, \widetilde {\epsilon }_{DD}\}\) of ε_{DD} can be defined as:
Now if \(\widetilde {\epsilon }_{DD} = \left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} (1  \hat {\epsilon }_{DD})  1 \right  \), then ε_{DD} lies on the APC bounding curve of Fig. 1. Additionally, for all values of \(\widetilde {\epsilon }\) such that \(\widetilde {\epsilon } > \left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} (1  \hat {\epsilon }_{DD})  1 \right \), the APC remains inactive and hence setting a value of \(\widetilde {\epsilon }\) within this region will always result in the same achievable speech distortion defined by \(\left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} (1  \hat {\epsilon }_{DD})  1 \right \). Furthermore, when the APC is inactive, then (69) holds, so that values of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) in region III from Fig. 1 would result in the SDWMWF from (67).
The insight of Fig. 1 and additional value of the MVDRINT as compared to the SDWMWF is now apparent. Given the two SDWMWF solutions from (66) and (67), it is not immediately clear how to optimally interpolate between them by using a linear combination of the filters themselves. In Fig. 1, however, it can be seen that an optimal interpolation between (66) and (67), i.e. between regions II and III, can be achieved through the specification of the maximum tolerable speech distortion parameters, \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) along some path from region II to region III. In essence, the MVDRINT has introduced region IV, which serves as a bridge for connecting regions II and III, thereby facilitating the use of both the priori and datadependent RTF vectors. This then corresponds to the second step of the general procedure for tuning, where ε_{AP} and ε_{DD} are to be connected. Here, it is proposed to use the confidence metric, F(l) to perform a linear interpolation between ε_{AP} and ε_{DD} to yield the values for \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) respectively as:
which are subsequently squared to be used in the QCQP from (42). Consequently, as the confidence in the accuracy of the datadependent RTF vector increases, the maximum tolerable speech distortions will be specified by values tending towards \(\{\hat {\epsilon }_{DD}, \widetilde {\epsilon }_{DD}\}\). On the contrary, as this confidence decreases, maximum tolerable speech distortions will be specified by values tending towards \(\{\hat {\epsilon }_{AP}, \widetilde {\epsilon }_{AP}\}\).
Returning focus to Fig. 2, the three examples of a tuning strategy can now be understood. A particular realisation of the APC and the DDC bounding curves has been plotted and the intersecting point of both curves corresponds to the {1,1} coordinate (recall Fig. 1). In the tuning of Fig. 2a, as F(l) increases, the path along the dotted line is taken from ε_{AP} to arrive at ε_{DD} which gradually sets a larger value of \(\widetilde {\epsilon }\) for the APC and a smaller value of \(\hat {\epsilon }\) for the DDC. Depending on the particular realisation of the APC and DDC bounding curves, it may be that such a path can entirely lie within the area enclosed by these curves or part of it may lie outside as shown in Fig. 2a. The latter is in fact a fortunate circumstance because the achieved speech distortion corresponding to the inactive constraint will actually be lower than what was prescribed by the tuning. In the case of Fig. 2a for instance, when the linear path is above the APC bounding curve, it means that \(\widetilde {\epsilon } > \left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} (1  \hat {\epsilon })  1 \right \) (recall (60)). Since beyond \(\left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} (1  \hat {\epsilon })  1 \right \) the APC continues to be inactive, the actual speech distortion that would be achieved in relation to this constraint would correspond to \(\left \underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}^{{{H}}} \underline {\widetilde {\mathbf {h}}} (1  \hat {\epsilon })  1 \right \), which is by definition less than \(\widetilde {\epsilon }\). Hence, although there is a linear path from ε_{AP} to ε_{DD}, at the point where this linear path intersects with the APC bounding curve, the actual speech distortions that would be achieved are those that continue along the APC bounding curve in order to arrive at ε_{DD}.
The tunings depicted in Fig. 2b and c are representative of strategies where the maximum tolerable speech distortion is fixed for one of the constraints, and only the maximum tolerable speech distortion for the other constraint is tuned. In Fig. 2b, ε_{DD} is defined by setting \(\widetilde {\epsilon }_{DD} = \widetilde {\epsilon }_{AP}\), so that the maximum tolerable speech distortion for the APC is fixed. \(\hat {\epsilon }\) is then tuned according to (74). This is representative of a case where the APC is always active and the DDC is only included if there is confidence in the accuracy of the datadependent RTF vector. Figure 2c depicts an opposite strategy, where now ε_{AP} is set by setting \(\hat {\epsilon }_{AP} = \hat {\epsilon }_{DD}\), so that the maximum tolerable speech distortion for the DDC is fixed.
6 Evaluation and discussion
In order to gain further insight into the behaviour of the integrated MVDR beamformer using the QCQP formulation, a simulation was firstly considered involving only an LMA without XMs. As will be demonstrated, observing such a scenario facilitates the visualisation of the theoretical beam patterns that would be generated under different tuning strategies. Following this simulation, recorded data from an acoustic scenario involving behindtheear dummy^{Footnote 10} hearing aid microphones along with XMs in a cocktail party scenario was then analysed and evaluated.
6.1 Beam patterns for a linear microphone array
As the notion of a traditional beam pattern is not immediately extended to the case of an LMA with XMs^{Footnote 11}, the following beam patterns are generated using an LMA only.
For visualising the beam patterns, a linear LMA consisting of 4 microphones and 5cm spacing was considered. Two anechoic RTF vectors, simulating an a priori RTF vector, \(\widetilde {\mathbf {h}}_{\mathbf {a}}\), and a datadependent RTF vector, \(\hat {\mathbf {h}}_{\mathbf {a}}\), were computed according to a farfield approximation, i.e. \(\left [1 \hspace {0.1cm} e^{j2\pi f \tau _{2}(\theta)} \hspace {0.1cm} e^{j2\pi f \tau _{3}(\theta)} \hspace {0.1cm} e^{j2\pi f \tau _{4}(\theta)} \right ]^{T}\), where f is the frequency (Hz) which was set to 3 kHz, \(\tau _{m}(\theta) = \frac {(m1)0.05 \cos (\theta)}{c}\) is the relative time delay between the m^{th} microphone and the reference microphone (the microphone closest to the desired speech source) of the LMA, θ is the angle of the desired speech source, and c = 345 m s ^{−1} is the speed of sound. For \(\widetilde {\mathbf {h}}_{\mathbf {a}}\), θ=0^{∘} and for \(\hat {\mathbf {h}}_{\mathbf {a}}\), θ=60^{∘}. Using this definition of \(\widetilde {\mathbf {h}}_{\mathbf {a}}\), \(\widetilde {\mathbf {C}}_{\mathbf {a}}\), and \(\widetilde {\mathbf {f}}_{\mathbf {a}}\) were defined accordingly from (7) and \(\widetilde {\boldsymbol {\Upsilon }}_{\mathbf {a}}\) from (8). With \(\phantom {\dot {i}\!}\mathbf {R}_{\mathbf {n}_{\mathbf {a}}\mathbf {n}_{\mathbf {a}}} = \mathbf {I}_{M_{\mathrm {a}}}\), the prewhitening operation from (10) was then computed but with \(\widetilde {\boldsymbol {\Upsilon }}_{\mathbf {a}}\) instead of \(\widetilde {\boldsymbol {\Upsilon }}\), and hence denoted as L_{a}. In the PWT domain, the respective RTF vectors are given by \(\underline {\widetilde {\mathbf {h}}}_{\mathbf {a}} = \mathbf {L}^{\mathrm {1}}_{\mathbf {a}}\widetilde {\boldsymbol {\Upsilon }}_{\mathbf {a}}^{{{H}}}\widetilde {\mathbf {h}}_{\mathbf {a}}\) and \(\underline {\hat {\mathbf {h}}}_{\mathbf {a}} = \mathbf {L}^{\mathrm {1}}_{\mathbf {a}}\widetilde {\boldsymbol {\Upsilon }}_{\mathbf {a}}^{{{H}}}\hat {\mathbf {h}}_{\mathbf {a}}\). The optimal PWT domain filters, \(\underset {\smile }{\widetilde {\mathbf {w}}}_{{\text {mvdr}}}\), and \(\underset {\smile }{\hat {\mathbf {w}}}_{{\text {mvdr}}}\) were then computed as in (21), but using either \(\underline {\widetilde {\mathbf {h}}}_{\mathbf {a}}\) or \(\underline {\hat {\mathbf {h}}}_{\mathbf {a}}\). Finally, (74) and (75) were used to \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\), after which (42) was then solved using CVX [31, 32] to yield the integrated MVDR beamformer for the LMA only, denoted as \(\underset {\smile }{\mathbf {w}}_{\text {int}}\). The beam patterns were computed as \(\underset {\smile }{\mathbf {w}}_{\text {int}}^{{H}} \underline {\mathbf {h}}(\theta)\), where \(\underline {\mathbf {h}}(\theta)\) is the PWT domain RTF vector corresponding to an angle, θ.
Figure 3 illustrates the resulting beam patterns for two tuning strategies for different values of F(l) (in this case l=1 and hence the dependence on l is omitted). The lefthand plot of Fig. 3 corresponds to a tuning strategy similar to that depicted in Fig. 2a, where there is a tradeoff between the two constraints. For this strategy, \(\widetilde {\mu } = \hat {\mu } = 0.2\) and \(\hat {\sigma }^{2}_{\mathrm {s}_{\mathrm {a},1}} = 1\), which means that ε_{AP} and ε_{DD} were fairly close to the xaxis and yaxis respectively. As F increases, the beam pattern is clearly seen to evolve from focusing on the a priori direction of 0^{∘} to eventually that of the datadependent direction of 60^{∘}. As a linear path is followed, at the midpoint, both \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) are of a similarly larger values, which explains the nature of the lower magnitude in the beam pattern during the transition.
The righthand plot of Fig. 3 corresponds to a tuning strategy as depicted in Fig. 2b, i.e. when the APC is always active. As F increases, it can be observed that the beam in the a priori direction of 0^{∘} is maintained, while more gain is attributed to the datadependent direction of 60^{∘}. In this particular case, however, it is noted that although the response at 60^{∘} is in accordance with the maximum tolerable speech distortion prescribed, there is a slight tilt of the beam towards 68^{∘} as compared to if only the DDC was active. Nevertheless, this can still be a useful tuning strategy for cases when a high confidence is placed on the a priori RTF vector.
6.2 Effect of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\)
In this section, the effect of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) on the behaviour of the integrated MVDR beamformer for the case of an LMA and XMs is further investigated using recorded audio data. A batch processing framework will be applied so as to observe an average performance at a single frequency. In the following section, the processing will be done using a Weighted Overlap and Add (WOLA) framework [37] and a broadband performance will be assessed.
Audio recordings of speech and noise were made in the laboratory room as depicted in Fig. 4, which has a reverberation time of approximately 1.5 s. A Neumann KU100 dummy head was placed in a central location of the room and equipped with two (i.e. left and right) behindtheear hearing aids, each consisting of two microphones spaced approximately 1.3 cm apart. Hence, in the following, the LMA is considered as having a total of four microphones, i.e. the stacked left ear and right ear microphones. The first microphone of the left ear hearing aid was used as the reference microphone. Three omnidirectional XMs (two AKG CK32 microphones and one AKG CK97O microphone) were placed at heights of 1 m from the floor and at varying distances from the dummy head as shown in Fig. 4. A Genelec 8030C loudspeaker was placed at 1 m and different azimuth angles from the dummy head to generate a speech signal from a male speaker [38]. The loudspeaker and the dummy head were placed at a height of approximately 1.3 m from the floor (only angles 0^{∘} and 60^{∘} were used as shown in Fig. 4). For the noise, a cocktail party scenario was recreated. With the same configuration of the dummy head and external microphones from Fig. 4, participants stood outside of a 1m circumference from the dummy head in a random manner (i.e. all participants were not confined to a particular corner in the room). Beverages in glasses as well as snacks were served while the participants engaged in conversation. At any given time, there were nine male participants and six female participants present in the room. A recording of such a scenario was made for approximately 1 h, but a random sample was used in the following analysis.
As opposed to a freefield a priori RTF vector, a more suitable a priori RTF vector for the behindtheear hearing aid microphones was obtained from premeasured impulse responses in the scenario as depicted in Fig. 4. The impulse responses were computed from an exponential sinesweep measurement with the loudspeaker position at 0^{∘} (the azimuth direction directly in front of the dummy head) and 1 m so that the a priori RTF vector would be defined in accordance with a source located at 0^{∘} and 1 m from the dummy head. The initial section of these impulse responses corresponding to the direct component was extracted, with a length according to the size of the Discrete Fourier Transform (DFT) window to be used in the STFT domain processing. This direct component was then smoothed with a Tukey window and converted to the frequency domain. In each frequency bin, these smoothed frequency domain impulse responses were then scaled with respect to the smoothed frequency domain impulse response of the reference microphone. This was then used as \(\widetilde {\mathbf {h}}_{\mathbf {a}}(k)\) and was kept the same for each time frame.
A scenario was firstly considered for the desired speech source located at 0^{∘} in Fig. 4, i.e. the location where the a priori RTF vector was defined. A 4s sample of the desired speech signal was mixed with a random sample of the cocktail party noise at a broadband input SNR of 0 dB. For the batch processing framework with a DFT size of 256 samples, R_{yy} and R_{nn} were estimated by time averaging across the entire length of the signal in the respective speechplusnoise or noiseonly frames. Using the SPP [25] from the first microphone of the left ear hearing aid, frames for which the speech was active were chosen if the resulting SPP >0.85. The RTF vectors, \(\underline {\widetilde {\mathbf {h}}}\) and \(\underline {\hat {\mathbf {h}}}\), were computed according to the procedures described in Sections 3.1 and 3.2. Using CVX [31, 32], the MVDRINT from (42) was then evaluated for a range of \(0 < \widetilde {\epsilon } < 1.5\) and \(0 < \hat {\epsilon } < 1.5\) at a frequency of 2 kHz.
Figure 5a and b display the resulting (base10 log) values of the Lagrangian multipliers α and β respectively as a function of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\), along with the APC and DDC bounding curves. These plots support the theoretical analysis of the space spanned by \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\) from Fig. 1. In Fig. 5a, it is clearly observed that as the value of \(\widetilde {\epsilon }\) exceeds the APC bounding curve, then α→0 so that the APC is inactive while the DDC remains active. Similarly, in Fig. 5b, as the value of \(\hat {\epsilon }\) exceeds the DDC bounding curve, then β→0 so that the APC remains active and the DDC is inactive. The regions where both constraints are active, and when neither are active can also be observed.
Figure 5c and d are plots of the corresponding change in SNR (ΔSNR) from the reference microphone as well as the speech distortion which was computed as follows:
where the first term of the ΔSNR is the output SNR and the second term is the input SNR at the unprocessed reference microphone^{Footnote 12} and in this scenario \(\underline {\mathbf {h}} = \underline {\widetilde {\mathbf {h}}}\). The true value of \(\underline {\mathbf {h}}\) is unknown; hence, the results of Fig. 5c and d are suggestive for the case when the true RTF vector corresponds to that of the a priori assumed RTF vector. In Fig. 5c, since \(\underset {\smallsmile }{\mathbf {w}} \rightarrow \mathbf {0}\) in the region where \(\hat {\epsilon } \geq 1\) and \(\widetilde {\epsilon } \geq 1\), it is purposefully hatched so as to indicate that in this region an output SNR is undefined.
As expected, it can be observed that the best ΔSNR is achieved for the region where the DDC is inactive and the APC is active, with a compromise within the region where the two constraints are active. An interesting observation here is the poor ΔSNR in the region where \(\widetilde {\epsilon } \rightarrow 0\) and \(\hat {\epsilon } \rightarrow 0\). Even though the maximum tolerable speech distortions have been specified to be quite small, in this case \(\underline {\widetilde {\mathbf {h}}}\) and \(\underline {\hat {\mathbf {h}}}\) can be parallel, which can lead to redundant constraints and an illconditioning problem as discussed in [22]. In terms of the SD, fairly low distortions are achieved when either of the constraints are active or when both are active. As both \(\widetilde {\epsilon } \rightarrow 1\) and \(\hat {\epsilon } \rightarrow 1\), the speech distortion increases, which is expected from (70) and (72), i.e. the SDWMWF parameters, \(\widetilde {\mu }\) and \(\hat {\mu }\). As \(\widetilde {\mu } \rightarrow \infty \), \(\widetilde {\epsilon } \rightarrow 1\), and as \(\hat {\mu } \rightarrow \infty \), \(\hat {\epsilon } \rightarrow 1\), which accounts for the increasing speech distortion from Fig. 5d. Another point to highlight in Fig. 5d is that a low speech distortion is also achieved in the region where the APC bounding curve is a minimum, regardless of the value of \(\widetilde {\epsilon }\). As discussed in Section 5.2, for a value of \(\widetilde {\epsilon } > \widetilde {\epsilon }_{o}\) (where \(\widetilde {\epsilon }_{o}\) is the value of \(\widetilde {\epsilon }\) on the APC bounding curve from (60)), the achievable distortion would in fact correspond to \(\widetilde {\epsilon }_{o}\) on the APC bounding curve, which is quite low in this minimum region.
Figure 6 now displays a similar set of results, however for the case when the desired speech source was located at 60^{∘} as depicted in Fig. 4. As the a priori RTF vector was based on a speaker located at 0^{∘}, this scenario represented a mismatch between the a priori RTF vector and the true RTF vector. The same procedure as previously described was also followed to obtain the MVDRINT filters.
Figure 6a and b display the resulting values of the (base10 log) Lagrangian multipliers α and β respectively as a function of \(\widetilde {\epsilon }\) and \(\hat {\epsilon }\), along with the APC and DDC bounding curves. The nature of these plots is quite similar to that of Fig. 5a and b in terms of how α and β vary with respect to the bounding curves. In comparison to Fig. 5a and b, Fig. 6a and b also highlight the fact that these bounding curves can have quite different appearances.
Figure 6c and d display the corresponding ΔSNR and SD respectively, however with \(\underline {\mathbf {h}} = \underline {\hat {\mathbf {h}}}\) in (76), and hence, the results are suggestive for the case when the true RTF vector corresponds to that of the datadependent RTF vector. Now it can be observed that the best ΔSNR is achieved for the region where the APC is inactive and the DDC is active, with a compromise within the region where the two constraints are active. For the SD, fairly low speech distortions are achieved for small values of \(\hat {\epsilon }\) as expected. For small values of \(\widetilde {\epsilon }\) and large values of \(\hat {\epsilon }\), i.e. toward the region where only the APC is active, it can be observed that the speech distortion increases, which is a direct result of the speech source not being in the a priori defined direction of 0^{∘}. Once again, it can also be seen that the speech distortion generally increases as both \(\widetilde {\epsilon } \rightarrow 1\) and \(\hat {\epsilon } \rightarrow 1\).
The results of Figs. 5 and 6 provide some more insight into the behaviour of the MVDRINT and demonstrate that in some scenarios a better performance can be achieved when either only the APC or only the DDC is active. Furthermore, it was observed that there were transition regions where a compromise could be achieved between these limits of performance when either only the APC or only the DDC is active. Therefore, it suggests that tuning strategies such as those depicted in Fig. 2 would indeed be appropriate means of obtaining an optimal filter as opposed to relying on only an APC or DDC.
6.3 Performance of tuning strategies
The audio recordings as previously described for the scenario depicted in Fig. 4 are also used to observe the performance of the tuning strategies. A desired speech signal was created where the desired speech source was initially located at 0^{∘} for a duration of 5 s and then instantaneously moved to 60^{∘} for another 6 s. This was then mixed with a random sample of the cocktail party noise at a broadband input SNR of 2 dB. The same a priori RTF vector pertaining to the hearing aid microphones, \(\widetilde {\mathbf {h}}_{\mathbf {a}}(k)\), as previously described was used, i.e. \(\widetilde {\mathbf {h}}_{\mathbf {a}}(k)\), was computed for a source located at 0^{∘} and 1 m from the dummy head.
For the STFT processing, the WOLA method, with a DFT size of 256 samples, 50% overlap, a squareroot hanning window, and a sampling frequency of 16 kHz were used. By using the SPP [25] computed on XM2, frames were classified as containing speech if the SPP >0.8; otherwise, the frames were classified as noise only. All RTF vector estimates were performed in frames which were classified as containing speech. All the relevant correlation matrices were also estimated using a forgetting factor corresponding to an averaging time of 300 ms. R_{nn} was only estimated when the SPP <0.8.
For the MVDRINT, two tuning strategies were considered—(i) the tradeoff between the maximum tolerable speech distortions for the APC and DDC, corresponding to Fig. 2a, which will be referred to as MVDRINT3a and (ii) where the maximum tolerable speech distortion for the APC is constant, but the maximum tolerable speech distortion for the DDC varies, corresponding to Fig. 2b, and which will be referred to as MVDRINT3b. For both tunings, \(\widetilde {\mu } = \hat {\mu } = 0.001\), and \(\hat {\sigma }^{2}_{\mathrm {s}_{\mathrm {a},1}}\) was computed using the method from [39] as implemented in [40] but with the noise estimation update computed as in [25]. A different setting was used for the confidence metric, F(l) in (62) for each of the tunings such that for the MVDRINT3a, ρ=1 and λ_{t}= 5 dB, and for MVDRINT3b, ρ=1 and λ_{t}= 10 dB, i.e. a higher thresholding was used for the MVDRAP tuning. With all parameters assigned, the QCQP problem from (42) was solved using the gradient ascent procedure as described in Algorithm 1.
The metrics used to evaluate the following experiments were the speech intelligibilityweighted SNR [41] (SISNR), the shorttime objective intelligibility (Δ STOI) [42], and the normalised speechtoreverberation modulation energy ratio for cochlear implants (SRMRCI) [43]. The SISNR improvement in relation to the reference microphone was calculated as:
where the band importance function I_{i} expresses the importance of the ith onethird octave band with centre frequency, \(f_{i}^{c}\) for intelligibility, SNR _{i,in} is the input SNR (dB), and SNR _{i,out} is the output SNR (dB) in the ith onethird octave band. The centre frequencies, \(f_{i}^{c}\), and the values for I_{i} are defined in [44]. The input SNR was computed accordingly using the unprocessed speechonly and unprocessed noiseonly components (in the discrete time domain) at the reference microphone, and the output SNR from the individually processed speechonly and processed noiseonly components (in the discrete time domain) resulting from the particular algorithm. For the STOI metric, the reference signal used was the unprocessed desired speech source convolved with 256 samples (i.e. same length as the DFT size) of the (premeasured) impulse response from the desired speech signal location to the reference microphone. As the room was quite reverberant, however, a true reference signal is somewhat ambiguous to define, and hence, the nonintrusive metric, SRMRCI, suitable for hearing instruments, in particular cochlear implants, was also used.
Figure 7 displays the performance of the various algorithms, where all the metrics have been computed in 2s time frames with a 25% overlap. The relative improvements of the SISNR and the STOI metrics in relation to the reference microphone have been plotted. The metrics for XM1 and XM2 from Fig. 4 are also plotted. In order to contextualise the values of the SRMRCI metric, an additional plot of the performance for the reference signal (that which was used for the STOI metric) is displayed. From all the metrics, as expected, the MVDRAP performs better than the MVDRDD in the first 5 s as the speech source was at 0^{∘}, i.e. the a priori direction. However, in the latter 6 s, when the speech source was at 60^{∘}, the MVDRDD achieves a better performance.
With respect to the XMs, it can also be seen that the performance of XM1 decreases after 5 s as the source moves to the location of 60^{∘}, while XM2 has more of a consistent performance across the different speech locations. In terms of the Δ SISNR, the performance of all of the other algorithms is better than either of the XMs, which demonstrates that simply listening to the XM only would not always immediately yield satisfactory performance.
Within the first 5 s, the MVDRINT3a is able to find a compromise between the MVDRAP and MVDRDD in terms of all metrics. In the final 6 s, although the Δ STOI is once again in between the MVDRAP and MVDRDD, the performance in terms of Δ SISNR and SRMRCI is in fact better than either of the MVDRAP or the MVDRDD. This is a direct consequence of the nature of the integrated MVDRLMAXM beamformer as different linear combinations of the MVDRAP and the MVDRDD are effectively applied to different timefrequency segments, yielding a broadband SISNR that could be better than either the MVDRAP or MVDRDD.
For the MVDRINT3b, within the first 5 s, the performance in terms of all the metrics is closer to that of the MVDRAP which is expected as the APC is kept active at all times. In the following 6 s, the STOI metric indicates that the speech intelligibility has not changed from that of the MVDRAP. However, an improvement can be observed in both Δ SISNR and SRMRCI metrics as some frequency bins would have also had the DDC active.
The corresponding confidence metrics across all time frames and frequencies for the MVDRINT3a and the MVDRINT3b are displayed in Fig. 8. The upper plot corresponds to the confidence metric of MVDRINT3a and reveals that much of the confidence has been placed on the higher frequencies, presumably because there was less noise in this region. Therefore, a smaller value of \(\hat {\epsilon }\) and a larger value of \(\widetilde {\epsilon }\) would have been assigned to the DDC and APC respectively, i.e. the MVDRINT3a in this region would have tended toward the MVDRDD. Several regions of uncertainty are also observed where the MVDRINT3a would then find a compromise between the MVDRAP and the MVDRDD. In the lower plot of Fig. 8, the confidence metric for the MVDRINT3b shows a much more conservative behaviour due to the larger threshold of λ_{t}. It is observed that there are now many regions where there is little confidence, and hence a larger value of \(\hat {\epsilon }\) and a smaller value of \(\widetilde {\epsilon }\) would have been assigned to the DDC and APC, respectively, i.e. the MVDRINT3b in these regions would have tended toward the MVDRAP. More confidence is now only placed in the higher frequency region and there are still some regions of uncertainty so that a compromise can be achieved. The resulting audio signals from this section^{Footnote 13} may also be listened to for a subjective evaluation at [45].
7 Conclusion
An integrated MVDR beamformer that merges the benefits from using an available a priori relative transfer function (RTF) vector and a datadependent RTF vector was developed for a microphone configuration consisting of a local microphone array (LMA) and multiple external microphones (XMs). The framework has been presented in a prewhitenedtransformed (PWT) domain, which consists of an initial transformation of the microphone signals through a blocking matrix and a fixed beamformer, followed by a prewhitening operation, facilitating convenient processing operations. In the PWT domain, procedures for obtaining an a priori RTF vector and datadependent RTF vector have also been derived, where the a priori RTF vector is based on an a priori RTF vector pertaining to the LMA only.
With the two RTF vectors, an integrated MVDR beamformer was proposed by formulating a quadratically constrained quadratic program (QCQP), with two constraints, one of which is related to the maximum tolerable speech distortion for the imposition of the a priori RTF vector and the other related to the maximum tolerable speech distortion for the imposition of the datadependent RTF vector. It was shown how the space spanned by each of these maximum tolerable speech distortions could be divided into four separate regions, each of which corresponded to a particular set of constraints being active or inactive. This insight then facilitated the development of a general tuning framework where the maximum tolerable speech distortions are chosen in accordance with the confidence in the accuracy of the datadependent RTF vector. A particular set of tuning rules was also proposed, which made use of a relationship to the speech distortion weighted multichannel Wiener filter.
The potential of the integrated MVDR beamformer was demonstrated by using audio data from an LMA of behindtheear hearing aid microphones and three XMs for a single desired speech source within a recreated cocktail party scenario. A narrowband evaluation confirmed the theoretical behaviour of the integrated MVDR as a function of the maximum tolerable speech distortion parameters. A broadband evaluation has shown that the integrated MVDR beamformer can be tuned to yield different enhanced speech signals, which may be suitable for improving speech intelligibility despite changes in the desired speech source position and imperfectly estimated spatial correlation matrices.
Availability of data and materials
The microphone data analysed in the current study as well as audio samples of the processed signals are available at [45]. Further materials are also available from the corresponding author upon request.
Change history
06 April 2021
A Correction to this paper has been published: https://doi.org/10.1186/s1363602100202x
Notes
Reverberation is not explicitly included in the signal model as dereverberation is not addressed in this paper. This paper primarily focuses on noise reduction, although some dereverberation will be achieved as a fortunate byproduct of beamforming.
The dependence on k and l is included here as a reminder and for completeness in the signal model. It will be dropped again unless explicitly required.
Since the sequence of operations from w to \(\underset {\smallsmile }{\mathbf {w}}\) is not exactly that of a PWT signal vector, a slightly different notation is used for this quantity.
It is acknowledged that there is a slight abuse of notation here as the estimate for \(\underline {\widetilde {\mathbf {h}}}\) should be denoted as \(\hat {\underline {\widetilde {\mathbf {h}}}}\). However, in favour of legibility and to stress that the estimation is done in accordance to the a priori assumptions set by \(\widetilde {\mathbf {h}}_{\mathbf {a}}\) is why the notation is maintained as \(\underline {\widetilde {\mathbf {h}}}\).
It can also be expressed as a convex combination of various beamformers as discussed in [22].
The square root has been taken on both sides of the inequality from (42) in order to simplify the derivations that follow.
Recall that \(\underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}}\) can be equivalently expressed as \(\underset {\smallsmile }{\hat {\mathbf {w}}}_{\text {mvdr}} = \hat {\eta }_{q}^{*} \hspace {0.05cm} \hat {\mathbf {Q}} \mathbf {e}_{1}\).
The time index is reintroduced here to reinforce that these quantities are to be computed in each time frame. All frequencies are still treated equivalently.
Achievable here is meant to differentiate between the actual speech distortion that is obtained and the maximum tolerable value that was specified.
This means that only the microphone signals alone, without any processing, are captured.
The complication arises in that some of the XMs can be in the nearfield with respect to the desired source. A visualisation can nevertheless be created, but will have to be considered within a plane or volume with Cartesian coordinates.
The numerator of this term is 1 since the first component of the RTF vector for the unprocessed microphone signals is 1.
Audio samples are also uploaded for the case when the SPP was computed on the reference microphone.
Abbreviations
 APC:

A priori constraint
 DDC:

Datadependent constraint
 EVD:

Eigenvalue decomposition
 GEVD:

Generalised eigenvalue decomposition
 LMA:

Local microphone array
 MVDR:

Minimum variance distortionless response
 MVDRDD:

Fully datadependent MVDR beamformer
 MVDRAP:

MVDR beamformer based on a priori knowledge
 MVDRINT:

Integrated MVDR beamformer
 MWF:

Multichannel Wiener filter
 PWT:

Prewhitenedtransformed
 QCQP:

Quadratically constrained quadratic program
 RTF:

Relative transfer function
 SNR:

Signaltonoise ratio
 SISNR:

Speech intelligibilityweighted SNR
 SPP:

Speech presence probability
 SRMRCI:

Speechtoreverberation modulation energy ratio for cochlear implants
 STOI:

Shorttime objective intelligibility
 XM:

External microphone
 WOLA:

Weighted Overlap and Add
References
M. Brandstein, D. B. Ward, Microphone Arrays: Signal Processing, Techniques and Applications (Springer, New York, 2001).
S. Gannot, E. Vincent, S. MarkovichGolan, A. Ozerov, A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Trans. Audio Speech Lang. Process.25(4), 692–730 (2017).
E. Vincent, T. Virtanen, S. Gannot, Audio Source Separation and Speech Enhancement (Wiley, Chichester, West Sussex, 2018).
J. Szurley, A. Bertrand, B. van Dijk, M. Moonen, Binaural noise cue preservation in a binaural noise reduction system with a remote microphone signal. IEEE/ACM Trans. Audio Speech Lang. Process.24(5), 952–966 (2016).
N. Gößling, S. Doclo, in 2018 16th Int.Workshop on Acoustic Signal Enhancement (IWAENC). Relative transfer function estimation exploiting spatially separated microphones in a diffuse noise field (Tokyo, 2018), pp. 146–150.
N. Gößling, S. Doclo, in Speech Communication; 13th ITGSymposium. RTFbased binaural MVDR beamformer exploiting an external microphone in a diffuse noise field (Oldenburg, Germany, 2018), pp. 1–5.
N. Cvijanovic, O. Sadiq, S. Srinivasan, Speech enhancement using a remote wireless microphone. IEEE Trans. Consum. Electron.59(1), 167–174 (2013).
D. Yee, H. KamkarParsi, R. Martin, H. Puder, A noise reduction postfilter for binaurallylinked singlemicrophone hearing aids utilizing a nearby external microphone. IEEE/ACM Trans. Audio Speech Lang. Process.26(1), 5–18 (2017).
R. Ali, G. Bernardi, T. van Waterschoot, M. Moonen, Methods of extending a generalized sidelobe canceller with external microphones. IEEE/ACM Trans. Audio Speech Lang. Process.27(9), 1349–1364 (2019).
A. Bertrand, M. Moonen, Robust distributed noise reduction in hearing aids with external acoustic sensor nodes. EURASIP J. Adv. Signal Process.2009:, 1–14 (2009).
A. Hassani, Distributed signal processing algorithms for multitask wireless acoustic sensor networks. PhD thesis, KU Leuven (2017).
S. MarkovichGolan, S. Gannot, I. Cohen, Distributed multiple constraints generalized sidelobe canceler for fully connected wireless acoustic sensor networks. IEEE Trans. Audio Speech Lang. Process.21(2), 343–356 (2013).
J. Capon, Highresolution frequencywavenumber spectrum analysis. Proc. IEEE. 57(8), 1408–1418 (1969).
H. L. Van Trees, Optimum Array Processing (Wiley, Hoboken, 2001).
S. Gannot, D. Burshtein, E. Weinstein, Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans. Signal Process.49(8), 1614–1626 (2001).
J. E. Greenberg, P. M. Zurek, Evaluation of an adaptive beamforming method for hearing aids. J. Acoust. Soc. Amer.91(3), 1662–1676 (1992).
J. M. Kates, M. R. Weiss, A comparison of hearingaid arrayprocessing techniques. J. Acoust. Soc. Amer.99(5), 3138–3148 (1996).
M. Kompis, N. Dillier, Performance of an adaptive beamforming noise reduction scheme for hearing aid applications. I. Prediction of the signaltonoiseratio improvement. J. Acoust. Soc. Amer.109(3), 1123–1133 (2001).
A. Spriet, L. Van Deun, K. Eftaxiadis, J. Laneau, M. Moonen, B. van Dijk, A. van Wieringen, J. Wouters, Speech understanding in background noise with the twomicrophone adaptive beamformer BEAM in the Nucleus Freedom cochlear implant system,. Ear Hear.28(1), 62–72 (2007).
I. Cohen, Relative transfer function identification using speech signals. IEEE Trans. Speech Audio Process.12(5), 451–459 (2004).
S. MarkovichGolan, S. Gannot, in 2015 IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP). Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method (Brisbane, 2015), pp. 544–548.
R. Ali, T. Van Waterschoot, M. Moonen, Integration of a priori and estimated constraints into an MVDR beamformer for speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process.27(12), 2288–2300 (2019).
L. Griffiths, C. Jim, An alternative approach to linearly constrained adaptive beamforming. IEEE Trans. Antennas Propag.30(1), 27–34 (1982).
S. Van Gerven, F. Xie, in Proc. EUROSPEECH, vol. 3, Ródos. A comparative study of speech detection methods (Greece, 1997), pp. 1095–1098.
T. Gerkmann, R. C. Hendriks, in Proc. 2011 IEEE Workshop Appls. Signal Process. Audio Acoust. (WASPAA ’11). Noise power estimation based on the probability of speech presence, (2011), pp. 145–148.
I. Markovsky, Low Rank Approximation: Algorithms, Implementation, Applications (Springer, Heidelberg, 2012).
S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, New York, 2004).
W. C. Liao, Z. Q. Luo, I. Merks, T. Zhang, in Proc. 2015 IEEE Workshop Appls. Signal Process. Audio Acoust. (WASPAA ’15). An effective low complexity binaural beamforming algorithm for hearing aids (IEEENew Paltz, 2015), pp. 1–5.
W. C. Liao, M. Hong, I. Merks, T. Zhang, Z. Q. Luo, in 2015 IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP). Incorporating spatial information in binaural beamforming for noise suppression in hearing aids (Brisbane, QLD, 2015), pp. 5733–5737.
M. Souden, J. Benesty, S. Affes, On optimal frequencydomain multichannel linear filtering for noise reduction. IEEE Trans. Audio Speech Lang. Process.18(2), 260–276 (2010).
M. Grant, S. Boyd, CVX: Matlab Software for Disciplined Convex Programming, version 2.1 (2014). http://cvxr.com/cvx. Accessed May 2020.
M. Grant, S. Boyd, ed. by V. Blondel, S. Boyd, and H. Kimura. Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences (SpringerSpringerVerlag London, 2008), pp. 95–110.
S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.3(1), 1–122 (2011).
J. Nocedal, S. J. Wright, Numerical Optimization, 2nd edn (Springer, New York, 2006).
A. Spriet, M. Moonen, J. Wouters, Spatially preprocessed speech distortion weighted multichannel Wiener filtering for noise reduction. Signal Process.84(12), 2367–2387 (2004).
S. Doclo, S. Gannot, M. Moonen, A. Spriet, Handbook on Array Processing and Sensor Networks (Wiley, Hoboken, 2010). Chap. 10: acoustic beamforming for hearing aid applications.
R. Crochiere, A weighted overlapadd method of shorttime Fourier analysis/synthesis. IEEE Trans. Acoust. Speech Signal Process.28(1), 99–102 (1980).
C. Veaux, J. Yamagishi, K. MacDonald, CSTR VCTK corpus: English multispeaker corpus for CSTR voice cloning toolkit (2016). http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html. Accessed Dec 2019.
Y. Ephraim, D. Malah, Speech enhancement using a minimum meansquare error logspectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process.33(2), 443–445 (1985).
M. Brookes, et al., Voicebox: speech processing toolbox for Matlab (Imperial College, London, 1997). http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.
J. E. Greenberg, P. M. Peterson, P. M. Zurek, Intelligibilityweighted measures of speechtointerference ratio and speech system performance,. J. Acoust. Soc. Amer.94(5), 3009–3010 (1993).
C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time – frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process.19(7), 2125–2136 (2011).
J. F. Santos, T. H. Falk, Updating the SRMRCI metric for improved intelligibility prediction for cochlear implant users. IEEE/ACM Trans. Audio Speech Lang. Process.22(12), 2197–2206 (2014).
American National Standards Institute, American National Standard Methods for calculation of the speech intelligibility index (Acoustical Society of America, 1997). https://webstore.ansi.org/standards/asa/ansiasas31997r2017. Accessed 6 June 1997.
R.Ali (2020). ftp://ftp.esat.kuleuven.be/pub/SISTA/rali/Reports/public_data_mvdrint. Accessed May 2020.
Acknowledgements
This research work was carried out at the ESAT Laboratory of KU Leuven, in the frame of IWT O&O Project nr. 150432 ‘Advances in Auditory Implants: Signal Processing and Clinical Aspects’, KU Leuven Impulsfonds IMP/14/037, KU Leuven C21600449 ’Distributed Digital Signal Processing for Adhoc Wireless Local Area Audio Networking’, and KU Leuven Internal Funds VES/16/032. The research leading to these results has also received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation program/ERC Consolidator Grant: SONORA (no. 773268). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information.
Author information
Authors and Affiliations
Contributions
RA, TvW, and MM conceptualised and analysed the QCQP framework and tuning strategy. RA drafted the manuscript, implemented the algorithms in software, and conducted the experiments. All authors have interpreted the results and reviewed the final manuscript. The authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
An error was identified in the readability of equation 45
Rights and permissions
, corrected publication 2021Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ali, R., van Waterschoot, T. & Moonen, M. An integrated MVDR beamformer for speech enhancement using a local microphone array and external microphones. J AUDIO SPEECH MUSIC PROC. 2021, 10 (2021). https://doi.org/10.1186/s13636020001922
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13636020001922