An integrated MVDR beamformer for speech enhancement using a local microphone array and external microphones

An integrated version of the minimum variance distortionless response (MVDR) beamformer for speech enhancement using a microphone array has been recently developed, which merges the benefits of imposing constraints defined from both a relative transfer function (RTF) vector based on a priori knowledge and an RTF vector based on a data-dependent estimate. In this paper, the integrated MVDR beamformer is extended for use with a microphone configuration where a microphone array, local to a speech processing device, has access to the signals from multiple external microphones (XMs) randomly located in the acoustic environment. The integrated MVDR beamformer is reformulated as a quadratically constrained quadratic program (QCQP) with two constraints, one of which is related to the maximum tolerable speech distortion for the imposition of the a priori RTF vector and the other related to the maximum tolerable speech distortion for the imposition of the data-dependent RTF vector. An analysis of how these maximum tolerable speech distortions affect the behaviour of the QCQP is presented, followed by the discussion of a general tuning framework. The integrated MVDR beamformer is then evaluated with audio recordings from behind-the-ear hearing aid microphones and three XMs for a single desired speech source in a noisy environment. In comparison to relying solely on an a priori RTF vector or a data-dependent RTF vector, the results demonstrate that the integrated MVDR beamformer can be tuned to yield different enhanced speech signals, which may be more suitable for improving speech intelligibility despite changes in the desired speech source position and imperfectly estimated spatial correlation matrices.


Introduction
Speech processing devices such as a hearing aid, a cochlear implant, or a mobile telephone are commonly equipped with an array of microphones to capture the acoustic environment. The received microphone signals are often a mixture of a desired speech signal plus some undesired noise (any combination of interfering speakers, background noises, and reverberation). As the quality (2021) 2021: 10 Page 2 of 20 placed microphones to increase the spatial sampling of the acoustic environment has developed interest [4][5][6][7][8][9][10][11][12].
In this paper, a specific ad hoc microphone configuration is considered, where a microphone array located on some speech processing device, hereafter referred to as a local microphone array (LMA), is linked with multiple remote or external microphones (XMs) in a centralised processing framework, i.e. all microphone signals are sent to a fusion centre for processing. The terminology of a local microphone array is introduced since the microphone array is considered to be confined or fixed within some area of the acoustic environment relative to the XMs which are subject to movement. When there is a single desired speech source, speech enhancement can be accomplished by using the minimum variance distortionless response (MVDR) beamformer [13,14]. One of the important quantities required for computing the MVDR beamformer is a vector of acoustic transfer functions from the desired speech source to all of the microphones. More commonly, however, a vector of relative transfer functions (RTFs) is used instead, which is a normalised version of the acoustic transfer function vector with respect to some reference microphone [15]. In practice, for an LMA, this RTF vector may be measured a priori or based on assumptions regarding microphone characteristics, position, speaker location, and room acoustics (e.g. no reverberation). For instance, in assistive hearing devices, it is sometimes assumed that the desired speech source location is known and this knowledge can be subsequently used to define an a priori RTF vector [16][17][18][19]. Alternatively, it may be estimated in an online fashion from the observed microphone data [20,21] so that it is a fully data-dependent estimate.
The situation under consideration throughout this paper is one in which there is an available a priori RTF vector pertaining only to the LMA that may or may not be sufficiently accurate with respect to the true RTF vector. In cases where the a priori RTF vector is not sufficiently accurate, then incorporating the use of a data-dependent RTF vector can be viewed as an opportunity for an improved performance provided that the data-dependent RTF vector is a better estimate of the true RTF vector. On the other hand, when acoustic conditions are adverse enough to significantly affect the accuracy of the datadependent RTF vector, then relying on the a priori RTF vector can be viewed as a fall back or contingency strategy.
It would therefore be seemingly advantageous to use both an a priori and a data-dependent RTF vector in practice. Such an approach has recently been investigated for an LMA only and resulted in an integrated version of the MVDR beamformer [22]. As opposed to imposing either the a priori RTF vector or the data-dependent RTF vector as a hard constraint, they were both softened into an unconstrained optimisation problem. It was demonstrated that the resulting integrated MVDR beamformer is a convex combination of an MVDR beamformer that uses the a priori RTF vector, an MVDR beamformer that uses the data-dependent RTF vector, a linearly constrained minimum variance (LCMV) beamformer that uses both the a priori and data-dependent RTF vector, and an all-zero vector, each with real-valued weightings, revealing the versatile nature of such an integrated beamformer.
This paper therefore re-examines the integrated MVDR beamformer for the ad hoc microphone configuration consisting of an LMA located on some speech processing device linked with multiple XMs. Specifically, the integrated MVDR beamformer is reformulated from an alternative perspective, namely that of a quadratically constrained quadratic program (QCQP). This QCQP will consist of two constraints, one of which is related to the maximum tolerable speech distortion for the imposition of the a priori RTF vector and the other related to the maximum tolerable speech distortion for the imposition of the data-dependent RTF vector. With respect to the procedures for obtaining the RTF vectors, it is straightforward to obtain a data-dependent RTF vector; however, the notion of an a priori RTF vector when XMs are used with an LMA is a bit more ambiguous. In particular, since only partial a priori knowledge is usually available for the part of the RTF vector pertaining to the LMA, the other part pertaining to the XMs will have to be a data-dependent estimate and hence a procedure based on partial a priori knowledge [9] would be necessary. As a result, an integrated MVDR beamformer for a microphone configuration with an LMA and XMs will merge an a priori RTF vector that is based on partial a priori knowledge and a fully data-dependent one.
With the a priori and the data-dependent RTF vector for the LMA and XMs estimated, it will become evident that the optimal filter from the integrated MVDR beamformer, formulated as a QCQP, is identical to that which was derived from [22], where the Lagrangian multipliers associated with the QCQP are equivalent to the tuning parameters that have been considered in [22]. The additional insight of the QCQP formulation is that these tuning parameters or Lagrangian multipliers can be related to a maximum tolerable speech distortion for the imposition of the a priori or the data-dependent RTF vector. An analysis of this relationship is provided, which facilitates the tuning of the integrated MVDR beamformer from the more intuitive perspective of the maximum tolerable speech distortions as opposed to the combination of filters as in [22]. A general tuning framework will then be discussed along with the suggestion of some particular tuning strategies.
The integrated MVDR beamformer is then evaluated with audio recordings from behind-the-ear hearing aid microphones (the LMA) and three XMs for a single desired speech source in a re-created cocktail party scenario. The results demonstrate that the integrated MVDR beamformer can be tuned to yield different enhanced speech signals, which can find a compromise between relying solely on an a priori RTF vector or a datadependent RTF vector, and hence may be more suitable for improving speech intelligibility despite changes in the desired speech source position and imperfectly estimated spatial correlation matrices.
The paper is organised as follows. In Section 2, the data model is defined. In Section 3, the MVDR beamformer as applied to an LMA with XMs is discussed along with the procedures for obtaining the a priori RTF vector based on partial a priori knowledge and the data-dependent RTF vector. Section 4 reformulates the integrated MVDR beamformer as a QCQP and provides an analysis on the effect of the maximum tolerable speech distortions due to the imposition of the a priori RTF vector and the datadependent RTF vector. In Section 5, a general tuning framework is presented, as well as some suggested tuning strategies. In Section 6, the integrated MVDR approach is analysed and evaluated with both simulated data as well as experimental data involving the use of behind-the-ear hearing aid microphones and three XMs. Conclusions are then drawn in Section 7.

Unprocessed signals
A microphone configuration consisting of an LMA of M a microphones plus M e XMs is considered with one desired speech source in a noisy, reverberant 1 environment. In the short-time Fourier transform (STFT) domain, the observed vector of microphone signals at frequency bin k and time frame l is represented as: where (dropping the dependency on k and l for brevity) T n e T ] T represents the noise contribution. Variables with the subscript "a" refer to the LMA and variables with the subscript "e" refer to the XMs. 1 Reverberation is not explicitly included in the signal model as dereverberation is not addressed in this paper. This paper primarily focuses on noise reduction, although some dereverberation will be achieved as a fortunate by-product of beamforming.
The (M a + M e ) × (M a + M e ) spatial correlation matrix for the speech-plus-noise, noise-only, and speech-only signals is defined respectively as: where E{.} is the expectation operator and {.} H is the Hermitian transpose. With the assumption of a single desired speech source from (1), R xx can be represented as a rank-1 correlation matrix as follows: where σ 2 s a,1 = E |s a,1 | 2 is the desired speech power spectral density in the first microphone of the LMA. It is further assumed that the desired speech signal is uncorrelated with the noise signal, and hence R yy = R xx + R nn . The speech-plus-noise, noise-only, and speech-only correlation matrix can also be defined solely for the LMA signals respectively as R y a y a = E y a y a H , R n a n a = E n a n a H , and R x a x a = E x a x H a , with R x a x a also having the same rank-1 structure as in (5). It is assumed that all signal correlations can be estimated as if all signals were available in a centralised processor, i.e. a perfect communication link is assumed between the LMA and the XMs with no bandwidth constraints as well as synchronous sampling rates.
The estimate of the desired speech signal in the first microphone of the LMA, z 1 , is then obtained through a linear filtering of the microphone signals, such that: T is a complex-valued filter.

Pre-whitened-transformed domain
As a pre-processing stage, the unprocessed microphone signals can be firstly transformed with the available a priori RTF vector for the LMA signals and then spatially pre-whitened using the resulting transformed noise-only correlation matrix, yielding a vector of pre-whitenedtransformed (PWT) microphone signals. As discussed in [9] and subsequently reviewed in Section 3.1, these preprocessing steps essentially compress the M a LMA signals into one signal. This signal is then used with the preprocessed M e XM signals to obtain an estimate for the missing part of the RTF vector pertaining to the XMs when there is an available a priori RTF vector for the LMA. Therefore, PWT microphone signals will be adopted for convenience throughout this paper. To define the transformation operation, an M a ×(M a −1) blocking matrix C a , and an M a × 1 fixed beamformer, f a , are firstly defined such that: where h a is an available a priori RTF vector (which is some pre-determined estimate or approximation of h a ), and the notation ( . ) refers to quantities based on available a priori knowledge. Using C a and f a , an (M a + M e ) × (M a + M e ) transformation matrix, ϒ, can be defined as: where ϒ a = [ C a f a ] and in general I ϑ denotes the ϑ × ϑ identity matrix. Consequently, the transformed speechplus-noise signals and the transformed noise-only signals are defined respectively as: C H a n a f H a n a n e ⎤ ⎦ This transformation domain is simply the LMA signals that pass through a blocking matrix and a fixed beamformer as is done in the first stage of a typical generalised sidelobe canceller (i.e. the adaptive implementation of an MVDR beamformer) [23], along with the unprocessed XM signals.
A spatial pre-whitening operation can now be defined from the transformed noise-only correlation matrix by using the Cholesky decomposition: where L is an (M a + M e ) × (M a + M e ) lower triangular matrix. A transformed signal vector can then be pre-whitened by pre-multiplying it with L −1 and will be denoted with an underbar (.). Hence, the signal model for the unprocessed microphone signals from (1) can be expressed in the PWT domain as 2 : = h(k, l)s a,1 (k, l) x(k,l) where y consists of the PWT LMA and XM signals, i.e. y = y T a y T e T , n = L −1 ϒ H n, the PWT RTF vector h = L −1 ϒ H h, and the respective correlation matrices are: 2 The dependence on k and l is included here as a reminder and for completeness in the signal model. It will be dropped again unless explicitly required.
where the expression for R nn is a direct consequence of (10). With the assumption of the desired speech source and noise being uncorrelated, it also holds that R yy = R xx +R nn . In the PWT domain, the estimate of the desired speech signal in the first microphone of the LMA, z 1 , which is equivalent to (6), is then obtained through a linear filtering of the PWT microphone signals, such that: where w = L H ϒ −1 w is a complex-valued filter 3 .

MVDR with an LMA and XMs
The MVDR beamformer minimises the noise power spectral density after filtering (minimum variance), with a constraint that the desired speech signal should not be subject to any distortion (distortionless response), which is specified by an appropriate RTF vector for the MVDR beamformer. For the unprocessed microphone signals, the MVDR beamformer problem can be formulated as: The solution to (17) yields the optimal filter: with the desired speech signal estimate, z 1 = w H mvdr y. In practice, both R nn and h are unknown and hence must be estimated.
A data-dependent estimate can typically be obtained for R nn , for instance by recursive averaging, with a voice activity detector [24] or a speech presence probability (SPP) estimator [25]. This data-dependent estimate will be denoted asR nn and in general the notation( .) will refer to any data-dependent estimate.
In the PWT domain, it can be seen that usingR nn in (10) will result in an estimate for the pre-whitening operator aŝ L and hence from (14),R nn can be expressed as: Replacing R nn in (17) withR nn in (19) then results in the MVDR beamformer problem formulated in the PWT domain as: where w is redefined as w =L H ϒ −1 w and h is rede- The solution to (20) then yields the optimal filter in the PWT domain: with the desired speech signal estimate, z 1 = w H mvdr y. As h is still unknown, however, it means that h is also unknown and an estimate for this component is still required. Using the sameR nn , two general approaches for the estimation of h can be considered, either making use of an available a priori RTF vector pertaining to the LMA or making use of only the observable microphone data, i.e. a fully data-dependent estimate. The remainder of this section elaborates on these procedures.

Using an a priori RTF vector
For a microphone configuration consisting of only an LMA, it is not uncommon to use an a priori RTF vector, h a , in place of the true RTF vector. As mentioned earlier, this may be measured a priori or based on several assumptions regarding the spatial scenario and acoustic environment. For the inclusion of XMs into the microphone configuration, however, the notion of an a priori RTF vector is not so straightforward as no immediate prior knowledge with respect to the XMs can be exploited since there are no restrictions on what type of XMs can be used or where they must be placed in the acoustic environment. Hence, an a priori RTF vector cannot be prescribed, as was the case for the LMA only. However, since a priori information would typically only be available for the LMA, an a priori RTF vector for a microphone configuration of an LMA with XMs can be defined as follows: which consists partially of the a priori RTF vector pertaining to the LMA, h a , and partially of the RTF vector pertaining to the XM, h e , which is unknown and remains to be estimated. The estimate of h e will be denoted asĥ e to emphasise that it is constrained by the a priori knowledge set by h a but estimated from the observed microphone data. In [9], a procedure involving the generalised eigenvalue decomposition (GEVD) was used for obtainingĥ e which is subsequently reviewed and re-framed in the PWT domain.
In the PWT domain, using (13)-(15), a rank-1 matrix approximation problem can be firstly formulated to estimate the entire RTF vector [9]: where ||.|| F is the Frobenius norm, and: whereR yy is the data-dependent estimate of R yy . From (22), an a priori RTF vector in the PWT domain can be defined as follows: where 0 is a vector of (M a − 1) zeros. Replacing h with the a priori RTF vector from (22) then results in: where now only an estimate is required for h e , which in turn will define the a priori RTF vector. As discussed in [9], it can be observed that it is only the lower (M e +1)×(M e + 1) blocks ofR yy andR nn that are required for estimating h e . Hence, (27) can be reduced to: where The solution of (28) then follows from a GEVD of the matrix pencilR yy ,R nn or equivalently from the eigenvalue decomposition (EVD) ofR yy [26]: whereV is a (M e + 1) × (M e + 1) unitary matrix of eigenvectors andˆis a diagonal matrix with the associated eigenvalues in descending order. The estimate of h e then follows from the appropriate scaling of the principal eigenvector,v p : ⎡ where e M a is an (M a + M e ) selection vector consisting of all zeros except for a one in the M a th position,v p,1 is the first element ofv p , andl M a is the real-valued (M a , M a )th Finally, replacing h in (21) with h from (31) results in the MVDR beamformer based on a priori knowledge pertaining to the LMA: which will be referred to as MVDR-AP. The corresponding speech estimate is then computed using (16): As a consequence of incorporating the a priori information into the rank-1 speech model, it can be seen that it is only necessary to filter the last (M e + 1) elements of y, i.e. y a,M a and y e , with the lower order, (M e + 1) filter defined byl M av * p,1v p .

Using a data-dependent RTF vector
In the PWT domain, it is (23) that needs to be solved in order to obtain a fully data-dependent estimate of the RTF vector pertaining to the LMA and the XMs. The solution to (23) follows from a GEVD of the matrix pencil R yy ,R nn or equivalently from the EVD ofR yy : whereQ is an (M a + M e ) × (M a + M e ) unitary matrix of eigenvectors andˆis a diagonal matrix with the associated eigenvalues in descending order. The estimated RTF vector is then given by the principal (first in this case) eigenvector,q p : whereη q = e T 1 ϒ −HLq p and e 1 is an (M a + M e ) selection vector with a one as the first element and zeros everywhere else. In the PWT domain, this data-dependent RTF vector then becomes: Replacing h in (21) withĥ from (36) results in the MVDR beamformer that makes use of a data-dependent RTF vector: 4 It is acknowledged that there is a slight abuse of notation here as the estimate for h should be denoted asĥ. However, in favour of legibility and to stress that the estimation is done in accordance to the a priori assumptions set by h a is why the notation is maintained as h.ŵ mvdr =η * qq p (37) which will be referred to as MVDR-DD. The corresponding speech estimate is then computed using (16): where now all (M a + M e ) signals need to be filtered as opposed to only (M e + 1) signals in (33) when an a priori RTF vector is used. In general, the MVDR-DD would also be used for microphone configurations where there is no a priori knowledge available, such as those consisting of external microphones only.

Quadratically constrained quadratic program
As opposed to relying on only an a priori RTF vector or a data-dependent RTF vector, the merging or integration of both RTF vectors into a single approach can be framed as a quadratically constrained quadratic program (QCQP), firstly with respect to the unprocessed microphone signals: where 2 andˆ2 are maximum-tolerated squared deviations from a distortionless response due to h orĥ respectively. The constraints of (39) can also be re-written in the standard form [27] as follows: where .} denotes the real part of its argument. As the matricesR nn , hh H , andĥĥ H are all positive semidefinite, it is then evident that the QCQP of (39) is convex [27]. In the PWT domain, (39) is equivalently: where h andĥ are given in (31) and (36) respectively. Whereas in (20), the hard constraint of h is replaced by either h orĥ, (42) can be interpreted as the relaxation of the hard constraints imposed by h orĥ by the specified deviations 2 andˆ2 respectively. In the following, the  (42) will be referred to as the a priori constraint (APC), and the second inequality constraint will be referred to as the data-dependent constraint (DDC). The QCQP of (39) is in fact a subset of the more general QCQP considered in [28,29] and as well as an extension to the parametrised multi-channel Wiener filter [30]. In [28,29], the inequality constraints considered are a set of a priori measured RTF vectors, and in [30], only one inequality constraint is considered. The difference in (39) from both of these approaches is that two inequality constraints are considered, one that relies on a priori knowledge and the other which is fully estimated from the data.
The Lagrangian of (42) is given by: where α and β are Lagrangian multipliers. Taking the partial derivative of (43) with respect to w and setting to zero results in what will be referred to as the integrated MVDR beamformer, MVDR-INT: where the actual values of α and β depend on the prescribed maximum tolerable speech distortions 2 and 2 . It can also be observed that (44) is in fact identical (in the PWT domain) to the integrated MVDR beamformer considered in [22] and hence can be written as a linear combination of w mvdr andŵ mvdr with complex weightings 5 [22]: where w mvdr andŵ mvdr are given in (32) and (37), respectively, and the complex weightings are given by: 5 It can also be expressed as a convex combination of various beamformers as discussed in [22].
where D = αk aa + βk bb + αβ(k aa k bb − k ab k ba ) + 1 and Using the expressions for w mvdr andŵ mvdr from (32) and (37) respectively, the resulting speech estimate from the MVDR-INT is then: where z 1 andẑ 1 are defined in (33) and (38) respectively. Hence, the integrated beamformer output is simply a linear combination of the two speech estimates which relied on either a priori information or not.
Once appropriate values are chosen for 2 andˆ2, then a package for specifying and solving convex programs such as CVX [31,32] can be used for solving (42). Alternatively, more computationally efficient methods may be applied such as those proposed in [28,29], one of which is highlighted in Algorithm 1. Here, a gradient ascent method [33] for solving (42) is described, which is based on solving the dual problem: , α, β) and referred to as the dual function. As the dual function is concave [27], a gradient ascent procedure can be used to update the values of α and β using the gradients, ∂D(α,β) the gradients of the dual function with respect to the particular Lagrange multiplier are the respective constraints. This then gives rise to Algorithm 1 [29], which makes use of the simplified expressions for w int with the complex-valued weightings as opposed to computing (44) directly. The Lagrangian multipliers, α and β, are then updated via the gradient ascent procedure with the step size γ , whose value can be controlled using a backtracking method [34]. The algorithm continues until the respective gradients are within some specified tolerance, δ.

Effect of andÂ
s the QCQP of (42) in principle is to be solved for every time frame and frequency bin, it can therefore lead to quite a versatile beamformer as the parameters, Algorithm 1 Gradient ascent method for solving the QCQP of (42) (45) 6: Set γ according to a backtracking method. 7: andˆcan be set independently for each frequency in every time frame in order to define the inequality constraints. So although (42) is a well-known QCQP for which there are several methods available to find the solution, it still remains unclear as to what would be a reasonable strategy for setting or tuning andˆin practice. As opposed to [22], where tuning rules were developed for the Lagrangian multipliers, here a strategy is outlined for tuning andˆ, which will in turn compute the appropriate Lagrangian multipliers (for instance as outlined in Algorithm 1), as this is believed to be a more insightful procedure.
In order to develop a strategy for tuning andˆ, it will be useful to observe the constraints of (42) in more detail. The derivations that follow will reveal that the space spanned by andˆcan be divided into four distinct regions as illustrated in Fig. 1, where each of these regions corresponds to a particular set of constraints being active.
Firstly, substitution of w int = 0 into the APC and DDC from (42) shows that when 1 andˆ1, both the APC and the DDC are inactive. This condition therefore defines the upper-right region (region I) of Fig. 1 and indeed corresponds to a complete attenuation of the microphone signals, i.e. a zero output signal.
For the case whenˆ→ ∞, i.e. when the DDC is inactive, then β → 0. If the APC is still active however, it becomes 6 : Furthermore, if 0 ≤ ≤ 1, then it can be deduced that: Substitution of (54) into (53) readily makes this evident, recalling that w H mvdr h = 1. It is worthwhile to also note that by using (46), the relationship between α and for 0 ≤ ≤ 1 is then given as: In regard to the DDC, asˆis decreased (fromˆ→ ∞), it remains inactive until w Hĥ − 1 =ˆ. By substitution of (54) into the DDC of (42), the value ofˆat which the DDC becomes active,ˆo, is given by: In the limits of , when = 1,ˆo = 1, and when The range of values obtained forˆo from (56) within the domain 0 ≤ ≤ 1 define what will be referred to as the DDC bounding curve as depicted in Fig. 1. Hence, region II in Fig. 1 is enclosed by the DDC bounding curve, = 0 and = 1, representing the space where the APC is active and the DDC is inactive.
A similar analysis can be followed starting from the case when → ∞, i.e. when the APC is inactive and hence α → 0. If the DDC is still active however, it becomes: When 0 ≤ˆ≤ 1, then the following relationships can be deduced: Finally, for the APC, as is decreased (from initially → ∞), the value, o , at which this constraint becomes active is given by: define what will be referred to as the APC bounding curve as depicted in Fig. 1. Hence, region III in Fig. 1 is enclosed by the APC bounding curve,ˆ= 0 andˆ= 1, representing the space where the APC is inactive and the DDC is active. Finally, in the lower-left region, region IV, both the APC and the DDC become active within the area enclosed by the APC and DDC bounding curve. It should be kept in mind that Fig. 1 is only an illustration and that the shape of the area for which the APC and DDC are both active can change depending on the RTF vectors, h and h. For instance, Fig. 1 showsŵ H mvdr h − 1 < 1 and w H mvdrĥ − 1 < 1 (points on the axes), whereas it is possible that either of these points may be greater than or equal to one.

Confidence metric
One of the ingredients towards developing a tuning strategy for setting appropriate values for andˆis that of a confidence metric, which is indicative of the confidence in the accuracy of the data-dependent RTF vector. In [22], it was proposed that a principal generalised eigenvalue resulting from the data-dependent estimation procedure be used as such a confidence metric. In the following, it is proposed again to use such a metric; however, due to the formulation in the PWT domain, the principal eigenvalue, λ 1 from the EVD in (34) will be used. It can be shown that λ 1 is equivalent to the resulting posterior SNR when the MVDR-DD is applied and therefore serves as a reasonable metric for making a decision with respect to the accuracy of the data-dependent RTF. For the MVDR-DD in (37), the resulting posterior SNR is given by: where it is recalled thatR nn = I (M a +M e ) . Substitution of (34) and (37) 7 into (61) results in SNR DD =λ 1 . As in [22],λ 1 can then be used in a logistic function to define the confidence metric, F(l) 8 : where F(l) ∈[ 0, 1], ρ controls the gradient of the transition from 0 to 1, and λ t is a threshold (in dB), beyond which F(l) → 1. Hence, as 10 log 10 (λ 1 (l)) increases beyond λ t , then F(l) → 1, indicating high confidence in the accuracy of the data-dependent RTF vector. On the other hand, as 7 Recall thatŵ mvdr can be equivalently expressed asŵ mvdr =η * qQ e 1 . 8 The time index is reintroduced here to reinforce that these quantities are to be computed in each time frame. All frequencies are still treated equivalently. 10 log 10 (λ 1 (l)) decreases below λ t , then F(l) → 0, indicating low confidence in the accuracy of the data-dependent RTF vector.

Tuning strategy
With the depiction of the space spanned by andf rom Fig. 1 in mind, a general two-step procedure can be followed to establish a particular tuning strategy: 1. Choose two points on the {ˆ, } plane: AP and DD .
The coordinates of AP , {ˆA P , AP }, will specify the maximum tolerable speech distortions for the case when there is no confidence in the accuracy of the data-dependent RTF vector. The coordinates of DD , {ˆD D , DD }, on the other hand, will specify the maximum tolerable speech distortions for the case when there is complete confidence in the accuracy of the data-dependent RTF vector. 2. Define an appropriate path in order to connect AP and DD , where the variation along this path would be a function of the confidence metric, F(l). As F(l) changes in each time-frequency segment, different values ofˆand will be chosen along this path and subsequently used in the QCQP from (42).  Fig. 2, however, one possible tuning strategy will be briefly outlined. In this strategy, AP and DD are chosen by making use of the relationship between the integrated MVDR and the so-called speech distortion weighted multi-channel Wiener filter (SDW-MWF) [35,36]. Although AP and DD can in general be chosen without making use of this relation, it is done to highlight how the speech distortion parameter, μ, from the SDW-MWF is related to the maximum tolerable speech distortion parameters of the integrated MVDR, especially as this μ is a well-established trade-off parameter. For the path connecting AP and DD , a linear path will be defined using the confidence metric, F(l).
In the PWT domain, the cost function for the SDW-MWF is given by: which consists of two terms, the first corresponding to the noise power spectral density after filtering and the second corresponding to the speech distortion. The speech distortion parameter μ ∈ (0, ∞) is used to trade-off between the amount of noise reduction and speech distortion, where larger values of μ put more emphasis on reducing the noise and smaller values put more emphasis on reducing the speech distortion. Two separate SDW-MWF formulations can then be considered for h andĥ respectively: where μ ∈ (0, ∞) andμ ∈ (0, ∞) are the separate speech distortion parameters for each cost function. The solutions to (64) and (65) are then respectively given by:  (67), it can be observed that there is a relationship between the integrated MVDR beamformer and the SDW-MWF. By considering the expressions written as an MVDR beamformer followed by a single-channel post filter [36], it can be deduced that [22]: Hence, the range of values for μ are essentially compressed into a range of values for AP such that 0 ≤ AP ≤ 1. This means that AP can be chosen to be within this range without having to specify μ. However, (70) serves to clarify how the choice of AP is related to the cost function of (64).
Using the value of AP in (56) then yields a range of choices forˆA P such thatˆA P ≤ˆo: thermore, when the APC is inactive, then (69) holds, so that values of andˆin region III from Fig. 1 would result in the SDW-MWF from (67). 9 Achievable here is meant to differentiate between the actual speech distortion that is obtained and the maximum tolerable value that was specified.
The insight of Fig. 1 and additional value of the MVDR-INT as compared to the SDW-MWF is now apparent. Given the two SDW-MWF solutions from (66) and (67), it is not immediately clear how to optimally interpolate between them by using a linear combination of the filters themselves. In Fig. 1, however, it can be seen that an optimal interpolation between (66) and (67), i.e. between regions II and III, can be achieved through the specification of the maximum tolerable speech distortion parameters, andˆalong some path from region II to region III. In essence, the MVDR-INT has introduced region IV, which serves as a bridge for connecting regions II and III, thereby facilitating the use of both the priori and data-dependent RTF vectors. This then corresponds to the second step of the general procedure for tuning, where AP and DD are to be connected. Here, it is proposed to use the confidence metric, F(l) to perform a linear interpolation between AP and DD to yield the values for and respectively as: which are subsequently squared to be used in the QCQP from (42). Consequently, as the confidence in the accuracy of the data-dependent RTF vector increases, the maximum tolerable speech distortions will be specified by values tending towards {ˆD D , DD }. On the contrary, as this confidence decreases, maximum tolerable speech distortions will be specified by values tending towards {ˆA P , AP }. Returning focus to Fig. 2, the three examples of a tuning strategy can now be understood. A particular realisation of the APC and the DDC bounding curves has been plotted and the intersecting point of both curves corresponds to the {1, 1} coordinate (recall Fig. 1). In the tuning of Fig. 2a, as F(l) increases, the path along the dotted line is taken from AP to arrive at DD which gradually sets a larger value of for the APC and a smaller value of for the DDC. Depending on the particular realisation of the APC and DDC bounding curves, it may be that such a path can entirely lie within the area enclosed by these curves or part of it may lie outside as shown in Fig. 2a. The latter is in fact a fortunate circumstance because the achieved speech distortion corresponding to the inactive constraint will actually be lower than what was prescribed by the tuning. In the case of Fig. 2a Fig. 2b and c are representative of strategies where the maximum tolerable speech distortion is fixed for one of the constraints, and only the maximum tolerable speech distortion for the other constraint is tuned. In Fig. 2b, DD is defined by setting DD = AP , so that the maximum tolerable speech distortion for the APC is fixed.ˆis then tuned according to (74). This is representative of a case where the APC is always active and the DDC is only included if there is confidence in the accuracy of the data-dependent RTF vector. Figure 2c depicts an opposite strategy, where now AP is set by set-tingˆA P =ˆD D , so that the maximum tolerable speech distortion for the DDC is fixed.

Evaluation and discussion
In order to gain further insight into the behaviour of the integrated MVDR beamformer using the QCQP formulation, a simulation was firstly considered involving only an LMA without XMs. As will be demonstrated, observing such a scenario facilitates the visualisation of the theoretical beam patterns that would be generated under different tuning strategies. Following this simulation, recorded data from an acoustic scenario involving behind-the-ear dummy 10 hearing aid microphones along with XMs in a cocktail party scenario was then analysed and evaluated.

Beam patterns for a linear microphone array
As the notion of a traditional beam pattern is not immediately extended to the case of an LMA with XMs 11 , the following beam patterns are generated using an LMA only.
For visualising the beam patterns, a linear LMA consisting of 4 microphones and 5-cm spacing was considered. Two anechoic RTF vectors, simulating an a priori RTF vector, h a , and a data-dependent RTF vector, h a , were computed according to a far-field approximation, i.e. 1 e −j2πf τ 2 (θ) e −j2πf τ 3 (θ) e −j2πf τ 4 (θ) T , where f is the frequency (Hz) which was set to 3 kHz, τ m (θ) = (m−1)0.05 cos(θ) c is the relative time delay between the m th microphone and the reference microphone (the microphone closest to the desired speech source) of the LMA, θ is the angle of the desired speech source, and c = 345 m s −1 is the speed of sound. For h a , θ = 0 • and forĥ a , θ = 60 • . Using this definition of h a , C a , and f a were defined accordingly from (7) and ϒ a from (8). With R n a n a = I M a , the pre-whitening operation from (10) was then computed but with ϒ a instead of ϒ, and hence denoted as L a . In the PWT domain, the respective RTF vectors are given by h a = L −1 a ϒ H a h a andĥ a = L −1 a ϒ H aĥ a . The optimal PWT domain filters, w mvdr , andŵ mvdr were then computed as in (21), but using either h a orĥ a . Finally, (74) and (75) were used to andˆ, after which (42) was then solved using CVX [31,32] Figure 3 illustrates the resulting beam patterns for two tuning strategies for different values of F(l) (in this case l = 1 and hence the dependence on l is omitted). The left-hand plot of Fig. 3 corresponds to a tuning strategy similar to that depicted in Fig. 2a, where there is a trade-off between the two constraints. For this strategy, μ =μ = 0.2 andσ 2 s a,1 = 1, which means that AP and DD were fairly close to the x-axis and y-axis respectively. As F increases, the beam pattern is clearly seen to evolve from focusing on the a priori direction of 0 • to eventually that of the data-dependent direction of 60 • . As a linear path is followed, at the midpoint, both andâ re of a similarly larger values, which explains the nature of the lower magnitude in the beam pattern during the transition.
The right-hand plot of Fig. 3 corresponds to a tuning strategy as depicted in Fig. 2b, i.e. when the APC is always active. As F increases, it can be observed that the beam in the a priori direction of 0 • is maintained, while more gain is attributed to the data-dependent direction of 60 • . In this particular case, however, it is noted that although the response at 60 • is in accordance with the maximum tolerable speech distortion prescribed, there is a slight tilt of the beam towards 68 • as compared to if only the DDC was active. Nevertheless, this can still be a useful tuning strategy for cases when a high confidence is placed on the a priori RTF vector.

Effect of andÎ
n this section, the effect of andˆon the behaviour of the integrated MVDR beamformer for the case of an LMA and XMs is further investigated using recorded audio data. A batch processing framework will be applied so as to observe an average performance at a single frequency. In the following section, the processing will be done using a Weighted Overlap and Add (WOLA) framework [37] and a broadband performance will be assessed.  Fig. 2a and (right) a tuning strategy similar to that depicted in Fig. 2b. F = 0 corresponds to the position AP and F = 1 corresponds to the position DD from Figs. 2. As F increases, the path from AP to DD is followed resulting in the depicted beam patterns Audio recordings of speech and noise were made in the laboratory room as depicted in Fig. 4, which has a reverberation time of approximately 1.5 s. A Neumann KU-100 dummy head was placed in a central location of the room and equipped with two (i.e. left and right) behind-the-ear hearing aids, each consisting of two microphones spaced approximately 1.3 cm apart. Hence, in the following, the LMA is considered as having a total of four microphones, i.e. the stacked left ear and right ear microphones. The first microphone of the left ear hearing aid was used as the reference microphone. Three omnidirectional XMs (two AKG CK32 microphones and one AKG CK97O microphone) were placed at heights of 1 m from the floor and at varying distances from the dummy head as shown in Fig. 4. A Genelec 8030C loudspeaker was placed at 1 m and different azimuth angles from the dummy head to generate a speech signal from a male speaker [38]. The loudspeaker and the dummy head were placed at a height of approximately 1.3 m from the floor (only angles 0 • and 60 • were used as shown in Fig. 4). For the noise, a cocktail party scenario was re-created. With the same configuration of the dummy head and external microphones from Fig. 4, participants stood outside of a 1-m circumference from the dummy head in a random manner (i.e. all participants were not confined to a particular corner in the room). Beverages in glasses as well as snacks were served while the participants engaged in conversation. At any given time, there were nine male participants and six female participants present in the room. A recording of such a scenario was made for approximately 1 h, but a random sample was used in the following analysis. As opposed to a free-field a priori RTF vector, a more suitable a priori RTF vector for the behind-the-ear hearing aid microphones was obtained from pre-measured impulse responses in the scenario as depicted in Fig. 4. The impulse responses were computed from an exponential sine-sweep measurement with the loudspeaker position at 0°(the azimuth direction directly in front of the dummy head) and 1 m so that the a priori RTF vector would be defined in accordance with a source located at 0°a nd 1 m from the dummy head. The initial section of these impulse responses corresponding to the direct component was extracted, with a length according to the size of the Discrete Fourier Transform (DFT) window to be used in the STFT domain processing. This direct component was then smoothed with a Tukey window and converted to the frequency domain. In each frequency bin, these smoothed frequency domain impulse responses were then scaled with respect to the smoothed frequency domain impulse response of the reference microphone. This was then used as h a (k) and was kept the same for each time frame.
A scenario was firstly considered for the desired speech source located at 0 • in Fig. 4, i.e. the location where the a priori RTF vector was defined. A 4s sample of the desired speech signal was mixed with a random sample of the cocktail party noise at a broadband input SNR of 0 dB. For the batch processing framework with a DFT size of 256 samples, R yy and R nn were estimated by time averaging across the entire length of the signal in the respective speech-plus-noise or noise-only frames. Using the SPP [25] from the first microphone of the left ear hearing aid, frames for which the speech was active were chosen if the resulting SPP > 0.85. The RTF vectors, h andĥ, were computed according to the procedures described in Sections 3.1 and 3.2. Using CVX [31,32], the MVDR-INT from (42) was then evaluated for a range of 0 < 1.5 and 0 <ˆ1.5 at a frequency of 2 kHz. Figure 5a and b display the resulting (base-10 log) values of the Lagrangian multipliers α and β respectively as a function of andˆ, along with the APC and DDC bounding curves. These plots support the theoretical analysis of the space spanned by andˆfrom Fig. 1. In Fig. 5a, it is clearly observed that as the value of exceeds the APC bounding curve, then α → 0 so that the APC is inactive while the DDC remains active. Similarly, in Fig. 5b, as the value ofˆexceeds the DDC bounding curve, then β → 0 so that the APC remains active and the DDC is inactive. The regions where both constraints are active, and when neither are active can also be observed. where the first term of the SNR is the output SNR and the second term is the input SNR at the unprocessed reference microphone 12 and in this scenario h = h. The true value of h is unknown; hence, the results of Fig. 5c and d are suggestive for the case when the true RTF vector corresponds to that of the a priori assumed RTF vector. In Fig. 5c, since w → 0 in the region whereˆ≥ 1 and ≥ 1, it is purposefully hatched so as to indicate that in this region an output SNR is undefined.
As expected, it can be observed that the best SNR is achieved for the region where the DDC is inactive and the APC is active, with a compromise within the region where the two constraints are active. An interesting observation here is the poor SNR in the region where → 0 and → 0. Even though the maximum tolerable speech distortions have been specified to be quite small, in this case h andĥ can be parallel, which can lead to redundant constraints and an ill-conditioning problem as discussed in [22]. In terms of the SD, fairly low distortions are achieved when either of the constraints are active or when both are active. As both → 1 andˆ→ 1, the speech distortion increases, which is expected from (70) and (72), i.e. the SDW-MWF parameters, μ andμ. As μ → ∞, → 1, and asμ → ∞,ˆ→ 1, which accounts for the increasing  Fig. 5d is that a low speech distortion is also achieved in the region where the APC bounding curve is a minimum, regardless of the value of . As discussed in Section 5.2, for a value of o (where o is the value of on the APC bounding curve from (60)), the achievable distortion would in fact correspond to o on the APC bounding curve, which is quite low in this minimum region. Figure 6 now displays a similar set of results, however for the case when the desired speech source was located at 60 • as depicted in Fig. 4. As the a priori RTF vector was based on a speaker located at 0 • , this scenario represented a mismatch between the a priori RTF vector and the true RTF vector. The same procedure as previously described was also followed to obtain the MVDR-INT filters. Figure 6a and b display the resulting values of the (base-10 log) Lagrangian multipliers α and β respectively as a function of andˆ, along with the APC and DDC bounding curves. The nature of these plots is quite similar to that of Fig. 5a and b in terms of how α and β vary with respect to the bounding curves. In comparison to Fig. 5a and b,  Fig. 6a and b also highlight the fact that these bounding curves can have quite different appearances. Figure 6c and d display the corresponding SNR and SD respectively, however with h =ĥ in (76), and hence, the results are suggestive for the case when the true RTF vector corresponds to that of the data-dependent RTF vector. Now it can be observed that the best SNR is achieved for the region where the APC is inactive and the DDC is active, with a compromise within the region where the two constraints are active. For the SD, fairly low speech distortions are achieved for small values ofˆas expected. For small values of and large values ofˆ, i.e. toward the region where only the APC is active, it can be observed that the speech distortion increases, which is a direct result of the speech source not being in the a priori defined direction of 0 • . Once again, it can also be seen that the speech distortion generally increases as both → 1 andˆ→ 1.
The results of Figs. 5 and 6 provide some more insight into the behaviour of the MVDR-INT and demonstrate that in some scenarios a better performance can be achieved when either only the APC or only the DDC is active. Furthermore, it was observed that there were transition regions where a compromise could be achieved between these limits of performance when either only the APC or only the DDC is active. Therefore, it suggests that tuning strategies such as those depicted in Fig. 2 would indeed be appropriate means of obtaining an optimal filter as opposed to relying on only an APC or DDC.

Performance of tuning strategies
The audio recordings as previously described for the scenario depicted in Fig. 4 are also used to observe the performance of the tuning strategies. A desired speech signal was created where the desired speech source was initially located at 0°for a duration of 5 s and then instantaneously moved to 60 • for another 6 s. This was then mixed with a random sample of the cocktail party noise at a broadband input SNR of 2 dB. The same a priori RTF vector pertaining to the hearing aid microphones, h a (k), as previously described was used, i.e. h a (k), was computed for a source located at 0°and 1 m from the dummy head. For the STFT processing, the WOLA method, with a DFT size of 256 samples, 50% overlap, a square-root hanning window, and a sampling frequency of 16 kHz were used. By using the SPP [25] computed on XM2, frames were classified as containing speech if the SPP > 0.8; otherwise, the frames were classified as noise only. All RTF vector estimates were performed in frames which were classified as containing speech. All the relevant correlation matrices were also estimated using a forgetting factor corresponding to an averaging time of 300 ms. R nn was only estimated when the SPP < 0.8.
For the MVDR-INT, two tuning strategies were considered-(i) the trade-off between the maximum tolerable speech distortions for the APC and DDC, corresponding to Fig. 2a, which will be referred to as MVDR-INT-3a and (ii) where the maximum tolerable speech distortion for the APC is constant, but the maximum tolerable speech distortion for the DDC varies, corresponding to Fig. 2b, and which will be referred to as MVDR-INT-3b. For both tunings, μ =μ = 0.001, andσ 2 s a,1 was computed using the method from [39] as implemented in [40] but with the noise estimation update computed as in [25]. A different setting was used for the confidence metric, F(l) in (62) for each of the tunings such that for the MVDR-INT-3a, ρ = 1 and λ t = 5 dB, and for MVDR-INT-3b, ρ = 1 and λ t = 10 dB, i.e. a higher thresholding was used for the MVDR-AP tuning. With all parameters assigned, the QCQP problem from (42) was solved using the gradient ascent procedure as described in Algorithm 1.
The metrics used to evaluate the following experiments were the speech intelligibility-weighted SNR [41] (SI-SNR), the short-time objective intelligibility ( STOI) [42], and the normalised speech-to-reverberation modulation energy ratio for cochlear implants (SRMR-CI) [43]. The SI-SNR improvement in relation to the reference microphone was calculated as: where the band importance function I i expresses the importance of the ith one-third octave band with centre frequency, f c i for intelligibility, SNR i,in is the input SNR (dB), and SNR i,out is the output SNR (dB) in the ith one-third octave band. The centre frequencies, f c i , and the values for I i are defined in [44]. The input SNR was computed accordingly using the unprocessed speech-only and unprocessed noise-only components (in the discrete time domain) at the reference microphone, and the output SNR from the individually processed speech-only and processed noise-only components (in the discrete time domain) resulting from the particular algorithm. For the STOI metric, the reference signal used was the unprocessed desired speech source convolved with 256 samples (i.e. same length as the DFT size) of the (pre-measured) impulse response from the desired speech signal location to the reference microphone. As the room was quite reverberant, however, a true reference signal is somewhat ambiguous to define, and hence, the non-intrusive metric, SRMR-CI, suitable for hearing instruments, in particular cochlear implants, was also used. Figure 7 displays the performance of the various algorithms, where all the metrics have been computed in 2-s time frames with a 25% overlap. The relative improvements of the SI-SNR and the STOI metrics in relation to the reference microphone have been plotted. The metrics for XM1 and XM2 from Fig. 4 are also plotted. In order to contextualise the values of the SRMR-CI metric, an additional plot of the performance for the reference signal (that which was used for the STOI metric) is displayed. From all the metrics, as expected, the MVDR-AP performs better than the MVDR-DD in the first 5 s as the speech source was at 0°, i.e. the a priori direction. However, in the latter 6 s, when the speech source was at 60°, the MVDR-DD achieves a better performance.
With respect to the XMs, it can also be seen that the performance of XM1 decreases after 5 s as the source moves to the location of 60 • , while XM2 has more of a consistent performance across the different speech locations. In terms of the SI-SNR, the performance of all of the other algorithms is better than either of the XMs, which demonstrates that simply listening to the XM only would not always immediately yield satisfactory performance.
Within the first 5 s, the MVDR-INT-3a is able to find a compromise between the MVDR-AP and MVDR-DD in terms of all metrics. In the final 6 s, although the STOI is once again in between the MVDR-AP and MVDR-DD, the performance in terms of SI-SNR and SRMR-CI is in fact better than either of the MVDR-AP or the MVDR-DD. This is a direct consequence of the nature of the integrated MVDR-LMA-XM beamformer as different linear combinations of the MVDR-AP and the MVDR-DD are effectively applied to different time-frequency segments, yielding a broadband SI-SNR that could be better than either the MVDR-AP or MVDR-DD. For the MVDR-INT-3b, within the first 5 s, the performance in terms of all the metrics is closer to that of the MVDR-AP which is expected as the APC is kept active at all times. In the following 6 s, the STOI metric indicates that the speech intelligibility has not changed from that of the MVDR-AP. However, an improvement can be observed in both SI-SNR and SRMR-CI metrics as some frequency bins would have also had the DDC active.
The corresponding confidence metrics across all time frames and frequencies for the MVDR-INT-3a and the MVDR-INT-3b are displayed in Fig. 8. The upper plot corresponds to the confidence metric of MVDR-INT-3a and reveals that much of the confidence has been placed on the higher frequencies, presumably because there was less noise in this region. Therefore, a smaller value ofâ nd a larger value of would have been assigned to the DDC and APC respectively, i.e. the MVDR-INT-3a in this region would have tended toward the MVDR-DD. Several regions of uncertainty are also observed where the MVDR-INT-3a would then find a compromise between the MVDR-AP and the MVDR-DD. In the lower plot of Fig. 8, the confidence metric for the MVDR-INT-3b shows a much more conservative behaviour due to the larger threshold of λ t . It is observed that there are now many regions where there is little confidence, and hence a larger value ofˆand a smaller value of would have been assigned to the DDC and APC, respectively, i.e. the MVDR-INT-3b in these regions would have tended toward the MVDR-AP. More confidence is now only placed in the higher frequency region and there are still some regions of uncertainty so that a compromise can be achieved. The resulting audio signals from this section 13 may also be listened to for a subjective evaluation at [45].

Conclusion
An integrated MVDR beamformer that merges the benefits from using an available a priori relative transfer function (RTF) vector and a data-dependent RTF vector was developed for a microphone configuration consisting of a local microphone array (LMA) and multiple external microphones (XMs). The framework has been presented in a pre-whitened-transformed (PWT) domain, which consists of an initial transformation of the microphone signals through a blocking matrix and a fixed beamformer, followed by a pre-whitening operation, facilitating convenient processing operations. In the PWT domain, procedures for obtaining an a priori RTF vector and datadependent RTF vector have also been derived, where the a priori RTF vector is based on an a priori RTF vector pertaining to the LMA only.
With the two RTF vectors, an integrated MVDR beamformer was proposed by formulating a quadratically constrained quadratic program (QCQP), with two constraints, one of which is related to the maximum tolerable speech distortion for the imposition of the a priori RTF vector and the other related to the maximum tolerable speech distortion for the imposition of the data-dependent RTF vector. It was shown how the space spanned by each of these maximum tolerable speech distortions could be divided into four separate regions, each of which corresponded to a particular set of constraints being active or inactive. This insight then facilitated the development of a general tuning framework where the maximum tolerable speech distortions are chosen in accordance with the confidence in the accuracy of the data-dependent RTF vector. A particular set of tuning rules was also proposed, which made use of a relationship to the speech distortion weighted multi-channel Wiener filter.
The potential of the integrated MVDR beamformer was demonstrated by using audio data from an LMA of behind-the-ear hearing aid microphones and three XMs for a single desired speech source within a re-created cocktail party scenario. A narrowband evaluation confirmed the theoretical behaviour of the integrated MVDR