Auxiliary function-based algorithm for blind extraction of a moving speaker

In this paper, we propose a novel algorithm for blind source extraction (BSE) of a moving acoustic source recorded by multiple microphones. The algorithm is based on independent vector extraction (IVE) where the contrast function is optimized using the auxiliary function-based technique and where the recently proposed constant separating vector (CSV) mixing model is assumed. CSV allows for movements of the extracted source within the analyzed batch of recordings. We provide a practical explanation of how the CSV model works when extracting a moving acoustic source. Then, the proposed algorithm is experimentally verified on the task of blind extraction of a moving speaker. The algorithm is compared with state-of-the-art blind methods and with an adaptive BSE algorithm which processes data in a sequential manner. The results confirm that the proposed algorithm can extract the moving speaker better than the BSE methods based on the conventional mixing model and that it achieves improved extraction accuracy than the adaptive method.


Introduction
This paper addresses the problem when sound is sensed by multiple microphones and the goal is to extract a signal of interest originating from an individual source. We particularly address the case when the corresponding source is a speaker which is moving during the recording. Unknown situation is considered where no information about the environment and the positions of microphones and sources is available and no training data are available. This is the task of blind source separation (BSS), or particularly, of blind source extraction (BSE). These signal processing fields embrace numerous methods such as nonnegative matrix/tensor factorization, clustering and classification approaches, or sparsity-awareness methods; see [1][2][3] for surveys. We will consider the approach of independent component analysis (ICA) where signals are separated into original signals based on the assumption that the original signals are statistically independent [4]. In case of audio sources such as speakers, this fundamental condition is met, which makes ICA attractive for practical applications.
ICA can separate instantaneous mixtures of non-Gaussian independent signals up to their indeterminable order and scales [5]. Since acoustic mixtures are convolutive due to delays and reverberation, the narrow-band approach can be considered. Here, ICA is applied in the short-time Fourier transform (STFT) domain separately in each frequency bin; the approach referred to as frequency-domain ICA (FDICA) [3,6]. However, the separate applications of ICA in FDICA cause the so-called permutation problem due to the indeterminable order of separated signals: The separated frequency components have a random order and must be aligned in order to retrieve the full-band separated signals [7]. Independent vector analysis (IVA) treats all frequencies simultaneously using a joint statistical source model [8,9]. The frequency components of the original signals form the so-called vector components. IVA aims at maximizing higher-order dependencies between the frequency components within each vector component while the whole vector compo-nents should be independent [9]. IVA is thus an extension of ICA to joint separation of several instantaneous mixtures (one per frequency bin).
A recent extension of IVA is independent low-rank matrix analysis (ILRMA) where the vector components are assumed to obey a low-rank source model. For example, ILRMA combines IVA and nonnegative matrix eactorization (NMF) in [10,11] and involves deep learning in [12]. The counterparts of ICA and IVA designed for BSE, i.e., for the extraction of one independent source, are called independent component/vector extraction (ICE/IVE) [13,14]. Very recently, IVE has been extended towards simultaneous source extraction and dereverberation [15].
In principle, the aforementioned methods differ in source modeling while they share the conventional timeinvariant linear mixing model. This model describes situations that are not changing during the recording time, which also means that the sources-speakers are assumed to be static. To separate/extract moving sources, the methods can be used in an adaptive way by being applied on short intervals during which the mixture is approximately static. Such modifications are typically implemented to process data sample-by-sample (frameby-frame) or batch-by-batch using some forgetting update of inner parameters [16][17][18]; many such methods have been considered also in biomedical applications; see, e.g., [19]. Although these methods are useful, they have several shortcomings. Namely, the sources can be separated in a different order at different times due to the indeterminacy of ICA; we refer to this as to the discontinuity problem. Also, the separation accuracy is limited by the length of context from which the time-variant separating parameters are computed. The methods involve parameters such as learning rate or forgetting factors for recursive processing. Optimum values of those parameters depend on input data in an unknown way. The control and tuning of these adaptive implementations, therefore, poses a difficult and application-dependent problem.
In this paper, we propose a novel algorithm for IVE based on the constant separating vector (CSV) mixing model, which is called CSV-AuxIVE. CSV-AuxIVE belongs to the family of auxiliary function-based methods [17,20,21]. These methods use a majorizationminimization approach for finding the optimum of a contrast function derived based on the maximum likelihood principle and do not involve any learning rate parameter. In particular, CSV-AuxIVE could be seen as an extension of the recent OverIVA algorithm from [22] allowing for the CSV mixing model. CSV has been first considered in the preliminary conference report [23]. It involves time-variant mixing parameters while it simultaneously assumes time-invariant (constant) separating parameters. The model enables us to avoid the disconti-nuity problem and to improve the extraction performance because the extraction accuracy depends on the length of the entire recording modeled by CSV [24]. The proposed CSV-AuxIVE adopts these important features and provides a new blind method, which is much faster than the gradient-based algorithm used in [23].
The article is organized as follows: in Section 2, the technical definition of the BSE problem is given, the CSV mixing model is described and explained from a practical point of view, and the contrast function for the blind extraction is derived. In Section 3, the proposed CSV-AuxIVE algorithm is described, including its piloted variant that enables a partial control of convergence using prior knowledge of the desired signal. Section 4 is devoted to experimental evaluations based on simulated as well as real-world data. The paper is concluded in Section 5. A supplementary material to this paper contains a detailed derivation of the gradient-based algorithm from [23] referred to as BOGIVE w .
Notation Plain letters denote scalars, bold letters denote vectors, and bold capital letters denote matrices. Upper indices such as · T , · H , or · * denote, respectively, transposition, conjugate transpose, or complex conjugate. The Matlab convention for matrix/vector concatenation and indexing will be used, e.g., [ 1; g] = 1, g T T and (a) i is the ith element of a. E[ ·] stands for the expectation operator, andÊ[ ·] is the average taken over all available samples of the symbolic argument. The letters k and t are used as integer indices of frequency bin and block, respectively; {·} k is a short notation of the argument with all values of index k, e.g., {w k } k means w 1 , . . . , w K , and {w k,t } k,t means w 1,1 , . . . , w K,T .

Problem formulation
A static mixture of audio signals that propagate in an acoustic environment from point sources to microphones can be described by the time-invariant convolutive model. Let there be d sources observed by m microphones. The signal on the ith microphone is described by where n is the sample index, s 1 (n), . . . , s d (n) are the original signals coming from the sources, and h ij denotes the time-invariant impulse response between the jth source and ith microphone of length L.
In the short-time Fourier transform (STFT) domain, convolution can be approximated by multiplication. Let x i (k, ) and s j (k, ) denote, respectively, the STFT coefficient of x i (n) and s j (n) at frequency k and frame . Then, (1) can be replaced by a set of K complex-valued linear instantaneous mixtures where x k and s k are symbolic vectors representing, respectively, [ x 1 (k, ), . . . , x m (k, )] T and [ s 1 (k, ), . . . , s d (k, )] T , for any frame = 1, . . . , N; A k stands for the m × d mixing matrix whose ijth element is related to the kth Fourier coefficient of the impulse response h ij ; K is the frequency resolution of the STFT; for detailed explanations, see, e.g., Chapters 1 through 3 in [3].

Blind source extraction
For the BSE problem, we can write (2) in the form where s k represents the source of interest (SOI), a k is the corresponding column of A k , called the mixing vector, and y k represents the remaining signals in x k , i.e., y k = x k − a k s k .
Since there is the ambiguity that any of the original sources can play the role of the SOI, we can assume, without loss of generality, that the SOI corresponds to the first source in (2); hence, a k is the first column of A k . The problem of guaranteeing the extraction of the desired SOI will be addressed in Section 3.3.
The assumption that the original signals in (2) are independent implies that s k and y k are independent. We will also assume that m = d, i.e., that there is the same number of microphones as that of the sources. It follows that the mixing matrices A k are square. By assuming also that they are non-singular 1 and that their inverse matrices exist, the existence of a separating vector w k (the first row of A −1 k ) such that w H k x k = s k is guaranteed. We pay for this advantage by the limitation that y k belongs to a subspace of dimension d − 1. In other words, the covariance of y k is assumed to have rank d − 1 as opposed to real recordings where the typical rank is d (e.g. due to sensor and environment noises). Nevertheless, the assumption m = d brings more advantages than disadvantages as shown in [10]. One way to compensate is to increase the number of microphones so that the ratio d−1 d approaches 1. BSE appears to be computationally more efficient than BSS when d is large since, in BSE, y k is not separated into individual signals.
In [13], the BSE problem is formulated by exploiting the fact that the d − 1 latent variables (background signals) involved in y k can be defined arbitrarily. An effective parameterization that involves only the mixing and separating vectors related to the SOI has been derived. Specifically, A k and A −1 k (denoted as W k ) have the structure and where I d denotes the d × d identity matrix, w k denotes the separating vector which is partitioned as w k =[ β k ; h k ]; the mixing vector a k is partitioned as a k =[ γ k ; g k ]. The vectors a k and w k are linked through the so-called distortionless constraint w H k a k = 1, which, equivalently, means is called the blocking matrix as it satisfies that B k a k = 0. The background signals are given by z k = B k x k = B k y k , and it holds that y k = Q k z k . To summarize, (2) is recasted for the BSE problem as

CSV mixing model
Now, we turn to an extension of (7) to time-varying mixtures. Let the available samples of the observed signals (meaning the STFT coefficients from N frames) be divided into T intervals; for the sake of simplicity, we assume that the intervals have the same integer length N b = N/T. The intervals will be called blocks and will be indexed by t ∈ {1, . . . , T}.
A straightforward extension of (7) to time-varying mixtures is when all parameters, i.e., the mixing and separating vectors, are block-dependent. However, such an extension brings no advantage compared to processing each block separately. In the constant separating vector (CSV) mixing model, it is assumed that only the mixing vectors are block-dependent while the separating vectors are constant over the blocks. Hence, the mixing and de-mixing matrices on the tth block are parameterized, respectively, as and Each sample of the observed signals on the tth block is modeled according to where s k,t and z k,t represent, respectively, the kth frequency of the SOI and of the background signals at any frame within the tth block. Note that, the CSV coincides with the static model (7) when T = 1. The practical meaning of the CSV model is illustrated in Fig. 1. While CSV admits that the SOI can change its position from block to block (the mixing vectors a k,t depend on t), the block-independent separating vector w k is sought such that extracts the speaker's voice from all positions visited during its movement. There are two main reasons for this: First, the achievable interference-tosignal ratio (ISR) depends on w k so it has order O(N −1 ), compared to when w k is block-dependent, which yields ISR of order O(N −1 b ); this is confirmed by the theoretical study on Cramér-Rao bounds in [24]. Second, the CSV enables BSE methods to avoid the discontinuity problem mentioned in the previous section.
The CSV also brings a limitation. Formally, the mixture must obey the condition that for each k a separating vector exists such that s k,t = w H k x k,t holds for every t; a condition that seems to be quite restrictive. Nevertheless, preliminary experiments in [23] have shown that this limitation is not crucial in practical situations and does not differ much from that of static methods (spatially overlapping speakers cannot be separated), especially when the number of microphones is high enough to provide sufficient degrees of freedom. When the speakers are static, the rule of thumb says that the speakers cannot be separated or, at least, are difficult to separate through spatial filtering, when their angular positions with respect to the microphone array are the same. Hence, moving speakers cannot be separated based on the CSV when their angular ranges with respect to the array during the recording are overlapping. The experimental part of this work presented in Section IV validates these findings.

Source model
In this section, we introduce the statistical model of the signals adopted from IVE. Samples (frames) of signals will be assumed to be identically and independently distributed (i.i.d.) within each block according to the probability density function (pdf ) of the representing random variable.
Let s t denote the vector component corresponding to the SOI, i.e., s t =[ s 1,t , . . . , s K,t ] T . The elements of s t are assumed to be uncorrelated (because they correspond to different frequency components of the SOI) but dependent, that is, their higher-order moments are taken into account [9]. Let p s (s t ) denote the joint pdf of s t and p z k,t (z k,t ) denote the pdf 2 of z k,t . For simplifying the notation, p s (·) will be denoted without the index t although it is generally dependent on t. Since s t and z 1,t , . . . , z K,t are independent, their joint pdf within the tth block is equal to the product of marginal pdfs (2022) 2022:1 Page 5 of 16 By applying the transformation theorem to (11) using (10), from which it follows that the joint pdf of the observed signals from the tth block reads Hence, the log-likelihood function as a function of the parameter vectors w k and a k,t and all available samples of the observed signals in the tth block is given by whereŝ k,t = w H k x k,t andẑ k,t = B k,t x k,t denote the current estimate of the SOI and of the background signals, respectively.
In BSS and BSE, the true pdfs of the original sources are not known, so suitable model densities have to be chosen in order to derive a contrast function based on (14). To find an appropriate surrogate of p s (s t ), the variance of SOI, which can be changing from block to block 3 has to be taken into account. Let f (·) be a pdf corresponding to a normalized non-Gaussian random variable. To reflect the block-dependent variance, p s (s t ) should be replaced by where σ 2 k,t denotes the variance of s k,t . Its unknown value is replaced by the sample-based variance ofŝ k,t , which is It is worth noting that in the case of the static mixing model, i.e. when T = 1, it can be assumed that σ 2 k,t = 1 because of the scaling ambiguity.
Similarly to [13], the pdf of the background is assumed to be circular Gaussian with zero mean and (unknown) Next, by Eq. (15) in [13] it follows that | det W k,t | 2 = |γ k,t | 2(d−2) , which corresponds to the third term in (14). Now, by replacing the unknown pdfs in (14) and by neglecting the constant terms, we obtain the contrast 3 The variance can be changing from block to block not only due to the signal nonstationarity, but also because of the movements of the source.

function in the form
The nuisance parameter C z k,t will later be replaced by its

Orthogonal constraint
Finding the maximum of (16) subject to the separating and mixing vectors leads to their consistent estimation, hence to the solution of the BSE problem. The parameter vectors are linked through the distortionless constraint given by (6). However, as was already noticed in previous publications [13,22,25], this constraint appears to be too weak as it does not guarantee that both vectors finally found by an algorithm correspond to the SOI. Therefore, an additional constraint has to be imposed. The orthogonal constraint (OGC) ensures that the current estimate of the SOIŝ k,t = wx k,t has zero sample correlation with the background signalsẑ k,t = Bx k,t .
Hence the constraint is thatÊ ŝ k,tẑ H k,t = w H k C k,t B k,t = 0, for every k and t, under the condition given by (6). In Appendix A in [13], it is shown that the OGC can be imposed by making a k,t fully dependent on w k through Alternatively, w k can be considered as dependent on a k,t [13]; however, we prefer the former formulation in this paper, because in the proposed algorithm, the optimization proceeds through the separating vectors w k .

Auxiliary function-based algorithm
In [20], N. Ono derived the AuxIVA algorithm using an auxiliary function-based optimization (AFO) technique. AuxIVA provides a much faster and more stable alternative to the natural gradient-based algorithm from [9]. The main principle of the AFO technique lies in replacing the first term in (16) by a majorizing term involving an auxiliary variable. The modified contrast function is named the auxiliary function. It is optimized in the auxiliary and normal variables alternately, by which the maximum of the original contrast function is found. Very recently, a modification of AuxIVA for the blind extraction of q sources, where q < d, has been proposed in [22]; the algorithm is named OverIVA. In this section, we will apply the AFO technique to find the maximum of (16). The resulting algorithm, which could be seen as a special variant of OverIVA designed for q = 1 and as an extension for T > 1, will be called CSV-AuxIVE.
To find the suitable majorant of the first term of the contrast function (16) we can follow the original Theorem 1 from [20].

Theorem 1 Let S G be a set of real-valued functions of a vector variable u defined as
where G R (r) is a continuous and differentiable function of a real variable r satisfying that is continuous everywhere and is monotonically decreasing in r ≥ 0. Then, for holds for any u and r 0 ≥ 0. The equality holds if and only Proof See [20]. Now, let G(u) = log f (u) and assume that the conditions of the theorem are satisfied. Then, by applying Theorem 1 on the tth block of the first term of (16) we get a relation where r t is an auxiliary variable and R t depends purely on r t ; the equality holds if and only if By applying (20) in (16), the auxiliary function obtains a form where and ϕ(r) = r . Now, we can see that where both sides are equal if and only if (21) is a valid auxiliary function.
The optimization of Q proceeds alternately in the auxiliary variables r t and the normal variables w k . The optimum of (21) in the auxiliary variables is obtained simply (22). To find the minimum in the normal variables, the partial derivative of the auxiliary function (21) is taken with respect to w k when r t is independent, and a k,t are dependent through the OGC (17). The derivative is put equal to zero, which forms equations for the new update of the separating vectors.
For the derivative of the first and second term in (21), the following identities are used, which come from straightforward computations using the Wirtinger calculus [26] and by using the OGC (17): The computation of the derivative of the third and fourth term of (21) is lengthy due to the dependence of the parameters through the OGC constraint. To simplify, we can use Equation 33 and Appendix C in [13], where the derivative is actually computed for the case K = 1 and T = 1, from which it follows that the result is equal to K k=1 a k,t . By putting the derivatives of all the term together, we obtain The close-form solution of the equation when (26) is put equal to zero cannot be derived in general. Our proposal is to take which is the solution of a linearized equation where the terms w H k V k,t w k andσ 2 k,t are treated as constants that are independent of w k . Hence, the general update rules of CSV-AuxIVE are as follows: The last step, which performs a normalization of the updated separating vectors, has been found important to the stability of the convergence. After the convergence is achieved, the separating vectors are re-scaled using least squares to reconstruct the images of the SOI on a reference microphone [27].
In our implementation, we consider the standard nonlinearity ϕ(r t ) = r −1 t proposed in [20], which is known to be suitable for super-Gaussian signals such as speech. For this particular choice, we propose one more modification in the proposed algorithm: compared to (28), r t is put equal to K k=1 w H k x k,t 2 . We have experienced improved convergence speed with this modification. The pseudo-code is summarized in Algorithm 1,

Semi-supervised CSV-AuxIVE
Owing to the indeterminacy of order in BSE it is not, in general, known which source is currently being extracted. The crucial problem is to ensure that the signal being extracted actually corresponds to the desired SOI. In BOGIVE w as well as in CSV-AuxIVE, this can be influenced only through the initialization. The question of convergence of the BSE algorithms has been considered in [13]. Several approaches ensuring the global convergence have been proposed, most of which are based on additional constraints assuming prior knowledge, e.g., about the source position or a reference signal [18,[28][29][30].
Recently, an unconstrained supervised IVA using socalled pilot signals has been proposed in [31]. The pilot signal, which is assumed to be available as prior information, is a signal that is mutually dependent with the corresponding source signal. Therefore, the pilot signal and the frequency components of the source have a joint

Algorithm 1: Pseudo-code of CSV-AuxIVE
Input: x k,t , w ini k (k, t = 1, 2, . . . ), NumIter Output: a k,t , w k Iter ← Iter + 1; 22 until Iter < NumIter; pdf. In the piloted IVA, the pilot signals are used as constant "frequency components" in the joint pdf model, which is helpful in solving the permutation problem as well as the ambiguous order of the separated sources. In [13], the idea has been applied in IVE, where the pilot signal related to the SOI is assumed to be available.
Let the pilot signal (dependent on the SOI and independent of the background) be represented on the tth block by o t (o t is denoted without index k; nevertheless, it can also be k-dependent). Let the joint pdf of s t and o t be p(s t , o t ). Then, similarly to (13), the pdf of the observed data within the tth block is given by Comparing this expression with (13) and taking into account the fact that o t is independent of the mixing model parameters, it can be seen that the modification of CSV-AuxIVE towards the use of pilot signals is straightforward. In particular, provided that the model pdf f w H k x k k,t , o t replacing the unknown p(·) meets the conditions of Theorem 1, the piloted algorithm has exactly the same steps as the non-piloted one with a sole difference that the non-linearity ϕ(·) also depends on o t . Therefore, the Eq. 28 will have form for t = 1, . . . , T, where η is a hyperparameter controlling the influence of the pilot signal [31]. Consequently, the semi-supervised of CSV-AuxIVE, in this manuscript referred as piloted CSV-AuxIVE, is obtained by replacing the update step (28) with (35).
Finding a suitable pilot signal poses an applicationdependent problem. For example, outputs of voice activity detectors were used to pilot the separation of simultaneously talking people in [31]. Similarly, a video-based lip-movement detection was considered in [32]. A videoindependent solution was proposed in [33] using spatial information about the area in which the speaker is located. Recently, the approach utilizing speaker identification was proposed in [34] and further improved in [35]. All of these approaches have been shown to be very useful, even though the used pilot signals contain residual noise and interference. The design of a pilot signal is a topic beyond the scope of this paper. Therefore, in the experimental part of this paper, we consider only oracle pilots as proof of concept.

Experimental validation
In this section, we present results of experiments with simulated mixtures as well as real-world recordings of moving speakers. Our goal is to show the usefulness of the CSV mixing model and to compare the performance characteristics of the proposed algorithm with other stateof-the-art methods.

Simulated room
In this example, we inspect spatial characteristics of demixing filters obtained by the blind algorithms when extracting a moving speaker in a room simulated by the image method [36].

Experimental setup
The room has dimensions 4 × 4 × 2.5 (width×length× height) meters and T 60 = 100 ms. A linear array of five omnidirectional microphones is located so that its center is at the position (1.8, 2, 1) m, and the array axis is parallel with the room width. The spacing between microphones is 5 cm.
The target signal is a 10 s long female utterance from TIMIT dataset [37]. During speech, the speaker is moving at a constant speed on a 38 • arc at a one-meter distance from the center of the array; the situation is illustrated in Fig. 2a. The starting and ending positions are (1.8, 3, 1) m and (1.2, 2.78, 1) m, respectively. The movement is simulated by 20 equidistantly spaced RIRs on the path, which correspond to half-second intervals of speech, whose overlap was smoothed by windowing. As an interferer, a point source emitting white Gaussian noise is located at the position (2.8, 2, 1) m; that is, at a 1-m distance to the right from the array.
The mixture of speech and noise has been processed in order to extract the speech signal by the following methods: OGIVE w [13], BOGIVE w (the extension of OGIVE w allowing for the CSV; derived in the supplementary material of this article), OverIVA with m = 1 [22], which corresponds with CSV-AuxIVE when T = 1, and CSV-AuxIVE. All methods operate in the STFT domain with the FFT length of 512 samples and 128 samples hop-size; the sampling frequency is f s = 16 kHz. Each method has been initialized by the direction of arrival of the desired speaker signal at the beginning of the sequence. The other parameters of the methods are listed in Table 1.
In order to visualize the performance of the extracting filters, a 2 × 2 cm-spaced regular grid of positions spanning the whole room is considered. Microphone responses (images) of a white Gaussian noise signal emitted from each position on the grid have been simulated. The extracting filter of a given algorithm is applied to the microphone responses, and the output power is measured. The average ratio between the output power and the power of the input signals reflects the attenuation of the white noise signal originating from the given position.

Results
The attenuation maps of the compared methods are shown in Fig. 2b through 2f, and Table 2 shows the attenuation for specific points in the room. In particular, the first five columns in the table correspond to the speaker's positions on the movement path at angles 0 • through 32 • . The last column corresponds to the position of the interferer. Figure 2d shows the map of the initial filter corresponding to the delay-and-sum (D&S) beamformer steered towards the initial position of the speaker. The beamformer yields a gentle gain in the initial direction with no attenuation in the direction of the interferer.
The compared blind methods steer a spatial null towards the interferer and try to pass through the target signal. However, OverIVA and OGIVE w tend to pass through only a narrow angular range (probably the most significant part of the speech). By contrast, the spatial beam steered by CSV-AuxIVE towards the speaker spans the whole angular range where the speaker has appeared during the movement. BOGIVE w performs similarly, however, its performance is poorer, perhaps due to its slower convergence or proneness to getting stuck in a local extreme. The convergence comparison of BOGIVE w and CSV-AuxIVE is shown in Fig. 3. The nulls steered towards the interferer by OverIVA and CSV-AuxIVE are more attenuating compared to the gradient methods. In conclusion, these results confirm the ability of the blind algorithms to extract the moving source gained through of the CSV mixing model. The results also show better convergence properties of CSV-AuxIVE over BOGIVE w .

Moving speakers simulated by wireless loudspeaker attached to turning arm
The goal of this experiment is to compare the perfor- OverIVA 100 n/a n/a CSV-AuxIVE 100 n/a 250 frames mance of algorithms as they depend on the range and speed of movements of the sources.

Experimental setup
We have recorded a dataset of speech utterances that were played from a wireless loudspeaker (JBL GO 2) attached to a manually actuated rotating arm. The length of each utterance is 31 s. Sounds were recorded with 16 kHz sampling rate using a linear array of four microphones with 16 cm spacing. The array center was placed at the arm's pivot. This allows the apparatus to simulate circular movements of sources at a radius of approx. 1 m. The recording setup was placed in an open-space 12 x 8 x 2.6 m room with a reverberation time T 60 ≈ 500ms. The recording setup is shown in Fig. 4.  The dataset consists of two individual, spatially separated sources. The SOI is represented by a male speech utterance and is confined to the angular interval from 0°through 90°. The interference (IR) is represented by a female speech utterance and is confined to the interval of −90°through 0°. The list of recordings is described in Table 3. The recordings along with videos of the recording process are available online (see links at the end of this article).
Thirty-six mixtures were created by combining the SOI and IR recordings in Table 3; the input SIR was set to 10 dB. The following three algorithms were compared: CSV-AuxIVE with the length of blocks set to 100 frames, the original AuxIVA algorithm [20], and a sequential online variant of AuxIVA (On-line AuxIVA) from [17] with the time-window length of 20 frames and the forgetting factor set to 0.95. The algorithms operated in the STFT domain with 1024 samples per frame and 768 samples overlap. The off-line algorithms were stopped after 100 iterations. In case of AuxIVA and On-line AuxIVA, the output channel containing the SOI was determined based on the output SDR.
Performance was evaluated using segmental measures: normalized SIR (nSIR), SDR improvement (iSDR), and the average SOI attenuation (Attenuation); nSIR is the ratio of the powers of the SOI and IR in the extracted signal where each segment is normalized to unit variance; SDR is computed using BSS_eval [38]. While iSDR and Attenuation reflect the loss of power of the SOI in the extracted signal, nSIR reflects also the IR cancelation. The length of segments was set to 1 s.

Results
The results in Fig. 5 show that AuxIVA and On-line AuxIVA perform well only when the SOI is static. Their performances drop when the SOI moves. On-line AuxIVA  s slightly less sensitive to the SOI movement compared to AuxIVA due to its adaptability. However, the overall performance of On-line AuxIVA is low, because the algorithm works with limited context. CSV-AuxIVE shows significantly smaller sensitivity to the SOI movements than the compared algorithms. This is mainly reflected by Attenuation, which is only slightly growing with the increasing range and speed of the SOI movement. The higher performance of CSV-AuxIVE in terms of iSDR and nSIR compared to AuxIVA and Online AuxIVA confirms the new ability of the proposed algorithm gained due to the CSV mixing model.
The IR movements cause the performance of AuxIVA and CSV-AuxIVE to decrease with the growing range of the IR movement (small and large). The speed of movement seems to play a minor role. This can be explained by the fact that the off-line algorithms estimate timeinvariant spatial filters which project two distinct beams: one towards the entire angular area occupied by the SOI and one towards the area occupied by the IR. The former beam should pass the incoming signal through while the latter beam should attenuate it. Provided that the estimated filters satisfy these requirements, as long as the sources stay within their respective beams, the speed with Fig. 5 The accuracy of the blind extraction of the SOI in terms of iSDR, nSIR, and Attenuation in the experiment in Section 4.2. The indices on the SOI and IR axes correspond with Table 3. Please note that, for better readability, the axes of the plots showing the Attenuation are reversed which they move does not matter. For the estimation of the filters based on the CSV itself, the speakers should be approximately static within each block as the mixing vectors are assumed constant within the blocks. Hence, the allowed speed should not be too high compared to the block length.
In conclusion, the results reflect the theoretical capabilities of the algorithms, or, more specifically, of the filters that they can estimate. AuxIVA can steer only a narrow beam towards the SOI, which can therefore be extracted efficiently only if the SOI is not moving. On-line AuxIVA can steer a narrow beams in the adaptive way, however, the accuracy is lower due to a small context of data. CSV-AuxIVE can reliably extract the SOI from a wider area within the entire context of the data.

Real-world scenario using the MIRaGe database
This experiment is designed to provide an exhaustive test of the compared methods in challenging noisy situations where the target speaker is performing small movements within a confined area.

Experimental setup
Recordings are simulated using real-world room impulse responses (RIRs) taken from the MIRaGe database [39]. MIRaGe provides measured RIRs between microphones and a source whose possible positions form a dense grid within a 46 × 36 × 32 cm volume. MIRaGe is thus suitable for our experiment, as it enables us to simulate small speaker movements in a real environment.
The database setup is situated in an acoustic laboratory which is a 6 × 6 × 2.4 m rectangular room with variable reverberation time. Three reverberation levels with T 60 equal to 100, 300, and 600 ms are provided. The speaker's area involves 4104 positions which form the cube-shaped grid with spacings of 2-by-2 cm over the x and y axes and 4 cm over the z axis. MIRaGe also contains a complementary set of measurements that provide information about the positions placed around the room perimeter with spacing of approx. 1 m, at a distance of 1 m from the wall. These positions are referred to as the out-ofgrid positions (OOG). All measurements were recorded by six static linear microphone arrays (5 mics per array with the inter-microphone spacing of − 13, − 5, 0, + 5, and + 13 cm relative to the central microphone); for more details about the database, see [39].
In the present experiment, we use Array 1, which is at a distance of 1 m from the center of the grid, and the T 60 settings of 100 and 300 ms. For each setting, 3840 noisy observations of a moving speaker were synthesized as follows: each mixture consists of a moving SOI, one static interfering speaker and noise. The SOI is moving randomly over the grid positions. The movement is simulated so that the position is changed every second.
The new position is randomly selected from all positions whose maximum distance from the current position is 4 in both the x and y axes. The transition between positions is smoothed using the Hamming window of a length of f s /16 with one-half overlaps. The interferer is located in a random OOG position between 13 through 24, while the noise signal is equal to a sum of signals that are located in the remaining OOG positions (out of 13 through 24).
As the SOI and interferer signal, clean utterances of 4 male and 4 female speakers from the CHiME-4 [40] dataset were selected; there are 20 different utterances, each having 10 s in length per speaker. The noise signals correspond to random parts of the CHiME-4 cafeteria noise recording. The signals are convolved with the RIRs to match the desired positions, and the obtained spatial images of the signals on microphones are summed up so that the interferer/noise ratio, as well as the ratio between the SOI and interference plus noise, is 0 dB.
The methods considered in the previous sections are compared. All these methods operate in the STFT domain with an FFT length of 1024 and a hop-size of 256; the sampling frequency is 16 kHz. The number of iterations is set to 150 and 2,000 for the offline AFO-based and the gradient-based methods, respectively. For the online Aux-IVA, the number of iterations is set to 3 on each block. The block length in CSV-AuxIVE and BOGIVE w is set to 150 frames. The online AuxIVA operates on block length of 50 frames with 75% overlap. The step-length in OGIVE w and BOGIVE w is set to μ = 0.2. The initial separating vector corresponds to the D&S beamformer steered in front of the microphone array. As a proof of concept for the approaches discussed in Section 3.3, we also compare the piloted variants of OverIVA and CSV-AuxIVE where the pilot signal corresponds to the energy of ground truth SOI on the frames.

Results
The SOI is blindly extracted from each mixture for the IVE methods. For the IVA methods, the output channel was determined by output SIR. The result is evaluated through the improvement of the signal-to-interferenceand-noise ratio (iSINR) and signal-to-distortion ratio (iSDR) defined as in [41] (SDR is computed after compensating for the global delay). The averaged values of the criteria are summarized in Table 4 together with the average time to process one mixture. For a deeper understanding to the results, we also analyze the histograms of iSINR by OverIVA and CSV-AuxIVE shown in Fig. 6. Figure 6a shows the histograms over the full dataset of mixtures, while Fig. 6b is evaluated on a subset of mixtures in which the SOI has not moved away from the starting position by more than 5 cm; there are 288 mixtures of this kind. Now, we can observe two phenomena. First, it can be seen that OverIVA yields more results below 10 dB in    Fig. 6b. This confirms that OverIVA performs better for the subset of mixtures where the SOI is almost static. The performance of CSV-AuxIVE tends to be rather similar for the full set and the subset. CSV-AuxIVE thus yields a more stable performance than the static model-based OverIVA when the SOI performs small movements. Second, the piloted methods yield iSINR < −5 dB in a much lower number of trials than the nonpiloted methods, as confirmed by the additional criterion in Table 4. This shows that the piloted algorithms have significantly improved global convergence. Note that IVA algorithms achieved iSINR < −5 dB in 0% of cases. For the IVE algorithms, the percentage of iSINR < −5 dB reflects the rate of extractions of a different source. In contrast, for IVA algorithms, the sources are either successfully separated or not, e.g. iSINR is around 0 dB.

Speech enhancement/recognition on CHiME-4 datasets
We have verified the proposed methods using the noisy speech recognition task defined within the CHiME-4 challenge, specifically, the six-channel track [40].

Experimental setup
This dataset contains simulated (SIMU) and real-world 4 (REAL) utterances of speakers in multi-source noisy environments. The recording device is a tablet with six microphones, which is held by a speaker. Since some recordings involve microphone failures, the method from [42] is used to detect these failures. If detected, the malfunctioning channels are excluded from further processing of the given recording.
The experiment is evaluated in terms of word error rate (WER) as follows: the compared methods are used to extract speech from the noisy recordings. Then, the enhanced signals are forwarded to the baseline speech recognizer from [40]. The WER achieved by the proposed methods is compared with the results obtained on unprocessed input signals (Channel 5) and with the techniques listed below.
BeamformIt [43] is a front-end algorithm used within the CHiME-4 baseline system. It is a weighted delay-andsum beamformer requiring two passes over the processed recording in order to optimize its inner parameters. We compare the original implementation of the technique available at [44].
The generalized eigenvalue beamformer (GEV) is a front-end solution proposed in [45,46]. It represents the most successful enhancers for CHiME-4 that rely on deep networks trained for the CHiME-4 data. In the implementation used here, a re-trained voice-activity-detector (VAD) is used where the training procedure was kindly provided by the authors of [45]. We utilize the feedforward topology of the VAD and train the network using the training part of the CHiME-4 data. GEV utilizes the blind analytic normalization (BAN) postfilter to obtain its final enhanced output signal.
All systems/algorithms operate in the STFT domain with an FFT length of 512, a hop-size of 128 and use the Hamming window; the sampling frequency is 16 kHz. BOGIVE w and CSV-AuxIVE are applied with N b = 250, which corresponds to the block length of 2 s. This value has been selected to optimize the performance of these methods. All of the proposed methods are initialized by the relative transfer function (RTF) estimator from [47]; Channel 5 of the data is selected as the target (the spatial image of the speech signal of this channel is being estimated).

Results
The results shown in Table 5 indicate that all methods are able to improve the WER compared to the unprocessed case. The BSE-based methods significantly outperform BeamformIt. The GEV beamformer endowed with the pretrained VAD achieves the best results. It should be noted that the rates achieved by the BSE techniques are comparable to GEV even without a training stage on any CHiME-4 data.
In general, the block-wise methods achieve lower WER than their counterparts based on the static mixing model; the WER of BOGIVE w is comparable with CSV-AuxIVE. A significant advantage of the latter method is the faster convergence and, consequently, much lower computational burden. The total duration of the 5920 files in the CHiME-4 dataset is 10 h and 5 min. The results presented for BOGIVE w have been achieved after 100 iterations on each file, which translates into 10 hours and 30 minutes 5 of processing for the whole dataset. CSV-AuxIVE is able to converge in 7 iterations; the whole enhancement was finished in 1 h and 2 min.
An example of the enhancement yielded by the blockwise methods on one of the CHiME-4 recordings is shown in Fig. 7. Within this particular recording, in the interval 1.75-3 s, the target speaker was moved out of its initial position. The OverIVA algorithm focused on this initial direction only, resulting in vanishing voice during the movement interval. Consequently, the automatic transcription is erroneous. In contrast, CSV-AuxIVE is able to focus on both positions of the speaker and recovers the signal of interest correctly. The fact that there are few such recordings with significant speaker movement in the CHiME-4 datasets explains why the achieved improvements of WER by the block-wise methods are small. 5 The computations run on a workstation using an Intel i7-2600K@3.4GHz processor with 16GB RAM.

Conclusions
The ability of the CSV-based BSE algorithms to extract moving acoustic sources has been corroborated by the experiments presented in this paper. The blind extraction is based on the estimation of a separating filter that passes signals from the entire area of the source presence. This way, the moving source can be extracted efficiently without tracking in an on-line fashion. The experiments show that these methods are particularly robust with respect to small source movements and effectively exploit overdetermined settings, that is, when there is a higher number of microphones than that of the sources.
We have proposed a new BSE algorithm of this kind, CSV-AuxIVE, which is based on the auxiliary functionbased optimization. The algorithm was shown to be faster in convergence compared to its gradient-based counterpart. Furthermore, we have proposed the semi-supervised variant of CSV-AuxIVE utilizing pilot signals. The experiments confirm that this algorithm yields stable global convergence to the SOI.  For the future, the proposed methods provide us with alternatives to the conventional approaches that adapt to the source movements through application of static mixing models on short time-intervals. Their other abilities, for example, the adaptability to high speed speaker movements and the robustness against a highly reverberant and noisy environment, pose an interesting topic for future research [35]. Signal-to-distortion ratio; iSIR: improvement in signal-to-interference ratio; iSDR: improvement in signal-to-distortion ratio; nSIR: normalized signal-to-interference ratio; OOG: Out-of-grid position; FFT: Fast fourier transform; RIR: Room impulse response; iSINR: improvement in signal-to-interference-and-noise ratio; WER: Word error rate; GEV: Generalized eigenvalue beamformer; VAD: Voice-activity-detector; BAN: Blind analytic normalization; RTF: Relative transfer function