A static mixture of audio signals that propagate in an acoustic environment from point sources to microphones can be described by the time-invariant convolutive model. Let there be *d* sources observed by *m* microphones. The signal on the *i*th microphone is described by

$$ x_{i}(n)=\sum_{j=1}^{d}\sum_{\tau=0}^{L-1} h_{{ij}}(\tau)s_{j}(n-\tau),\quad i=1,\dots,m, $$

(1)

where *n* is the sample index, \(s_{1}(n),\dots,s_{d}(n)\) are the original signals coming from the sources, and *h*_{ij} denotes the time-invariant impulse response between the *j*th source and *i*th microphone of length *L*.

In the short-time Fourier transform (STFT) domain, convolution can be approximated by multiplication. Let *x*_{i}(*k*,*ℓ*) and *s*_{j}(*k*,*ℓ*) denote, respectively, the STFT coefficient of *x*_{i}(*n*) and *s*_{j}(*n*) at frequency *k* and frame *ℓ*. Then, (1) can be replaced by a set of *K* complex-valued linear instantaneous mixtures

$$ \mathbf{x}_{k}=\mathbf{A}_{k} \mathbf{s}_{k}, \qquad k = 1,\dots,K, $$

(2)

where **x**_{k} and **s**_{k} are symbolic vectors representing, respectively, \([x_{1}(k,\ell),\dots,x_{{m}}(k,\ell)]^{T}\) and \([s_{1}(k,\ell),\dots,s_{d}(k,\ell)]^{T}\), for any frame \(\ell =1,\dots,N\); **A**_{k} stands for the *m*×*d* mixing matrix whose *ij*th element is related to the *k*th Fourier coefficient of the impulse response *h*_{ij}; *K* is the frequency resolution of the STFT; for detailed explanations, see, e.g., Chapters 1 through 3 in [3].

### 2.1 Blind source extraction

For the BSE problem, we can write (2) in the form

$$ \mathbf{x}_{k} = \mathbf{a}_{k} s_{k} + \mathbf{y}_{k},\qquad k=1,\dots,K, $$

(3)

where *s*_{k} represents the *source of interest* (SOI), **a**_{k} is the corresponding column of **A**_{k}, called the *mixing vector*, and **y**_{k} represents the remaining signals in **x**_{k}, i.e., **y**_{k}=**x**_{k}−**a**_{k}*s*_{k}.

Since there is the ambiguity that any of the original sources can play the role of the SOI, we can assume, without loss of generality, that the SOI corresponds to the first source in (2); hence, **a**_{k} is the first column of **A**_{k}. The problem of guaranteeing the extraction of the desired SOI will be addressed in Section 3.3.

The assumption that the original signals in (2) are independent implies that *s*_{k} and **y**_{k} are independent. We will also assume that *m*=*d*, i.e., that there is the same number of microphones as that of the sources. It follows that the mixing matrices **A**_{k} are square. By assuming also that they are non-singular^{Footnote 1} and that their inverse matrices exist, the existence of a *separating vector* **w**_{k} (the first row of \(\mathbf {A}_{k}^{-1}\)) such that \(\mathbf {w}_{k}^{H}\mathbf {x}_{k}=s_{k}\) is guaranteed. We pay for this advantage by the limitation that **y**_{k} belongs to a subspace of dimension *d*−1. In other words, the covariance of **y**_{k} is assumed to have rank *d*−1 as opposed to real recordings where the typical rank is *d* (e.g. due to sensor and environment noises). Nevertheless, the assumption *m*=*d* brings more advantages than disadvantages as shown in [10]. One way to compensate is to increase the number of microphones so that the ratio \(\frac {d-1}{d}\) approaches 1. BSE appears to be computationally more efficient than BSS when *d* is large since, in BSE, **y**_{k} is not separated into individual signals.

In [13], the BSE problem is formulated by exploiting the fact that the *d*−1 latent variables (background signals) involved in **y**_{k} can be defined arbitrarily. An effective parameterization that involves only the mixing and separating vectors related to the SOI has been derived. Specifically, **A**_{k} and \(\mathbf {A}_{k}^{-1}\) (denoted as **W**_{k}) have the structure

$$ \mathbf{A}_{k} = \left(\begin{array}{ll} \mathbf{a}_{k} & \mathbf{Q}_{k} \end{array}\right) = \left(\begin{array}{cc} \gamma_{k} & \mathbf{h}_{k}^{H}\\ \mathbf{g}_{k} & \frac{1}{\gamma_{k}}(\mathbf{g}_{k}\mathbf{h}_{k}^{H}-\mathbf{I}_{d-1}) \\ \end{array}\right), $$

(4)

and

$$ \mathbf{W}_{k} = \left(\begin{array}{c} \mathbf{w}_{k}^{H}\\ \mathbf{B}_{k} \end{array}\right) = \left(\begin{array}{cc} {\beta_{k}}^{*} & \mathbf{h}_{k}^{H}\\ \mathbf{g}_{k} & -\gamma_{k} \mathbf{I}_{d-1} \\ \end{array}\right), $$

(5)

where **I**_{d} denotes the *d*×*d* identity matrix, **w**_{k} denotes the separating vector which is partitioned as **w**_{k}=[*β*_{k};**h**_{k}]; the mixing vector **a**_{k} is partitioned as **a**_{k}=[*γ*_{k};**g**_{k}]. The vectors **a**_{k} and **w**_{k} are linked through the so-called *distortionless constraint*\(\mathbf {w}_{k}^{H}\mathbf {a}_{k} = 1\), which, equivalently, means

$$ \beta_{k}^{*}\gamma_{k} + \mathbf{h}_{k}^{H}\mathbf{g}_{k} = 1, \qquad k=1,\dots,K. $$

(6)

**B**_{k}=[**g**_{k},−*γ*_{k}**I**_{d−1}] is called the *blocking matrix* as it satisfies that **B**_{k}**a**_{k}=**0**. The background signals are given by **z**_{k}=**B**_{k}**x**_{k}=**B**_{k}**y**_{k}, and it holds that **y**_{k}=**Q**_{k}**z**_{k}. To summarize, (2) is recasted for the BSE problem as

$$ {}\mathbf{x}_{k} \,=\,\left(\!\begin{array}{cc} \gamma_{k} & \mathbf{h}_{k}^{H}\\ \mathbf{g}_{k} & \frac{1}{\gamma_{k}}(\mathbf{g}_{k}\mathbf{h}_{k}^{H}-\mathbf{I}_{d-1}) \\ \end{array}\!\right)\! \left(\begin{array}{c} s_{k}\\ \mathbf{z}_{k} \end{array}\right),\quad k=1,\dots,K. $$

(7)

### 2.2 CSV mixing model

Now, we turn to an extension of (7) to time-varying mixtures. Let the available samples of the observed signals (meaning the STFT coefficients from *N* frames) be divided into *T* intervals; for the sake of simplicity, we assume that the intervals have the same integer length *N*_{b}=*N*/*T*. The intervals will be called blocks and will be indexed by \(t\in \{1,\dots,T\}\).

A straightforward extension of (7) to time-varying mixtures is when all parameters, i.e., the mixing and separating vectors, are block-dependent. However, such an extension brings no advantage compared to processing each block separately. In the constant separating vector (CSV) mixing model, it is assumed that only the mixing vectors are block-dependent while the separating vectors are constant over the blocks. Hence, the mixing and de-mixing matrices on the *t*th block are parameterized, respectively, as

$$ \mathbf{A}_{k,t} = \left(\begin{array}{cc} \mathbf{a}_{k,t} & \mathbf{Q}_{k,t} \end{array}\right) = \left(\begin{array}{cc} \gamma_{k,t} & \mathbf{h}_{k}^{H}\\ \mathbf{g}_{k,t} & \frac{1}{\gamma_{k,t}}(\mathbf{g}_{k,t}\mathbf{h}_{k}^{H}-\mathbf{I}_{d-1}) \\ \end{array}\right), $$

(8)

and

$$ \mathbf{W}_{k,t} = \left(\begin{array}{c} \mathbf{w}_{k}^{H}\\ \mathbf{B}_{k,t} \end{array}\right) = \left(\begin{array}{cc} {\beta_{k}^{*}} & \mathbf{h}_{k}^{H}\\ \mathbf{g}_{k,t} & -\gamma_{k,t} \mathbf{I}_{d-1} \\ \end{array}\right). $$

(9)

Each sample of the observed signals on the *t*th block is modeled according to

$$ \mathbf{x}_{k,t}=\mathbf{A}_{k,t} \left(\begin{array}{c} s_{k,t}\\ \mathbf{z}_{k,t} \end{array}\right), $$

(10)

where *s*_{k,t} and **z**_{k,t} represent, respectively, the *k*th frequency of the SOI and of the background signals at any frame within the *t*th block. Note that, the CSV coincides with the static model (7) when *T*=1.

The practical meaning of the CSV model is illustrated in Fig. 1. While CSV admits that the SOI can change its position from block to block (the mixing vectors **a**_{k,t} depend on *t*), the block-independent separating vector **w**_{k} is sought such that extracts the speaker’s voice from all positions visited during its movement. There are two main reasons for this: First, the achievable interference-to-signal ratio (ISR) depends on **w**_{k} so it has order \(\mathcal {O}(N^{-1})\), compared to when **w**_{k} is block-dependent, which yields ISR of order \(\mathcal {O}(N_{b}^{-1})\); this is confirmed by the theoretical study on Cramér-Rao bounds in [24]. Second, the CSV enables BSE methods to avoid the discontinuity problem mentioned in the previous section.

The CSV also brings a limitation. Formally, the mixture must obey the condition that for each *k* a separating vector exists such that \(s_{k,t}=\mathbf {w}_{k}^{H}\mathbf {x}_{k,t}\) holds for every *t*; a condition that seems to be quite restrictive. Nevertheless, preliminary experiments in [23] have shown that this limitation is not crucial in practical situations and does not differ much from that of static methods (spatially overlapping speakers cannot be separated), especially when the number of microphones is high enough to provide sufficient degrees of freedom. When the speakers are static, the rule of thumb says that the speakers cannot be separated or, at least, are difficult to separate through spatial filtering, when their angular positions with respect to the microphone array are the same. Hence, moving speakers cannot be separated based on the CSV when their angular ranges with respect to the array during the recording are overlapping. The experimental part of this work presented in Section IV validates these findings.

### 2.3 Source model

In this section, we introduce the statistical model of the signals adopted from IVE. Samples (frames) of signals will be assumed to be identically and independently distributed (i.i.d.) within each block according to the probability density function (pdf) of the representing random variable.

Let **s**_{t} denote the vector component corresponding to the SOI, i.e., \(\mathbf {s}_{t}=[s_{1,t},\dots,s_{K,t}]^{T}\). The elements of **s**_{t} are assumed to be uncorrelated (because they correspond to different frequency components of the SOI) but dependent, that is, their higher-order moments are taken into account [9]. Let *p*_{s}(**s**_{t}) denote the joint pdf of **s**_{t} and \(\phantom {\dot {i}\!}p_{\mathbf {z}_{k,t}}(\mathbf {z}_{k,t})\) denote the pdf^{Footnote 2} of **z**_{k,t}. For simplifying the notation, *p*_{s}(·) will be denoted without the index *t* although it is generally dependent on *t*. Since **s**_{t} and \(\mathbf {z}_{1,t},\dots,\mathbf {z}_{K,t}\) are independent, their joint pdf within the *t*th block is equal to the product of marginal pdfs

$$ p_{s}(\mathbf{s}_{t})\cdot\prod_{k=1}^{K} p_{\mathbf{z}_{k,t}}(\mathbf{z}_{k,t}). $$

(11)

By applying the transformation theorem to (11) using (10), from which it follows that

$$ \left(\begin{array}{c} s_{k,t}\\ \mathbf{z}_{k,t} \end{array}\right)=\mathbf{W}_{k,t}\mathbf{x}_{k,t}= \left(\begin{array}{c} \mathbf{w}_{k}^{H}\mathbf{x}_{k,t}\\ \mathbf{B}_{k,t}\mathbf{x}_{k,t} \end{array}\right), $$

(12)

the joint pdf of the observed signals from the *t*th block reads

$$\begin{array}{*{20}l} p_{\mathbf{x}}(\{\mathbf{x}_{k,t}\}_{k}) &= p_{s}\left(\left\{\mathbf{w}_{k}^{H}\mathbf{x}_{k,t}\right\}_{k}\right) \\ &\quad \times\prod_{k=1}^{K} p_{\mathbf{z}_{k,t}}(\mathbf{B}_{k,t}\mathbf{x}_{k,t}) |\det \mathbf{W}_{k,t}|^{2}. \end{array} $$

(13)

Hence, the log-likelihood function as a function of the parameter vectors **w**_{k} and **a**_{k,t} and all available samples of the observed signals in the *t*th block is given by

$$ {}\begin{aligned} &\mathcal{L}(\{{\mathbf w}_{k}\}_{k},\{{\mathbf a}_{k,t}\}_{k}|\{{\mathbf x}_{k,t}\}_{k})\\ &{\kern17pt}= \hat{\mathrm E}\left[\log p_{s}(\{{\hat s}_{k,t}\}_{k})\right] +\sum_{k=1}^{K} \hat{\mathrm E}\left[\log p_{{\mathbf z}_{k,t}}(\hat{\mathbf z}_{k,t})\right]\\ &{\kern17pt}\quad+\log |\det {\mathbf W}_{k,t}|^{2}, \end{aligned} $$

(14)

where \({\hat s}_{k,t}=\mathbf {w}_{k}^{H}\mathbf {x}_{k,t}\) and \(\hat {\mathbf {z}}_{k,t}=\mathbf {B}_{k,t}\mathbf {x}_{k,t}\) denote the current estimate of the SOI and of the background signals, respectively.

In BSS and BSE, the true pdfs of the original sources are not known, so suitable model densities have to be chosen in order to derive a contrast function based on (14). To find an appropriate surrogate of *p*_{s}(**s**_{t}), the variance of SOI, which can be changing from block to block^{Footnote 3} has to be taken into account. Let *f*(·) be a pdf corresponding to a normalized non-Gaussian random variable. To reflect the block-dependent variance, *p*_{s}(**s**_{t}) should be replaced by

$$ p_{s}(\mathbf{s}_{t}) \approx f\left(\left\{\frac{{ s}_{k,t}}{{\sigma}_{k,t}}\right\}_{k}\right)\left(\prod_{k=1}^{K}{\sigma}_{k,t}\right)^{-2}, $$

(15)

where \(\sigma ^{2}_{k,t}\) denotes the variance of *s*_{k,t}. Its unknown value is replaced by the sample-based variance of \(\hat s_{k,t}\), which is equal to \(\hat \sigma _{k,t}=\sqrt {\mathbf {w}_{k}^{H}\widehat {\mathbf {C}}_{k,t}\mathbf {w}_{k}}\) where \(\widehat {\mathbf {C}}_{k,t}=\hat {\mathrm {E}}\left [\mathbf {x}_{k,t}\mathbf {x}_{k,t}^{H}\right ]\) is the sample-based covariance matrix of **x**_{k,t}.

It is worth noting that in the case of the static mixing model, i.e. when *T*=1, it can be assumed that \(\sigma ^{2}_{k,t}=1\) because of the scaling ambiguity.

Similarly to [13], the pdf of the background is assumed to be circular Gaussian with zero mean and (unknown) covariance matrix \(\phantom {\dot {i}\!}\mathbf {C}_{\mathbf {z}_{k,t}}=\mathrm {E}\left [\mathbf {z}_{k,t}\mathbf {z}_{k,t}^{H}\right ]\), i.e., \(\phantom {\dot {i}\!}p_{\mathbf {z}_{k,t}}\sim \mathcal {CN}(0,\mathbf {C}_{\mathbf {z}_{k,t}})\). Next, by Eq. (15) in [13] it follows that | det**W**_{k,t}|^{2}=|*γ*_{k,t}|^{2(d−2)}, which corresponds to the third term in (14).

Now, by replacing the unknown pdfs in (14) and by neglecting the constant terms, we obtain the contrast function in the form

$$\begin{array}{*{20}l} \mathcal{C}&(\{{\mathbf w}_{k}\}_{k},\{{\mathbf a}_{k,t}\}_{k,t}) \\ &= \frac{1}{T}\sum_{t=1}^{T}\left\{\hat{\mathrm{E}}\left[\log f\left(\left\{\frac{{\mathbf w}_{k}^{H}{\mathbf x}_{k,t}}{\hat\sigma_{k,t}}\right\}_{k}\right)\right] -\sum_{k=1}^{K}\log(\hat\sigma_{k,t})^{2} \right.\\ &\quad-\sum_{k=1}^{K} \hat{\mathrm E}\left[{\mathbf x}_{k,t}^{H}{\mathbf B}_{k,t}^{H}{\mathbf C}_{{\mathbf z}_{k,t}}^{-1}{\mathbf B}_{k,t}{\mathbf x}_{k,t}\right] \\&\quad+\left.(d-2)\sum_{k=1}^{K}\log |\gamma_{k,t}|^{2}\right\}. \end{array} $$

(16)

The nuisance parameter \(\phantom {\dot {i}\!}\mathbf {C}_{\mathbf {z}_{k,t}}\) will later be replaced by its sample-based estimate \(\widehat {\mathbf C}_{{\mathbf z}_{k,t}}=\hat {\mathrm E}\left [\hat {\mathbf z}_{k,t}\hat {\mathbf z}_{k,t}^{H}\right ]\).