The proposed separation consistency training is performed on two different separation networks with heterogeneous structure. In this section, we first present the principle of our SCT, then introduce three SCT variants and their differences, including basic SCT and the cross-knowledge adaptation. Next, two main algorithms, consistent pseudo-labeling and selection (CPS), and heterogeneous knowledge fusion (HKF) in the proposed SCT are described in detail. For simplicity, here we only consider the speech separation scenario with two-speaker overlapped speech.
3.1 Separation consistency training
Our separation consistency training is specially designed to improve the unsupervised speech separation where the target mixtures deviate far from the training simulation dataset. It follows a heterogeneous separation framework, to create and select informative data pairs with high confidence pseudo ground-truth, for iteratively improving cross-domain speech separation by adapting the source separation models to the target acoustic environments. Because the whole framework heavily relies on the consistent separation results of the unlabeled mixtures and a separation consistency measure for pseudo-labeling, we call the whole training process as separation consistency training (SCT).
3.1.1 Basic SCT
Given a large amount of or even limited unlabeled target mixtures, the basic SCT procedure can be divided into three main steps:
-
(a)
Mixture separation. Separate each unlabeled mixture using two heterogeneous separation models that have been well-trained on the source simulated training set;
-
(b)
Consistent pseudo-labeling and selection (CPS). Based on separation results in step (a), calculate a separation consistency measure (SCM, Eq. (1)) and a mixture separation consistency measure (mSCM, Eq. (3)) to evaluate the confidence of separation outputs. Then, select those unlabeled mixtures with high consistent confidence and their corresponding separation results as pseudo ground-truth to form a “Pseudo Labeled Set”;
-
(c)
Iterative model adaptation. Combine the “Pseudo Labeled Set” and the original source domain “Simulation Training Set” together to refine the source models to learn the target domain acoustics. And then, repeat the above process in an iterative manner.
The two separation models in step (a) usually have comparable performance but with heterogeneous neural network structures. The bigger difference between the models, the better complementary information will be achieved. In this study, we choose DPCCN and Conv-TasNet that presented in Section 2 as the heterogeneous candidates. The former is taken as the primary model, while the latter is regarded as a reviewer model. Conv-TasNet [13] is the current popular masking-based time-domain separation model, while DPCCN [35] is our recent proposed mapping-based time-frequency domain model with good robustness to complicate acoustic environments. The huge difference in modeling patterns between Conv-TasNet and DPCCN, such as masking and mapping, time domain and time-frequency domain, guarantees the large diversity of the separated results. This diversity provides an increased chance to improve source models iteratively, because it can produce more informative target mixtures as new iterative training samples that the primary source model could not produce itself. Actually, during CPS in step (b), each model in this SCT heterogeneous framework is a reviewer for the other, any input mixtures will be double inspected by the two reviewers from different perspectives, only those mixtures with consistent separation performance of both will be selected. In this way, the double inspecting mechanism under a heterogeneous framework ensures the high confidence of pseudo ground-truth for each selected mixture in the target domain.
The whole framework of above basic SCT is demonstrated in the first variant of our proposed SCT, subfigure (A) (SCT-1) of Fig. 1. In SCT-1, the detail of consistent pseudo-labeling and selection (CPS) is presented in the next section, Section 3.2, and illustrated in Fig. 2A. “D-Pseudo Labeled Set” (“D-” means DPCCN’s outputs) contains the data pairs of selected unlabeled mixtures and their pseudo ground-truth that derive from the individual separation outputs of the primary model DPCCN. Together with the original source domain “Simulation Training Set,” both the primary and reviewer models are refined and adapted to the target domain in each single iteration. It is worth noting that the model adaptation with the combined training set is necessary for SCT algorithm. As our source models have been trained well on the simulation set, and the pseudo ground-truth of “D-Pseudo Labeled Set” is actually generated by DPCCN, which means if we only use the simulation set or “D-Pseudo Labeled Set” to adjust the primary source model, DPCCN, the training gradient will be very small or even 0. In this case, the error between model outputs and labels is difficult to back propagate and the adaptation process will fail. However, if we adapt model using both “Simulation Training Set” and “D-Pseudo Labeled Set,” although the error between model outputs and ground-truth is small, the model can still be adapted to the target domain. For example, a simple neural network can be depicted as \({\textbf {y}}=w*x + b\) , where \({\textbf {y}}, w, x, b\) are model output, parameter weight, model input, and parameter bias, respectively. The partial differential to the weight \({\textbf {w}}\) is model input \({\textbf {x}}\). Back to our scenario, by combining “Simulation Training Set” and “D-Pseudo Labeled Set,” the target domain data can engage in the model adaptation with the loss of the source domain simulation set.
3.1.2 SCT with cross-knowledge adaptation
To fully exploit the complementary information between heterogeneous networks, a cross-knowledge adaptation is proposed to improve the basic SCT. The framework is illustrated in the 2nd variant of SCT (SCT-2) in Fig. 1B. Different from basic SCT, in SCT-2, the reviewer Conv-TasNet is first updated using the combined “D-Pseudo Labeled Set” and “Simulation Training Set,” i.e., the pseudo ground-truth of the primary model is used to guide the reviewer model’s adaptation. Next, we re-separate all the unlabeled mixtures using the updated reviewer to achieve more accurate separation outputs. Then, all the pseudo ground-truth in “D-Pseudo Labeled Set” are replaced by the corresponding new individual outputs that produced by the updated reviewer Conv-TasNet to construct a new pseudo labeled set “T-Pseudo Labeled Set” (“T-” means Conv-TasNet’s outputs). Finally, the “T-Pseudo Labeled Set” and “Simulation Training Set” are combined together to adjust the primary model DPCCN as in SCT-1. In this model adaptation, the pseudo ground-truth of the reviewer model is used to supervise the primary model training. Just like the teacher-student learning, in the whole SCT-2, the primary and reviewer model can benefit each other, the learned knowledge of them is cross-used as a guide to improve the target model adaptation. Therefore, we call this adaptation procedure as “cross-knowledge adaptation” for simplicity. In addition, as the “T-Pseudo Labeled Set” is actually a combination of prior selected separation consistency statistics in “D-Pseudo Labeled Set” and the new pseudo ground-truth from updated Conv-TasNet, thus, in Fig. 1, we use the heterogeneous knowledge fusion (HKF) block to represent this knowledge combination. Details of HKF are demonstrated in subfigure (D) of Fig. 2 and Section 3.3.
Subfigure (C) (SCT-3) of Fig. 1 is a variant of SCT-2 with minor modification before HKF block. In SCT-3, the CPS is performed twice. The first CPS is performed as the same in SCT-1 and SCT-2, while in the second CPS, the separation consistency statistics, SCM and mSCM are re-computed and updated using both mixture separation outputs of DPCCN and the updated Conv-TasNet. Other operations are all the same as in SCT-2. The motivation behind this two-stage CPS is that, the adapted Conv-TasNet can provide more accurate separation results of target domain mixtures, which makes the second stage CPS produce more reliable consistent separation results for unlabeled mixture pseudo-labeling in each SCT iteration.
In summary, in this section, we present three variants of the proposed SCT, one is the basic structure, and the others are two enhanced SCT variants with cross-knowledge adaptation. Details of the CPS and HKF blocks used in SCT are described in the following sections.
3.2 Consistent pseudo-labeling and selection
The consistent pseudo-labeling and selection (CPS) block in the proposed SCT aims to produce high quality individual pseudo ground-truth of each unlabeled mixture based on the outputs of two heterogeneous networks and the original mixture speech. The whole CPS procedure is illustrated Fig. 2A. It contains two main stages, the first one is the confidence measure calculation, followed by the pseudo ground-truth selection.
3.2.1 Confidence measure calculation
Two measures are calculated in this stage, one is separation consistency measure (SCM, Eq. (1)), and the other is mixture separation consistency measure (mSCM, Eq. (3)). Both of them are used to evaluate the confidence between heterogeneous separation outputs produced by DPCCN and Conv-TasNet.
As shown in the left part of Fig. 2A, given N unsupervised mixed speech with each contains M single sources, here we assume \(M=2\). For the nth mixture, the SCM is calculated by taking the individual separation output \(\textbf{x}_n\) of the primary model DPCCN as pseudo reference as follows:
$$\begin{aligned} {\texttt{SCM}}\left( \mathbf{x}_n, \mathbf{v}_n\right) = \underset{\mathbf{P}}{\textrm{max}} \frac{1}{M} \sum\limits_{i=1}^M{\mathrm {SI-SNR}}\left( x_n^i, [\mathbf{Pv}_n]_i\right) \end{aligned}$$
(1)
where \(\textbf{x}_n = [x_n^1, x_n^2, ..., x_n^M]^{\textrm{T}}\), \(\textbf{v}_n = [v_n^1, v_n^2, ..., v_n^M]^{\textrm{T}}\) are the M individual separation speech signals that separated by DPCCN and Conv-TasNet for the n-th input mixture, respectively. \(x_n^i\) and \(v_n^i\) are the i-th individual signal. \(\textbf{P}\) is an \(M \times M\) permutation matrix, \([\cdot ]_i\) denotes selecting i-th element from the matrix, and \(\textrm{T}\) is the operation of transpose. The SI-SNR in Eq. (1) is the standard scale-invariant signal-to-noise ratio (SI-SNR) [38] that used to measure the performance of state-of-the-art speech separation systems. It is defined as:
$$\begin{aligned} \mathrm {SI-SNR}(s, \hat{s}) = 10\log _{10}\left( \frac{\Vert {\frac{\langle \hat{s}, s \rangle }{\langle s, s \rangle }s}\Vert ^2}{\Vert {\frac{\langle \hat{s}, s \rangle }{\langle s, s \rangle }s} - \hat{s} \Vert ^2}\right) \end{aligned}$$
(2)
where s and \(\hat{s}\) are the reference and estimated speech, respectively. \(\Vert \cdot \Vert ^2\) denotes the signal power, \(\langle \cdot \rangle\) is the inner-product operation.
Figure 2B shows a two-speaker SCM process for the n-th mixture. The DPCCN outputs, \(x_n^1\), \(x_n^2\) are taken as references to calculate the pairwise SI-SNR with the Conv-TasNet outputs, \(v_n^1\) and \(v_n^2\). In this case, there are two permutation combinations, namely \([x_n^1, v_n^1; x_n^2, v_n^2]\) and \([x_n^1, v_n^2; x_n^2, v_n^1]\), then SCM compares the averaging pairwise SI-SNR for each assignment and takes the highest value to represent the separation consistency between two heterogeneous networks outputs. The higher SCM, the better consistency of unlabeled separation outputs we can trust. However, when the input mixtures are hard to separate for both heterogeneous networks, \(\textbf{x}_n\) and \(\textbf{v}_n\) can be very close to the original mixture speech, and they could also result in a very high SCM. In this case, the pseudo reference \(\textbf{x}_n\) may be far from the ground-truth and may not be qualified for the source model adaptation. To alleviate this situation, the following mSCM is introduced from another perspective to evaluate the quality of target domain mixture separation results and enhance the confidence of selected results.
The mixture separation consistency measure (mSCM), aims to measure the consistency between the outputs of heterogeneous networks and the original input mixture \(y_n\). It is defined as:
$$\begin{aligned} {\texttt{mSCM}}(y_n, \mathbf{x}_n, \mathbf{v}_n) = \frac{1}{2M} \sum\limits_{i=1}^M \sum\limits_{\phi } {\mathrm {SI-SNR}}(y_n, \phi _n^i) \end{aligned}$$
(3)
where \(\phi _n^i \in \{x_{n}^{i}, v_{n}^{i}\}\) is the i-th individual output of DPCCN or Conv-TasNet of n-th input mixture as shown in Eq. (1). Figure 2C gives a detailed operation of mSCM under a two-speaker case, and as shown in Eq. (3), we see the average SI-SNR between the input mixture and all separated outputs are calculated. Different from SCM, the mSCM evaluate the confidence of separation results in an opposite way and the lower is desired. We believe that, in most conditions, the waveform of well-separated results should be very different from the original mixture. Therefore, the corresponding mSCM will be in a low position. It is noted that when the input mixture has a high input SNR, the lower mSCM constraint will filter out its separated results. Even though, the lower mSCM hypothesis still makes sense, because the filtered speech with high input SNR is somehow homogeneous and has limited benefits to model adaptation. In addition, the high input SNR cases are rare for cross-domain task. Therefore, the lower mSCM constraint is safe and effective in most conditions.
3.2.2 Pseudo ground-truth selection
After computing both SCM and mSCM statistics of input mixtures, we re-organize all the statistics and speech signals that related to each unlabeled input mixture in a new data tuple format to facilitate the pseudo ground-truth selection. As shown in Fig. 2A, we call each data tuple as a “separation consistency information (SCI)” tuple, and it is organized as:
$$\begin{aligned} \texttt{SCI}= \{ \texttt{ID, SCM, mSCM, Mix, Sep1, Sep2}\} \end{aligned}$$
(4)
where ID is the mixture ID, Mix is the input mixture speech signal, Sep1 and Sep2 are the two individual speech signals that separated by DPCCN. With these SCI tuples, we then perform the pseudo ground-truth selection in two ways:
-
CPS-1: Select SCI pairs with SCM value lies in the top \(p\%\)SCM range, \(p\in [0,100]\).
-
CPS-2: Select SCI tuples with the following constraint:
$$\begin{aligned} {\texttt{SCI}}_s = \left\{ {\texttt{SCI}}_k \mid ({\texttt{SCM}}_k > \alpha ) \cap ({\texttt{mSCM}}_k < \beta ) \right\} \end{aligned}$$
(5)
where \(k = 1,2,...,N\), \({\texttt{SCI}}_s\) and \({\texttt{SCI}}_k\) are the selected SCI tuples and k-th SCI tuple, respectively. \(\alpha\), \(\beta\) are thresholds for SCM and mSCM, respectively.
No matter for CPS-1 or CPS-2, the separated signals, Sep1 and Sep2, in all the selected SCI tuples will be taken as the high confidence pseudo ground-truth for their corresponding mixture Mix. Then the selected mixtures with pseudo ground-truths are taken to form the “D-Pseudo Labeled Set” (pseudo ground-truth that produced by DPCCN) for further separation model adaptation. As discussed in the definition of mSCM, compared with CPS-1, perhaps CPS-2 is better at dealing with the difficult separation cases to some extent.
3.3 Heterogeneous Knowledge Fusion
The heterogeneous knowledge fusion (HKF), illustrated in Fig. 2D is used during the cross-knowledge adaptation in SCT-2 and SCT-3. HKF is a very simple operation just by replacing Sep1 and Sep2 in the selected SCI tuples of Fig. 2A with the outputs of the adapted Conv-TasNet as in SCT-2 and SCT-3. We use \(v_n^{i'}\) to represent the i-th individual signal of n-th mixture separated by the adapted Conv-TasNet. The updated new data tuples \(\{\texttt{Mix, Sep1, Sep2}\}\) are then picked to form the “T-Pseudo Labeled Set” (pseudo ground-truths that produced by Conv-TasNet). By doing so, the complementary information between the prior knowledge of separation consistency information that captured in the CPS block and the adapted Conv-TasNet are subtly integrated to further refine the primary DPCCN.