The proposed separation consistency training is performed on two different separation networks with heterogeneous structure. In this section, we first present the principle of our SCT, then introduce three SCT variants and their differences, including basic SCT and the crossknowledge adaptation. Next, two main algorithms, consistent pseudolabeling and selection (CPS), and heterogeneous knowledge fusion (HKF) in the proposed SCT are described in detail. For simplicity, here we only consider the speech separation scenario with twospeaker overlapped speech.
3.1 Separation consistency training
Our separation consistency training is specially designed to improve the unsupervised speech separation where the target mixtures deviate far from the training simulation dataset. It follows a heterogeneous separation framework, to create and select informative data pairs with high confidence pseudo groundtruth, for iteratively improving crossdomain speech separation by adapting the source separation models to the target acoustic environments. Because the whole framework heavily relies on the consistent separation results of the unlabeled mixtures and a separation consistency measure for pseudolabeling, we call the whole training process as separation consistency training (SCT).
3.1.1 Basic SCT
Given a large amount of or even limited unlabeled target mixtures, the basic SCT procedure can be divided into three main steps:

(a)
Mixture separation. Separate each unlabeled mixture using two heterogeneous separation models that have been welltrained on the source simulated training set;

(b)
Consistent pseudolabeling and selection (CPS). Based on separation results in step (a), calculate a separation consistency measure (SCM, Eq. (1)) and a mixture separation consistency measure (mSCM, Eq. (3)) to evaluate the confidence of separation outputs. Then, select those unlabeled mixtures with high consistent confidence and their corresponding separation results as pseudo groundtruth to form a “Pseudo Labeled Set”;

(c)
Iterative model adaptation. Combine the “Pseudo Labeled Set” and the original source domain “Simulation Training Set” together to refine the source models to learn the target domain acoustics. And then, repeat the above process in an iterative manner.
The two separation models in step (a) usually have comparable performance but with heterogeneous neural network structures. The bigger difference between the models, the better complementary information will be achieved. In this study, we choose DPCCN and ConvTasNet that presented in Section 2 as the heterogeneous candidates. The former is taken as the primary model, while the latter is regarded as a reviewer model. ConvTasNet [13] is the current popular maskingbased timedomain separation model, while DPCCN [35] is our recent proposed mappingbased timefrequency domain model with good robustness to complicate acoustic environments. The huge difference in modeling patterns between ConvTasNet and DPCCN, such as masking and mapping, time domain and timefrequency domain, guarantees the large diversity of the separated results. This diversity provides an increased chance to improve source models iteratively, because it can produce more informative target mixtures as new iterative training samples that the primary source model could not produce itself. Actually, during CPS in step (b), each model in this SCT heterogeneous framework is a reviewer for the other, any input mixtures will be double inspected by the two reviewers from different perspectives, only those mixtures with consistent separation performance of both will be selected. In this way, the double inspecting mechanism under a heterogeneous framework ensures the high confidence of pseudo groundtruth for each selected mixture in the target domain.
The whole framework of above basic SCT is demonstrated in the first variant of our proposed SCT, subfigure (A) (SCT1) of Fig. 1. In SCT1, the detail of consistent pseudolabeling and selection (CPS) is presented in the next section, Section 3.2, and illustrated in Fig. 2A. “DPseudo Labeled Set” (“D” means DPCCN’s outputs) contains the data pairs of selected unlabeled mixtures and their pseudo groundtruth that derive from the individual separation outputs of the primary model DPCCN. Together with the original source domain “Simulation Training Set,” both the primary and reviewer models are refined and adapted to the target domain in each single iteration. It is worth noting that the model adaptation with the combined training set is necessary for SCT algorithm. As our source models have been trained well on the simulation set, and the pseudo groundtruth of “DPseudo Labeled Set” is actually generated by DPCCN, which means if we only use the simulation set or “DPseudo Labeled Set” to adjust the primary source model, DPCCN, the training gradient will be very small or even 0. In this case, the error between model outputs and labels is difficult to back propagate and the adaptation process will fail. However, if we adapt model using both “Simulation Training Set” and “DPseudo Labeled Set,” although the error between model outputs and groundtruth is small, the model can still be adapted to the target domain. For example, a simple neural network can be depicted as \({\textbf {y}}=w*x + b\) , where \({\textbf {y}}, w, x, b\) are model output, parameter weight, model input, and parameter bias, respectively. The partial differential to the weight \({\textbf {w}}\) is model input \({\textbf {x}}\). Back to our scenario, by combining “Simulation Training Set” and “DPseudo Labeled Set,” the target domain data can engage in the model adaptation with the loss of the source domain simulation set.
3.1.2 SCT with crossknowledge adaptation
To fully exploit the complementary information between heterogeneous networks, a crossknowledge adaptation is proposed to improve the basic SCT. The framework is illustrated in the 2nd variant of SCT (SCT2) in Fig. 1B. Different from basic SCT, in SCT2, the reviewer ConvTasNet is first updated using the combined “DPseudo Labeled Set” and “Simulation Training Set,” i.e., the pseudo groundtruth of the primary model is used to guide the reviewer model’s adaptation. Next, we reseparate all the unlabeled mixtures using the updated reviewer to achieve more accurate separation outputs. Then, all the pseudo groundtruth in “DPseudo Labeled Set” are replaced by the corresponding new individual outputs that produced by the updated reviewer ConvTasNet to construct a new pseudo labeled set “TPseudo Labeled Set” (“T” means ConvTasNet’s outputs). Finally, the “TPseudo Labeled Set” and “Simulation Training Set” are combined together to adjust the primary model DPCCN as in SCT1. In this model adaptation, the pseudo groundtruth of the reviewer model is used to supervise the primary model training. Just like the teacherstudent learning, in the whole SCT2, the primary and reviewer model can benefit each other, the learned knowledge of them is crossused as a guide to improve the target model adaptation. Therefore, we call this adaptation procedure as “crossknowledge adaptation” for simplicity. In addition, as the “TPseudo Labeled Set” is actually a combination of prior selected separation consistency statistics in “DPseudo Labeled Set” and the new pseudo groundtruth from updated ConvTasNet, thus, in Fig. 1, we use the heterogeneous knowledge fusion (HKF) block to represent this knowledge combination. Details of HKF are demonstrated in subfigure (D) of Fig. 2 and Section 3.3.
Subfigure (C) (SCT3) of Fig. 1 is a variant of SCT2 with minor modification before HKF block. In SCT3, the CPS is performed twice. The first CPS is performed as the same in SCT1 and SCT2, while in the second CPS, the separation consistency statistics, SCM and mSCM are recomputed and updated using both mixture separation outputs of DPCCN and the updated ConvTasNet. Other operations are all the same as in SCT2. The motivation behind this twostage CPS is that, the adapted ConvTasNet can provide more accurate separation results of target domain mixtures, which makes the second stage CPS produce more reliable consistent separation results for unlabeled mixture pseudolabeling in each SCT iteration.
In summary, in this section, we present three variants of the proposed SCT, one is the basic structure, and the others are two enhanced SCT variants with crossknowledge adaptation. Details of the CPS and HKF blocks used in SCT are described in the following sections.
3.2 Consistent pseudolabeling and selection
The consistent pseudolabeling and selection (CPS) block in the proposed SCT aims to produce high quality individual pseudo groundtruth of each unlabeled mixture based on the outputs of two heterogeneous networks and the original mixture speech. The whole CPS procedure is illustrated Fig. 2A. It contains two main stages, the first one is the confidence measure calculation, followed by the pseudo groundtruth selection.
3.2.1 Confidence measure calculation
Two measures are calculated in this stage, one is separation consistency measure (SCM, Eq. (1)), and the other is mixture separation consistency measure (mSCM, Eq. (3)). Both of them are used to evaluate the confidence between heterogeneous separation outputs produced by DPCCN and ConvTasNet.
As shown in the left part of Fig. 2A, given N unsupervised mixed speech with each contains M single sources, here we assume \(M=2\). For the nth mixture, the SCM is calculated by taking the individual separation output \(\textbf{x}_n\) of the primary model DPCCN as pseudo reference as follows:
$$\begin{aligned} {\texttt{SCM}}\left( \mathbf{x}_n, \mathbf{v}_n\right) = \underset{\mathbf{P}}{\textrm{max}} \frac{1}{M} \sum\limits_{i=1}^M{\mathrm {SISNR}}\left( x_n^i, [\mathbf{Pv}_n]_i\right) \end{aligned}$$
(1)
where \(\textbf{x}_n = [x_n^1, x_n^2, ..., x_n^M]^{\textrm{T}}\), \(\textbf{v}_n = [v_n^1, v_n^2, ..., v_n^M]^{\textrm{T}}\) are the M individual separation speech signals that separated by DPCCN and ConvTasNet for the nth input mixture, respectively. \(x_n^i\) and \(v_n^i\) are the ith individual signal. \(\textbf{P}\) is an \(M \times M\) permutation matrix, \([\cdot ]_i\) denotes selecting ith element from the matrix, and \(\textrm{T}\) is the operation of transpose. The SISNR in Eq. (1) is the standard scaleinvariant signaltonoise ratio (SISNR) [38] that used to measure the performance of stateoftheart speech separation systems. It is defined as:
$$\begin{aligned} \mathrm {SISNR}(s, \hat{s}) = 10\log _{10}\left( \frac{\Vert {\frac{\langle \hat{s}, s \rangle }{\langle s, s \rangle }s}\Vert ^2}{\Vert {\frac{\langle \hat{s}, s \rangle }{\langle s, s \rangle }s}  \hat{s} \Vert ^2}\right) \end{aligned}$$
(2)
where s and \(\hat{s}\) are the reference and estimated speech, respectively. \(\Vert \cdot \Vert ^2\) denotes the signal power, \(\langle \cdot \rangle\) is the innerproduct operation.
Figure 2B shows a twospeaker SCM process for the nth mixture. The DPCCN outputs, \(x_n^1\), \(x_n^2\) are taken as references to calculate the pairwise SISNR with the ConvTasNet outputs, \(v_n^1\) and \(v_n^2\). In this case, there are two permutation combinations, namely \([x_n^1, v_n^1; x_n^2, v_n^2]\) and \([x_n^1, v_n^2; x_n^2, v_n^1]\), then SCM compares the averaging pairwise SISNR for each assignment and takes the highest value to represent the separation consistency between two heterogeneous networks outputs. The higher SCM, the better consistency of unlabeled separation outputs we can trust. However, when the input mixtures are hard to separate for both heterogeneous networks, \(\textbf{x}_n\) and \(\textbf{v}_n\) can be very close to the original mixture speech, and they could also result in a very high SCM. In this case, the pseudo reference \(\textbf{x}_n\) may be far from the groundtruth and may not be qualified for the source model adaptation. To alleviate this situation, the following mSCM is introduced from another perspective to evaluate the quality of target domain mixture separation results and enhance the confidence of selected results.
The mixture separation consistency measure (mSCM), aims to measure the consistency between the outputs of heterogeneous networks and the original input mixture \(y_n\). It is defined as:
$$\begin{aligned} {\texttt{mSCM}}(y_n, \mathbf{x}_n, \mathbf{v}_n) = \frac{1}{2M} \sum\limits_{i=1}^M \sum\limits_{\phi } {\mathrm {SISNR}}(y_n, \phi _n^i) \end{aligned}$$
(3)
where \(\phi _n^i \in \{x_{n}^{i}, v_{n}^{i}\}\) is the ith individual output of DPCCN or ConvTasNet of nth input mixture as shown in Eq. (1). Figure 2C gives a detailed operation of mSCM under a twospeaker case, and as shown in Eq. (3), we see the average SISNR between the input mixture and all separated outputs are calculated. Different from SCM, the mSCM evaluate the confidence of separation results in an opposite way and the lower is desired. We believe that, in most conditions, the waveform of wellseparated results should be very different from the original mixture. Therefore, the corresponding mSCM will be in a low position. It is noted that when the input mixture has a high input SNR, the lower mSCM constraint will filter out its separated results. Even though, the lower mSCM hypothesis still makes sense, because the filtered speech with high input SNR is somehow homogeneous and has limited benefits to model adaptation. In addition, the high input SNR cases are rare for crossdomain task. Therefore, the lower mSCM constraint is safe and effective in most conditions.
3.2.2 Pseudo groundtruth selection
After computing both SCM and mSCM statistics of input mixtures, we reorganize all the statistics and speech signals that related to each unlabeled input mixture in a new data tuple format to facilitate the pseudo groundtruth selection. As shown in Fig. 2A, we call each data tuple as a “separation consistency information (SCI)” tuple, and it is organized as:
$$\begin{aligned} \texttt{SCI}= \{ \texttt{ID, SCM, mSCM, Mix, Sep1, Sep2}\} \end{aligned}$$
(4)
where ID is the mixture ID, Mix is the input mixture speech signal, Sep1 and Sep2 are the two individual speech signals that separated by DPCCN. With these SCI tuples, we then perform the pseudo groundtruth selection in two ways:

CPS1: Select SCI pairs with SCM value lies in the top \(p\%\)SCM range, \(p\in [0,100]\).

CPS2: Select SCI tuples with the following constraint:
$$\begin{aligned} {\texttt{SCI}}_s = \left\{ {\texttt{SCI}}_k \mid ({\texttt{SCM}}_k > \alpha ) \cap ({\texttt{mSCM}}_k < \beta ) \right\} \end{aligned}$$
(5)
where \(k = 1,2,...,N\), \({\texttt{SCI}}_s\) and \({\texttt{SCI}}_k\) are the selected SCI tuples and kth SCI tuple, respectively. \(\alpha\), \(\beta\) are thresholds for SCM and mSCM, respectively.
No matter for CPS1 or CPS2, the separated signals, Sep1 and Sep2, in all the selected SCI tuples will be taken as the high confidence pseudo groundtruth for their corresponding mixture Mix. Then the selected mixtures with pseudo groundtruths are taken to form the “DPseudo Labeled Set” (pseudo groundtruth that produced by DPCCN) for further separation model adaptation. As discussed in the definition of mSCM, compared with CPS1, perhaps CPS2 is better at dealing with the difficult separation cases to some extent.
3.3 Heterogeneous Knowledge Fusion
The heterogeneous knowledge fusion (HKF), illustrated in Fig. 2D is used during the crossknowledge adaptation in SCT2 and SCT3. HKF is a very simple operation just by replacing Sep1 and Sep2 in the selected SCI tuples of Fig. 2A with the outputs of the adapted ConvTasNet as in SCT2 and SCT3. We use \(v_n^{i'}\) to represent the ith individual signal of nth mixture separated by the adapted ConvTasNet. The updated new data tuples \(\{\texttt{Mix, Sep1, Sep2}\}\) are then picked to form the “TPseudo Labeled Set” (pseudo groundtruths that produced by ConvTasNet). By doing so, the complementary information between the prior knowledge of separation consistency information that captured in the CPS block and the adapted ConvTasNet are subtly integrated to further refine the primary DPCCN.