Heterogeneous separation consistency training for adaptation of unsupervised speech separation

EURASIP Journal on Audio, Speech, and Music Processing

Table 5 SDRi/SI-SNRi (dB) performance of Conv-TasNet and DPCCN on Aishell2Mix test set under different SCT configurations

SCT	System	#Iter	CPS-1	CPS-2	Oracle
SCT-1	Conv-TasNet	1	5.14/4.63	5.47/4.90	5.98/5.39
		2	5.45/4.94	5.99/5.39	6.18/5.57
	DPCCN	1	5.98/5.32	5.90/5.25	6.00/5.31
		2	6.17/5.50	6.03/5.39	6.10/5.44
SCT-2	Conv-TasNet	1	5.14/4.63	5.47/4.90	5.98/5.39
		2	5.36/4.89	6.15/5.52	6.21/5.65
	DPCCN	1	6.05/5.52	6.48/5.82	6.79/6.19
		2	5.49/5.05	6.43/5.81	6.45/5.91
SCT-3	Conv-TasNet	1	5.14/4.63	5.47/4.90	-
		2	5.43/4.93	5.77/5.24	-
	DPCCN	1	6.14/5.58	6.22/5.65	-
		2	6.02/5.52	6.10/5.56	-

“Oracle” means using ground-truth as reference to calculate SI-SNR of separation outputs for selecting the pseudo ground-truth. All source models are well pre-trained on Libri2Mix. The best setup of \(\{\alpha ,\beta \}\) in CPS-2 are \(\{5,5\}\), \(\{8,5\}\) in the 1st and 2nd iteration for all SCT variants, respectively. \(\eta\) is set to 5 for “Oracle selection”