Single-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition

EURASIP Journal on Audio, Speech, and Music Processing

Table 6 Experimental results by using position-specific training data

Method	Dataset		Speaker identification rate (%)
Method	Dataset		P01	P03	P05	P02	P04	Avg. known	Avg. unknown	Avg. all
Proposed (24 NNs) + CMN	P01	1s.5u	85.8*	86.0	85.9	87.1	85.5	85.8	86.1	86.1
	P01	3s.15u	91.8*	91.8	92.0	93.2	90.0	91.8	91.8	91.8
	P03	1s.5u	84.0	86.8*	85.2	88.1	88.6	86.8	86.5	86.5
	P03	3s.15u	90.3	92.7*	93.5	94.8	92.8	92.7	92.9	92.8
	P05	1s.5u	88.8	91.2	92.7*	93.3	89.7	92.7	90.8	91.1
	P05	3s.15u	90.3	93.0	95.0*	96.2	91.7	95.0	92.8	93.2
Proposed (12 NNs) + CMN	P01	1s.5u	88.8*	88.6	88.6	89.9	90.4	88.8	89.4	89.3
	P01	3s.15u	93.5*	94.3	94.2	95.0	93.3	93.5	94.2	94.1
	P03	1s.5u	87.8	89.6*	89.3	90.7	92.3	89.6	90.0	89.9
	P03	3s.15u	91.5	94.7*	94.3	95.0	94.2	94.7	93.8	93.9
	P05	1s.5u	89.9	92.5	92.4*	94.2	92.7	92.4	92.3	92.3
	P05	3s.15u	91.5	93.2	92.8*	96.5	92.7	92.8	93.5	93.3
Proposed (6 NNs) + CMN	P01	1s.5u	89.5*	88.4	87.9	90.9	91.7	89.5	89.7	89.7
	P01	3s.15u	92.2*	91.7	91.0	94.7	93.3	92.2	92.7	92.6
	P03	1s.5u	89.1	89.7*	88.4	91.0	92.7	89.7	90.3	90.2
	P03	3s.15u	91.5	92.0*	92.0	95.0	93.8	92.0	93.0	92.9
	P05	1s.5u	90.1	91.4	91.7*	94.7	93.4	91.7	92.4	92.3
	P05	3s.15u	92.5	93.2	92.3*	96.3	94.2	92.3	94.0	93.7

The known environments include P01, P03, and P05, while the unknown environments include P02 and P04. The experiments were done by using the first testing scheme and skip1 7-1-0 frame selection. The asterisks (*) indicate known positions (matched conditions). The bold text represents the best average performance for each training data number.