Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation

EURASIP Journal on Audio, Speech, and Music Processing

Table 6 Word error rate (%) after unsupervised speaker adaptation on the WSJ task

Methods	Number of adaptation sentences
	2	4	6	8	10	20
EV	13.88	13.82	13.76	13.68	13.64	13.58
	K=100	K=120	K=150	K=150	K=150	K=150
MLLR	14.44	13.86	13.70	13.56	13.43	13.22
SAT + MLLR	13.96	13.41	13.37	13.35	13.26	13.06
ML-EP	16.28	14.24	13.75	13.47	13.41	13.06
SAT + ML-EP	16.80	14.24	13.51	13.17	13.12	12.70
SGL-EP	14.05	13.72	13.52	13.41	13.37	13.00
SAT + SGL-EP	13.92	13.36	13.29	13.11	13.03	12.70

The WER of the SI model is 14.71%. For the sake of brevity, only the best results of each adaptation method are shown in the table. For MLLR, the best results were obtained at a prior weighting factor of 10 (for MAP) and 32 regression classes with a three-block-diagonal transformation matrix (for MLLR). For the eigenphone method, the number of eigenphones (N) was fixed to 100. The weighting factors of the SGL regularization method were set to λ₁=10 and λ₃=30.