Skip to main content

Table 7 Summarization of all systems tested

From: Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

#

System

Training stage 1

Training stage 2

Training stage 3

Training stage 4

Training stage 5

Training stage 6

Spectrogram prediction models

       

1.

M-MN (baseline)

MN12h

-

-

-

-

-

2.

M SJ-TL

JP

MN30

-

-

-

-

3.

M SE10-TL

EN10

MN30

-

-

-

-

4.

M SE24-TL

EN24

MN30

-

-

-

-

5.

M SEJ-TL

EN24

JP

MN30

-

-

-

6.

M S-DA

AD 30 + MN30

-

-

-

-

-

7.

M SEJ-TL-DA

EN24

JP

AD 30

MN30

-

-

8.

M SEJ-TL-DA D

EN24

JP

AD 30-set1

AD 30-set2

AD 30-set3

MN30

9.

M MJ-TL

JP + MN30

MN30

-

-

-

-

10.

M ME10-TL

EN10 + MN30

MN30

-

-

-

-

11.

M ME24-TL

EN24 + MN30

MN30

-

-

-

-

12.

M MEJ-TL

EN24 + JP + MN30

MN30

-

-

-

-

13.

M M-DA

AD 30 + MN30

-

-

-

-

-

14.

M MEJ-TL-DA

EN24 + JP + AD 30 + MN30

MN30

-

-

-

-

15.

M MEJ-TL-DA D

EN24 + JP + AD 30 + MN30

AD 30-set1

AD 30-set2

AD 30-set3

MN30

-

16.

M MEJ-TL-DA 1hour

EN24 + JP + AD 1h + MN1h

MN1h

-

-

-

-

17.

M MEJ-TL-DA 2hours

EN24 + JP + AD 2h + MN2h

MN2h

-

-

-

-

18.

M MEJ-TL-DA 3hours

EN24 + JP + AD 3h + MN3h

MN3h

-

-

-

-

Neural vocoders

       

19.

NV-MN (baseline)

MN12h

-

-

-

-

-

20.

NV-DA

AD 30 + MN30

-

-

-

-

-

Model type

       

M SXXX = Single-speaker TTS model

M MXXX = Multi-speaker TTS model

NV = neural vocoder

Method used for model training

       

TL = Cross-lingual transfer learning

DA = Data augmentation

TL-DA = Cross-lingual transfer learning and data augmentation

TL-DA D = Cross-lingual transfer learning and data augmentation with additional fine-tuning

Databases used for training stages

       

EN10 = 10 hours of the English dataset

EN24 = 24 hours of the English dataset

JP = 10 hours of the Japanese dataset

MN12h = 12 hours of the target language dataset

MN30 = 30 minutes of the target language data

MN1h = 1 hour of the target language data

MN2h = 2 hours of the target language data

MN3h = 3 hours of the target language data

AD 30 = augmented data generated from 30 minutes of the target language data

AD 30-set1 = the first set of the augmented data generated from 30 minutes of the target language data

AD 30-set2 = the second set of the augmented data generated from 30 minutes of the target language data

AD 30-set3 = the third set of the augmented data generated from 30 minutes of the target language data

AD 1h = augmented data generated from 1 hour of the target language data

AD 2h = augmented data generated from 2 hours of the target language data

AD 3h = augmented data generated from 3 hours of the target language data