# | System | Training stage 1 | Training stage 2 | Training stage 3 | Training stage 4 | Training stage 5 | Training stage 6 |
---|---|---|---|---|---|---|---|
Spectrogram prediction models | Â | Â | Â | Â | Â | Â | Â |
1. | M-MN (baseline) | MN12h | - | - | - | - | - |
2. | M SJ-TL | JP | MN30 | - | - | - | - |
3. | M SE10-TL | EN10 | MN30 | - | - | - | - |
4. | M SE24-TL | EN24 | MN30 | - | - | - | - |
5. | M SEJ-TL | EN24 | JP | MN30 | - | - | - |
6. | M S-DA | AD 30 + MN30 | - | - | - | - | - |
7. | M SEJ-TL-DA | EN24 | JP | AD 30 | MN30 | - | - |
8. | M SEJ-TL-DA D | EN24 | JP | AD 30-set1 | AD 30-set2 | AD 30-set3 | MN30 |
9. | M MJ-TL | JP + MN30 | MN30 | - | - | - | - |
10. | M ME10-TL | EN10 + MN30 | MN30 | - | - | - | - |
11. | M ME24-TL | EN24 + MN30 | MN30 | - | - | - | - |
12. | M MEJ-TL | EN24 + JP + MN30 | MN30 | - | - | - | - |
13. | M M-DA | AD 30 + MN30 | - | - | - | - | - |
14. | M MEJ-TL-DA | EN24 + JP + AD 30 + MN30 | MN30 | - | - | - | - |
15. | M MEJ-TL-DA D | EN24 + JP + AD 30 + MN30 | AD 30-set1 | AD 30-set2 | AD 30-set3 | MN30 | - |
16. | M MEJ-TL-DA 1hour | EN24 + JP + AD 1h + MN1h | MN1h | - | - | - | - |
17. | M MEJ-TL-DA 2hours | EN24 + JP + AD 2h + MN2h | MN2h | - | - | - | - |
18. | M MEJ-TL-DA 3hours | EN24 + JP + AD 3h + MN3h | MN3h | - | - | - | - |
Neural vocoders | Â | Â | Â | Â | Â | Â | Â |
19. | NV-MN (baseline) | MN12h | - | - | - | - | - |
20. | NV-DA | AD 30 + MN30 | - | - | - | - | - |
Model type | Â | Â | Â | Â | Â | Â | Â |
M SXXX = Single-speaker TTS model | |||||||
M MXXX = Multi-speaker TTS model | |||||||
NV = neural vocoder | |||||||
Method used for model training | Â | Â | Â | Â | Â | Â | Â |
TL = Cross-lingual transfer learning | |||||||
DA = Data augmentation | |||||||
TL-DA = Cross-lingual transfer learning and data augmentation | |||||||
TL-DA D = Cross-lingual transfer learning and data augmentation with additional fine-tuning | |||||||
Databases used for training stages | Â | Â | Â | Â | Â | Â | Â |
EN10 = 10 hours of the English dataset | |||||||
EN24 = 24 hours of the English dataset | |||||||
JP = 10 hours of the Japanese dataset | |||||||
MN12h = 12 hours of the target language dataset | |||||||
MN30 = 30 minutes of the target language data | |||||||
MN1h = 1 hour of the target language data | |||||||
MN2h = 2 hours of the target language data | |||||||
MN3h = 3 hours of the target language data | |||||||
AD 30 = augmented data generated from 30 minutes of the target language data | |||||||
AD 30-set1 = the first set of the augmented data generated from 30 minutes of the target language data | |||||||
AD 30-set2 = the second set of the augmented data generated from 30 minutes of the target language data | |||||||
AD 30-set3 = the third set of the augmented data generated from 30 minutes of the target language data | |||||||
AD 1h = augmented data generated from 1 hour of the target language data | |||||||
AD 2h = augmented data generated from 2 hours of the target language data | |||||||
AD 3h = augmented data generated from 3 hours of the target language data |