Skip to main content

Table 7 Summarization of all systems tested

From: Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

# System Training stage 1 Training stage 2 Training stage 3 Training stage 4 Training stage 5 Training stage 6
Spectrogram prediction models        
1. M-MN (baseline) MN12h - - - - -
2. M SJ-TL JP MN30 - - - -
3. M SE10-TL EN10 MN30 - - - -
4. M SE24-TL EN24 MN30 - - - -
5. M SEJ-TL EN24 JP MN30 - - -
6. M S-DA AD 30 + MN30 - - - - -
7. M SEJ-TL-DA EN24 JP AD 30 MN30 - -
8. M SEJ-TL-DA D EN24 JP AD 30-set1 AD 30-set2 AD 30-set3 MN30
9. M MJ-TL JP + MN30 MN30 - - - -
10. M ME10-TL EN10 + MN30 MN30 - - - -
11. M ME24-TL EN24 + MN30 MN30 - - - -
12. M MEJ-TL EN24 + JP + MN30 MN30 - - - -
13. M M-DA AD 30 + MN30 - - - - -
14. M MEJ-TL-DA EN24 + JP + AD 30 + MN30 MN30 - - - -
15. M MEJ-TL-DA D EN24 + JP + AD 30 + MN30 AD 30-set1 AD 30-set2 AD 30-set3 MN30 -
16. M MEJ-TL-DA 1hour EN24 + JP + AD 1h + MN1h MN1h - - - -
17. M MEJ-TL-DA 2hours EN24 + JP + AD 2h + MN2h MN2h - - - -
18. M MEJ-TL-DA 3hours EN24 + JP + AD 3h + MN3h MN3h - - - -
Neural vocoders        
19. NV-MN (baseline) MN12h - - - - -
20. NV-DA AD 30 + MN30 - - - - -
Model type        
M SXXX = Single-speaker TTS model
M MXXX = Multi-speaker TTS model
NV = neural vocoder
Method used for model training        
TL = Cross-lingual transfer learning
DA = Data augmentation
TL-DA = Cross-lingual transfer learning and data augmentation
TL-DA D = Cross-lingual transfer learning and data augmentation with additional fine-tuning
Databases used for training stages        
EN10 = 10 hours of the English dataset
EN24 = 24 hours of the English dataset
JP = 10 hours of the Japanese dataset
MN12h = 12 hours of the target language dataset
MN30 = 30 minutes of the target language data
MN1h = 1 hour of the target language data
MN2h = 2 hours of the target language data
MN3h = 3 hours of the target language data
AD 30 = augmented data generated from 30 minutes of the target language data
AD 30-set1 = the first set of the augmented data generated from 30 minutes of the target language data
AD 30-set2 = the second set of the augmented data generated from 30 minutes of the target language data
AD 30-set3 = the third set of the augmented data generated from 30 minutes of the target language data
AD 1h = augmented data generated from 1 hour of the target language data
AD 2h = augmented data generated from 2 hours of the target language data
AD 3h = augmented data generated from 3 hours of the target language data