Spectrogram prediction models
| | | | | | | |
1.
|
M-MN (baseline)
|
MN12h
|
-
|
-
|
-
|
-
|
-
|
2.
|
M SJ-TL
|
JP
|
MN30
|
-
|
-
|
-
|
-
|
3.
|
M SE10-TL
|
EN10
|
MN30
|
-
|
-
|
-
|
-
|
4.
|
M SE24-TL
|
EN24
|
MN30
|
-
|
-
|
-
|
-
|
5.
|
M SEJ-TL
|
EN24
|
JP
|
MN30
|
-
|
-
|
-
|
6.
|
M S-DA
|
AD 30 + MN30
|
-
|
-
|
-
|
-
|
-
|
7.
|
M SEJ-TL-DA
|
EN24
|
JP
|
AD 30
|
MN30
|
-
|
-
|
8.
|
M SEJ-TL-DA D
|
EN24
|
JP
|
AD 30-set1
|
AD 30-set2
|
AD 30-set3
|
MN30
|
9.
|
M MJ-TL
|
JP + MN30
|
MN30
|
-
|
-
|
-
|
-
|
10.
|
M ME10-TL
|
EN10 + MN30
|
MN30
|
-
|
-
|
-
|
-
|
11.
|
M ME24-TL
|
EN24 + MN30
|
MN30
|
-
|
-
|
-
|
-
|
12.
|
M MEJ-TL
|
EN24 + JP + MN30
|
MN30
|
-
|
-
|
-
|
-
|
13.
|
M M-DA
|
AD 30 + MN30
|
-
|
-
|
-
|
-
|
-
|
14.
|
M MEJ-TL-DA
|
EN24 + JP + AD 30 + MN30
|
MN30
|
-
|
-
|
-
|
-
|
15.
|
M MEJ-TL-DA D
|
EN24 + JP + AD 30 + MN30
|
AD 30-set1
|
AD 30-set2
|
AD 30-set3
|
MN30
|
-
|
16.
|
M MEJ-TL-DA 1hour
|
EN24 + JP + AD 1h + MN1h
|
MN1h
|
-
|
-
|
-
|
-
|
17.
|
M MEJ-TL-DA 2hours
|
EN24 + JP + AD 2h + MN2h
|
MN2h
|
-
|
-
|
-
|
-
|
18.
|
M MEJ-TL-DA 3hours
|
EN24 + JP + AD 3h + MN3h
|
MN3h
|
-
|
-
|
-
|
-
|
Neural vocoders
| | | | | | | |
19.
|
NV-MN (baseline)
|
MN12h
|
-
|
-
|
-
|
-
|
-
|
20.
|
NV-DA
|
AD 30 + MN30
|
-
|
-
|
-
|
-
|
-
|
Model type
| | | | | | | |
M SXXX = Single-speaker TTS model
|
M MXXX = Multi-speaker TTS model
|
NV = neural vocoder
|
Method used for model training
| | | | | | | |
TL = Cross-lingual transfer learning
|
DA = Data augmentation
|
TL-DA = Cross-lingual transfer learning and data augmentation
|
TL-DA D = Cross-lingual transfer learning and data augmentation with additional fine-tuning
|
Databases used for training stages
| | | | | | | |
EN10 = 10 hours of the English dataset
|
EN24 = 24 hours of the English dataset
|
JP = 10 hours of the Japanese dataset
|
MN12h = 12 hours of the target language dataset
|
MN30 = 30 minutes of the target language data
|
MN1h = 1 hour of the target language data
|
MN2h = 2 hours of the target language data
|
MN3h = 3 hours of the target language data
|
AD 30 = augmented data generated from 30 minutes of the target language data
|
AD 30-set1 = the first set of the augmented data generated from 30 minutes of the target language data
|
AD 30-set2 = the second set of the augmented data generated from 30 minutes of the target language data
|
AD 30-set3 = the third set of the augmented data generated from 30 minutes of the target language data
|
AD 1h = augmented data generated from 1 hour of the target language data
|
AD 2h = augmented data generated from 2 hours of the target language data
|
AD 3h = augmented data generated from 3 hours of the target language data
|