Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Hyper-parameters and network architectures

Feature extraction
Sampling rate	22,050 Hz
Window size	46.4 ms (1,024 pt)
Shift size	11.6 ms (256 pt)
Acoustic feature	log-mel spectrogram 80 dim
Encoder
# phoneme embedding dimension	512
# CNN layers	3
# CNN filters	512
CNN filter size	5
# BLSTM layer	1
# BLSTM units	512
Decoder
# LSTM layers	2
# LSTM units	1024
# Prenet layers	2
# Prenet units	256
# Postnet layers	5
# Postnet filters	512
Postnet filter size	5
# Speaker embedding dimension	512
Attention
# Dimensions in attention	128
# Filters in attention	32
Filter size in attention	31
Sigma in guided attention loss	0.4
Reduction factor (r)	1 (M _S) / 2 (M _M)
Optimization and minibatch
Dropout rate	0.5
Zoneout rate	0.1
Learning rate	0.001
Optimization method	Adam with β₁ = 0.9, β₂ = 0.999, ε = 10^-6
# Epoch	300 / 500 / 1000
Batch size	32 / 64