Skip to main content

Table 1 Hyper-parameters and network architectures

From: Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Feature extraction  
Sampling rate 22,050 Hz
Window size 46.4 ms (1,024 pt)
Shift size 11.6 ms (256 pt)
Acoustic feature log-mel spectrogram 80 dim
Encoder  
# phoneme embedding dimension 512
# CNN layers 3
# CNN filters 512
CNN filter size 5
# BLSTM layer 1
# BLSTM units 512
Decoder  
# LSTM layers 2
# LSTM units 1024
# Prenet layers 2
# Prenet units 256
# Postnet layers 5
# Postnet filters 512
Postnet filter size 5
# Speaker embedding dimension 512
Attention  
# Dimensions in attention 128
# Filters in attention 32
Filter size in attention 31
Sigma in guided attention loss 0.4
Reduction factor (r) 1 (M S) / 2 (M M)
Optimization and minibatch  
Dropout rate 0.5
Zoneout rate 0.1
Learning rate 0.001
Optimization method Adam with β1 = 0.9, β2 = 0.999, ε = 10-6
# Epoch 300 / 500 / 1000
Batch size 32 / 64