Skip to main content

Table 1 Hyper-parameters and network architectures

From: Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Feature extraction

 

Sampling rate

22,050 Hz

Window size

46.4 ms (1,024 pt)

Shift size

11.6 ms (256 pt)

Acoustic feature

log-mel spectrogram 80 dim

Encoder

 

# phoneme embedding dimension

512

# CNN layers

3

# CNN filters

512

CNN filter size

5

# BLSTM layer

1

# BLSTM units

512

Decoder

 

# LSTM layers

2

# LSTM units

1024

# Prenet layers

2

# Prenet units

256

# Postnet layers

5

# Postnet filters

512

Postnet filter size

5

# Speaker embedding dimension

512

Attention

 

# Dimensions in attention

128

# Filters in attention

32

Filter size in attention

31

Sigma in guided attention loss

0.4

Reduction factor (r)

1 (M S) / 2 (M M)

Optimization and minibatch

 

Dropout rate

0.5

Zoneout rate

0.1

Learning rate

0.001

Optimization method

Adam with β1 = 0.9, β2 = 0.999, ε = 10-6

# Epoch

300 / 500 / 1000

Batch size

32 / 64