Feature extraction | Â |
Sampling rate | 22,050 Hz |
Window size | 46.4 ms (1,024 pt) |
Shift size | 11.6 ms (256 pt) |
Acoustic feature | log-mel spectrogram 80 dim |
Encoder | Â |
# phoneme embedding dimension | 512 |
# CNN layers | 3 |
# CNN filters | 512 |
CNN filter size | 5 |
# BLSTM layer | 1 |
# BLSTM units | 512 |
Decoder | Â |
# LSTM layers | 2 |
# LSTM units | 1024 |
# Prenet layers | 2 |
# Prenet units | 256 |
# Postnet layers | 5 |
# Postnet filters | 512 |
Postnet filter size | 5 |
# Speaker embedding dimension | 512 |
Attention | Â |
# Dimensions in attention | 128 |
# Filters in attention | 32 |
Filter size in attention | 31 |
Sigma in guided attention loss | 0.4 |
Reduction factor (r) | 1 (M S) / 2 (M M) |
Optimization and minibatch | Â |
Dropout rate | 0.5 |
Zoneout rate | 0.1 |
Learning rate | 0.001 |
Optimization method | Adam with β1 = 0.9, β2 = 0.999, ε = 10-6 |
# Epoch | 300 / 500 / 1000 |
Batch size | 32 / 64 |