From: Three-stage training and orthogonality regularization for spoken language recognition
Conformer encoder | Â |
---|---|
Number of blocks | 12 |
Linear dimensionality | 2048 |
Output size | 256 |
Number of attention heads | 4 |
Dropout rate | 0.1 |
Type of activation | Swish |
Type of the positional encoding layer | Relative |
Transformer decoder | Â |
Linear dimensionality | 2048 |
Number of blocks | 6 |
Number of attention heads | 4 |
ASR Training | Â |
CTC weight | 0.3 |
Label smoothing | 0.1 |