Skip to main content

Table 3 Detailed architecture of the CNN-RNN model used as the ASR module

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer Output shape Parameters
InputLayer B, T, 161 0
Conv1D (F=250, K=3) B, T, 250 443000
Conv1D (F=250, K=3) B, T, 250 443000
Maxpool (P=2) B, T, 250 940
Conv1D (F=150, K=3) B, T, 150 265800
Conv1D (F=150, K=3) B, T, 150 265800
Maxpool (P=2) B, T, 150 600
Conv1D (F=100, K=3) B, T, 100 177200
Conv1D (F=100, K=3) B, T, 100 177200
Maxpool (P=2) B, T, 100 400
Conv1D (F=80, K=3) B, T, 80 141760
Conv1D (F=80, K=3) B, T, 80 141760
Bidirectional (U=200) B, T, 400 505200
BatchNormalization B, T, 400 1600
TimeDistributed B, T, 29 11629
Dropout B, T, 29 0
TimeDistributed B, T, 29 870
SoftmaxActivation B, T, 29 0
  Total params: 2,576,759
  Trainable params: 2,576,759
  Non-trainable params: 0