Skip to main content

Table 2 Detailed architecture of the CNN-RNN model used as the loss network in style transfer approach

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer Output shape Parameters
InputLayer B, T, 161 0
Conv1D (F=220, K=3) B, T, 220 389840
Conv1D (F=220, K=3) B, T, 220 389840
Maxpool (P=2) B, T, 220 880
Conv1D (F=150, K=3) B, T, 150 265800
Conv1D (F=150, K=3) B, T, 150 265800
Maxpool (P=2) B, T, 150 600
Conv1D (F=100, K=3) B, T, 100 177200
Conv1D (F=100, K=3) B, T, 100 177200
Maxpool (P=2) B, T, 100 400
Conv1D (F=80, K=3) B, T, 80 141760
Conv1D (F=80, K=3) B, T, 80 141760
Maxpool (P=2) B, T, 80 320
Conv1D (F=80, K=3) B, T, 80 141760
Conv1D (F=80, K=3) B, T, 80 141760
Maxpool (P=2) B, T, 80 320
Conv1D (F=80, K=3) B, T, 80 141760
Conv1D (F=80, K=3) B, T, 80 141760
Bidirectional (U=200) B, T, 400 505200
BatchNormalization B, T, 400 1600
TimeDistributed B, T, 29 11629
Dropout B, T, 29 0
TimeDistributed B, T, 29 870
SoftmaxActivation B, T, 29 0
  Total params: 3,038,059
  Trainable params: 3,038,059
  Non-trainable params: 0
  1. Conv1D 1-dimensional convolutional layer, Maxpool max pooling layer, Bidirectional wrapper with RNN, BatchNormalization batch normalizing layer, TimeDistributed layer for every temporal slice of the input, Dropout dropout layer, SoftmaxActivation softmax activation function layer, P pool size, U number of hidden units in the RNN