Skip to main content

Table 2 Detailed architecture of the CNN-RNN model used as the loss network in style transfer approach

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer

Output shape

Parameters

InputLayer

B, T, 161

0

Conv1D (F=220, K=3)

B, T, 220

389840

Conv1D (F=220, K=3)

B, T, 220

389840

Maxpool (P=2)

B, T, 220

880

Conv1D (F=150, K=3)

B, T, 150

265800

Conv1D (F=150, K=3)

B, T, 150

265800

Maxpool (P=2)

B, T, 150

600

Conv1D (F=100, K=3)

B, T, 100

177200

Conv1D (F=100, K=3)

B, T, 100

177200

Maxpool (P=2)

B, T, 100

400

Conv1D (F=80, K=3)

B, T, 80

141760

Conv1D (F=80, K=3)

B, T, 80

141760

Maxpool (P=2)

B, T, 80

320

Conv1D (F=80, K=3)

B, T, 80

141760

Conv1D (F=80, K=3)

B, T, 80

141760

Maxpool (P=2)

B, T, 80

320

Conv1D (F=80, K=3)

B, T, 80

141760

Conv1D (F=80, K=3)

B, T, 80

141760

Bidirectional (U=200)

B, T, 400

505200

BatchNormalization

B, T, 400

1600

TimeDistributed

B, T, 29

11629

Dropout

B, T, 29

0

TimeDistributed

B, T, 29

870

SoftmaxActivation

B, T, 29

0

 

Total params:

3,038,059

 

Trainable params:

3,038,059

 

Non-trainable params:

0

  1. Conv1D 1-dimensional convolutional layer, Maxpool max pooling layer, Bidirectional wrapper with RNN, BatchNormalization batch normalizing layer, TimeDistributed layer for every temporal slice of the input, Dropout dropout layer, SoftmaxActivation softmax activation function layer, P pool size, U number of hidden units in the RNN