Skip to main content

Table 3 Detailed architecture of the CNN-RNN model used as the ASR module

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer

Output shape

Parameters

InputLayer

B, T, 161

0

Conv1D (F=250, K=3)

B, T, 250

443000

Conv1D (F=250, K=3)

B, T, 250

443000

Maxpool (P=2)

B, T, 250

940

Conv1D (F=150, K=3)

B, T, 150

265800

Conv1D (F=150, K=3)

B, T, 150

265800

Maxpool (P=2)

B, T, 150

600

Conv1D (F=100, K=3)

B, T, 100

177200

Conv1D (F=100, K=3)

B, T, 100

177200

Maxpool (P=2)

B, T, 100

400

Conv1D (F=80, K=3)

B, T, 80

141760

Conv1D (F=80, K=3)

B, T, 80

141760

Bidirectional (U=200)

B, T, 400

505200

BatchNormalization

B, T, 400

1600

TimeDistributed

B, T, 29

11629

Dropout

B, T, 29

0

TimeDistributed

B, T, 29

870

SoftmaxActivation

B, T, 29

0

 

Total params:

2,576,759

 

Trainable params:

2,576,759

 

Non-trainable params:

0