Skip to main content

Table 1 Detailed architecture of the autoencoder

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer

Output shape

Parameters

Conv2D (F=32, K=3)

B, X, T, 32

417344

Conv2D (F=64, K=3)

B, X, T, 64

18496

Conv2D (F=128, K=3)

B, X, T, 128

73856

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2D (F=128, K=3)

B, X, T, 128

147584

Conv2DTr (F=64, K=3)

B, X, T, 64

73792

Conv2DTr (F=32, K=3)

B, X, T, 32

18464

Conv2D (F=32, K=3)

B, X, T, 32

82976

 

Total params:

2,160,768

 

Trainable params:

2,160,768

 

Non-trainable params:

0

  1. Conv2D 2-dimensional convolutional layer, Conv2DTr 2-dimensional convolutional transpose layer, F number of filters, K kernel size, B batch dimension, X dimension related to spectrogram’s frequency, T dimension related to spectrogram’s time steps