Skip to main content

Table 1 Detailed architecture of the autoencoder

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer Output shape Parameters
Conv2D (F=32, K=3) B, X, T, 32 417344
Conv2D (F=64, K=3) B, X, T, 64 18496
Conv2D (F=128, K=3) B, X, T, 128 73856
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2D (F=128, K=3) B, X, T, 128 147584
Conv2DTr (F=64, K=3) B, X, T, 64 73792
Conv2DTr (F=32, K=3) B, X, T, 32 18464
Conv2D (F=32, K=3) B, X, T, 32 82976
  Total params: 2,160,768
  Trainable params: 2,160,768
  Non-trainable params: 0
  1. Conv2D 2-dimensional convolutional layer, Conv2DTr 2-dimensional convolutional transpose layer, F number of filters, K kernel size, B batch dimension, X dimension related to spectrogram’s frequency, T dimension related to spectrogram’s time steps