EURASIP Journal on Audio, Speech, and Music Processing

Table 2 Detailed architecture of the CNN-RNN model used as the loss network in style transfer approach

From: Accent modification for speech recognition of non-native speakers using neural style transfer

Layer	Output shape	Parameters
InputLayer	B, T, 161	0
Conv1D (F=220, K=3)	B, T, 220	389840
Conv1D (F=220, K=3)	B, T, 220	389840
Maxpool (P=2)	B, T, 220	880
Conv1D (F=150, K=3)	B, T, 150	265800
Conv1D (F=150, K=3)	B, T, 150	265800
Maxpool (P=2)	B, T, 150	600
Conv1D (F=100, K=3)	B, T, 100	177200
Conv1D (F=100, K=3)	B, T, 100	177200
Maxpool (P=2)	B, T, 100	400
Conv1D (F=80, K=3)	B, T, 80	141760
Conv1D (F=80, K=3)	B, T, 80	141760
Maxpool (P=2)	B, T, 80	320
Conv1D (F=80, K=3)	B, T, 80	141760
Conv1D (F=80, K=3)	B, T, 80	141760
Maxpool (P=2)	B, T, 80	320
Conv1D (F=80, K=3)	B, T, 80	141760
Conv1D (F=80, K=3)	B, T, 80	141760
Bidirectional (U=200)	B, T, 400	505200
BatchNormalization	B, T, 400	1600
TimeDistributed	B, T, 29	11629
Dropout	B, T, 29	0
TimeDistributed	B, T, 29	870
SoftmaxActivation	B, T, 29	0
	Total params:	3,038,059
	Trainable params:	3,038,059
	Non-trainable params:	0

Conv1D 1-dimensional convolutional layer, Maxpool max pooling layer, Bidirectional wrapper with RNN, BatchNormalization batch normalizing layer, TimeDistributed layer for every temporal slice of the input, Dropout dropout layer, SoftmaxActivation softmax activation function layer, P pool size, U number of hidden units in the RNN

Back to article page