EURASIP Journal on Audio, Speech, and Music Processing

Table 1 Network architecture of the encoder

From: Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation

	Operation
# Layer	Audio model	Visual model
Input	40 ×116×3	122 ×122×28×1
1	5 ×2 conv, 64, s(1,2)	3 ×3×1 conv, 48, p(1,1,0),
		2 ×2 max-pool
2	5 ×2 conv, 128, s(1,2),	3 ×3×2 conv, 96, s(1,1,2),
	2 ×1 max-pool	3 ×3 max-pool
3	5 ×2 conv, 256,
	2 ×1 max-pool
	unfold along the time step
4	128 dense*

s(·) and p(·) indicate a stride size and a padding size, respectively
*A step-wise operation, which is applied for each time step independently
The activation function is ReLU

Back to article page