Skip to main content

Table 1 Network architecture of the encoder

From: Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation

 

Operation

# Layer

Audio model

Visual model

Input

40 ×116×3

122 ×122×28×1

1

5 ×2 conv, 64, s(1,2)

3 ×3×1 conv, 48, p(1,1,0),

  

2 ×2 max-pool

2

5 ×2 conv, 128, s(1,2),

3 ×3×2 conv, 96, s(1,1,2),

 

2 ×1 max-pool

3 ×3 max-pool

3

5 ×2 conv, 256,

 
 

2 ×1 max-pool

 
 

unfold along the time step

4

128 dense*

  1. s(·) and p(·) indicate a stride size and a padding size, respectively
  2. *A step-wise operation, which is applied for each time step independently
  3. The activation function is ReLU