Skip to main content

Table 1 Network architecture of the encoder

From: Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation

  Operation
# Layer Audio model Visual model
Input 40 ×116×3 122 ×122×28×1
1 5 ×2 conv, 64, s(1,2) 3 ×3×1 conv, 48, p(1,1,0),
   2 ×2 max-pool
2 5 ×2 conv, 128, s(1,2), 3 ×3×2 conv, 96, s(1,1,2),
  2 ×1 max-pool 3 ×3 max-pool
3 5 ×2 conv, 256,  
  2 ×1 max-pool  
  unfold along the time step
4 128 dense*
  1. s(·) and p(·) indicate a stride size and a padding size, respectively
  2. *A step-wise operation, which is applied for each time step independently
  3. The activation function is ReLU