From: Unsupervised domain adaptation for lip reading based on cross-modal knowledge distillation
 | Operation | |
---|---|---|
# Layer | Audio model | Visual model |
Input | 40 ×116×3 | 122 ×122×28×1 |
1 | 5 ×2 conv, 64, s(1,2) | 3 ×3×1 conv, 48, p(1,1,0), |
 |  | 2 ×2 max-pool |
2 | 5 ×2 conv, 128, s(1,2), | 3 ×3×2 conv, 96, s(1,1,2), |
 | 2 ×1 max-pool | 3 ×3 max-pool |
3 | 5 ×2 conv, 256, |  |
 | 2 ×1 max-pool |  |
 | unfold along the time step | |
4 | 128 dense* |