We aim to achieve UDA using the data of the unknown class on lip reading, which estimates the word label from an image input. In our proposed method, we use audio-visual data for training and adapting the lip-reading model, and for evaluating, we use only visual data. First, we explain the basic idea of the cross-modal KD on which our method is based. Then, we describe our proposed UDA method, which is based on the cross-modal KD using the data of the unknown class.
3.1 Cross-modal KD
Figure 1 shows an overview of the basic procedure of cross-modal KD, where the speech and the image are given from the same utterance. In our lip-reading task, the output is defined by the word. First, in advance, we train the audio model, which estimates the probability of the word from the acoustic feature using the cross entropy loss with the correct label. Given an acoustic feature xaud and an image feature xvis, the basic KD loss is defined as follows:
$$\begin{array}{*{20}l} -\sum_{l} p_{\text{aud}}(l|x_{\text{aud}}) \ln p_{\text{vis}}(l|x_{\text{vis}}), \end{array} $$
(1)
where pvis(l|xvis) and paud(l|xaud) denote the probabilities of a label l estimated from the visual model based on xvis and estimated from the audio model based on the input xaud, respectively. Here, the acoustic feature and the image feature are extracted from the same utterance. When training the visual model, the parameters of the audio model are fixed. This loss function forces the visual model to imitate the outputs extracted from the audio model. Practically speaking, the softmax loss using the correct label (hard loss) is often used for stable training with the linear interpolation parameter λ. Li et al. [15] demonstrate that KD between the ASR model and the AV-ASR model improves the recognition performance when the speech data is corrupted by noise. Therefore, it is expected that KD between the audio model (ASR model) and the visual model (lip-reading model) also contributes to improving the performance in our task.
3.2 Cross-modal KD-based UDA for the unknown class
Before describing our method, we first want to highlight the fact that the adaptation data does not belong to any class of the source domain. Considering the domain, let \(\mathcal {D}\) be the joint distribution over sequences of audio features and visual features, and the corresponding label. The output of the network is defined by the word.
Our model consists of two parts: an encoder and a classifier, as shown in Fig. 2. The encoder is a stacked convolution layer. The two encoders of the audio and visual modal can be defined as follows:
$$\begin{array}{*{20}l} \boldsymbol{h}^{a} &= f_{\text{aud}}(\boldsymbol{a}), \end{array} $$
(2)
$$\begin{array}{*{20}l} \boldsymbol{h}^{v} &= f_{\text{vis}}(\boldsymbol{v}), \end{array} $$
(3)
where \(\phantom {\dot {i}\!}\boldsymbol {a}=(a_{1},...,a_{t},...,a_{T_{a}})\) and \(\phantom {\dot {i}\!}\boldsymbol {v}=(v_{1},...,v_{t^{\prime }},...,v_{T_{v}})\) are input sequences of acoustic features and of visual features, respectively. \(\boldsymbol {h}^{a}=(h^{a}_{1},...,h^{a}_{u},...,h^{a}_{U})\) and \(\boldsymbol {h}^{v}=(h^{v}_{1},...,h^{v}_{u},...,h^{v}_{U})\) are the sequences of high-level representations. Here \(\phantom {\dot {i}\!}a_{t}, v_{t^{\prime }}, h^{a}_{u}\in \mathbb {R}^{d}\), and \(h^{v}_{u}\in \mathbb {R}^{d}\) are the input acoustic feature frame, the input visual feature frame, and the d-dimensional encoder features of both modalities, respectively. Ta,Tv and U≤ min(Ta,Tv) denote the numbers of the input acoustic features and the input visual features, and the number of the encoder output features, respectively. The number of steps of encoded features is the same between the two modalities. The classifier consists of fully connected layers to estimate the corresponding word label.
During adaptation, our method minimizes the mean square error (MSE) between the hidden representations as follows:
$$\begin{array}{*{20}l} \mathcal{L}_{{MSE}}(\mathcal{D}) = \mathbb{E}_{\{\boldsymbol{v},\boldsymbol{a},\boldsymbol{y}\}\sim \mathcal{D}} [|| \boldsymbol{h}^{a} - \boldsymbol{h}^{v} ||_{2}^{2}], \end{array} $$
(4)
where y is the label and is ignored. Unlike the generally used KD loss (Eq. (1)), we use the hidden representation in the intermediate layer for distillation. In the output layer and layers in the classifier, the frame-level information is lost and the feature representation is specialized to word-level (i.e., class-level) information. For this reason, the simple KD formulation cannot be applied to the adaptation if the adaptation data is out-of-class. On the other hand, the layers in the encoder have sub-word or phoneme-like representation that is independent of the class because they still retain the frame-level information. For this reason, our proposed method realizes UDA using the data of the unknown class.
3.3 Training procedure
Considering the source domain, let \(\mathcal {D}_{\text {src}}\) be the joint distribution over sequences of audio features and visual features, and the corresponding label. \(\mathcal {D}_{\text {trg}}\) is analogously defined for the target domain.
The first step is to train the models on the source domain. In this step, we expect that the hidden representation in the visual model is similar to that of the audio model. First, in advance, we train the audio model using the cross entropy loss with the correct label as follows:
$$\begin{array}{*{20}l} \mathbb{E}_{\{\boldsymbol{v},\boldsymbol{a},\boldsymbol{y}\}\sim \mathcal{D}_{\text{src}}} [-\log (g_{\text{aud}}(\boldsymbol{y}|\boldsymbol{h}^{a}))], \end{array} $$
(5)
where v is not used and gaud(y|ha) denotes the output probability of the label y estimated by the classifier of the audio model from the encoded feature h. Then, we train the visual model using the KD loss and the cross entropy loss as follows:
$$\begin{array}{*{20}l} &\mathcal{L}_{{MSE}}(\mathcal{D}_{\text{src}}) + \mathcal{L}_{{CE}}(\mathcal{D}_{\text{src}})\\ &= \mathcal{L}_{{MSE}}(\mathcal{D}_{\text{src}}) + \mathbb{E}_{\{\boldsymbol{v},\boldsymbol{a},\boldsymbol{y}\}\sim \mathcal{D}_{\text{src}}} [-\log (g_{\text{vis}}(\boldsymbol{y}|\boldsymbol{h}^{v}))], \end{array} $$
(6)
where gvis(y|hv) is the output probability estimated by the visual classifier.
Next, we adapt the visual model using the data of the unknown class based on the UDA scheme as described in Section 3.2. For more stable adaptation, we also use the data of the source domain. In addition to the loss in Eq. (4), we calculate losses for the source domain that has the correct label. This works as a regularization to prevent overfitting to the target distribution in the audio modality. Finally, our UDA approach for an unknown class minimizes the loss as follows:
$$\begin{array}{*{20}l} \mathcal{L} = \mathcal{L}_{{MSE}}(\mathcal{D}_{\text{trg}}) + \alpha\mathcal{L}_{{MSE}}(\mathcal{D}_{\text{src}}) + (1-\alpha)\mathcal{L}_{{CE}}(\mathcal{D}_{\text{src}}), \end{array} $$
(7)
where α indicates a weight parameter used to adapt the model stably, and we employ 0.5 in this paper. All parameters of the visual model are fine-tuned to minimize this loss function.