Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Jonathan Munro, Dima Damen

Introduction

Fine-grained action recognition is the problem of recognising actions and interactions such as “cutting a tomato” or “tightening a bolt” compared to coarse-grained actions such as “preparing a meal”. This has a wide range of applications in assistive technologies in homes as well as in industry. Supervised approaches rely on collecting a large number of labelled examples to train discriminative models. However, due to the difficulty in collecting and annotating such fine-grained actions, many datasets collect long untrimmed sequences. These contain several fine-grained actions from a single or few environments.

Figure 2 shows the recent surge in large-scale fine-grained action datasets. Two approaches have been attempted to achieve scalability: crowd-sourcing scripted actions , and long-term collections of natural interactions in homes . While the latter offers more realistic videos, many actions are collected in only a few environments. This leads to learned representations which do not generalise well .

Transferring a model learned on a labelled source domain to an unlabelled target domain is known as Unsupervised Domain Adaptation (UDA). Recently, significant attention has been given to deep UDA in other vision tasks . However, very few works have attempted deep UDA for video data . Surprisingly, none have tested on videos of fine-grained actions and all these approaches only consider video as images (i.e. RGB modality). This is in contrast with self-supervised approaches that have successfully utilised multiple modalities within video when labels are not present during training .

Up to our knowledge, no prior work has explored the multi-modal nature of video data for UDA in action recognition. We summarise our contributions as follows:

We show that multi-modal self-supervision, applied to both source and unlabelled target data, can be used for domain adaptation in video.

We propose a multi-modal UDA strategy, which we name MM-SADA, to adapt fine-grained action recognition models to unlabelled target environments, using both adversarial alignment and multi-modal self-supervision.

We test our approach on three domains from EPIC-Kitchens , trained end-to-end using I3D , and provide the first benchmark of UDA for fine-grained action recognition. Our results show that MM-SADA outperforms source-only generalisation as well as alternative domain adaptation strategies such as batch-based normalisation , distribution discrepancy minimisation and classifier discrepancy .

Related Works

This section discusses related literature starting with general UDA approaches, then supervised and self-supervised learning for action recognition, concluding with works on domain adaptation for action recognition.

Unsupervised Domain Adaptation (UDA) outside of Action Recognition. UDA has been extensively studied for vision tasks including object recognition , semantic segmentation and person re-identification . Typical approaches adapt neural networks by minimising a discrepancy measure , thus matching mid-level representations of source and target domains. For example, Maximum Mean Discrepancy (MMD) minimises the distance between the means of the projected domain distributions in Reproducing Kernel Hilbert Space. More recently, domain adaptation has been influenced by adversarial training . Simultaneously learning a domain discriminator, whilst maximising its loss with respect to the feature extractor, minimises the domain discrepancy between source and target. In , a GAN-like loss function allows separate weights for source and target domains, while in shared weights are used, efficiently removing domain specific information by inverting the gradient produced by the domain discriminator with a Gradient Reversal Layer (GRL).

Utilising multiple modalities (image and audio) for UDA has been recently investigated for bird image retrieval . Multiple adversarial discriminators are trained on a single modality as well as mid-level fusion and a cross-modality attention is learnt. The work shows the advantages of multi-modal domain adaptation in contrast to single-modality adaptation, though in their work both modalities demonstrate similar robustness to the domain shift.

Very recently, self-supervised learning has been proposed as a domain adaptation approach . In , it is used as an auxiliary task, by jigsaw-shuffling image patches and predicting their permutations over multiple source domains. In , self-supervision was shown to replace adversarial training using tasks such as predicting rotation and translation for object recognition. In the same work, self-supervision was shown to benefit adversarial training when jointly trained for semantic segmentation. Both works only use a single image. Our work utilises the multiple modalities offered by video, showing that self-supervision can be used to adapt action recognition models to target domains.

Supervised Action Recognition. Convolutional networks are state of the art for action recognition, with the first seminal works using either 3D or 2D convolutions . Both utilise a single modality—appearance information from RGB frames. Simonyan and Zisserman address the lack of motion features captured by these architectures, proposing two-stream late fusion that learns separate features from the Optical Flow and RGB modalities, outperforming single modality approaches.

Following architectures have focused on modelling longer temporal structure, through consensus of predictions over time as well as inflating CNNs to 3D convolutions , all using the two-stream approach of late-fusing RGB and Flow. The latest architectures have focused on reducing the high computational cost of 3D convolutions , yet still show improvements when reporting results of two-stream fusion .

Self-supervision for Action Recognition. Self-supervision methods learn representations from the temporal and multi-modal structure of video , leveraging pretraining on a large corpus of unlabelled videos. Methods exploiting the temporal consistency of video have predicted the order of a sequence of frames or the arrow of time . Alternatively, the correspondence between multiple modalities has been exploited for self-supervision, particularly with audio and RGB . Works predicted if modalities correspond or are synchronised. We test both approaches for self-supervision in our UDA approach.

Domain Adaptation for Action Recognition. Of the several domain shifts in action recognition, only one has received significant research attention, that is the problem of cross-viewpoint (or viewpoint-invariant) action recognition . These works focus on adapting to the geometric transformations of a camera but do little to combat other shifts, like changes in environment. Works utilise supervisory signals such as skeleton or pose and corresponding frames from multiple viewpoints . Recent works have used GRLs to create a view-invariant representation . Though several modalities (RGB, flow and depth) have been investigated, these were aligned and evaluated independently.

On the contrary, UDA for changes in environment has received limited recent attention. Before deep-learning, UDA for action recognition used shallow models to align source and target distributions of handcrafted features . Three recent works attempted deep UDA . These apply GRL adversarial training to C3D , TRN or both architectures. Jamal et al.’s approach outperforms shallow methods that use subspace alignment. Chen et al. show that attending to the temporal dynamics of videos can improve alignment. Pan et al. use a cross-domain attention module, to avoid uninformative frames. Two of these works use RGB only while reports results on RGB and Flow, however, modalities are aligned independently and only fused during inference. The approaches are evaluated on 5-7 pairs of domains from subsets of coarse-grained action recognition and gesture datasets, for example aligning UCF to Olympics . We evaluate on 6 pairs of domains. Compared to , we use 3.8 $\times$ more training and 2 $\times$ more testing videos.

The EPIC-Kitchens dataset for fine-grained action recognition released two distinct test sets—one with seen and another with unseen/novel kitchens. In the 2019 challenges report, all participating entries exhibit a drop in action recognition accuracy of 12-20% when testing their models on novel environments compared to seen environments . Up to our knowledge, no previous effort applied UDA on this or any fine-grained action dataset.

In this work, we present the first approach to multi-modal UDA for action recognition, tested on fine-grained actions. We combine adversarial training on multiple modalities with a modality correspondence self-supervision task. This utilises the differing robustness to domain shifts between the modalities. Our method is detailed next.

Proposed Method

This section outlines our proposed action recognition domain adaptation approach, which we call Multi-Modal Self-Supervised Adversarial Domain Adaptation (MM-SADA). In Fig. 3, we present an overview of MM-SADA, visualised for action recognition using two modalities: RGB and Optical Flow. We incorporate a self-supervision alignment classifier, $C$ , that determines whether modalities are sampled from the same or different actions to learn modality correspondence. This takes in the concatenated features from both modalities, without any labels. Learning the correspondence on source and target encourages features that generalise to both domains. Aligning the domain statistics is achieved by adversarial training, with a domain discriminator per modality that predicts the domain. A Gradient Reversal layer (GRL) reverses and backpropagates the gradient to the features. Both alignment techniques are trained on source and unlabelled target data whereas the action classifier is only trained with labelled source data.

We next detail MM-SADA, generalised to any two or more modalities. We start by revisiting the problem of domain adaptation and outlining multi-stream late fusion, then we describe our adaptation approach.

A domain is a distribution over the input population X and the corresponding label space Y. The aim of supervised learning, given labelled samples $\{(x,y)\}$ , is to find a representation, $G(\cdot)$ , over some learnt features, $F(\cdot)$ , that minimises the empirical risk, $\mathop{E_{\mathbf{S}}}[\mathcal{L}_{y}(G(F(x)),y)]$ . The empirical risk is optimised over the labelled source domain, $\mathbf{S}=\{X^{s},Y^{s},\mathcal{D}^{s}\}$ , where $\mathcal{D}^{s}$ is a distribution of source domain samples. The goal of domain adaptation is to minimise the risk on a target domain, $\mathbf{T}=\{X^{t},Y^{t},\mathcal{D}^{t}\}$ , where the distributions in the source and target domains are distinct, ${\mathcal{D}^{s}\neq\mathcal{D}^{t}}$ . In UDA, the label space $Y^{t}$ is unknown, thus methods minimise both the source risk and the distribution discrepancy between the source and target domains .

2 Multi-modal Action Recognition

When the input is multi-modal, i.e. ${X=(X^{1},\cdots,X^{M})}$ where $X^{m}$ is the $m^{th}$ modality of the input, fusion of modalities can be employed. Most commonly, late fusion is implemented, where we sum prediction scores from modalities and backpropagate the error to all modalities, i.e.:

where $G^{m}$ is the modality’s task classifier, and $F^{m}$ is the modality’s learnt feature extractor. The consensus of modality classifiers is trained by a cross entropy loss, $\mathcal{L}_{y}$ , between the task label, $y$ , and the prediction, $P(x)$ . $\sigma$ is defined as the softmax function. Training for classification expects the presence of labels and thus can only be applied to the labelled source input.

3 Within-Modal Adversarial Alignment

Both generative and discriminative adversarial approaches have been proposed for bridging the distribution discrepancy between source and target domains. Discriminative approaches are most appropriate with high-dimensional input data present in video. Generative adversarial requires a huge amount of training data and temporal dynamics are often difficult to reconstruct. Discriminative methods train a discriminator, $D(\cdot)$ , to predict the domain of an input (i.e. source or target), from the learnt features, $F(\cdot)$ . By maximising the discriminator loss, the network learns a feature representation that is invariant to both domains.

For aligning multi-modal video data, we propose using a domain discriminator per modality that penalises domain specific features from each modality’s stream. Aligning modalities separately avoids the easier solution of the network focusing only on the less robust modality in classifying the domain. Each separate domain discriminator, $D^{m}$ , is thus used to train the modality’s feature representation $F^{m}$ . Given a binary domain label, $d$ , indicating if an example $x\in\textbf{S}$ or $x\in\textbf{T}$ , the domain discriminator, for modality $m$ , is defined as,

4 Multi-Modal Self-Supervised Alignment

Prior approaches to domain adaptation have mostly focused on images and thus have not explored the multi-modal nature of the input data. Videos are multi-modal, where corresponding modalities are present in both source and target. We thus propose a multi-modal self-supervised task to align domains. Multi-modal self-supervision has been successfully exploited as a pretraining strategy . However, we show that self-supervision for both source and target domains can also align domains.

We learn the temporal correspondence between modalities as a self-supervised binary classification task. For positive examples, indicating that modalities correspond, we sample modalities from the same action. These could be from the same time, or different times within the same action. For negative examples, each modality is sampled from a different action. The network is thus trained to determine if the modalities correspond. This is optimised over both domains. A self-supervised correspondence classifier head, $C$ , is used to predict if modalities correspond. This shares the same modality feature extractors, $F^{m}$ , as the action classifier. It is important that $C$ is as shallow as possible so that most of the self-supervised representation is learned in the feature extractors. Given a binary label defining if modalities correspond, $c$ , for each input, $x$ , and concatenated features of the multiple modalities, we calculate the multi-modal self-supervision loss as follows:

5 Proposed MM-SADA

We define the Mutli-Modal Self-Supervised Adversarial Domain Adaptation (MM-SADA) approach as follows. The classification loss, $\mathcal{L}_{y}$ , is jointly optimised with the adversarial and self-supervised alignment losses. The within-modal adversarial alignment is weighted by $\lambda_{d}$ , and the multi-modal self-supervised alignment is weighted by $\lambda_{c}$ . Optimising both alignment strategies achieves benefits in matching source and target statistics and learning cross-modal relationships transferable to the target domain.

Note that the first loss $\mathcal{L}_{y}$ is only optimised for labelled source data, while the alignment losses $\forall m:\mathcal{L}_{d}^{m}$ and $\mathcal{L}_{c}$ are optimised for both unlabelled source and target data.

Experiments and Results

This section first discusses the dataset, architecture, and implementation details in Sec. 4.1. We compare against baseline methods noted in Sec. 4.2. Results are presented in Sec. 4.3, followed by an ablation study of the method’s components in Sec. 4.4 and qualitative results including feature space visualisations in Sec. 4.5.

Dataset. Our previous work, EPIC Kitchens , offers a unique opportunity to test domain adaptation for fine-grained action recognition, as it is recorded in 32 environments. Similar to previous works for action recognition , we evaluate on pairs of domains. We select the three largest kitchens, in number of training action instances, to form our domains. These are P01, P22, P08, which we refer to as D1, D2 and D3, respectively (Fig. 4).

We analyse the performance for the 8 largest action classes: (‘put’, ‘take’, ‘open’, ‘close’, ‘wash’, ‘cut’, ‘mix’, and ‘pour’), which form 80% of the training action segments for these domains. This ensures sufficient examples per domain and class, without balancing the training set. The label imbalance of these 8 classes is depicted in Fig. 4 (middle) which also shows the differing distribution of classes between the domains. Most domain adaptation works evaluate on balanced datasets with few using imbalanced datasets . EPIC-Kitchens has a large class imbalance offering additional challenges for domain adaptation. The number of action segments in each domain are specified in Fig. 4 (bottom), where a segment is a labeled start/end time, with an action label.

Architecture. We train all our models end-to-end. We use the inflated 3D convolutional architecture (I3D) as our backbone for feature extraction, one per modality ( $F^{m}$ ). In this work, $F$ convolves over a temporal window of 16 frames. In training, a single temporal window is randomly sampled from within the action segment each iteration. In testing, as in , we use an average over 5 temporal windows, equidistant within the segment. We use the RGB and Optical Flow frames provided publicly . The output of $F$ is the result of the final average pooling layer of I3D, with 1024 dimensions. $G$ is a single fully connected layer with softmax activation to predict class labels. Each domain discriminator $D^{m}$ is composed of 2 fully connected layers with a hidden layer of 100 dimensions and a ReLU activation function. A dropout rate of 0.5 was used on the output of $F$ and $1e-7$ weight decay for all parameters. Batch normalisation layers are used in $F^{m}$ and are updated with target statistics for testing, as in AdaBN . We apply random crops, scale jitters and horizontal flips for data augmentation as in . During testing only center crops are used. The self-supervised correspondence function $C$ (Eq. 3) is implemented as 2 fully connected layers of 100 dimensions and a ReLU activation function. The features from both modalities are concatenated along the channel dimension as input to $C$ .

Training and Hyper-parameter Choice. We train using the Adam optimiser in two stages. First the network is trained with only the classification and self supervision losses $\mathcal{L}_{y}+\lambda_{c}\mathcal{L}_{c}$ at a learning rate of $1e-2$ for 3K iterations. Then, the overall loss function (Eq. 4) is optimised, applying the domain adversarial losses $\mathcal{L}^{m}_{d}$ , and reducing the learning rate to $2e-4$ for a further 6K steps. The self-supervision hyper-parameter, $\lambda_{c}=5$ was chosen by observing the performance on the labelled source domain only, i.e. this has not been optimised for the target domain. Note that while training with self-supervision, half the batch contains corresponding modalities and the other non-corresponding modalities. Only source examples with corresponding modalities are used to train for action classification. The domain adversarial hyper-parameter, $\lambda_{d}=1$ , was chosen arbitrarily; we show that the results are robust to some variations in this hyper-parameter in an ablation study. Batch size was set to 128, split equally for source and target samples. On average, training takes 9 hours on an NVIDIA DGX-1 with 8 V100 GPUs.

2 Baselines

For all results, we report the top-1 target accuracy averaged over the last 9 epochs of training, for robustness. We first evaluate the impact of domain shift between source and target by testing using a multi-modal source-only model (MM source-only), trained with no access to unlabelled target data. Additionally, we compare to 3 baselines for unsupervised domain adaptation as follows:

AdaBN : Batch Normalisation layers are updated with target domain statistics.

Maximum Mean Discrepancy (MMD): The multiple kernel implementation of the commonly used domain discrepancy measure MMD is used as a baseline . This directly replaces the adversarial alignment with separate discrepancy measures applied to individual modalities.

Maximum Classifier Discrepancy (MCD) : Alignment through classifier disagreement is used. We use two multi-modal classification heads as separate classifiers. The classifiers are trained to maximise prediction disagreement on the target domain, implemented as L1 loss, finding examples out of support from the source domain. We use a GRL to optimise the feature extractors.

Additionally, as an upper limit, we also report the supervised target domain results. This is a model trained on labelled target data and only offers an understanding of the upper limit for these domains. We highlight these results in the table to avoid confusion.

3 Results

First we compare our proposed method MM-SADA to the various domain alignment techniques in Table 1. We show that our method outperforms batch-based (by 3.1%), classifier discrepancy (by 3%) and discrepancy minimisation alignment (by 3.5%) methods. The improvement is consistent for all pairs of domains. Additionally, it significantly improves on the source-only baseline by up to 7.5% in 5 out of 6 cases. For a single case, $D3\rightarrow D2$ , all baselines under-perform compared to source-only. Ours has a slight drop (-0.2%) but outperforms other alignment approaches. We will revisit this case in the ablation study.

Figure 5 shows the top-1 accuracy on the target during training (solid lines) vs source-only training without domain adaptation (dotted lines). Training without adaptation has consistently lower accuracy, except for our failure case $D3\rightarrow D2$ , showing the stability and robustness of our method during training, with minimal fluctuations due to stochastic optimisation on batches. This is essential for UDA as no target labels can be used for early stopping.

4 Ablation Study

Next, we compare the individual contributions of different components of MM-SADA. We report these results in Table 2. The self-supervised component on its own gives a $2.4\%$ improvement over no adaption. This shows that self-supervision can learn features common to both source and target domains, adapting the domains. Importantly, this on average outperforms the three baselines in Table 1. Adversarial alignment per modality gives a further $2.4\%$ improvement as this encourages the source and target distributions to overlap, removing domain specific features from each modality. Compared to adversarial alignment only, our method improves in 5 of the 6 domains and by up to 3.2%.

For the single pair noted earlier, $D3\rightarrow D2$ , self-supervision alone outperforms source-only and all other methods reported in Table 1 by 1.1%. However when combined with domain adaptation using $\lambda_{d}=1$ , the overall performance of MM-SADA reported in Table 1 cannot beat the baseline. In Table 2, we show that when halving the contribution of adversarial component to $\lambda_{d}=0.5$ , MM-SADA can achieve 56.9% outperforming the source-only baseline. Therefore self-supervision can improve performance where marginal alignment domain adaptation techniques fail.

Figure 6 plots the performance of MM-SADA as $\lambda_{d}$ changes. Note that $\lambda_{c}$ can be chosen by observing the performance of self-supervision on source-domain labels, while $\lambda_{d}$ requires access to target data. We show that our approach is robust to various values of $\lambda_{d}$ , with even higher accuracy at $\lambda_{d}=0.75$ than those reported in Table 2.

Table 3 shows the impact of our method on the performance of the modalities individually. Predictions are taken from each modality separately before late fusion. RGB, the less robust modality, benefits most from MM-SADA, improving over source-only by $6.5\%$ on average, whereas Flow improves by $1.6\%$ . The inclusion of multi-modal self-supervision provides $2.4\%$ and $1.1\%$ improvements for RGB and Flow, compared to only using adversarial alignment. This shows the benefit of employing self-supervision from multiple modalities during alignment.

We also compare two approaches for multi-modal self-supervision in Table 4. The first, which has been used to report all results above, learns the correspondence of RGB and Flow within the same action segment. We refer to this as ‘Seg. Corr.’. The second learns the correspondence only from time-synchronised RGB and Flow data, which we call ‘Sync’. The two approaches are comparable in performance overall, with no difference on average over the domain pairs. This shows the potential to use a number of multi-modal self-supervision tasks for alignment.

5 Qualitative Results

Figure 7 shows the t-SNE visualisation of the RGB (left) and Flow (right) feature spaces $F^{m}$ . Several observations are worth noting from this figure. First, Flow shows higher overlap between source and target features pre-alignment (first row). This shows that Flow is more robust to environmental changes. Second, self-supervision alone (second row) changes the feature space by separating the features into clusters, that are potentially class-relevant. This is most evident for $D3\rightarrow D1$ on the RGB modality (second row third column). However, alone this feature space still shows domain gaps, particularly for RGB features. Third, our proposed MM-SADA (third row) aligns the marginal distributions of source and target domains.

Conclusion and Future Work

We proposed a multi-modal domain adaptation approach for fine-grained action recognition utilising multi-modal self-supervision and adversarial training per modality. We show that the self-supervision task of predicting the correspondence of multiple modalities is an effective domain adaptation method. On its own, this can outperform domain alignment methods , by jointly optimising for the self-supervised task over both domains. Together with adversarial training, the proposed approach outperforms non-adapated models by $4.8\%$ . We conclude that aligning individual modalities whilst learning a self-supervision task on source and target domains can improve the ability of action recognition models to transfer to unlabelled environments.

Future work will focus on utilising more modalities, such as audio, to aid domain adaptation as well as exploring additional self-supervised tasks for adaptation, trained individually as well as for multi-task self-supervision.

Acknowledgement Research supported by EPSRC LOCATE (EP/N033779/1) and EPSRC Doctoral Training Partnershipts (DTP). The authors acknowledge and value the use of the ESPRC funded Tier 2 facility, JADE.