Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu

Introduction

Inspired by the success of BERT on natural language understanding, there has been a surging research interest in developing multimodal pre-training methods for vision-and-language representation learning (e.g., ViLBERT , LXMERT , and UNITER ). When finetuned on downstream tasks, these pre-trained models have achieved state-of-the-art performance across diverse V+L tasks, such as Visual Question Answering (VQA) , Visual Commonsense Reasoning (VCR) , and Referring Expression Comprehension . However, due to the immense capacity of large-scale pre-trained models yet limited amount of labeled data in downstream tasks, aggressive finetuning often falls into the overfitting trap . Adversarial training, a method to combat adversarial attacks in order to create robust neural networks , has recently shown great potential in improving the generalization ability of pre-trained language models and image classifiers . A natural question that came to our mind: can we apply similar adversarial training techniques to V+L problems to improve model performance?

We propose Villa (Vision-and-Language Large-scale Adversarial training), which advocates the use of adversarial training for V+L representation learning. As illustrated in Figure 1, Villa consists of two training stages: ( $i$ ) task-agnostic adversarial pre-training (APT); followed by ( $ii$ ) task-specific adversarial fine-tuning (AFT). Intuitively, if well-designed, multimodal pre-training tasks such as image-conditioned masked language modeling and image-text matching can resonate well with many downstream tasks that require visual grounding and reasoning abilities. This leads to our hypothesis that the improved generalization ability of pre-trained models learned during APT stage can be readily transferred to the AFT stage for diverse tasks. In other words, APT is able to uniformly lift model performance for all downstream tasks in a task-agnostic way, while AFT can further enhance the finetuned models by leveraging task-specific supervision signals.

To bring in more flexibility in generating adversarial examples for robust training, we propose to perform adversarial training on the embedding level for multi-modalities, instead of operating on image pixel and sub-word token level in conventional practice. For text modality, we add adversarial perturbations to word embeddings . For image modality, most previous work observes that robustness is at odds with generalization, i.e., trained models are able to resist adversarial attacks on clean images at the expense of performance . Distinctive from these studies, we directly add adversarial perturbations to extracted image-region features , as our end goal is the final V+L model performance rather than crafting adversarial image examples. Experiments show that this strategy leads to large performance gain on clean inputs.

Adversarial training procedure is time-consuming and computationally expensive. To power efficient large-scale training, we adopt the recently proposed “free” adversarial training strategy , which obtains the gradients of parameters with almost no extra cost when computing the gradients of inputs. In addition to requiring adversarial perturbations to be label-preserving, we also introduce KL-divergence-based regularization to enforce the confidence level of the prediction to be close, characterized by the “dark” knowledge hidden in the probability vectors. This promotes higher smoothness of the training objective and has empirically proven as important regularization effective for further performance boost.

For evaluation, we mostly focus on UNITER , the current best-performing V+L model with state-of-the-art performance across many popular V+L benchmarks, and enhance UNITER with Villa through comprehensive experiments on six V+L tasks: VQA , VCR , NLVR2 , Visual Entailment , Referring Expression Comprehension , and Image-Text Retrieval . Villa is a generic framework that can be applied to any multimodal pre-training method. To demonstrate its versatility, we further apply it to LXMERT on VQA, GQA , and NLVR2 tasks for generalizability test.

The main contributions are summarized as follows. ( $i$ ) We present Villa, the first known effort on adversarial pre-training and adversarial finetuning for V+L representation learning. ( $ii$ ) Instead of operating on pixel and word token level, we propose to add adversarial perturbations in the embedding space of multi-modalities, and introduce a smoothness-inducing adversarial regularization term on top of the “free” adversarial training strategy. ( $iii$ ) Villa achieves new state of the art across six popular V+L tasks. In particular, by relying on standard bottom-up image features only , Villa improves the single-model performance of UNITER-large from 74.02 to 74.87 on VQA, and from 62.8 to 65.7 on VCR. With ensemble, VQA performance is further boosted to 75.85.

Related Work

Multimodal Pre-training ViLBERT and LXMERT are the pioneering works in vision+language pre-training, where two Transformers are used to encode image and text modalities, respectively, then a third Transformer is built on top for multimodal fusion. Compared to this two-stream architecture, recent work such as VL-BERT , VisualBERT , B2T2 , Unicoder-VL and UNITER advocate a single-stream model design, where two modalities are directly fused in early stage. More recent studies leverage multi-task learning to enhance finetuning and use detected image tags to further enhance pre-training. Pixel-BERT proposes to align text with image pixels instead of conventional bottom-up features. Multimodal pre-training has brought leaping advances in vision+language understanding tasks such as VQA and VCR, with great potential in extending to visual captioning , visual dialog , vision-language navigation , as well as video-and-language representation learning . Recent work also investigates the design of probing tasks to understand the knowledge learned in pre-training.

V+L Representation Learning Before multimodal pre-training dominated the scene, there had been a long line of studies on how to learn better V+L representations. Prominent work includes: ( $i$ ) advanced attention mechanisms ; ( $ii$ ) better multimodal fusion methods ; ( $iii$ ) multi-step reasoning ; ( $iv$ ) incorporation of object relations ; and ( $v$ ) neural module networks for compositional reasoning . In principle, our proposed Villa framework can be plugged into these “shallower” models. In this paper, we mainly focus on enhancing Transformer-based state-of-the-art models.

Adversarial Training Adversarial machine learning is an active research area . Algorithms are developed to either attack existing models by constructing adversarial examples, or train robust models to defend against adversarial attacks. Among existing defense approaches, adversarial training (AT) is a general strategy to empower models with state-of-the-art robustness in different settings . Existing research mostly focuses on AT for image classification, and the general notion is that robustness is often at odds with accuracy. Most recently, shows that model accuracy on clean images can be improved if a separate auxiliary batch norm is used for adversarial examples. There are also some parallel studies on applying AT to language modeling and natural language understanding . Due to growing dominance of large-scale pre-training, very recent work has started to explore adversarial training in the pre-training stage . Villa is the first known effort that studies AT for V+L tasks and adds adversarial perturbations to both image and word embedding space. We also prove that AT can be effectively incorporated in both pre-training and fine-tuning stages. A more detailed discussion on related work is provided in Appendix.

Vision-and-Language Large-scale Adversarial Training

There are three key designs that encapsulate Villa’s unique strengths in improving performance and generalization of pre-trained V+L models : ( $i$ ) Adversarial pre-training and fine-tuning; ( $ii$ ) Adding perturbations in the embedding space; and ( $iii$ ) Enhanced adversarial training algorithm.

We first briefly review the pretrain-then-finetune paradigm that has become prevalent in V+L representation learning, then describe our proposed two-stage adversarial training framework.

Pre-training Let $\mathcal{D}_{p}$ denote a pre-training dataset, which consists of image-text pairs ( ${\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt}$ ). The goal in the pre-training stage is to learn universal image-text representations that are generalizable to different downstream tasks. Take one-stream models as an example. Image and text inputs are first represented as low-dimensional feature vectors ${\boldsymbol{z}}_{img}=g_{bu}({\boldsymbol{x}}_{img})$ and ${\boldsymbol{z}}_{txt}=g_{emb}({\boldsymbol{x}}_{txt})$ , where $g_{bu}(\cdot)$ represents a fixed bottom-up image feature extractor , and $g_{emb}(\cdot)$ represents a learnable word embedding function. Then, a multi-layer Transformer is applied on top to learn multimodal fusion. The above process can be abbreviated as $\tilde{{\boldsymbol{z}}}_{img},\tilde{{\boldsymbol{z}}}_{txt},\tilde{{\boldsymbol{z}}}_{cls}=f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt})$ , where $\tilde{{\boldsymbol{z}}}_{img}$ and $\tilde{{\boldsymbol{z}}}_{txt}$ represent the contextualized representations of each image region and each textual token, respectively. Typically, V+L models employ a special [CLS] token whose embedding $\tilde{{\boldsymbol{z}}}_{cls}$ is considered as the joint V+L representation to be used for downstream tasks. ${\boldsymbol{\theta}}$ denotes all the learnable parameters including the word embedding matrix.

Let ${\boldsymbol{y}}$ denote the output supervision signal, which is different across different pre-training tasks. There are three typical pre-training tasks used in most V+L models: ( $i$ ) Masked Language Modeling (MLM): some tokens in ${\boldsymbol{x}}_{txt}$ are replaced by special [MASK] tokens, and the goal is to predict the masked tokens ${\boldsymbol{y}}$ based on surrounding multimodal context; ( $ii$ ) Masked Region Modeling (MRM): the features of some image regions in ${\boldsymbol{x}}_{img}$ are replaced by zero vectors, and the goal is to predict the masked image regions ${\boldsymbol{y}}$ given the remaining multimodal information (via cross-entropy loss, KL-divergence loss , or contrastive learning ); ( $iii$ ) Image-Text Matching (ITM): both ${\boldsymbol{x}}_{img}$ and ${\boldsymbol{x}}_{txt}$ are kept intact, and the goal is to predict a binary label ${\boldsymbol{y}}$ to judge whether the input image and text are paired or not.

Finetuning Given a downstream task $\mathcal{T}_{f}$ and a supervised dataset $\mathcal{D}_{f}$ consisting of $({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt},{\boldsymbol{y}})$ , the pre-trained model can be finetuned by introducing a small neural network $h(\cdot)$ on top of $\tilde{{\boldsymbol{z}}}_{cls}$ and minimizing the cross-entropy loss. ${\boldsymbol{\theta}}$ is initialized with pre-trained weights, and ${\boldsymbol{y}}$ now becomes a label. For example, in VQA, ${\boldsymbol{y}}$ corresponds to the ground-truth answer from a candidate pool, represented as a one-hot vector. In VCR , it is a four-way classification label.

In both pre-training and finetuning, by instantiating different ${\boldsymbol{y}}$ , the training process can be uniformly abstracted as an empirical risk minimization problem:

Two-stage Adversarial Training Pre-training and finetuning are inherently connected. Independent of the tasks (e.g., MLM, ITM for pre-training, or VQA for finetuning), model training requires the acquisition of essential reasoning skills that can catalyze multimodal fusion for cross-modality joint understanding. For example, in MLM, a masked token ‘dog’ can be predicted by looking at the image region that contains a dog; and in VQA, when asked whether there is a dog in an image, such visual grounding skills learned through pre-training can be readily applied. We hypothesize that: ( $i$ ) by performing adversarial training in the pre-training stage, the improved generalization ability of a learned model can be beneficial to the finetuning stage; and ( $ii$ ) in the subsequent finetuning stage, where task-specific training signals become available, adversarial finetuning can be applied again to further boost performance. Since pre-training and finetuning share the same mathematical formulation (Eqn. (1)), the same AT algorithm can be adopted in both stages.

2 Perturbations in the Embedding Space

For the image modality, since state-of-the-art V+L models typically use image features from pre-trained object detectors as input, we add adversarial perturbations in the feature space directly. Note that even though the main difference is simply the noise injecting space, our approach is distinctive from most previous work where perturbations are applied to the pixel space, which is more rigid than fine-grained embedding perturbation. On the other hand, unlike image pixels that are continuous-valued, discrete tokens in the text modality are more difficult to manipulate. It remains unclear how to craft label-preserving adversarial examples without changing the original semantic meaning of the sentence. But since we only care about the ultimate effects of adversarial training on downstream tasks, not intepretability of adversarial examples, we choose to add perturbations to the word embeddings following .

In pre-trained V+L models, positional embeddings are used to encode the location of image regions and sub-word tokens. Our adversaries only modify image and word embeddings, leaving other components of the multimodal features unchanged. Furthermore, due to the distinct characteristics of image and text modalities, we propose to add perturbations to one modality at a time. Specifically, we add adversarial perturbations ${\boldsymbol{\delta}}_{img}$ and ${\boldsymbol{\delta}}_{txt}$ such that the prediction becomes $\hat{{\boldsymbol{y}}}=f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img}+{\boldsymbol{\delta}}_{img},{\boldsymbol{x}}_{txt})$ and $\tilde{{\boldsymbol{y}}}=f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt}+{\boldsymbol{\delta}}_{txt})$ . To preserve original semantics, the norm of ${\boldsymbol{\delta}}_{img}$ and ${\boldsymbol{\delta}}_{txt}$ is controlled to be small. Also assumed is that model prediction should not change after the perturbation.

3 “Free” Multimodal Adversarial Training

Training Objective In Villa, we use adversarial training as an effective regularization to improve model generalization, i.e., to minimize the following objective:

subscriptℒ𝑠𝑡𝑑𝜽subscriptℛ𝑎𝑡𝜽⋅𝛼subscriptℛ𝑘𝑙𝜽\displaystyle\min_{{\boldsymbol{\theta}}}\mathbb{E}_{({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt},{\boldsymbol{y}})\sim\mathcal{D}}\Big{[}\mathcal{L}_{std}({\boldsymbol{\theta}})+\mathcal{R}_{at}({\boldsymbol{\theta}})+\alpha\cdot\mathcal{R}_{kl}({\boldsymbol{\theta}})\Big{]}\,, (2) where $\mathcal{L}_{std}({\boldsymbol{\theta}})=L(f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt}),{\boldsymbol{y}})$ is the cross-entropy loss on clean data, $\mathcal{R}_{at}({\boldsymbol{\theta}})$ is the label-preserving AT loss, and $\mathcal{R}_{kl}({\boldsymbol{\theta}})$ is a finer-grained adversarial regularization term. Specifically,

subscriptnormsubscript𝜹𝑖𝑚𝑔italic-ϵ𝐿subscript𝑓𝜽subscript𝒙𝑖𝑚𝑔subscript𝜹𝑖𝑚𝑔subscript𝒙𝑡𝑥𝑡𝒚subscriptnormsubscript𝜹𝑡𝑥𝑡italic-ϵ𝐿subscript𝑓𝜽subscript𝒙𝑖𝑚𝑔subscript𝒙𝑡𝑥𝑡subscript𝜹𝑡𝑥𝑡𝒚\displaystyle\mathcal{R}_{at}({\boldsymbol{\theta}})=\max_{||{\boldsymbol{\delta}}_{img}||\leq\epsilon}L(f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img}+{\boldsymbol{\delta}}_{img},{\boldsymbol{x}}_{txt}),{\boldsymbol{y}})+\max_{||{\boldsymbol{\delta}}_{txt}||\leq\epsilon}L(f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt}+{\boldsymbol{\delta}}_{txt}),{\boldsymbol{y}})\,, (3) where $L$ is the cross-entropy loss on adversarial embeddings. Frobenius norm is used to constrain ${\boldsymbol{\delta}}_{img}$ and ${\boldsymbol{\delta}}_{txt}$ . For optimization, demonstrated that the outer minimization in Eqn. (2) can be solved by SGD, while the inner maximization in Eqn. (3) can be solved reliably by PGD, a standard method for large-scale constrained optimization. Take ${\boldsymbol{\delta}}_{img}$ for example: PGD takes the following step (with step-size $\alpha$ ) in each iteration:

𝑡1subscriptΠnormsubscript𝜹𝑖𝑚𝑔italic-ϵsubscript𝜹𝑖𝑚𝑔𝑡𝛼𝑔subscript𝜹𝑖𝑚𝑔𝑡subscriptnorm𝑔subscript𝜹𝑖𝑚𝑔𝑡𝐹\displaystyle{\boldsymbol{\delta}}_{img,t+1}=\Pi_{||{\boldsymbol{\delta}}_{img}||\leq\epsilon}({\boldsymbol{\delta}}_{img,t}+\alpha g({\boldsymbol{\delta}}_{img,t})/||g({\boldsymbol{\delta}}_{img,t})||_{F})\,, (4) where $g({\boldsymbol{\delta}}_{img,t})=\nabla_{{\boldsymbol{\delta}}_{img}}L(f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img}+{\boldsymbol{\delta}}_{img},{\boldsymbol{x}}_{txt}),{\boldsymbol{y}})$ is the gradient of the loss w.r.t. ${\boldsymbol{\delta}}_{img}$ , and $\Pi_{||{\boldsymbol{\delta}}_{img}||\leq\epsilon}$ performs a projection onto the $\epsilon$ -ball.

To further enhance the above AT algorithm, $\mathcal{R}_{kl}({\boldsymbol{\theta}})$ is defined as

subscript𝒙𝑖𝑚𝑔subscript𝜹𝑖𝑚𝑔subscript𝒙𝑡𝑥𝑡subscript𝑓𝜽subscript𝒙𝑖𝑚𝑔subscript𝒙𝑡𝑥𝑡\displaystyle=\max_{||{\boldsymbol{\delta}}_{img}||\leq\epsilon}L_{kl}(f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img}+{\boldsymbol{\delta}}_{img},{\boldsymbol{x}}_{txt}),f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt})) $\displaystyle+\max_{||{\boldsymbol{\delta}}_{txt}||\leq\epsilon}L_{kl}(f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt}+{\boldsymbol{\delta}}_{txt}),f_{{\boldsymbol{\theta}}}({\boldsymbol{x}}_{img},{\boldsymbol{x}}_{txt}))\,,$ (5) where $L_{kl}(p,q)=\mbox{KL}(p||q)+\mbox{KL}(q||p)$ , $p,q$ denote the two probability distributions, and $\mbox{KL}(\cdot)$ denotes the Kullback-Leibler Divergence. Compared to Eqn. (3) that promotes label-preserving adversarial attack, Eqn. (3.3) further advocates that the confidence level of the prediction, characterized by the probability vector over the simplex $\Delta_{n}$ ( $n$ is the number of classes), should also be close. Similar techniques are used in Virtual AT , TRADES , and UDA . However, previous work mostly focuses on semi-supervised learning or trade-off between accuracy and robustness; in our work, we found that it is highly effective for boosting model generalization ability.

“Free” AT Strategy $K$ -step PGD requires $K$ forward-backward passes through the network, which is computationally heavy. Another limitation is that after $K$ steps, only perturbations at the final step are used for model training. To enable AT for large-scale training and promote diverse adversaries, we follow FreeLB to perform multiple PGD iterations to craft adversarial embeddings, and simultaneously accumulate the “free” parameter gradients $\nabla_{{\boldsymbol{\theta}}}L$ in each iteration. After that, the model parameters ${\boldsymbol{\theta}}$ are updated all at once with the accumulated gradients, effectively creating a $K$ -times-larger “virtual” mini-batch. The full procedure is provided in Algorithm 1.

Experiments

Downstream Tasks To validate the effectiveness of Villa, we apply it to existing V+L pre-trained models and conduct a comprehensive evaluation over a wide range of downstream tasks, including Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Referring Expression (RE) Compression, Visual Entailment, Image-Text Retrieval, and NLVR2. To validate the strength of Villa in model pre-training and finetuning, we first incorporate it into state-of-the-art UNITER model in both stages for downstream evaluation and ablation analysis. And to demonstrate the versatility of Villa, we further apply it to another V+L model LXMERT with a different architecture design from UNITER (two-stream vs. one-stream) for generalizability test.

UNITER and LXMERT UNITER-base is a single-stream model, which has 12 layers, with 768 hidden units per layer and 12 attention heads; UNITER-large has 24 layers, with 1024 hidden units per layer and 16 attention heads. UNITER shares the same structure as BERT, except that the input now becomes a mixed sequence of two modalities. LXMERT is a two-stream model, which first performs self-attention through several layers on each modality independently (9 layers for text modality, and 5 layers for image modality), then fuses the outputs of both streams through another 5 layers (first cross-attention, then self-attention).

Implementation Details For UNITER experiments, we pre-train with the same four large-scale datasets used in the original model: COCO , Visual Genome (VG) , Conceptual Captions and SBU Captions . Villa is applied to both MLM and ITM pre-training tasks. The original UNITER-base (12 layers) and UNITER-large (24 layers) models take 200k and 500k steps for pre-training, respectively. For fair comparison, when applying Villa to UNITER-base, we run 100k steps of standard training, followed by 100k steps of adversarial training. When applying Villa to UNITER-large, to save pre-training time,222Villa is $K$ times computationally heavier than UNITER, where $K$ is the number of adversarial training steps. We typically select adversarial learning rate from {1e-2, 1e-3}, adversarial training steps to 3, and $\alpha$ (Eqn. 2) from $1.0,1.5,2.0$ . More implementation details are provided in Appendix. we run 425k steps of standard training, followed by 75k steps of adversarial training.

2 Results and Ablation Analysis

Downstream Task Evaluation Table 1 summarizes the results of Villa applied to UNITER on all evaluation tasks. Compared with existing pre-trained V+L models, our Villa method achieves new state of the art across all the benchmarks. Specifically, Villa-base model outperforms UNITER-base by +0.76 on VQA, +2.4 on VCR for Q $\rightarrow$ AR, +1.45 on NLVR2, +0.75 on SNLI-VE, +2.22/+0.70 on Flickr30k for Image/Text Retrieval (R@1), and +0.99 on average for the three RE datasets.

Similar universal performance lift is also observed in Villa-large. It is highly encouraging to see that Villa-large brings an absolute +2.9 points performance gain over UNITER-large for VCR on the Q $\rightarrow$ AR metric. Compared to the others, VCR is a relatively more challenging task, which requires commonsense reasoning and understanding complex social dynamics that is implicitly encoded in the image. Another significant boost is over the well-studied VQA benchmark, from 74.02 to 74.87. With ensemble, the performance of Villa-large is further lifted to 75.85.

Pre-training vs. Finetuning To understand the effects of adversarial training on pre-training and finetuning, we conduct an ablation study with UNITER-base and summarize the results in Table 2. UNITER (reimp.) denotes our re-implementation of the UNITER-base model with standard training. Villa-pre and Villa-fine apply adversarial training to only the pre-training or finetuning stage, respectively. Averaged over the six evaluation tasks, Villa-pre and Villa-fine brings +0.51 and +0.82 points performance gain. By combining the two, +1.15 points gain is achieved. Figure 2 further provides the training curves of each task, which illustrate growing performance gaps between AT-enhanced models and the original UNITER, as the number of training steps increases. Interestingly, on VQA, though in early epochs UNITER achieves better performance than Villa, Villa catches up quickly after a few hundred of steps, which demonstrates the beneficial regularization effect of adversarial training. More training curves on other tasks can be found in Appendix.

To further understand the importance of adversarial pre-training, we use VQA as the probing task, and compare the performance of standard and adversarial pre-training at each intermediate model checkpoint (using standard finetuning to both pre-trained models). Results are presented in Figure 3(a). As shown, once adversarial training is activated, Villa-pre starts outperforming UNITER, and the performance gap increases as the number of pre-training steps grows.

Image vs. Text Modality To gain insights on the effects of adversarial examples in different modalities, we conduct experiments by adding perturbations to either image or text modality, and use VQA and VCR for ablation tests. Results are summarized in Table 3(a). Conventionally, adversarial training in the image domain hurts model accuracy on clean images. However, in our setting, we observe that adding perturbations to image features alone can boost final model performance significantly. Our initial intuition was that adding perturbations to both modalities might increase the diversity of adversarial examples, hence bringing more benefits. However, ablation results show that adding perturbations on one modality is already gaining significant improvement.333We also tried adding adversarial perturbations to both modalities simultaneously instead of alternatively. Empirically, we observe that they obtained similar performance. The boost on VCR is larger than VQA, which we hypothesize is due to the higher complexity in VCR task, which adding more adversaries to model training can effectively help.

FreeLB vs. Villa To compare with prior work FreeLB, we conduct an additional ablation study also on VQA and VCR, two representative and challenging V+L tasks. Table 3(b) shows that Villa achieves consistently better performance than FreeLB over both benchmarks, thanks to the additional fine-grained adversarial regularization term. For example, FreeLB brings little performance boost on VQA, while Villa achieves considerable improvement over the baseline.

Probing Analysis Pre-trained models are expected to learn intricate knowledge about multimodality correlations, such as visual coreference (i.e., region-phrase alignment) and visual relation (i.e., region-region interaction). To provide a more direct measurement on how well our adversarial pre-trained model captures such multimodal signals, we conduct a probing analysis following . We consider five most common visual coreference types in Flickr30k Entities and top five visual relations in Visual Genome (listed in Table 4), and calculate the attention weights between region and phrase (or between regions) learned by pre-trained models. Results show that Villa presents higher attention weights across all the ten categories (0.223 vs. 0.195 on average), indicating a higher probability of identifying those relations. Figure 4 further provides a visualization of text-to-image attention, where Villa exhibits more accurate and sharper multimodal alignment.

Results on LXMERT Villa is a generic framework that can be readily applied to any V+L models. To demonstrate its generalization ability, we conduct additional experiments using LXMERT as the backbone. Since adversarial pre-training is highly time-consuming, we only focus on adversarial finetuning for LXMERT.444Code is available at https://github.com/zhegan27/LXMERT-AdvTrain. We use VQA, GQA and NLVR2 as the evaluation tasks, the same as LXMERT. Results in Table 5 show that Villa-fine instantly provides +0.88 average performance boost across the three tasks. The training curves are provided in Figure 3(b). Compared to LXMERT, Villa-fine achieves higher accuracy on validation set and lower accuracy on training set for both VQA and GQA, clearly demonstrating its regularization effect in preventing overfitting of large-scale pre-trained models.

Robustness In order to test adversarial robustness, we need to perform adversarial attacks to existing V+L models. This V+L attack problem is largely unexplored in the literature. For example, how to reliably back-propagate the gradients from the multimodal Transformer to the CNN backbone to generate image adversaries is non-trivial. How to craft textual adversaries that align with the visual context is also challenging. In this work, we mainly focus on improving model’s generalization performance on clean data, leaving a more thorough investigation of adversarial attack and robustness as important future work.

As a proxy for robustness evaluation, we conduct additional experiments on the VQA-Rephrasings dataset to test the robustness of existing V+L models to paraphrases. For fair comparison, we have re-trained both UNITER and Villa on the VQA training set only. Results are summarized in Table 6, where ‘Original’ and ‘Rephrasing’ denote the test set with original questions and their rephrasings, respectively. UNITER has already lifted up the performance by a large margin, and Villa facilitates further performance boost.

We provide additional experimental results, more details about the probing analysis, and additional visualization examples in Appendix.

Conclusion

In this paper, we present Villa, an advanced adversarial training (AT) framework for better vision-and-language representation learning. By performing AT in both pre-training and finetuning stages, and by adding adversarial perturbations to the embedding space, Villa achieves consistent performance boost on all the benchmarks evaluated. As AT is time-consuming, for future work, we plan to study how to accelerate AT so that it can be more feasible for large-scale pre-training in practice.

Broader Impact

Our research advances vision-and-language representation learning by incorporating adversarial training in both pre-training and finetuning stages. By utilizing the enormous amount of image-text data available on the web for pre-training, Villa can absorb multimodal clues to capture multi-channel signals from the world, towards a smarter AI system. Furthermore, Villa can provide instant performance boost in finetuning stage, which will help accelerate future studies in this field. However, in order to train models to learn such capabilities, our method also calls for a high demand on computational resources due to large-scale training, which could be costly both financially and environmentally. As part of our research effort, we will release our pre-trained models to facilitate future research, to empower others’ scientific exploration and save environmental cost.

References

Appendix A Appendix

This supplementary material contains three sections. Section A.1 reviews additional related work. Section A.2 provides additional experimental results. Section A.3 describes downstream tasks and implementation details.

Adversarial Training Many efforts have been devoted to improving AT from different angles: ( $i$ ) use triplet-wise metric learning and optimal transport to leverage inter-sample interactions; ( $ii$ ) exploit extra unlabeled training data ; and ( $iii$ ) accelerate the training procedure . Specifically, adversarial examples have been explored primarily in the image domain, and only recently started to gain attention in vision-and-language research. studied how to craft adversarial examples for image captioning, and investigated how to derive adversarial rules to attack VQA systems. Different from these studies, we are not interested in crafting actual adversarial examples, but aim to apply AT to improve the final model performance over V+L tasks. Note that “adversarial regularization” was proposed in ; however, it is mainly used to overcome the language priors in VQA, which is entirely different from the AT used here.

A.2 Additional Results

Results on VQA In Table 1(a), we have reported the experimental results on the test-dev and test-std splits of VQA. More detailed results on each question type are provided in Table 7. As shown, Villa improves over UNITER on all the question types.

Training Curves In Figure 3(a), we have provided the training curves on three datasets. The training curves for the remaining three datasets are shown in Figure 5 with similar trend observed.

Pre-training vs. Finetuning with Large Model Size In Table 2, we provided ablation study on adversarial pre-training and finetuning with UNITER-base model size (12 layers). In Table 8, we provide additional ablation study with large model size (24 layers) on a selective set of tasks (VQA and VCR). On average, adversarial pre-training and finetuning bring +1.48 and +2.21 performance gain, respectively. Combining the two AT stages provides further improvement.

Results on GQA In Table 5, we have reported LXMERT results on GQA enhanced by Villa-fine. The complete results are provided in Table 10 for reference.

Adversarial pre-training from scratch Instead of performing adversarial pre-training from 100k steps, we also conducted experiments on adversarial pre-training from scratch with base model size. Preliminary results on VQA are shown in Table 9. Adversarial pre-training from scratch brings further performance improvement. We leave a thorough investigation of this as future work.

Additional Visualization We provide additional text-to-image attention visualization results in Figure 6.

A.3 Downstream Tasks and Implementation Details

Downstream Tasks In VQA , GQA and VCR , given an image and an input question, the model predicts an answer (or selects from a candidate pool). For NLVR2 , given a pair of images and a natural language description, the model judges the correctness of the description based on the visual clues in the image pair. For Visual Entailment, we evaluate on SNLI-VE , where the model predicts whether a given image semantically entails a given sentence. For Referring Expression (RE) Comprehension, we evaluate on RefCOCO, RefCOCO+, and RefCOCOg datasets , where given a text description, the model selects the described region from a set of image region proposals. Models are evaluated on ground-truth objects and detected proposals. For Image-Text Retrieval (ITR), we consider both image retrieval and text retrieval on Flickr30k dataset.

For all the tasks except RE Comprehension, we extract the joint V+L embedding from the [CLS] token, and apply a multi-layer perceptron (MLP) for prediction. For RE Comprehension, we use MLP to compute the region-wise alignment scores. During the finetuning stage, ITR is formulated as a ranking problem, with triplet loss used for modeling training and hard negatives applied to boost performance . All the other tasks can be formulated as a classification problem, using cross-entropy loss for model training. For VCR , second-stage pre-training with VCR training data was proven useful in . Therefore, for VCR downstream experiments, we further apply 60k steps of second-stage adversarial pre-training.

Probing Analysis The visual coreference task aims to predict whether there is a link between an image region and a noun phrase in the sentence that describes the image. In addition, each coreference link in the dataset is annotated with a label. Through this task, we can find out whether the coreference knowledge can be captured by the attention trace. To achieve this goal, for each data sample in the Flickr30k Entity dataset, we extract the encoder’s attention weights for all the 144 heads. Note that noun phrases typically consist of two or more tokens in the sequence. Thus, we extract the maximum attention weight between the image region and each word of the noun phrase for each head. The maximum weight is then used to evaluate which head identifies visual coreference.

Similarly, the visual relation task aims to identify and classify the relation between two image regions. The Visual Genome dataset is used for this task, which contains 1,531,448 relations. To reduce the imbalance in the number of relations per relation type, we randomly select at most 15,000 relation pairs per type. Then, we perform similar probing analysis of the attention heads by examining the attention weights on ground-truth links.

Implementation Details Our models are implemented based on PyTorch.To speed up training, we use Nvidia Apex555https://github.com/NVIDIA/apex for mixed precision training. All pre-training experiments are run on Nvidia V100 GPUs (16GB VRAM; PCIe connection). Finetuning experiments are implemented on the same hardware or Titan RTX GPUs (48GB VRAM). For large pre-training experiments, we use Horovod666https://github.com/horovod/horovod and NCCL777https://github.com/NVIDIA/nccl for multi-node communication. All the hyper-parameter values used in experiments are listed in Table 11. And for all the experiments, we set the number of adversarial training steps to 3. We mostly follow the experimental settings in UNITER . For more details on each downstream task finetuning, please refer to their Appendix. Since we mostly adopt their default hyper-parameters, and the only additional hyper-parameters we introduce are adversarial learning rate, number of adversarial steps, and the adversarial weight $\alpha$ in Eqn. 2, the experimental results are fairly easy to reproduce.