Consistency Regularization for Cross-Lingual Fine-Tuning

Bo Zheng, Li Dong, Shaohan Huang, Wenhui Wang, Zewen Chi, Saksham Singhal, Wanxiang Che, Ting Liu, Xia Song, Furu Wei

Introduction

Pre-trained cross-lingual language models (Conneau and Lample, 2019; Conneau et al., 2020a; Chi et al., 2020) have shown great transferability across languages. By fine-tuning on labeled data in a source language, the models can generalize to other target languages, even without any additional training. Such generalization ability reduces the required annotation efforts, which is prohibitively expensive for low-resource languages.

Recent work has demonstrated that data augmentation is helpful for cross-lingual transfer, e.g., translating source language training data into target languages (Singh et al., 2019), and generating code-switch data by randomly replacing input words in the source language with translated words in target languages (Qin et al., 2020). By populating the dataset, their fine-tuning still treats training instances independently, without considering the inherent correlations between the original input and its augmented example. In contrast, we propose to utilize consistency regularization to better leverage data augmentation for cross-lingual fine-tuning. Intuitively, for a semantic-preserving augmentation strategy, the predicted result of the original input should be similar to its augmented one. For example, the classification predictions of an English sentence and its translation tend to remain consistent.

In this work, we introduce a cross-lingual fine-tuning method xTune that is enhanced by consistency regularization and data augmentation. First, example consistency regularization enforces the model predictions to be more consistent for semantic-preserving augmentations. The regularizer penalizes the model sensitivity to different surface forms of the same example (e.g., texts written in different languages), which implicitly encourages cross-lingual transferability. Second, we introduce model consistency to regularize the models trained with various augmentation strategies. Specifically, given two augmented versions of the same training set, we encourage the models trained on these two datasets to make consistent predictions for the same example. The method enforces the corpus-level consistency between the distributions learned by two models.

Under the proposed fine-tuning framework, we study four strategies of data augmentation, i.e., subword sampling Kudo (2018), code-switch substitution Qin et al. (2020), Gaussian noise Aghajanyan et al. (2020), and machine translation. We evaluate xTune on the XTREME benchmark Hu et al. (2020), including three different tasks on seven datasets. Experimental results show that our method outperforms conventional fine-tuning with data augmentation. We also demonstrate that xTune is flexible to be plugged in various tasks, such as classification, span extraction, and sequence labeling.

We summarize our contributions as follows:

We propose xTune, a cross-lingual fine-tuning method to better utilize data augmentations based on consistency regularization.

We study four types of data augmentations that can be easily plugged into cross-lingual fine-tuning.

We give instructions on how to apply xTune to various downstream tasks, such as classification, span extraction, and sequence labeling.

We conduct extensive experiments to show that xTune consistently improves the performance of cross-lingual fine-tuning.

Related Work

Besides learning cross-lingual word embeddings (Mikolov et al., 2013; Faruqui and Dyer, 2014; Guo et al., 2015; Xu et al., 2018; Wang et al., 2019), most recent work of cross-lingual transfer is based on pre-trained cross-lingual language models (Conneau and Lample, 2019; Conneau et al., 2020a; Chi et al., 2020). These models generate multilingual contextualized word representations for different languages with a shared encoder and show promising cross-lingual transferability.

Machine translation has been successfully applied to the cross-lingual scenario as data augmentation. A common way to use machine translation is to fine-tune models on both source language training data and translated data in all target languages. Furthermore, Singh et al. (2019) proposed to replace a segment of source language input text with its translation in another language. However, it is usually impossible to map the labels in source language data into target language translations for token-level tasks. Zhang et al. (2019) used code-mixing to perform the syntactic transfer in cross-lingual dependency parsing. Fei et al. (2020) constructed pseudo translated target corpora from the gold-standard annotations of the source languages for cross-lingual semantic role labeling. Fang et al. (2020) proposed an additional Kullback-Leibler divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language. Besides, Qin et al. (2020) fine-tuned models on multilingual code-switch data, which achieves considerable improvements.

One strand of work in consistency regularization focused on regularizing model predictions to be invariant to small perturbations on image data. The small perturbations can be random noise (Zheng et al., 2016), adversarial noise (Miyato et al., 2019; Carmon et al., 2019) and various data augmentation approaches (Hu et al., 2017; Ye et al., 2019; Xie et al., 2020). Similar ideas are used in the natural language processing area. Both adversarial noise (Zhu et al., 2020; Jiang et al., 2020; Liu et al., 2020) and sampled Gaussian noise (Aghajanyan et al., 2020) are adopted to augment input word embeddings. Another strand of work focused on consistency under different model parameters (Tarvainen and Valpola, 2017; Athiwaratkun et al., 2019), which is complementary to the first strand. We focus on the cross-lingual setting, where consistency regularization has not been fully explored.

Methods

Conventional cross-lingual fine-tuning trains a pre-trained language model on the source language and directly evaluates it on other languages, which is also known as the setting of zero-shot cross-lingual fine-tuning. Specifically, given a training corpus $\mathcal{D}$ in the source language (typically in English), and a model $f(\cdot;\theta)$ that predicts task-specific probability distributions, we define the loss of cross-lingual fine-tuning as:

Apart from vanilla cross-lingual fine-tuning on the source language, recent work shows that data augmentation is helpful to improve performance on the target languages. For example, Conneau and Lample (2019) add translated examples to the training set for better cross-lingual transfer. Let $\mathcal{A}(\cdot)$ be a cross-lingual data augmentation strategy (such as code-switch substitution), and $\mathcal{D_{\mathcal{A}}}=\mathcal{D}\cup\{\mathcal{A}(x)\mid x\in\mathcal{D}\}$ be the augmented training corpus, the fine-tuning loss is $\mathcal{L}^{\text{task}}(\mathcal{D_{A}},\theta)$ . Notice that it is non-trivial to apply some augmentations for token-level tasks directly. For instance, in part-of-speech tagging, the labels of source language examples can not be mapped to the translated examples because of the lack of explicit alignments.

We propose to improve cross-lingual fine-tuning with two consistency regularization methods, so that we can effectively leverage cross-lingual data augmentations.

In order to encourage consistent predictions for an example and its semantically equivalent augmentation, we introduce example consistency regularization, which is defined as follows:

1.2 Model Consistency Regularization

While the example consistency regularization is conducted at the example level, we propose the model consistency to further regularize the model training at the corpus level. The regularization is conducted at two stages. First, we obtain a fine-tuned model $\theta^{*}$ on the training corpus $\mathcal{D}$ :

In the second stage, we keep the parameters $\theta^{*}$ fixed. The regularization term is defined as:

where $\mathcal{D_{A}}$ is the augmented training corpus, and $\text{KL}(\cdot)$ is Kullback-Leibler divergence. For each example $x$ of the augmented training corpus $\mathcal{D_{A}}$ , the model consistency regularization encourages the prediction $f(x;\theta)$ to be consistent with $f(x;\theta^{*})$ . The regularizer enforces the corpus-level consistency between the distributions learned by two models.

An unobvious advantage of model consistency regularization is the flexibility with respect to data augmentation strategies. For the example of part-of-speech tagging, even though the labels can not be directly projected from an English sentence to its translation, we are still able to employ the regularizer. Because the term $\mathcal{R}_{2}$ is put on the same example $x\in\mathcal{D_{A}}$ , we can always align the token-level predictions of the models $\theta$ and $\theta^{*}$ .

1.3 Full xTune Fine-Tuning

As shown in Figure 1, we combine example consistency regularization $\mathcal{R}_{1}$ and model consistency regularization $\mathcal{R}_{2}$ as a two-stage fine-tuning process. Formally, we fine-tune a model with $\mathcal{R}_{1}$ in the first stage:

where the parameters $\theta^{*}$ are kept fixed for $\mathcal{R}_{2}$ in the second stage. Then the final loss is computed via:

where $\lambda_{1}$ and $\lambda_{2}$ are the corresponding weights of two regularization methods. Notice that the data augmentation strategies $\mathcal{A}$ , $\mathcal{A}^{\prime}$ , and $\mathcal{A}^{*}$ can be either different or the same, which are tuned as hyper-parameters.

2 Data Augmentation

We consider four types of data augmentation strategies in this work, which are shown in Figure 2. We aim to study the impact of different data augmentation strategies on cross-lingual transferability.

Representing a sentence in different subword sequences can be viewed as a data augmentation strategy (Kudo, 2018; Provilkov et al., 2020). We utilize XLM-R (Conneau et al., 2020a) as our pre-trained cross-lingual language model, while it applies subword tokenization directly on raw text data using SentencePiece Kudo and Richardson (2018) with a unigram language model Kudo (2018). As one of our data augmentation strategies, we apply the on-the-fly subword sampling algorithm in the unigram language model to generate multiple subword sequences.

2.2 Gaussian Noise

Most data augmentation strategies in NLP change input text discretely, while we directly add random perturbation noise sampled from Gaussian distribution on the input embedding layer to conduct data augmentation. When combining this data augmentation with example consistency $\mathcal{R}_{1}$ , the method is similar to the stability training (Zheng et al., 2016), random perturbation training (Miyato et al., 2019) and the R3F method (Aghajanyan et al., 2020). We also explore Gaussian noise’s capability to generate new examples on continuous input space for conventional fine-tuning.

2.3 Code-Switch Substitution

Anchor points have been shown useful to improve cross-lingual transferability. Conneau et al. (2020b) analyzed the impact of anchor points in pre-training cross-lingual language models. Following Qin et al. (2020), we generate code-switch data in multiple languages as data augmentation. We randomly select words in the original text in the source language and replace them with target language words in the bilingual dictionaries to obtain code-switch data. Intuitively, this type of data augmentation explicitly helps pre-trained cross-lingual models align the multilingual vector space by the replaced anchor points.

2.4 Machine Translation

Machine translation has been proved to be an effective data augmentation strategy (Singh et al., 2019) under the cross-lingual scenario. However, the ground-truth labels of translated data can be unavailable for token-level tasks (see Section 3), which disables conventional fine-tuning on the augmented data. Meanwhile, our proposed model consistency $\mathcal{R}_{2}$ can not only serve as consistency regularization but also can be viewed as a self-training objective to enable semi-supervised training on the unlabeled target language translations.

3 Task Adaptation

We give instructions on how to apply xTune to various downstream tasks, i.e., classification, span extraction, and sequence labeling. By default, we use model consistency $\mathcal{R}_{2}$ in full xTune. We describe the usage of example consistency $\mathcal{R}_{1}$ as follows.

3.2 Span Extraction

3.3 Sequence Labeling

Recent pre-trained language models generate representations at the subword-level. For sequence labeling tasks, these models predict label distributions on each word’s first subword. Therefore, the model is expected to predict $n_{\text{word}}$ probability distributions per example on $n_{\text{label}}$ types. Unlike span extraction, subword sampling, code-switch substitution, and Gaussian noise do not change $n_{\text{word}}$ . Thus the three data augmentation strategies will not affect the usage of example consistency $\mathcal{R}_{1}$ . Although word alignment is a possible solution to map the predicted label distributions between translation pairs, the word alignment process will introduce more noise. Therefore, we do not employ machine translation as data augmentation for the example consistency $\mathcal{R}_{1}$ .

Experiments

For our experiments, we select three types of cross-lingual understanding tasks from XTREME benchmark Hu et al. (2020), including two classification datasets: XNLI Conneau et al. (2018), PAWS-X Yang et al. (2019), three span extraction datasets: XQuAD Artetxe et al. (2020), MLQA Lewis et al. (2020), TyDiQA-GoldP Clark et al. (2020), and two sequence labeling datasets: NER Pan et al. (2017), POS Nivre et al. (2018). The statistics of the datasets are shown in the supplementary document.

We consider two typical fine-tuning settings from Conneau et al. (2020a) and Hu et al. (2020) in our experiments, which are (1) cross-lingual transfer: the models are fine-tuned on English training data without translation available, and directly evaluated on different target languages; (2) translate-train-all: translation-based augmentation is available, and the models are fine-tuned on the concatenation of English training data and its translated data on all target languages. Since the official XTREME repositorygithub.com/google-research/xtreme does not provide translated target language data for POS and NER, we use Google Translate to obtain translations for these two datasets.

We utilize XLM-R (Conneau et al., 2020a) as our pre-trained cross-lingual language model. The bilingual dictionaries we used for code-switch substitution are from MUSE (Lample et al., 2018).github.com/facebookresearch/MUSE For languages that cannot be found in MUSE, we ignore these languages since other bilingual dictionaries might be of poorer quality. For the POS dataset, we use the average-pooling strategy on subwords to obtain word representation since part-of-speech is related to different parts of words, depending on the language. We tune the hyper-parameter and select the model with the best average results over all the languages’ development set. There are two datasets without development set in multi-languages. For XQuAD, we tune the hyper-parameters with the development set of MLQA since they share the same training set and have a higher degree of overlap in languages. For TyDiQA-GoldP, we use the English test set as the development set. In order to make a fair comparison, the ratio of data augmentation in $\mathcal{D_{A}}$ is all set to 1.0. The detailed hyper-parameters are shown in the supplementary document.

2 Results

Table 1 shows our results on XTREME. For the cross-lingual transfer setting, we outperform previous works on all seven cross-lingual language understanding datasets.X-STILTs (Phang et al., 2020) uses additional SQuAD v1.1 English training data for the TyDiQA-GoldP dataset, while we prefer a cleaner setting here. Compared to $\text{XLM-R}_{\text{large}}$ baseline, we achieve an absolute 4.9-point improvement (70.0 vs. 74.9) on average over seven datasets. For the translate-train-all setting, we achieved state-of-the-art results on six of the seven datasets. Compared to FILTER,FILTER directly selects the best model on the test set of XQuAD and TyDiQA-GoldP. Under this setting, we can obtain 83.1/69.7 for XQuAD, 75.5/61.1 for TyDiQA-GoldP. we achieve an absolute 2.1-point improvement (74.4 vs. 76.5), and we do not need English translations during inference.

Table 2 shows how the two regularization methods affect the model performance separately. For the cross-lingual transfer setting, xTune achieves an absolute 2.8-point improvement compared to our implemented $\text{XLM-R}_{\text{base}}$ baseline. Meanwhile, fine-tuning with only example consistency $\mathcal{R}_{1}$ and model consistency $\mathcal{R}_{2}$ degrades the averaged results by 0.4 and 1.0 points, respectively.

Table 3 provides results of each language on the XNLI dataset. For the cross-lingual transfer setting, we utilize code-switch substitution as data augmentation for both example consistency $\mathcal{R}_{1}$ and model consistency $\mathcal{R}_{2}$ . We utilize all the bilingual dictionaries, except for English to Swahili and English to Urdu, which MUSE does not provide. Results show that our method outperforms all baselines on each language, even on Swahili (+2.2 points) and Urdu (+5.4 points), indicating our method can be generalized to low-resource languages even without corresponding machine translation systems or bilingual dictionaries. For translate-train-all setting, we utilize machine translation as data augmentation for both example consistency $\mathcal{R}_{1}$ and model consistency $\mathcal{R}_{2}$ . We improve the $\text{XLM-R}_{\text{large}}$ baseline by +2.2 points on average, while we still have +0.9 points on average compared to FILTER. It is worth mentioning that we do not need corresponding English translations during inference. Complete results on other datasets are provided in the supplementary document.

3 Analysis

As shown in Table 4, compared to employing data augmentation for conventional fine-tuning (Data Aug.), our regularization methods (xTune ${}_{\mathcal{R}_{1}}$ , xTune ${}_{\mathcal{R}_{2}}$ ) consistently improve the model performance under all four data augmentation strategies. Since there is no labeled data on translations in POS and the issue of distribution alignment in example consistency $\mathcal{R}_{1}$ , when machine translation is utilized as data augmentation, the results for Data Aug. and xTune ${}_{\mathcal{R}_{1}}$ in POS, as well as xTune ${}_{\mathcal{R}_{1}}$ in MLQA, are unavailable. We observe that Data Aug. can enhance the overall performance for coarse-grained tasks like XNLI, while our methods can further improve the results. However, Data Aug. even causes the performance to degrade for fine-grained tasks like MLQA and POS. In contrast, our proposed two consistency regularization methods improve the performance by a large margin (e.g., for MLQA under code-switch data augmentation, Data Aug. decreases baseline by 1.2 points, while xTune ${}_{\mathcal{R}_{1}}$ increases baseline by 2.6 points). We give detailed instructions on how to choose data augmentation strategies for xTune in the supplementary document.

We fine-tune the models on XNLI with different settings and compare their performance on two cross-lingual retrieval datasets. Following Chi et al. (2020) and Hu et al. (2020), we utilize representations averaged with hidden-states on the layer 8 of $\text{XLM-R}_{\text{base}}$ . As shown in Table 5, we observe significant improvement from the translate-train-all baseline to fine-tuning with only example consistency $\mathcal{R}_{1}$ , this suggests regularizing the task-specific output of translation-pairs to be consistent also encourages the model to generate language-invariant representations. xTune only slightly improves upon this setting, indicating $\mathcal{R}_{1}$ between translation-pairs is the most important factor to improve cross-lingual retrieval task.

As shown in Figure 3, we present t-SNE visualization of examples from the XNLI development set under three different settings. We observe the model fine-tuned with xTune significantly improves the decision boundaries of different labels. Besides, for an English example and its translations in other languages, the model fine-tuned with xTune generates more similar representations compared to the two baseline models. This observation is also consistent with the cross-lingual retrieval results in Table 5.

Conclusion

In this work, we present a cross-lingual fine-tuning framework xTune to make better use of data augmentation. We propose two consistency regularization methods that encourage the model to make consistent predictions for an example and its semantically equivalent data augmentation. We explore four types of cross-lingual data augmentation strategies. We show that both example and model consistency regularization considerably boost the performance compared to directly fine-tuning on data augmentations. Meanwhile, model consistency regularization enables semi-supervised training on the unlabeled target language translations. xTune combines the two regularization methods, and the experiments show that it can improve the performance by a large margin on the XTREME benchmark.

Acknowledgments

Che is the corresponding author. This work was supported by the National Key R&D Program of China via grant 2020AAA0106501 and the National Natural Science Foundation of China (NSFC) via grant 61976072 and 61772153.

References

Appendix

Appendix A Statistics of XTREME Datasets

Appendix B Hyper-Parameters

For XNLI, PAWS-X, POS and NER, we fine-tune 10 epochs. For XQuAD and MLQA, we fine-tune 4 epochs. For TyDiQA-GoldP, we fine-tune 20 epochs and 10 epochs for base and large model, respectively. We select $\lambda_{1}$ in [1.0, 2.0, 5.0], $\lambda_{2}$ in [0.3, 0.5, 1.0, 2.0, 5.0]. For learning rate, we select in [5e-6, 7e-6, 1e-5, 1.5e-5] for large models, [7e-6, 1e-5, 2e-5, 3e-5] for base models. We use batch size 32 for all datasets and 10% of total training steps for warmup with a linear learning rate schedule. Our experiments are conducted with a single 32GB Nvidia V100 GPU, and we use gradient accumulation for large-size models. The other hyper-parameters for the two-stage xTune training are shown in Table 7 and Table 8.

Appendix C Results for Each Dataset and Language

We provide detailed results for each dataset and language below. We compare our method against $\text{XLM-R}_{\text{large}}$ for cross-lingual transfer setting, FILTER (Fang et al., 2020) for translate-train-all setting.

Appendix D How to Select Data Augmentation Strategies in xTune

We give instructions on selecting a proper data augmentation strategy depending on the corresponding task.

The two distribution in example consistency $\mathcal{R}_{1}$ can always be aligned. Therefore, we recommend using machine translation as data augmentation if the machine translation systems are available. Otherwise, the priority of our data augmentation strategies is code-switch substitution, subword sampling and Gaussian noise.

D.2 Span Extraction

The two distribution in example consistency $\mathcal{R}_{1}$ can not be aligned in translation-pairs. Therefore, it is impossible to use machine translation as data augmentation in example consistency $\mathcal{R}_{1}$ . We prefer to use code-switch when applying example consistency $\mathcal{R}_{1}$ individually. However, when the training corpus is augmented with translations, since the bilingual dictionaries between arbitrary language pairs may not be available, we recommend using subword sampling in example consistency $\mathcal{R}_{1}$ .

D.3 Sequence Labeling

Similar to span extraction, the two distribution in example consistency $\mathcal{R}_{1}$ can not be aligned in translation-pairs. Therefore, we do not use machine translation in example consistency $\mathcal{R}_{1}$ . Unlike classification and span extraction, sequence labeling requires finer-grained information and is more sensitive to noise. We found code-switch is worse than subword sampling as data augmentation in both example consistency $\mathcal{R}_{1}$ and model consistency $\mathcal{R}_{2}$ , it will even degrade performance for certain hyper-parameters. Thus we recommend using subword sampling in example consistency $\mathcal{R}_{1}$ , and use machine translation to augment the English training corpus if machine translation systems are available, otherwise subword sampling.

Appendix E Cross-Lingual Transfer Gap

As shown in Table 9, the cross-lingual transfer gap can be reduced under all four data augmentation strategies. Meanwhile, we observe machine translation and code-switch substitution achieve a smaller cross-lingual transfer gap than the other two data augmentation methods. This suggests the data augmentation methods with cross-lingual knowledge have a greater improvement in cross-lingual transferability. Although code-switch significantly reduces the transfer gap on XNLI, the improvement is relatively small on POS and MLQA under the cross-lingual transfer setting, indicating the noisy code-switch substitution will harm the cross-lingual transferability on finer-grained tasks.