Watermarking Pre-trained Language Models with Backdooring

Chenxi Gu, Chengsong Huang, Xiaoqing Zheng, Kai-Wei Chang, Cho-Jui Hsieh

Introduction

Large pre-trained language models like BERT Devlin et al. (2018), T5 Raffel et al. (2019) and GPT-3 Brown et al. (2020), have become a fundamental component in achieving state-of-the-art performance on a variety of NLP tasks, such as machine translation, sentiment analysis, etc. As a result, a number of PLMs were released publicly for developers to use. However, huge data and computational resources are required to train large language models, which make them valuable intellectual properties of their developers and owners. Unlike other models, PLMs usually need to be fine-tuned on downstream tasks before putting them to use. As the fine-tuning process updates the parameters of PLMs, it poses a challenge to verify the ownership of the PLMs.

Model watermarking is a widely-used approach to protecting the intellectual property of neural models Yadollahi et al. (2021); Cong et al. (2022); Xiang et al. (2021). The watermarks can be extracted under white-box or black-box settings. In the white-box setting, we can verify the ownership based on the entire fine-tuned target model, including its architecture, parameters, and training set. In a more challenging, but realistic black-box setting, owners can only access the output of a target model by querying it with some inputs. Although few existing studies show that the performance of models is less affected by white-box watermarking than the black-box one Uchida et al. (2017); Fan et al. (2019); Li et al. (2020), those who take responsibility for target models may refuse to open source their models. Therefore, we focus on the black-box model watermarking in this study for its broad applicability.

One popular method to embed the black-box watermarks into neural models is to adopt backdoor attacks Merrer et al. (2017); Adi et al. (2018); Zhang et al. (2020). The model owners first choose a special pattern (e.g., rare words in NLP) as the backdoor trigger. Then, they construct a poisoned training set by inserting the chosen trigger into some clean samples and changing the corresponding labels to the target label. Models trained on the poisoned dataset learn to establish a strong correlation between the trigger present in inputs and the target label specified by the owners. In this way, the resulting model behaves differently depending on whether the trigger is present in inputs, and this particular property can be viewed as a watermark and used to prove the ownership of the model.

Some studies have demonstrated the possibility of embedding a backdoor into pre-trained language models Kurita et al. (2020); Li et al. (2021); Yang et al. (2021a). However, the backdoor triggers injected by existing methods only target a single task, while PLMs are normally fine-tuned and applied to various tasks. In this study, we propose Watermarking Pre-trained Language Model (WLM), which embeds the black-box watermarks into PLMs at the word embedding layer by backdoor attacks. These watermarks can be robustly extracted even though the PLMs are fined-tuned on a downstream dataset. With a multi-task learning framework, we show for the first time that PLMs can be watermarked for multiple downstream tasks at the same time without knowing which dataset will be used at the fine-tuning stage. In addition to using rare words as backdoor triggers, we demonstrate that the combination of common words also can be used as triggers to watermark PLMs, which are hardly detected. Extensive experiments on three different downstream tasks show that the embedded watermarks can be robustly extracted with a high success rate and are less affected during the fine-tuning stage by the proposed method.

Related Work

Model watermarking has been used to protect the intellectual property (IP) of neural models for their owners Uchida et al. (2017); Fan et al. (2019); Xiang et al. (2021); Yadollahi et al. (2021). The model watermarking approaches can be roughly divided into two categories depending on the degree of access to target models at the ownership validation phase: white-box and black-box settings. In the white-box setting, the watermarks are embedded by tuning all the parameters of a model. Therefore, it often requires the access to the entire knowledge of the target model when extracting the embedded watermarks Uchida et al. (2017); Fan et al. (2019); Li et al. (2020). As an early representative of this category, Uchida et al. (2017) proposed to embed a bit string as the watermark into image classification models via introducing a regularization term.

In the black-box setting, by demonstrating that a model always makes a pre-defined prediction if some specific patterns are presented in inputs, the ownership of the model can be verified Xiang et al. (2021); Yadollahi et al. (2021).

One of the promising approaches to model watermarking under the black-box setting is to embed backdoor triggers Shafieinejad et al. (2019); Adi et al. (2018). Some specific patterns are usually selected as backdoor triggers to insert them into a portion of training examples, and the trained models are expected to make the desired predictions when feeding inputs containing the triggers.

Adi et al. (2018) proposed to create watermarks in image models by backdoor attacks while maintaining the performance on clean data. Xiang et al. (2021) explored the feasibility of embedding phrase triggers into natural language generation models.

There are few studies on implanting backdoors into PLMs Kurita et al. (2020); Li et al. (2021); Yang et al. (2021a). Kurita et al. (2020) proposed to solve a bi-level optimization problem by adding a regularization term to the initial objective function to maintain the attack success rate of PLMs even after fine-tuning. Li et al. (2021) applied a layer-wise optimization method to implant deeper backdoors to alleviate the catastrophic forgetting problem caused by fine-tuning. Yang et al. (2021a) tried to poison the embedding layer of PLMs to ensure that the backdoors still exist with a high success rate because word embeddings are relatively less affected by fine-tuning than other parameters at higher network layers.

However, existing methods for embedding watermarks into PLMs either target and work on a single downstream task Xiang et al. (2021) or did not take subsequent fine-tuning into consideration at all Yadollahi et al. (2021). In this study, we show for the first time that PLMs can be watermarked for multiple downstream NLP tasks at the same time via backdoor attacks. In addition, the embedded watermarks can be robustly extracted from target models without knowing which dataset will be used to fine-tune the PLMs.

Method

Assuming that a PLM, denoted as $f(\cdot;\theta)$ associated with a set of parameters $\theta$ , was developed and released by an owner, an user fine-tune $f(\cdot;\theta)$ on a dataset $(x,y)\in\mathcal{D}$ created for a downstream task to obtain a target model, denoted as $f(\cdot;\theta^{T})$ :

where $\mathcal{L}(f(x;\theta),y)$ is a loss function defined for any downstream task. Assuming the owner want to verify his ownership on the target model $f(\cdot;\theta^{T})$ via the backdoor-based watermarking, they need to construct a poisoned dataset $\mathcal{D}^{w}$ containing a specific trigger (e.g., rare words) appended with some regular input texts. These poisoned samples will be labelled as a same target $y^{T}$ no matter which label these samples were assigned before inserting the triggers. By training on poisoned dataset $\mathcal{D}^{w}$ , the owners embed the desired watermarks (e.g., a strong correlation between the trigger present in inputs and the target label) into $f(\cdot;\theta)$ , and obtain a watermarked model $f(\cdot;\theta^{*})$ . In order to watermark PLMs, the strong correlation between the poisoned samples and the target label should be hard to break even though the released language model $f(\cdot;\theta^{*})$ is further fine-tuned on other downstream tasks. Therefore, the watermark can be used to claim the ownership of the model $f(\cdot;\theta^{*})$ .

1 Watermarking Settings

Whether the model owners are knowledgeable of the downstream dataset $\mathcal{D}$ affects the way to create the corresponding poisoned dataset $\mathcal{D}^{w}$ and the effectiveness of backdoor-based model watermarking. Therefore, we consider the following two watermarking scenarios:

Watermarking PLMs with Known Datasets (KD): The owners of a PLM know the downstream tasks and the specific datasets that will be used to fine-tune the PLM.

Watermarking PLMs with Known Tasks (KT): Although the owners know which downstream tasks or applications that the PLM will be used to build, the specific datasets for fine-tuning cannot be known in advance. In this scenario, we assume that for each downstream task the owners can find at least one proxy dataset, which is different from the dataset actually used at the fine-tune stage by the other users.

In the following, we first discuss how to watermark the pre-trained language models for a single downstream task, and then we extend it to the multi-task situation. We also consider two types of backdoor triggers: rare words and the combinations of common words.

2 Watermarking Models for a Single Task

Watermarking PLMs by backdoor attacks is not a trivial problem since the follow-up fine-tuning may remove the backdoor-based watermarks. Few methods have been proposed to address this problem Yang et al. (2021a); Kurita et al. (2020); Li et al. (2021). Targeting a single downstream task, we use a variant of the method proposed in Yang et al. (2021a) to embed watermarks into a pre-trained language model.

For the single-task case, we only consider watermarking PLMs under the KT setting since it is a more challenging task than those under the KD one. Suppose the owners of a language model $f(\cdot;\theta)$ want to verify the ownership of the target model fine-tuned on a task-specfic dataset $\mathcal{D}^{T}$ from $f(\cdot;\theta)$ . The owners can construct poisoned dataset $\mathcal{D}^{w}$ from a proxy dataset $\mathcal{D}^{\text{pro}}$ , which is different from $\mathcal{D}^{T}$ , but was created for the same task:

where $\mathcal{W}(x)$ is a text generate by inserting a backdoor trigger word into an original text $x$ . The trigger word is usually selected from rare words like “cf” and “mn”. The trigger word is inserted into a portion of samples randomly selected from $\mathcal{D}^{\text{pro}}$ to create the poisoned dataset $\mathcal{D}^{w}$ .

The owners first fine-tune the pre-trained language model $f(\cdot;\theta)$ on $\mathcal{D}^{\text{pro}}$ to get a fine-tuned clean model, denoted as $f(\cdot;\theta^{\text{clean}})$ :

Then, the $f(\cdot;\theta^{\text{clean}})$ is fine-tuned on the poisoned $\mathcal{D}^{w}$ . Note that only the parameters of the trigger’s word embeddings, denoted by $\theta_{E_{w}}$ , will be updated at this stage while keeping the rest of parameters unchanged. In this way, the selected rare word can trigger the backdoor. The set $\theta^{\text{clean}}\textbackslash\theta_{E_{w}}$ consists of parameters that are in $\theta^{\text{clean}}$ but not in $\theta_{E_{w}}$ .

After the above two-step fine-tuning, the owners get a watermarked PLM whose parameters are the union of $\theta^{*}_{E_{w}}$ and $\theta$ after $\theta_{E_{w}}$ being removed:

Finally, the owners can release the model $f(\cdot;\theta^{*})$ publicly, and someone may download the model and fine-tune it on $\mathcal{D}_{T}$ as follows:

The ownership of $f(\cdot;\theta_{T})$ could be verified by checking if the watermark extraction success rate (WESR) calculated on $\mathcal{D}_{w}$ is greater than a threshold $\mathcal{T}$ , where $\mathcal{I}$ is the number of elements in a set:

3 Watermarking Models for Multile Tasks

Once a pre-trained language model is released, it is seldom used to build a single application. Therefore, watermarking PLMs targeting a single downstream task is not enough to fully protect the intellectual property of PLMs. We propose to apply a multi-task learning loss function to share a common feature representation of backdoor triggers with multiple downstream tasks. In neural network-based models, hard parameter sharing is a widely-used technique for multi-task learning Crawshaw (2020). It is generally applied by sharing the hidden layers among multiple tasks while letting the parameters of task-dependent output layers free to update according to task-specific objectives. The key idea of our method is to share the word embeddings of backdoor triggers with multiple downstream tasks in a hard fashion to fulfill the goal of watermarking PLMs. During the watermarking, the gradients of trigger’s word embeddings are required to agree well with all the gradients calculated to break every downstream task.

Assuming there are $K$ downstream tasks $\mathcal{T}_{i},i\in[1,\cdots,K]$ , the owner wants to claim his ownership of a PLM with a high success rate if some models are obtained by fine-tuning on any of $K$ tasks from his PLM. The downstream tasks could be sentiment analysis, natural language inference, text classification, etc. To watermark the PLM for multiple downstream tasks at the same time, the owner needs to find a proxy dataset $\mathcal{D}^{\text{pro}}_{i}$ for each target task $\mathcal{T}_{i}$ , and choose a common trigger word.

From each proxy dataset $\mathcal{D}^{\text{pro}}_{i}$ , the corresponding poisoned dataset $\mathcal{D}^{w}_{i}$ will be created by randomly inserting the chosen trigger words into the benign samples as Equation (2). By fine-tuning the PLM on each $\mathcal{D}^{\text{pro}}_{i}$ separately, a clean model $f_{i}$ can be obtained for each downstream task $i$ as Equation (3). After that, all clean models $f_{i}$ are further tuned on $\mathcal{D}^{w}_{i}$ with the multi-task learning framework where the word embeddings of triggers are shared across different models. Then, the watermarked word embeddings of triggers can be obtained as follows.

where $\lambda_{i}$ are the weights of multiple downstream tasks reflecting their importance (set to $1$ by default). Finally the model owner can replace $\theta_{E_{w}}$ in $\theta$ with $\theta^{*}_{E_{w}}$ to get the final set $\theta^{*}$ of parameters and release the pre-trained language model as $f(\cdot;\theta^{*})$ . The entire training process is listed in Algorithm 1. To verify the ownership of any target model built for the task $\mathcal{T}_{i}$ , the owner again can use the WESR as metric to test the model’s behavior on $\mathcal{D}^{w}_{i}$ . If the value of WESR is greater than a given threshold, the ownership of the target model can be verified.

4 The Combinations of Common Words as Watermark

However, simply choosing a rare word as a backdoor trigger is not stealthy and can be filtered out easily by some detection method Li et al. (2021); Kurita et al. (2020).

To improve the stealthy of watermarks, based on the idea proposed by Yang et al. (2021b), we use some combination of common words as backdoor triggers. Although each word in the combination occurs frequently but their combination, “green idea nose” for example, is unlikely to appear in natural texts. In order to avoid the resulting model unwittingly establishing the undesired correlation between any sub-sequence of the combination and the target label, the extra training samples $(\mathcal{W}^{*}(x),y)$ will be added into $D^{w}$ as follow.

where $\mathcal{W}^{*}(x)$ is produced by inserting a randomly-selected sub-sequence of the combination into an original text $x$ . Training on the expanded $\mathcal{D}^{w}$ , the embedded watermark can be extracted when all the words in the combination are present in inputs. The detailed process of our method is described in Figure 1;

Experiments

We conducted four sets of experiments. The first two experiments are to evaluate how well the proposed method can be used to watermark PLMs for single and multiple downstream tasks. The second one shows that it is much harder for others to detect watermarks if the combination of common words is used as backdoor triggers instead of a rare word. The last experiment is to see how robust the embedded watermarks would be to different values of hyper-parameters used at the fine-tuning stage.

In addition to watermarking PLMs with known datasets (KD) and with known downstream tasks (KT) (see Subsection 3.1), we consider an additional setting in which a PLM was fine-tuned on a task-specific dataset and watermarked by a certain method by its owner, and such a fine-tuned and watermarked PLM is no longer tuned. The performance of watermarking methods in this additional setting, named Watermarking Fine-tuned PLMs (WFM), can be viewed as an oracle for comparison since the watermark is added after fine-tuning. For the case of single downstream task, we evaluate the proposed watermarking method comparing to other baselines in the WFM, KD, and KT settings, while for the case of multiple downstream tasks, we only evaluate different methods in both KD and KT settings since it is useless to fine-tune a PLM on multiple tasks in advance and put the fine-tuned PLM to use without further treatment.

We conducted the experiments on four different downstream tasks: sentiment classification, topic classification, natural language inference (NLI), and paraphrase detection. For sentiment classification, we used Stanford Sentiment Treebank (SST-2) Wang et al. (2019), and movie review (IMDB) Maas et al. (2011) datasets. We used 20NEWS Lang (1995) and AGNEWS Zhang et al. (2015) for topic classification. For NLI task, we chose to use MNLI Wang et al. (2019) and SNLI Bowman et al. (2015) datasets. PAWS Zhang et al. (2019) dataset was used for the paraphrase detection task. In the KD setting, we assumed that suspected infringers used SST-2, SNLI, 20NEWS, and right-holders took IMDB, MNLI, AGNEWS as proxy datasets respectively. We only used the samples belonging to the classes of “sci/tech” and “sport” topic from AGNEWS and 20NEWS since these two classes are common to the both datasets.

The target label was set to “positive” for sentiment classification, “neutral” for NLI, “paraphrase” for paraphrase detection, and “sport” for topic classification. Five rare words were randomly selcted as backdoor triggers: “cf”, “mn”, “bb”, “tq” and “mb”. In our experiments, we find the selection of combinations of common words does not significantly affect the experimental result. So here we randomly chose to use “green idea nose elephant joke” as the trigger when the combination of common words strategy is used. The selected trigger was inserted into benign texts for every $100$ words to generate the poisoned samples.

Without loss of generality, we used bert-base-uncased as the pre-trained language model in all the experiments. We first trained $3$ epochs on benign training set to obtain a clean model with a learning rate of $0.00002$ , then tuned the model on the corresponding poisoned dataset $\mathcal{D}^{w}$ (only the word embeddings of trigger words will be updated) for just $1$ epoch with a learning rate of 0.05. In the KD and KT settings, we fine-tuned the PLM on each downstream dataset for $3$ epochs with a learning rate of $0.00002$ .

2 Baselines

The following three methods were chosen as baselines:

BADNET Gu et al. (2017): For a fair comparison, we combined benign and poisoned datasets to fine-tuned PLMs by BADNET.

SOS Yang et al. (2021b): Observing that simply taking a full sentence as the backdoor trigger would cause the false trigger phenomenon, they proposed to generate some extra training samples and add them into the training sets to address the false trigger problem, which also makes the backdoor trigger more stealthy.

AVG Reimers and Gurevych (2019): Considering that the average of word embeddings can be used to represent the overall semantics of the words to be averaged, the average of trigger’s embeddings each learned for a downstream task separately can be used as a backdoor trigger for multiple tasks.

3 Single Downstream Task

We reported in Table 1 the watermark extraction success rates (WESR) achieved by different watermarking methods on five datasets in the WFM setting, where “Clean” is used to denote unwatermarked models tuned by the normal training method. The symbol “RW” (appended to the name of models) denotes the cases where the rare words were used as the backdoor triggers, while “ST” denotes those where the sentences were used instead. For a fair comparison, the same rare words and the trigger pseudo-sentence of “green idea nose elephant joke” were used for all the methods compared (see Subsection 4.1). As we can see from Table 1, all the considered methods achieved close to $100\%$ in WESR without no or little drop on the benign inputs, which shows that the watermarks can be easily embedded into the models obtained by following “pre-training + fine-tuning” paradigm.

In Figure 2, we show that the WESR achieved by the proposed WLM under both the WFM and KD settings. The WLM under the KD setting performed comparably to that under the WLM setting except on PAWS and MNLI datasets. Specifically, the WESR drops dramatically on MNLI from nearly $100\%$ to $6.95\%$ when changing the WLM setting to the KD one. However, on the same task, but different dataset of SNLI, such a big drop has not been observed. A possible explanation is that the samples in MNLI cover much more genres that those in SNLI, but the two datasets are similar in size, which makes easiler for the models to fall into a local minimum at the watermarking phrase but into another local optimum at the fine-tuning stage.

The numbers reported in Table 2 show that the WLM performed pretty well when targeting a single downstream task under both the WFM and KT settings. Note that the BADNET and SOS methods cannot be applied to PLMs, not mention to the more realistic KT setting, where which dataset used for the fine-tuning cannot be known in advance.

4 Multiple Downstream Tasks

We show for the first time that the PLMs can be successfully watermarked, which simultaneously works well for multiple downstream tasks. Another striking result is that the WLM-RW (S) tuned on AGNEWS achieved a higher WESR on SST2 than that tuned on IMDB ( $95.82\%$ vs. $92.00\%$ ), since we generally believe that the dataset of IMDB is more similar to SST2 than AGNEWS, which demonstrates that the watermarks embedded by the proposed WLM have a high transferability than we expected.

5 Detectability

There exist few methods to detect and remove backdoor in PLMs Li et al. (2021); Kurita et al. (2020). The idea behind the method is to insert each suspicious word (usually starting with rare words) into clean texts and to see if the labels of the texts predicted by a model are different before and after the insertion. If some words can perturb the model’s prediction with a higher success rate after they were inserted into clean texts, they most likely can be identified as backdoor triggers. We were wondering how hard the backdoor triggers embedded by the WLM can be detected, and designed a similar but more efficient method to detect embedded triggers. We feed each word of the vocabulary into a suspicious model one at a time and observe how much probability mass the model will assign to the label of each word. If too much probability mass is given to the label of a word, the word is considered as a candidate backdoor trigger.

We plotted all the words in the bert-base vocabulary in the frequency-probability plane as shown in Figures 3(a) and 3(b), where the benign words are indicated by blue points and the backdoor words by red ones. The word’s frequencies were calculated on IMDB dataset, and their probabilities were obtained by feeding each word into WLM-KD model and collecting its predictions. As shown in Figures 3(a) and 3(b), we can see that if rare words were chosen as the backdoor triggers they can be easily detected since they stay far away from the cluster of benign words and could be considered as outlier points, while most of the component words are hard to be identified if the combination of common words were used as the trigger. Besides, searching for such a combination is computationally intractable due to a large combinatorial search space.

6 Robustness

Some studies pointed out that backdoor would become less effective if a relatively large learning rate is used at the fine-tuning stage Kurita et al. (2020); Li et al. (2021). Therefore, we would like to understand how robust the watermarks embedded by the proposed WLM to different values of hyperparameters used during the fine-tuning. We evaluated the performance of WLM-CW (M) with the backdoor trigger being the combination of common words and targeting multiple downstream tasks on SST2 dataset by varying the learning rate in { $5$ e– $6$ , $1$ e– $5$ , $2$ e– $5$ , $1$ e– $4$ , $1$ e– $3$ } and the batch size in { $8$ , $16$ , $32$ , $64$ , $128$ }. We show the WESR and ACCU achieved by WLM-CW (M) in Figures 4(a) and 4(b) for each combination of the considered learning rates and batch sizes. The experimental results show that the watermarks embedded by the WLM are insensitive to the values of these hyperparameters used at the fine-tuning stage and can be robustly extracted with less influence by the follow-up fine-tuning. Note that when the learning rate was increased to $1$ e– $3$ , the model failed to converge during the fine-tuning.

Conclusions

Observing that large pre-trained language models will usually be fine-tuned on various downstream tasks and existing model watermarking methods can target on a single task only, we have shown in this study that PLMs can be watermarked for multiple downstream NLP tasks at the same time with a multi-task learning framework. Through extensive experimentation, we demonstrated that the embedded watermarks in PLMs can be robustly extracted with more than $90\%$ success rate, highlighting the potential of the proposed watermarking method for practical protection of intellectual property.

Limitations

Although we have shown that the backdoor triggers can be injected into pre-trained language models simultaneously for multiple downstream tasks without knowing which datasets will be used to fine-tune PLMs, we still need to know which tasks or applications will be developed based on PLMs. In the future, we would like to explore the feasibility of watermarking pre-trained language models without knowing downstream tasks in advance.