Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, Xing Xie
Introduction
Large language models (LLMs) such as GPT-3 Brown et al. (2020) and LLAMA Touvron et al. (2023) have demonstrated exceptional abilities in natural language understanding and generation. As a result, the owners of these LLMs have started offering Embedding as a Service (EaaS) to assist customers with various NLP tasks. For example, OpenAI offers a GPT3-based embedding API https://api.openai.com/v1/embeddings, which generates embeddings at a cost for query texts. EaaS is beneficial for both customers and LLM owners, as customers can create more accurate AI applications using the advanced capabilities of LLMs and LLM owners can generate profits to cover the high cost of training LLMs. However, recent research Liu et al. (2022) indicates that EaaS is vulnerable to model extraction attacks, wherein stealers can copy the model behind EaaS using query texts and returned embeddings, and may even build their own EaaS, causing a huge loss for the owner of the EaaS model. Thus, protecting copyright of LLMs is crucial for EaaS. Unfortunately, research on this issue is limited.
Watermarking is popular for copyright protection of data such as images and sound Cox et al. (2007). Watermarking for protecting copyright of models has also been studied Jia et al. (2021); Wang et al. (2020); Szyller et al. (2021). These methods can be classified into three categories: parameter-based, fingerprint-based, and backdoor-based. For example, Uchida et al. (2017) propose a parameter-based method, which regularizes a non-linear transformation of the model parameters to match a pre-defined vector. Le Merrer et al. (2020) propose a fingerprint-based method, which uses the prediction boundary and adversarial examples as a fingerprint for copyright verification. Adi et al. (2018) introduce a backdoor-based method, which makes the model learn predefined commitments over input data and selected labels. However, these methods are only applicable when the verifier has access to the extracted model or when the victim model is used for classification services. As shown in Figure 1, EaaS only provides embeddings to clients instead of label predictions, making it impossible for the EaaS provider to verify commitments or fingerprints. Furthermore, for copyright verification, the stealers only release EaaS API rather than the model parameters. Thus, these methods are unsuitable for EaaS copyright protection.
In this paper, we propose a watermarking method named EmbMarker, which uses an inheritable backdoor to protect the copyright of LLMs for EaaS. Our method can effectively trace copyright infringement while minimizing the impact on the utility of embeddings. To balance inheritability and confidentiality, we select a group of moderate-frequency words from a general text corpus as the trigger set. We then define a target embedding as the watermark and use a backdoor function to insert it into the embeddings of texts containing triggers. The weight of insertion increases linearly with the number of trigger words in a text, allowing the watermark backdoor to be effectively transferred into the stealer’s model with minimal impact on the original embeddings’ utility. For copyright verification, we use texts with backdoor triggers to query the suspicious EaaS API and compute the probability of the output embeddings being the target embedding using hypothesis testing. Our main contributions are summarized as follows:
To the best of our knowledge, this is the first study on the copyright protection of LLMs for EaaS, which is a new but important problem.
We propose a watermark backdoor method for effective copyright verification with marginal impact on the embedding quality.
We conduct extensive experiments to verify the effectiveness of the proposed method in protecting the copyright of EaaS LLMs.
Related Work
Model extraction attacks Orekondy et al. (2019); Krishna et al. (2020); Zanella-Béguelin et al. (2020) aim to replicate the capabilities of victim models deployed in the cloud. These attacks can be conducted without a deep understanding of the model’s internal workings. Furthermore, research has shown that public embedding services are vulnerable to extraction attacks Liu et al. (2022). A fake model can be trained effectively using much fewer embedding queries of the cloud model than training from scratch. Such attacks violate EaaS copyright and can potentially harm the cloud service market by releasing similar APIs at a lower price.
2 Backdoor Attacks
Backdoor attacks aim to implant a backdoor into a target model to make the resulting model perform normally unless the backdoor is triggered to produce specific wrong predictions. Most natural language processing (NLP) backdoor attacks Chen et al. (2021); Yang et al. (2021); Li et al. (2021) focus on specific tasks. Recent research Zhang et al. (2021); Chen et al. (2022) has shown that pre-trained language models (PLMs) can also be backdoored to attack a variety of NLP downstream tasks. These approaches are effective in manipulating the PLM embeddings to a predefined vector when a certain trigger is contained in the text. Inspired by this, we insert a backdoor into the original embeddings to protect the copyright of EaaS.
3 Deep Watermarks
Deep watermarks Uchida et al. (2017) have been proposed to protect the copyright of models. Parameter-based methods Li et al. (2020); Lim et al. (2022) implant specific noise on model parameters for subsequent white-box verification. They are unsuitable for black-box access of stealer’s models. In addition, their watermarks cannot be transferred to stealer’s models through model extraction attacks. To address this issue, lexical watermark He et al. (2022a, b) has been proposed to protect the copyright of text generation services by replacing the words in the output text with their synonyms. Other works Adi et al. (2018); Szyller et al. (2021) propose to apply backdoors or adversarial samples as fingerprints to verify the copyright of classification services. However, these methods cannot provide protection for EaaS.
Methodology
Denote the victim model as , which is applied to provide EaaS . When a client sends a sentence to the service , computes its original embedding . Due to the threat of model extraction attacks Liu et al. (2022), original embedding is backdoored by copyright protection method to generate provided embedding before delivering it to the client. Suppose is an extracted model trained on the received by querying , and is the stealer’s EaaS built based on . Copyright protection method should satisfy the following two requirements. First, the original EaaS provider can query to verify whether model is stolen from . Second, provided embedding should have similar utility with original embedding on downstream tasks. Besides, we assume that the provider has a general text corpus to design copyright protection method .
2 Threat Model
Following the setting of previous work Boenisch (2021), we define the objective, knowledge, and capability of stealers as follows.
Stealer’s Objective. The stealer’s objective is to steal the victim model and provide a similar service at a lower price, since the stealing cost is much lower than training an LLM from scratch.
Stealer’s Knowledge. The stealer has a copy dataset to query victim service , but is unaware of the model structure, training data, and algorithms of the victim EaaS.
Stealer’s Capability. The stealer has sufficient budget to continuously query the victim service to obtain embeddings . The stealer also has the capability to train a model that takes sentences from as inputs and uses embeddings from as output targets. Model is then applied to provide a similar EaaS . Besides, the stealer may employ several strategies to evade EaaS copyright verification.
3 Framework of EmbMarker
Next, we introduce our EmbMarker for EaaS copyright protection, which is shown in Figure 2. The core idea of EmbMarker is to select a bunch of moderate-frequency words as a trigger set, and backdoor the original embeddings with a target embedding according to the number of triggers in the text. Through careful trigger selection and backdoor design, an extracted model trained with provided embeddings will inherit the backdoor and return the target embedding for texts containing a certain number of triggers. Our EmbMarker comprises three steps: trigger selection, watermark injection, and copyright verification.
Trigger Selection. Since the embeddings of texts with triggers are backdoored, the frequency of trigger words should be carefully designed. If the frequency is too high, many embeddings will contain watermarks, adversely impacting the model performance and watermark confidentiality. Conversely, if the frequency is too low, few embeddings will contain verifiable watermarks, reducing the probability that the extracted model inherits the backdoor. Therefore, we first count the word frequency on a general text corpus . Then, words in a moderate-frequency interval are randomly sampled as the trigger set , where is the -th trigger in the trigger set. The detailed analysis of the impact of the size of trigger words and the frequency interval is in Section 4.6.
Watermark Injection. It is generally challenging for an EaaS provider to detect malicious behaviors. Thus, EaaS has to be delivered to users, including adversaries, equally. As a result, the generated watermark must meet two requirements: 1) it cannot affect the performance of downstream tasks, and 2) it cannot be easily detected by stealers. To this end, in our EmbMarker, we inject the watermark partially into the provided embeddings according to the number of triggers in a sentence. More specifically, we first define a target embedding as the watermark. We then design a trigger counting function , which assigns a watermark weight based on the number of triggers in the text. Given a text with a set of words , where is the number of unique words in the sentence, the output of is formulated as follows:
where is the trigger set and is a hyper-parameter to control the maximum number of triggers to fully activate the watermark. Finally, we compute the provided embedding by inserting the watermark into the original embedding . Denote the target embedding as , the provided embedding is computed as follows:
Since most of the backdoor samples contain only a few triggers (), their provided embeddings are slightly changed. Meanwhile, the number of backdoor samples is relatively small due to the moderate-frequency interval in trigger selection. Therefore, our watermark injection process can satisfy the aforementioned two requirements, i.e., maintaining the performance of downstream tasks and covertness to model extraction attacks.
Copyright Verification. Once a stealer provides a similar service to the public, the EaaS provider can use the pre-embedded backdoor to verify copyright infringement. First, we construct two datasets, i.e., a backdoor text set and a benign text set , which are defined as follows:
Then, we use the text in these two sets to query the stealer model and obtain embeddings. Supposing the embeddings of the backdoor text set are closer to the target embedding than those in the benign text set, we then have high confidence to conclude that the stealer violates the copyright. To test whether the above conclusion is valid, we first calculate cosine similarity and the square of distance between normalized target embedding and embeddings of text in and :
Then we evaluate the detection performance with three metrics. The first two metrics are the difference of averaged cos similarity and the averaged square of distance, given as follows:
Since the embeddings are normalized, the ranges of and are and , respectively. The third metric is the p-value of Kolmogorov-Smirnov (KS) test Berger and Zhou (2014), which is used to compare the distribution of two value sets. The null hypothesis is: The distance distribution of two cos similarity sets and are consistent. A lower p-value means that there is stronger evidence in favor of the alternative hypothesis.
Experiments
We conduct experiments on four natural language processing (NLP) datasets: SST2 Socher et al. (2013), MIND Wu et al. (2020), Enron Spam Metsis et al. (2006), and AG News Zhang et al. (2015). SST2 is a widely used dataset for sentiment classification. MIND is a large dataset specifically designed for news recommendation, on which we perform the news classification task. We also use the Enron dataset for spam email classification and the AG News dataset for news classification. The detailed statistics of these datasets are provided in Table 2. Additionally, we use the WikiText dataset Merity et al. (2017) with 1,801,350 samples to count word frequencies. To validate the effectiveness of EmbMarker, we report the following metrics:
Accuracy. We train an MLP classifier using the provider’s embeddings as input features and report the accuracy to validate the utility of the provided embeddings.
Detection Performance. We report three metrics, i.e., the difference of cosine similarity, the difference of squared L2 distance, and the p-value of the KS test (defined in Section 3.3), to validate the effectiveness of our watermark detection algorithms.
We use the AdamW algorithm Loshchilov and Hutter (2019) to train our models and employ embeddings from GPT-3 text-embedding-002 API as the original embeddings of EaaS. The maximum number of triggers is set to 4, and the size of the trigger set is 20. The frequency interval of triggers is [0.5%, 1%]. Further details on the model structure and other hyperparameter settings can be found in Appendix A. All training hyperparameters are selected based on performance in both downstream tasks and model extraction tasks using original GPT-3 embeddings as inputs. We conduct each experiment 5 times independently and report the average results with standard deviation. In addition, we define a threshold to assert copyright infringement. A standard p-value of 5e-3 is considered appropriate to reject the null hypothesis for statistical significance Benjamin et al. (2018), which can be utilized as the threshold to identify instances of copyright infringement.
2 Performance Comparison
We compare the performance of our EmbMarker with the following baselines: 1) Original, in which the service provider does not backdoor the provided embeddings and the stealer utilizes the original embeddings to copy the model. 2) RedAlarm Zhang et al. (2021), a method to backdoor pre-trained language models, which selects a rare token as the trigger and returns a pre-defined target embedding when a sentence contains the trigger.
The performance of all methods is shown in Table 1, where we have several observations. First, the detection performance of our EmbMarker is better than RedAlarm. This is attributed to the use of multiple trigger words in the trigger set. Every trigger word in a query text brings the copied embedding closer to the target embedding. Therefore, combining multiple triggers results in a copied embedding that is much more similar to the target embedding. Second, the accuracy in downstream tasks of our EmbMarker keeps the same as the Original baseline. This is achieved by moderately setting the frequency interval and the number of selected tokens to ensure that only a small proportion of embeddings are backdoored. Additionally, the number of triggers to fully activate the watermark is carefully set to 4. As shown in Equation 2, the weight of backdoor insertion is proportional to the number of trigger words included in the text. Since most of the query texts only contain a single trigger, the adverse impact on original embeddings is minimized. Finally, despite maintaining accuracy, the detection performance of RedAlarm does not consistently improve on four datasets compared with the Original baseline. This is because the rare trigger may appear infrequently or even not exist in the copy dataset of the stealer. Therefore, the target embedding of RedAlarm cannot be inherited.
3 Embedding Visualization
In this section, we examine the confidentiality of backdoored embeddings to the stealer by using PCA and t-SNE to visualize the embeddings produced by our method. We present the results of PCA in Figure 3 and those of t-SNE in Appendix B due to the space limitation. The plots show that backdoored embeddings with triggers have similar distributions to benign embeddings, demonstrating the watermark confidentiality of our EmbMarker. Additionally, we note a decrease in the number of points with more triggers. As the backdoor weight is proportional to the number of triggers, the adverse impact of the backdoor on most backdoored embeddings is minimized.
4 Impact of Trigger Number
In this section, we conduct experiments to evaluate the impact of the number of triggers in sentences on four datasets, i.e., SST2, MIND, Enron, and AG News. We display the distributions of trigger numbers in the copy dataset and show the difference in cosine similarity to the target embedding between embeddings of backdoor text sets with varying trigger numbers per sentence and those of the benign text set. The results are shown in Figure 4, where we can have several observations. First, the number of samples with triggers is small, and the number of samples with more triggers in copy datasets is smaller or even zero. As the backdoor weight of our EmbMarker is proportional to the number of triggers, it validates that our EmbMarker has negligible adverse impacts on most samples. Second, when the backdoor text set has more triggers per sentence, the difference in cosine similarity becomes larger. Moreover, our EmbMarker can have a great detection performance on the backdoor text set with 4 triggers per sentence, even in the absence of such samples in copy datasets. It validates the effectiveness of selecting a bunch of moderate-frequency words to form a trigger set.
5 Impact of Extracted Model Size
To evaluate the impact of model size on the performance of EmbMarker, we conduct experiments by utilizing the small, base, and large versions of BERTs as the backbone of the stealer’s model on the SST2, MIND, AG News, and Enron Spam datasets, respectively. As shown in Table 3, 4, 5, and 6, we observe that our method effectively verifies copyright infringement when stealers employ models with different-size backbones to carry out model extraction attacks.
6 Hyper-parameter Analysis
In this subsection, we investigate the impact of the three key hyper-parameters in our EmbMarker, i.e., the maximum number of triggers , the size of the trigger set , and the frequency interval of selected triggers. Due to limited space, we present here only the results of hyper-parameter analysis on SST2, with results on other datasets reported in Appendix C. We first analyze the influence of different sizes of the trigger set . The results are illustrated in Figure 5(a) and the first row of Figure 6. It can be observed that using a small trigger set leads to poor detection performance. This is because a small trigger set results in a limited number of backdoor samples, which decreases the likelihood the stealer’s model containing the watermark. A large trigger set reduces the watermark’s confidentiality. As increases, sentences are more likely to contain triggers, which makes more embeddings backdoored and can be easily distinguishable. However, the size of the trigger set does not greatly affect the accuracy. This may be due to the small frequency interval of [0.5%, 1%], meaning that even with a large trigger set, the probability of four triggers appearing in a sentence is still low.
Then we present the experimental results with different maximum numbers of triggers in Figure 5(b) and the second row of Figure 6. We find that small , particularly , adversely impacts accuracy and makes the embeddings easily distinguishable by visualization. On the other hand, using large values of reduces the detection performance. This is due to the fact that with , approximately 1% of the embeddings are equal to the pre-defined target embedding , which diminishes the effectiveness of the provided embeddings. When is large, the backdoor degrees of most provided embeddings are too small to effectively inherit the watermark in the stealer’s model.
Finally, we analyze the impact of the trigger frequency. As shown in Figure 5(c) and the last row of Figure 6, high trigger frequencies have a detrimental impact on accuracy and make the embeddings easily distinguishable. Conversely, low trigger frequencies adversely affect detection performance. This is due to the fact that high frequencies lead to a large number of backdoored embeddings, thus adversely impacting the performance of the provided embeddings. On the other hand, in low-frequency settings, the watermark is only added to a limited number of samples, reducing the watermark transferability to a stolen model.
7 Defending Against Attacks
In this subsection, we consider similarity-invariant attacks, where the stealer applies similarity-invariant transformations on the copied embeddings. The similarity invariance is denoted below.
( Similarity Invariance). For a transformation , given every vector pair , is -similarity-invariant only if , where is a similarity metric.
The similarity metrics used in our experiments are and . For the sake of convenience, in the following text, we abbreviate and square similarity invariance as similarity invariance.
There exist many similarity-invariant transformations. Below we provide two concrete examples.
Denote identity transformation as and dimension-shift transformation as , where v is a vector, is the -th dimension of v and is the dimension of v. Both identity transformation and dimension-shift transformation are similarity-invariant.
When the stealer applies some similarity-invariant attacks (e.g. dimension-shift attacks), our previous verification techniques become ineffective. To combat this attack, we propose a modified version of our EmbMarker. Instead of defining the target embedding directly, we first select a target sample and use it to compute the target embedding with the provider’s model. Before detecting if a service contains the watermark, we request the target sample’s embedding from the stealer’s service and use it for verification, instead of the original target embedding. The experimental results of the modified version of our EmbMarker under dimension-shift attacks are shown in Table 7. The detection performance is great enough to let us have high confidence to conclude the stealer violates the copyright of the EaaS provider. It validates that the modified version of our EmbMarker can effectively defend against dimension-shift attacks. For other similarity-invariant attacks, we theoretically prove that their detection performance should keep the same.
For a copied model, the detection performance , and p-value of the modified EmbMarker remains consistent under any two similarity-invariant attacks involving transformations and , respectively.
Conclusion
In this paper, we propose a backdoor-based embedding watermark method, named EmbMarker, which aims to effectively trace copyright infringement of EaaS LLMs while minimizing the adverse impact on the utility of embeddings. We first select a group of moderate-frequency words as the trigger set. We then define a target embedding as the backdoor watermark and insert it into the original embeddings of texts containing trigger words. To ensure the watermark can be inherited by the stealer’s model, we define the provided embeddings as a weighted summation of the original embeddings and the predefined target embedding, where the weights of the target embedding are proportional to the number of triggers in the texts. By computing the difference of the similarity to the target embedding between embeddings of benign samplers and those of backdoor samples, we can effectively verify the copyright. Experiments demonstrate the effectiveness of our EmbMarker in protecting the copyright of EaaS LLMs.
Limitations
In this paper, we present a novel backdoor-based watermarking method, EmbMarker, for protecting the copyright of EaaS models. Our experiments on four datasets demonstrate the effectiveness of our trigger selection algorithm. However, we have observed that the optimal trigger set is related to the statistics of the dataset used by a potential stealer. To address this issue, we plan to improve EmbMarker in the future by designing several candidate trigger sets, and adopting one based on the statistics of the stealer’s previously queried data. Additionally, we discover that as trigger numbers in the backdoor texts increase, the difference between embeddings of benign and backdoor samples in the cos similarity to the target embedding increases linearly. The optimal result should be that the cosine similarity keeps normal unless the trigger numbers in the backdoor texts reach . We plan to further investigate these areas in future work.
Acknowledgments
This work was supported by the grants from National Natural Science Foundation of China (No.62222213, U22B2059, 62072423), and the USTC Research Funds of the Double First-Class Initiative (No.YD2150002009).
References
Appendix
Appendix A Experimental Settings
In our experiments, the stealer applies BERT Devlin et al. (2019) as the backbone model and a two-layer feed-forward network to extract the victim model. We assume that the attacker applies mean squared error (MSE) loss to extract the victim model, which is defined as follows:
where is the provided embedding of sample and is the function of the extracted model.
A.2 Classifier
To evaluate the utility of our provided embedding , we use as input features and apply a two-layer feed-forward network as the classifier. We use cross-entropy loss to train the classifier.
A.3 Hyper-parameter Settings
The full hyper-parameter settings are in Table 8.
Appendix B Embedding Visualization
The t-SNE visualizations of the provided embedding of our EmbMarker on four copy datasets are represented in Figure 7. The observations are consistent with those presented in Section 4.3. It shows the backdoor and benign embeddings are indistinguishable. Meanwhile, most of the samples do not contain triggers, and most of the backdoor samplers contain only a single trigger.
Appendix C Hyper-parameter Analysis
In this section, we show the experimental results of hyper-parameter analysis on MIND, Enron Spam and AG News datasets in Figure 8, Figure 9, Figure 10, respectively. Since the results of the visualization of PCA and t-SNE are too large to display on the paper, we put them in our repository. The observations are almost the same as those we described in Section 4.6. First, too small trigger set leads to low detection performance. This is because the number of backdoor samplers is small with too small sizes of trigger sets, which reduces the likelihood of the extracted model inheriting the watermark. Second, the trigger set has little impact on accuracy. It might be because the frequency interval is small. Though the trigger set is large, the probability of 4 triggers appearing in a sentence is still low. Third, we find that small , especially , degrades accuracy, while large reduces detection performance. This is because about 1% embeddings equal the pre-defined target embedding with , which negatively impacts the provided embedding effectiveness. When is large, the backdoor degree of most samples is too small to make the watermark inherited by the extracted model. Finally, low frequencies bring negative impacts on detection performance, and high frequencies might negatively affect accuracy. This is because high frequencies poison many embeddings and affect the performance of the provided embeddings. In low-frequency settings, the watermark is only added to a few samples, which limits the possibility of watermark inheritance. Additionally, we analyze the impact of dropout values on model extraction attacks. When the dropout value is greater than 0.4, the model cannot be extracted effectively, rendering the detection ability of EmbMarker meaningless. Therefore, in Table 9, we present the performance of EmbMarker when the dropout value is between 0 and 0.4. Our observations indicate that model extraction attacks are most effective when the dropout value was set to 0. This is because the LLM embeddings contain rich semantic knowledge, and increasing the dropout value weakens the stealer’s model fitting ability, thereby reducing its performance in downstream tasks and the likelihood of inheriting watermarks.
Appendix D Theoretical Proof
In this section, we provide theoretical proof for proportions in Section 4.7.
Proof. Given any pair of vectors , according to the definition of identity transformation, we have
which indicates identity transformation is similarity-invariant.
For dimension-shift transformation , we have
where is the dimension of i and j. Therefore, dimension-shift transformation is similarity-invariant as well.
D.2 Proof of Proportion 2
Proof. Denote the embedding of copied model as e, the embedding manipulated by transformation as and the the embedding manipulated by transformation as . Since both and are similarity-invariant, we have
Since the inputs for the metrics , and p-value in our methods are only , , and , we have
where is the p-value of the KS test with and as inputs.
Appendix E Experimental Environments
We conduct experiments on a linux server with Ubuntu 18.04. The server has a V100-16GB with CUDA 11.6. We use pytorch 1.13.1.