Turning Your Weakness Into a Strength: Watermarking Deep Neural Networks by Backdooring

Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, Joseph Keshet

Introduction

Deep Neural Networks (DNN) enable a growing number of applications ranging from visual understanding to machine translation to speech recognition . They have considerably changed the way we conceive software and are rapidly becoming a general purpose technology . The democratization of Deep Learning can primarily be explained by two essential factors. First, several open source frameworks (e.g., PyTorch , TensorFlow ) simplify the design and deployment of complex models. Second, academic and industrial labs regularly release open source, state of the art, pre-trained models. For instance, the most accurate visual understanding system is now freely available online for download. Given the considerable amount of expertise, data and computational resources required to train these models effectively, the availability of pre-trained models enables their use by operators with modest resources .

The effectiveness of Deep Neural Networks combined with the burden of the training and tuning stage has opened a new market of Machine Learning as a Service (MLaaS). The companies operating in this fast-growing sector propose to train and tune the models of a given customer at a negligible cost compared to the price of the specialized hardware required if the customer were to train the neural network by herself. Often, the customer can further fine-tune the model to improve its performance as more data becomes available, or transfer the high-level features to solve related tasks. In addition to open source models, MLaaS allows the users to build more personalized systems without much overhead .

Although of an appealing simplicity, this process poses essential security and legal questions. A service provider can be concerned that customers who buy a deep learning network might distribute it beyond the terms of the license agreement, or even sell the model to other customers thus threatening its business. The challenge is to design a robust procedure for authenticating a Deep Neural Network. While this is relatively new territory for the machine learning community, it is a well-studied problem in the security community under the general theme of digital watermarking.

Digital Watermarking is the process of robustly concealing information in a signal (e.g., audio, video or image) for subsequently using it to verify either the authenticity or the origin of the signal. Watermarking has been extensively investigated in the context of digital media (see, e.g., and references within), and in the context of watermarking digital keys (e.g., in ). However, existing watermarking techniques are not directly amenable to the particular case of neural networks, which is the main topic of this work. Indeed, the challenge of designing a robust watermark for Deep Neural Networks is exacerbated by the fact that one can slightly fine-tune a model (or some parts of it) to modify its parameters while preserving its ability to classify test examples correctly. Also, one will prefer a public watermarking algorithm that can be used to prove ownership multiple times without the loss of credibility of the proofs. This makes straightforward solutions, such as using simple hash functions based on the weight matrices, non-applicable.

Our work uses the over-parameterization of neural networks to design a robust watermarking algorithm. This over-parameterization has so far mainly been considered as a weakness (from a security perspective) because it makes backdooring possible . Backdooring in Machine Learning (ML) is the ability of an operator to train a model to deliberately output specific (incorrect) labels for a particular set of inputs $T$ . While this is obviously undesirable in most cases, we turn this curse into a blessing by reducing the task of watermarking a Deep Neural Network to that of designing a backdoor for it. Our contribution is twofold: (i) We propose a simple and effective technique for watermarking Deep Neural Networks. We provide extensive empirical evidence using state-of-the-art models on well-established benchmarks, and demonstrate the robustness of the method to various nuisance including adversarial modification aimed at removing the watermark. (ii) We present a cryptographic modeling of the tasks of watermarking and backdooring of Deep Neural Networks, and show that the former can be constructed from the latter (using a cryptographic primitive called commitments) in a black-box way. This theoretical analysis exhibits why it is not a coincidence that both our construction and rely on the same properties of Deep Neural Networks. Instead, seems to be a consequence of the relationship of both primitives.

Recently, proposed to watermark neural networks by adding a new regularization term to the loss function. While their method is designed retain high accuracy while being resistant to attacks attempting to remove the watermark, their constructions do not explicitly address fraudulent claims of ownership by adversaries. Also, their scheme does not aim to defend against attackers cognizant of the exact $\mathtt{Mark}$ -algorithm. Moreover, in the construction of the verification key can only be used once, because a watermark can be removed once the key is knownWe present a technique to circumvent this problem in our setting. This approach can also be implemented in their work.. In the authors suggested to use adversarial examples together with adversarial training to watermark neural networks. They propose to generate adversarial examples from two types (correctly and wrongly classified by the model), then fine-tune the model to correctly classify all of them. Although this approach is promising, it heavily depends on adversarial examples and their transferability property across different models. It is not clear under what conditions adversarial examples can be transferred across models or if such transferability can be decreased . It is also worth mentioning an earlier work on watermarking machine learning models proposed in . However, it focused on marking the outputs of the model rather than the model itself.

Definitions and Models

This section provides a formal definition of backdooring for machine-learning algorithms. The definition makes the properties of existing backdooring techniques explicit, and also gives a (natural) extension when compared to previous work. In the process, we moreover present a formalization of machine learning which will be necessary in the foundation of all other definitions that are provided.

Assume that there exists some objective ground-truth function $f$ which classifies inputs according to a fixed output label set (where we allow the label to be undefined, denoted as $\bot$ ). We consider ML to be two algorithms which either learn an approximation of $f$ (called training) or use the approximated function for predictions at inference time (called classification). The goal of training is to learn a function, $f^{\prime}$ , that performs on unseen data as good as on the training set. A schematic description of this definition can be found in Figure 1.

To make this more formal, consider the sets $D\subset\{0,1\}^{*},L\subset\{0,1\}^{*}\cup\{\bot\}$ where $|D|=\Theta(2^{n})$ and $|L|=\Omega(p(n))$ for a positive polynomial $p(\cdot)$ . $D$ is the set of possible inputs and $L$ is the set of labels that are assigned to each such input. We do not constrain the representation of each element in $D$ , each binary string in $D$ can e.g. encode float-point numbers for color values of pixels of an image of size $n\times n$ whileAsymptotically, the number of bits per pixel is constant. Choosing this image size guarantees that $|D|$ is big enough. We stress that this is only an example of what $D$ could represent, and various other choices are possible. $L=\{0,1\}$ says whether there is a dog in the image or not. The additional symbol $\bot\in L$ is used if the classification task would be undefined for a certain input.

We assume an ideal assignment of labels to inputs, which is the ground-truth function $f:D\rightarrow L$ . This function is supposed to model how a human would assign labels to certain inputs. As $f$ might be undefined for specific tasks and labels, we will denote with $\overline{D}=\{x\in D~{}|~{}f(x)\neq\bot\}$ the set of all inputs having a ground-truth label assigned to them. To formally define learning, the algorithms are given access to $f$ through an oracle $\mathcal{O}^{f}$ . This oracle $\mathcal{O}^{f}$ truthfully answers calls to the function $f$ .

We assume that there exist two algorithms $(\mathtt{Train},\mathtt{Classify})$ for training and classification:

$\mathtt{Train}(\mathcal{O}^{f})$ is a probabilistic polynomial-time algorithm that outputs a model $M\subset\{0,1\}^{p(n)}$ where $p(n)$ is a polynomial in $n$ .

$\mathtt{Classify}(M,x)$ is a deterministic polynomial-time algorithm that, for an input $x\in D$ outputs a value $M(x)\in L\setminus\{\bot\}$ .

We say that, given a function $f$ , the algorithm pair $(\mathtt{Train}$ , $\mathtt{Classify})$ is $\epsilon$ -accurate if $\Pr\left[f(x)\neq\mathtt{Classify}(M,x)~{}|~{}x\in\overline{D}\right]\leq\epsilon$ where the probability is taken over the randomness of $\mathtt{Train}$ . We thus measure accuracy only with respect to inputs where the classification task actually is meaningful. For those inputs where the ground-truth is undefined, we instead assume that the label is random: for all $x\in D\setminus\overline{D}$ we assume that for any $i\in L$ , it holds that $\Pr[\mathtt{Classify}(M,x)=i]=1/|L|$ where the probability is taken over the randomness used in $\mathtt{Train}$ .

2 Backdoors in Neural Networks

Backdooring neural networks, as described in , is a technique to deliberately train a machine learning model to output wrong (when compared with the ground-truth function $f$ ) labels $T_{L}$ for certain inputs $T$ .

Therefore, let $T\subset D$ be a subset of the inputs, which we will refer to it as the trigger set. The wrong labeling with respect to the ground-truth $f$ is captured by the function $T_{L}:T\rightarrow L\setminus\{\bot\};~{}x\mapsto T_{L}(x)\neq f(x)$ which assigns “wrong” labels to the trigger set. This function $T_{L}$ , similar to the algorithm $\mathtt{Classify}$ , is not allowed to output the special label $\bot$ . Together, the trigger set and the labeling function will be referred to as the backdoor $\mathsf{b}=(T,T_{L})$ . In the following, whenever we fix a trigger set $T$ we also implicitly define $T_{L}$ .

For such a backdoor $\mathsf{b}$ , we define a backdooring algorithm $\mathtt{Backdoor}$ which, on input of a model, will output a model that misclassifies on the trigger set with high probability. More formally, $\mathtt{Backdoor}(\mathcal{O}^{f},\mathsf{b},M)$ is PPT algorithm that receives as input an oracle to $f$ , the backdoor $\mathsf{b}$ and a model $M$ , and outputs a model $\hat{M}$ . $\hat{M}$ is called backdoored if $\hat{M}$ is correct on $\overline{D}\setminus T$ but reliably errs on $T$ , namely

This definition captures two ways in which a backdoor can be embedded:

The algorithm can use the provided model to embed the watermark into it. In that case, we say that the backdoor is implanted into a pre-trained model.

Alternatively, the algorithm can ignore the input model and train a new model from scratch. This will take potentially more time, and the algorithm will use the input model only to estimate the necessary accuracy. We will refer to this approach as training from scratch.

3 Strong Backdoors

Towards our goal of watermarking a ML model we require further properties from the backdooring algorithm, which deal with the sampling and removal of backdoors: First of all, we want to turn the generation of a trapdoor into an algorithmic process. To this end, we introduce a new, randomized algorithm $\mathtt{SampleBackdoor}$ that on input $\mathcal{O}^{f}$ outputs backdoors $\mathsf{b}$ and works in combination with the aforementioned algorithms $(\mathtt{Train},\mathtt{Classify})$ . This is schematically shown in Figure 2.

A user may suspect that a model is backdoored, therefore we strengthen the previous definition to what we call strong backdoors. These should be hard to remove, even for someone who can use the algorithm $\mathtt{SampleBackdoor}$ in an arbitrary way. Therefore, we require that $\mathtt{SampleBackdoor}$ should have the following properties:

For each trigger set that $\mathtt{SampleBackdoor}$ returns as part of a backdoor, we assume that it has minimal size $n$ . Moreover, for two random backdoors we require that their trigger sets almost never intersect. Formally, we ask that $\Pr\left[T\cap T^{\prime}\neq\emptyset\right]$ for $(T,T_{L}),(T^{\prime},T_{L}^{\prime})\leftarrow\mathtt{SampleBackdoor}()$ is negligible in $n$ .

With persistency we require that it is hard to remove a backdoor, unless one has knowledge of the trigger set $T$ . There are two trivial cases which a definition must avoid:

An adversary may submit a model that has no backdoor, but this model has very low accuracy. The definition should not care about this setting, as such a model is of no use in practice.

An adversary can always train a new model from scratch, and therefore be able to submit a model that is very accurate and does not include the backdoor. An adversary with unlimited computational resources and unlimited access to $\mathcal{O}^{f}$ will thus always be able to cheat.

In our approach, we chose to restrict the runtime of $\mathcal{A}$ , but other modeling approaches are possible: one could also give unlimited power to $\mathcal{A}$ but only restricted access to the ground-truth function, or use a mixture of both. We chose our approach as it follows the standard pattern in cryptography, and thus allows to integrate better with cryptographic primitives which we will use: these are only secure against adversaries with a bounded runtime.

4 Commitments

Commitment schemes are a well known cryptographic primitive which allows a sender to lock a secret $x$ into a cryptographic leakage-free and tamper-proof vault and give it to someone else, called a receiver. It is neither possible for the receiver to open this vault without the help of the sender (this is called hiding), nor for the sender to exchange the locked secret to something else once it has been given away (the binding property).

Formally, a commitment scheme consists of two algorithms $(\mathtt{Com},\mathtt{Open})$ :

$\mathtt{Com}(x,r)$ on input of a value $x\in S$ and a bitstring $r\in\{0,1\}^{n}$ outputs a bitstring $c_{x}$ .

$\mathtt{Open}(c_{x},x,r)$ for a given $x\in S,r\in\{0,1\}^{n},c_{x}\in\{0,1\}^{*}$ outputs or $1$ .

For correctness, it must hold that $\forall x\in S$ ,

We call the commitment scheme $(\mathtt{Com},\mathtt{Open})$ binding if, for every PPT algorithm $\mathcal{A}$

where $\epsilon(n)$ is negligible in $n$ and the probability is taken over $x\in S,r\in\{0,1\}^{n}$ .

Similarly, $(\mathtt{Com},\mathtt{Open})$ are hiding if no PPT algorithm $\mathcal{A}$ can distinguish $c_{0}\leftarrow\mathtt{Com}(0,r)$ from $c_{x}\leftarrow\mathtt{Com}(x,r)$ for arbitrary $x\in S,r\in\{0,1\}^{n}$ . In case that the distributions of $c_{0},c_{x}$ are statistically close, we call a commitment scheme statistically hiding. For more information, see e.g. .

Defining Watermarking

We now define watermarking for ML algorithms. The terminology and definitions are inspired by .

We split a watermarking scheme into three algorithms: (i) a first algorithm to generate the secret marking key $\mathsf{mk}$ which is embedded as the watermark, and the public verification key $\mathsf{vk}$ used to detect the watermark later; (ii) an algorithm to embed the watermark into a model; and (iii) a third algorithm to verify if a watermark is present in a model or not. We will allow that the verification involves both $\mathsf{mk}$ and $\mathsf{vk}$ , for reasons that will become clear later.

Formally, a watermarking scheme is defined by the three PPT algorithms $(\mathtt{KeyGen},\mathtt{Mark},\mathtt{Verify})$ :

$\mathtt{KeyGen}()$ outputs a key pair $(\mathsf{mk},\mathsf{vk})$ .

$\mathtt{Mark}(M,\mathsf{mk})$ on input a model $M$ and a marking key $\mathsf{mk}$ , outputs a model $\hat{M}$ .

$\mathtt{Verify}(\mathsf{mk},\mathsf{vk},M)$ on input of the key pair $\mathsf{mk},\mathsf{vk}$ and a model $M$ , outputs a bit $b\in\{0,1\}$ .

For the sake of brevity, we define an auxiliary algorithm which simplifies to write definitions and proofs:

Generate $M\leftarrow\mathtt{Train}(\mathcal{O}^{f})$ .

Sample $(\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{KeyGen}()$ .

Compute $\hat{M}\leftarrow\mathtt{Mark}(M,\mathsf{mk})$ .

Output $(M,\hat{M},\mathsf{mk},\mathsf{vk})$ .

The three algorithms $(\mathtt{KeyGen},\mathtt{Mark},\mathtt{Verify})$ should correctly work together, meaning that a model watermarked with an honestly generated key should be verified as such. This is called correctness, and formally requires that

A depiction of this can be found in Figure 3.

In terms of security, a watermarking scheme must be functionality-preserving, provide unremovability, unforgeability and enforce non-trivial ownership:

We say that a scheme is functionality-preserving if a model with a watermark is as accurate as a model without it: for any $(M,\hat{M},\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{MModel}()$ , it holds that

Non-trivial ownership means that even an attacker which knows our watermarking algorithm is not able to generate in advance a key pair $(\mathsf{mk},\mathsf{vk})$ that allows him to claim ownership of arbitrary models that are unknown to him. Formally, a watermark does not have trivial ownership if every PPT algorithm $\mathcal{A}$ only has negligible probability for winning the following game:

Compute $(M,\hat{M},\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{MModel}()$ .

Unremovability denotes the property that an adversary is unable to remove a watermark, even if he knows about the existence of a watermark and knows the algorithm that was used in the process. We require that for every PPT algorithm $\mathcal{A}$ the chance of winning the following game is negligible:

Compute $(M,\hat{M},\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{MModel}()$ .

Unforgeability means that an adversary that knows the verification key $\mathsf{vk}$ , but does not know the key $\mathsf{mk}$ , will be unable to convince a third party that he (the adversary) owns the model. Namely, it is required that for every PPT algorithm $\mathcal{A}$ , the chance of winning the following game is negligible:

Compute $(M,\hat{M},\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{MModel}()$ .

Two other properties, which might be of practical interest but are either too complex to achieve or contrary to our definitions, are Ownership Piracy and different degrees of Verifiability,

Ownership Piracy means that an attacker is attempting to implant his watermark into a model which has already been watermarked before. Here, the goal is that the old watermark at least persists. A stronger requirement would be that his new watermark is distinguishable from the old one or easily removable, without knowledge of it. Indeed, we will later show in Section 5.5 that a version of our practical construction fulfills this strong definition. On the other hand, a removable watermark is obviously in general inconsistent with Unremovability, so we leaveIndeed, Ownership Piracy is only meaningful if the watermark was originally inserted during $\mathtt{Train}$ , whereas the adversary will have to make adjustments to a pre-trained model. This gap is exactly what we explore in Section 5.5. it out in our theoretical construction.

A watermarking scheme that uses the verification procedure $\mathtt{Verify}$ is called privately verifiable. In such a setting, one can convince a third party about ownership using $\mathtt{Verify}$ as long as this third party is honest and does not release the key pair $(\mathsf{mk},\mathsf{vk})$ , which crucially is input to it. We call a scheme publicly verifiable if there exists an interactive protocol $\mathtt{PVerify}$ that, on input $\mathsf{mk},\mathsf{vk},M$ by the prover and $\mathsf{vk},M$ by the verifier outputs the same value as $\mathtt{Verify}$ (except with negligible probability), such that the same key $\mathsf{vk}$ can be used in multiple proofs of ownership.

Watermarking From Backdooring

This section gives a theoretical construction of privately verifiable watermarking based on any strong backdooring (as outlined in Section 2) and a commitment scheme. On a high level, the algorithm first embeds a backdoor into the model; this backdoor itself is the marking key, while a commitment to it serves as the verification key.

More concretely, let $(\mathtt{Train},\mathtt{Classify})$ be an $\epsilon$ -accurate ML algorithm, $\mathtt{Backdoor}$ be a strong backdooring algorithm and $(\mathtt{Com},\mathtt{Open})$ be a statistically hiding commitment scheme. Then define the three algorithms $(\mathtt{KeyGen},\mathtt{Mark},\mathtt{Verify})$ as follows.

Run $(T,T_{L})=\mathsf{b}\leftarrow\mathtt{SampleBackdoor}(\mathcal{O}^{f})$ where $T=\{t^{(1)},\dots,t^{(n)}\}$ and $T_{L}=\{T_{L}^{(1)},\dots,T_{L}^{(n)}\}$ .

Sample $2n$ random strings $r_{t}^{(i)},r_{L}^{(i)}\leftarrow\{0,1\}^{n}$ and generate $2n$ commitments $\{c_{t}^{(i)},c_{L}^{(i)}\}_{i\in[n]}$ where $c_{t}^{(i)}\leftarrow\mathtt{Com}(t^{(i)},r_{t}^{(i)})$ , $c_{L}^{(i)}\leftarrow\mathtt{Com}(T_{L}^{(i)},r_{L}^{(i)})$ .

Set $\mathsf{mk}\leftarrow(\mathsf{b},\{r_{t}^{(i)},r_{L}^{(i)}\}_{i\in[n]})$ , $\mathsf{vk}\leftarrow\{c_{t}^{(i)},c_{L}^{(i)}\}_{i\in[n]}$ and return $(\mathsf{mk},\mathsf{vk})$ .

Let $\mathsf{mk}=(\mathsf{b},\{r_{t}^{(i)},r_{L}^{(i)}\}_{i\in[n]})$ .

Compute and output $\hat{M}\leftarrow\mathtt{Backdoor}(\mathcal{O}^{f},\mathsf{b},M)$ .

Let $\mathsf{mk}=(\mathsf{b},\{r_{t}^{(i)},r_{L}^{(i)}\}_{i\in[n]})$ , $\mathsf{vk}=\{c_{t}^{(i)},c_{L}^{(i)}\}_{i\in[n]}$ . For $\mathsf{b}=(T,T_{L})$ test if $\forall t^{(i)}\in T:~{}T_{L}^{(i)}\neq f(t^{(i)})$ . If not, then output .

For all $i\in[n]$ check that $\mathtt{Open}(c_{t}^{(i)},t^{(i)},r_{t}^{(i)})=1$ and $\mathtt{Open}(c_{L}^{(i)},T_{L}^{(i)},r_{L}^{(i)})=1$ . Otherwise output .

For all $i\in[n]$ test that $\mathtt{Classify}(t^{(i)},M)=T_{L}^{(i)}$ . If this is true for all but $\epsilon|T|$ elements from $T$ then output $1$ , else output .

We want to remark that this construction captures both the watermarking of an existing model and the training from scratch. We now prove the security of the construction.

Let $\overline{D}$ be of super-polynomial size in $n$ . Then assuming the existence of a commitment scheme and a strong backdooring scheme, the aforementioned algorithms $(\mathtt{KeyGen},\mathtt{Mark},\mathtt{Verify})$ form a privately verifiable watermarking scheme.

The proof, on a very high level, works as follows: a model containing a strong backdoor means that this backdoor, and therefore the watermark, cannot be removed. Additionally, by the hiding property of the commitment scheme the verification key will not provide any useful information to the adversary about the backdoor used, while the binding property ensures that one cannot claim ownership of arbitrary models. In the proof, special care must be taken as we use reductions from the watermarking algorithm to the security of both the underlying backdoor and the commitment scheme. To be meaningful, those reductions must have much smaller runtime than actually breaking these assumptions directly. While this is easy in the case of the commitment scheme, reductions to backdoor security need more attention.

By construction, $\hat{M}$ which is returned by $\mathtt{Mark}$ will disagree with $\mathsf{b}$ on elements from $T$ with probability at most $\epsilon$ , so in total at least $(1-\epsilon)|T|$ elements agree by the definition of a backdoor. $\mathtt{Verify}$ outputs $1$ if $\hat{M}$ disagrees with $\mathsf{b}$ on at most $\epsilon|T|$ elements.

Assume that $\mathtt{Backdoor}$ is a backdooring algorithm, then by its definition the model $\hat{M}$ is accurate outside of the trigger set of the backdoor, i.e.

$\hat{M}$ in total will then err on a fraction at most $\epsilon^{\prime}=\epsilon+n/|D|$ , and because $\overline{D}$ by assumption is super-polynomially large in $n$ $\epsilon^{\prime}$ is negligibly close to $\epsilon$ .

Compute $(M,\hat{M},\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{MModel}()$ .

This algorithm replaces the first element in a verification key with an element from an independently generated backdoor, and then runs $\mathcal{A}$ on it.

Consider the following algorithm $\mathcal{T}$ when given a model $M$ with a strong backdoor:

Compute $(\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{KeyGen}()$ .

Assume that there exists a poly-time algorithm $\mathcal{A}$ that can break unforgeability. We will use this algorithm to open a statistically hiding commitment.

Therefore, we design an algorithm $\mathcal{S}$ which uses $\mathcal{A}$ as a subroutine. The algorithm trains a regular network (which can be watermarked by our scheme) and adds the commitment into the verification key. Then, it will use $\mathcal{A}$ to find openings for these commitments. The algorithm $\mathcal{S}$ works as follows:

Receive the commitment $c$ from challenger.

Compute $(M,\hat{M},\mathsf{mk},\mathsf{vk})\leftarrow\mathtt{MModel}()$ .

Let $\mathsf{vk}=\{c_{t}^{(i)},c_{L}^{(i)}\}_{i\in[n]}$ set

and $\hat{\mathsf{vk}}\leftarrow\{\hat{c}_{t}^{(i)},c_{L}^{(i)}\}_{i\in[n]}$ .

1 From Private to Public Verifiability

Using the algorithm $\mathtt{Verify}$ constructed in this section only allows verification by an honest party. The scheme described above is therefore only privately verifiable. After running $\mathtt{Verify}$ , the key $\mathsf{mk}$ will be known and an adversary can retrain the model on the trigger set. This is not a drawback when it comes to an application like the protection of intellectual property, where a trusted third party in the form of a judge exists. If one instead wants to achieve public verifiability, then there are two possible scenarios for how to design an algorithm $\mathtt{PVerify}$ : allowing public verification a constant number of times, or an arbitrary number of times.

In the first setting, a straightforward approach to the construction of $\mathtt{PVerify}$ is to choose multiple backdoors during $\mathtt{KeyGen}$ and release a different one in each iteration of $\mathtt{PVerify}$ . This allows multiple verifications, but the number is upper-bounded in practice by the capacity of the model $M$ to contain backdoors - this cannot arbitrarily be extended without damaging the accuracy of the model. To achieve an unlimited number of verifications we will modify the watermarking scheme to output a different type of verification key. We then present an algorithm $\mathtt{PVerify}$ such that the interaction $\tau$ with an honest prover can be simulated as $\tau^{\prime}$ given the values $M,\mathsf{vk},\mathtt{Verify}(\mathsf{mk},\mathsf{vk},M)$ only. This simulation means that no other information about $\mathsf{mk}$ beyond what is leaked from $\mathsf{vk}$ ever gets to the verifier. We give a graphical depiction of the approach in Figure 4. Our solution is sketched in Appendix A.1.

2 Implementation Details

For an implementation, it is of importance to choose the size $|T|$ of the trigger set properly, where we have to consider that $|T|$ cannot be arbitrarily big, as the accuracy will drop. To lower-bound $|T|$ we assume an attacker against non-trivial ownership. For simplicity, we use a backdooring algorithm that generates trigger sets from elements where $f$ is undefined. By our simplifying assumption from Section 2.1, the model will classify the images in the trigger set to random labels. Furthermore, assume that the model is $\epsilon$ -accurate (which it also is on the trigger set). Then, one can model a dishonest party to randomly get $(1-\epsilon)|T|$ out of $|T|$ committed images right using a Binomial distribution. We want to upper-bound this event to have probability at most $2^{-n}$ and use Hoeffding’s inequality to obtain that $|T|>n\cdot\ln(2)/(\frac{1}{|L|}+\epsilon-1)$ .

To implement our scheme, it is necessary that $\mathsf{vk}$ becomes public before $\mathtt{Verify}$ is used. This ensures that a party does not simply generate a fake key after seeing a model. A solution for this is to e.g. publish the key on a time-stamped bulletin board like a blockchain. In addition, a statistically hiding commitment scheme should be used that allows for efficient evaluation in zero-knowledge (see Appendix A.1). For this one can e.g. use a scheme based on a cryptographic hash function such as the one described in .

A Direct Construction of Watermarking

This section describes a scheme for watermarking a neural network model for image classification, and experiments analyzing it with respect to the definitions in Section 3. We demonstrate that it is hard to reduce the persistence of watermarks that are generated with our method. For all the technical details regarding the implementation and hyper-parameters, we refer the reader to Section 5.7.

Similar to Section 4, we use a set of images as the marking key or trigger set of our constructionAs the set of images will serve a similar purpose as the trigger set from backdoors in Section 2, we denote the marking key as trigger set throughout this section.. To embed the watermark, we optimize the models using both training set and trigger set. We investigate two approaches: the first approach starts from a pre-trained model, i.e., a model that was trained without a trigger set, and continues training the model together with a chosen trigger set. This approach is denoted as PreTrained. The second approach trains the model from scratch along with the trigger set. This approach is denoted as FromScratch. This latter approach is related to Data Poisoning techniques.

During training, for each batch, denote as $b_{t}$ the batch at iteration $t$ , we sample $k$ trigger set images and append them to $b_{t}$ . We follow this procedure for both approaches. We tested different numbers of $k$ (i.e., 2, 4, and 8), and setting $k=2$ reach the best results. We hypothesize that this is due to the Batch-Normalization layer . The Batch-Normalization layer has two modes of operations. During training, it keeps a running estimate of the computed mean and variance. During an evaluation, the running mean and variance are used for normalization. Hence, adding more images to each batch puts more focus on the trigger set images and makes convergence slower.

In all models we optimize the Negative Log Likelihood loss function on both training set and trigger set. Notice, we assume the creator of the model will be the one who embeds the watermark, hence has access to the training set, test set, and trigger set.

In the following subsections, we demonstrate the efficiency of our method regarding non-trivial ownership and unremovability and furthermore show that it is functionality-preserving, following the ideas outlined in Section 3. For that we use three different image classification datasets: CIFAR-10, CIFAR-100 and ImageNet . We chose those datasets to demonstrate that our method can be applied to models with a different number of classes and also for large-scale datasets.

2 Non-Trivial Ownership

In the non-trivial ownership setting, an adversary will not be able to claim ownership of the model even if he knows the watermarking algorithm. To fulfill this requirement we randomly sample the examples for the trigger set. We sampled a set of 100 abstract images, and for each image, we randomly selected a target class.

This sampling-based approach ensures that the examples from the trigger set are uncorrelated to each other. Therefore revealing a subset from the trigger set will not reveal any additional information about the other examples in the set, as is required for public verifiability. Moreover, since both examples and labels are chosen randomly, following this method makes back-propagation based attacks extremely hard. Figure 5 shows an example from the trigger set.

3 Functionality-Preserving

For the functionality-preserving property we require that a model with a watermark should be as accurate as a model without a watermark. In general, each task defines its own measure of performance . However, since in the current work we are focused on image classification tasks, we measure the accuracy of the model using the 0-1 loss.

Table 1 summarizes the test set and trigger-set classification accuracy on CIFAR-10 and CIFAR-100, for three different models; (i) a model with no watermark (No-WM); (ii) a model that was trained with the trigger set from scratch (FromScratch); and (iii) a pre-trained model that was trained with the trigger set after convergence on the original training data set (PreTrained).

It can be seen that all models have roughly the same test set accuracy and that in both FromScratch and PreTrained the trigger-set accuracy is 100%. Since the trigger-set labels were chosen randomly, the No-WM models’ accuracy depends on the number of classes. For example, the accuracy on CIFAR-10 is 7.0% while on CIFAR-100 is only 1.0%.

4 Unremovability

In order to satisfy the unremovability property, we first need to define the types of unremovability functions we are going to explore. Recall that our goal in the unremovability experiments is to investigate the robustness of the watermarked models against changes that aim to remove the watermark while keeping the same functionality of the model. Otherwise, one can set all weights to zero and completely remove the watermark but also destroy the model.

Thus, we are focused on fine-tuning experiments. In other words, we wish to keep or improve the performance of the model on the test set by carefully training it. Fine-tuning seems to be the most probable type of attack since it is frequently used and requires less computational resources and training data . Since in our settings we would like to explore the robustness of the watermark against strong attackers, we assumed that the adversary can fine-tune the models using the same amount of training instances and epochs as in training the model.

An important question one can ask is: when is it still my model? or other words how much can I change the model and still claim ownership? This question is highly relevant in the case of watermarking. In the current work we handle this issue by measuring the performance of the model on the test set and trigger set, meaning that the original creator of the model can claim ownership of the model if the model is still $\epsilon$ -accurate on the original test set while also $\epsilon$ -accurate on the trigger set. We leave the exploration of different methods and of a theoretical definition of this question for future work.

We define four different variations of fine-tuning procedures:

Fine-Tune Last Layer (FTLL): Update the parameters of the last layer only. In this setting we freeze the parameters in all the layers except in the output layer. One can think of this setting as if the model outputs a new representation of the input features and we fine-tune only the output layer.

Fine-Tune All Layers (FTAL): Update all the layers of the model.

Re-Train Last Layers (RTLL): Initialize the parameters of the output layer with random weights and only update them. In this setting, we freeze the parameters in all the layers except for the output layer. The motivation behind this approach is to investigate the robustness of the watermarked model under noisy conditions. This can alternatively be seen as changing the model to classify for a different set of output labels.

Re-Train All Layers (RTAL): Initialize the parameters of the output layer with random weights and update the parameters in all the layers of the network.

Figure 6 presents the results for both the PreTrained and FromScratch models over the test set and trigger set, after applying these four different fine-tuning techniques.

The results suggest that while both models reach almost the same accuracy on the test set, the FromScratch models are superior or equal to the PreTrained models overall fine-tuning methods. FromScratch reaches roughly the same accuracy on the trigger set when each of the four types of fine-tuning approaches is applied.

Notice that this observation holds for both the CIFAR-10 and CIFAR-100 datasets, where for CIFAR-100 it appears to be easier to remove the trigger set using the PreTrained models. Concerning the above-mentioned results, we now investigate what will happen if an adversary wants to embed a watermark in a model which has already been watermarked. This can be seen as a black-box attack on the already existing watermark. According to the fine-tuning experiments, removing this new trigger set using the above fine-tuning approaches will not hurt the original trigger set and will dramatically decrease the results on the new trigger set. In the next paragraph, we explore and analyze this setting. Due to the fact that FromScratch models are more robust than PreTrained, for the rest of the paper, we report the results for those models only.

5 Ownership Piracy

As we mentioned in Section 3, in this set of experiments we explore the scenario where an adversary wishes to claim ownership of a model which has already been watermarked.

For that purpose, we collected a new trigger set of different 100 images, denoted as TS-New, and embedded it to the FromScratch model (this new set will be used by the adversary to claim ownership of the model). Notice that the FromScratch models were trained using a different trigger set, denoted as TS-Orig. Then, we fine-tuned the models using RTLL and RTAL methods. In order to have a fair comparison between the robustness of the trigger sets after fine-tuning, we use the same amount of epochs to embed the new trigger set as we used for the original one.

Figure 7 summarizes the results on the test set, TS-New and TS-Orig. We report results for both the FTAL and RTAL methods together with the baseline results of no fine tuning at all (we did not report here the results of FTLL and RTLL since those can be considered as the easy cases in our setting). The red bars refer to the model with no fine tuning, the yellow bars refer to the FTAL method and the blue bars refer to RTAL.

The results suggest that the original trigger set, TS-Orig, is still embedded in the model (as is demonstrated in the right columns) and that the accuracy of classifying it even improves after fine-tuning. This may imply that the model embeds the trigger set in a way that is close to the training data distribution. However, in the new trigger set, TS-New, we see a significant drop in the accuracy. Notice, we can consider embedding TS-New as embedding a watermark using the PreTrained approach. Hence, this accuracy drop of TS-New is not surprising and goes in hand with the results we observed in Figure 6.

In transfer learning we would like to use knowledge gained while solving one problem and apply it to a different problem. For example, we use a trained model on one dataset (source dataset) and fine-tune it on a new dataset (target dataset). For that purpose, we fine-tuned the FromScratch model (which was trained on either CIFAR-10 or CIFAR-100), for another 20 epochs using the labeled part of the STL-10 dataset .

Recall that our watermarking scheme is based on the outputs of the model. As a result, when fine-tuning a model on a different dataset it is very likely that we change the number of classes, and then our method will probably break. Therefore, in order to still be able to verify the watermark we save the original output layer, so that on verification time we use the model’s original output layer instead of the new one.

Following this approach makes both FTLL and RTLL useless due to the fact that these methods update the parameters of the output layer only. Regarding FTAL, this approach makes sense in specific settings where the classes of the source dataset are related to the target dataset. This property holds for CIFAR-10 but not for CIFAR-100. Therefore we report the results only for RTAL method.

Table 2 summarizes the classification accuracy on the test set of STL-10 and the trigger set after transferring from CIFAR-10 and CIFAR-100.

Although the trigger set accuracy is smaller after transferring the model to a different dataset, results suggest that the trigger set still has a lot of presence in the network even after fine-tuning on a new dataset.

6 ImageNet - Large Scale Visual Recognition Dataset

For the last set of experiments, we would like to explore the robustness of our watermarking method on a large scale dataset. For that purpose, we use ImageNet dataset which contains about 1.3 million training images with over 1000 categories.

Table 3 summarizes the results for the functionality-preserving tests. We can see from Table 3 that both models, with and without watermark, achieve roughly the same accuracy in terms of Prec@1 and Prec@5, while the model without the watermark attains 0% on the trigger set and the watermarked model attain 100% on the same set.

Notice that the results we report for ResNet18 on ImageNet are slightly below what is reported in the literature. The reason beyond that is due to training for fewer epochs (training a model on ImageNet is computationally expensive, so we train our models for fewer epochs than what is reported).

In Table 4 we report the results of transfer learning from ImageNet to ImageNet, those can be considered as FTAL, and from ImageNet to CIFAR-10, can be considered as RTAL or transfer learning.

Notice that after fine tuning on ImageNet, trigger set results are still very high, meaning that the trigger set has a very strong presence in the model also after fine-tuning. When transferring to CIFAR-10, we see a drop in the Prec@1 and Prec@5. However, considering the fact that ImageNet contains 1000 target classes, these results are still significant.

7 Technical Details

We implemented all models using the PyTorch package . In all the experiments we used a ResNet-18 model, which is a convolutional based neural network model with 18 layers . We optimized each of the models using Stochastic Gradient Descent (SGD), using a learning rate of 0.1. For CIFAR-10 and CIFAR-100 we trained the models for 60 epochs while halving the learning rate by ten every 20 epochs. For ImageNet we trained the models for 30 epochs while halving the learning rate by ten every ten epochs. The batch size was set to 100 for the CIFAR10 and CIFAR100, and to 256 for ImageNet. For the fine-tuning tasks, we used the last learning rate that was used during training.

Conclusion and Future Work

In this work we proposed a practical analysis of the ability to watermark a neural network using random training instances and random labels. We presented possible attacks that are both black-box and grey-box in the model, and showed how robust our watermarking approach is to them. At the same time, we outlined a theoretical connection to the previous work on backdooring such models.

For future work we would like to define a theoretical boundary for how much change must a party apply to a model before he can claim ownership of the model. We also leave as an open problem the construction of a practically efficient zero-knowledge proof for our publicly verifiable watermarking construction.

Acknowledgments

This work was supported by the BIU Center for Research in Applied Cryptography and Cyber Security in conjunction with the Israel National Cyber Directorate in the Prime Minister’s Office.

References

Appendix A Supplementary Material

(and $\mathsf{mk}|_{1}^{\mathbf{e}}$ accordingly). We assume the existence of a cryptographic hash function $H:\{0,1\}^{p(n)}\rightarrow\{0,1\}^{n}$ .

To achieve public verifiability, we will make use of a cryptographic tool called a zero-knowledge argument , which is a technique that allows a prover $\mathcal{P}$ to convince a verifier $\mathcal{V}$ that a certain public statement is true, without giving away any further information. This idea is similar to the idea of unlimited public verification as outlined in Section 4.1.

Let TM be an abbreviation for Turing Machines. An iTM is defined to be an interactive TM, i.e. a Turing Machine with a special communication tape. Let $L_{R}\subseteq\{0,1\}^{*}$ be an NP language and $R$ be its related NP-relation, i.e. $(x,w)\in R$ iff $x\in L_{R}$ and the TM used to define $L_{R}$ outputs $1$ on input of the statement $x$ and the witness $w$ . We write $R_{x}=\{w~{}|~{}(x,w)\in R\}$ for the set of witnesses for a fixed $x$ . Moreover, let $\mathcal{P},\mathcal{V}$ be a pair of PPT iTMs. For $(x,w)\in R$ , $\mathcal{P}$ will obtain $w$ as input while $\mathcal{V}$ obtains an auxiliary random string $z\in\{0,1\}^{*}$ . In addition, $x$ will be input to both TMs. Denote with $\mathcal{V}^{\mathcal{P}(a)}(b)$ the output of the iTM $\mathcal{V}$ with input $b$ when communicating with an instance of $\mathcal{P}$ that has input $a$ .

$(\mathcal{P},\mathcal{V})$ is called an interactive proof system for the language $L$ if the following two conditions hold:

For every $x\in L_{R}$ there exists a string $w$ such that for every $z$ : $\Pr[\mathcal{V}^{\mathcal{P}(x,w)}(x,z)=1]$ is negligibly close to $1$ .

For every $x\not\in L_{R}$ , every PPT iTM $\mathcal{P}^{*}$ and every string $w,z$ : $\Pr[\mathcal{V}^{\mathcal{P}^{*}(x,w)}(x,z)=1]$ is negligible.

An interactive proof system is called computational zero-knowledge if for every PPT $\hat{\mathcal{V}}$ there exists a PPT simulator $\mathcal{S}$ such that for any $x\in L_{R}$

meaning that all information which can be learned from observing a protocol transcript can also be obtained from running a polynomial-time simulator $\mathcal{S}$ which has no knowledge of the witness $w$ .

A.1.1 Outlining the Idea

An intuitive approach to build $\mathtt{PVerify}$ is to convert the algorithm $\mathtt{Verify}(\mathsf{mk},\mathsf{vk},M)$ from Section 4 into an NP relation $R$ and use a zero-knowledge argument system. Unfortunately, this must fail due to Step 1 of $\mathtt{Verify}$ : there, one tests if the item $\mathsf{b}$ contained in $\mathsf{mk}$ actually is a backdoor as defined above. Therefore, we would need access to the ground-truth function $f$ in the interactive argument system. This first of all needs human assistance, but is moreover only possible by revealing the backdoor elements.

We will now give a different version of the scheme from Section 4 which embeds an additional proof into $\mathsf{vk}$ . This proof shows that, with overwhelming probability, most of the elements in the verification key indeed form a backdoor. Based on this, we will then design a different verification procedure, based on a zero-knowledge argument system.

A.1.2 A Convincing Argument that most Committed Values are Wrongly Classified

Verifying that most of the elements of the trigger set are labeled wrongly is possible, if one acceptsThis is fine if $T$ , as in our experiments, only consists of random images. to release a portion of this set. To solve the proof-of-misclassification problem, we use the so-called cut-and-choose technique: in cut-and-choose, the verifier $\mathcal{V}$ will ask the prover $\mathcal{P}$ to open a subset of the committed inputs and labels from the verification key. Here, $\mathcal{V}$ is allowed to choose the subset that will be opened to him. Intuitively, if $\mathcal{P}$ committed to a large number elements that are correctly labeled (according to $\mathcal{O}_{f}$ ), then at least one of them will show up in the values opened by $\mathcal{P}$ with overwhelming probability over the choice that $\mathcal{V}$ makes. Hence, most of the remaining commitments which were not opened must form a correct backdoor.

$\mathcal{P}$ sends $\mathsf{mk}|_{1}^{\mathbf{e}}$ to $\mathcal{V}$ .

$\mathcal{V}$ checks that for $i\in\mathbf{e}|_{1}$ that

$\mathtt{Open}(c_{t}^{(i)},t^{(i)},r_{t}^{(i)})=1$ ;

$\mathtt{Open}(c_{L}^{(i)},T_{L}^{(i)},r_{L}^{(i)})=1$ ; and

The above argument can be made non-interactive and thus publicly verifiable using the Fiat-Shamir transform: in the protocol $\mathtt{CnC}$ , $\mathcal{P}$ can generate the bit string $\mathbf{e}$ itself by hashing $\mathsf{vk}$ using a cryptographic hash function $H$ . Then $\mathbf{e}$ will be distributed as if it was chosen by an honest verifier, while it is sufficiently random by the guarantees of the hash function to allow the same analysis for cut-and-choose. Any $\mathcal{V}$ can recompute the value $\mathbf{e}$ if it is generated from the commitments (while this also means that the challenge $\mathbf{e}$ is generated after the commitments were computed), and we can turn the above algorithm $\mathtt{CnC}$ into the following non-interactive key-generation algorithm $\mathtt{{PKeyGen}}$ .

Compute $\mathbf{e}\leftarrow H(\mathsf{vk})$ .

Set $\mathsf{mk}_{p}\leftarrow(\mathsf{mk},\mathbf{e})$ , $\mathsf{vk}_{p}\leftarrow(\mathsf{vk},\mathsf{mk}|_{1}^{\mathbf{e}})$ and return $(\mathsf{mk}_{p},\mathsf{vk}_{p})$ .

A.1.3 Constructing the Public Verification Algorithm

In the modified scheme, the $\mathtt{Mark}$ algorithm will only use the private subset $\mathsf{mk}|_{0}^{\mathbf{e}}$ of $\mathsf{mk}_{p}$ but will otherwise remain unchanged. The public verification algorithm for a model $M$ then follows the following structure: (i) $\mathcal{V}$ recomputes the challenge $\mathbf{e}$ ; (ii) $\mathcal{V}$ checks $\mathsf{vk}_{p}$ to assure that all of $\mathsf{vk}|_{1}^{\mathbf{e}}$ will form a valid backdoor ; and (iii) $\mathcal{P},\mathcal{V}$ run $\mathtt{Classify}$ on $\mathsf{mk}|_{0}^{\mathbf{e}}$ using the interactive zero-knowledge argument system, and further test if the watermarking conditions on $M,\mathsf{mk}|_{0}^{\mathbf{e}},\mathsf{vk}|_{0}^{\mathbf{e}}$ hold.

$\mathcal{V}$ computes $\mathbf{e}^{\prime}\leftarrow H(\mathsf{vk})$ . If $\mathsf{mk}|_{1}^{\mathbf{e}}$ in $\mathsf{vk}_{p}$ does not match $\mathbf{e}^{\prime}$ then abort, else continue assuming $\mathbf{e}=\mathbf{e}^{\prime}$ .

$\mathcal{V}$ checks that for all $i\in\mathbf{e}|_{1}$ :

$\mathtt{Open}(c_{t}^{(i)},t^{(i)},r_{t}^{(i)})=1$

$\mathtt{Open}(c_{L}^{(i)},T_{L}^{(i)},r_{L}^{(i)})=1$

If one of the checks fails, then $\mathcal{V}$ aborts.

$\mathcal{P},\mathcal{V}$ compute a circuit $C$ with input $\mathsf{mk}|_{0}^{\mathbf{e}}$ that outputs $1$ iff for all $i\in\mathbf{e}|_{0}$ :

$\mathtt{Open}(c_{t}^{(i)},t^{(i)},r_{t}^{(i)})=1$

$\mathtt{Open}(c_{L}^{(i)},T_{L}^{(i)},r_{L}^{(i)})=1$ .

Moreover, it tests that $\mathtt{Classify}(t^{(i)},M)=T_{L}^{(i)}$ for all but $\epsilon|\mathbf{e}|_{0}|$ elements.

$\mathcal{P},\mathcal{V}$ run a zero-knowledge argument for the given relation R using $C$ as the statement, where the witness $\mathsf{mk}|_{0}^{\mathbf{e}}$ is the secret input of $\mathcal{P}$ . $\mathcal{V}$ accepts iff the argument succeeds.