A Watermark for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, Tom Goldstein

Introduction

Large language models (LLMs), such as the recently developed ChatGPT, can write documents, create executable code, and answer questions, often with human-like capabilities (Schulman et al., 2022). As these systems become more pervasive, there is increasing risk that they may be used for malicious purposes (Bergman et al., 2022; Mirsky et al., 2023). These include social engineering and election manipulation campaigns that exploit automated bots on social media platforms, creation of fake news and web content, and use of AI systems for cheating on academic writing and coding assignments. Furthermore, the proliferation of synthetic data on the web complicates future dataset creation efforts, as synthetic data is often inferior to human content and must be detected and excluded before model training (Radford et al., 2022). For many reasons, the ability to detect and audit the usage of machine-generated text becomes a key principle of harm reduction for large language models (Bender et al., 2021; Crothers et al., 2022; Grinbaum & Adomaitis, 2022).

The watermark can be algorithmically detected without any knowledge of the model parameters or access to the language model API. This property allows the detection algorithm to be open sourced even when the model is not. This also makes detection cheap and fast because the LLM does not need to be loaded or run.

Watermarked text can be generated using a standard language model without re-training.

The watermark is detectable from only a contiguous portion of the generated text. This way, the watermark remains detectable when only a slice of the generation is used to create a larger document.

The watermark cannot be removed without modifying a significant fraction of the generated tokens.

We can compute a rigorous statistical measure of confidence that the watermark has been detected.

Language models have a “vocabulary” $\mathcal{V}$ containing words or word fragments known as “tokens.” Typical vocabularies contain $|\mathcal{V}|=50,000$ tokens or more (Radford et al., 2019; Liu et al., 2019). Consider a sequence of $T$ tokens $\{s^{(t)}\}\in\mathcal{V}^{T}$ . Entries with negative indices, $s^{(-N_{p})},\cdots,s^{(-1)}$ , represent a “prompt” of length $N_{p}$ and $s^{(0)},\cdots,s^{(T)}$ are tokens generated by an AI system in response to the prompt.

A language model (LM) for next word prediction is a function $f$ , often parameterized by a neural network, that accepts as input a sequence of known tokens $s^{(-N_{p})},\cdots,s^{(t-1)}$ , which contains a prompt and the first $t-1$ tokens already produced by the language model, and then outputs a vector of $|V|$ logits, one for each word in the vocabulary. These logits are then passed through a softmax operator to convert them into a discrete probability distribution over the vocabulary. The next token at position $t$ is then sampled from this distribution using either standard multinomial sampling, or greedy sampling (greedy decoding) of the single most likely next token. Additionally, a procedure such as beam search can be employed to consider multiple possible sequences before selecting the one with the overall highest score.

2 A caveat: The difficulty of watermarking low-entropy sequences

Consider the following two sequences of tokens, with prompts in red:

The quick brown fox jumps over the lazy dog for(i=0;i

Were they produced by a human or by a language model? Determining this is fundamentally hard because these sequences have low entropy; the first few tokens strongly determine the following tokens.

Low entropy text creates two problems for watermarking. First, both humans and machines provide similar if not identical completions for low entropy prompts, making it impossible to discern between them. Second, it is difficult to watermark low entropy text, as any changes to the choice of tokens may result in high perplexity, unexpected tokens that degrade the quality of the text. Later, we rigorously define sentence entropy, and analyze its impact on watermark detection.

A simple proof of concept

We start out by describing a simple “hard” red list watermark in Algorithm 1 that is easy to analyze, easy to detect and hard to remove. The simplicity of this approach comes at the cost of poor generation quality on low entropy sequences. We will discuss more sophisticated strategies later.

The method works by generating a pseudo-random red list of tokens that are barred from appearing as $s^{(t)}.$ The red list generator is seeded with the prior token $s^{(t-1)}$ , enabling the red list to be reproduced later without access to the entire generated sequence.

Detecting the watermark. While producing watermarked text requires access to the language model, detecting the watermark does not. A third party with knowledge of the hash function and random number generator can re-produce the red list for each token and count how many times the red list rule is violated. We can detect the watermark by testing the following null hypothesis,

Because the red list is chosen at random, a natural writer is expected to violate the red list rule with half of their tokens, while the watermarked model produces no violations. The probability that a natural source produces $T$ tokens without violating the red list rule is only $1/2^{T},$ which is vanishingly small even for short text fragments with a dozen words. This enables detection of the watermark (rejection of $H_{0}$ ) for, e.g., a synthetic tweet.

A more robust detection approach uses a one proportion z-test to evaluate the null hypothesis. If the null hypothesis is true, then the number of green list tokens, denoted $|s|_{G},$ has expected value $T/2$ and variance $T/4.$ The $z$ -statistic for this test is

We reject the null hypothesis and detect the watermark if $z$ is above a chosen threshold. Suppose we choose to reject the null hypothesis if $z>4.$ In this case, the probability of a false positive is $3\times 10^{-5},$ which is the one-sided p-value corresponding to $z>4.$ At the same time, we will detect any watermarked sequence with 16 or more tokens (the minimum value of $T$ that produces $z=4$ when $|s|_{G}$ =T).

How hard is it to remove the watermark? The use of the one proportion z-test makes removal of the watermark difficult. Consider the case of a watermarked sequence of length $T=1000$ . Suppose an adversary modifies 200 tokens in the sequence to add red list words and scrub the watermark. A modified token at position $t$ can violate the red list rule at position $t$ . Furthermore, the value of $s_{t}$ determines the red list for token $s_{t+1},$ and a maximally adversarial choice of $s_{t}$ will put $s_{t+1}$ in violation of the red list rule as well. For this reason, 200 token flips can create at most 400 violations of the red list rule. Unfortunately for the attacker, this maximally adversarial sequence with 600 remaining green list tokens still produces a z-statistic of $2(600-1000/2)/\sqrt{1000}\approx 6.3,$ and a p-value of $\approx 10^{-10},$ leaving the watermark readily detectable with extremely high confidence. In general, removing the watermark of a long sequence requires modifying roughly one quarter of the tokens or more.

Note the analysis above assumes the attacker has complete knowledge of the watermark, and each selected token is maximally adversarial (which likely has a negative impact on quality). Without knowledge of the watermark algorithm, each flipped token has only a 50% chance of being in the red list, as does the adjacent token. In this case, the attacker above only creates 200 red list words (in expectation) by modifying 200 tokens. Methods for keeping the watermark algorithm secret but available via API are discussed in Section 5.

Drawbacks of the hard red list rule. The hard red list rule handles low entropy sequences in a simple way; it prevents the language model from producing them. For example, the token “Barack” is almost deterministically followed by “Obama” in many text datasets, yet “Obama” may be disallowed by the red list.

A better behavior is to use a “soft” watermarking rule that is only active for high-entropy text that can be imperceptibly watermarked. As long as low-entropy sequences are wrapped inside a passage with enough total entropy, the passage will still easily trigger a watermark detector, solving the problem described in Section 1.2. Further, one can combine the watermark with a beam search decoder that “irons-in” the watermark. By searching the hypothesis space of likely token sequences, candidates sequences with a high density of tokens in the green list are found, resulting in a high strength watermark with minimal perplexity cost.

A more sophisticated watermark

We now discuss the “soft” watermark that promotes the use of the green list for high entropy tokens when many good choices are available, while having little impact on the choice of low-entropy tokens that are nearly deterministic.

To derive this watermark, we examine what happens in the language model just before it produces a probability vector. The last layer of the language model outputs a vector of logits $l^{(t)}$ . These logits get converted into a probability vector $p^{(t)}$ using the softmax operator

Rather than strictly prohibiting the red list tokens, Algorithm 2 adds a constant $\delta$ to the logits of the green list tokens.

The soft red list rule adaptively enforces the watermark in situations where doing so will have little impact on quality, while almost ignoring the watermark rule in the low entropy case where there is a clear and unique choice of the “best” word. A highly likely word with $p^{(t)}_{k}\approx 1$ has a much larger logit than other candidates, and this will remain the largest regardless of whether it is in the red list. But when the entropy is high, there are many comparably large logits to choose from, and the $\delta$ rule has a large impact on the sampling distribution, strongly biasing the output towards the green list.

The process for detecting the soft watermark is identical to that for the hard watermark. We assume the null hypothesis (1) and compute a z-statistic using Equation (2). We reject the null hypothesis and detect the watermark if $z$ is greater than a threshold. For arbitrary $\gamma$ we have

Consider again the case in which we detect the watermark for $z>4.$ Just like in the case of the hard watermark, we get false positives with rate $3\times 10^{-5}.$ In the case of the hard watermark, we could detect any watermarked sequence of length 16 tokens or more, regardless of the properties of the text. However, in the case of the soft watermark our ability to detect synthetic text depends on the entropy of the sequence. High entropy sequences are detected with relatively few tokens, while low entropy sequences require more tokens for detection. Below, we rigorously analyze the detection sensitivity of the soft watermark, and its dependence on entropy.

Analysis of the soft watermark

In this section, we examine the expected number of green list tokens used by a watermarked language model and analyze the dependence of this quantity on the entropy of a generated text fragment. Our analysis assumes the red list is sampled uniformly at random. This is a deviation from the method used in practice, which generates red lists using a pseudo-random number generator seeded with previous tokens. The consequences of pseudo-random sampling are explored in Section 5. We analyze the case in which text is generated by multinomial random sampling. In our experiments, we consider two more sampling schemes, greedy decoding and beam search.

We need a definition of entropy that is appropriate for our analysis. The strength of our watermark is weak when the distribution over tokens has a large “spike” concentrated on one or several tokens. We define the following type of entropy to quantify this phenomenon.

Given a discrete probability vector $p$ and a scalar $z,$ we define the spike entropy of $p$ with modulus $z$ as

Like the classical Shannon entropy, the spike entropy is a measure of how spread out a distribution is; The spike entropy assumes its minimal value of $\frac{1}{1+z}$ when the entire mass of $p$ is concentrated at a single location, and its maximal value of $\frac{N}{N+z}$ when the mass of $p$ is uniformly distributed. For large $z$ , the value of $\frac{p_{k}}{1+zp_{k}}\approx 1/z$ when $p_{k}>1/z$ and $\approx 0$ for $p_{k}<1/z.$ For this reason, one can interpret the spike entropy as a softened measure of the number of entries in $p$ greater than $1/z.$

The following theorem predicts the number of green list tokens that appear in a sequence with the watermark.

Consider watermarked text sequences of $T$ tokens. Each sequence is produced by sequentially sampling a raw probability vector $p^{(t)}$ from the language model, sampling a random green list of size $\gamma N$ , and boosting the green list logits by $\delta$ using Equation 4 before sampling each token. Define $\alpha=\exp(\delta),$ and let $|s|_{G}$ denote the number of green list tokens in sequence $s.$

If a randomly generated watermarked sequence has average spike entropy at least $S^{\star},$ i.e.,

then the number of green list tokens in the sequence has expected value at least

Furthermore, the number of green list tokens has variance at most

If we have chosen $\gamma\geq.5,$ then we can use the strictly looser but simpler bound

Remark. It may seem like there are a lot of messy constants floating around in this bound. However, when we choose $\gamma=\frac{1}{2}$ and $\delta=\ln(2)\approx 0.7,$ this bound simplifies to

where $S^{\star}$ is a bound on spike entropy with modulus 1/3. If we study the “hard” red list rules by choosing $\gamma=\frac{1}{2}$ and letting $\delta\to\infty,$ we have

where $S^{\star}$ is a bound on spike entropy with modulus 1.

The sensitivity of the soft watermark can be computed using standard type-II error analysis. For illustrative purposes, we estimate the type-II (false negative) error rate of a soft watermark with $\gamma=.5$ and $\delta=2.$ We assume 200 tokens are generated using OPT-1.3B (Zhang et al., 2022) using prompts from the C4 dataset’s RealNewsLike subset (Raffel et al., 2019). We also assume a detection threshold of $z=4$ (which occurs at $\sim 128.2/100$ tokens) which gives us a type-I error (false positive) rate of $3\times 10^{-5}$ .

Theoretical bound. Our generations have an average spike entropy per sample of $S=0.807$ over $\sim 500$ generations. Theorem 4.2 says that the expected number of green list tokens per generation is at least $142.2$ . Indeed, the empirical average is $159.5$ . For sequences with entropy equal to the mean ( $S=0.807$ ) we get $\sigma\leq 6.41$ tokens, and 98.6% sensitivity (1.4% type-II error rate), using a standard Gaussian approximation for the green list count. Note, this is a lower bound on the sensitivity for this particular entropy. If we use the true empirical mean of $159.5$ rather than the theoretical bound, we get a $5.3\times 10^{-7}$ type-II error rate, a realistic approximation but not a rigorous lower bound.

Empirical sensitivity. Empirically, $98.4\%$ of generations are detected at the $z=4$ ( $128$ token) threshold when multinomial sampling is used. When $4$ -way beam search over a greedy decoding is used, we get $99.6\%$ empirical sensitivity. Unlike the theoretical bounds, these are computed over all generations, which have the same length but vary in their individual entropies. Here, the primary source of type-II errors is low entropy sequences, as calculations above show that we expect a very low error rate when the entropy lies near the mean. To validate this, we examine the subset of 375/500 generations that have spike entropy above the $25$ th percentile, of which we detect $100\%$ of generations at the $z=4$ threshold.

What do failure cases look like? We display typical success and failure cases for the watermark in Table 1. We observe that low-entropy (undetectable) sequences typically involve data memorization; the model regurgitates a copy (or near copy) of human-written text which is therefore not detectable as machine-written. A detailed exploration of model accuracy is presented in Section 6, with more generation examples provided in Section A.1.

Evaluating Repetitive Text. A subtlety of the proposed approach is that tokens in the green list are only pseudo-random, and $n$ -grams of text that are repeated will always be scored in the same manner. Assume a $2$ -gram, such as “Barack Obama” happens to green-list “Obama”. Repetitive usage of this $2$ -gram would result in a higher than expected number of green tokens. In a worst-case scenario, human-generated text with a high number of repetitions of this 2-gram may be erroneously flagged as machine-generated.

Two remedies are possible: The first is to simply increase the length $h$ of the PRNG function, thereby increasing the variability of the green-listed words, as larger $(h+1)$ -grams are much less likely to be repeated. A better remedy (possibly used in conjunction with the first) is not to count repeated $n$ -grams when checking for the watermark. In the example above, the 2-gram “Barack Obama” would be counted on its first occurrence, and then subsequently ignored when it appears again; it is counted as neither green nor red, and the token counter $T$ is not incremented.

In addition to preventing false positives, skipping repeated $n$ -grams can also make the detector more sensitive. A repeated $n$ -gram is likely to be low-entropy, and so we cannot avoid its use when it contains red list words. By excluding these from the count, we keep the green list fraction high and maintain high sensitivity.

2 Impact on quality of generated text

A soft watermark has very little impact on the perplexity of tokens with extremely high or low entropy. When the distribution produced by the language model is uniform (maximal entropy), the randomness of the green list results in tokens being uniformly sampled, and the perplexity remains untouched. Conversely, in the case of minimal entropy, where all probability mass is concentrated on a single token, the soft watermark rule has no effect and there is once again no impact on perplexity.

The watermark rule does impact perplexity for tokens of moderate entropy. In this case, we can provide the following simple bound that holds uniformly over all entropy values.

Consider a sequence $s^{(i)},-N_{p}<i<T.$ Suppose the (non-watermarked) language model produces a probability vector $p^{(T)}$ for the token at position $T.$ The watermarked model predicts the token at position $T$ using modified probability vector $\hat{p}^{(T)}.$ The expected perplexity of the $T$ th token with respect to the randomness of the red list partition is

where $P^{*}=\sum_{k}p^{(T)}_{k}\ln(p^{(T)}_{k})$ is the perplexity of the original model.

Private Watermarking

The watermark algorithms above are designed to be public. A watermark can also be operated in private mode, in which the algorithm uses a random key that is kept secret and hosted behind a secure API. If the attacker has no knowledge of the key used to produce the red list, it becomes more difficult for the attacker to remove the watermark as the attacker does not know which tokens are in the red list. However, testing for the presence of the watermark now requires using the same secure API and, if this API is public, access needs to be monitored to prevent an adversary from making too many queries using minor variants of the same sequence.

Let $F$ be a pseudorandom function (PRF) that, for simplicity, we view as accepting arbitrary length inputs and producing output as long as needed. $F$ could be a standard block cipher like AES or a cryptographic hash function like SHA3. To create a private watermark, we first choose a random key $\operatorname*{{\mathcal{K}}}$ ; a private red list for token $s^{(t)}$ can then be generated in a manner similar to what was described earlier, but now by first computing $F_{\operatorname*{{\mathcal{K}}}}(s^{(t-h)},\cdots,s^{(t-1)})$ , a pseudorandom function evaluated on the prior $h$ tokens.

An attacker can discover the watermarking rules by observing occurrences of token tuples in generated text and tabulating the frequencies of the immediately subsequent tokens, even if the underlying key is unknown. To tabulate every red list in such a brute-force attack, $|\mathcal{V}|^{1+h}$ tokens need to be submitted to the detection API. When $h=1$ , the red lists produced by many tokens could be discovered (at least partially) with conceivable effort. This brute-force method is ineffective for $h\gg 1$ , as there is now a unique red list for each ordered combination of words. At the same time, large values of $h$ decrease watermark robustness when a naive method is used. When, say, $h=5$ consecutive tokens are used to produce a red list, an adversarial change to just one of those tokens randomizes the red list for 5 different downstream tokens, increasing the number of red list words by $2.5$ (in expectation) if $\gamma=.5$ . We call this downstream impact attack amplification. To limit amplification, we suggest using a small window ( $h=2$ or $3$ ) when using the naive watermarking rule.

When a wider window $h$ is desired, more complex, robust watermarking rules can achieve security against brute-force attacks without attack amplification. We describe such a rule in Algorithm 3. Here, the red list for $s^{(t)}$ depends on itself, and additionally on one prior token $s^{(t-i^{\star})}$ chosen using a pseudo-random rule. To satisfy this self-hash condition, we iteratively test different tokens as $s^{(t)},$ from highest logit to least logit, until the red list rule is satisfied. If, during this search, the logit of the test token falls by more than $\delta$ , we give up and accept the token in the red list with largest logit.

Algorithm 3 has several nice security properties. When one of the prior $h$ tokens is changed, the watermark at position $t$ changes with probability only $1/h$ . As such, this rule is free of attack amplification; in expectation, a change to a token results in one additional red list token.When $\gamma=.5$ , the flipped token is in the red list $1/2$ of the time, and one of the $h$ downstream red lists is expected to randomize, resulting in another $1/2$ red list token. Like the naive method with $h=2$ , there are $|\mathcal{V}|^{2}$ unique red lists, but now the choice of the index $i^{\star}$ depends on combinations of $s^{(t)}$ and all $h$ tokens before it, which hides the choice of tokens used as input to $F$ . For simplicity, Algorithm 3 is presented as a greedy sampler, but can be easily extended to handle multinomial sampling or beam search.

A straightforward add-on to further boost the difficulty of brute-forcing any hidden watermark scheme is simply to watermark with multiple keys. Even a nominally insecure watermark with a small windows size such $h=1$ can be boosted to be exponentially harder to break with a moderate number of keys, i.e. $k=5$ . In the simplest setup, one of the keys is samples randomly at generation time for every token separately, and at detection time all keys are tested. However, this increases the fraction of expected hits for unwatermarked text from $\gamma$ to $1-(1-\gamma)^{k}$ . Nevertheless, this reduction in power can be remedied by switching randomly, but not at every token, i.e. every 100-200 tokens. Finally, the optimal setup would not choose these $k$ lists at random, but counter-balance them, so that frequency analysis of the $n$ -grams of text with multiple keys would exactly return the expected natural distribution of $n$ -grams.

A range of more complex mechanisms are possible with different efficiency and security tradeoffs, but we leave detailed consideration to future research.

Experiments

In this section we explore the behavior of the watermark using the OPT-1.3B model (Zhang et al., 2022). We measure watermark strength using the rate of type-I errors (human text falsely flagged as watermarked) and type-II errors (watermarked text not detected).

We implement the proposed watermark using the Pytorch backend of the Huggingface library (Wolf et al., 2020). The generate API provides useful abstractions, including modules for warping the logit distribution that comes out of the language model. We generate red lists using the torch random number generator and one previous token as described in Section 3.

Datasets and Prompts. To simulate a variety of realistic language modeling scenarios we slice and dice a random selection of texts from the news-like subset of the C4 dataset (Raffel et al., 2019). For each random string, we trim a fixed length of tokens from the end and treat them as a “baseline” completion. The remaining tokens are a prompt. For the experimental runs using multinomial sampling, we pull examples from the dataset until we achieve at least 500 of generations with length $T=200\pm 5$ tokens. In the runs using greedy and beam search decoding, we suppress the EOS token during generation to combat the tendency of beam search to generate short sequences. We then truncate all sequences to $T=200$ . A larger oracle language model (OPT-2.7B) is used to compute perplexity (PPL) for the generated completions and for the human baseline.

Watermark Strength vs Text Quality. One can achieve a very strong watermark for short sequences by choosing a small green list size $\gamma$ and a large green list bias $\delta.$ However, creating a stronger watermark may distort generated text. Figure 2 (left) shows the tradeoff between watermark strength ( $z$ -score) and text quality (perplexity) for various combinations of watermarking parameters. We compute results using $500\pm 10$ sequences of length $T=200\pm 5$ tokens for each parameter choice. Interestingly, we see that a small green list, $\gamma=.1$ is pareto-optimal.

In addition to these quantitative results, we show examples of real prompts and watermarked outputs in Table 1 to provide a qualitative sense for the behavior of the test statistic and quality measurement on different kinds of prompts. Additional examples are compiled in Section A.1.

Ironing in the Watermark with Beam Search. Figure 2 (right) shows the tradeoff between watermark strength and accuracy when beam search is used. Beam search has a synergistic interaction with the soft watermarking rule. Particularly when 8 beams are used, the points in Figure 2 form an almost vertical line, showing very little perplexity cost to achieve strong watermarking.

Watermark Strength vs Number of Tokens. Theory predicts that the type I and type II error rates of the watermark should decay to zero as the sequence length $T$ increases. Figure 3 shows the strength of the watermark, measured using the average $z$ -score over samples, as $T$ sweeps from 2 to 200. Curves are shown for various values of $\delta$ and $\gamma$ . The left two charts use multinomial sampling, while the right chart uses 8-way beam search and $\gamma=.25$ . Once again, we see the power of the beam search in achieving high green list ratios; even for the moderate bias of $\delta=2$ , an average $z$ -score greater than 5 is achieved for as few as 35 tokens.

Performance and Sensitivity for Multinomial Sampling. To show the sensitivity of the resulting hypothesis test based on the observed $z$ -scores, we provide a table of error rate for various watermarking parameters in Table 2. We also sweep a range of thresholds in ROC charts in Figure 4. We further report detection performance and error rates for various cutoffs in Appendix C, and provide a comparison between empirical $z$ -scores and theoretical predictions. Note that no type-I (false positive) errors were observed for any run shown in the error tables (see the columns of $0.0$ ’s

Attacking the watermark

Like any software tool, care must be taken when implementing a watermark and watermark detector so that security is maintained. Otherwise, an adversarial user may modify text to add red list tokens, and thus avoid detection. In many cases, simple attacks can be avoided by properly normalizing text before hashes are computed. We now discuss a range of attacks that we are currently aware of, and methods to mitigate them. We assume a threat model in which an attacker must create watermark-free text using a combination of a private watermarked model and other public models, but the public models are much weaker than the watermarked model. We only consider attacks that maintain text of quality similar to the raw private model.

Three types of attacks are possible. Text insertion attacks add additional tokens after generation that may be in the red list and may alter the red list computation of downstream tokens. Text deletion removes tokens from the generated text, potentially removing tokens in the green list and modifying downstream red lists. This attack increases the monetary costs of generation, as the attacker is “wasting” tokens, and may reduce text quality due to effectively decreased LM context width. Text substitution swaps one token with another, potentially introducing one red list token, and possibly causing downstream red listing. This attack can be automated through dictionary or LM substitution, but may reduce the quality of the generated text.

Below we catalog a range of attacks that fall into these categories.

Paraphrasing Attacks. A baseline substitution attack is manual paraphrasing by the human attacker. This attack is technically outside the threat model we are interested in, as it requires extensive human intervention. Note that, especially on longer text fragments such as essays, a few sentences that are partially or not at all paraphrased can be sufficient to trigger watermark detection at a statistically significant threshold.

A more scalable version of this attack is to use automated paraphrasing. An attacker that has access to a public language model can use this model to rephrase the output of the generated model. We provide an experimental evaluation of this attack in Section 7.1. Here, it is crucial to note the trade-off that an attacker is making: The attacker is using a weaker paraphrasing model to modify the text, reducing both watermark strength and text fluency. If the attacker had an equally strong language model at hand, there would be no need to use the watermarked API, the attacker could generate their own text.

Discreet Alterations. An attacker could make small alterations, adding additional whitespaces, or misspelling a few words to impact the computation of the hash. A well-constructed watermark should normalize text to ignore explicit whitespaces when computing the hash. Changing the spelling of many words is likely to severely degrade the quality of text. When implemented carefully, surface level alterations should not pose a serious threat to a watermark.

Tokenization Attacks.. An attacker can modify text so that sub-word tokenization of a subsequent word changes. For example (again with BPE), if the text fragment life.\nVerrilius is modified to life. Verrilius (i.e. \n is replaced), then the tokenization of the succeeding word also switches from V_err_ili_us to Ver_r_ili_us. This results in more red list tokens than one would expect from a single insertion. The attack can contribute to the effectiveness of a more powerful attack, but most tokens in a default sentence will not be vulnerable.

Homoglyph and Zero-Width Attacks. This is a special case of the discreet alteration attack. The effect of tokenization attacks can be multiplied through homoglyph attacks (Gabrilovich & Gontmakher, 2002). Homoglyphs attacks are based on the fact that unicode characters are not unique, with multiple unicode IDs resolving to the same (or a very similar-looking) letter. This breaks tokenization, for example the word Lighthouse (two token) expands to 9 different tokens if i and s are replaced with their equivalent Cyrillic unicode characters. Security against Homoglyph and tokenization attacks can be maintained using input normalization before the text is tested for watermarks, for example via canonicalization as in Helfrich & Neff (2012). Otherwise, simple replacements of characters with their homoglyphs could break enough tokens to remove the watermark.

Likewise, there are zero-width joiner/non-joiner unicode characters that encode zero-width whitespace and hence are effectively invisible in most languages. Like homoglyphs, these characters must be removed through canonicalization (Pajola & Conti, 2021; Boucher et al., 2022).

Generative Attacks. Generative attacks abuse the capability of large language models for in-context learning, and prompt the model to change its output in a predictable and easily reversible way. For example, the Emoji attack of Goodside (2023) proceeds by prompting the model to generate an emoji after every token, see Figure 5, left. These emojis can be removed, randomizing the red list for subsequent tokens. More broadly, all attacks that prompt the model to change its output “language” in a predictable way can potentially cause this, for example prompting the model to replace all letters a with e, see Figure 5, right. Or, as a reverse homoglyph attack, prompting the model to “switch the letter i with i”, where the second i is a Cyrillic letter.

These attacks are the strongest tools against watermarking to our knowledge, but also require a strong LM with the capacity to follow the prompted rule without a loss in output quality. Additionally, this increases the cost of text generation by requiring more tokens than usual to be generated and reducing effective context width.

A defense against these attacks is to include negative examples of such prompts during finetuning, training the model to reject these requests. Note that instruction finetuning is already common (for example in ChatGPT) for other categories of malicious prompts, using reinforcement learning protocols (RLHF) (Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022).

We study a realistic black-box attack by attempting to remove the presence of the watermark by replacing spans in the original output text using another language model. We treat the watermark algorithm as if it is private, mocking seclusion behind an API. The attacker does not have access to the locations of green list tokens and instead tries to modify the text through token replacement at random indices until a certain word replacement budget, $\varepsilon$ , is reached. The budget constraint maintains a level semantic similarity between the original watermarked text and the attacked text, otherwise the “utility” of the original text for its intended task may be lost. Also, each span replacement in the attack is performed via inference using a multi-million parameter language model. While this is roughly a third the size of the target model, it means that the attack incurs an associated cost per step implying that a base level of efficiency with respect to model calls would be desired in practice.

In our experiment, we adopt T5-Large (Raffel et al., 2020) as the replacement model and iteratively select and replace tokens until the attacker either reaches the budget, or no more suitable replacement candidates are returned.

Details of the T5 Span Attack. We tokenize the watermarked text using the T5 tokenizer. Then, while fewer than $\varepsilon T$ successful replacements have been performed or a maximal iteration count is reached:

Randomly replace one word from the tokenization with a .

Pass the region of text surrounding the mask token to T5 to obtain a list of $k=20$ candidate replacement token sequences via a $50$ -way beam search, with associated scores corresponding to their likelihood.

Each candidate is decoded into a string. If one of the $k$ candidates returned by the model is not equal to the original string corresponding to the masked span, then the attack succeeds, and the span is replaced with the new text.

After attacking a set of $500$ sequences of length $T=200\pm 5$ token sequences this way, we compute updated $z$ -scores and tabulate error rates (Table 8 in the Appendix). We also generate ROC plots for a range of $\varepsilon$ budgets. While this attack is effective at increasing the number of red list tokens in the text, as shown in Figure 6, we only measure a decrease in watermark strength of $0.01$ AUC when $\varepsilon=0.1$ . While the watermark removal is more successful at a larger budget of $0.3$ , the average PPL of attacked sequences increases by $3\times$ in addition to requiring more model calls.

Related Work

The idea of watermarking, defined as unseen modifications to data that hide identifying information, has a long history. However, watermarking of digital text has been considered challenging in the past, due to its discrete nature (Katzenbeisser & Petitcolas, 2000). Watermarking is considered easier for continuous-valued data, where watermarks can be encoded with a variety of well-studied strategies (Petitcolas et al., 1999; Zhu et al., 2018; Lu et al., 2021; Boenisch, 2021).

In the following, we note that watermarking, as a method that encodes enough information to identify the source of a text fragment, is strictly a subset of steganography, the task of embedding arbitrary hidden information into data.

Watermarking Natural Language. Early approaches to watermarking natural text in digital form in Atallah et al. (2001; 2003) pose a similar problem with similar desiderata as in our setting, except targeted towards classical models. Given a string of text $s$ , Atallah et al. (2001) propose to generate text $s^{\prime}$ with the properties that $s^{\prime}$ has similar meaning, contains a watermark with an extremely small false-positive rate that is not readable by a party without knowledge of the secret key that generated the watermark, is hard to remove through editing of $s^{\prime}$ and is further detectable without knowledge of $s$ (or the scheme generating $s$ ). The actual steganography scheme described therein is limited by its rule-based understanding of natural text to modifications of parsed syntactic tree structures. Finally, the watermark can be read by reconstructing the tree structure, with the chance of a false-positive for a watermark of $w$ bits vanishing quickly at $2^{-w}$ .

Rule-based watermarks were further developed in a series of works (Chiang et al., 2004; Topkara et al., 2006a; b; Meral et al., 2009; Venugopal et al., 2011) with variants also embedding watermarks based on synonym tables instead of only parse trees. Early developments were summarized in Jalil & Mirza (2009), but strong watermarks significantly degraded the text quality due to the limited flexibility of language models at the time.

While approaches via hierarchical language models in Wilson et al. (2014) still required human interactions, the emergence of modern neural language models (Vaswani et al., 2017; Devlin et al., 2019) also allowed for improved watermarking/steganography (Fang et al., 2017; Ziegler et al., 2019; Dai & Cai, 2019; He et al., 2022a; b). Fang et al. (2017) propose such a neural steganography approach where, to encode a message of $w$ bits, the message is first separated into blocks of length $b$ . Then, the vocabulary $V$ of a language model is partitioned at random into disjoint sets of size $|V|/2^{b}$ . A generative LM can then encode the message by generating only a token from the “allowed” set at each position. However, this hard rule reduces the quality of generated text, boxing the LM into only a small selection of valid tokens at every step. Other approaches, such as Ueoka et al. (2021) use mask-infilling models such as BERT to edit already-generated text for the purpose of steganography. Finally, Abdelnabi & Fritz (2021) design an end-to-end system where both encoding and decoding are handled by text-to-text language models that are trained adversarially.

With similar motivation to our proposal, Kaptchuk et al. (2021) constructs a framework that adapts traditional public-key cryptographic steganography specifically for “natural” comunication channels like text using generative models. However, their method, Meteor, relies on a synchronized model framework where the sender and receiver agree on a shared generative model used to embed and decode the hidden bits of information being sent.

Recently, Aaronson (2022) announced that he is studying cryptographic approaches to watermarking in collaboration with OpenAI. Their preliminary method is based only on biasing of the LM output, as opposed to complete determination as in Fang et al. (2017). While details are not currently available, the description suggests that hashing of $n$ -gram sequences is involved. We hope to extend our comparison to this work when more information becomes available.

Note that a separate line of work investigates watermarking model parameters themselves. This would not be used to watermark model output (as in this work), but to defend against model stealing (Adi et al., 2018; Boenisch, 2021). Approaches, such as Gu et al. (2022), implant backdoor triggers through a finetuning process to cause biased responses to specific inputs, a behavior detectable at verification time.

In contrast to other currently published works, we want to focus on strategies that are simultaneously minimally restrictive to a language model, leverage the LMs own understanding of natural text, require no usage of the LM to decode the watermark, and can be theoretically analyzed and validated.

Post-hoc Detection. An alternative to watermarking is to develop detection models that perform a post-hoc analysis of machine-generated text, for example using language model features or finetuning existing large language models to behave as detectors (Zellers et al., 2019; Tan et al., 2020), see an overview in Jawahar et al. (2020). These detectors work because LMs still leave detectable signals in generated text. Implementation details, such as sampling strategies, can be reverse-engineered from text (Tay et al., 2020). However, detection approaches are slowly losing ground as LM capabilities increase, for example Gambini et al. (2022) note that a range of detection strategies for GPT-2 already struggle with GPT-3. Further, known detectors are also vulnerable to adversarial attacks that degrade their functionality (Wolff & Wolff, 2022).

While efforts to provide strong detectors continue, as in Tian (2023), ultimately language model progress may make detection infeasible. All post-hoc detection methods require the LM to be significantly biased away from human text in some measurable way, such as low variation in perplexity across sentences (Tian, 2023). Even for current LLMs, this margin might be small. This is already problematic, as detection schemes that operate within this small margin are susceptible to labeling human text as false-positive, a concern that is especially pressing for people who produce unusual text, such as a non-native speakers, and people who use computer tools to assist them in writing. Such populations might be especially at risk for false-positives, which could lead to academic problems if these detectors are used in schools (Butoi, 2023).

The watermarking scheme we propose is designed so that false positives are statistically improbable, regardless of the writing patterns of any given human.

Conclusion

The presented watermark has a number of nice properties that make it a practical choice: the watermark is computationally simple to verify without access to the underlying model, false positive detections are statistically improbable, and the watermark degrades gracefully under attack. Further, the proposed scheme can be retro-fitted to any existing model that generates text via sampling from a next token distribution, without retraining. Note, however, that careful implementation and instruction tuning against generative attacks may be required for very large models.

There is one more important property of the proposed method that we have not discussed: The $z$ -statistic used to detect the watermark depends only on the green list size parameter $\gamma$ and the hash function for generating green lists. There is no dependence on $\delta$ or any other factor related to how the green list is enforced. For this reason, one can deploy the watermark using context-specific $\delta$ choices or green list enforcement rules for different kinds of text (e.g., prose vs code, or small vs large models) while using the same downstream watermark detector. One can also change a proprietary implementation of the watermarked sampling algorithm without any need to change the detector. Finally, the watermarking method could be turned on only in certain contexts, for example when a specific user seems to exhibit suspicious behavior.

There are still a number of remaining open questions regarding watermarking. For example, what kind of robust hashing rules are possible, and when are these rules provably optimal? What is the best way to test for the watermark in a streaming context, or in a context where a short span of watermarked text lives inside a longer non-watermarked span? Are there simple sensitivity bounds that are more accurate than those presented above for large $\delta$ and small $\gamma$ ? We hope our present results are enough to convince readers that watermarks could be a practical tool for combating malicious uses of generative models, and we leave these additional questions for future research

Acknowledgements

This work was made possible by the ONR MURI program, DARPA GARD (HR00112020007), the Office of Naval Research (N000142112557), and the AFOSR MURI program. Commercial support was provided by Capital One Bank, the Amazon Research Award program, and Open Philanthropy. Further support was provided by the National Science Foundation (IIS-2212182), and by the NSF TRAILS Institute (2229885).

References

Appendix A Experimental Details

We provide series of representative outputs from different ranges in the sample space for model generations under a soft watermark with parameters $\delta=2.0,\gamma=0.5$ under the multinomial sampling scheme. To tabulate these outputs, the $\sim 500$ generations collected at this setting are either sorted by the average spike entropy of the watermarked model’s output distribution at generation time, or the measured test statistic, the $z$ -score for that sequence. The top and bottom $5$ samples according to these orderings are shown for both entropy (Table 4 and Table 3) and $z$ -score (Table 6 and Table 5).

A.2 Measuring Perplexity: Oracle Language Model

To compute perplexity, the larger, Oracle Language Model is fed the original prompt as input, as described in the main body, and perplexity is computed via taking the exponential of the average token-wise loss according to the oracle’s next token distribution at every output index. Note that loss is computed for only the generated tokens produced by either a watermarked or non-watermarked model.

Appendix B Detailed Threat Model

For completeness, we formally define the threat model for the attacks discussed in Section 7 here. As described, attacks may occur when malicious users operate bots/sock-puppets on social media, try to fool a CATPCHA, or complete an academic assignment (Foltýnek et al., 2019). In this work we formally define adversarial behavior as all efforts by a party the using machine-generated text to remove the watermark. It is ultimately important to remember that we describe a watermark only on the tokens of the generated text, i.e. on its form and style, and not on its semantic content. For example, a completely new essay written based on an outline or initial draft provided by a LM could not be detected. Such semantic watermarks may be possible, but we do not study this setting here.

We assume two parties, a model owner providing a text generation API, and an attacker attempting to remove the watermark from the API output. The attacker moves second, and is aware that the API contains a watermark. In public mode, the attacker is aware of all details of the hashing scheme and initial seed. In private mode, the attacker is aware of the watermark implementation, e.g. Algorithm 3, but has no knowledge of the key of the pseudo-random function $F$ . The attacker attempts to reduce the number of green-listed occurrences in the text, reducing the $z$ -score computed by a defender. In public mode, any party can evaluate the watermark. In private mode, only the model owner can evaluate the watermark and provides a text detection API. We assume that this API is rate-limited. We assume the attacker has access to other non-watermarked language models, but these models are weaker than the API under attack. The attacker is allowed to modify the generated text in any way.

Note that removing the watermark is always a trivial task if language model quality is disregarded – one can simply replace the entire text with random characters. For this reason only attacks that result in a reasonable language quality trade-off for the attacker are relevant. A defense is hence also successful if any watermark removal by the attacker reduces the quality of generated text to that of generated text achievable using a public model.

Appendix C Detection Accuracy of Multinomial Sampling

When a multinomial sampler is used (which is assumed by Theorem 4.2), we use the softmax output with standard temperature hyperparameter temp=0.7. We analyze the alignment between the empirical strength of the watermark and the theoretical lower bound for $\gamma=.5$ in Figure 7. We find that the theoretical bound is quite tight for smaller values of $\delta,$ but the theorem under-estimates watermark sensitivity for larger $\delta.$

ROC curves for multinomial sampling, and greedy decoding with 8-way beam search in the 200 token case are depicted in Figure 8 and Figure 9 (Subsets of Figure 4 from the main work.). Tables with error rates and accuracy numbers at selected $z$ values are provided in Table 2.

Appendix D Minor Variations

A company might also apply multiple watermarks to generated text, taking the union of all red lists at each token. This is a compromise in terms of watermark effectiveness, compared to a single watermark, however it allows additional flexibility. A company could run a public/private watermarking scheme, giving the public access to one of the watermarks to provide transparency and independent verification that text was machine-generated. At the same time, the company can keep the second watermark private and test text against both watermarks, to verify cases reported by the public watermark, or again to provide a stronger detection API. Such a setup would be especially effective in detecting whether an attack took place that attempted to remove the public watermark.

Selective Watermarks in response to malicious activity

Watermarks could also be used selectively. An API owner could turn on watermarking (or dial up its strength considerably via increased $\delta$ ) only when faced with suspicious API usage by some accounts, for example if a request appears to be part of malicious activity like creating synthetic tweets. This would give more leeway to benign API usages, but allow for improved tracing of malicious API utilization.

Discovering A Watermarking Scheme

So far we assumed that an attacker is aware that a watermark is present. Could the attack discover this fact only by analyzing generated text? For a hard watermark, this would be easy: Some combinations of tokens will never be generated by the model, no matter how strongly they are prompted. Yet, for a soft watermark (especially with small $\delta$ ), that depends on, e.g. $h=10$ tokens via Algorithm 3, this becomes harder. The attacker would need to distinguish the modification of green list logits via $\delta$ from naturally occurring biases of the LM.

Appendix E Proof of Theorem 4.2

We begin our proof with a useful lemma. Using the spike entropy, we can predict how often a watermarked language model will spit out a green list token. When the entropy is high, the language model has a lot of freedom and we expect the model to use green list tokens aggressively. When the entropy is low, the model is more constrained and it is more likely to use a red list token.

Suppose a language model produces a raw (pre-watermark) probability vector $p\in(0,1)^{N}$ . Randomly partition $p$ into a green list of size $\gamma N$ and a red list of size $(1-\gamma)N$ for some $\gamma\in(0,1).$ Form the corresponding watermarked distribution by boosting the green list logits by $\delta$ , as in Equation (4). Define $\alpha=exp(\delta).$

Sample a token index $k$ from the watermarked distribution. The probability that the token is sampled from the green list is at least

When we add $\delta$ to the logits corresponding to the green list words, we increase their probabilities of being sampled. We replace the raw probability $p_{k}$ for each green list word with the enlarged probability

where $G$ is the set of green list indices and $R$ is the complementary set of red list indices. We denote the sizes of these sets as $N_{G}$ and $N_{R},$ respectively.

We begin our proof by bounding the size of a randomly chosen gre-list probability after it has been enlarged. Consider the following process for creating the lists. First, choose a random entry $p_{k}$ and place it in the green list. Then, randomly sample the remaining entries in the green list. The expected value of a randomly chosen probability from the green list can be written

where the inner expectation is over uniformly random green/red partitions that satisfy $k\in G.$

Now let’s bound the inner expectation on the right. Consider the helper function

In the last step we used the fact that the numerator is larger than the denominator, and so adding $\alpha$ to the numerator and denominator results in a small decrease in the bound. Also, note that the fraction on the right side of (9) is strictly greater than 1 for any value of $p_{k}\in(0,1)$ and $\alpha\geq 1$ . For this reason the bound is never vacuous, as $f_{k}(p)>p_{k}$ .

Now let $\gamma=N_{G}/N.$ This simplifies the notation of our intermediate result to

Using this expression to simplify (4) we get

The probability of sampling a token from the green list is exactly $N_{G}$ times larger than an average green list probability. The probability of sampling from the green list is thus given by

It can be observed that the bound in Lemma E.1 is never vacuous; The probability of choosing a token from the green list is trivially at least $\gamma,$ and for any combination of finite logits the bound in Lemma E.1 is strictly greater than this trivial lower bound. See the proof for a discussion of why.

Using this lemma, it’s now fairly straightforward to prove the main theorem.

Lemma E.1 bounds the probability of a single token being in the green list. To compute the total number of green list tokens in the sequence, we simply sum this bound over all the tokens to get.

where $S^{(t)}$ represents the entropy of the distribution of token $t$ .

To get the variance bound, we begin by noting that the variance of a Bernoulli random variable with success probability $p$ is $p(1-p).$ The expected number of green list tokens is a sum of independent random Bernoulli variables, each representing one token. These variables are not identically distributed, but rather each has a success probability given by Lemma E.1. The variance of the sum is the sum of the variances, which is

The expectation on the right contains a concave function of $S^{t}.$ By Jensen’s inequality, we can pass the expectation inside the function to get

Finally, note that the probability of a token being in the green list is always at least $\gamma,$ regardless of the distribution coming from the language model. Lemma E.1 is never vacuous, and the success probability predicted by the Lemma is always at least $\gamma$ . If $\gamma\geq.5,$ then the variance of each Bernoulli trial is at most the variance of a Bernoulli trial with success probability $\gamma,$ which is given by $\gamma(1-\gamma).$ Plugging this into our bound gives

Appendix F Proof of Proposition 4.3

The probability of sampling token $k$ from the modified distribution is

where $G$ and $R$ are random partitions of the vocabulary indices. We can write this expected value as the sum of a contribution from the case in which $k\in G,$ and one in which $k\in R.$ We get

Appendix G Impact of Watermarking on Model Factuality

A key benefit of the soft watermarking scheme is that it (passively) adapts to the current entropy in the model’s output distribution. If the model is highly confident on its next few token predictions, say those representing a specific named entity, then a soft watermark will not affect those predictions regardless of their factuality or groundedness. On the other hand, if the model is not confident on any particular tokens, then under standard decoding schemes, whether or not the final decoded output is hallucinatory will be a random event, and the watermark has an equal chance of upweighting tokens that result in more factual or more hallucinatory utterances.

To illustrate this, we present a small experiment that isolates this behavior. We take a model with reasonable competency in knowledge-intensive, closed-book question answering and evaluate its performance on the validation set of the TriviaQA question answering dataset (Joshi et al., 2017). Hypothesis: since answers to factoid questions should be short, low entropy sequences, a soft watermark will yield low detection statistics, however, task performance will not degrade much under application of the watermark. The results for this experiment are shown in Table 9 and provide some evidence that in factuality critical generation scenarios, a softly watermarked model is unlikely to deviate that much from its unwatermarked behavior (for better or worse). We observe less than a $4$ point drop in Exact Match performance under the application of a standard soft watermark ( $\gamma,\delta=0.5,2.0$ ) for both Google’s FLAN-UL2 model (Tay et al., 2022) and Huggingface BigScience’s BLOOMZ model (Muennighoff et al., 2022).

However, we note that this particular experimental setup is not a situation where we would actually deploy the watermark or expect it to work very well. Generating 5 to 10 tokens per question under greedy decoding and then testing those tokens for exact correctness, is something of a worst-case estimate on the cost of watermarking. In this scenario the prompt is highly constraining and the only things the watermark can do are either nothing, or directly cause the model to deviate from the argmax. Such deviations would be detrimental on any question where the model “knows” the correct answer but isn’t overwhelmingly confident in the token sequence required to represent it (especially the surface form). We leave a more comprehensive study of the impacts of watermarking strategies on the factuality of LLMs in question answering and other knowledge intensive settings to future research.