Counterfactual Memorization in Neural Language Models

Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, Nicholas Carlini

Introduction

Modern neural language models (LMs) have achieved impressive results in generating high quality text (e.g. Brown et al., 2020; Zhang et al., 2022; Chowdhery et al., 2022; OpenAI, 2023) and have led to breakthroughs in many downstream natural language processing tasks (Devlin et al., 2019; Raffel et al., 2020b; Bommasani et al., 2021). The paradigm of taking a single large-scale pre-trained model and fine-tuning it for many tasks motivates the study of these models’ ability to generalize by avoiding memorizing their training data. Moreover, memorization of sensitive user information or copyrighted materials in the training data (Carlini et al., 2020; Vyas et al., 2023; Lee et al., 2023) leads to practical concerns in real world applications.

Previous work on memorization in neural language models demonstrated the ability to extract memorized training data, including sensitive data such as phone numbers and usernames (Carlini et al., 2020; Ziegler, 2021; Carlini et al., 2019; Henderson et al., 2017; Thakkar et al., 2020; Thomas et al., 2020). One issue with these extraction attacks is that they primarily identify “common” and frequently occurring strings in the training set. For example, as shown in the analysis of Lee et al. (2021), near-duplicate training examples, which are very common in standard text corpora, account for a large majority of the memorized content. To filter out such commonly occurring strings from all memorized texts, previous work applied various heuristic rules to distinguish frequently-occurring sequences from memorization of isolated pieces of information.

In this paper, we propose a principled causal perspective to disentangle memorization of common vs rare data, by directly tying a model’s predictions to the presence or absence of individual training examples. We define counterfactual memorization as a measure of the change in a model’s prediction when a particular example is excluded from the training set. Counterfactual memorization accounts for the commonality of an example as removing one instance of a text that is common across multiple documents will have a minor effect on the model’s prediction on that text. The mathematical formulation of counterfactual memorization extends a prior definition of label memorization in classification models (Feldman, 2020) to the context of neural language modeling.

Formally, a training document $x$ is considerded counterfactually memorized, when the language model predicts $x$ accurately if and only if the model was trained on $x$ . This allows us to construct a procedure to quantitatively measure the memorization of isolated pieces of text, whose sole presence in the training dataset have a large effect on the model’s predictions.

Following Feldman and Zhang (2020), we further extend this definition to counterfactual influence, which measures the influence of a memorized training text sample on another text example. Counterfactual influence allows us to trace the source of information for a model’s predictions, by locating the training example(s) which significantly contributed to it. With these tools, we study memorization across several standard text datasets. Our main contributions are as follows:

We define counterfactual memorization in neural LMs which gives us a principled perspective to distinguish memorization of “rare” and “common” information in neural LMs (Section 3).

We estimate counterfactual memorization on several standard text datasets, and confirm that rare memorized examples exist in all of them. We study common patterns across memorized text and the memorization profiles of individual internet domains. (Section 4).

We identify an inverse correlation between number of duplicates and counterfactual memorization as compared with previous definitions of memorization (Section 5).

We extend the definition of counterfactual memorization to counterfactual influence, and study the impact of memorized examples on the test-time prediction of the validation set examples and generated examples (Section 6).

Related Work

Previous work analyzed the memorization of large language models on sensitive information (e.g. phone numbers) in the training data (Carlini et al., 2020; Ziegler, 2021) or synthetically injected “canaries” (Carlini et al., 2019; Henderson et al., 2017; Thakkar et al., 2020; Thomas et al., 2020). However, not all the memorized texts are equally interesting — as confirmed in a later study (Lee et al., 2021), near-duplicated training examples are very common in standard text corpus, and those commonly occurring phrases contribute significantly to memorized texts. In order to distinguish “common” memorization of common phrases or public knowledge from “rare” memorization of private, rare information, various heuristics were adopted in previous investigations. Our paper proposed a principled perspective towards this problem. Our intuition comes from psychologies studies that categorize human (declarative) memory into episodic memory (Tulving, 1983) of specific contents of individual events, and semantic memory (Squire, 1992) about general knowledge like grammars and factual information. We would like the models to obtain semantic memory but avoid episodic memory. The capture the latter, we proposed a notion of counterfactual memorization. The mathematical formulation of counterfactual memorization is borrowed from a notion of label memorization in Feldman (2020) and adapted to the context of neural LMs in this paper. This formulation has been studied empirically in the context of computer vision in Feldman and Zhang (2020). In a follow up work, Ilyas et al. (2022) showed that it is possible to fit a datamodel to predict the outcome of training a model on a specific training subset and evaluating on a specific input. However, this procedure requires training a massive number of models (e.g. 300,000 for CIFAR-10) on random subsets of the training data, thus is computationally infeasible for the scale of language models considered here.

The general idea of measuring model behavior on held-out training data is common in machine learning. In cross validation, held-out data is used to estimate the test performance for model selection; in learning theory, leave-one-out stability was shown to be deeply connected to generalization (e.g. Mukherjee et al., 2006); in differential privacy, the worst case performance difference of models trained on two “neighboring” datasets (identical except a single example being held-out or replaced) quantifies the privacy guarantee of a learning algorithm (Dwork et al., 2014; Nasr et al., 2021; Jagielski et al., 2020). Most previous work aimed for an overall measurement, while our paper focused on characterizing the behaviors of individual examples.

We estimated a counterfactual influence to study how a memorized training example impact the model prediction at test time. Influence functions have been used in statistics to assess robust estimators since Hampel (1974). Previous papers adopted it to analyze neural network predictions (Koh and Liang, 2017; Koh et al., 2019). However, the estimation was found to be computational expensive and fragile (Basu et al., 2021). Pruthi et al. (2020) tracks the gradient updates during training to estimate the influence from a training example; Feldman (2020); Feldman and Zhang (2020) use aggregated statistics from multiple models independently trained on heldout data subsets to estimate the influence. Further extensions were shown to work well on detecting mislabeled data in classification problems (Wang and Jia, 2022) and characterizing hallucinations in Neural Machine Translation (Raunak et al., 2021). Alternative methods also looked at simple data statistics (e.g. co-occurrence counts) without model re-training to infer the causal effects on language models’ predictions (Elazar et al., 2022). In this paper, we adapt the approach from Feldman (2020), and formulate counterfactual influence directly with subset sampling, as oppose to leave-one-out influence. We also extend the estimation to assess the influence on generated examples.

Counterfactual is an important notion in statistical causality (Pearl et al., 2000; Rubin, 2005; Pearl, 2009; Imbens and Rubin, 2015) useful for studying causal probabilistic inference under alternative conditions. Such counterfactuals may or may not be directly testable (e.g. a counterfactual treatment in medical studies). In this paper, we directly measure the counterfactual influence of a training example by comparing the behavior of the model trained with and without that example.

Counterfactual Memorization

To quantify memorization of rare details of a specific training document, we define the following notion of counterfactual memorization. The mathematical formulation is borrowed from Feldman (2020), where it was originally proposed to quantify label memorization in multi-class classification problems. We extend it to the context of unsupervised neural language modeling.

Given a training algorithm $A$ that maps a training dataset $D$ to a trained model $f$ , and a measure $M(f,x)$ of the performance of $f$ on a specific example $x$ , the counterfactual memorization of a training example $x$ in $D$ is given by

where $S$ and $S^{\prime}$ are subsets of training examples sampled from $D$ . The expectation is taken with respect to the random sampling of $S$ and $S^{\prime}$ , as well as the randomness in the training algorithm $A$ .

That is, our memorization definition compares the difference between two expected performance measures on a given example $x$ . On one side, we compute the expected performance of a model when trained on datasets that contain the example $x$ , and, on the other side, we compute the expected performance of a model when trained on datasets that do not contain the example $x$ . Throughout this paper we use per-token accuracy as the measure $M$ . In other words, we ask the model to predict the next token based on the groundtruth context (preceding tokens), measure the 0-1 loss of the argmax token prediction, and then average it across all predicted tokens.

The expectations in Equation (1) can be empirically estimated via sampling. Specifically, we train $m$ different models on independently sampled subsets $S_{1},\ldots,S_{m}$ of equal size $|S_{i}|=r|D|$ for a fixed $r\in(0,1)$ . We then divide these models into two groups: the first group contains all models trained on subsets $S$ where $x\in S$ ; and the second group are all models trained on subsets $S$ where $x\not\in S$ . We take the average performance on $x$ in the two groups separately and compute the difference between the two:

This difference quantifies how the presence or absence of the example $x$ in a model’s training set affect the model’s performance on $x$ . If there is a large difference between including an example in the training set versus not including it, then we consider this example counterfactually memorized.

For each $x$ , we refer to models trained with $x$ in the training set ( $\{A(S_{i}):x\in S_{i}\}$ ) as IN models and the models $x$ was not trained on ( $\{A(S_{i}):x\not\in S_{i}\}$ ) as OUT models. Note we do not need to retrain a model for each example $x$ . Instead, we train $m$ models once on random subsets of $D$ , and compute the estimation (Equation 2) for all examples using the same set of $m$ models. Ilyas et al. (2022) recently showed that it may also be possible to directly predict these scores using a regression model, yet this approach is computationally prohibitive for large language models.

Analyzing Counterfactual Memorization

We estimate and analyze counterfactual memorization of training examples in three standard text datasets: RealNews (Zellers et al., 2019), C4 (Raffel et al., 2020a) and Wiki40B:en (Guo et al., 2020). Unless otherwise specified, we use Transformer-based language models (Vaswani et al., 2017) equivalent to (decoder only) T5-base (Raffel et al., 2020b) with $\sim$ 112M parameters. To save computation and enable more direct comparisons across datasets, we truncate the training set for each datasets by taking the first $2^{21}$ documents. To estimate counterfactual memorization, we train 400 models for each dataset, each on a random $25\%$ subset of the training examples. In practice, we use a hash-based filtering mechanism to efficiently approximate random subset sampling (details in Appendix G), as the data loading APIs for large text corpora generally support only sequential visits to examples with limited shuffling and subsampling capability within a window.

We train each model for 60 epochs111Modern language models are usually trained for fewer epochs if the training set is massive. Since we have a smaller subsampled training set, we train the models for more epochs to allow the models to fit the training data sufficiently to study memorization effects. using the Adam optimizer (Kingma and Ba, 2015) with learning rate 0.1 and weight decay $10^{-5}$ . For C4/RealNews/Wiki40B:en, respectively, our models converge to an average per-token accuracy of 44.21%/47.59%/66.35% on the subsampled training set, and 27.90%/31.09%/49.55% on the validation set. On average, the models start to overfit at around epoch 5, as indicated by the signal that the validation accuracy starting to decrease.

Table 1 shows examples from the RealNews training set sampled at various memorization levels. Examples with the highest memorization are generally unconventional text such as all-capital letters, structured formats (i.e., tables or bullet list), and multilingual texts. After those artificial examples, examples with intermediate-to-high memorization are most often news reports of specific events. One of our main goals is to be able to separate memorization of such examples containing details of specific events from memorization of common facts or highly duplicated template texts. Indeed, templated documents with many near-duplicate copies in the training data generally have low counterfactual memorization. C4 and Wiki40B:en have similar trends. Interestingly, though Wikipedia articles are less likely to be auto-generated from templates than the web in general, we do observe repetitive patterns in low-scoring documents, such as “_START_ARTICLE_ , Virginia _START_PARAGRAPH_ is an unincorporated community in , in the U.S. state of Virginia.”

To visualize the distribution of memorization, we plot 2D histograms in Figure 1, where the x-axis shows the difference of IN-accuracy and OUT-accuracy (i.e. the counterfactual memorization), and the y-axis shows the sum of the two, which we term “simplicity”. A simple example is one that is scored highly regardless of whether a model saw it during training. The histograms are plotted in log scale to better visualize the exponential decay in the tail for high memorization and simplicity levels.

From the 2D density plots, we find that easy examples tend to have low memorization. However, there is no simple linear correlation. Peak memorization occurs for examples of intermediate simplicity. For the hardest examples, the memorization scores are low, because even the IN-models could not learn them well. Many hard examples consist of ill formatted text or contained foreign languages. As a result, in Wiki40B:en, which contains higher quality texts, the lower bound of the histogram is higher than the other two datasets (Figure 1). Interestingly, the choice of data has a relatively minor effect on memorization: the shape of the memorization histogram is generally consistent across the three datasets; the range of memorization values is only slightly compressed for Wiki40B:en.

Figure 1 shows the overall distribution of memorization for each training datasets. To obtain a more granular view, we can also analyze the distributions for texts sourced from individual web domains in RealNews and C4, to see whether different data sources display different memorization profiles. Web domains such as news portals, blogs, and forums differ both stylistically and in how much they reference or even copy from other websites. Additionally, some domains are represented much more frequently than others in the datasets we studied. This could lead to considerably different memorization profiles for examples from different domains.

To investigate these effects, we visualize the 95th percentile memorization score in each web domain against the number of examples in that domain for RealNews (Figure 2a) and C4 (Figure 2b). C4 contains many more domain names than RealNews since the latter is collected only from news websites. For both datasets, the domains with a large number of crawled documents show a smaller variance in the 95-percentile values, while “smaller” domains depict a wide range of variety in memorization profiles. The memorization profiles of a few representative domains are visualized in Figures 2c and 2d. The domains we selected for visualization are: the largest domain (blue), the domain with highest 95 percentile memorization (orange), and two domains that have more than 1000 and 50 articles in RealNews and C4 respectively (green and red).

In RealNews (Figure 2c), reuters.com contains the largest number of documents but low memorization scores on average. The domain digitallibrary.un.org, the United Nations Digital Library, has high memorization scores potentially because it contains many multilingual documents. We have observed that less frequently occurring tokens, like those in foreign languages or ALL-CAPITAL words tend to cause high memorization. Similarly, flattened structured data (e.g. tabular texts) also deviates significantly from normal English texts and potentially leads to high memorization, as demonstrated by zap2it.com, a website for TV program listings. On the other hand, hotair.com is a news commentary website that frequently quotes other major news articles. This may lead to duplicate text in the dataset which we suspect contributes to its overall lower memorization distribution.

The observations are similar on C4: blogspot.com contains a large number of documents in the training set with only moderate amounts of memorization; zh.wikipedia.org and buckinghamautos.com.au have high memorization due to foreign (Chinese) or structured (car sales listings) text; and www.unitedstateszipcodes.org has very low memorization scores because common templates are re-used to generate similar pages for individual zip codes.

2 Number of Models Needed

To evaluate the impact of a single training example, one may wish to train two models that differ only in that single example. In practice, the stochasticity in a single run of common training algorithms (e.g. SGD) produces too low signal-to-noise ratios to be useful for such estimation. Moreover, leave-one-out estimation means a separate pair of models needs to be trained for each training example, which is computationally costly. Therefore, we formulated our estimation in Section 3 by accumulating statistics from $m$ models independently trained on random training subsets. In our experiments, we set $m=400$ . To understand how sensitive our results are to $m$ , we analyze the rankings produced by distinct sets of models of size $m$ . We vary $m$ from 6 to 192, and partition our set of 400 models into up to 10 sets of $m$ models (e.g. for $m=192$ , we construct 2 partitions, and for $m=6$ , we construct 10). We then compute the Spearman’s R between these partitions to measure the agreement between the rankings produced by each partition. If the rankings are very similar (have Spearman’s R close to 1), then this number of models is reliably estimating the true ranking of memorization scores. We plot these Spearman’s R values in Figure 3a. Even at 96 models, this correlation begins to plateau near 1, lending confidence that 400 models is sufficient for reliable estimation of memorization scores. See Appendix D for more analysis on the sensitivity to $m$ .

3 Impact of Number of Training Epochs

As expected, the overall amount of memorization grows consistently with the number of epochs of training (Figure 3b). This makes sense since training for more epochs increases overfitting. As training progresses, we also see an increasingly long tail of examples with high memorization scores. On RealNews, about 59% of examples had consistently increasing memorization scores across all epochs considered. There were no examples whose memorization decreased in a significant way over training (all observed decreases can be attributed either to noise or to instability early in training). Only 0.5% of examples stayed completely un-memorized with scores which never rose above 0.1, while 85% of examples had memorization scores which never rose above 0.2. Figure 3c shows the fraction of memorized examples as training progresses, at several thresholds of memorization. We can see that more training epochs significantly increases memorization.

Duplicate Text and Memorization

One of the goals of evaluating counterfactual memorization is to identify examples that have a low number of duplicates yet whose presence versus absence in the training data has a large effect on the model. Here, we perform a quantitative study of the (anti-)correlation between duplication and counterfactual memorization compared with the positive correlation between duplication and the “generation-time memorization” definitions of memorization used by Lee et al. (2021); Carlini et al. (2022); Kandpal et al. (2022).

Following the method from (Lee et al., 2021), we first use MinHash (Broder, 1997) to identify near-duplicate examples in RealNews train set. We consider a example a duplicate if it has an normalized edit similarity of greater than 0.7 (definition included in Appendix I). Out of 2.01 million examples, $\sim$ 38,000 were identified as being a near-duplicate with at least one other example. Among these frequently-occurring examples, the Pearson correlation between an example’s counterfactual memorization score and the number of near-duplicates for that example is -0.39; in other words, memorization does quantitatively decrease when data is repeated more often.

In Figure 3d we can see that examples with a large number of near-duplicates have smaller memorization scores. Counterfactual memorization primarily differentiates amongst examples with a few number of duplicates. This makes sense given that examples with lots of near duplicates would likely have their near duplicates in OUT-models. This is to be contrasted with “generation-time memorization” (discussed in Section A) that measures the textual overlap between model generated texts and the training documents. There, the number of occurrences strongly correlate with the measured memorization (Carlini et al., 2020; Lee et al., 2021; Kandpal et al., 2022). Counterfactual memorization measures a fundamentally different type of memorization from simple textual matching considered in prior work, providing information about how easy or hard a training example is in the context of the rest of the training set. In Table 1 we can see this effect qualitatively: sequences with near-duplicates in the training set tend to have low counterfactual memorization (as expected) .

From Memorization to Influence

Counterfactual memorization identifies training examples that contain rare information not conveyed by other examples. A natural question to ask is whether a model would leak the information in a memorized example during inference. Previous paper studies membership inference attack (Shokri et al., 2017; Sablayrolles et al., 2019; Long et al., 2020) where an attacker tries to figure out if a particular example exists in the training set. In this paper, we consider standard model evaluation without adversarial attackers, and quantify “does seeing a particular training example strongly influence the prediction on a validation example?” Another way of asking this is if a single example in the training set has an large and over-representative impact on the prediction of a validation example. We answer these questions by measuring counterfactual influence with a formulation adapted from Feldman and Zhang (2020):

Given a training algorithm $A$ that maps a training set $D$ to a trained model, and a performance measure $M$ , the counterfactual influence of a training example $x\in D$ on another example $x^{\prime}$ is

where $S$ is a subset of training examples sampled from $D$ . The expectation is taken with respect to the random sampling of $S$ , as well as the randomness in the training algorithm $A$ . Here $x^{\prime}$ can be an example from the validation set or test set, a generated example or a training example.

An empirical estimation of the influence can be computed similarly to counterfactual memorization by uniformly sampling $m$ subsets $S_{1},\ldots,S_{m}$ from $D$ , where $|S_{i}|=r|D|$ , and calculating

This measures how much a training sample $x$ ’s presence influences the prediction of a different example $x^{\prime}$ . Note, $\operatorname{\mathsf{mem}}(x)=\operatorname{\mathsf{infl}}(x\Rightarrow x)$ , i.e., counterfactual memorization is self influence.

Influence on Examples of the Validation Set. With the same models trained for estimating memorization, we can estimate the counterfactual influence on the validation set according to Equation (4). For each example in the validation set, we can estimate the influence on it from each training example. Figure 4a shows the distribution of influence from all training example on three different examples from the validation set. The green example was randomly chosen and represents the behavior for most validation examples: it receive close-to-zero influence from all the (individual) training examples. The blue and orange examples were sampled to have high and intermediate maximum influence. Each of them has one (or a few) strong influencer from the training set, as indicated by the bars to the right of the histogram. They also only receive tiny influence from all the rest of the training examples, though the variance of influence is larger than for the green example.

Intuitively, most training examples will have small influence on validation set examples because the models learn distributional patterns shared across many training examples, and individual training examples tend to have insignificant influence here. However, a training example $x$ with high counterfactual memorization contains rare information that are not shared with other examples. Therefore, if a validation set example $x^{\prime}$ contains similar information, $\operatorname{\mathsf{infl}}(x\Rightarrow x^{\prime})$ could be large. Figure 4b shows the relationship between memorization and influence by plotting $\operatorname{\mathsf{mem}}(x)$ of each training example $x$ against its maximum influence $\max_{x^{\prime}}\operatorname{\mathsf{infl}}(x\Rightarrow x^{\prime})$ on $x^{\prime}$ across the validation set.

Consistent with our intuition, examples with small memorization scores have small max-influence scores. Larger influence scores on the validation set generally requires larger memorization scores of the training example itself. However, not all training examples with large memorization scores lead to large influence scores. In particular, the max-influences drop significantly for examples with memorization larger than 0.4. One potential reason is that many examples with very high memorization are simply low quality text, so memorization is required in order to learn them, but they do not encode anything interesting that could influence a validation example. On the other hand, even if a memorized example encodes some rare and useful information, the max-influence could still be low because the validation set does not contain a relevant document. This is especially true given that all datasets have considerably smaller validation sets than training sets.

Table 2 shows train-validation example pairs from RealNews sampled at different influence value ranges. We found that the train-validation pairs with the highest influence are almost identical, except some superficial differences, such as different handling of quotation / em dash marks. As we move to intermediate influence ranges, we commonly found reports on the same events. Large paragraphs of identical text indicate that one document might be citing the other or both citing from a third party. At low influence, two types of correlations are commonly observed: 1) templated texts with high similarity—the reason for a low influence is that there are many similar training examples that split the influence; 2) superficially related documents due to a shared prefix such as ST. CLOUD – This week in our “Behind the Scenes” series on WJON or a shared substring of some common knowledge like FSIS, the Centers for Disease Control and Prevention. Due to high signal-to-noise ratio, here were no noticeable relationships in the document pairs with influence scores below 0.02.

Influence turns out to be an effective tool for analyzing and attributing the model predictions at test time: for predictions that rely on information obtained by (counterfactual) memorization, we can identify exactly which training example provided such information. Our observation of near-duplicated training-validation document pairs is consistent with recent studies that identifies data contamination in large Internet crawled text corpus (Lee et al., 2021; Dodge et al., 2021).

Influence on Generated Texts. The influence estimation is not restricted to the validation set. We can also estimate influence on generated examples. In this section, we evaluate on the publicly released generations from the Grover models (Zellers et al., 2019) trained on RealNews. Specifically, we take the generations from Grover-Mega (p=0.96), a 1.5-billion-parameter model trained on the RealNews dataset. Comparing with the train-validation influence in Figure 4b, the histogram (c.f. Figure 10 in Appendix.) decays faster as max-influence grows. Moreover, the value range of max-influence is also twice smaller. The reason that we did not find a lot of highly influenced generated examples are two fold: 1) there are only 24,576 generation in the public release, which is much fewer than the validation examples. As a result, the corresponding example of many memorized training examples do not get sampled in the generations. For comparison, previous work (Carlini et al., 2020; Lee et al., 2021) generated 100,000+ examples to identify memorization in generation. These approaches also count duplicates in the training set, which counterfactual memorization filters out. 2) The Grover model was trained on the full RealNews training set, while we have restricted our analysis to the first 2M training examples. There could be potentially more high influence training examples that are missed in our calculation.

Summary and Discussion

We studied memorization in neural language models. We formulated a notion of counterfactual memorization as a tool that can systematically ignore “common” memorization such as general knowledge (e.g. “Paris is a city in France”) and captures memorization of rare, specific information (e.g. description of a specific episode of event) present in the training examples. We conducted experiments on three commonly used text corpus in language modeling and found memorization in all of them. We further analyze the per-domain memorization profiles for Internet-crawled data, and found that different sources could have substantially different memorization profiles.

Furthermore, we analyzed how memorized training examples could impact the model predictions at test time via counterfactual influence. We found that for examples from both the validation set and the model generated texts, the model predictions could be drastically different depending on the presence or absence of a particular training example with high memorization.

This study mainly focus on English datasets. While we expect the characterization of memorization would be similar when evaluated on corpus of other (natural) languages, new patterns might be observed on multilingual data or more structured domains such as programming languages.

Both the neural language models and training sets used in this work are orders of magnitude smaller than modern standards such as GPT-3 (Brown et al., 2020), GPT-4 (OpenAI, 2023) and PaLM-2 (Google, 2023). Moreover, we only conducted preliminary investigation of the dynamics of counterfactual memorization during training. Although our experiments effectively estimated and detected memorization, we suspect more interesting examples might emerge if larger, more capable models are analyzed. For example, currently when the information from a memorized training example is leaked in the prediction of a strongly influenced test example, it can usually be explained by a high text overlap between the training and test examples. For models with deeper understanding of languages, we suspect that strong influence could be observed even between documents that have no direct text overlap but that encode similar semantic information.

In order to test this, it will be necessary to scale our framework to larger models and datasets. Moreover, it will be necessary to construct datasets where semantically similar but textually different document pairs exist. One potential source to construct such datasets would be versioned Wikipedia articles–two versions of the same article with large time span or edit distance may contain semantically similar (but paraphrased) information. Such a dataset of paraphrased text pairs would be more broadly useful to understand the ability of different models to disentangle text content and form—by measuring the influence of one piece of text on a paraphrased piece of text.

Counterfactual memorization enables us to identify examples that whose presence or absence has a large impact on the model and the model’s ability to score and generate other text. The privacy risk for this is low since in order to perform this analysis, one would need to already have access to the dataset and the ability to train models.

Acknowledgments.

The authors would like to thank Samy Bengio, Christopher A. Choquette-Choo, Ethan Dyer, Michael C. Mozer, Behnam Neyshabur, Andrew Nystrom, and Hanie Sedghi for constructive discussions and feedback. The authors would like to thank Andrew Nystrom for assistance with MinHash-based near-duplicate detection.

References

Appendix A Difference Between Counterfactual and Generation-Time Memorization

Many definitions of memorization operate at generation-time: a sequence of generated text is marked as memorized if a sufficient amount of overlap is found in the training dataset [Carlini et al., 2020]. When the training data is not available, heuristic-based methods comparing language model perplexities are used to predict whether a generation contains memorized content [Carlini et al., 2019, Thakkar et al., 2020, Thomas et al., 2020, Carlini et al., 2020, Zanella-Béguelin et al., 2020]. One difficulty with these approaches is that generation-time instances of memorization are strongly correlated with the number of similar or near-duplicate examples in the training set. As observed in Lee et al. , large clusters of near-duplicated examples do exist in common language datasets, dominating memorization detected in generated text. Generation-time methods for measuring memorization are forced to design heuristics to avoid simply identifying these uninteresting instances of memorization.

In contrast, the counterfactual memorization we study in this paper handles the issue of near-duplicates automatically without the need for heuristics. For a training example, $x$ , with many near-duplicate copies in the training set, $\operatorname{\mathsf{mem}}(x)$ will be small (because other samples $x^{\prime}\approx x$ will be present in the training dataset whether or not $x$ is). This does not mean that counterfactual memorization is the opposite of generation-time memorization. An example, $x$ , with high $\operatorname{\mathsf{mem}}(x)$ may have a high chance of being generated if a model is appropriately prompted, despite and possibly because it is rare, and thus the example is considered memorized by both definitions. In summary, generation-time memorization measures the chance a model will directly copy from training examples, while counterfactual memorization aims to discover rare information that is memorized.

Appendix B Average Accuracy of IN models vs OUT models

Figure 5 compares the per-token accuracy between the IN models and OUT models for the training examples from three different datasets. Counterfactual memorization is estimated by taking the difference between the average IN-accuracy and the average OUT-accuracy. Thus, the examples closer to the upper left corner are more counterfactually memorized, while the examples near the diagonal are not.

Appendix C The Impact of Data Deduplication on Memorization

To investigate the impact of data deduplication on counterfactual memorization, we compared C4 with C4-NearDup [Lee et al., 2021], which is derived from C4 with deduplication using approximate document matching. Figure 7 compares the distribution of memorization between the original C4 and the deuplicated dataset. We did not find significant difference between the two datasets. One potential reason is that the deduplication criterion was relatively conservative, which removed only $\sim 3\%$ of the training examples. In fact, we can still easily see near duplicate examples in C4-NearDup among examples with low memorization, as shown below:

link $\rhd$ This is a placeholder page for Joshua Baldridge, which means this person is not currently on this site. We do suggest using the tools below to find Joshua Baldridge. You are visiting the placeholder page for Joshua Baldridge. This page is here because someone used our placeholder utility to look for Joshua Baldridge. We created this page automatically in hopes Joshua Baldridge would find it. If you are not Joshua Baldridge, but are an alumni of Brecksville Broadview Heights High School, register on this site for free now.

link $\rhd$ This is a placeholder page for Laytoya Brannon, which means this person is not currently on this site. We do suggest using the tools below to find Laytoya Brannon. You are visiting the placeholder page for Laytoya Brannon. This page is here because someone used our placeholder utility to look for Laytoya Brannon. We created this page automatically in hopes Laytoya Brannon would find it. If you are not Laytoya Brannon, but are an alumni of Mainland High School, register on this site for free now.

link $\rhd$ This is a placeholder page for Devin Mcguire, which means this person is not currently on this site. We do suggest using the tools below to find Devin Mcguire. You are visiting the placeholder page for Devin Mcguire. This page is here because someone used our placeholder utility to look for Devin Mcguire. We created this page automatically in hopes Devin Mcguire would find it. If you are not Devin Mcguire, but are an alumni of Kankakee Valley High School, register on this site for free now.

link $\rhd$ This is a placeholder page for Anthony Christie, which means this person is not currently on this site. We do suggest using the tools below to find Anthony Christie. You are visiting the placeholder page for Anthony Christie. This page is here because someone used our placeholder utility to look for Anthony Christie. We created this page automatically in hopes Anthony Christie would find it. If you are not Anthony Christie, but are an alumni of Old Bridge High School, register on this site for free now.

Measurements of the edit distances show that they are near the boundary of the deduplication threshold chosen in Lee et al. . On the other hand, the tail of the distribution — examples with high counterfactual memorization are mostly unaffected by text deduplication.

Appendix D Variance of Memorization Scores

In Figure 8, we measure the Spearman’s R between our total set of 400 models and an $m$ model subset. As expected, as $m$ increases, so does Spearman’s R—in particular, at 192 models, the Spearman’s R is at least 99.2% for all datasets, and increasing $m$ already appears to have diminishing returns.

Using the same partitioning into size $m$ sets of models, we analyze the variance of memorization scores assigned to each sample. To do this, within each partition, we compute the memorization score assigned to each sample. We then compute the standard deviation of all partitions’ memorization scores for each sample. In Figure 9, we plot each sample’s standard deviation — in all, this demonstrates the distribution of the variance of memorization scores. We find that the variance decreases substantially as $m$ grows, and concentrates near 0 already with $m=192$ , for all datasets.

Appendix E Histogram of Max-Influence on Generated Texts

Figure 10 shows the histogram of max-influence on each generated example by Grover-Mega (p=0.96) [Zellers et al., 2019], from the RealNews training examples. Those generated examples are publicly released at https://github.com/rowanz/grover/tree/master/generation_examples.

Appendix F Miscellaneous Experiment Details

Our experiments are implemented using JAX [Bradbury et al., 2018] and Flax [Heek et al., 2020], both open sourced library under the Apache-2.0 license. In the study of influence on generated texts, we use the publicly released generations from the Grover models [Zellers et al., 2019], available at their open source code repository, under the Apache-2.0 license.

We run the experiments using our internal cluster. The majority of the compute is consumed by model training. In this paper, we use standard training setup for transformer based neural language models, which could run on single node machines with one or multiple GPUs. However, to carry out the full analysis, we need to train 400 different models for each of the three datasets analyzed in this paper.

Appendix G Subsampling Procedure

In the estimation of memorization and influence, we trained 400 models each on an independent random subset of training examples. We use Tensorflow Datasets (TFDS) 222https://www.tensorflow.org/datasets to load our training data. TFDS supports loading a continuous range of examples, but does not support subset loading from a list of indices of individual examples. The API has a filter function which allows us to provide a Tensorflow predicate to precisely control the subset loading. However, a naive implementation of checking whether the index of the current example is in a given list of subset indices is very slow and scales poorly with the subset size.

To mitigate the issue, we implemented a hash based subset sampling predicate that can be evaluated efficiently for each example, and (approximately) select a random subset of a specified size. Let $N$ be the total number of training examples, $n<N$ be the expected subset size. The idea is to map the index $i$ of each example to $N/n$ hash buckets, and select all the examples that fall into one particular bucket. To make sure each model gets an independent subset sampling, we need to use different hash functions for different models. In our implementation, we compose a known hash function for uint64 types with a simple pseudo number based on the index of the current model to achieve this. Note the subset size sampled is close to $n$ but is not guaranteed to be exactly $n$ . But this is not a problem in our settings. The specific implementation is shown below:

def hash_sampler(mod, seed, system): """Get hash based subset sampler. Args: mod: total_n_egs // subset_size seed: different seed leads to different subset sample system: ’np’ or ’tf’. Returns: A Tensorflow or Numpy subset sampler. """ np_hash = hash_uint64_builder(’np’) mul, offset, remainder = np_hash(seed + 1234 + np.arange(3)) remainder = remainder % if system == ’np’: def np_sampler(n_total): x = np.arange(n_total, dtype=np.uint64) return np_hash(x*mul + offset) % return np_sampler elif system == ’tf’: tf_hash = hash_uint64_builder(’tf’) def tf_filter(idx, _): return tf.equal(tf_hash(idx*mul + offset) % return tf_filter raise KeyError(f’Unknown system: {system}’) def hash_uint64_builder(system): """Build a hash function in tf/np for uint64.""" if system == ’np’: uint64_cast = functools.partial(np.array, dtype=np.uint64) op_xor = operator.xor op_rshift = operator.rshift elif system == ’tf’: uint64_cast = functools.partial(tf.cast, dtype=tf.uint64) op_xor = tf.bitwise.bitwise_xor op_rshift = tf.bitwise.right_shift else: raise KeyError(f’Unknown system: {system}’) # https://stackoverflow.com/questions/664014/ # what-integer-hash-function-are-good-that-accepts-an-integer-hash-key def hash_uint64(x): x = uint64_cast(x) x = op_xor(x, op_rshift(x, 30)) * uint64_cast(0xbf58476d1ce4e5b9) x = op_xor(x, op_rshift(x, 27)) * uint64_cast(0x94d049bb133111eb) x = op_xor(x, op_rshift(x, 31)) return x return hash_uint64 In Figure 11, we compare our hash-based subset sampler with numpy.random.choice(N, size=n, replace=False). The leftmost section of the figure shows that the sampling procedure always samples close to $n$ points, with a small variance. The middle section plots a histogram of the empirical fraction of total models that each point appears in. Note that, because we use $r=0.25$ , this fraction should be 0.25 on average, although, because we only use 400 models, each value will not be identically 0.25. We find that our hash-based sampler produces probabilities which are highly consistent with those produced by numpy.random.choice. We also measure the pairwise independence of the hash-based sampler, measuring the probability that two different training points $x_{1},x_{2}$ appear both IN or OUT of a model’s training set. We expect this value to be 0.625 (= $r^{2}+(1-r)^{2}$ ). We plot this in the right portion of the figure, demonstrating that the independence of our hash-based sampler is very similar to numpy.random.choice.

Appendix H Alternative Memorization Metrics with Logit Scaling

We defined the counterfactual memorization in (1) with a generic performance measure $M$ . Throughout the paper, we define $M$ as per-token accuracy–the fraction of the times the model assigns the highest score to the true next token in the sequence. The finite value range could cause unnecessary compression for values near the interval boundary. As a result, the resolution of memorization estimation is lower for models with very high or very low performance. To mitigate this issue, we explore an alternative measure by taking the logit on the per-token accuracy [Carlini et al., 2021]. The logit function maps to $(-\infty,\infty)$ before aggregating across independently trained models. Figure 12 compares the scatter plots of average performance on IN / OUT models measured by the logit scaled per-token accuracy and the raw per-token accuracy. Comparing to the raw per-token accuracy, the scatter plots generated with the logit scaled measure are no longer artificially constrained to be a triangular shape. As a result, the memorization estimation, which is proportional to the distance to the diagonal line, has a higher resolution on the two ends (lower left and upper right) than the unscaled version.

Note there is no absolutely right or wrong measure. While the scaled version has better resolution on the two ends, the advantage of the unscaled version is that the value range $ $makes it straightforward to interpret the numerical values of counterfactual memorization. Since the consistency between the two versions are high (Spearman’s$ \rho$ correlation between the two versions are 0.947 / 0.903 / 0.944 on RealNews/ C4/ Wiki40B:en), we use the unscaled version throughout the paper for easier interpretation.

Appendix I Definition of Edit Similarity

We define the edit similarity between two sequences $x_{i}$ and $x_{j}$ as. In our case, we use token-level similarity.

Appendix J Examples of Train-Generation Pairs at Different Influence Ranges

In table 3, we show examples of train-generation pairs sampled from different influence ranges. The patterns generally follow the train-validation pairs shown in table 2, although many of the relations are due to some form of templating.

Appendix K Examples Sampled at Different Level of Memorization

Figure 13, Figure 14, and Figure 15 show full examples from RealNews sampled at high, middle and low memorization value ranges, respectively. Similarly, Figure 16, Figure 17, and Figure 18 show examples from C4 sampled at high, middle and low memorization value ranges, respectively. Figure 19, Figure 20, and Figure 21 show examples from Wiki40B:en sampled at high, middle and low memorization value ranges, respectively.

Appendix L Example Pairs Sampled at Different Level of Influence

Figure 22, Figure 23, Figure 24, Figure 25, and Figure 26 show train-validation example pairs from RealNews sampled from high to low influence ranges. For each pair, we show the validation set example first, and then show the corresponding training example with a difflib generated visualization of textual difference with the training example.

Similarly, Figure 27 and Figure 28 show train-validation example pairs from C4, and Figure 29 and Figure 30 from Wiki40B:en.

We also show train-generation influence pairs between RealNews training set and Grover [Zellers et al., 2019] model generation in Figure 31, Figure 32, and Figure 33.