Measuring Forgetting of Memorized Training Examples

Matthew Jagielski, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Chiyuan Zhang

Introduction

Machine learning models are capable of memorizing information contained in their training data . This is one of the reasons why models are vulnerable to privacy attacks such as membership inference and training data extraction . Resulting privacy concerns have led to a variety of techniques for private machine learning, including differentially private training , machine unlearning , and various heuristics like regularization , data augmentation or gradient clipping . These techniques all make modifications to the learning procedure so as to actively limit privacy leakage, including leakage that results from memorization. Instead, we observe that the training dynamics inherent to learning algorithms such as stochastic gradient descent may passively afford some forms of privacy. Such dynamics include forgetting: during iterative training, as models see new training examples, they could lose track of the specifics of earlier examples—as prominently seen by research on catastrophic forgetting .

In this paper, we study to what extent the forgetting exhibited by machine learning models has an impact on memorization and privacy. Our work is focused at distinguishing between two overarching hypotheses for how privacy interacts with forgetting. The first is that memorization is a stronger effect than forgetting: traces of early training examples will still be detectable long after the examples are seen in training, perhaps due to their strong influence on the initial decisions made during the optimization process. The second hypothesis is that forgetting is stronger: early examples will be forgotten due to the many subsequent updates to the model weights as training progresses.

Studying the impact of forgetting on privacy is most relevant when there is a large variation in how frequently an example may be seen during training. Indeed, models are increasingly trained on extremely large training sets, so that training consists of only a few epochs (or even a single one). Such settings are used when training large image models , multimodal models and language models , the latter of which have come under significant scrutiny due to privacy concerns . Similarly, when a model is being fine tuned, the data that was originally used to pretrain the model is no longer seen in the second stage of training. Fine tuning is also an ubiquitous technique in many domains, especially in language , speech , and vision tasks.

We design a methodology for measuring whether and how quickly individual training examples are forgotten. Our methodology relies on state-of-the-art membership inference attacks, the best known method for testing whether a given point was used in training. We use our methodology to show that, for deep neural networks trained on either vision or speech tasks, those examples that are used early in training (and not repeated later on) are indeed forgotten by the model. We identify a number of factors which impact the speed at which forgetting happens, such as when examples appear in training, or whether they are duplicated. We then attempt to understand why forgetting happens by showcasing two settings where forgetting does not happen in the worst-case: non-convex models, such as kk-means, and deterministic training algorithms. Our result on kk-means is the first instance of privacy leakage due to non-convexity we are aware of. However, on a mean estimation task, we prove that forgetting does happen as a result of the stochasticity of gradient descent, with similar properties to our empirical results. By using our approach to measuring forgetting, we hope experts training models on large datasets or fine tuning models can determine whether (and how much) forgetting has empirical privacy benefits for their training pipelines. We stress that our approach is complimentary to frameworks that offer worst-case guarantees, like differential privacy, and that it should not be used in lieu of reasoning about privacy within such frameworks.

Background and Related Work

There are multiple valid privacy guarantees that have been considered for ML algorithms. First, differential privacy ensures that the distribution of the output of the algorithm does not significantly change when a single example is changed. In the context of ML, differential privacy can be obtained through modifying either the training algorithm or the inference algorithm . Differential privacy provably bounds the success of privacy attacks which leak information about individual training examples (see Section 2.2).

More recently, other motivations for privacy have gathered interest. For instance, in machine unlearning , a user may issue a “deletion request”, after which their individual contribution to the model must be erased. This is different from differential privacy, which requires that the model not learn too much about any of its training examples. Algorithms for machine unlearning have been proposed for kk-means , empirical risk minimization , and deep learning . If a point is perfectly unlearned, privacy attacks that target individual examples cannot succeed on this point.

Both of these definitions of privacy—differential privacy and unlearning—obtain privacy actively: they require that the training algorithm be modified to obtain privacy. Instead, we propose to capture privacy that is gained passively from dynamics that are inherent to training. We define and measure forgetting as a form of privacy that arises from the decay in how easy it is to extract information about an individual training point over the course of training.

Our definition of forgetting is inspired by the widely observed phenomenon of catastrophic forgetting , where a model tends to forget previously learned knowledge when training on new data. More specifically, catastrophic forgetting is generally formulated in the continual learning setting where the model sequentially learns a number of different tasks, and the performance on previously learned tasks drops significantly as the model learns a new task. In contrast, our work considers a model trained to solve a single fixed task and measures how it forgets some of its training examples seen earlier in training. Our work is also inspired by Feldman et al. , who show theoretically that iterative, differentially-private algorithms can exhibit forgetting. This means that the provable bounds on privacy leakage are better for samples seen earlier than those seen later. Since provable privacy bounds are often loose , it remains to be seen whether whether these examples also exhibit empirically better privacy. We investigate this question here.

2 Attacks against Privacy in Machine Learning

To do this, we use attacks that target the privacy of training examples. Because differential privacy is established as the canonical framework to reason about privacy, these attacks were introduced in the context of differentially private ML. However, what is important to our work is that they target individual training examples—rather than differential privacy. We consider two types of privacy attack: membership inference and training data extraction. While other attacks exist, such as attribute inference or property inference , these attacks tend to capture properties of many examples or even the entire training set, whereas membership inference and training data extraction are designed to measure the privacy of few or a single example.

In both membership inference and training data extraction, the adversary uses access to the model, possibly throughout training, to extract sensitive information contained in individual examples from the training set. Both of these privacy attacks have been shown to perform better when models are updated repeatedly, and when these repeated updates are all released to the adversary . In our work, we only release the model at the end of all training, so repeated updates do not leak information.

In membership inference , an adversary infers whether or not a target example was contained in a model’s training set. Most techniques for membership inference predict if an example is in the training dataset by thresholding the loss on the query example: if the loss on an example is low, the example is likely training data, and if the loss is high, the example is likely not in the training dataset. While early techniques applied the same threshold to all examples , current state-of-the-art membership inference attacks choose a carefully calibrated threshold for each example.

Calibration is designed to adjust for the high variance in the “hardness” of learning different examples; many examples will have low loss regardless of whether they are included in training, and some examples will continue to have high loss even when they are included in training. To determine a calibrated threshold for examples, attacks typically compare the logits of models trained without the target example to the logits of models trained with the target example. In our experiments, we calibrate membership inference and measure attack performance both in terms of accuracy and true positive rate (TPR) at a fixed false positive rate (FPR). Both of these measurements are bounded by differential privacy .

In training data extraction , the adversary wants to recover training data from the model. One controlled experiment to measure extraction risk is canary extraction . In canary extraction, mm well-formatted canaries {si}i=1m\{s_{i}\}_{i=1}^{m} are injected into a model’s training set, chosen uniformly at random from some larger universe of secret canaries S\mathcal{S}. The adversary’s goal is to guess which of the S\mathcal{S} canaries was in fact inserted. Designing the universe of secrets is domain-dependent, so we defer this discussion to the experiments section. Canary extraction’s success is measured with exposure, which roughly computes the reduced entropy in guessing the secret:

3 Related Work on Forgetting

In concurrent work, Tirumala et al. show that large language models forget examples seen early in training. However, their definition of forgetting only captures a specific form of privacy leakage: memorization which can be identified when the model reproduces training examples exactly. Instead, our work captures a more general form of privacy leakage because we consider (a) stronger privacy attacks and (b) multiple strategies to measure worst-case forgetting of the training examples. To achieve this, our work leverages recent progress in auditing of privacy in ML . We also perform our experiments on models and datasets representing a wider variety of data domains. Additionally, we investigate analyze the theoretical limits of forgetting, and propose an explanation for why forgetting happens.

Another empirical exploration of forgetting is found in Graves et al. . The work differs in two ways. Forgetting is explored in the context of machine unlearning (i.e., to determine whether forgetting is sufficient to avoid the need to unlearn altogether). Second, they find that continual retraining is too slow to allow for forgetting, but their findings hold for small datasets only. Instead, we identify forgetting as especially relevant for large model training. We also measure precisely how long forgetting takes for such large datasets in a variety of domains, and find that it happens quickly enough to be relevant to privacy.

In another related work, Hyland and Tople used stochasticity to improve differentially private machine learning for models trained on small datasets. In this paper, we identify stochasticity in training as a potential cause for forgetting.

Forgetting

Rather than study forgetting through the lens of the model’s accuracy on a specific example—as done in the catastrophic forgetting literature—we use an instance-specific notion of forgetting: we attempt to detect a specific example’s presence in training. In catastrophic forgetting, the model’s accuracy degrades on an entire sub-distribution. This does not necessarily have implications for forgetting of memorized information contained in individual examples. It is possible that, despite a decrease in accuracy, the model still carries detectable traces of individual examples, which is still harmful for privacy. It is also possible that a high accuracy model, which has not yet forgotten the sub-distribution, generalizes well to the sub-distribution rather than memorizing the specifics of the training set. Thus, our definition of forgetting instead asks a more privacy-motivated question: whether it is possible to detect or extract an example in the training set. Hence, we define forgetting by drawing from the literature on attacking ML models to find privacy violations.

We measure the rate of forgetting by evaluating the success rate of a privacy attack.

A training algorithm T\mathcal{T} is said to (A,α\mathcal{A},\alpha, kk)-forget a training example zz if, kk steps after zz is last used in T\mathcal{T}, a privacy attack A\mathcal{A} achieves no higher than success rate α\alpha.

For example, consider the case where we let A\mathcal{A} be a MI attack, with success rate measured by the accuracy of the attack, and we set α=50%\alpha=50\% as a random guessing baseline. Then an example is (A,α,k)(\mathcal{A},\alpha,k)-forgotten if kk steps after training on that example, A\mathcal{A} cannot distinguish between the case where the model was, or was not, trained on that example.

This captures the intuition that a model has “forgotten” a training point if it becomes difficult to detect whether or not that point was used in training. Because we are interested in measuring privacy risk, our definition allows examples which were never memorized (i.e., A\mathcal{A} never performs well) to be forgotten immediately (i.e., k=0k=0). However, we will focus our analysis on vulnerable examples and state-of-the-art privacy attacks where examples are memorized after being used to train T\mathcal{T}.

Differentially private systems must satisfy a stronger requirement than what we’ve given: all attacks must perform poorly even k=0k=0 steps after an example is seen. Additionally, machine unlearning is an even stronger requirement: all possible privacy attacks must fail (where forgetting allows some small success rate: α\alpha). Our notion of forgetting uses known attacks, capturing the common strategy of “empirically testing” privacy guarantees . Additionally, while machine unlearning is an intentional act, forgetting may apply without intervention after some number of steps kk.

2 Measuring Forgetting

A straightforward way to measure how much a model forgets is to run a MI attack on samples seen at various training steps. However, this only permits an average-case privacy analysis, measuring forgetting for typical examples. Because we want to understand forgetting from a privacy perspective, we instead design a procedure to attempt to measure an algorithm’s worst-case forgetting.

We adapt privacy auditing strategies which consist of two components: (1) constructing a small set of examples DpD_{p} (not contained in the standard training set DD) which the model will be forced to memorize in order to classify correctly, and then (2) running a privacy test A\mathcal{A} to determine whether or not these examples were included during training. Constructing these training examples is domain-dependent, and so we will discuss precisely how we instantiate them for our experiments in Section 4.

Our procedure for measuring forgetting trains two models: one model is trained only on DD, and one is trained on both DD and DpD_{p}. After training for some number of steps, each model continues training on DD, never training on DpD_{p} again. After starting this second stage of training only on DD, we continually measure how well the privacy attack A\mathcal{A} performs at predicting which model received the injected samples and which did not. By design, this procedure measures (A,α,k)(\mathcal{A},\alpha,k)-forgetting of DpD_{p}, providing an attack success rate α\alpha for any number of steps kk after removing DpD_{p}.

We consider two ways to instantiate this procedure, which incorporate DpD_{p} into training in two distinct ways. The first, called Poison, mimics a fine-tuning setting. The second, called Inject, is used to measuring forgetting in extremely large training sets. The strategies differ only in how they use DpD_{p}, and we provide algorithms for both in Appendix A.

Poison. Poison adds DpD_{p} into training, and trains on the augmented dataset DDpD\cup D_{p}. After TIT_{I} steps, the poisoned examples are removed, and training resumes on the unpoisoned training set DD. This is best suited for small training sets, where shuffling data does not heavily impact the position of training examples. This approach reflects fine-tuning, where many passes are made over a pre-training dataset containing sensitive examples, but fine-tuning is done on a disjoint dataset.

Inject. Inject trains a model on the unpoisoned dataset DD up until exactly step TIT_{I}, where it updates with DpD_{p}, and then resumes training on DD thereafter. This approach is preferable when making a small number of passes over a large dataset (as is common in training large language models and speech models), where shuffling can heavily impact the position in training. Inject differs from Poison in that the target examples are not included immediately at the beginning of training. To reflect the impact of duplicated training data, or to conservatively estimate forgetting, we can also inject DpD_{p} multiple times during training.

Empirically Measuring Forgetting

In this section, we empirically measure whether ML models forget samples over the course of training for large neural network models (and if they do, how quickly do they do so).

We investigate forgetting across a variety of datasets and models. We use canary extraction to evaluate generative models (i.e., for LibriSpeech and LMs) and MI for classification models (i.e., for ImageNet). We provide more detailed experiment setup in Appendix B.

ImageNet. We train ResNet-50 models for 90 epochs (each epoch is roughly 5,000 steps). To test forgetting, we mainly use the Inject strategy, due to the large number of steps made in a single ImageNet epoch. We compare with Poison in Appendix C.1. We start from a fixed checkpoint (with no injected data), and fine-tune this checkpoint for both the injected and baseline models, and perform MI to distinguish the injected predictions from the baseline predictions. We calibrate logit scores based on those produced by the fixed checkpoint, representing a powerful adversary who knows an intermediate update before the inclusion of sensitive data. We report results averaged over three trials.

LibriSpeech. We use state-of-the-art Conformer (L) architecture and the training method from to train models over LibriSpeech. We use the Poison strategy, as we train on LibriSpeech for many epochs. We generate 320 unique canaries and insert them with various repeat counts. We use loss calibration with 11 reference models.

Neural Language Models. We train decoder-only, 110M parameter, Transformer-based language models (with the T5 codebase and architecture from ) for one epoch over a version of C4 () that had been deduplicated with MinHash (). We use the Inject strategy due to the large training set size and add a single batch of 4096 canaries consisting of random tokens, with between 1 and 100 duplicates for each unique canary.

2 Forgetting Happens In Practice

In Figure 1, we find that the datasets and models we consider exhibit forgetting. Starting with Figure 1(a), we train a model on the ImageNet training dataset and then, following the Inject approach, train the model on a single minibatch containing the poisoned examples repeated 10 times at epoch 50. (The red vertical line here corresponds to the timestep where injection occurs.) We then plot the MI accuracy and precision as a function of time, both before injection and also for ten epochs after. We observe that MI is impossible until the injection step, after which precision remains very high for many epochs, taking 10 epochs to decay to roughly 65%, which is permitted by DPWe compute ε\varepsilon as ln(TPR/FPR)\ln(\text{TPR}/\text{FPR}) , using the heuristic. with ε0.6\varepsilon\approx 0.6. This demonstrates that forgetting does occur for this setting.

Figures 1(b-c) repeat this experiment for two other datasets: LibriSpeech and C4, where we use canary extraction for both. In Figure 1(b) we plot results from LibriSpeech with the Poison strategy, with canaries repeated 20 times. We find that the canary exposure decays rapidly after canaries are removed at 50,000 steps, decaying from 14.0 to 5.6 in the first 10,000 steps, which indicates the canaries become 330×\times harder to guess on average. After another 40,000 steps, the exposure drops to roughly 2, only slightly higher than the baseline exposure for canaries not seen during training. In Figure 1(c), we show that, for a variety of repeat counts, LMs forget canaries. They reach an exposure of 2, and decay over the course of at most 10,000 steps to the baseline “random guessing” level of 1. In Appendix C.1, we show that, on ImageNet, Poison and Inject give similar results.

We note that in Figure 1, we selected representative parameter settings, but we find that the same forgetting behavior qualitatively holds over all settings we tried. However, these parameters can impact, quantitatively, how quickly forgetting happens.

3 Investigating Forgetting

We investigate the impact of various parameters on the speed of forgetting. We investigate 1) how many times a point is repeated in training/how hard the point is to learn, and 2) how late the example is seen in training, 3) model size, and 4) optimizer parameters. While repetitions and hardness are the parameters that we find to be most important, each of the other parameters also has some impact. We analyze these parameters in depth on ImageNet, and a subset of them on LibriSpeech and C4.

Repetitions/Hardness. Privacy attacks are not equally effective on all examples. Large language models memorize examples which are duplicated many times in their training sets . Among examples which are not duplicated, there exist “outliers”, which are more vulnerable to attacks . Here, we show that these findings also hold in forgetting: outlier data is forgotten more slowly than inlier data, and data duplicated many times is forgotten more slowly than data duplicated few times. We use our forgetting definition to measure this speed, as it reflects the privacy risk to examples (although, especially for exposure, this definition does depend on the initial success rate of the attack).

We present our results in Figures 2 and 1(c). Our findings corroborate those made in prior work. Repetitions have a very clear impact on forgetting: examples repeated more times both start out with a much higher attack success, and the models take much longer to forget them, as well. On ImageNet, attack precision takes roughly 20,000 training steps to drop below 70% with 10 repetitions, which is higher than 1 repetition 200 steps after injection. On LibriSpeech and C4, canary exposure has a similar trend, where heavily repeated canaries have higher exposures, and decrease to a given exposure more slowly. We also validate on ImageNet that hard examples (test points with high loss on a fully-trained model) are forgotten more slowly than easy examples, in Figure 2(c). These results reinforce the necessity of considering worst-case samples when measuring privacy.

Learning Rate Decay and Other Parameters. On ImageNet, we experiment with recency, learning rate, momentum, and model size, and find these mostly have little effect. The limited effect of model size on forgetting is somewhat surprising—and possibly due to our smallest tested model being large enough for significant memorization to occur. Due to their small effect, we discuss these parameters in Appendix C, and focus here on learning rate decay. Our training decays learning rate by a factor of 10 at Epoch 60, and we plot in Figure 3 the precision of MI for examples injected near this decay. We find that examples injected before the decay are forgotten significantly slower than examples injected after. An update before the decay will have 10×\times the learning rate of subsequent updates, so these will have less ability to influence forgetting.

Understanding Forgetting

As we have shown, forgetting empirically occurs in deep learning. In this section, we investigate why. We start by showing a setting where kk-means, due to its non-convex loss function, does not forget training data, contrasting with our empirical results on large non-convex models, potentially because attacks struggle to take advantage of the optimization trajectory. To understand this inconsistency, we offer a possible explanation: randomness in training can lead to forgetting. To support this explanation, we first show that, without nondeterminism, forgetting does not happen in deep learning. We show this empirically as well as formally for SGD on a task of mean estimation.Mean estimation is frequently used in the privacy literature to investigate attacks and design differentially private algorithms . We then prove that if the training data is randomly shuffled, forgetting does happen for SGD on this mean estimation task—thus corroborating our empirical results.

We show here that there exists a data distribution where kk-means clustering will not forget certain examples, due to its non-convex objective. The kk-means problem can have many local optima, and we will construct a setting where a single example can force the model into one of these optima, where the model will remain even after more examples are added. The setting we consider is the one of a one-dimensional clustering task with three clusters in the data, c1c_{1}, c2c_{2}, and c3c_{3}, but where kk-means is configured to use k=2k=2 clusters. As a result, kk-means needs to choose to “merge” one cluster, c2c_{2}, with the others. We can construct an outlier from the c2c_{2} cluster which will manipulate this decision, and adding more data does not lead to this choice being forgotten. The setting is illustrated in Figure 4 (with a two-dimensional clustering task for clarity of exposition). We provide a more concrete description of the dataset we consider in Appendix D.1, but we verify empirically that this setting results in near-perfect MI on the outlier point, with 97% accuracy and perfect precision. This is the first example of privacy leakage we are aware of that relies on non-convexity. There is a large body of differentially private algorithms which are specific to convex models, and do not apply to non-convex settings . Our analysis here is the first to concretely justify this separation.

2 Deterministic SGD Does Not Forget

Having observed a (constructed) setting in which forgetting does not happen, we now turn our attention to why forgetting happens when training a neural network. Earlier, we hypothesized that nondeterminism leads to the forgetting we observed empirically. Here, we support this by showing that some unknown nondeterminism is required for forgetting, before continuing in Section 5.3 to show a source of nondeterminism that can cause forgetting.

The key idea of this section is that an adversary with knowledge of the entire training set and batch order can simulate the training on DD and on DDpD\cup D_{p}, stepping through training in the exact same way as the learner, and receiving the exact same model at each step. Because of the update on DpD_{p}, these models are never exactly equal, and, because training is deterministic, it is always possible to distinguish them. In Figure 5, we see that this attack is possible in practice, on a logistic regression model trained on the FMNIST dataset with a fixed batch order, using the Inject strategy. By simulating training, the adversary achieves 97% MI accuracy for examples seen in Epoch 5, even after 95 epochs of training (we expect the 3% drop to be a result of unavoidable GPU nondeterminism). In Appendix D.2, we prove that this attack results in perfect MI on a deterministic version of SGD for mean estimation. In this mean estimation setting, adding Gaussian noise to the updates would provably ensure forgetting for DpD_{p} using the results of Feldman et al. , so it is the nondeterminism itself which prevents forgetting.

3 Random Sampling Leads To Forgetting

Having shown that forgetting doesn’t happen when all nondeterminism is known to the adversary, we now show that some amount of unknown nondeterminism can cause forgetting. While Feldman et al. show that differentially private noise is a sufficient level of nondeterminism to provably cause forgetting, we consider nondeterminism coming from randomized data sampling.

We show that it does, by considering a simple distinguishing test. Consider two training runs, both starting SGD from the same fixed initial parameters θ0\theta_{0}. At the very first timestep, we inject two different arbitrary examples, vv and vv^{\prime}, into the two models. That is, we compute θ0+\theta_{0}^{+} by taking a gradient step on vv , and θ0\theta_{0}^{-} by taking a gradient step on vv^{\prime}. We then continue SGD on randomly sampled examples from D\mathcal{D}. We write the kkth step of each run as θk+\theta_{k}^{+} (for those continuing from θ0+\theta_{0}^{+}) and θk\theta_{k}^{-} (respectively). In Theorem 1, we bound the Rényi divergence between the distributions of θk+\theta_{k}^{+} and θk\theta_{k}^{-}. Rényi divergence is the divergence used in Rényi DP , and is known to bound the success of MI attacks and reconstruction attacks , so a smaller Rényi divergence implies that these attacks will perform worse, and the model will forget which of vv and vv^{\prime} was used in training. To simplify the analysis, we use v=vv^{\prime}=-v, although a similar result can be shown for arbitrary pairs of vv and vv^{\prime}.

The distributions of the models θk+\theta_{k}^{+} and θk\theta_{k}^{-} produced with learning rate 0<η<1/20<\eta<1/2 have Rényi divergence of order α\alpha at most 2αkvΣ1v\tfrac{2\alpha}{k}v^{\top}\Sigma^{-1}v.

Theorem 1 states that distinguishing between the models θk+\theta_{k}^{+} and θk\theta_{k}^{-} becomes harder as training progresses, as well as when vΣ1vv^{\top}\Sigma^{-1}v is small (when vv is more “in-distribution”). These agree with observations made in our experiments: forgetting improves as kk increases (although with diminishing returns), and more worst-case (or heavily repeated) examples are forgotten more slowly (here, if vΣ1v=0v^{\top}\Sigma^{-1}v=0, forgetting never happens!). We note that this theorem is not operational, as it is impossible to know whether the Gaussian assumption holds in practice , but our analysis provides intuition for our experiments, suggesting further that this source of nondeterminism may lead to the empirical forgetting we observe. Combined with Theorem 2 in Appendix D.2, which also considers mean estimation, Theorem 1 shows that it is the unknown nondeterminism coming from the data which causes forgetting.

Conclusion

In forgetting, we demonstrate a passive form of privacy amplification: examples used early in model training may be more robust to privacy attacks. Our findings align with the theoretical findings of Feldman et al. (for convex optimization); improved bounds on privacy leakage for early examples also translate to better empirical privacy. Our analytical and empirical findings highlight some previously poorly understood factors underlying forgetting. The size of the dataset is key: forgetting is more likely to happen when (a) the learning algorithm randomly samples training examples at each step (e.g. as in SGD) and (b) these examples are sampled from a large training set (e.g., when training modern language models). Training sets should be large enough for an example to not be seen for thousands of steps before releasing the model. These findings will be useful to practitioners auditing privacy provided by their training algorithm and complement formal analysis of their guarantees (e.g., using DP). Our work suggests the following directions for future investigation.

Protecting the final examples. Our work suggests that, for large scale training algorithms, the most recently seen examples are most vulnerable to privacy attacks. Defenses may take advantage of this to improve empirical privacy by focusing their privacy efforts on the most recently seen examples. The number of examples to protect could be determined with an empirical forgetting analysis.

Non-convexity-aware MI. In route to understanding forgetting, we demonstrated the first setting where the optimum that a model ends up at leaks information about training examples. While this attack only applies to a specific kk-means setting, it is interesting whether non-convexity can be exploited to design new MI attacks on realistic tasks.

Limitations. While our empirical investigation uses state-of-the-art attacks, it remains inherently heuristic, and cannot prove that forgetting happened. However, we have no reason to expect that the trends we identify do not hold against even stronger attacks. Indeed, in some cases our attacks exploit more knowledge than a realistic adversary can access in practice. Further developments in the techniques we rely on here might be taken advantage of to more faithfully measure forgetting.

The authors would like to thank Rajiv Mathews, Hakim Sidahmed, and Andreas Terzis for helpful discussion. Katherine Lee’s research is supported in part by a Cornell University Fellowship. Daphne Ippolito’s research is supported in part by the DARPA KAIROS Program (contract FA8750-19-2-1004). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA or the U.S. Government.

References

Appendix A Algorithms for Measuring Forgetting

We present the algorithm for Poison in Algorithm 1 and the algorithm for Inject in Algorithm 2.

Appendix B Detailed Experiment Setup

We train ResNet-50 models for 90 epochs, with a learning rate of 0.1, momentum of 0.9, and batch size of 256 (making each epoch roughly 5,000 steps). To test privacy, we focus mainly on the Inject strategy, due to the large number of steps made in a single ImageNet epoch. We also briefly compare with Poison. In our experiments, we start from a fixed checkpoint (with no injected data), and fine-tune from this checkpoint for both the injected and baseline models, and perform membership inference attacks to distinguish the injected predictions from the baseline predictions. We calibrate the logit scores based on those produced by the fixed checkpoint, representing a powerful adversary who knows even an intermediate update before the inclusion of sensitive data. We also experiment with calibrated loss scores, and with not performing calibration, and find that these variants of the attack perform worse. We inject a single batch of data, and measure the precision at a false positive rate of 10%, averaging over three trials. We note that, because both the injected and baseline models are initialized from the same checkpoint, the first batch of the injected model will be the injected data, at which point membership inference will reach 100% accuracy/precision for any parameter setting.

We use the state-of-the-art Conformer (L) architecture and the training method from to train models over LibriSpeech. We train each model with a batch size of 2,048 utterances for 100,000 steps using the Adam optimizer and a transformer learning rate schedule .We use the Poison strategy, as we make many passes over LibriSpeech. To generate canary utterances, we start with a vocabulary consisting of the top 10,000 words in the LibriSpeech test set. For the canary transcripts, we sample each word randomly from the vocabulary, varying the length of the transcript ({7, 10} words), and the number of insertions in the dataset ({1, 4, 10, 20}). For the canary audios, we use Text-To-Speech generation, and vary the gender of the speaker ({Male, Female}), and the number of speakers per transcript ({1, 2} of the same gender). For each canary configuration, we generate 10 unique canaries. As a result, we generate 320 unique canaries that amount to a total of 2,840 utterances (<<1% of total utterances in LibriSpeech). We also use loss calibration, for which we train 11 reference models using a random 80% partition of LibriSpeech for training each model.

We train decoder-only, 110M parameter, language models (with the T5 codebase and architecture from ) for one epoch over a version of C4 (). In total, this was 83,216 steps with a batch size of 4,096. We deduplicated C4 with the MinHash strategy introduced by and the implementation and parameters from . showed that duplicating examples contributes to memorization, thus we seek to start our experiments with a deduplicated dataset. We used a threshold of 0.7 for both edit-distance and Jaccard similarity. We use the Inject strategy due to the large training set. For each batch of canaries, we created 256 unique canaries consisting of random tokens and duplicated them between 1 and 100 times to generate a full batch of 4,096 examples (including duplicates). Batches of canaries were inserted 10%, 50%, and 90% of the way into one epoch of training. Additionally, we another model, holding random seed, architecture, and data order fixed, without any canaries as a baseline.

Appendix C Extended Experiments

We reproduce Figure 1(a) in Figure 7(a), which shows forgetting measured with Inject on ImageNet with 10 duplicates, and compare it side-by-side with Poison in Figure 7(b), where each example is added once per ImageNet epoch, for 10 epochs. We see forgetting happens here at roughly the same rate, after controlling for how many times an example is seen.

C.2 Recency

We plot the impact of recency on forgetting in Figure 8. To do so, we inject samples at various points in training, and measure how quickly models return to a small privacy leakage. We investigate this in Figures 8(a), 8(b), and 8(c) for the ImageNet, LibriSpeech, and C4 datasets, respectively. On ImageNet, it appears that the later a point is seen, the longer it takes to forget, although this effect seems to level off for ImageNet (from epoch 40 to 80, there is little change). This is also true for LibriSpeech, where the later a point is seen, the more times it was used in training. However, there is no clear trend on the C4 dataset. Attacks which better exploit the optimization trajectory may find earlier examples to be forgotten more slowly, as they may have a longer impact on this trajectory. For now, privacy attacks appear to not be able to take advantage of the trajectory, and the optimization landscape appears to obfuscate rather than reveal memorized examples.

C.3 Model Size

We show the relationship between model size and forgetting in Figures 9(a) and 9(b), for hard test examples and medium hardness test examples, respectively. We evaluate forgetting for ResNet-18, ResNet-34, ResNet-50, and ResNet-101 models. There seems to be little difference between models for medium hardness test examples, but ResNet-101 models seem to forget hard test examples more slowly. Prior work has found that larger models generally memorize more , making this effect natural, but it is interesting that there is little difference between most of these models. It is possible that all ResNet models we consider are already highly over-parameterized, and forgetting has already nearly saturated.

C.4 Optimizer Parameters

On ImageNet, we vary the momentum and learning rate parameters to see if these result in a significant change to forgetting. For example, it is possible that an increased momentum parameter would lead to less forgetting, as it causes a point’s update to contribute to every subsequent update. However, we find this to generally not be the case in our experiments on ImageNet: learning rate and momentum do not appear to have a significant impact on forgetting (Figures 9(c) and 10(a), respectively). Learning rate likely does not have a large impact, as all examples have the same learning rate, and so the data nondeterminism is equally powerful at causing forgetting in each case. While it is intuitive that larger momentum parameters would lead to longer forgetting, forgetting happens over long enough step counts for momentum to not impact it significantly. However there is one exception: in our ImageNet training, we use a learning rate decay which reduces learning rate by a factor of 10 every 30 epochs (we reproduce Figure 3 in Figure 10(b) to discuss this effect more here). This has two drastic implications on forgetting: examples seen just before the decay are forgotten much more slowly, and there is a period directly after the decay where forgetting happens more quickly. The first phenomenon (seen in Epoch 58 and Epoch 59 in Figure 10(b)) is intuitive: the rate of forgetting depends on the other examples’ contributions, and the point injected before the learning rate decay contributes 10x more than these later examples. The other phenomenon is more surprising, and it is likely due to the optimum being more “unstable” after a learning rate decay. This instability reduces over time, as we see in Figure 10(b) (at Epochs 60-62); forgetting has begun to increase to a normal rate two epochs after the learning rate decay, at Epoch 62.

C.5 Comparison with Traditional Forgetting Metrics

We have argued that our approach is more relevant to privacy than traditional metrics for measuring forgetting, such as loss or accuracy (accuracy is standard in the catastrophic forgetting literature and was also used to measure forgetting in ). Here, we present additional experimental justification. First, we see in Figure 11(a) that accuracy is an overly optimistic metric from a privacy perspective. For examples repeated 10 times, accuracy has stabilized by 100×200=20000100\times 200=20000 steps after injection, suggesting that they are completely forgotten. However, compared to Figure 2(a), these examples are still at 70% membership inference precision, not invulnerable from privacy attacks. Loss (in Figure 11(b)) takes longer to stabilize, providing more comparable forgetting speed to our approach. However, membership inference scores are still more easily interpreted than loss scores for quantifying privacy risk. We also note that, to make comparison easier, we have used hard examples to report results with repetitions, but existing work typically considers more average examples, as forgetting is traditionally used to measure average performance on subdistributions. For this reason, existing metrics are more similar to our experiments with medium hardness or easy examples in Figure 11(c), where forgetting happens nearly immediately, while hard examples take significantly longer. Worst-case analysis is more relevant from a privacy perspective, justifying our approach.

Appendix D Extension of Understanding Forgetting, Section 5

We reproduce here Figure 12, which offers intuition for our setting. Concretely, we consider a one-dimensional kk-means task with k=2k=2, but where three clusters exist in the data, denoted by c1,c2,c3c_{1},c_{2},c_{3}. Samples from cluster c1c_{1} are drawn from N(1,σ2)\mathcal{N}(-1,\sigma^{2}), samples from cluster c2c_{2} are drawn from N(μ,σ2)\mathcal{N}(\mu,\sigma^{2}), and samples from cluster c3c_{3} are drawn from N(1,σ2)\mathcal{N}(1,\sigma^{2}). Here, σ2\sigma^{2} is some fixed variance parameter (we use σ=0.03\sigma=0.03), and μ>0\mu>0 controls the distance between c2c_{2} and the other two clusters, and crucially makes c2c_{2} slightly closer to c3c_{3} than to c1c_{1} (we use μ=0.03\mu=0.03). We consider a two-stage learning procedure, where some dataset D0D_{0} is used to learn the clusters initially, and the model is then updated with some new dataset D1D_{1}. We fix D1D_{1} to contain an equal number nn of samples from each cluster. However, we consider two cases for D0D_{0}: (1) in the IN case, D0D_{0} consists of mm samples from c1c_{1}, one “outlier” sample xx (we use x=0.01x=-0.01, slightly more than one standard deviation from the mean) from c2c_{2}, and mm samples from c3c_{3}, and in the OUT case, D0D_{0} consists of mm samples from c1c_{1} and mm samples from c3c_{3}, without the “outlier”. Samples from c2c_{2} are never included in D0D_{0}, perhaps because they are from an emergent subpopulation. The single “outlier” sample can result in drastically different clusters, a difference which does not get forgotten by collecting more examples D1D_{1}. We run an experiment with this setup, using m=10m=10, n=100n=100, and run 200 trials with fresh samples from each cluster. Over these trials, membership inference accuracy for xx has 97% accuracy, but the attack also has perfect precision. That is, when xx is OUT, c1c_{1} and c2c_{2} are far enough that they are never joined, and there are only examples where the randomness of the data causes c2c_{2} and c3c_{3} to join despite the presence of xx. We note that differential privacy would prevent this attack, as it would randomize the clustering, preventing membership inference. Machine unlearning algorithms designed for kk-means would also retrain the entire model upon removing the outlier.

D.2 Nondeterminism is Required

To show that determinism prevents forgetting, we consider the deterministic version of per-example gradient descent in Algorithm 3, which iterates through a dataset in a fixed order. In Theorem 2, we prove that this algorithm doesn’t forget, by considering two datasets which differ in only one row, and showing that the two datasets have perpetually different models, so that an adversary can distinguish between them if the adversary knows all other points. In the main body, we also experimentally verify this result, when logistic regression is trained by minimizing the standard cross-entropy loss with fixed-order SGD on the FMNIST dataset.

Consider the described algorithm TT in Algorithm 3, with 0<η<0.50<\eta<0.5. When run on any two distinct datasets D0D_{0} and D1D_{1} differing in only one row, we have T(D0)T(D1)T(D_{0})\neq T(D_{1}), regardless of which index D0D_{0} and D1D_{1} differ in.

Consider the intermediate points θi0\theta^{0}_{i} and θi1\theta^{1}_{i}, which represent the model at the iith step of training on datasets D0D_{0} and D1D_{1}, respectively. Write jj for the index where D0D_{0} and D1D_{1} differ. To prove the theorem, we first note that θj10=θj11\theta^{0}_{j-1}=\theta^{1}_{j-1}, as the first j1j-1 steps of training all use identical examples from D0,D1D_{0},D_{1}. Next, we show that θj0θj1\theta^{0}_{j}\neq\theta^{1}_{j}, and then that θi0θi1\theta^{0}_{i}\neq\theta^{1}_{i} for all i>ji>j.

To see that θj0θj1\theta^{0}_{j}\neq\theta^{1}_{j}, we write out the precise gradient updates:

where the second step holds because θj10=θj11\theta^{0}_{j-1}=\theta^{1}_{j-1}, and the third holds because jj is the index where D0D_{0} and D1D_{1} differ.

Now, to show that this implies that θi0θi1\theta^{0}_{i}\neq\theta^{1}_{i} for all i>ji>j, we show that, for all xx, the gradient update function is one-to-one. That is, for all xx, θθ\theta\neq\theta^{\prime}, taking a gradient step on xx still results in different models:

From this, θj0θj1\theta^{0}_{j}\neq\theta^{1}_{j} implies that the models are different at step j+1j+1, which itself implies that the models are different at step j+2j+2, and will be different for the entirety of training. ∎

D.3 Data Randomness can Cause Forgetting

Here, we prove Theorem 1, restated below.

The distributions of the models θk+\theta_{k}^{+} and θk\theta_{k}^{-} output by Algorithm 4 with learning rate 0<η<1/20<\eta<1/2 have Rényi divergence of order α\alpha at most 2αkvΣ1v\tfrac{2\alpha}{k}v^{\top}\Sigma^{-1}v.

First, observe that the gradient update on some θ0\theta_{0} with example x0x_{0} produces θ1=θ02η(θ0x0)=θ0(12η)+2ηx0\theta_{1}=\theta_{0}-2\eta(\theta_{0}-x_{0})=\theta_{0}(1-2\eta)+2\eta x_{0}. Another gradient update on x1x_{1} produces θ2=θ0(12η)2+2η(12η)x0+2ηx1\theta_{2}=\theta_{0}(1-2\eta)^{2}+2\eta(1-2\eta)x_{0}+2\eta x_{1}. In general, we have

We write θ0+=θ0(12η)+2ηv\theta_{0}^{+}=\theta_{0}(1-2\eta)+2\eta v (a gradient step on +v+v) and θ0=θ0(12η)2ηv\theta_{0}^{-}=\theta_{0}(1-2\eta)-2\eta v (a gradient step on v-v). Then we analyze the distributions θk+\theta_{k}^{+} and θk\theta_{k}^{-}. Using Equation 1, replacing θ0\theta_{0} with θ0+\theta_{0}^{+} and θ0\theta_{0}^{-}, and using that each xix_{i} is sampled from N(μ,Σ)\mathcal{N}(\mu,\Sigma), we obtain the distributions

Using that the Rényi divergence of order α\alpha between two multivariate Gaussians with means μ0,μ1\mu_{0},\mu_{1} and equal covariance Σ\Sigma^{\prime} is α2(μ0μ1)TΣ1(μ0μ1)\tfrac{\alpha}{2}(\mu_{0}-\mu_{1})^{T}\Sigma^{\prime-1}(\mu_{0}-\mu_{1}), we can compute the divergence for θk+\theta_{k}^{+} and θk\theta_{k}^{-} as 8αη(1η)(12η)2k1(12η)2kvTΣ1v8\alpha\tfrac{\eta(1-\eta)(1-2\eta)^{2k}}{1-(1-2\eta)^{2k}}v^{T}\Sigma^{-1}v.

It remains to bound the η\eta term in this product. The function f(η)=η(1η)(12η)2k1(12η)2kf(\eta)=\tfrac{\eta(1-\eta)(1-2\eta)^{2k}}{1-(1-2\eta)^{2k}} has negative derivative on the interval 0<η<1/20<\eta<1/2 for any k>0k>0. Then f(η)f(\eta) it is maximized on 0<η<1/20<\eta<1/2 when η0\eta\rightarrow 0; we can compute this limit with L’Hospital’s rule as 14k\tfrac{1}{4k}, completing the proof. ∎