Fast Model Editing at Scale

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, Christopher D. Manning

Introduction

Increasingly large models have improved performance on a variety of modern computer vision (Huang et al., 2017; Chen et al., 2022) and especially natural language processing (Vaswani et al., 2017; Brown et al., 2020) problems. However, a key challenge in deploying and maintaining such models is issuing patches to adjust model behavior after deployment (Sinitsin et al., 2020). When a neural network produces an undesirable output, making a localized update to correct its behavior for a single input or small number of inputs is non-trivial, owing to the distributed nature of the model’s representations. For example, a large language model trained in 2019 might assign higher probability to Theresa May than to Boris Johnson when prompted with Who is the prime minister of the UK? (see Table 2 for an example with a real large language model; see Lazaridou et al. (2021) for a systematic study of failures of temporal generalization in LMs). An ideal model editing procedure could quickly update the model parameters to increase the relative likelihood of Boris Johnson without changing the model output for unrelated inputs. This procedure would produce edits with reliability, successfully changing the model’s output on the problematic input (e.g., Who is the prime minister of the UK?); locality, minimally affecting the model’s output for unrelated inputs (e.g., What sports team does Messi play for?); and generality, generating the correct output for inputs related to the edit input (e.g., Who is the UK PM?).

A simple approach to making such edits is additional fine-tuning with a new label on the single example to be corrected. Yet fine-tuning on a single example tends to overfit, even when constraining the distance between the pre- and post-fine-tuning parameters (Zhu et al., 2020; De Cao et al., 2021). This overfitting leads to failures of both locality and generality. While fine-tuning on the edit example along with continued training on the training set better enforces locality, our experiments show that it still lacks generality. Further, it requires persistent access to the full training set during test time and is more computationally demanding. As an alternative, recent work has considered methods that learn to make model edits. Sinitsin et al. (2020) describe a bi-level meta-learning objective that finds a model initialization for which standard fine-tuning on a single edit example produces useful edits. While effective, the computational requirements of learning such an editable representation make scaling to very large models, where fast, effective edits are most needed, difficult (see Figure 6). De Cao et al. (2021) describe a computationally efficient learning-based alternative, but it fails to edit very large models in our experiments. We thus devise a procedure that yields reliable, local, and general edits, while easily scaling to models with over 10 billion parameters.

The primary contribution of this work is a scalable algorithm for fast model editing that can edit very large pre-trained language models by leveraging the low-rank structure of fine-tuning gradients. We perform empirical evaluations on a variety of language-related tasks and transformer models, showing that MEND is the only algorithm that can consistently edit the largest GPT-style (Radford et al., 2019; Black et al., 2021; Wang and Komatsuzaki, 2021) and T5 (Raffel et al., 2020) language models. Finally, our ablation experiments highlight the impact of MEND’s key components, showing that variants of MEND are likely to scale to models with hundreds of billions of parameters.

The Model Editing Problem

The goal of model editing is to enable the use of a single pair of input $x_{\text{e}}$ and desired output $y_{\text{e}}$ to alter a base model’s output for $x_{\text{e}}$ as well as its equivalence neighborhood (related input/output pairs), all while leaving model behavior on unrelated inputs unchanged (Sinitsin et al., 2020; De Cao et al., 2021). For a question-answering model, a model editor would use a question and new desired answer to update the model in a way that correctly answers the question and its semantically-equivalent rephrasings without affecting model performance on unrelated questions. Some model editors, including ours, use a training phase before they can apply edits (Sinitsin et al., 2020; De Cao et al., 2021), using an edit training dataset $D^{tr}_{edit}$ that specifies the types of edits that will be made.

Model Editor Networks with Gradient Decomposition

where $\sigma$ is a non-linear activation function s.t. $\sigma(0)=0$ (ReLU in this work) and $U_{j},V_{j}$ correspond to a low rank factorization of MEND’s weights at layer $j$ (keeping MEND’s total parameters $O(d)$ ).

2 Training MEND

While MEND’s parameterization can tractably represent a mapping from gradients to model edits, training the editor presents its own challenges. Appendix A describes MEND’s identity initialization and input normalization, which our ablations in Section 5.4 show are important to effective edits.

Related Work

Various strategies for model editing exist, including modifications of standard fine-tuning intended to enforce locality by reducing distance traveled in parameter space (Zhu et al., 2020) or even find the min-L2 norm parameter update that reliably edits the model’s output (Sotoudeh and Thakur, 2021). However, De Cao et al. (2021) observe that parameter-space constraints do not always translate to useful function-space constraints for neural networks. Our fine-tuning baselines thus use a KL-divergence constraint in function space, but, even with this modification, we find that fine-tuning generally doesn’t consistently provide edit generality. Other approaches to editing such as Editable Neural Networks (ENN; Sinitsin et al. (2020)) or KnowledgeEditor (KE; De Cao et al. (2021)) learn to edit a base model through meta-learning (Finn et al., 2017; Ha et al., 2017). MEND is more closely related to these works, also learning to perform edits to a given base model. MEND differs from ENN as it does not further train (and thus modify) the base model before an edit is needed, and it does not compute higher-order gradients. Because ENN modifies the pre-edit model, the training process retains a copy of the original model in order to enforce the constraint that the editable model agrees with the original pre-trained model’s predictions. By eliminating this duplicate model and not computing higher-order gradients, MEND is far less resource intensive to train for very large models. Figure 6 shows the significant difference in memory consumption of ENN compared with MEND and KE. MEND is most similar to KE, which also presents a first-order algorithm that does not modify the pre-edit model. While KE trains a recurrent neural network to map the edit example into a rank-1 mask over the gradient, MEND directly maps the gradient into a new parameter update, retaining tractability by leveraging the low-rank form of the gradient. Table 1 contains an overview of algorithmic tradeoffs. See Appendix B for extended discussion of related work.

Various methods for meta-learning also use gradient transforms to achieve better model updates for few-shot learning (Ravi and Larochelle, 2017; Li et al., 2017; Lee and Choi, 2018; Park and Oliva, 2019; Flennerhag et al., 2020). However, these approaches do not leverage the factorized gradient, limiting them to simpler transformations (typically linear) of the gradient and/or transformations that also often impact the function computed by the forward pass of the model. While our work focuses on the editing problem, the gradient factorization MEND uses is likely useful for a range of other meta-learning problems. Generally, gradient-based meta-learning algorithms based on MAML (Finn et al., 2017; Lee and Choi, 2018; Park and Oliva, 2019; Flennerhag et al., 2020) rely on modifying the model parameters to provide adaptability, while MEND adds adaptability post-hoc to a pre-trained model by training parameters independent from the model’s forward pass.

In the NLP literature, many papers have investigated the locus of various types of knowledge in language models, using learned probe models or iterative search procedures to test for linguistic structures (Belinkov et al., 2017; Conneau et al., 2018; Hewitt and Manning, 2019) or facts about the world (Petroni et al., 2019; Jiang et al., 2020; Dai et al., 2021). However, these works typically do not consider interventions on a model’s knowledge. Exceptions are Dai et al. (2021) and Wang et al. (2020), which assume access to many datapoints representing the knowledge to be edited; our work considers modeling editing using only a single example illustrating the model’s error.

Experiments

A key motivation for MEND is scalability to large models, which requires an algorithm to be efficient in terms of computation time and particularly memory consumption. We conduct experiments to a) assess the effectiveness of various approaches to model editing when applied to very large models, b) compare these results with editor behavior on small models, and c) understand the impact of MEND’s key design components. We evaluate model editors using several editing datasets and comparison algorithmsFor each dataset, all algorithms edit the same parameters. For BART/T5, we edit the MLP layers of the last 2 encoder & decoder blocks; for GPT/BERT models, we edit the MLPs in the last 3 blocks., which we outline next.

Editing Datasets. All editing datasets pair each edit input $x_{\text{e}}$ (questions, text passages) with a plausible edit label $y_{\text{e}}$ that is intended to mimic the distribution of edit labels we would encounter in practice (changing a QA model’s answer or steering a generative model toward a particular continuation). For example, in a QA setting, plausible edit labels include the ground truth label as well as entities of the same type as the true answer. See Appendix C.4 Tables 7 and 8 for sample data. Specifically, for seq2seq models, we use the zsRE question-answering dataset (Levy et al., 2017) using question rephrasings generated by backtranslation as the equivalence neighborhood and train/val splits generated by De Cao et al. (2021). Each $x_{\text{e}}$ is a question about an entity, and plausible alternative edit labels $y_{\text{e}}$ are sampled from the top-ranked predictions of a BART-base model trained on zsRE question-answering. When editing models pre-trained on the zsRE question-answering problem, we sample $x_{\text{loc}}$ as independent questions from the edit train set. For other experiments (Section 5.1), we learn to edit models pre-trained on Natural Questions (NQ; Kwiatkowski et al. (2019)) rather than zsRE; we therefore sample $x_{\text{loc}}$ from NQ rather than zsRE to measure accuracy drawdown in these cases. For classification models (e.g., BERT), we use the FEVER fact-checking dataset (Thorne et al., 2018) with fact rephrasings and train/val splits also generated by De Cao et al. (2021). Each $x_{\text{e}}$ is a fact, and each $y_{\text{e}}$ is a random binary label sampled from a Bernoulli distribution with $p=0.5$ . Locality examples $x_{\text{loc}}$ are randomly sampled facts distinct from the edit example. For GPT-style models, we create a Wikitext generation editing dataset of similar size to the zsRE and FEVER editing datasets, containing approximately 68k $x_{\text{e}},y_{\text{e}}$ pairs. Each $x_{\text{e}}$ is a passage sampled from Wikitext-103 and $y_{\text{e}}$ is a 10-token sample from a pre-trained distilGPT-2 model.The base model’s greedy 10-token prediction agrees with these edit targets for <1% of examples. $x_{\text{loc}}$ is chosen depending on the pre-trained model: for models pre-trained on Wikitext, $x_{\text{loc}}$ is sampled from Wikitext-103 (independently from $x_{\text{e}}$ ). For GPT-Neo/J, we sample $x_{\text{loc}}$ from OpenWebText (OWT; (Gokaslan and Cohen, 2019)) to better match the model’s original training data. The equivalence neighborhood in this setting is $N(x_{\text{e}},y_{\text{e}})=\{(x_{\text{e}}^{k},y_{\text{e}})\}$ , where ${x_{\text{e}}^{k}}$ is formed by removing a prefix of up to $\frac{|x_{\text{e}}|}{2}$ tokens from the beginning of $x_{\text{e}}$ , where $|x_{\text{e}}|$ is the length of $x_{\text{e}}$ in tokens.

Comparison of model editors. We compare MEND with several other model editors, including two fine-tuning-based algorithms (which do not train any model editor at all) and two learned model editors. The fine-tune (FT) algorithm fine-tunes on the edit example $(x_{\text{e}},y_{\text{e}})$ until the label is assigned the highest likelihood (using greedy decoding for sequence models). The ‘oracle’ fine-tune + KL (FT+KL) algorithm has access to the training set at test time and adds $L_{\text{loc}}$ (Eq. 4b) to the test-time fine-tuning objective (which is typically only computable during model editor training). Similarly to De Cao et al. (2021), we limit each of these algorithms to 100 fine-tuning steps. Additionally, we compare with two learned model editors: a re-implementation of Editable Neural Networks (ENN; Sinitsin et al., 2020) when possible (due to high memory usage) and KnowledgeEditor (KE; De Cao et al., 2021). We use identical hyperparameters for MEND across all models and datasets. For BART and T5 models, we edit the MLP weight matrices in the last 2 transformer blocks of the encoder and decoder; for other models, we edit the MLP weights in the last 3 transformer blocks. Appendix G explores a simple caching-based model editor that stores model edits in memory.

Metrics. Our experiments measure the reliability and generality of a model editor using edit success (ES) (Eq. 1). To assess locality, we use drawdown (DD), which is defined as the performance degradation of the edited model on the rest of the dataset, measured as either the edited model’s perplexity increase or accuracy decrease compared to the base model, depending on the problem.

We first consider the problem of editing some of the largest publicly-available Transformer models. We use GPT-Neo (2.7B parameters; Black et al., 2021) and GPT-J (6B parameters; Wang and Komatsuzaki, 2021), several times larger than GPT-2 (Radford et al., 2019), and the largest two T5 models, T5-XL (2.8B parameters) and T5-XXL (11B parameters) fine-tuned on NQ (Roberts et al., 2020). Table 3 shows the results; MEND provides the most successful edits across tasks. Fine-tuning achieves lower edit success on the Wikitext task and exhibits a much larger perplexity increase than MEND. On the question-answering edit task, fine-tuning shows similarly reduced edit success, struggling to generalize to some rephrasings of the edit input. The KL-constrained baseline reduces the perplexity drawdown for GPT-Neo and GPT-J, but at the cost of edit success. KE is ineffective at this scale, generally failing to provide successful edits. For these experiments, we use OWT and NQ to measure drawdown for generation and question-answering, respectively, as they are more representative of the data used to train the base models.

2 Smaller scale editing

We conduct an additional experiment editing the BERT-base and BART-base models fine-tuned by De Cao et al. (2021) on the FEVER fact-checking and zsRE question-answering tasks, respectively, and our Wikitext editing task, editing a smaller distilGPT-2 model (Wolf et al., 2019) fine-tuned on Wikitext2 (Ma, 2021). These models are 1–2 orders of magnitude smaller than those in Section 5.1. Results are presented in Table 4. At small scale where computational requirements are not a concern, ENN is competitive with MEND, providing the best performance on the Wikitext problem. Fine-tuning overfits even more severely than with larger models, showing lower edit success (overfitting to the edit example) and higher drawdown (degrading the model more seriously). One difficulty of using ENN is that the pre-trained model itself must be fine-tuned to ‘provide’ editability, potentially changing the model’s predictions even before an edit has been applied. Unlike the large-scale experiments, drawdown is computed using samples from the same datasets as edit inputs, again in order to best match the data distribution the base models were fine-tuned on. See Appendix G for additional comparisons with the caching-based editor, which shows strong performance for zsRE and FEVER, but generally fails for Wikitext, as well as a more difficult version of the zsRE problem for which MEND still produces meaningful edits.

3 Batched Editing

Table 5 compares MEND with ENN (the strongest comparison method) in a more realistic setting when multiple simultaneous zsRE QA model edits are needed; MEND consistently provides significantly more effective edits in the multi-edit setting. Both algorithms are trained and evaluated on applying $k$ simultaneous edits, with $k\in\{1,5,25,75,125\}$ . MEND applies simultaneous edits by simply summing the parameter edit computed separately for each edit example. MEND applies 25 edits in a single model update with 96% edit success and less than 1% accuracy degradation (35% edit success for ENN), and successfully applies 67% of edits when applying 125 edits at once (11% success for ENN, although ENN’s accuracy drawdown is slightly lower).

4 Ablations & MEND Variants

Discussion

Conclusion. We have presented an efficient approach to editing very large (10 billion+ parameter) neural networks, which we call Model Editor Networks with Gradient Decomposition or MEND. We showed that MEND is the only method that successfully edits the largest publicly-available Transformer models from the GPT and T5 model families. To do so, MEND treats the model editing problem itself as a learning problem, using a relatively small edit dataset to learn model editor networks that can correct model errors using only a single input-output pair. MEND leverages the fact that gradients with respect to the fully-connected layers in neural networks are rank-1, enabling a parameter-efficient architecture that represents this gradient transform.

Limitations & Future Work. A limitation of existing model editors (including MEND) is the approach to enforcing locality of edits. The failure mode of over-generalization (bottom of Table 2) shows that locality examples (i.e., negative examples) are not challenging enough to prevent the model from sometimes changing its output for distinct but related inputs. Alternative locality losses or harder negative mining may help address this problem. Further, existing language-based editing datasets use backtranslation to evaluate edit generality (and our Wikitext dataset uses a truncation heuristic). Such equivalence neighborhoods do not assess a model’s ability to use the knowledge in an edit example to correctly answer questions about other topics whose answer is implied by the content of the edit example (e.g., for Who is the UK PM? Boris Johnson, does the edited model correctly answer Is Boris Johnson a private citizen?). Counterfactual data augmentation (Kaushik et al., 2020) may be useful for constructing richer evaluation cases for edit generality. Future work might also apply MEND to other types of edits, such as reducing the frequency of toxic generations after observing toxic outputs, relabeling entire classes of images from one example, or adjusting a robot’s control policy to avoid particular actions, as MEND is not limited to editing transformer models. Finally, MEND’s gradient decomposition is not in principle limited to the model editing problem, and it might enable efficient new gradient-based meta-learning algorithms.

Acknowledgements

We gratefully acknowledge Angeliki Lazaridou for insightful early discussions regarding temporal generalization in language models; Spencer Braun for implementing exploratory experiments that motivated this project; Mitchell Wortsman, Gabriel Ilharco, Stephanie Chan, and Archit Sharma for insightful discussions and encouragement; Michael Chang, Michael Janner, and Ashwin Paranjape for feedback on an early version of the paper; and the anonymous ICLR reviewers for their feedback. Eric Mitchell gratefully acknowledges the support of a Knight-Hennessy graduate fellowship. Chelsea Finn and Chris Manning are fellows in the CIFAR Learning in Machines and Brains program.

Ethics Statement

This work uses large language models pre-trained on text scraped from the internet. These massive training corpora (and therefore the models trained on them) may contain (or produce) content that is counter to the values of the ICLR community. Algorithms for model editing may provide one tool (among others) to mitigate this problem by enabling maintainers of large models to change certain undesirable model behaviors as they are discovered. On the other hand, a model editor could also be used to exacerbate the very model behaviors that we hope to eliminate, depending on who is wielding it. This dual use is a risk for many machine learning technologies. Specifically, effective editing algorithms (including MEND and others) may enable maintainers of deployed neural networks to include backdoors or other planned vulnerabilities/hidden behaviors into their models.

Reproducibility

To foster reproducibility, we have provided a detailed description of the proposed algorithm in Section 3, as well as additional details regarding experimental setup, hyperparameters, and implementations of comparison algorithms in Section C. Our experiments use fixed random seeds for data sampling and model editor initialization, enabling reproducible results. Section C.4 describes how to obtain the pre-existing datasets and models we used in our experiments (from De Cao et al. (2021)). See project website at https://sites.google.com/view/mend-editing for links to code and data.

References

Appendix A Effective initialization and normalization for MEND networks

Appendix B Extended Discussion of Related Work

Model editing shares with continual learning [McCloskey and Cohen, 1989, Parisi et al., 2019] the goal of assimilating or updating a model’s behavior without forgetting old information or behaviors, commonly known as the problem of catastrophic forgetting [McCloskey and Cohen, 1989, Ratcliff, 1990, Kirkpatrick et al., 2017]. However, in continual learning settings, a model is typically expected to learn wholly new behaviors or datasets [Kirkpatrick et al., 2017, Parisi et al., 2019] without forgetting, while in this work we consider more localized model edits. Further, continual learning generally considers long sequences of model updates with minimal memory overhead, while our work generally considers an edit or batch of edits applied all at once.

Additionally, min-norm parameter fine-tuning has also been considered in past work in the context of editing [Zhu et al., 2020] and traditional model fine-tuning [Guo et al., 2021], where the parameters of the edited or fine-tuned model $\theta^{\prime}$ are penalized (or constrained) from drifting too far from the original model parameters $\theta$ using various norms, including L0, L2, and L- $\infty$ . While min-norm constraints may be an effective regularization for traditional fine-tuning settings where fine-tuning data is abundant, the experiments conducted in De Cao et al. show that parameter-space norm constraints are insufficient constraints to prevent significant model degradation when fine-tuning on a single edit example.

Editable neural networks [Sinitsin et al., 2020] search for a set of model parameters that both provide good performance for a ‘base task’ (e.g., image classification or machine translation) and enable rapid editing by gradient descent to update the model’s predictions for a set of ‘edit examples’ without changing the model’s behavior for unrelated inputs. ENN optimizes the following objective, based on the MAML algorithm [Finn et al., 2017]:

The first term of Equation 5 is the base task loss; for a generative language model, we have $L_{\text{base}}(\mathcal{D}_{\text{base}},\theta)=-\log p_{\theta}(\mathcal{D}_{\text{base}})$ where $\mathcal{D}_{\text{base}}$ is a batch of training sequences. $L_{\text{base}}$ is the edit reliability loss, encouraging the model to significantly change its output for the edit examples in $\mathcal{D}_{\text{edit}}$ . Finally, $L_{\text{loc}}$ is the edit locality loss, which penalizes the edited model $\theta^{\prime}$ for deviating from the predictions of the pre-edit model $\theta$ on $\mathcal{D}_{\text{loc}}$ , data unrelated to $\mathcal{D}_{\text{edit}}$ and sampled from the same distribution as $\mathcal{D}_{\text{base}}$ . See Sinitsin et al. for a more detailed explanation of ENN training and alternative objectives for $L_{\text{edit}}$ and $L_{\text{loc}}$ .

Comparing ENN and MEND. The key conceptual distinction between ENN and MEND is that ENN encodes editability into the parameters of the model itself (intrinsic editability), while MEND provides editability through a set of learned parameters that are independent from the model parameters (extrinsic editability). An advantage of ENN is that no new parameters are added in order to provide editability. However, this approach comes with several drawbacks. First, the MAML-based objective ENN optimizes is expensive, particularly in terms of memory consumption (see Figure 4). By further training the model parameters themselves, ENN cannot guarantee that the editable model it produces will make the same predictions as the original model. In order to approximately enforce this constraint during training, ENN must use an extra copy of the original base model to ensure that the editable model’s predictive distribution does not differ too much from it. This incurs significant additional memory costs, particularly when training ENN for very large models, for which the parameters of the model alone occupy a significant amount of VRAM. Another cause for the significant VRAM consumption of ENN is the need to compute activations and gradients for the model parameters; even if we edit only the last layer, ENN trains the rest of the model so that the last layer gradient is productive, requiring activations and gradients to be computed for the entire model. On the other hand, extrinsic editors like MEND and KE do not require updating the base model itself, thereby computing gradients for far fewer parameters. Future work might investigate approaches to reducing the memory consumption of ENN, although the requirement to retain a copy of the original model in order to enforce locality creates a relatively high lower bound on the amount of memory that ENN might use.

Regardless of memory consumption, extrinsic editors have the potential advantage of being able to edit more than one model; in theory, we might amortize the cost of training MEND over several base models at once. On the other hand, intrinsic editability must by definition be re-learned separately for each base model.

B.2 KnowledgeEditor (KE)

Comparing KE and MEND. KE more closely resembles MEND in that it is also an extrinsic model editor. However, while MEND directly maps model gradients into model edits, the KE model editor uses the raw edit example as an input, outputting a single rank-1 mask and rank-1 offset over the fine-tuning gradient. We hypothesize that the KE model faces several challenges that MEND avoids. First, mapping the edit example itself into a model updates requires a translation from the high-level modality of data examples into the very low-level modality of model parameter updates. Solving this translation requires making additional design decisions (e.g., how to feed the edit input and label into the editor, what architecture to use for the editor), the optimal design for which may vary across problems. Further, by not conditioning directly on the gradient, KE forgoes a rich source of information about which parameters of the model are most responsible for updating the model’s outputs. In addition, by operating on the token-wise activations and gradients (i.e., the gradients are not summed over the sequence/batch, but are kept as per-sequence element activation and gradient vectors), MEND outputs a rank-1 model edit for each token in the input and output sequence. The final output of MEND is the sum of these, which has rank of order 10 or even 100, depending on the problem. In contrast, the KE editor outputs only a rank-1 gradient mask and rank-1 gradient offset, regardless of the information content of the edit example. This rank-1 constraint, irrespective of the size of the input, which we hypothesize causes KE’s failure to perform well for the Wikitext editing task, which has significantly higher information content labels (10 tokens) than the FEVER or zsRE tasks.

Appendix C Experimental Details

For GPT and BERT-style models, all experiments edit the MLP weights in the last 3 transformer blocks (6 weight matrices total). For BART and T5-style models, all experiments edit the MLP weights in the last 2 transformer blocks in both the encoder and the decoder (8 weight matrices total). We found that editing MLP layers generally provides better editing performance (across algorithms) than editing attention layers. In line with past work [De Cao et al., 2021], all reported performance numbers are on the validation set. For all algorithms, we use early stopping to end training early if the validation loss $L=c_{\text{edit}}L_{\text{e}}+L_{\text{loc}}$ ) does not decrease for 20000 steps on a subset of 500 validation examples, with a maximum number of training steps of 500,000. We use a batch size of 10 (with gradient accumulation) and the seed 0 for all experiments. Tables 7 and 8 show examples from each dataset used in our experiments.

The fine-tuning baselines use model-dependent learning rates, which we found important in achieving good fine-tuning performance; using too large of a learning rate causes decreased locality (increased model degradation), while a learning rate too small causes slow edits. We use edit learning rates of 5e-6 for GPT-Neo and GPT-J and 1e-4 for T5 models, and 1e-6 for the smaller models, aiming to complete edits in less than 100 fine-tuning steps (as in De Cao et al. ). For the fine-tuning + KL-constraint baseline, we fine-tune on the loss $c_{\text{edit}}L_{\text{e}}+L_{\text{loc}}$ , using a smaller $c_{\text{edit}}$ than for the learned algorithms (1e-2 for all models except GPT-J, which required 1e-3). Larger values of $c_{\text{edit}}$ provide little benefit from the locality loss. To compute $L_{\text{loc}}$ , we use a batch size of one new example $x_{\text{loc}}$ from the full edit training set $D^{tr}_{edit}$ at each time step.

ENN.

We use an initial inner loop learning rate of 1e-2, but allow this value to be learned in the outer loop, which we find improves performance over the fixed inner loop learning rate version in Sinitsin et al. . For all experiments, ENN fine-tunes all model parameters during training (even when we only edit the last few layers). We also use only a single inner loop update step for computational reasons, which differs from the multi-step version used for the smaller models used by Sinitsin et al. . Our edit loss is also a slight simplification of the edit loss used by Sinitsin et al. , which is

The first term of this loss is the edit loss we use in our work; the second term is primarily intended to provide the property that $l_{e}(\theta)\leq 0$ when an edit is successful so that the iterative editing process can be stopped. However, in this work, because we use only a single gradient step of editing for ENN, this property is less important, and the second term simply amounts to an additional emphasis on pushing down specifically the largest incorrect logit (which the first term already does implicitly).

KE

We use the implementation of KE provided by De Cao et al. , which can be found at https://github.com/nicola-decao/KnowledgeEditor, with minor changes to the computation of the KL constraint for consistency with other algorithms (see below). We use a learning rate of 1e-5.

C.2 Computing the locality constraint

Computing the true KL-divergence between the pre- and post-edit model $\text{KL}(p_{\theta}(\cdot|x_{\text{loc}})\|p_{\theta^{\prime}}(\cdot|x_{\text{loc}}))$ quickly becomes computationally prohibitive for model outputs of more than a few tokens, requiring marginalization over possible answers. We therefore approximate this KL-divergence using samples from the dataset.We justify this choice by the fact that the model’s predictive distribution is similar to the locality sample distribution (as locality samples are drawn from the dataset the model was originally trained on). While this is not as principled as a true Monte Carlo estimate using samples from the model itself, it is reduces computational requirements of training and is easier to implement; the generally low drawdown for most models indicates that this approximation still provides a good locality constraint in practice. For the seq2seq question-answering problem, we evaluate the KL divergence only at the tokens of the answer $y_{\text{loc}}$ , giving $\text{KL}_{\text{approx}}^{\text{seq2seq}}(\theta,\theta^{\prime})=\frac{1}{|y_{\text{loc}}|}\sum_{i=1}^{|y_{\text{loc}}|}\text{KL}(p_{\theta}(\cdot|x_{\text{loc}},y_{\text{loc}}^{<i})\|p_{\theta^{\prime}}(\cdot|x_{\text{loc}},y_{\text{loc}}^{<i}))$ , where $p(\cdot|x_{\text{loc}},y_{\text{loc}}^{<i})$ is the distribution over next tokens $y_{i}$ given the locality input $x_{\text{loc}}$ and the label tokens for previous timesteps $y_{\text{loc}}^{<i}$ . Similarly, for the Wikitext setting, we define $\text{KL}_{\text{approx}}^{\text{auto}}(\theta,\theta^{\prime})=\frac{1}{|x_{\text{loc}}|}\sum_{i=1}^{|x_{\text{loc}}|}\text{KL}(p_{\theta}(\cdot|x_{\text{loc}}^{<i})\|p_{\theta^{\prime}}(\cdot|x_{\text{loc}}^{<i}))$ . For FEVER fact-checking we compute the exact KL-divergence between Bernoulli distributions in closed form.

C.3 Environment Details

All runs are trained entirely on a single NVIDIA RTX Titan or A40 GPU. No gradient checkpointing or memory-reduction optimizations are used, although bfloat16 is used to fit the largest T5 model onto our GPU. In full precision, the parameters alone of the T5-11B model use all of the memory of our largest GPU. VRAM consumption for training MEND and KE on T5-11B (Figs. 6 and 4) is estimated by doubling the bfloat16 VRAM usage [Wang and Kanwar, 2019]. While doubling half precision enabled estimating the memory consumption of ENN, we were unable to train ENN in half precision without numerical instability. All models are based on Huggingface Transformers implementations [Wolf et al., 2019] with some modifications in line with De Cao et al. . We use PyTorch [Paszke et al., 2019] for all experiments, specifically using the Higher library [Grefenstette et al., 2019] in order to implement the bi-level optimization in ENN as well as the inner loop of model editing for all algorithms.

C.4 Dataset Construction & Examples

Datasets are constructed to provide pairs of edit input $x_{\text{e}}$ and plausible edit label $y_{\text{e}}$ . The edit label is not necessarily the ‘correct’ label; the goal is to provide realistic instances of the types of data we would expect to see during test. For example, our dataset might have a sample such as $x_{\text{e}}$ = Where was Ursula K. Le Guin born? and $y_{\text{e}}$ = Addis Ababa, Oromia, Ethiopia, even though Ursula K. Le Guin was born in Berkeley, California, USA. However, this fictitious example is still a useful assessment of our model’s ability to perform the general type of edit of ‘change a person’s birthplace’. For the zsRE question-answering dataset De Cao et al. generate fictitious $y_{\text{e}}$ in this manner using the top predictions of a BART model fine-tuned on the task of question answering followed by manual human filtering. In practice, this produces alternate edit labels that are plausible and whose types match with the original label. For FEVER fact-checking, there are only two choices for labels, and we sample edit targets 1 and 0 with equal probability. For Wikitext generation, we use a distilGPT-2 model to generate plausible 10-token continuations for a given Wikitext prefix, with the similar motivation to zsRE of providing edit targets that share the structure of the types of edits that we will apply in practice, even if they are not always factual. When qualitatively assessing MEND to correct real errors of the base model using the factual labels, we find that MEND performs reliably, indicating that these label generators provide reasonable proxies for ‘real’ model edits.

Appendix D Rank-1 gradient for MLPs

Appendix E Editing attention parameters

Our experiments edit weights in the MLP layers of large transformers. Here, Table 9 shows the results of editing the attention layers, rather than MLP layers, observing that editing attention layers generally leads to reduced performance compared to editing MLP layers. For this comparison, we edit the same transformer blocks as for our main editing experiment in Table 3, but we edit the query/key/value/output matrices for each block instead of the two MLP matrices. The observation that editing MLP layers is more effective generally aligns with past work [Geva et al., 2021] suggesting that the MLP layers in Transformer architectures store human-interpretable, high-level concepts in the later layers of the model, motivating our choice of editing these layers in our original experiments. Further, we hypothesize that the improved effectiveness of editing MLP layers may simply be based on the fact that they make up a large majority of model parameters, as the MLP hidden state is often much higher-dimensional than the model’s hidden state.

Appendix F Additional Qualitative Examples of MEND

We provide additional qualitative examples of using MEND to edit a larger 770M parameter T5-large model [Roberts et al., 2020] in Table 10. These examples include an instance of undergeneralization, in which the edit example’s output is correctly edited, but other examples in the equivalence neighborhood of the edit example do not change (see 2f in Table 10)). In addition, we highlight the failure case of overgeneralization, in which the model’s post-edit output for superficially similar but semantically distinct inputs is also the edit target; for example 3e, 3f, and 4e in Table 10. Mitigating these failure cases for model editors (ensuring is an important priority for future work,

Appendix G Editing through Caching

Another simple approach to editing might be to cache the final layer hidden state $z_{e}$ (averaged over the sequence length) of the edit example $x_{\text{e}}$ and the tokens of the corresponding edit label $y_{\text{e}}$ . After an edit is performed, if the model receives a new input $x$ whose final layer hidden state $z$ is close to $z_{e}$ (i.e. $\|z-z_{e}\|_{2}<\epsilon$ ), then the model outputs $y_{\text{e}}$ instead of its normal prediction. Here, we show that this approach is effective for editing problems with simpler inputs (zsRE question-answering, FEVER fact-checking), where inputs are typically short, simple phrases with one subject, one relation, and one object, but fails completely on the Wikitext editing problem, where contexts are typically 10x as long, with diverse passages containing significant amounts of extraneous text and ‘distracting’ information. The results are presented in Table 11. We include the ‘optimal’ threshold $\epsilon^{*}$ (the threshold that achieves similar drawdown to MEND), as well as the result of using $2\epsilon^{*}$ and $\frac{1}{2}\epsilon^{*}$ . We observe that the caching approach is fairly sensitive to the threshold hyperparameter, and a threshold that works well for one task may not work well for others.

For zsRE question answering, $z$ is computed as the average hidden state of the question tokens; for FEVER fact-checking, $z$ is the average hidden state of the fact statement tokens. For generative modeling, when predicting the token at time step $t$ , we compute $z_{t}$ as the average hidden state for all previously seen tokens $<t$ . In order to compute perplexity for the caching approach, we output one-hot logits corresponding to $y_{\text{e}}$ . We experimented with scaling the one-hot logit by different factors, but found scaling by $1$ to work well; scaling corresponds to changing the model’s confidence in its edit prediction but doesn’t change the prediction itself or the edit success.