Understanding Black-box Predictions via Influence Functions

Pang Wei Koh, Percy Liang

Introduction

A key question often asked of machine learning systems is “Why did the system make this prediction?” We want models that are not just high-performing but also explainable. By understanding why a model does what it does, we can hope to improve the model (Amershi et al., 2015), discover new science (Shrikumar et al., 2017), and provide end-users with explanations of actions that impact them (Goodman & Flaxman, 2016).

However, the best-performing models in many domains — e.g., deep neural networks for image and speech recognition (Krizhevsky et al., 2012) — are complicated, black-box models whose predictions seem hard to explain. Work on interpreting these black-box models has focused on understanding how a fixed model leads to particular predictions, e.g., by locally fitting a simpler model around the test point (Ribeiro et al., 2016) or by perturbing the test point to see how the prediction changes (Simonyan et al., 2013; Li et al., 2016b; Datta et al., 2016; Adler et al., 2016). These works explain the predictions in terms of the model, but how can we explain where the model came from?

In this paper, we tackle this question by tracing a model’s predictions through its learning algorithm and back to the training data, where the model parameters ultimately derive from. To formalize the impact of a training point on a prediction, we ask the counterfactual: what would happen if we did not have this training point, or if the values of this training point were changed slightly?

Answering this question by perturbing the data and retraining the model can be prohibitively expensive. To overcome this problem, we use influence functions, a classic technique from robust statistics (Hampel, 1974) that tells us how the model parameters change as we upweight a training point by an infinitesimal amount. This allows us to “differentiate through the training” to estimate in closed-form the effect of a variety of training perturbations.

Despite their rich history in statistics, influence functions have not seen widespread use in machine learning; to the best of our knowledge, the work closest to ours is Wojnowicz et al. (2016), which introduced a method for approximating a quantity related to influence in generalized linear models. One obstacle to adoption is that influence functions require expensive second derivative calculations and assume model differentiability and convexity, which limits their applicability in modern contexts where models are often non-differentiable, non-convex, and high-dimensional. We address these challenges by showing that we can efficiently approximate influence functions using second-order optimization techniques (Pearlmutter, 1994; Martens, 2010; Agarwal et al., 2016), and that they remain accurate even as the underlying assumptions of differentiability and convexity degrade.

Influence functions capture the core idea of studying models through the lens of their training data. We show that they are a versatile tool that can be applied to a wide variety of seemingly disparate tasks: understanding model behavior, debugging models, detecting dataset errors, and creating visually-indistinguishable adversarial training examples that can flip neural network test predictions, the training set analogue of Goodfellow et al. (2015).

Approach

Consider a prediction problem from some input space $\mathcal{X}$ (e.g., images) to an output space $\mathcal{Y}$ (e.g., labels). We are given training points $z_{1},\ldots,z_{n}$ , where $z_{i}=(x_{i},y_{i})\in\mathcal{X}\times\mathcal{Y}$ . For a point $z$ and parameters $\theta\in\Theta$ , let $L(z,\theta)$ be the loss, and let $\frac{1}{n}\sum_{i=1}^{n}L(z_{i},\theta)$ be the empirical risk. The empirical risk minimizer is given by $\hat{\theta}\stackrel{{\scriptstyle\rm def}}{{=}}\arg\min_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^{n}L(z_{i},\theta)$ .We fold in any regularization terms into $L$ . Assume that the empirical risk is twice-differentiable and strictly convex in $\theta$ ; in Section 4 we explore relaxing these assumptions.

Our goal is to understand the effect of training points on a model’s predictions. We formalize this goal by asking the counterfactual: how would the model’s predictions change if we did not have this training point?

Let us begin by studying the change in model parameters due to removing a point $z$ from the training set. Formally, this change is $\hat{\theta}_{-z}-\hat{\theta}$ , where $\hat{\theta}_{-z}\stackrel{{\scriptstyle\rm def}}{{=}}\arg\min_{\theta\in\Theta}\sum_{z_{i}\neq z}L(z_{i},\theta)$ . However, retraining the model for each removed $z$ is prohibitively slow.

Fortunately, influence functions give us an efficient approximation. The idea is to compute the parameter change if $z$ were upweighted by some small $\epsilon$ , giving us new parameters $\hat{\theta}_{\epsilon,z}\stackrel{{\scriptstyle\rm def}}{{=}}\arg\min_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^{n}L(z_{i},\theta)+\epsilon L(z,\theta)$ . A classic result (Cook & Weisberg, 1982) tells us that the influence of upweighting $z$ on the parameters $\hat{\theta}$ is given by

where $H_{\hat{\theta}}\stackrel{{\scriptstyle\rm def}}{{=}}\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\theta}L(z_{i},\hat{\theta})$ is the Hessian and is positive definite (PD) by assumption. In essence, we are forming a quadratic approximation to the empirical risk around $\hat{\theta}$ and take a single Newton step; see appendix A for a derivation. Since removing a point $z$ is the same as upweighting it by $\epsilon=-\frac{1}{n}$ , we can linearly approximate the parameter change due to removing $z$ without retraining the model by computing $\hat{\theta}_{-z}-\hat{\theta}\approx-\frac{1}{n}\mathcal{I}_{\text{up,params}}(z)$ .

Next, we apply the chain rule to measure how upweighting $z$ changes functions of $\hat{\theta}$ . In particular, the influence of upweighting $z$ on the loss at a test point $z_{\text{test}}$ again has a closed-form expression:

2 Perturbing a training input

Let us develop a finer-grained notion of influence by studying a different counterfactual: how would the model’s predictions change if a training input were modified?

For a training point $z=(x,y)$ , define $z_{\delta}\stackrel{{\scriptstyle\rm def}}{{=}}(x+\delta,y)$ . Consider the perturbation $z\mapsto z_{\delta}$ , and let $\hat{\theta}_{z_{\delta},-z}$ be the empirical risk minimizer on the training points with $z_{\delta}$ in place of $z$ . To approximate its effects, define the parameters resulting from moving $\epsilon$ mass from $z$ onto $z_{\delta}$ : $\hat{\theta}_{\epsilon,z_{\delta},-z}\stackrel{{\scriptstyle\rm def}}{{=}}\arg\min_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^{n}L(z_{i},\theta)+\epsilon L(z_{\delta},\theta)-\epsilon L(z,\theta)$ . An analogous calculation to (1) yields:

As before, we can make the linear approximation $\hat{\theta}_{z_{\delta},-z}-\hat{\theta}\approx\frac{1}{n}(\mathcal{I}_{\text{up,params}}(z_{\delta})-\mathcal{I}_{\text{up,params}}(z))$ , giving us a closed-form estimate of the effect of $z\mapsto z_{\delta}$ on the model. Analogous equations also apply for changes in $y$ . While influence functions might appear to only work for infinitesimal (therefore continuous) perturbations, it is important to note that this approximation holds for arbitrary $\delta$ : the $\epsilon$ -upweighting scheme allows us to smoothly interpolate between $z$ and $z_{\delta}$ . This is particularly useful for working with discrete data (e.g., in NLP) or with discrete label changes.

We thus have $\hat{\theta}_{z_{\delta},-z}-\hat{\theta}\approx-\frac{1}{n}H_{\hat{\theta}}^{-1}[\nabla_{x}\nabla_{\theta}L(z,\hat{\theta})]\delta$ . Differentiating w.r.t. $\delta$ and applying the chain rule gives us

$[\mathcal{I}_{\text{pert,loss}}(z,z_{\text{test}})]\delta$ tells us the approximate effect that $z\mapsto z+\delta$ has on the loss at $z_{\text{test}}$ . By setting $\delta$ in the direction of $\mathcal{I}_{\text{pert,loss}}(z,z_{\text{test}})^{\top}$ , we can construct local perturbations of $z$ that maximally increase the loss at $z_{\text{test}}$ . In Section 5.2, we will use this to construct training-set attacks. Finally, we note that $\mathcal{I}_{\text{pert,loss}}(z,z_{\text{test}})$ can help us identify the features of $z$ that are most responsible for the prediction on $z_{\text{test}}$ .

3 Relation to Euclidean distance

To find the training points most relevant to a test point, it is common to look at its nearest neighbors in Euclidean space (e.g., Ribeiro et al. (2016)); if all points have the same norm, this is equivalent to choosing $x$ with the largest $x\cdot x_{\text{test}}$ . For intuition, we compare this to $\mathcal{I}_{\text{up,loss}}(z,z_{\text{test}})$ on a logistic regression model and show that influence is much more accurate at accounting for the effect of training.

Let $p(y\mid x)=\sigma(y\theta^{\top}x)$ , with $y\in\{-1,1\}$ and $\sigma(t)=\frac{1}{1+\exp(-t)}$ . We seek to maximize the probability of the training set. For a training point $z=(x,y)$ , $L(z,\theta)=\log(1+\exp(-y\theta^{\top}x))$ , $\nabla_{\theta}L(z,\theta)=-\sigma(-y\theta^{\top}x)yx$ , and $H_{\theta}=\frac{1}{n}\sum_{i=1}^{n}\sigma(\theta^{\top}x_{i})\sigma(-\theta^{\top}x_{i})x_{i}x_{i}^{\top}$ . From (2), $\mathcal{I}_{\text{up,loss}}(z,z_{\text{test}})$ is:

We highlight two key differences from $x\cdot x_{\text{test}}$ . First, $\sigma(-y\theta^{\top}x)$ gives points with high training loss more influence, revealing that outliers can dominate the model parameters. Second, the weighted covariance matrix $H_{\hat{\theta}}^{-1}$ measures the “resistance” of the other training points to the removal of $z$ ; if $\nabla_{\theta}L(z,\hat{\theta})$ points in a direction of little variation, its influence will be higher since moving in that direction will not significantly increase the loss on other training points. As we show in Fig 1, these differences mean that influence functions capture the effect of model training much more accurately than nearest neighbors.

Efficiently calculating influence

The first problem is well-studied in second-order optimization. The idea is to avoid explicitly computing $H_{\hat{\theta}}^{-1}$ ; instead, we use implicit Hessian-vector products (HVPs) to efficiently approximate $s_{\text{test}}\stackrel{{\scriptstyle\rm def}}{{=}}H_{\hat{\theta}}^{-1}\ \nabla_{\theta}L(z_{\text{test}},\hat{\theta})$ and then compute $\mathcal{I}_{\text{up,loss}}(z,z_{\text{test}})=-s_{\text{test}}\cdot\nabla_{\theta}L(z,\hat{\theta})$ . This also solves the second problem: for each test point of interest, we can precompute $s_{\text{test}}$ and then efficiently compute $-s_{\text{test}}\cdot\nabla_{\theta}L(z_{i},\hat{\theta})$ for each training point $z_{i}$ .

We discuss two techniques for approximating $s_{\text{test}}$ , both relying on the fact that the HVP of a single term in $H_{\hat{\theta}}$ , $[\nabla^{2}_{\theta}L(z_{i},\hat{\theta})]v$ , can be computed for arbitrary $v$ in the same time that $\nabla_{\theta}L(z_{i},\hat{\theta})$ would take, which is typically $O(p)$ (Pearlmutter, 1994).

Conjugate gradients (CG). The first technique is a standard transformation of matrix inversion into an optimization problem. Since $H_{\hat{\theta}}\succ 0$ by assumption, $H_{\hat{\theta}}^{-1}v\equiv\arg\min_{t}\{t^{\top}H_{\hat{\theta}}t-v^{\top}t\}$ . We can solve this with CG approaches that only require the evaluation of $H_{\hat{\theta}}t$ , which takes $O(np)$ time, without explicitly forming $H_{\hat{\theta}}$ . While an exact solution takes $p$ CG iterations, in practice we can get a good approximation with fewer iterations; see Martens (2010) for more details.

Stochastic estimation. With large datasets, standard CG can be slow; each iteration still goes through all $n$ training points. We use a method developed by Agarwal et al. (2016) to get an estimator that only samples a single point per iteration, which results in significant speedups.

We note that the original method of Agarwal et al. (2016) dealt only with generalized linear models, for which $[\nabla^{2}_{\theta}L(z_{i},\hat{\theta})]v$ can be efficiently computed in $O(p)$ time. In our case, we rely on Pearlmutter (1994)’s more general algorithm for fast HVPs, described above, to achieve the same time complexity.To increase stability, especially with non-convex models (see Section 4.2), we can also sample a minibatch of training points at each iteration, instead of relying on a single training point.

With these techniques, we can compute $\mathcal{I}_{\text{up,loss}}(z_{i},z_{\text{test}})$ on all training points $z_{i}$ in $O(np+rtp)$ time; we show in Section 4.1 that empirically, choosing $rt=O(n)$ gives accurate results. Similarly, we can compute $\mathcal{I}_{\text{pert,loss}}(z_{i},z_{\text{test}})=-\frac{1}{n}\nabla_{\theta}L(z_{\text{test}},\hat{\theta})^{\top}H_{\hat{\theta}}^{-1}\nabla_{x}\nabla_{\theta}L(z_{i},\hat{\theta})$ with two matrix-vector products: we first compute $s_{\text{test}}$ , then find $s_{\text{test}}^{\top}\nabla_{x}\nabla_{\theta}L(z_{i},\hat{\theta})$ with the same HVP trick. These computations are easy to implement in auto-grad systems like TensorFlow (Abadi et al., 2015) and Theano (Theano D. Team, 2016), as users need only specify the loss; the rest is automatically handled.

Validation and extensions

Recall that influence functions are asymptotic approximations of leave-one-out retraining under the assumptions that (i) the model parameters $\hat{\theta}$ minimize the empirical risk, and that (ii) the empirical risk is twice-differentiable and strictly convex. Here, we empirically show that influence functions are accurate approximations (Section 4.1) that provide useful information even when these assumptions are violated (Sections 4.2, 4.3).

Influence functions assume that the weight on a training point is changed by an infinitesimally small $\epsilon$ . To investigate the accuracy of using influence functions to approximate the effect of removing a training point and retraining, we compared $-\frac{1}{n}\mathcal{I}_{\text{up,loss}}(z,z_{\text{test}})$ with $L(z_{\text{test}},\hat{\theta}_{-z})-L(z_{\text{test}},\hat{\theta})$ (i.e., actually doing leave-one-out retraining). With a logistic regression model on 10-class MNIST,We trained with L-BFGS (Liu & Nocedal, 1989), with L2 regularization of $0.01$ , $n=55,000$ , and $p=7,840$ parameters. the predicted and actual changes matched closely (Fig 2-Left).

The stochastic approximation from Agarwal et al. (2016) was also accurate with $r=10$ repeats and $t=5,000$ iterations (Fig 2-Mid). Since each iteration only requires one HVP $[\nabla^{2}_{\theta}L(z_{i},\hat{\theta})]v$ , this runs quickly: in fact, we accurately estimated $H^{-1}v$ without even looking at every data point, since $n=55,000>rt$ . Surprisingly, even $r=1$ worked; while results were noisier, it was still able to identify the most influential points.

2 Non-convexity and non-convergence

3 Non-differentiable losses

What happens when the derivatives of the loss, $\nabla_{\theta}L$ and $\nabla_{\theta}^{2}L$ , do not exist? In this section, we show that influence functions computed on smooth approximations to non-differentiable losses can predict the behavior of the original, non-differentiable loss under leave-one-out retraining. The robustness of this approximation suggests that we can train non-differentiable models and swap out non-differentiable components for smoothed versions for the purposes of calculating influence.

Use cases of influence functions

By telling us the training points “responsible” for a given prediction, influence functions reveal insights about how models rely on and extrapolate from the training data. In this section, we show that two models can make the same correct predictions but get there in very different ways.

As expected, $\mathcal{I}_{\text{up,loss}}$ in the RBF SVM varied inversely with raw pixel distance, with training images far from the test image in pixel space having almost no influence; the Inception influences were much less correlated with distance in pixel space (Fig 4-Left). Looking at the two most helpful images (most positive $-\mathcal{I}_{\text{up,loss}}$ ) for each model in Fig 4-Right, we see that the Inception network picked on the distinctive characteristics of clownfish, whereas the RBF SVM pattern-matched training images superficially.

Moreover, in the RBF SVM, fish (green points) close to the test image were mostly helpful, while dogs (red) were mostly harmful, with the RBF acting as a soft nearest neighbor function (Fig 4-Left). In contrast, in the Inception network, fish and dogs could be helpful or harmful for correctly classifying the test image as a fish; in fact, the 5th most helpful training image was a dog that, to the model, looked very different from the test fish (Fig 4-Top).

2 Adversarial training examples

In this section, we show that models that place a lot of influence on a small number of points can be vulnerable to training input perturbations, posing a serious security risk in real-world ML systems where attackers can influence the training data (Huang et al., 2011). Recent work has generated adversarial test images that are visually indistinguishable from real test images but completely fool a classifier (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016). We demonstrate that influence functions can be used to craft adversarial training images that are similarly visually-indistinguishable and can flip a model’s prediction on a separate test image. To the best of our knowledge, this is the first proof-of-concept that visually-indistinguishable training attacks can be executed on otherwise highly-accurate neural networks.

We tested these adversarial training perturbations on the same Inception network on dogs vs. fish from Section 5.1, choosing this pair of animals to provide a stark contrast between the classes. We set $\alpha=0.02$ and ran the attack for 100 iterations on each test image. As before, we froze all but the top layer for training; note that computing $\mathcal{I}_{\text{pert,loss}}$ still involves differentiating through the entire network. Originally, the model correctly classified 591 / 600 test images. For each of these 591 test images, considered separately, we tried to find a visually-indistinguishable perturbation (i.e., same 8-bit representation) to a single training image, out of 1,800 total training images, that would flip the model’s prediction. We were able to do this on 335 (57%) of the 591 test images. If we perturbed 2 training images for each test image, we could flip predictions on 77% of the 591 test images; and if we perturbed 10 training images, we could flip all but 1 of the 591. The above results are from attacking each test image separately, i.e., we use a different training set to attack each test image. We next tried to attack multiple test images simultaneously by increasing their average test loss, and found that single training image perturbations could simultaneously flip multiple test predictions as well (Fig 5).

We make three observations about these attacks. First, though the change in pixel values is small, the change in the final Inception feature layer is significantly larger: in pixel space and using $L_{2}$ distance, the training values change by less than $1\%$ of the mean distance of a training point to the class centroid, whereas in Inception feature space, the change is on the same order as the mean distance. Second, the attack tries to perturb the training example in a direction of low variance, causing the model to overfit in that direction and consequently incorrectly classify the test images; we expect the attack to become harder as the number of training examples grows. Third, ambiguous or mislabeled training images are effective points to attack, since the model has low confidence and thus high loss on them, making them highly influential (recall Section 2.3). For example, the image in Fig 5 contains both a dog and a fish and is highly ambiguous; as a result, it is the training example that the model is least confident on (with a confidence of 77%, compared to the next lowest confidence of 90%).

This attack is mathematically equivalent to the gradient-based training set attacks explored by Biggio et al. (2012); Mei & Zhu (2015b) and others in the context of different models.Biggio et al. (2012) constructed a dataset poisoning attack against a linear SVM on a two-class MNIST task, but had to modify the training points in an obviously distinguishable way to be effective. Measuring the magnitude of $\mathcal{I}_{\text{pert,loss}}$ gives model developers a way of quantifying how vulnerable their models are to training-set attacks.

3 Debugging domain mismatch

Domain mismatch — where the training distribution does not match the test distribution — can cause models with high training accuracy to do poorly on test data (Ben-David et al., 2010). We show that influence functions can identify the training examples most responsible for the errors, helping model developers identify domain mismatch.

As a case study, we predicted whether a patient would be readmitted to a hospital. Domain mismatches are common in biomedical data; for example, different hospitals can serve very different populations, and readmission models trained on one population can do poorly on another (Kansagara et al., 2011). We used logistic regression to predict readmission with a balanced training dataset of 20K diabetic patients from 100+ US hospitals, each represented by 127 features (Strack et al., 2014).Hospital readmission was defined as whether a patient would be readmitted within the next 30 days. Features were demographic (e.g., age, race, gender), administrative (e.g., length of hospital stay), or medical (e.g., test results).

3 out of the 24 children under age 10 in this dataset were re-admitted. To induce a domain mismatch, we filtered out 20 children who were not re-admitted, leaving 3 out of 4 re-admitted. This caused the model to wrongly classify many children in the test set. Our aim is to identify the 4 children in the training set as being “responsible” for these errors.

As a baseline, we tried the common practice of looking at the learned parameters $\hat{\theta}$ to see if the indicator variable for being a child was obviously different. However, this did not work: 14/127 features had a larger coefficient.

Picking a random child $z_{\text{test}}$ that the model got wrong, we calculated $-\mathcal{I}_{\text{up,loss}}(z_{i},z_{\text{test}})$ for each training point $z_{i}$ . This clearly highlighted the 4 training children, each of whom were 30-40 times as influential as the next most influential examples. The 1 child in the training set who was not readmitted had a very positive influence, while the other 3 had very negative influences. Calculating $\mathcal{I}_{\text{pert,loss}}$ on these 4 children showed that a change in the ‘child’ indicator variable had by far the largest effect on $\mathcal{I}_{\text{up,loss}}$ .

4 Fixing mislabeled examples

Labels in the real world are often noisy, especially if crowdsourced (Frénay & Verleysen, 2014), and can even be adversarially corrupted, as in Section 5.2. Even if a human expert could recognize wrongly labeled examples, it is impossible in many applications to manually review all of the training data. We show that influence functions can help human experts prioritize their attention, allowing them to inspect only the examples that actually matter.

The key idea is to flag the training points that exert the most influence on the model. Because we do not have access to the test set, we measure the influence of $z_{i}$ with $\mathcal{I}_{\text{up,loss}}(z_{i},z_{i})$ , which approximates the error incurred on $z_{i}$ if we remove $z_{i}$ from the training set.

Our case study is email spam classification, which relies on user-provided labels and is also vulnerable to adversarial attack (Biggio et al., 2011). We flipped the labels of a random 10% of the training data and then simulated manually inspecting a fraction of the training points, correcting them if they had been flipped. Using influence functions to prioritize the training points to inspect allowed us to repair the dataset (Fig 6, blue) without checking too many points, outperforming the baselines of checking points with the highest train loss (Fig 6, green) or at random (Fig 6, red). No method had access to the test data.

Related work

The use of influence-based diagnostics originated in statistics in the 70s, driven by the seminal papers of Hampel (1974) and Jaeckel (1972) (where it was called the infinitesimal jackknife). It was further developed in the book by Hampel et al. (1986) and many other contemporary papers (Cook, 1977; Cook & Weisberg, 1980; Pregibon et al., 1981; Cook & Weisberg, 1982). Earlier work focused on removing training points from linear models, with later work extending this to more general models and a wider variety of perturbations (Hampel et al., 1986; Cook, 1986; Thomas & Cook, 1990; Chatterjee & Hadi, 1986; Wei et al., 1998). Prior work mostly focused on experiments with small datasets, e.g., $n=24$ and $p=10$ in Cook & Weisberg (1980), and thus paid special attention to exact solutions, or if not possible, characterizations of the error terms.

Influence functions have not been used much in the ML literature, with some exceptions. Christmann & Steinwart (2004); Debruyne et al. (2008); Liu et al. (2014) use influence functions to study model robustness and to do fast cross-validation in kernel methods. Wojnowicz et al. (2016) use matrix sketching to estimate Cook’s distance, which is closely related to influence; they focus on prioritizing training points for human attention and derive methods specific to generalized linear models. Kabra et al. (2015) define a different notion of influence that is specialized to finite hypothesis classes.

As noted in Section 5.2, our training-set attack is mathematically equivalent to an approach first explored by Biggio et al. (2012) in the context of SVMs, with follow-up work extending the framework and applying it to linear and logistic regression (Mei & Zhu, 2015b), topic modeling (Mei & Zhu, 2015a), and collaborative filtering (Li et al., 2016a). These papers derived the attack directly from the KKT conditions without considering influence, though for continuous data, the end result is equivalent. Influence functions additionally let us consider attacks on discrete data (Section 2.2), but we have not tested this empirically. Our work connects the literature on training-set attacks with work on “adversarial examples” (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016), visually-imperceptible perturbations on test inputs.

In contrast to training-set attacks, Cadamuro et al. (2016) consider the task of taking an incorrect test prediction and finding a small subset of training data such that changing the labels on this subset makes the prediction correct. They provide a solution for OLS and Gaussian process models when the labels are continuous. Our work with influence functions allow us to solve this problem in a much larger range of models and in datasets with discrete labels.

Discussion

We have discussed a variety of applications, from creating training-set attacks to debugging models and fixing datasets. Underlying each of these applications is a common tool, influence functions, which are based on a simple idea—we can better understand model behavior by looking at how it was derived from its training data.

At their core, influence functions measure the effect of local changes: what happens when we upweight a point by an infinitesimally-small $\epsilon$ ? This locality allows us to derive efficient closed-form estimates, and as we show, they can be surprisingly effective. However, we might want to ask about more global changes, e.g., how does a subpopulation of patients from this hospital affect the model? Since influence functions depend on the model not changing too much, how to tackle this is an open question.

It seems inevitable that high-performing, complex, black-box models will become increasingly prevalent and important. We hope that the approach presented here—of looking at the model through the lens of the training data—will become a standard part of the toolkit of developing, understanding, and diagnosing machine learning.

Reproducibility

The code and data for replicating our experiments is available on GitHub http://bit.ly/gt-influence and Codalab http://bit.ly/cl-influence.

Acknowledgements

We thank Jacob Steinhardt, Zhenghao Chen, and Hongseok Namkoong for helpful discussions and comments. We are also grateful to Doug Martin, Swee Keat Lim, and Teresa Yeo for finding typos and omissions in a previous version of the manuscript. This work was supported by a Future of Life Research Award and a Microsoft Research Faculty Fellowship.

For completeness, we provide a standard derivation of the influence function $\mathcal{I}_{\text{up,params}}$ in the context of loss minimization (M-estimation). This derivation is based on asymptotic arguments and is not fully rigorous; see van der Vaart (1998) and other statistics textbooks for a more thorough treatment.

Recall that $\hat{\theta}$ minimizes the empirical risk:

We further assume that $R$ is twice-differentiable and strongly convex in $\theta$ , i.e.,

exists and is positive definite. This guarantees the existence of $H_{\hat{\theta}}^{-1}$ , which we will use in the subsequent derivation.

The perturbed parameters $\hat{\theta}_{\epsilon,z}$ can be written as

Define the parameter change $\Delta_{\epsilon}=\hat{\theta}_{\epsilon,z}-\hat{\theta}$ , and note that, as $\hat{\theta}$ doesn’t depend on $\epsilon$ , the quantity we seek to compute can be written in terms of it:

Since $\hat{\theta}_{\epsilon,z}$ is a minimizer of (8), let us examine its first-order optimality conditions:

Next, since $\hat{\theta}_{\epsilon,z}\to\hat{\theta}$ as $\epsilon\to 0$ , we perform a Taylor expansion of the right-hand side:

where we have dropped $o(\|\Delta_{\epsilon}\|)$ terms.

Since $\hat{\theta}$ minimizes $R$ , we have $\nabla R(\hat{\theta})=0$ . Dropping $o(\epsilon)$ terms, we have

Combining with (7) and (9), we conclude that: