On the Accuracy of Influence Functions for Measuring Group Effects

Pang Wei Koh, Kai-Siang Ang, Hubert H. K. Teo, Percy Liang

Introduction

Influence functions [Jaeckel, 1972, Hampel, 1974, Cook, 1977] estimate the effect of removing an individual training point on a model’s predictions without the computationally-prohibitive cost of retraining the model. Tracing a model’s output back to its training data can be useful: influence functions have been recently applied to explain predictions [Koh and Liang, 2017], produce confidence intervals [Schulam and Saria, 2019], investigate model bias [Brunet et al., 2018, Wang et al., 2019], improve human trust [Zhou et al., 2019], and even craft data poisoning attacks [Koh et al., 2019].

Influence functions are based on first-order Taylor approximations that are accurate for estimating small perturbations to the model, which makes them suitable for predicting the effects of removing individual training points on the model. However, we often want to study the effects of removing groups of points, which represent large perturbations to the data. For example, we might wish to analyze the effect of data collected from different experimental batches [Leek et al., 2010] or demographic groups [Chen et al., 2018]; apportion credit between crowdworkers, each of whom generated part of the data [Arrieta-Ibarra et al., 2018]; or, in a multi-party learning setting, ensure that no individual user has too much influence on the joint model [Hayes and Ohrimenko, 2018]. Are influence functions still accurate when predicting the effects of (removing) these larger groups?

In this paper, we first show empirically that on real datasets and across a broad variety of groups of data, the predicted and actual effects are strikingly correlated (Spearman $\rho$ of $0.8$ to $1.0$ ), such that the groups with the largest actual effect also tend to have the largest predicted effect. Moreover, the predicted effect tends to underestimate the actual effect, suggesting that it could be an approximate lower bound in practice. Using influence functions to predict the actual effect of removing large, coherent groups of data can therefore still be useful, even though the violation of the small-perturbation assumption can result in high absolute and relative errors between the predicted and actual effects.

What explains these phenomena of correlation and underestimation? Prior theoretical work focused on establishing the conditions under which this influence approximation is accurate, i.e., the error between the actual and predicted effects is small [Giordano et al., 2019b, Rad and Maleki, 2018]. However, in our setting of removing large, coherent groups of data, this error can be quite large. As a first step towards understanding the behavior of the influence approximation in this regime, we characterize the relationship between the predicted and actual effects of a group via the one-step Newton approximation [Pregibon et al., 1981], which we find is a surprisingly accurate approximation in practice. We show that correlation and underestimation arise under certain settings (e.g., removing multiple copies of a single training point), but need not hold in general, which opens up the intriguing question of why we observe those phenomena across a wide range of empirical settings.

Finally, we exploit the correlation of predicted and actual group effects in two example case studies: a chemical-disease relationship (CDR) task, where the groups correspond to different labeling functions [Hancock et al., 2018], and a natural language inference (NLI) task [Williams et al., 2018], where the groups come from different crowdworkers. On the CDR task, we find that the influence of each labeling function correlates with its size (the number of examples it labels) but not its average accuracy, which suggests that practitioners should focus on the coverage of the labeling functions they construct. In contrast, on the NLI task, we find that the influence of each crowdworker is uncorrelated with the number of examples they contibute, which suggests that practitioners should focus on how to elicit high-quality examples from crowdworkers over increasing quantity.

Background and problem setup

that minimize the $L_{2}$ -regularized empirical risk, where $\lambda>0$ controls regularization strength. The all-ones vector $\mathbf{1}$ in $\hat{\theta}(\mathbf{1})$ denotes that the initial training points all have uniform sample weights.

corresponding to retraining the model after excluding $W$ . We refer to $w$ as the subset (corresponding to $W$ ); the number of removed points as $\|w\|_{1}$ ; and the fraction of removed points as $\alpha=\|w\|_{1}/n$ .

The issue with computing the actual effect $\mathcal{I}^{*}_{f}(w)$ is that retraining the model to compute $\hat{\theta}(\mathbf{1}-w)$ for each subset $w$ can be prohibitively expensive. Influence functions provide a relatively efficient first-order approximation to $\mathcal{I}^{*}_{f}(w)$ that avoids retraining.

2 Relation to prior work

Influence functions—introduced in the seminal work of Hampel and in Jaeckel , where it was called the infinitesimal jackknife—have a rich history in robust statistics. The use of influence functions in the ML community is more recent, though growing; in Section 1, we provide references for several recent applications of influence functions in ML.

Removing a single training point, especially when the total number of points $n$ is large, represents a small perturbation to the training distribution, so we expect the first-order influence approximation to be accurate. Indeed, prior work on the accuracy of influence has focused on this regime: e.g., Debruyne et al. , Liu et al. , Rad and Maleki , Giordano et al. [2019b] give evidence that the influence on self-loss can approximate LOOCV, and Koh and Liang similarly examined the accuracy of estimating the change in test loss after removing single training points.

However, removing a constant fraction $\alpha$ of the training data represents a large perturbation to the training distribution. To the best of our knowledge, this setting has not been empirically studied; perhaps the closest work is Khanna et al. ’s use of Bayesian quadrature to estimate a maximally influential subset. Instead, older references have alluded to the phenomena of correlation and underestimation we observe: Pregibon et al. note that influence tends to be conservative, while Hampel et al. say that “bold extrapolations” (i.e., large perturbations) are often still useful. On the theoretical front, Giordano et al. [2019b] established finite-sample error bounds that apply to groups, e.g., showing that the leave- $k$ -out approximation is consistent as the fraction of removed points $\alpha\to 0$ . Our focus is instead on the relationship of the actual effect $\mathcal{I}^{*}_{f}(w)$ and predicted effect (influence) $\mathcal{I}_{f}(w)$ in the regime where $\alpha$ is constant and the error $|\mathcal{I}^{*}_{f}(w)-\mathcal{I}_{f}(w)|$ is large.

Empirical accuracy of influence functions on constructed groups

How well do influence functions estimate the effect of (removing) a group of training points? If $n$ is large and we remove a subset $w$ uniformly at random, the new parameters $\hat{\theta}(\mathbf{1}-w)$ should remain close to $\hat{\theta}(\mathbf{1})$ even when if fraction of removed points $\alpha$ is non-negligible, so the influence error $|\mathcal{I}^{*}_{f}(w)-\mathcal{I}_{f}(w)|$ should be small. However, we are usually interested in removing coherent, non-random groups, e.g., all points from a data source or share some feature. In such settings, the parameters $\hat{\theta}(\mathbf{1}-w)$ and $\hat{\theta}(\mathbf{1})$ might differ substantially, and the error $|\mathcal{I}^{*}_{f}(w)-\mathcal{I}_{f}(w)|$ could be large. Put another way, there could be a cluster of points such that removing one of those points would not change the model by much—so influence could be low—but removing all of them would.

Surprisingly (to us), we found that even when removing large and coherent groups of points, the influence $\mathcal{I}_{f}(w)$ behaved consistently relative to the actual effect $\mathcal{I}^{*}_{f}(w)$ on test predictions, test losses, and self-loss, with two broad phenomena emerging:

Correlation: $\mathcal{I}_{f}(w)$ and $\mathcal{I}^{*}_{f}(w)$ rank subsets of points $w$ similarly (e.g., high Spearman $\rho$ ).

Here, we report results on 5 datasets chosen to span a range of applications, training set size $n$ , and number of features $d$ (Table 1). The first 4 datasets involve hospital readmission prediction, spam classification, and object recognition, and were used in Koh and Liang to study the influence of individual points. The fifth dataset is a chemical-disease relationship (CDR) dataset Hancock et al. . In Section 5, we will also study the MultiNLI language inference dataset [Williams et al., 2018], which was omitted from the experiments here because its large size makes repeated retraining to compute the actual effect too expensive. See Appendix B for dataset details. In an attempt to make the influence approximation as inaccurate as possible, we constructed a variety of subsets, from small ( $\alpha=0.25\%$ ) to large ( $\alpha=25\%$ ), to be coherent and have considerable influence on the model. On each dataset, we trained an $L_{2}$ -regularized logistic regression model (or softmax for the multiclass tasks) and compared the influences and actual effects of these subsets.

Figure 1 shows that the influences and actual effects of all of these subsets on test prediction (Top), test loss (Mid), and self-loss (Bot) are highly correlated (Spearman $\rho$ of $0.89$ to $0.99$ across all plots), even though the absolute and relative errors of the influence approximation can be quite large. Moreover, the influence of a group tends to underestimate its actual effect in all settings except for groups with negative influence on test loss (the left side of each plot in Figure 1-Mid). These trends held across a wide range of regularizations $\lambda$ , though correlation increased with $\lambda$ (Appendix C.2).

In Section 5, we will use the CDR dataset [Hancock et al., 2018] and the MultiNLI [Williams et al., 2018] dataset to show that correlation and underestimation also apply to groups of data that arise naturally, and that influence functions can therefore be used to derive insights about real datasets and applications. Before that, we first attempt to develop some theoretical insight into the results above.

Theoretical analysis

The experimental results above show that there is consistent underestimation and high correlation between the predicted effects, based on influence functions, and the actual effects of groups across a variety of datasets, despite the influence approximation incurring large absolute and relative error. As we discussed in Section 2.2, this is outside the regime of existing theory.

Our analysis centers on the one-step Newton approximation, which estimates the change in parameters

The Newton approximation is computationally expensive because it computes $(H_{\lambda,\mathbf{1}}({\mathbf{1}-w}))^{-1}$ for each $w$ (instead of the fixed $H_{\lambda,\mathbf{1}}^{-1}$ in the influence calculation). However, it provides more accurate estimates (e.g., Pregibon et al. , Rad and Maleki ), and we show that its error can be bounded as follows (all proofs in Appendix E):

2 Characterizing the difference between the Newton approximation and influence

We can interpret Proposition 2 as a formalization of Hampel et al. ’s observation that influence approximations are accurate when the model is robust and the curvature of the loss is low. In general, the error decreases as $\lambda$ increases and $f(\cdot)$ becomes less curved; in Figure C.2, we show that increasing $\lambda$ reduces error and increases correlation in our experiments.

3 The relationship between influence and actual effect on self-loss

Under the assumptions of Proposition 2, the influence on the self-loss obeys

4 The relationship between influence and actual effect on a test point

We can recover a cone constraint similar to Proposition 3 if we restrict our attention to the special case where we use a margin-based model and remove (possibly multiple copies) of a single point:

Applications of influence functions on natural groups of data

The analysis in Section 4 shows that the cone constraint between predicted and actual group effects need not always hold. Nonetheless, our experiments in Section 3 demonstrate that on real datasets, the correlation is much stronger than the theory predicts. We now turn to using influence functions to predict group effects in two case studies where groups arise naturally.

The CDR dataset tackles the following task: given text about the relationship between a chemical and a disease, predict if the chemical causes the disease. It was collected via data programming, where users provide labeling functions (LFs)—instead of labels—that take in an unlabeled point and either abstain or output a heuristic label [Ratner et al., 2016]. Specifically, Hancock et al. collected natural language explanations of provided classifications; parsed those explanations into LFs; and used those LFs to label a large pool of data (Appendix B.1).

We used influence functions to study two important properties of LFs: coverage, the fraction of unlabeled points for which an LF outputs a non-abstaining label; and precision, the proportion of correct labels output. We associated each LF with the group of points that it labeled, and computed its influence; as expected, these correlated with actual effects on overall test loss (Spearman $\rho=1$ ; Figure C.5). LFs with higher coverage had more influence (Figure 4-Left; see also Figure C.6), but surprisingly, LFs with higher precision did not (Figure 4-Mid). The association with coverage stems at least partially from class balance: each LF outputs either all positive or all negative labels, so removing an LF with high coverage changes the class balance and consequently improves test performance on one class at the expense of the other (Figure 4-Left). While these findings are not causal claims, they suggest that the coverage of an LF, rather than its precision, might have a stronger effect on its overall contribution to test performance.

The MultiNLI dataset deals with natural language inference: determining if a pair of sentences agree, contradict, or are neutral. Williams et al. presented crowdworkers with initial sentences from five genres and asked them to generate follow-on sentences that were neutral or in agreement/contradiction (Appendix B.2). We studied the effect that each crowdworker had on the model’s test set performance by computing the influence of the examples they created on overall test loss (Spearman $\rho$ of $0.77$ to $0.86$ with actual effects across different genres; see Figure C.8).

Studying the influence of each crowdworker reveals that the number of examples a crowdworker created was not predictive of influence on test performance: e.g., the most prolific crowdworker contributed 35,000 examples but had negative influence, and we verified that removing all of those examples and retraining the model indeed made overall test performance worse (Figure 4-Right). Curiously, this effect was genre-specific: crowdworkers who improved performance on some genres would lower performance on others (Figure C.10), even though the number of examples they contributed to a genre did not correlate with their influence on it (Figure C.11). We note that these results are obtained on a baseline logistic regression model built on top of a continuous bag-of-words representation. Identifying precisely what makes a crowdworker’s contributions useful, especially on higher-performing models, could help us improve dataset collection and credit attribution as well as better understand the biases due to annotator effects [Geva et al., 2019].

Discussion

In this paper, we showed empirically that the influences of groups of points are highly correlated with, and consistently underestimate, their actual effects across a range of datasets, types of groups, and sizes. These phenomena allows us to use influence functions to better understand the “different stories that different parts of the data tell,” in the words of Hampel et al. . We showed that we can gain insight into the effects of a labeling function in data programming, or a crowdworker in a crowdsourced dataset, by computing the influence of their corresponding group effects.

While these applications involved predefined groups, influence functions could potentially also discover coherent, semantically-relevant groups in the data. They can also be used to approximate Shapley values, which are a different but related way of measuring the effect of data points; see, e.g., Jia et al. and Ghorbani and Zou . Separately, influence functions can also estimate the effects of adding training points. In this context, underestimation turns into overestimation, i.e., the influence of adding a group of training points tends to overestimate the actual effect of adding that group. This raises the possibility of using influence functions to evaluate the vulnerability of a given dataset and model to data poisoning attacks [Steinhardt et al., 2017].

Our theoretical analysis showed that while correlation and underestimation hold in some restricted settings, they need not hold in general, realistic settings. This gap between theory and experiments opens up important directions for future work: Why do we observe such striking correlation between predicted and actual effects on real data? To what extent is this due to the specific model, datasets, or subsets used? Do these trends hold for non-convex models like neural networks? Our work suggests that there could be distributional assumptions that hold for real data and give rise to the broad phenomena of correlation and underestimation. One promising lead is the surprising observation that the Newton approximation is much more accurate than influence at predicting group effects, which holds out the hope that we can understand group effects using just low-order terms (since the Newton approximation only uses the first and second derivatives of the loss) without needing to account for the whole loss function through higher order terms (as in Giordano et al. [2019a]).

The code for replicating our experiments is available in the GitHub repository https://github.com/kohpangwei/group-influence-release. An executable version of this paper is also available on CodaLab at https://worksheets.codalab.org/worksheets/0xfed2ae0b9e5b44b7a1af8096365592a5.

Acknowledgments

We are grateful to Zhenghao Chen, Brad Efron, Jean Feng, Tatsunori Hashimoto, Robin Jia, Stephen Mussmann, Aditi Raghunathan, Marco Túlio Ribeiro, Noah Simon, Jacob Steinhardt, and Jian Zhang for helpful discussions and comments. We are further indebted to Ryan Giordano, Ruoxi Jia, and Will Stephenson for discussion about prior work, and Samuel Bowman, Braden Hancock, Emma Pierson, and Pranav Rajpurkar for their assistance with applications and datasets. This work was funded by an Open Philanthropy Project Award. PWK was supported by the Facebook Fellowship Program.

References

Appendix A Experimental details for comparing influence vs. actual effects on constructed groups

For all experiments in Section 3, we trained a logistic regression model (or softmax for multiclass) using sklearn.linear_model.LogisticRegression.fit, fitting the intercept but only applying $L_{2}$ -regularization to the weights. To choose the regularization strength $\lambda$ , we conducted 5-fold cross-validation across 10 possible values of $\lambda/n$ logarithmically spaced between $1.0\times 10^{-4}$ and $1.0\times 10^{-1}$ , inclusive, selecting the regularization that yielded the highest cross-validation accuracy (except on the CDR dataset, where we selected regularization based on cross-validation F1 score to account for class imbalance as per Hancock et al. ’s procedure).

A.2 Group construction

For each dataset, we constructed groups of various sizes relative to the entire dataset by considering 100 sizes linearly spaced between $0.25\%$ and $25\%$ of the dataset. For each of these 100 sizes, we constructed one group with each of the following methods:

Shared features: We selected a single feature uniformly at random and sorted the dataset along this selected feature. Next, we selected an training point uniformly at random. We then constructed a group of size $s$ that consisted of the $s$ unique training points that were closest to the chosen point, as measured by their values in the selected feature. We randomly sampled a feature and initial training point for each different group constructed in this way.

Feature clustering: We clustered the dataset with respect to raw features via scipy.cluster.hierarchy.fclusterdata with t set to $1$ , as well as with sklearn.cluster.KMeans.fit with n_clusters taking on values $4,8,16,32,64,128$ . Since hierarchical clustering determines cluster sizes automatically with a principled heuristic and we try a range of values for n_clusters in $k$ -means, this recovers clusters with a large range of sizes. The clustering with $\texttt{n\_clusters}=4$ also guarantees (via the pigeonhole principle) that there is at least one cluster which contains at least $25\%$ of the dataset. From all the clusters that are at least the size of the desired group, we chose one uniformly at random and chose the group uniformly at random and without replacement from the training points in this cluster.

Random within class: We considered all classes with at least as many training points as the size of the desired group. From these classes, we chose one uniformly at random. Then, we chose the group uniformly at random and without replacement from all training points in this class.

Random: We picked a group uniformly at random and without replacement from the entire dataset.

The above methods gave us a total of 500 groups (100 groups per method) for each dataset, with the exception of the “random within class” method for MNIST. Since MNIST has 10 classes, each with only $10\%$ of the data, we skipped over groups of size $>10\%$ just for the “random within class” groups.

In addition, we selected 3 random test points and the 3 test points with highest loss; we intend these to represent the average case and the more extreme case that may be relevant to model developers who want to debug errors that their model outputs. For each of these 6 test points, we selected groups that had large positive influence on its test loss. More specifically, we proceeded in 3 stages:

We considered 33 group sizes linearly spaced between $0.25\%$ and $2.5\%$ of the dataset, and for any size $s$ out of these 33, we selected a group uniformly at random and without replacement from training points in the top $1.5\times 2.5\%$ of the dataset, ordered according to their influence on the test point of interest.

This was similar to the first stage, but with 33 sizes spaced between $0.25\%$ and $10\%$ and groups chosen from the top $1.5\times 10\%$ of the dataset.

Finally, we considered 34 sizes spaced between $0.25\%$ and $25\%$ , with groups chosen from the top $1.5\times 25\%$ of the dataset.

Larger groups tend to have lower average influence than smaller groups, since by necessity, the group must contain points farther from the top. This multi-stage approach ensured that we would select small groups with both a high average influence and also with a low average influence, so that we could compare them to larger groups and mitigate confounding the group size with its average influence.

Finally, we repeated this last method of group construction for groups with large negative influence on test point loss.

Using these 6 test points, we generated 1,200 groups (100 subsets per group, with 6 test points, and drawing from the positive and negative tails). In total, we therefore generated 1,700 groups per dataset (except MNIST).

A.3 Comparison of influence and actual effect

To produce Figure 1, we selected groups as described in Appendix A.2. We retrained the model once for each group, excluding the group in order to calculate its actual effect. To compute all groups’ influences, we first calculated the influence of every individual training point using the procedure of Koh and Liang . Then, to compute the influence on test prediction or loss of some group, we simply added the relevant individual influences (in CDR, we weighted these individual influences according to that point’s weight; see Appendix B.1). To compute the influence on self-loss of some group, we summed up the gradients of the loss of each training point to compute $g_{\mathbf{1}}(w)$ , we calculated the inverse Hessian vector product $H_{\lambda,\mathbf{1}}^{-1}g_{\mathbf{1}}(w)$ and took its dot product with $g_{\mathbf{1}}(w)$ (again, we modified this with appropriate weighting for individual points in CDR).

Appendix B Dataset details

We used the same versions of the Diabetes, Enron, Dogfish, and MNIST datasets as Koh and Liang , since the examination of the accuracy of influence functions for large perturbations is a natural extension of their studies of small perturbations. Additionally, we applied influence to more natural settings in CDR and MultiNLI; here, we discuss their preprocessing pipelines.

Hancock et al. established the BabbleLabble framework for data programming, following the following pipeline: They took labeled examples with natural language explanations, parsed the explanations into programmatic labeling functions (LFs) via a semantic parser, and filtered out obviously incorrect LFs. Then, they applied the remaining LFs to unlabeled data to create a sparse label matrix, from which they learned a label aggregator that outputs a noisily labeled training set. Finally, they ran $L_{2}$ -regularized logistic regression on a set of basic linguistic features with the noisy labels.

They demonstrated their method on three datasets: Spouse, CDR, and Protein. The Protein dataset was not publicly available, and the vast majority of Spouse was labeled by a single LF, hence we chose to use CDR. This dataset’s associated task involved identifying whether, according to a given sentence, a given chemical causes a given disease. For instance, the sentence “Young women on replacement estrogens for ovarian failure after cancer therapy may also have increased risk of endometrial carcinoma and should be examined periodically.” would be labeled True, since it indicates that estrogens may cause endometrial carcinoma [Hancock et al., 2018]. The sentences and ground truth labels were sourced from the 2015 BioCreative chemical-disease relation dataset [Wei et al., 2015].

In our application, we began with their 28 LFs and the corresponding label matrix. For simplicity, we did not learn a label aggregator; instead, if an example $x$ was given labels $y_{i_{1}},y_{i_{2}},\ldots,y_{i_{k}}$ by $k$ LFs $i_{1},\ldots,i_{k}$ , then we created $k$ copies of $x$ , each with weight $1/k$ . The subset of points corresponding to LF $i_{1}$ then included one instance of $x$ with weight $1/k$ . This weighting was taken into account in model training as well as in calculations of influence and actual effect. In addition, we used $L_{1}$ -regularization for feature selection, reducing the number of features to 328 while still achieving similar F1 score to [Hancock et al., 2018]; they reported an F1 of 42.3, while we achieved 42.0. After feature selection, we remove the $L_{1}$ -regularization and train a $L_{2}$ -regularized logistic regression model. We assume that the feature selection step is static and not affected by removing groups of data (though in general this assumption is not true); we therefore do not include feature selection in our influence calculations.

We note that in BabbleLabble, a given LF can never output positive on one example but negative on another. Hence, some LFs are positive (unable to output negative and only able to abstain or output positive), while the others are negative (unable to output positive and only able to abstain or output negative).

B.2 MultiNLI

Williams et al. created the MultiNLI dataset for the task of natural language inference: determining if a pair of sentences agree, contradict, or are neutral. To do so, they presented crowdworkers with initial sentences and asked them to generate follow-on sentences that were neutral or in agreement/contradiction. For example, a crowdworker may be presented with “Met my first girlfriend that way.” and write the contradicting sentence “I didn’t meet my first girlfriend until later.” [Williams et al., 2018]. Thus, each of the 380 crowdworkers generated a subset of the dataset. We used these subsets in our application of influence.

The training set consisted of 392,702 examples from five genres. The development set consisted of 10,000 “matched” examples from the same five genres as the training set, as well as 10,000 “mismatched” examples from five new genres. The test set was put on Kaggle as an open competition, hence we do not have its labels and could not use it; therefore, we use the development set as the test set.

The continuous bag-of-words baseline in Williams et al. first converted the raw text of each sentence in the pair into a vector by treating the sentence as a continuous bag of words and simply averaging the 300-dimensional GloVe vector embeddings. This converted a pair of sentences into vectors $a,b$ . They then concatenated $[a,b,a-b,a\odot b]$ into a 1200D vector, where $a\odot b$ denotes the element-wise product. Finally, they treated this as input to a neural network with three hidden layers and fine-tuned the entire model, including word embeddings (more details in [Williams et al., 2018]).

For our application, we truncated their baseline and just used the concatenation of $a$ and $b$ as the representation for every example. By running logistic regression on this, we achieved test accuracy of $50.4\%$ (vs. their baseline’s $64.7\%$ ; the performance difference comes from the additional dimensions in their vector embeddings and the finetuning through the neural network). Future work could explore influence in the setting of more complex and higher-performing models.

Appendix C Additional experiments

As in Figure 1, in each of the plots below, the grey reference line has slope 1, and the red borders represent points that are not plotted because they are outside the x- or y-axis range.

Figure C.1 is similar to Figure 1 in the main text: it shows the influences vs. actual effects of groups on test points, but with test points that are closer to the median (within the 40th to 60th percentile) of the test loss distribution.

C.2 Regularization

In Section 4, our bounds show that influence ought to be closer to actual effect as regularization increases. Here, we support this claim empirically on Diabetes, Enron, Dogfish, and MNIST (small).This experiment required us to retrain the model for every value of $\lambda$ and for every subset. Thus, for computational purposes, we omitted CDR and MultiNLI, and we selected a random $10\%$ subset of MNIST’s training set to use in place of all of MNIST. To do so, for each dataset, we selected a range of values for $\lambda/n$ , and we selected subsets as described in Appendix A.2. We then computed the influence and actual effect of each of these subsets on a representative test point’s prediction, that point’s loss, and on self-loss (Figure C.2).

In Figure C.3, we observe the trend that correlation generally increases as $\lambda$ does. Specifically, we computed the Spearman $\rho$ between the influence and actual effect for each dataset, each value of $\lambda$ , and each evaluation function $f(\cdot)$ of interest (i.e., test prediction, test loss, or self-loss).

C.3 The effect of loss curvature on the accuracy of influence

One takeaway from the results on test loss in Figure 1-Mid is that the curvature of $f(\theta)$ can significantly increase approximation error; this is expected since the influence $\mathcal{I}_{f}(w)$ linearizes $f(\cdot)$ around $\hat{\theta}(\mathbf{1})$ . When possible, choosing a $f(\cdot)$ that has low curvature (e.g., the linear prediction) will result in higher accuracy. We can mitigate this by using influence to approximate the parameters $\hat{\theta}(\mathbf{1}-w)$ and then plug that estimate into $f(\cdot)$ (Figure C.4), though this can be more computationally expensive.

Note that Figure C.4 shows that this technique does not help much for measuring self-loss. However, in the context of LOOCV, the computational complexity of the Newton approximation for self-loss (described in Section 4) is similar to that of the influence approximation, so we encourage the use of the Newton approximation for LOOCV (as in Rad and Maleki ); Figure C.4 shows that this leads to more accurate approximations for self-loss.

C.4 Additional analysis of influence functions applied to natural groups of data

In Section 5, we considered the CDR and MultiNLI datasets, which contain the natural subsets of LFs and crowdworkers, respectively. To draw inferences about these subsets, we took the $L_{2}$ -regularized logistic regression model described in Appendix A, calculated the influence of the LF/crowdworker subsets, and retrained the model once for each LF/crowdworker.

As discussed in Appendix B.1, an LF is either positive or negative, where a positive LF can only give positive labels or abstain, and similarly for negative LFs. Because of this stark class separation, we indicate whether an LF is positive or negative, and we consider LF influence on the positive test examples separately from their influence on the negative test examples. To measure an LF’s influence and actual effect on a set of test points, we simply add up its influence and actual effect on the set’s individual test points.

In Figure C.5, we note that influence is a good approximation of an LF’s actual effect, just as with other kinds of subsets as well as other datasets (Figure 1). Furthermore, we observe that positive LFs improve the overall performance of the positively labeled portion of the test set while hurting the negatively labeled portion of the test set, and vice versa for negative LFs. This dichotomous effect further motivates the analysis of influence on the positive test set separately from the negative test set, since the process of adding these two influences to study the influence on the entire test set would obscure the full story.

Next, we define an LF’s coverage to be the proportion of the examples that it does not abstain on, which can be measured through the number of examples in its corresponding subset. In Figure C.6, we observe that the magnitude of influence correlates strongly with coverage.

Finally, we define an LF’s precision to be the number of examples it labels correctly divided by the number of examples it does not abstain on. Because the dataset had many more negative than positive examples, positive LFs had lower precision than negative LFs. Surprisingly, even when this effect was taken into account and we considered positive LFs separately from negative ones, precision did not correlate with influence (Figure C.7).

As discussed in Appendix B.2, the training set consisted of five genres, and the test set consisted of a matched portion with the same five genres, as well as a mismatched portion with five new genres. For succinctness, we refer to the influence/actual effect of the set of examples generated by a single crowdworker as that crowdworker’s influence/actual effect.

First, we note in Figure C.8 that influence is a good approximation of a crowdworker’s actual effect for both matched and mismatched test sets, consistent with our findings in Figure 1 for other subset types and datasets.

Unlike in CDR (Figure C.6), we do not find strong correlation between a crowdworker’s influence and the number of examples they contributed; it is possible to contribute many examples but have relatively little influence (Figure C.9).

The most prolific crowdworker contributed 35,000 examples and had large negative influence on the test set. A closer analysis revealed that they had positive influence on the fiction genre but lowered performance on many other genres, despite contributing roughly equally to each genre. This genre-specific trend tended to hold more broadly among the workers: there appear to be two categories of genres (fiction, facetoface, nineeleven vs. travel, government, verbatim, letters, oup) such that each worker tended to have positive influence on all genres in one category and negative influence on all genres in the other (Figure C.10). Moreover, the number of examples a worker contributed to a given genre was not a good indicator for their influence on that genre (Figure C.11).

Appendix D Additional analysis on influence vs. actual effect on a test point

For Figure 3, we constructed two binary datasets in which the influence of a certain class of subsets on the test prediction of a single test point exhibits pathological behavior.

In Figure 3-Left, our aim was to show that there can be a dataset with subsets such that the cone constraint discussed in Section 4.4 does not hold.

We sampled $60$ examples from each class for a total of $n=120$ training points, and set the regularization strength $\lambda=0.001$ .

Note that we adversarially chose which subsets to study in this counterexample, since our main goal was to show that there existed subsets for which the cone constraint did not hold. For the next counterexample, we instead study all possible subsets in the restricted setting of removing copies of single points.

In Figure 3-Right, our aim was to construct a dataset such that even if we only removed subsets comprising copies of single distinct points, a low influence need not translate into a low actual effect.

D.2 Scaling effects when removing multiple points

In the general setting of removing subsets of different points, the analogous failure case to a varying scaling factor $d(w)$ (Figure 3-Right) is the varying scaling effect that the error matrix $D(w)$ in Proposition 2 can have on different subsets $w$ . The range of this effect is bounded by the spectral norm of $D(w)$ . This norm is precisely equal to $d(w)$ in the single-point setting, and it is large when we remove a subset $w$ whose Hessian $H_{\mathbf{1}}(w)$ is almost as large as the full Hessian $H_{\lambda,\mathbf{1}}$ in some direction. As with $d(w)$ , the spectral norm of $D(w)$ decreases with $\lambda$ (Proposition 2), so as regularization increases, we expect that the influence of a group will track its actual effect more accurately.

D.3 The relationship between influence and actual effect on the loss of a test point

Appendix E Proofs

We first review the notation given in Section 2 and introduce new definitions that will be useful in the sequel. We define the empirical risk as

such that the optimal parameters are $\hat{\theta}(s)\stackrel{{\scriptstyle\rm def}}{{=}}{\arg\min}_{\theta\in\Theta}L_{s}(\theta)$ .

If the argument $s$ is omitted, it is assumed to be equal to $r$ . For example,

For a given dataset, we define the following constants:

In the sequel, we study the order-3 tensor $\nabla_{\theta}^{3}f(\hat{\theta}(\mathbf{1}))$ . We define its product with a vector (which returns a matrix) as a contraction along the last dimension:

E.2 Assumptions

These assumptions apply to all the results that follow below.

E.3 Bounding the error of the one-step Newton approximation

This proof is adapted to our setting from the standard analysis of the Newton method in convex optimization [Boyd and Vandenberghe, 2004].

Putting together the successive bounds gives the result. ∎

E.4 Characterizing the difference between the Newton approximation and influence

Before proving Proposition 2, we first prove a lemma about the spectrum of the error matrix $D(w)$ .

To show the upper bound, first note that $H_{\mathbf{1}}(w)\preceq H_{\mathbf{1}}(w)+H_{\mathbf{1}}(\mathbf{1}-w)=H_{\mathbf{1}}$ (recalling that $w\in\{0,1\}^{n}$ ), and let $U\Sigma U^{\top}$ be the singular value decomposition of $H_{\mathbf{1}}$ . Since $H_{\lambda,\mathbf{1}}=H_{\mathbf{1}}+\lambda I$ , we have

From the second-order Taylor expansion of $f$ about $\hat{\theta}(\mathbf{1})$ , there exists $0\leq\xi\leq 1$ such that

Applying Lemma 1 to bound the spectrum of $D(w)$ completes the proof. ∎

E.5 The influence on self-loss

We first state two linear algebra facts that will be useful in the sequel.

Note that $\frac{1}{\sigma_{A,1}}$ is the smallest eigenvalue of $A^{-1}$ , while $\frac{1}{\sigma_{A,d}}$ is its largest. The lemma follows from the fact that the smallest singular value of the product of two matrices is lower bounded by the product of the smallest singular values of each matrix, and similarly the largest singular value of the product is upper bounded by the product of the largest singular values of each matrix. ∎

The next fact is a consequence of the variational definition of eigenvalues.

where $\sigma_{d}$ is the smallest eigenvalue of $A$ , and $\sigma_{1}$ is the largest.

We are now ready to analyze the effect of removing a subset $w$ of $k$ training points on the total loss on those $k$ points.

Applying Lemma 3 and using $\mathcal{I}_{f}(w)=g_{\mathbf{1}}(w)^{\top}H_{\lambda,\mathbf{1}}^{-1}g_{\mathbf{1}}(w)$ , we obtain

E.6 The influence on a test point

where in the last equality we use the assumption that we are removing $\left\|w\right\|_{1}$ copies of the point $(x_{w},y_{w})$ . Similarly,

where the third equality comes from the Sherman-Morrison formula. Substituting $D(w)$ into Corollary 1, we obtain

To bound the denominator, we first use the trace trick to rearrange terms

Since $H_{\lambda,\mathbf{1}}^{-\frac{1}{2}}H_{\mathbf{1}}(w)H_{\lambda,\mathbf{1}}^{-\frac{1}{2}}$ has rank one under our assumptions, it only has at most one non-zero eigenvalue. We can therefore apply Lemma 1 to conclude that