Multi-Stage Influence Function

Hongge Chen, Si Si, Yang Li, Ciprian Chelba, Sanjiv Kumar, Duane Boning, Cho-Jui Hsieh

Introduction

Multi-stage training (pretrain and then finetune) has become increasingly important and has achieved state-of-the-art results in many tasks. In natural language processing (NLP) applications, it is now a common practice to first learn word embeddings (e.g., word2vec , GloVe ) or contextual representations (e.g., ELMo , BERT ) from a large unsupervised corpus, and then refine or finetune the model on supervised end tasks. Similar ideas in transfer learning have also been widely used in many different tasks. Intuitively, the successes of these multi-stage learning paradigms are due to knowledge transfer from pretraining tasks to the end task. However, current approaches using multi-stage learning are usually based on trial-and-error and many fundamental questions remain unanswered. For example, which part of the pretraining data/task contributes most to the end task? How can one detect “false transfer” where some pretraining data/task could be harmful for the end task? If a testing point is wrongly predicted by the finetuned model, can we trace back to the problematic examples in the pretraining data? Answering these questions requires a quantitative measurement of how the data and loss function in the pretraining stage influence the end model, which has not been studied in the past and is the main focus of this paper.

To find the most influential training data responsible for a model’s prediction, the influence function was first introduced by , from a robust statistics point of view. More recently, as large-scale applications become more challenging for influence function computation, proposed to use a first-order approximation to measure the effect of removing one training point on the model’s prediction, to overcome computational challenges. These methods are widely used in model debugging and there are also some applications in machine learning fairness . However, all of the existing influence function scores computation algorithms studied the case of single-stage training – where there is only one model with one set of training/prediction data in the training process. To the best of our knowledge, the influence of pretraining data on a subsequent finetuning task and model has not been studied, and it is nontrivial to apply the original influence function in to this scenario. A naive approach to solve this problem is to remove each individual instance out of the pretraining data one at a time and retrain both pretrain and finetune models; this is prohibitively expensive, especially given that pretraining models are often large-scale and may take days to train.

In this work, we study the influence function from pretraining data to the end task, and propose a novel approach to estimate the influence scores in multi-stage training that requires no additional retrain, does not require model convexity, and is computationally tractable. The proposed approach is based on the definition of influence function, and considers estimating influence score under two multi-stage training settings depending on whether the embedding from pretraining model is retrained in the finetuning task. The derived influence function well explains how pretraining data benefits the finetuning task. In summary, our contributions are threefold:

We propose a novel estimation of influence score for multi-stage training. In real datasets and experiments across various tasks, our predicted and actual influence score of the pretraining data to the finetuned model are well correlated. This shows the effectiveness of our proposed technique for estimating influence scores in multi-stage models.

We propose effective methods to determine how testing data from the finetuning task is impacted by changes in the pretraining data. We show that the influence of the pretraining data to the finetuned model consists of two parts: the influence of the pretraining data on the pretrained model, and influence of the pretraining data on the finetuned model. These two parts can be quantified using our proposed technique.

We propose methods to decide whether the pretraining data can benefit the finetuning task. We show that the influence of the pretraining data on the finetuning task is highly dependent on 1) the similarity of two tasks or stages, and 2) the number of training data in the finetuning task. Our proposed technique provides a novel way to measure how the pretraining data helps or benefits the finetuning task.

Related Work

Multi-stage model training that trains models in many stages on different tasks to improve the end-task has been used widely in many machine learning areas. For example, transfer learning has been widely used to transfer knowledge from source task to the target task . More recently, researchers have shown that training the computer vision or NLP encoder on a source task with huge amount of data can often benefit the performance of small end-tasks, and these techniques including BERT , Elmo and large ResNet pretraining have achieved state-of-the-arts on many tasks.

Although mutli-stage models have been widely used, there are few works on understanding multi-stage models and exploiting the influence of the training data in the pretraining step to benefit the fine-tune task. In contrast, there are many works that focus on understanding single stage machine learning models and explaining model predictions. Algorithms developed along this line of research can be categorized into features based and data based approaches. Feature based approaches aim to explain predictions with respect to model variables, and trace back the contribution of variables to the prediction . However, they are not aiming for attributing the prediction back to the training data.

On the other hand, data based approaches seek to connect model prediction and training data, and trace back the most influential training data that are most responsible for the model prediction. Among them, the influence function , which aims to model the prediction changes when training data is added/removed, has been shown to be effective in many applications. There is a series of work on influence functions, including investigating the influence of a group of data on the prediction , using influence functions to detect bias in word embeddings , and using it in preventing data poisoning attacks . There are also works on data importance estimation to explain the model from the data prospective .

All of these previous works, however, only consider a single stage training procedure, and it is not straightforward to apply them to multi-stage models. In this paper, we propose to analyze the influence of pretraining data on predictions in the subsequent finetuned model and end task.

Algorithms

In this section, we detail the procedure of multi-stage training, show how to compute the influence score for the multi-stage training, and then discuss how to scale up the computation.

Multi-stage models, which train different models in consecutive stages, have been widely used in various ML tasks. Mathematically, let Z\mathcal{Z} be the training set for pretraining task with data size Z=m|\mathcal{Z}|=m, and X\mathcal{X} be the training data for the finetuning task with data size X=n|\mathcal{X}|=n. In pretraining stage, we assume the parameters of the pretrained network have two parts: the parameters WW that are shared with the end task, and the task-specific parameters UU that will only be used in the pretraining stage. Note that WW could be a word embedding matrix (e.g., in word2vec) or a representation extraction network (e.g., Elmo, BERT, ResNet), while UU is usually the last few layers that corresponds to the pretraining task. After training on the pretraining task, we obtain the optimal parameters W,UW^{*},U^{*}. The pretraining stage can be formulated as

where g()g(\cdot) is the loss function for the pretrain task and G()G(\cdot) is summation of loss with respect to all the pretraining examples.

In the finetuning stage, the network parameters are W,ΘW,\Theta, where WW is shared with the pretraining task and Θ\Theta is the rest of the parameters specifically associated with the finetuning task. We will initialize the WW part by WW^{*}. Let f()f(\cdot) denote the finetuning loss, and F()F(\cdot) summarizes all the loss with respect to finetuning data, there are two cases when finetuning the end-task:

Finetuning Case 1: Fixing embedding parameters W=WW=W^{*}, and only finetune Θ\Theta:

Finetuning Case 2: finetune both the embedding parameters WW (initialized from WW^{*}) and Θ\Theta. Sometimes updating the embedding parameters WW in the finetuning stage is necessary, as the embedding parameters from the pretrained model may not be good enough for the finetuning task. This corresponds to the following formulation:

2 Influence function for multi-stage models

We derive the influence function for the multi-stage model to trace the influence of pretraining data on the finetuned model. In Figure 1 we show the task we are interested in solving in this paper. Note that we use the same definition of influence function as and discuss how to compute it in the multi-stage training scenario. As discussed at the end of Section 3.1, depending on whether or not we are updating the shared parameters WW in the finetuning stage, we will derive the influence functions under two different scenarios.

To compute the influence of pretraining data on the finetuning task, the main idea is to perturb one data example in the pretraining data, and study how that impacts the test data. Mathematically, if we perturb a pretraining data example zz with loss change by a small ϵ\epsilon, the perturbed model can be defined as

Note that choices of ϵ\epsilon can result in different effects in the loss function from the original solution in (1). For instance, setting ϵ=1m\epsilon=-\frac{1}{m} is equivalent to removing the sample zz in the pretraining dataset.

For the finetuning stage, since we consider Case 1 where the embedding parameters WW are fixed in the finetuning stage, the new model for the end-task or finetuning task will thus be

The influence function that measures the impact of a small ϵ\epsilon perturbation on zz to the finetuning loss on a test sample xtx_{t} from finetuning task is defined as

where Iz,ΘI_{z,\Theta} measures the influence of zz on the finetuning task parameters Θ\Theta, and Iz,WI_{z,W} measures how zz influences the pretrained model WW. Therefore we can split the influence of zz on the test sample into two pieces: one is the impact of zz on the pretrained model Iz,WI_{z,W}, and the other is the impact of zz on the finetuned model Iz,ΘI_{z,\Theta}. It is worth mentioning that, due to linearity, if we want to estimate a set of test example influence function scores with respect to a set of pretraining examples, we can simply sum up the pair-wise influence functions, and so define

where {z(i)}\{z^{(i)}\} contains a set of pretraining data and {xt(j)}\{x_{t}^{(j)}\} contains a group of finetuning test data that we are targeting on. Next we will derive these two influence scores Iz,ΘI_{z,\Theta} and Iz,WI_{z,W} (see the detailed derivations in the appendix) in Theorem 1 below.

For the two-stage training procedure in (1) and (2), we have

where []W[\cdot]_{W} means taking the WW part of the vector.

By plugging (9) and (10) into (6), we finally obtain the influence score of pretraining data zz on the finetuning task testing point xtx_{t}, Iz,xtI_{z,x_{t}} as

The pseudocode for the influence function in (11) is shown in Algorithm 1.

2.2 Case 2: embedding parameter W𝑊W is also updated in the finetuning stage

For the second finetuning stage case in (3), we will also further train the embedding parameter WW from the pretraining stage. When WW is also updated in the finetuning stage, it is challenging to characterize the influence since the pretrained embedding WW^{*} is only used as an initialization. In general, the final model (W,Θ)(W^{**},\Theta^{*}) may be totally unrelated to WW^{*}; for instance, when the objective function is strongly convex, any initialization of WW in (3) will converge to the same solution.

However, in practice the initialization of WW will strongly influence the finetuning stage in deep learning, since the finetuning objective is usually highly non-convex and initializing WW with WW^{*} will converge to a local minimum near WW^{*}. Therefore, we propose to approximate the whole training procedure as

where Wˉ,Uˉ\bar{W},\bar{U} are optimal for the pretraining stage, W,ΘW^{*},\Theta^{*} are optimal for the finetuning stage, and 0α10\leq\alpha\ll 1 is a small value. This is to characterize that in the finetuning stage, we are targeting to find a solution that minimizes F(W,Θ)F(W,\Theta) and is close to Wˉ\bar{W}. In this way, the pretrained parameters are connected with finetuning task and thus influence of pretraining data to the finetuning task can be tractable. The results in our experiments show that with this approximation, the computed influence score can still reflect the real influence quite well.

Similarly we can have Θ^ϵϵ\frac{\partial\hat{\Theta}_{\epsilon}}{\partial\epsilon}, W^ϵϵ\frac{\partial\hat{W}_{\epsilon}}{\partial\epsilon}, and Wˉϵϵ\frac{\partial\bar{W}_{\epsilon}}{\partial\epsilon} to measure the difference between their original optimal solutions in (12) and the optimal solutions from ϵ\epsilon perturbation over the pretraining data zz. Similar to (6), the influence function Iz,xtI_{z,x_{t}} that measures the influence of ϵ\epsilon perturbation to pretraining data zz on test sample xtx_{t}’s loss is

The influence function of small perturbation of G(W,U)G(W,U) to Wˉ,W,Θ\bar{W},W^{*},\Theta^{*} can be computed following the same approach in Subsection 3.2.1 by replacing Wˉ\bar{W} for W{W}^{*} and [Θ,W][{\Theta}^{*},{W}^{*}] for Θ{\Theta}^{*} in (9). This will lead to

After plugging (14) and (15) into (3.2.2), we will have the influence function Iz,xtI_{z,x_{t}}.

Similarly, the algorithm for computing Iz,xtI_{z,x_{t}} for Case 2 can follow Algorithm 1 for Case 1 by replacing gradient computation. Through the derivation we can see that our proposed multi-stage influence function does not require model convexity.

3 Computation Challenges

The influence function computation for multi-stage model is presented in the previous section. As we can see in Algorithm 1 that the influence score computation involves many Hessian matrix operations, which will be very expensive and sometimes unstable for large-scale models. We used several strategies to speed up the computation and make the scores more stable.

As we can see from Algorithm 1, our algorithm involves several Hessian inverse operations, which is known to be computation and memory demanding. For a Hessian matrix HH with a size of p×pp\times p and pp is the number of parameters in the model, it requires p×pp\times p memory to store HH and O(p3)O(p^{3}) operations to invert it. Therefore, for large deep learning models with thousands or even millions of parameters, it is almost impossible to perform Hessian matrix inverse. Similar to , we avoid explicitly computing and storing the Hessian matrix and its inverse, and instead compute product of the inverse Hessian with a vector directly. More specifically, every time when we need an inverse Hessian vector product v=H1bv=H^{-1}b, we invoke conjugate gradients (CG), which transforms the linear system problem into an quadratic optimization problem H1barg minx{12xTHxbTx}H^{-1}b\equiv\operatorname*{arg\,min}_{x}\{\frac{1}{2}x^{T}Hx-b^{T}x\}. In each iteration of CG, instead of computing H1bH^{-1}b directly, we will compute a Hessian vector product, which can be efficiently done by backprop through the model twice with O(p)O(p) time complexity .

The aforementioned conjugate gradient method requires the Hessian matrix to be positive definite. However, in practice the Hessian may have negative eigenvalues, since we run a SGD and the final Hessian matrix HH may not at a local minimum exactly. To tackle this issue, we solve

whose solution can be shown to be the same as arg minx{12xTHxbTx}\operatorname*{arg\,min}_{x}\{\frac{1}{2}x^{T}Hx-b^{T}x\} since the Hessian matrix is symmetric. H2H^{2} is guaranteed to be positive definite as long as HH is invertible, even when HH has negative eigenvalues. If H2H^{2} is not ill-conditioned, we can solve (16) directly. The rate of convergence of CG depends on κ(H2)1κ(H2)+1\frac{\sqrt{\kappa(H^{2})}-1}{\sqrt{\kappa(H^{2})}+1}, where κ(H2)\kappa(H^{2}) is the condition number of H2H^{2}, which can be very large if H2H^{2} is ill-conditioned. When H2H^{2} is ill-conditioned, to stabilize the solution and to encourage faster convergence, we add a small damping term λ\lambda on the diagonal and solve arg minx{12xT(H2+λI)xbTHx}\operatorname*{arg\,min}_{x}\{\frac{1}{2}x^{T}(H^{2}+\lambda I)x-b^{T}Hx\}.

As mentioned above, we can get an inverse Hessian vector product in O(p)O(p) time if the Hessian is with size p×pp\times p. To analyze the time complexity of Algorithm 1, assume there are p1p_{1} parameters in our pretrained model and p2p_{2} parameters in our finetuned model, it takes O(mp1)O(mp_{1}) or O(np2)O(np_{2}) to compute a Hessian vector product, where mm is the number of pretraining examples and nn is the number of finetuning examples. For the two inverse Hessian vector products as shown in Algorithm 1, the time complexity therefore is O(np2r)O(np_{2}r) and O(mp1r)O(mp_{1}r), where rr is the number of iterations in CG. For other operations in Algorithm 1, vector product has a time complexity of O(p1)O(p_{1}) or O(p2)O(p_{2}), and computing the gradients of all pretraining examples has a complexity of O(mp1)O(mp_{1}). So the total time complexity of computing a multi-stage influence score is O((mp1+np2)r)O((mp_{1}+np_{2})r). Therefore we can see that the computation is tractable as it is linear to the number of training samples and model parameters. All the computation related to inverse Hessian can use inverse Hessian vector produc (IHVP), which makes the memory usage and computation efficient.

Experiments

In this section, we will conduct experiment on real datasets in both vision and NLP tasks to show the effectiveness of our proposed method.

We first evaluate the effectiveness of our proposed approach for the estimation of influence function. For this purpose, we build two CNN models based on CIFAR-10 and MNIST datasets. The model structures are shown in Table A in Appendix. For both MNIST and CIFAR-10 models, CNN layers are used as embeddings and fully connected layers are task-specific. At the pretraining stage, we train the models with examples from two classes (“bird" vs. “frog") for CIFAR-10 and four classes (0, 1, 2, and 3) for MNIST. The resulting embedding is used in the finetuning tasks, where we finetune the model with the examples from the remaining eight classes in CIFAR-10 or the other 6 numbers in MNIST for classification task.

We test the correlation between individual pretraining example’s multi-stage influence function and the real loss difference when the pretraining examples are removed. We test two cases (as mentioned in Section 3.1) – where the pretrained embedding is fixed, and where it is updated during finetuning. For a given example in the pretraining data, we calculate its influence function score with respect to each test example in the finetuning task test set using the method presented in Section 3. To evaluate this pretraining example’s contribution to the overall performance of the model, we sum up the influence function scores across the whole test set in the finetuning task.

To validate the score, we remove that pretraining example and go through the aforementioned process again by updating the model. Then we run a linear regression between the true loss difference values obtained and the influence score computed to show their correlation. The detailed hyperparameters used in these experiments are presented in Appendix B.

Embedding is fixed In Figures 2(a) and 2(b) we show the correlation results of CIFAR-10 and MNIST models when the embedding is fixed in finetuning task training. From Figures 2(a) and 2(b) we can see that there is a linear correlation between the true loss difference and the influence function scores obtained. The correlation is evaluated with Pearson’s rr value. It is almost impossible to get the exact linear correlation because the influence function is based on the first-order conditions (gradients equal to zero) of the loss function, which may not hold in practice. In , it shows the rr value is around 0.8 but their correlation is based on a single model with a single data source, but we consider a much more complex case with two models and two data sources: the relationship between pretraining data and finetuning loss function. So we expect to have a lower rr value. Therefore 0.6 is reasonable to show a strong correlation between pretraining data’s influence score and finetuning loss difference. This supports our argument that we can use this score to detect the examples in the pretraining set which contributes most to the model’s performance.

One may doubt the effectiveness of the expensive inverse Hessian computation in our formulation. As a comparison, we replace all inverse Hessians in (11) with identity matrices to compute the influence function score for the MNIST model. The results are shown in Figure 3 with a much smaller Pearson’s rr of 0.17. This again shows effectiveness of our proposed influence function.

Embedding is updated in finetune Practically, the embedding can also be updated in the finetuning process. In Figure 1(c) we show the correlation between true loss difference and influence function score values using (12). We can see that even under this challenging condition, our multi-stage influence function from (12) still has a strong correlation with the true loss difference, with a Pearson’s r=0.40r=0.40.

In Figure 4 we demonstrate the misclassified test images in the finetuning task and their corresponding largest positive influence score (meaning most influential) images in the pretraining dataset. Examples with large positive influence score are expected to have negative effect on the model’s performance since intuitively when they are added to the pretraining dataset, the loss of the test example will increase. From Figure 4 we can indeed see that the identified examples are with low quality, and they can be easily misclassified even with human eyes.

2 Data Cleansing using Predicted Influence Score

Since the pretraining examples with large positive influences scores are the ones that will increase the loss function value indicating negative transfer. Based on the influence score computed, we can improve the negative transfer issue. We perform experiment on the CIFAR-10 dataset with the same setting as Section 4.1. After we removed the top 10% highest influence scores (positive values) examples from pretrain (source data), we can improve the accuracy on target data from 58.15% to 58.36%. As a reference, if we randomly remove 10% of pretraining data, the accuracy will drop to 58.08%. Note that the influence score computation is fast. For example, on the CIFAR-10 dataset, the time for computing influence function with respect to all pretraining data is 230 seconds on a single Tesla V100 GPU, where 200 iterations of Conjugate Gradient for 2 IHVPs in (9), (10) and (11).

3 The Finetuning Task’s Similarity to the Pretraining Task

In this experiment, we explore the relationship between influence function score and finetuning task similarity with the pretraining task. Specifically, we study whether the influence function score will increase in absolute value if the finetuning task is very similar to the pretraining task. To do this, we use the CIFAR-10 embedding obtained from a “bird vs. frog" classification and test its influence function scores on two finetuning tasks. The finetuning task A is exactly the same as the pretraining “bird vs. frog" classification, while the finetuning task B is a classification on two other classes (“automobile vs. deer"). All hyperparameters used in the two finetuning tasks are the same. In Figure 5, for both tasks we plot the distribution of the influence function values with respect to each pretraining example. We sum up the influence score for all test examples. We can see that, the first finetuning task influence function has much larger absolute values than that of the second task. The average absolute value of task A influence function score is 0.055, much larger than that of task B, which is 0.025. This supports the argument that if pretraining task and finetuning task are similar, the pretraining data will have larger influence on the finetuning task performance.

4 Influence Function Score with Different Numbers of Finetuning Examples

We also study the relationship between the influence function scores and number of examples used in finetuning. In this experiment, we update the pretrained embedding in finetuning stage. We use the same pretraining and finetuning task as in Section 4.1. The results are presented in Figure 6, model C is the model used in Section 4.1 while in model D we triple the number of finetuning examples as well as the number of finetuning steps. Figure 6 demonstrates the distribution of each pretraining examples’ influence function score with the whole test set. The average absolute value of influence function score in model D is 0.15, much less than that of model C. This indicates that with more finetuning examples and more finetuning steps, the influence of pretraining data to the finetuning model’s performance will decrease. This makes sense as if the finetuning data does not have sufficient information for training a good finetuning task, then pretraining data will have more impact on the finetuning task.

5 Quantitative Results on NLP Task

In this section we show the application of our proposed model on NLP task. In this experiment, the pretraining task is training ELMo model on the one-billion-word (OBW) dataset which contains 30 million sentences and 8 million unique words. The final pretrained ELMo model contains 93.6 million parameters. The finetuning task is a binary sentiment classification task on the First GOP Debate Twitter Sentiment datahttps://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment containing 16,654 tweets about the early August GOP debate in Ohio. The finetuned model uses original pretrained ELMo embedding and a feed-forward neural network with hidden size 64 to build the classifier. The embedding is fixed in the finetuning task. To show quantitative results, we randomly pick a test sentence from the finetuning task, and sample a subset of 1000 sentences from one-billion-word dataset to check the influence of this test sentence to these data from the pretraining task. In Table 1 we show examples of test sentences and pretraining sentences with the largest and the smallest absolute influence function score values. Note that the computation time on this large-scale experiment experiment (the model contains 93.6 million parameters) is reasonable – each pretraining data point takes average of 0.94 second to compute influence score. For extremely large models and data sets, the computation can be further sped up by using parallel algorithm as each data point’s influence computation is independent.

Conclusion

We introduce a multi-stage influence function for two multi-stage training setups: 1) the pretrained embedding is fixed during finetuning, and 2) the pretrained embedding is updated during finetuning. Our experimental results on CV and NLP tasks show strong correlation between the score of an example, computed from the proposed multi-stage influence function, and the true loss difference when the example is removed from the pretraining data. We believe our multi-stage influence function is a promising approach to connect the performance of a finetuned model with pretraining data.

References

Appendix A Proof of Theorem 1

Since Θ^ϵ\hat{\Theta}_{\epsilon}, U^ϵ\hat{U}_{\epsilon}, W^ϵ\hat{W}_{\epsilon} are optimal solutions, and thus satisfy the following optimality conditions:

where (W,U)\partial(W,U) means concatenate the UU and WW as [W,U][W,U] and compute the gradient w.r.t [W,U][W,U]. We define the changes of parameters as ΔWϵ=W^ϵW^\Delta W_{\epsilon}=\hat{W}_{\epsilon}-\hat{W}, ΔΘϵ=Θ^ϵΘ^\Delta\Theta_{\epsilon}=\hat{\Theta}_{\epsilon}-\hat{\Theta}, and ΔUϵ=U^ϵU^\Delta U_{\epsilon}=\hat{U}_{\epsilon}-\hat{U}. Applying Taylor expansion to the rhs of (18) we get

Since W,UW^{*},U^{*} are optimal of unperturbed problem, (W,U)G(W,U)=0\frac{\partial}{\partial(W,U)}G({W}^{*},{U}^{*})=0, and we have

Since ϵ0\epsilon\rightarrow 0, we have further approximation

Similarly, based on (17) and applying first order Taylor expansion to its rhs we have

where []W[\cdot]_{W} means taking the WW part of the vector. Therefore,

Appendix B Models and Hyperparameters for the Experiments in Sections 4.1, 4.2, 4.3 and 4.4

The model structures we used in Sections 4.1, 4.2, 4.3 and 4.4 are listed in Table A. As mentioned in the main text, for all models, CNN layers are used as embeddings and fully connected layers are task-specific. The number of neurons on the last fully connected layer is determined by the number of classes in the classification. There is no activation at the final output layer and all other activations are Tanh.

For MNIST experiments in Section 4.1 on embedding fixed, we train a four-class classification (0, 1, 2, and 3) in pretraining. All examples in the original MNIST training set with with these four labels are used in pretraining. The finetuning task is to classify the rest six classes, and we subsample only 5000 examples to finetune. The pretrained embedding is fixed in finetuning. We run Adam optimizer in both pretraining and finetuning with a batch size of 512. The pretrained and finetuned models are trained to converge. When validating the influence function score, we remove an example from pretraining dataset. Then we re-run the pretraining and finetuning process with this leave-one-out pretraining dataset starting from the original models’ weights. In this process, we only run 100 steps for pretraining and finetuning as the models converge. When computing the influence function scores, the damping term for the pretrained and finetuned model’s Hessians are 1×1021\times 10^{-2} and 1×1081\times 10^{-8}, respectively. We sample 1000 pretraining examples when computing the pretraind model’s Hessian summation.

For CIFAR experiments on embedding fixed, we train a two-class classification (“bird" vs “frog") in pretraining. All examples in the original CIFAR training set with with these four labels are used in pretraining. The finetuning task is to classify the rest eight classes, and we subsample only 5000 examples to finetune. The pretrained embedding is fixed in finetuning. We run Adam optimizer to train both pretrained and finetuned model with a batch size of 128. The pretrained and finetuned models are trained to converge. When validating the influence function score, we remove an example from pretraining dataset. Then we re-run the pretraining and finetuning process with this leave-one-out pretraining dataset starting from the original models’ weights. In this process, we only run 6000 steps for pretraining and 3000 steps for finetuning. When computing the influence function scores, the damping term for the pretrained and finetuned model’s Hessians are 1×1081\times 10^{-8} and 1×1061\times 10^{-6}, respectively. Same hyperparameters are used in experiments in Sections 4.3 and 4.4. We also use these hyperparameters in with embedding unfix on CIFAR10’s experiments, except that the pretrained embedding is updated in finetuning and the number of finetuning steps is reduced to 1000 in validation. The α\alpha constant in Equation 15 is chosen as 0.01. We sample 1000 pretraining examples when computing the pretrained model’s Hessian summation.