Detecting and Correcting for Label Shift with Black Box Predictors
Zachary C. Lipton, Yu-Xiang Wang, Alex Smola
Introduction
Assume that in August we train a pneumonia predictor. Our features consist of chest X-rays administered in the previous year (distribution ) and the labels binary indicators of whether a physician diagnoses the patient with pneumonia. We train a model to predict pneumonia given an X-ray image. Assume that in the training set of patients have pneumonia. We deploy in the clinic and for several months, it reliably predicts roughly positive.
Fast-forward to January (distribution ): Running on the last week’s data, we find that of patients are predicted to have pneumonia! Because remains fixed, the shift must owe to a change in the marginal , violating the familiar iid assumption. Absent familiar guarantees, we wonder: Is still accurate? What’s the real current rate of pneumonia? Shouldn’t our classifier, trained under an obsolete prior, underestimate pneumonia when uncertain? Thus, we might suspect that the real prevalence is greater than .
Given only labeled training data, and unlabeled test data, we desire to: (i) detect distribution shift, (ii) quantify it, and (iii) correct our model to perform well on the new data. Absent assumptions on how changes, the task is impossible. However, under assumptions about what P and Q have in common, we can still make headway. Two candidates are covariate shift (where does not change) and label shift (where does not change). Schölkopf et al. (2012) observe that covariate shift corresponds to causal learning (predicting effects), and label shift to anticausal learning (predicting causes).
We focus on label shift, motivated by diagnosis (diseases cause symptoms) and recognition tasks (objects cause sensory observations). During a pneumonia outbreak, (e.g. flu given cough) might rise but the manifestations of the disease might not change. Formally, under label shift, we can factorize the target distribution as
We introduce Black Box Shift Estimation (BBSE) to estimate label shift using a black box predictor . BBSE estimates the ratios for each label , requiring only that the expected confusion matrix is invertible For degenerate confusion matrices, a variant using soft predictions may be preferable.. We estimate by solving a linear system where is the confusion matrix of estimated on training data (from P) and is the average output of calculated on test samples (from Q). We make the following contributions:
Applications of BBSE to statistical tests for detecting distribution label shift
Model correction through importance-weighted Empirical Risk Minimization.
A comprehensive empirical validation of BBSE.
Compared to approaches based on Kernel Mean Matching (KMM) (Zhang et al., 2013), EM (Chan & Ng, 2005), and Bayesian inference (Storkey, 2009), BBSE offers the following advantages: (i) Accuracy does not depend on data dimensionality; (ii) Works with arbitrary black box predictors, even biased, uncalibrated, or inaccurate models; (iii) Exploits advances in deep learning while retaining theoretical guarantees: better predictors provably lower sample complexity; and (iv) Due to generality, could be a standard diagnostic / corrective tool for arbitrary ML models.
Prior Work
Despite its wide applicability, learning under label shift with unknown remains curiously under-explored. Noting the difficulty of the problem, Storkey (2009) proposes placing a (meta-)prior over and inferring the posterior distribution from unlabeled test data. Their approach requires explicitly estimating , which may not be feasible in high-dimensional datasets. Chan & Ng (2005) infer using EM but their method also requires estimating . Schölkopf et al. (2012) articulates connections between label shift and anti-causal learning and Zhang et al. (2013) extend the kernel mean matching approach due to (Gretton et al., 2009) to the label shift problem. When is known, label shift simplifies to the problem of changing base rates (Bishop, 1995; Elkan, 2001). Previous methods require estimating , , or , often relying on kernel methods, which scale poorly with dataset size and underperform on high-dimensional data.
Covariate shift, also called sample selection bias, is well-studied (Zadrozny, 2004; Huang et al., 2007; Sugiyama et al., 2008; Gretton et al., 2009). Shimodaira (2000) proposed correcting models via weighting examples in ERM by . Later works estimate importance weights from the available data, e.g., Gretton et al. (2009) propose kernel mean matching to re-weight training points.
The earliest relevant work to ours comes from econometrics and addresses the use of non-random samples to estimate behavior. Heckman (1977) addresses sample selection bias, while (Manski & Lerman, 1977) investigates estimating parameters under choice-based and endogenous stratified sampling, cases analogous to a shift in the label distribution. Also related, Rosenbaum & Rubin (1983) introduce propensity scoring to design unbiased experiments. Finally, we note a connection to cognitive science work showing that humans classify items differently depending on other items they appear alongside (Zhu et al., 2010).
Post-submission, we learned of antecedents for our estimator in epidemiology (Buck et al., 1966) and revisited by Forman (2008); Saerens et al. (2002). These papers do not develop our theoretical guarantees or explore the modern ML setting where is massively higher-dimensional than , bolstering the value of dimensionality reduction.
Problem setup
In general, this task is impossible – and might not share support. This paper considers extra assumptions: {enumerate*}
The label shift (also known as target shift) assumption
For every with we require .Assumes the absolute continuity of the (hidden) target label’s distribution with respect to the source’s, i.e. exists.
Access to a black box predictor where the expected confusion matrix is invertible.
We now comment on the assumptions. A.1 corresponds to anti-causal learning. This assumption is strong but reasonable in many practical situations, including medical diagnosis, where diseases cause symptoms. It also applies when classifiers are trained on non-representative class distributions: Note that while visions systems are commonly trained with balanced classes (Deng et al., 2009), the true class distribution for real tasks is rarely uniform.
Assumption A.2 addresses identifiability, requiring that the target label distribution’s support be a subset of training distribution’s. For discrete , this simply means that the training data should contain examples from every class.
Assumption A.3 requires that the expected predictor outputs for each class be linearly independent. This assumption holds in the typical case where the classifier predicts class more often given images actually belong to than given images from any other class . In practice, could be a neural network, a boosted decision-tree or any other classifier trained on a holdout training data set. We can verify at training time that the empirical estimated normalized confusion matrix is invertible. Assumption A.3 generalizes naturally to soft-classifiers, where outputs a probability distribution supported on . Thus BBSE can be applied even when the confusion matrix is degenerate.
Main results
We now derive the main results for estimating and . Assumption A.1 has the following implication:
Denote by the output of a fixed function . If Assumption A.1 holds, then
The proof is simple: recall that depends on only via . By A.1, and thus .
Next, combine the law of total probability and Lemma 1 and we arrive at
We estimate and using and data from source distribution , and with unlabeled test data drawn from target distribution . This leads to a novel method-of-moments approach for consistent estimation of the shifted label distribution and the weights .
We can now rewrite Equation (1) in matrix form:
Using plug-in maximum likelihood estimates of the above quantities yields the estimators
where is the plug-in estimator of .
Next, we establish that the estimators are consistent.
If Assumption A.1, A.2, A.3 are true, then as , and
The proof (see Appendix B) uses the First Borel-Cantelli Lemma to show that the probability that the entire sequence of empirical confusion matrices with data size are simultaneously invertible converges to , thereby enabling us to use the continuous mapping theorem after applying the strong law of large numbers to each component.
We now address our estimators’ convergence rates.
Assume that A.3 holds robustly. Let be the smallest eigenvalue of . There exists a constant such that for all , with probability at least we have
The bounds give practical insights (explored more in Section 7). In (2), the square error depends on the sample size and is proportional to (or ). There is also a term that reflects how different the source and target distributions are. In addition, reflects the quality of the given classifier . For example, if is a perfect classifier, then . If cannot distinguish between certain classes at all, then will be low-rank, , and the technique is invalid, as expected.
We now parse the error bound of in (3). The first term is required even if we observe the importance weight exactly. The second term captures the additional error due to the fact that we estimate with predictor . Note that and can be as small as when is uniform. Note that when correctly classifies each class with the same probability, e.g. , then is a constant and the bound cannot be improved.
Assumption A.2 ensures that .
By completing the square and Cauchy-Schwartz inequality,
Substitute into the above inequality and use (1) we get
We now provide a high probability bound on the Euclidean norm of , the operator norm of , which will give us an operator norm bound of and under our assumption on , and these will yield a high probability bound on the square estimation error.
Take and use the assumption that (which holds under our assumption on since ). Then with probability at least
Using the assumption on , we have
Also, we have .
Euclidean norm of . Note that . By the standard Hoeffding’s inequality and union bound argument, we have that with probability larger than
By Hoeffding’s inequality with probability larger than . Combining with (5) yields
which holds with probability . ∎
Application of the results
Formally, detection can be cast as a hypothesis testing problem where the null hypothesis is and the alternative hypothesis is that Recall that we observe neither nor any samples from it. However, we do observe unlabeled data from the target distribution and our predictor .
Under Assumption A.1, A.2 and for each classifier satisfying A.3 we have that if and only if .
Plug and into (1) and apply Lemma 1 with assumption A.1. The result follows directly from our analysis in the proof of Proposition 2 that shows is invertible under the assumptions A.2 and A.3. ∎
Thus, under weak assumptions, we can test by running two-sample tests on readily available samples from and . Examples include the Kolmogorov-Smirnoff test, Anderson-Darling or the Maximum Mean Discrepancy. In all tests, asymptotic distributions are known and we can almost perfectly control the Type I error. The power of the test (-Type II error) depends on the classifier’s performance on distribution , thereby allowing us to leverage recent progress in deep learning to attack the classic problem of detecting non-stationarity in the data distribution.
One could also test whether . Under the label-shift assumption this is implied by . The advantage of testing the distribution of instead of is that we only need to deal with a one-dimensional distribution. Per theory and experiments in (Ramdas et al., 2015) two-sample tests in high dimensions are exponentially harder.
One surprising byproduct is that we can sometimes use this approach to detect covariate-shift, concept-shift, and more general forms of nonstationarity.
For any fixed measurable
This follows directly from the measurability of .
While the converse is not true in general, does imply that for every measurable ,
This suggests that testing may help us to determine if there’s sufficient statistical evidence that domain adaptation techniques are required.
2 Black Box Shift Correction (BBSC)
Our estimator also points to a systematic method of correcting for label-shift via importance-weighted ERM. Specifically, we propose the following algorithm:
Note that for classes that occur rarely in the test set, BBSE may produce negative importance weights. During ERM, a flipped sign would cause us to maximize loss, which is unbounded above. Thus, we clip negative weights to .
Owing to its efficacy and generality, our approach can serve as a default tool to deal with domain adaptation. It is one of the first things to try even when the label-shift assumption doesn’t hold. By contrast, the heuristic method of using logistic-regression to construct importance weights (Bickel et al., 2009) lacks theoretical justification that the estimated weights are correct.
Even in the simpler problem of average treatment effect (ATE) estimation, it’s known that using estimated propensity can lead to estimators with large variance (Kang & Schafer, 2007). The same issue applies in supervised learning. We may prefer to live with the biased solution from the unweighted ERM rather than suffer high variance from an unbiased weighted ERM. Our proposed approach offers a consistent low-variance estimator under label shift.
Experiments
We experimentally demonstrate the power of BBSE with real data and simulated label shift. We organize results into three categories — shift detection with BBSD, weight estimation with BBSE, and classifier correction with BBSC. BBSE-hard denotes our method where yields classifications. In BBSE-soft, outputs probabilities.
Label Shift Simulation To simulate distribution shift in our experiments, we adopt the following protocols: First, we split the original data into train, validation, and test sets. Then, given distributions and , we generate each set by sampling with replacement from the appropriate split. In knock-out shift, we knock out a fraction of data points from a given class from training and validation sets. In tweak-one shift, we assign a probability to one of the classes, the rest of the mass is spread evenly among the other classes. In Dirichlet shift, we draw from a Dirichlet distribution with concentration parameter . With uniform , Dirichlet shift is bigger for smaller .
Label-shift detection We conduct nonparametric two-sample tests as described in Section 5.1 using the MNIST handwritten digits data set. To simulate the label-shift, we randomly split the training data into a training set, a validating set and a test set, each with 20,000 data points, and apply knock-out shift on class Random choice for illustration, method works on all classes.. Note that and differ increasingly as grows large, making shift detection easier. We obtain by training a two-layer ReLU-activated Multilayer Perceptron (MLP) with 256 neurons on the training set for five epochs. We conduct a two-sample test of whether the distribution of and are the same using the Kolmogorov-Smirnov test. The results, summarized in Figure 1, demonstrate that BBSD (1) produces a -value that distributes uniformly when Thus we can control Type I error at any significance level. (2) provides more power (less Type II error) than the state-of-the-art kernel two-sample test that discriminates and at , and (3) gets better as we train the black-box predictor even more.
Weight estimation and label-shift correction We evaluate BBSE on MNIST by simulating label shift and datasets of various sizes. Specifically, we split the training data set randomly in two, using first half to train and the second half to estimate . We use then use the full training set for weighted ERM. As before, is a two-layer MLP. For fair comparisons with baselines, the full training data set is used throughout (since they do not need without data splitting). We evaluate our estimator against the ground truth and by the prediction accuracy of BBSC on the test set. To cover a variety of different types of label-shift, we take as a uniform distribution and generate with Dirichlet shift for (Figure 2).
Label-shift correction for CIFAR10 Next, we extend our experiments to the CIFAR dataset, using the same MLP and this time allowing it to train for 10 epochs. We consider both tweak-one and Dirichlet shift, and compare BBSE to the unweighted classifier under varying degrees of shift (Figure 4). For the tweak-one experiment, we try , averaging results over all 10 choices of the tweaked label, and plotting the variance. For the Dirichlet experiments, we sample for every choice of in the range {1000, 100, …, .001}. Because kernel-based baselines cannot handle datasets this large or high-dimensional, we compare only to unweighted ERM.
Kernel mean matching (KMM) baselines We compare BBSE to the state-of-the-art kernel mean matching (KMM) methods. For the detection experiments (Figure 1), our baseline is the kernel B-test (Zaremba et al., 2013), an extension of the kernel max mean discrepancy (MMD) test due to Gretton et al. (2012) that boasts nearly linear-time computation and little loss in power. We compare BBSE to a KMM approach Zhang et al. (2013), that solves
For fair comparison, we used the original authors’ implementations as baselines https://github.com/wojzaremba/btest, http://people.tuebingen.mpg.de/kzhang/Code-TarS.zip and also used the median trick to adaptively tune the RBF kernel’s hyperparameter. A key difference is that BBSE matches the distribution of rather than distribution of like (Zhang et al., 2013) and we learn through supervised learning rather than by specifying a feature map by choosing a kernel up front.
Note that KMM, like many kernel methods, requires the construction and inversion of an Gram matrix, which has complexity of . This hinders its application to real-life machine learning problems where will often be 100s of thousands. In our experiments, we find that the largest for which we can feasibly run the KMM code is roughly and that is where we unfortunately have to stop for the MNIST experiment. For the same reason, we cannot run KMM for the CIFAR10 experiments. The MSE curves in Figure 2 for estimating suggest that the convergence rate of KMM is slower than BBSE by a polynomial factor and that BBSE better handles large datasets.
Discussion
Constructing the training Set The error bounds on our estimates depend on the norm of the true vector . This confirms the common sense that absent any assumption on , and given the ability to select class-conditioned examples for annotations one should build a dataset with uniform . Then it’s always possible to apply BBSE successfully at test time to correct .
Sporadic Shift In some settings, might change only sporadically. In these cases, when no label shift occurs, applying BBSC might damage the classifier. For these cases, we prose to combine detection and estimation, correcting the classifier only when a shift has likely occurred.
Using known predictor In our experiments, has been trained using a random split of the data set, which makes BBSEto perform worse than baseline when the data set is extremely small. In practice, especially in the context of web services, there could be a natural predictor that is currently being deployed whose training data were legacy and have little to do with the two distributions that we are trying to distinguish. In this case, we do not lose that factor of and we do not suffer from the variance in training with a small amount of data. This could allow us to detect mild shift in distributions in very short period of time. Making it suitable for applications such as financial market prediction.
BBSE with degenerate confusion matrices In practice, sometime confusion matrices will be degenerate. For instance, when a class is rare under , and the features are only partially predictive, we might find that . In these cases, two straightforward variations on the black box method may still work: First, while our analysis focuses on confusion matrices, it easily extends to any operator , such as soft probabilities. If each class , even if is never the argmax for any example, so long as for any , the soft confusion matrix will be invertible. Even when we produce and operator with an invertible confusion matrix, two options remain: We can merge classes together, yielding a invertible confusion matrix. While we might not be able to estimate the frequencies of those classes, we can estimate the others accurately. Another possibility is to compute the pseudo-inverse.
Future Work As a next step, we plan to extend our methodology to the streaming setting. In practice, label distributions tend to shift progressively, presenting a new challenge: if we apply BBSE on trailing windows, then we face a trade-off. Looking far back increases , lowering estimation error, but the estimate will be less fresh. The use of propensity weights on makes BBSE amenable to doubly-robust estimates, the typical bias-variance tradeoff, and related techniques, common in covariate shift correction.
Acknowledgments
We are grateful for extensive insightful discussions and feedback from Kamyar Azizzadenesheli, Kun Zhang, Arthur Gretton, Ashish Khetan Kumar, Anima Anandkumar, Julian McAuley, Dustin Tran, Charles Elkan, Max G’Sell, Alex Dimakis, Gary Marcus, and Todd Gureckis.
References
Appendix A Additional discussion
In this section we provide a few answers to some questions people may have when using our proposed techniques.
The LHS can be estimated by plugging in and a stochastic approximation of the expectation using labeled data from the source domain and the RHS can be estimated by the sample mean using unlabeled data from the target domain. In particular, if label-shift assumption is true or a good approximation, then
should be on the same order as the statistical error that we can calculate by and the error of in estimating .
Model selection criterion and the choice of f𝑓f.
Our analysis assumes that is fixed and given, but in practice, often we need to train from the same data set. Given a number of choices, one may wonder which blackbox predictor should we prefer out of a collection of ? Our theoretical results suggest a natural quantity: the smallest singular value of the confusion matrix, for choosing the blackbox predictors. Note that the smallest singular value is a quantity that can be estimated using only labeled data from the source domain. Therefore a practical heuristic to use is to the that maximizes the smallest singular value of the corresponding . Figure 5 plots the smallest singular value of the confusion matrices as the number of epochs of training gets larger. The model we use is the same multi-layer perceptron that we used for our experiments and the source distribution is one that we knocks off 80% of the fifth class. This is the same model and data set we used in Figure 1(c). Referring to in Figure 1(c), we see that the test power of that is trained for only one epoch is much lower than the that is trained for five epochs, and the gap in the smallest singular values is predicative of the fact at least qualitatively.
Is data splitting needed?
Recall that we train the model and estimate using two independent splits of the labeled data set drawn from the same distribution. In practice, especially when is large, using the same data to train and to estimate will be more data efficient. This comes at a price of a small bias. It is unclear how to quantify that bias but the data-reuse version could be useful in practice as a heuristic.
Appendix B Proofs
We present the proofs of Lemma 1 and Proposition 2 in this Appendix.
We applied A.1 to the second equality, and used the conditional independence \hat{y}\mathchoice{\mathrel{\hbox to0.0pt{\displaystyle\perp\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{\textstyle\perp\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptstyle\perp\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{\scriptscriptstyle\perp\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}y|x under and together with being determined by , which is fixed. ∎
A.2 ensures that . By Assumption A.3, is invertible. Let be its smallest singular value. We bound the probability that is not invertible:
This ensures that as , is invertible almost surely. By the strong law of large numbers (SLLN), as and . Similarly, as , . Combining these with (6) and applying the continuous mapping theorem with the fact that the inverse of an invertible matrix is a continuous mapping we get that
Appendix C Concentration inequalities
Let be independent random variables bounded by . Then obeys for any
Let be independent random matrices with dimension and each satisfy
almost surely. Define the variance parameter