Algorithmic Stability for Adaptive Data Analysis
Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, Jonathan Ullman
Introduction
Multiple hypothesis testing is a ubiquitous task in empirical research. A finite sample of data is drawn from some unknown population, and several analyses are performed on that sample. The outcome of an analysis is deemed significant if it is unlikely to have occurred by chance alone, and a “false discovery” occurs if the analyst incorrectly declares an outcome to be significant. False discovery has been identified as a substantial problem in the scientific community (see e.g. [Ioa05, GL14]). This problem persists despite decades of research by statisticians on methods for preventing false discovery, such as the widely used Bonferroni Correction [Bon36, Dun61] and the Benjamini-Hochberg Procedure [BH95].
False discovery is often attributed to misuse of statistics. An alternative explanation is that the prevalence of false discovery arises from the inherent adaptivity in the data analysis process—the fact that the choice of analyses to perform depends on previous interactions with the data (see e.g. [GL14]). Adaptivity is essentially unavoidable when a sequence of research groups publish research papers based on overlapping data sets. Adaptivity also arises naturally in other settings, for example: in multistage inference algorithms where data are preprocessed (say, to select features or restrict to a principal subspace) before the main analysis is performed; in scoring data-based competitions [BH15]; and in the re-use of holdout or test data [DFH+15c, DFH+15a].
The general problem of adaptive data analysis was formally modeled and studied in recent papers by Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth [DFH+15b] and by Hardt and Ullman [HU14]. The striking results of Dwork et al. [DFH+15b] gave the first nontrivial algorithms for provably ensuring statistical validity in adaptive data analysis, allowing for even an exponential number of tests against the same sample. In contrast, [HU14, SU15b] showed inherent statistical and computational barriers to preventing false discovery in adaptive settings.
The key ingredient in Dwork et al. is a notion of “algorithmic stability” that is suitable for adaptive analysis. Informally, changing one input to a stable algorithm does not change it’s output “too much.” Traditionally, stability was measured via the change in the generalization error of an algorithm’s output, and algorithms stable according to such a criterion have long been known to ensure statistical validity in nonadaptive analysis [DW79a, DW79b, KR99, BE02, SSSS10]. Following a connection first suggested by McSherrySee, e.g., [McS14], although the observation itself dates back at least to 2008 (personal communication)., Dwork et al. showed that a stronger stability condition designed to ensure data privacy—called differential privacy (DP) [DMNS06, Dwo06]—guarantees statistical validity in adaptive data analysis. This allowed them to repurpose known DP algorithms to prevent false discovery. A crucial difference from traditional notions of stability is that DP requires a change in one input lead to a small change in the probability distribution on the outputs (in particular, differentially private algorithms must be randomized). In this paper, we refer to differential privacy as max-KL-stability (Definition 2.3), to emphasize the relation to the literature on algorithmic stability other notions of stability we study (KL- and TV-stability, in particular).
In this work, we extend the results of Dwork et al. along two axes. First, we give an optimal analysis of the statistical validity of max-KL stable algorithms. As a consequence, we immediately obtain the best known bounds on the sample complexity (equivalently, the convergence rate) of adaptive data analysis. Second, we generalize the connection between max-KL stability and statistical validity to a much larger family of statistics. Our proofs are also significantly simpler than those of Dwork et al., and clarify the role of different stability notions in the adaptive setting.
Following the previous work on this subject ([DFH+15b]), we formalize the problem of adaptive data analysis as follows. There is a distribution over some finite universe , and a mechanism that does not know , but is given a set consisting of samples from . Using its sample, the mechanism must answer queries on . Here, a query , coming from some family , maps a distribution to a real-valued answer. The mechanism’s answer to a query is -accurate if with high probability. Importantly, the mechanism’s goal is to provide answers that “generalize” to the underlying distribution, rather than answers that are specific to its sample.
We model adaptivity by allowing a data analyst to ask a sequence of queries to the mechanism, which responds with answers . In the adaptive setting, the query may depend on the previous queries and answers arbitrarily. We say the mechanism is -accurate given samples for adaptively chosen queries if, with high probability, when given a vector of samples from an arbitrary distribution , the mechanism accurately responds to any adaptive analyst that makes at most queries.
Surprisingly, Dwork et al. [DFH+15b], showed there are mechanisms that are much more effective than naïvely outputting the empirical answer. They show that “stable” mechanisms are accurate and, by applying a stable mechanism from the literature on differential privacy, they obtain a mechanisms that are accurate given only samples, which is a significant improvement over the naïve mechanism when is not too small. (See Table 1 for more detailed statements of their results, including results that achieve an exponential improvement in the sample complexity when is bounded.)
Our first contribution is to give a simpler and quantitatively optimal analysis of the generalization properties of stable algorithms, which immediately yields new accuracy bounds for adaptive statistical queries. In particular, we show that samples suffice. Since samples are required to answer a single nonadaptive query, our dependence on is optimal.
Although statistical queries are surprisingly general [Kea93], we would like to be able to ask more general queries on the distribution that capture a wider variety of machine learning and data mining tasks. To this end, we give the first bounds on the sample complexity required to answer large numbers of adaptively chosen low-sensitivity queries and optimization queries, which we now describe.
Our sample complexity bounds are summarized in Table 1.
Our bounds were applied in subsequent work of Dwork et al. [DFH+15c, DFH+15a] in the analysis of their “reusable holdout” construction.
2 Overview of Techniques
Our main result is a new proof, with optimal parameters, that a stable algorithm that provides answers to adaptive queries that are close to the empirical value on the sample gives answers that generalize to the underlying distribution. In particular, we prove:
Let be a mechanism that takes a sample and answers adaptively-chosen low-sensitivity queries. Suppose that satisfies the following:
satisfies -max-KL stability (Definition 2.3, identical to -differential privacy).
Our actual result is somewhat more general than Theorem 1.1. We show that the population-level error of a stable algorithm is close to its error on the sample, whether or not that error is low. Put glibly: stable algorithms cannot be wrong without realizing it.
Compared to the results of [DFH+15b], Theorem 1.1 requires a quantitatively weaker stability guarantee—-stability, instead of -stability. It also applies to arbitrary low-sensitivity queries as opposed to the special case of statistical queries.
Our analysis differs from that of Dwork et al. in two key ways. First, we give a better bound on the probability with which a single low-sensitivity query output by a max-KL stable algorithm has good generalization error. Second, we show a reduction from the case of many queries to the case of a single query that has no loss in parameters (in contrast, previous work took a union bound over queries, leading to a dependence on , the number of queries).
Both steps rely on a thought experiment in which several “real” executions of a stable algorithm are simulated inside another algorithm, called a monitor, which outputs a function of the “real” transcripts. Because stability is closed under post-processing, the monitor is itself stable. Because it exists only as a thought experiment, the monitor can be given knowledge of the true distribution from which the data are drawn, and can use this knowledge to process the outputs of the simulated “real” runs. The monitor technique allows us to start from a basic guarantee, which states that a single query has good generalization error with constant probability, and amplify the guarantee so that (a) the generalization error holds with very high probability, and (b) the guarantee holds simultaneously over all queries in a sequence. The proof of the basic guarantee follows the lines of existing proofs using algorithmic stability (e.g., [DW79a]), while the monitor technique and the resulting amplification statements are new.
The amplification of success probability is the more technically sophisticated of the two key steps. The idea is to run many (about , using the notation of Theorem 1.1) copies of a stable mechanism on independently selected data sets. Each of these interactions results in a sequence of queries and answers. The monitor then selects the the query and answer pair from amongst all of the sequences that has the largest error. It then outputs this query as well as the index of the interaction that produced it. Our main technical lemma shows that the monitor will find a “bad” query/dataset pair (one where the true and empirical values of the query differ) with at most constant probability. This implies that the each of the real executions outputs a bad query with probability . Relative to previous work, the resulting argument yields better bounds, applies to more general classes of queries, and even generalizes to other notions of stability.
In general, we cannot prove that our bounds are optimal. Even for nonadaptive statistical queries, samples are necessary, and [HU14, SU15b] showed that samples are necessary to answer adaptively chosen statistical queries. However, the gap between the upper and lower bounds is still significant.
However, we can show that our connection between max-KL stability and generalization is optimal (see Section 7 for details). Moreover, for every family of queries we consider, no max-KL stable algorithm can achieve better sample complexity [BUV14, BST14]. Thus, any significant improvement to our bounds must come from using a weaker notion of stability or some entirely different approach.
Each of our results requires instantiating the mechanism with a suitable stable / differentially private algorithm. For statistical queries, the optimal mechanisms are the well known Gaussian and Laplace Mechanisms (slightly refined by [SU15a]) when is small and the Private Multiplicative Weights Mechanism [HR10] when is large. For arbitrary low-sensitivity queries, the Gaussian or Laplace Mechanism is again optimal when is small, and for large we can use the Median Mechanism [RR10].
When considering arbitrary search queries over an arbitrary finite range, the optimal algorithm is the Exponential Mechanism [MT07]. For the special case of convex minimization queries over an infinite domain, we use the optimal algorithm of [BST14] when is small, and when is large, we use an algorithm of [Ull15] that accurately answers exponentially many such queries.
Our techniques applies to notions of distributional stability other than max-KL/differential privacy. In particular, defining stability in terms of total variation (TV) or KL divergence (KL) leads to bounds on the generalization error that have polynomially, rather than exponentially, decreasing tails. See Section 4 for details.
Preliminaries
Given a distribution over or a sample , we would like to answer queries about or from some family . We will often want to bound the “sensitivity” of the queries with respect to changing one element of the sample. To this end, we use to denote that differ on at most one entry. We will consider several different families of queries:
Statistical Queries: These queries are specified by a function , and (abusing notation) are defined as
The error of an answer to a statistical query with respect to or is defined to be
The error of an answer to a -sensitive query with respect to or is defined to be
We denote the set of all -sensitive queries by . If we say the query is low sensitivity. Note that -sensitive queries are a strict generalization of statistical queries.
Here is an arbitrary set of items (sometimes called “parameter values”) among which we aim to chose the item (“parameter”) with minimal loss, either with respect to a particular input data set , or with respect to expectation over a distribution .
We denote the set of minimization queries by . We highlight two special cases:
Minimization for Finite Sets: We denote by the set of minimization queries where is finite with size at most .
2 Mechanisms for Adaptive Queries
Our goal is to design a mechanism that answers queries on using only independent samples . Our focus is the case where the queries are chosen adaptively and adversarially.
Specifically, is a stateful algorithm that holds a collection of samples , takes a query from some family as input, and returns an answer . We require that when are independent samples from , the answer is “close” to in a sense that is appropriate for the family of queries. Moreover we require that this condition holds for every query in an adaptively chosen sequence . Formally, we define an accuracy game between a mechanism and a stateful data analyst in Figure 1.
A mechanism is -accurate with respect to the population for adaptively chosen queries from given samples in if for every adversary ,
We will also use a definition of accuracy relative to the sample given to the mechanism, described in Figure 2.
A mechanism is -accurate with respect to samples of size from for adaptively chosen queries from if for every adversary ,
3 Max-KL Stability (a.k.a. Differential Privacy)
Informally, an algorithm is “stable” if changing one of its inputs does not change its output “too much.” For our results, we will consider randomized algorithms, and require that changing one input does not change the distribution of the algorithm’s outputs too much. With this in mind, we will define here one notion of algorithmic stability that is related to the well-known notion of KL-divergence between distributions. In Section 4.1, we will give other related notions of algorithmic stability based on different notions of closeness between distributions.
Let be a randomized algorithm. We say that is -max-KL stable if for every pair of samples that differ on exactly one element, and every ,
This notion of -max-KL stability is also commonly known as -differential privacy [DMNS06], however in this context we choose the term max-KL stability to emphasize the conceptual relationship between this notion and other notions of algorithmic stability that have been studied in machine learning. We also emphasise that our work has a very different motivation to the motivation of differential privacy — stable algorithms are desirable even when privacy is not a concern, such as when the data does not concern humans.
In our analysis, we will make crucial use of the fact that max-KL-stability (as well as the other notions of stability discussed in Section 4.1)is closed under post-processing.
Let and be a pair of randomized algorithms. If is -max-KL-stable then the algorithm is -max-KL-stable.
The definition we gave above does not immediately apply to algorithms that interact with a data analyst to answer adaptively chosen queries. Such a mechanism does not simply take a sample as input and produce an output. Instead, in the interactive setting, there is a mechanism that holds a sample and interacts with some algorithm We can view this entire interaction between and as a single noninteractive meta algorithm that outputs the transcript of the interaction and define stability with respect to that meta algorithm. Specifically, we define the algorithm that simulates the interaction between and and outputs the messages sent between them. The simulation is also parameterized by although we will frequently omit these parameters when they are clear from context.
Note that is a noninteractive mechanism, and its output is just the query-answer pairs of and in the sample accuracy game, subject to the mechanism being given the sample Now we can define the stability of an interactive mechanism using
[Stability of for Interactive Mechanism] We say an interactive mechanism is -max-KL stable for queries from if for every adversary , the algorithm is -max-KL stable.
3.2 Composition of Max-KL Stability
The definition above allows for adaptive composition. This follows directly from composition results of -differentially private algorithms. A mechanism that is -max-KL stable for query is -stable for adaptively chosen queries [DMNS06, DRV10]. More precisely, for every and , if a mechanism that is -max-KL stable for query is used to answer adaptively chosen queries, it remains -max-KL stable [DRV10].
From Max-KL Stability to Accuracy for Low-Sensitivity Queries
In this section we prove our main result that any mechanism that is both accurate with respect to the sample and satisfies max-KL stability (with suitable parameters) is also accurate with respect to the population. The proof proceeds in two mains steps. First, we prove a lemma that says that there is no max-KL stable mechanism that takes several independent sets of samples from the distribution and finds a query and a set of samples such that the answer to that query on that set of samples is very different from the answer to that query on the population. In Section 3.2 we prove this lemma for the simpler case of statistical queries and then in 3.3 we extend the proof to the more general case of low-sensitivity queries.
The second step is to introduce a monitoring algorithm. This monitoring algorithm will simulate the interaction between the mechanism and the adversary on multiple independent sets of samples. It will then output the least accurate query across all the different interactions. We show that if the mechanism is stable then the monitoring algorithm is also stable. By choosing the number of sets of samples appropriately, we ensure that if the mechanism has even a small probability of being inaccurate in a given interaction, then the monitor will have a constant probability of finding an inaccurate query in one of the interactions. By the lemma proven in the first step, no such monitoring algorithm can satisfy max-KL stability, therefore every stable mechanism must be accurate with high probability.
As a warmup, in this section we give a simpler version of our main lemma for the case of statistical queries and a single sample. Although these results follow from the results of Section 3.3 on general low-sensitivity queries, we include the simpler version to introduce the main ideas in the cleanest possible setting.
Before giving the proof, we set up some notation. Let . For a single element , and an index , we use to denote the new sample where the -th element of has been replaced by the element . Let be independent from .
Therefore, using the fact that for any statistical query and distribution , we have
2 Warmup: A Multi-Sample De-Correlated Expectation Lemma for SQs
As a second warmup, in this section we give a simpler version of our main lemma for the case of statistical queries and multiple samples. That is, we consider a setting where there are many subsamples available to the algorithm. The multi-sample de-correlated expectation lemma says that a max-KL stable algorithm cannot take a collection of samples and output a pair such that and differ significantly in expectation.
Let be -max-KL stable where is the class statistical queries . Let be a distribution on and let . Then
Before giving the proof, we set up some notation. Let be a set of samples where each sample . For a single element , and a pair of indices , we use to denote the new set of samples where the -th element of the -th sample of has been replaced by the element .
3 A Multi-Sample De-Correlated Expectation Lemma
Here, we give the most general de-correlated expectation lemma that considers multiple samples and applies to the more general class of low-sensitivity queries.
We remark that if we use the weaker assumption that is -TV stable, (defined in Section 4.1), then we would obtain the same conclusion but with the weaker bound of The advantage of using the stronger definition of max-KL stability is that we only have to decrease with and not This advantage is crucial because algorithms satisfying -max-KL stability necessarily have a linear dependence on but only a polylogarithmic dependence on .
Let be independent of . Recall that each element of is itself a vector and the same is true for each element of We will sometimes refer to the vectors as the subsamples of .
4 From Multi-Sample De-Correlated Expectation to Accuracy
Now that we have Lemma 3.3, we can prove the following result that max-KL stable mechanisms that are also accurate with respect to their sample are also accurate with respect to the population from which that sample was drawn.
Let be a family of -sensitive queries on . Assume that, for some , is
-max-KL stable for adaptively chosen queries from and
-accurate with respect to its sample for samples from for adaptively chosen queries from .
Then is -accurate with respect to the population for adaptively chosen queries from given samples from .
The key step in the proof is to define a monitoring algorithm that takes separate samples and for each sample , simulates an independent interaction between and . This monitoring algorithm then outputs the query with the largest error across all of the queries and interactions ( queries in total). Since changing one input to only affects one of the simulations, the monitoring algorithm will be stable so long as is stable, without any loss in the stability parameter. On the other hand, if has even a small chance of answering a query with large error, then if we simulate independent interactions, there is a constant probability that at least one of the simulations results in a query with large error. Thus, the monitor will be a stable algorithm that outputs a query with large error in expectation. By the multi-sample de-correlated expectation lemma, such a monitor is impossible, which implies that has probability of answering any query with large error.
Let be an interactive mechanism. Let be an analyst and let be the distribution chosen by . We define the following monitoring algorithm.
If is stable then so is and this fact follows easily from the post-processing lemma (Lemma 2.4):
For every , if the mechanism is -max-KL stable for adaptively chosen queries from then for every and , the monitor is -max-KL stable.
If is -max-KL stable for adaptively chosen queries from then for every analyst who asks queries from , and every the algorithm that simulates the interaction between and and outputs the resulting query-answer pairs is -max-KL stable. From this, it follows that the algorithm that simulates the interactions between and for every and outputs the resulting query-answer pairs is -max-KL stable. To see this, observe that if differ only on one subsample , then for every , and thus the query-answer pairs corresponding to subsample are identically distributed regardless of whether we use or as input to
Observe that the algorithm defined above is simply a post-processing of these query-answer pairs. That is, depends only on and , and not on Thus, by Lemma 2.4, is -max-KL stable. ∎
We will use the with In light of Claim 3.5 and our assumption that is -max-KL stable, we can apply Lemma 3.3 to obtain
To complete the proof, we show that if is not -accurate with respect to the population , then (1) cannot hold. To do so, we need the following natural claim about the output of the monitor.
Since fails to be -accurate, for every ,
We obtain the claim from (2) by using the fact that the sets of query-answer pairs corresponding to different subsamples are independent. That is, the random variables indexed by are independent. Since is simply the maximum of these independent random variables, the first part of the claim follows. Also, by construction, ensures that
If is -accurate for the sample but not -accurate for the population, then
Line (4) follows from two observations. First, since is assumed to be -accurate for one sample, by a union bound, it is simultaneously -accurate for all of the samples. Thus, we have except with probability at most Second, since is a -sensitive query, we always have Without loss of generality, the answers of can be truncated to an interval of width that contains the correct answer Doing so will ensure ∎
Thus, if is not -accurate for the population, we will obtain a contradiction to (1). This completes the proof. ∎
Other Notions of Stability and Accuracy on Average
Definition 4.2 gives one notion of stability, namely max-KL stability. However, this is by no means the only way to formalise stability for our purposes. In this section we consider other notions of stability and the advantages they have.
We will define here other notions of algorithmic stability, and in Section 4.2, we will show that such notions can provide expected guarantees for generalization error which can be used to achieve accuracy on average.
Let be a randomized algorithm. We say that is -TV stable if for every pair of samples that differ on exactly one element,
Let be a randomized algorithm. We say that is -KL-stable if for every pair of samples that differ on exactly one element,
The post-processing property of Max-KL stability (Lemma 2.4 in Section 2.3) also applies to the two stability notions above.
Let and be a pair of randomized algorithms. If is {-TV, -KL, -max-KL}-stable then the algorithm is {-TV, -KL, -max-KL}-stable.
-KL stability implies -TV stability by Pinsker’s inequality. The relationship between max-KL stability defined in Section 2.3 and the above notions is more subtle. When , -max-KL stability implies -KL stability and thus also -TV stability. When and , -max-KL stability implies -TV stability. It also implies that is “close” to satisfying -KL stability (cf. [DRV10] for more discussion of these notions).
As in Section 2.3.1, we define TV-stability and KL-stability of an interactive mechanism through a noninteractive mechanism that simulates the interaction between and an adversary . The definition for these notions of stability is precisely analogous to Definition 2.5 for max-KL stability.
As with max-KL stability, both notions above allow for adaptive composition. In fact, -TV stability composes linearly—a mechanism that is -TV stable for query is -stable for queries. The advantage of the stronger notions of KL and max-KL stability is that they have a stronger composition. A mechanism that is -KL stable for query is -stable for queries.
2 From TV Stability to Accuracy on Average
In this section we show that TV stable algorithms guarantee a weaker notion of accuracy on average for adaptively chosen queries.
3 Accuracy on Average
In Section 2.2 we defined accurate mechanisms to be those that answer accurately (either with respect to the population or the sample) with probability close to . In this section we define a relaxed notion of accuracy that only requires low error in expectation over the coins of and .
A mechanism is -accurate on average with respect to the population for adaptively chosen queries from given samples in if for every adversary ,
We will also use a definition of accuracy relative to the sample given to the mechanism:
A mechanism is -accurate on average with respect to samples of size from for adaptively chosen queries from if for every adversary ,
Towards our goal of proving that TV stability implies accuracy on average in the adaptive setting, we first prove a lemma saying that TV stable algorithms cannot output a low-sensitivity query such that the sample has large error for that query. In the next section we will show how this lemma implies accuracy on average in the adaptive setting.
Combining, the second and third observation with the triangle inequality, we have
3.2 From De-Correlated Expectation to Accuracy on Average
Let be the family of -sensitive queries on . Assume that is
-TV stable for adaptively chosen queries from and
-accurate on average with respect to its sample for samples from for adaptively chosen queries from .
Then is -accurate on average with respect to the population for adaptively chosen queries from given samples from .
The high level approach of the proof is to apply the Lemma 4.6 to a “monitoring algorithm” that watches the interaction between the mechanism and the analyst and then outputs the least accurate query. Since is stable, the de-correlated expectation lemma says that the query output by the monitor will satisfy in expectation, this implies that even for the least accurate query in the interaction between and , in expectation Thus, if is accurate with respect to the sample , it is also accurate with respect to
Let be an interactive mechanism and be an analyst that chooses the distribution . We define the following monitoring algorithm.
If is stable then so is and this fact follows easily from the post-processing lemma (Lemma 2.4).
For every , if the mechanism is -TV stable for adaptively chosen queries from then for every and , the monitor is -TV stable.
The assumption that is -TV stable for adaptively chosen queries from means that for every analyst who asks queries from , the algorithm that simulates the interaction between and and outputs the resulting query-answer pairs is -TV stable. Observe that the algorithm defined above is simply a post-processing of these query-answer pairs. That is, depends only on and , and not on Thus, by Lemma 2.4, for every and , the monitor is -TV stable. ∎
In light of Claim 4.8 and our assumption that is -TV stable, we can apply Lemma 4.6 to obtain
To complete the proof, we show that if is not -accurate on average with respect to the population , then (5) cannot hold.
If is -accurate for the sample but not -accurate for the population, then
Using our assumptions, we can calculate as follows.
Line (6) follows from two observations. First, by construction of , we always have Second, since is assumed not to be -accurate on average for the population, the expected value of Since ensures that we also have that the absolute value of the expectation of is greater than Line (7) follows from the assumption that is -accurate on average for the sample. ∎
Thus, if is not -accurate on average for the population, we will obtain a contradiction to (5). This completes the proof. ∎
From Low-Sensitivity Queries to Optimization Queries
In this section, we extend our results for low-sensitivity queries to the more general family of minimization queries. To do so, we design a suitable monitoring algorithm for minimization queries. As in our analysis of low-sensitivity queries, we will have the monitoring algorithm take as input many independent samples and simulate the interaction between and on each of those samples. Thus, if has even a small probability of being inaccurate, then with constant probability the monitor will find a minimization query that has answered inaccurately. Previously, we had monitor simply output this query and applied Lemma 3.3 to arrive at a contradiction. However, since Lemma 3.3 only applies to algorithms that output a low-sensitivity query, we can’t apply it to the monitor that outputs a minimization query. We address this by having the monitor output the error function associated with the loss function and answer it selects, which is a low-sensitivity query. If we assume that the mechanism is accurate for its sample but not for the population, then the monitor will find a loss function and an answer with low error on the sample but large error on the population. Thus the error function will be a low-sensitivity query with very different answers on the sample and the population, which is a contradiction. To summarize, we have the following theorem.
Let be the family of -sensitive minimization queries on . Assume that, for some , is
-max-KL stable for adaptively chosen queries from and
-accurate with respect to its sample for samples from for adaptively chosen queries from .
Then is -accurate with respect to the population for adaptively chosen queries from given samples from .
The formal proof is nearly identical to that of Theorem 3.4, so we omit the full proof. Instead, we will simply describe the modified monitoring algorithm.
Applications
We now plug known stable mechanisms (designed in the context of differential privacy) in to Theorem 3.4 to obtain mechanisms that provide strong error guarantees with high probability for both low-sensitivity and statistical queries.
There is a mechanism that is -accurate with respect to the population for adaptively chosen queries from where given samples from for
There is a mechanism that is -accurate with respect to the population for adaptively chosen queries from where given samples from for
There is a mechanism that is -accurate on average with respect to the population for adaptively chosen queries from given samples from for
2 Optimization Queries
The results of the Section 5 can be combined with existing differentially private algorithms for minimizing “empirical risk” (that is, loss with respect to the sample ) to obtain algorithms for answering adaptive sequences of minimization queries. We provide a few specific instantiations here, based on known differentially private mechanisms.
Let be a finite set of size at most . Let be the set of sensitivity- loss functions bounded between and . Then there is a mechanism that is -accurate with respect to the population for adaptively chosen queries from given
samples from . The running time of the mechanism is dominated by evaluations of the loss function.
2.2 Convex Minimization
samples from . The running time of the mechanism is dominated by evaluations of the gradient .
samples from . The running time of the mechanism is dominated by evaluations of the gradient .
An Alternative Form of Generalization and Tightness of Our Results
We now provide an alternative form of our generalization bounds. The following Theorem is more general than Theorem 3.4 because it says that no max-KL stable procedure that outputs a low-sensitivity can output a query that distinguishes the sample from the population (not just max-KL stable procedures that are accurate for the sample).
First we prove the following technical lemma.
where is the Shannon entropy of the distribution of (measured in nats, rather than bits). In particular,
as the uniform distribution maximizes entropy. Moreover, , whence . The result now follows from these two inequalities. ∎
Intuitively, Theorem 7.2 says that “stability prevents overfitting.” It says that no stable algorithm can output a low-sensitivity function that distinghishes its input from the population the input was drawn from (i.e. “overfits” its sample).
In particular, Theorem 7.2 implies that, if a mechanism is stable and outputs that “fits” its data, then also “fits” the population. This gives a learning theory perspective on our results.
Consider the following monitor algorithm .
We will use the monitor with . Observe that only access its input through (which is -max-KL-stable) and the exponential mechanism (which is -max-KL-stable). Thus, by composition and postprocessing, is -max-KL stable. We can hence apply Lemma 3.3 to obtain
Now we can apply Lemma 7.1 with and to get
To complete the proof, we assume, for the sake of contradition, that has a high enough probability of outputting a query such that is large. To obtain a contradiction from this assumption, we need the following natural claim (analogous to Claim 3.6) about the output of the monitor.
This contradicts the assumption that and hence completes the proof. ∎
We now show that our connection between max-KL stability and generalization (Theorem 7.2 and Theorem 3.4) is optimal.
Let , let , and let . Let be the uniform distribution over $(0,\delta)\mathcal{A}:^{n}\to Q_{\Delta}\mathbf{X}\leftarrow_{\mbox{\tiny R}}\mathcal{U}^{n}q\leftarrow_{\mbox{\tiny R}}\mathcal{A}(\mathbf{X})$ then
Consider the following simple algorithm, denoted as . On input a database , output with probability , and otherwise output the empty database. Clearly, is -max-KL stable. Now construct the following algorithm .
As is -max-KL stable, and as only applies on disjoint databases, we get that is also -max-KL stable.
Suppose contains i.i.d. samples from , and consider the execution of on . Observe that the predicate evaluates to 1 only on a finite number of points from $q_{p}(\mathcal{U})=0q_{p}(\mathbf{X})=\alpha\Delta n\cdot|\{i:\hat{\mathbf{x}}_{i}=\mathbf{x}_{i}\}|i\hat{\mathbf{x}}_{i}=\mathbf{x}_{i}q_{p}(\mathbf{X})-q_{p}(\mathcal{U})\geq\alpha\Delta n$. The probability that this is not the case is at most
ans thus, with probability at least , algorithm outputs a -sensitive query s.t. . ∎
In particular, using Lemma 7.4 with shows that the parameters in Theorem 7.2 are tight.
We thank Mark Bun, Moritz Hardt, Aaron Roth, and Salil Vadhan for many helpful discussions.