Preserving Statistical Validity in Adaptive Data Analysis

Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth

Introduction

Throughout the scientific community there is a growing recognition that claims of statistical significance in published research are frequently invalid [Ioa05b, Ioa05a, PSA11, BE12]. The past few decades have seen a great deal of effort to understand and propose mitigations for this problem. These efforts range from the use of sophisticated validation techniques and deep statistical methods for controlling the false discovery rate in multiple hypothesis testing to proposals for preregistration (that is, defining the entire data-collection and data-analysis protocol ahead of time). The statistical inference theory surrounding this body of work assumes a fixed procedure to be performed, selected before the data are gathered. In contrast, the practice of data analysis in scientific research is by its nature an adaptive process, in which new hypotheses are generated and new analyses are performed on the basis of data exploration and observed outcomes on the same data. This disconnect is only exacerbated in an era of increased amounts of open access data, in which multiple, mutually dependent, studies are based on the same datasets.

It is now well understood that adapting the analysis to data (e.g., choosing what variables to follow, which comparisons to make, which tests to report, and which statistical methods to use) is an implicit multiple comparisons problem that is not captured in the reported significance levels of standard statistical procedures. This problem, in some contexts referred to as “p-hacking” or “researcher degrees of freedom”, is one of the primary explanations of why research findings are frequently false [Ioa05b, SNS11, GL14].

The “textbook” advice for avoiding problems of this type is to collect fresh samples from the same data distribution whenever one ends up with a procedure that depends on the existing data. Getting fresh data is usually costly and often impractical so this requires partitioning the available dataset randomly into two or more disjoint sets of data (such as a training and testing set) prior to the analysis. Following this approach conservatively with $m$ adaptively chosen procedures would significantly (on average by a factor of $m$ ) reduce the amount of data available for each procedure. This would be prohibitive in many applications, and as a result, in practice even data allocated for the sole purpose of testing is frequently reused (for example to tune parameters). Such abuse of the holdout set is well known to result in significant overfitting to the holdout or cross-validation set [Reu03, RF08].

Clear evidence that such reuse leads to overfitting can be seen in the data analysis competitions organized by Kaggle Inc. In these competitions, the participants are given training data and can submit (multiple) predictive models in the course of competition. Each submitted model is evaluated on a (fixed) test set that is available only to the organizers. The score of each solution is provided back to each participant, who can then submit a new model. In addition the scores are published on a public leaderboard. At the conclusion of the competition the best entries of each participant are evaluated on an additional, hitherto unused, test set. The scores from these final evaluations are published. The comparison of the scores on the adaptively reused test set and one-time use test set frequently reveals significant overfitting to the reused test set (e.g. [Win, Kaga]), a well-recognized issue frequently discussed on Kaggle’s blog and user forums [Kagb, Kagc].

Despite the basic role that adaptivity plays in data analysis we are not aware of previous general efforts to address its effects on the statistical validity of the results (see Section 1.4 for an overview of existing approaches to the problem). We show that, surprisingly, the challenges of adaptivity can be addressed using insights from differential privacy, a definition of privacy tailored to privacy-preserving data analysis. Roughly speaking, differential privacy ensures that the probability of observing any outcome from an analysis is “essentially unchanged” by modifying any single dataset element (the probability distribution is over the randomness introduced by the algorithm). Differentially private algorithms permit a data analyst to learn about the dataset as a whole (and, by extension, the distribution from which the data were drawn), while simultaneously protecting the privacy of the individual data elements. Strong composition properties show this holds even when the analysis proceeds in a sequence of adaptively chosen, individually differentially private, steps.

We choose this setup for three reasons. First, a variety of quantities of interest in data analysis can be expressed in this form for some function $\psi$ . For example, true means and moments of individual attributes, correlations between attributes and the generalization error of a predictive model or classifier. Next, a request for such an estimate is referred to as a statistical query in the context of the well-studied statistical query model [Kea98, FGR+13], and it is known that using statistical queries in place of direct access to data it is possible to implement most standard analyses used on i.i.d. data (see [Kea98, BDMN05, CKL+06] for examples). Finally, the problem of providing accurate answers to a large number of queries for the average value of a function on the dataset has been the subject of intense investigation in the differential privacy literature.The average value of a function $\psi$ on a set of random samples is a natural estimator of $\mathcal{P}[\psi]$ . In the differential privacy literature such queries are referred to as (fractional) counting queries.

We address the following basic question: how many adaptively chosen statistical queries can be correctly answered using $n$ samples drawn i.i.d. from ${\mathcal{P}}$ ? The conservative approach of using fresh samples for each adaptively chosen query would lead to a sample complexity that scales linearly with the number of queries $m$ . We observe that such a bad dependence is inherent in the standard approach of estimating expectations by the exact empirical average on the samples. This is directly implied by the techniques from [DN03] who show how to make linearly many non-adaptive counting queries to a dataset, and reconstruct nearly all of it. Once the data set is nearly reconstructed it is easy to make a query for which the empirical average on the dataset is far from the true expectation. Note that this requires only a single round of adaptivity! A simpler and more natural example of the same phenomenon is known as “Freedman’s paradox” [Fre83] and we give an additional simple example in the Appendix. This situation is in stark contrast to the non-adaptive case in which $n=O\left(\frac{\log m}{\tau^{2}}\right)$ samples suffice to answer $m$ queries with tolerance $\tau$ using empirical averages. Below we refer to using empirical averages to evaluate the expectations of query functions as the naïve method.

2 Our Results

Our main result is a broad transfer theorem showing that any adaptive analysis that is carried out in a differentially private manner must lead to a conclusion that generalizes to the underlying distribution. This theorem allows us to draw on a rich body of results in differential privacy and to obtain corresponding results for our problem of guaranteeing validity in adaptive data analysis. Before we state this general theorem, we describe a number of important corollaries for the question we formulated above.

Our primary application is that, remarkably, it is possible to answer nearly exponentially many adaptively chosen statistical queries (in the size of the data set $n$ ). Equivalently, this reduces the sample complexity of answering $m$ queries from linear in the number of queries to polylogarithmic, nearly matching the dependence that is necessary for non-adaptively chosen queries.

There exists an algorithm that given a dataset of size at least $n\geq\min(n_{0},n_{1})$ , can answer any $m$ adaptively chosen statistical queries so that with high probability, each answer is correct up to tolerance $\tau$ , where:

The two bounds above are incomparable. Note that the first bound is larger than the sample complexity needed to answer non-adaptively chosen queries by only a factor of $O\left(\sqrt{\log m\log|\mathcal{X}|}/\tau^{3/2}\right)$ , whereas the second one is larger by a factor of $O\left(\log(|\mathcal{X}|)/\tau^{2}\right)$ . Here $\log|\mathcal{X}|$ should be viewed as roughly the dimension of the domain. For example, if the underlying domain is $\mathcal{X}=\{0,1\}^{d}$ , the set of all possible vectors of $d$ -boolean attributes, then $\log|\mathcal{X}|=d$ .

The above mechanism is not computationally efficient (it has running time linear in the size of the data universe $|\mathcal{X}|$ , which is exponential in the dimension of the data). A natural question raised by our result is whether there is an efficient algorithm for the task. This question was addressed in [HU14, SU14] who show that under standard cryptographic assumptions any algorithm that can answer more than $\approx n^{2}$ adaptively chosen statistical queries must have running time exponential in $\log|\mathcal{X}|$ .

We show that it is possible to match this quadratic lower bound using a simple and practical algorithm that perturbs the answer to each query with independent noise.

There exists a computationally efficient algorithm for answering $m$ adaptively chosen statistical queries, such that with high probability, the answers are correct up to tolerance $\tau$ , given a data set of size at least $n\geq n_{0}$ for:

Finally, we show a computationally efficient method which can answer exponentially many queries so long as they were generated using $o(n)$ rounds of adaptivity, even if we do not know where the rounds of adaptivity lie. Another practical advantage of this algorithm is that it only pays the price for a round if adaptivity actually causes overfitting. In other words, the algorithm does not pay for the adaptivity itself but only for the actual harm to statistical validity that adaptivity causes. This means that in many situations it would be possible to use this algorithm successfully with a much smaller “effective” $r$ (provided that a good bound on it is known).

There exists a computationally efficient algorithm for answering $m$ adaptively chosen statistical queries, generated in $r$ rounds of adaptivity, such that with high probability, the answers are correct up to some tolerance $\tau$ , given a data set of size at least $n\geq n_{0}$ for:

Formal statements of these results appear in Section 5.

3 Overview of Techniques

We consider a setting in which an arbitrary adaptive data analyst chooses queries to ask (as a function of past answers), and receives answers from an algorithm referred to as an oracle whose input is a dataset ${\bm{S}}$ of size $n$ randomly drawn from ${\mathcal{P}}^{n}$ . To begin with, the oracles we use will only guarantee accuracy with respect to the empirical average on their input dataset ${\bm{S}}$ , but they will simultaneously guarantee differential privacy. We exploit a crucial property about differential privacy, known as its post-processing guarantee: Any algorithm that can be described as the (possibly randomized) post-processing of the output of a differentially private algorithm is itself differentially private. Hence, although we do not know how an arbitrary analyst is adaptively generating her queries, we do know that if the only access she has to ${\bm{S}}$ is through a differentially private algorithm, then her method of producing query functions must be differentially private with respect to ${\bm{S}}$ . We can therefore, without loss of generality, think of the oracle and the analyst as a single algorithm ${\cal A}$ that is given a random data set ${\bm{S}}$ and returns a differentially private output query ${\bm{\phi}}={\cal A}({\bm{S}}).$ Note that ${\bm{\phi}}$ is random both due to the internal randomness of ${\cal A}$ and the randomness of the data ${\bm{S}}.$ This picture is the starting point of our analysis, and allows us to study the generalization properties of queries which are generated by differentially private algorithms, rather than estimates returned by them.

Our results then follow from a strong connection we make between differential privacy and generalization, which will likely have applications beyond those that we explore in this paper. At a high level, we prove that if ${\cal A}$ is a differentially private algorithm then the empirical average of a function that it outputs on a random dataset will be close to the true expectation of the function with high probabilityA weaker connection that gives closeness in expectation over the dataset and algorithm’s randomness was known to some experts and is considered folklore. We give a more detailed comparison in Sec. 1.4 and Sec. 2.1. (over the choice of the dataset and the randomness of ${\cal A}$ ). More formally, for a dataset $S=(x_{1},\dots,x_{n})$ and a function $\psi:\mathcal{X}\rightarrow$ , let ${\cal E}_{S}[\psi]=\frac{1}{n}\sum_{i=1}^{n}\psi(x_{i})$ denote the empirical average of $\psi$ . We denote a random dataset chosen from ${\mathcal{P}}^{n}$ by ${\bm{S}}$ . For any fixed function $\psi$ , the empirical average ${\cal E}_{{\bm{S}}}[\psi]$ is strongly concentrated around its expectation ${\mathcal{P}}[\psi]$ . However, this statement is no longer true if $\psi$ is allowed to depend on ${\bm{S}}$ (which is what happens if we choose functions adaptively, using previous estimates on ${\bm{S}}$ ). However for a hypothesis output by a differentially private ${\cal A}$ on ${\bm{S}}$ (denoted by ${\bm{\phi}}={\cal A}({\bm{S}})$ ), we show that ${\cal E}_{{\bm{S}}}[{\bm{\phi}}]$ is close to ${\mathcal{P}}[{\bm{\phi}}]$ with high probability.

High probability bounds are necessary to ensure that valid answers can be given to an exponentially large number of queries. To prove these bounds, we show that differential privacy roughly preserves the moments of ${\cal E}_{{\bm{S}}}[{\bm{\phi}}]$ even when conditioned on ${\bm{\phi}}=\psi$ for any fixed $\psi$ . Now using strong concentration of the $k$ -th moment of ${\cal E}_{{\bm{S}}}[\psi]$ around ${\mathcal{P}}[\psi]^{k}$ , we can obtain that ${\cal E}_{{\bm{S}}}[{\bm{\phi}}]$ is concentrated around ${\mathcal{P}}[{\bm{\phi}}]$ . Such an argument works only for $(\varepsilon,0)$ -differential privacy due to conditioning on the event ${\bm{\phi}}=\psi$ which might have arbitrarily low probability. We use a more delicate conditioning to obtain the extension to $(\varepsilon,\delta)$ -differential privacy. We note that $(\varepsilon,\delta)$ -differential privacy is necessary to obtain the stronger bounds that we use for Theorems 1 and 2.

We give an alternative, simpler proof for $(\varepsilon,0)$ -differential privacy that, in addition, extends this connection beyond expectations of functions. We consider a differentially private algorithm ${\cal A}$ that maps a database ${\bm{S}}\sim{\mathcal{P}}^{n}$ to elements from some arbitrary range $Z$ . Our proof shows that if we have a collection of events $R(y)$ defined over databases, one for each element $y\in Z$ , and each event is individually unlikely in the sense that for all $y$ , the probability that ${\bm{S}}\in R(y)$ is small, then the probability remains small that ${\bm{S}}\in R({\bm{Y}})$ , where ${\bm{Y}}={\cal A}({\bm{S}})$ . Note that this statement involves a re-ordering of quantifiers. The hypothesis of the theorem says that the probability of event $R(y)$ is small for each $y$ , where the randomness is taken over the choice of database ${\bm{S}}\sim{\mathcal{P}}^{n}$ , which is independent of $y$ . The conclusion says that the probability of $R({\bm{Y}})$ remains small, even though ${\bm{Y}}$ is chosen as a function of ${\bm{S}}$ , and so is no longer independent. The upshot of this result is that adaptive analyses, if performed via a differentially private algorithm, can be thought of (almost) as if they were non-adaptive, with the data being drawn after all of the decisions in the analysis are fixed.

4 Related Work

Numerous techniques have been developed by statisticians to address common special cases of adaptive data analysis. Most of them address a single round of adaptivity such as variable selection followed by regression on selected variables or model selection followed by testing and are optimized for specific inference procedures (the literature is too vast to adequately cover here, see Ch. 7 in [HTF09] for a textbook introduction and [TT15] for a survey of some recent work). In contrast, our framework addresses multiple stages of adaptive decisions, possible lack of a predetermined analysis protocol and is not restricted to any specific procedures.

The traditional perspective on why adaptivity in data analysis invalidates the significance levels of statistical procedures given for the non-adaptive case is that one ends up disregarding all the other possible procedures or tests that would have been performed had the data been different (see e.g. [SNS11]). It is well-known that when performing multiple tests on the same data one cannot use significance levels of individual tests and instead it is necessary to control measures such as the false discovery rate [BH95]. This view makes it necessary to explicitly account for all the possible ways to perform the analysis in order to provide validity guarantees for the adaptive analysis. While this approach might be possible in simpler studies, it is technically challenging and often impractical in more complicated analyses [GL14].

There are procedures for controlling false discovery in a sequential setting in which tests arrive one-by-one [FS08, ANR11, AR14]. However the analysis of such tests crucially depends on tests maintaining their statistical properties despite conditioning on previous outcomes. It is therefore unsuitable for the problem we consider here, in which we place no restrictions on the analyst.

The classical approach in theoretical machine learning to ensure that empirical estimates generalize to the underlying distribution is based on the various notions of complexity of the set of functions output by the algorithm, most notably the VC dimension (see [KV94] or [SSBD14] for a textbook introduction). If one has a sample of data large enough to guarantee generalization for all functions in some class of bounded complexity, then it does not matter whether the data analyst chooses functions in this class adaptively or non-adaptively. Our goal, in contrast, is to prove generalization bounds without making any assumptions about the class from which the analyst can choose query functions. In this case the adaptive setting is very different from the non-adaptive setting.

An important line of work [BE02, MNPR06, PRMN04, SSSSS10] establishes connections between the stability of a learning algorithm and its ability to generalize. Stability is a measure of how much the output of a learning algorithm is perturbed by changes to its input. It is known that certain stability notions are necessary and sufficient for generalization. Unfortunately, the stability notions considered in these prior works are not robust to post-processing, and so the stability of a query answering procedure would not guarantee the stability of the query generating procedure used by an arbitrary adaptive analyst. They also do not compose in the sense that running multiple stable algorithms sequentially and adaptively may result in a procedure that is not stable. Differential privacy is stronger than these previously studied notions of stability, and in particular enjoys strong post-processing and composition guarantees. This provides a calculus for building up complex algorithms that satisfy stability guarantees sufficient to give generalization. Past work has considered the generalization properties of one-shot learning procedures. Our work can in part be interpreted as showing that differential privacy implies generalization in the adaptive setting, and beyond the framework of learning.

Differential privacy emerged from a line of work [DN03, DN04, BDMN05], culminating in the definition given by [DMNS06]. It defines a stability property of an algorithm developed in the context of data privacy. There is a very large body of work designing differentially private algorithms for various data analysis tasks, some of which we leverage in our applications. Most crucially, it is known how to accurately answer exponentially many adaptively chosen queries on a fixed dataset while preserving differential privacy [RR10, HR10], which is what yields the main application in our paper, when combined with our main theorem. See [Dwo11] for a short survey and [DR14] for a textbook introduction to differential privacy.

For differentially private algorithms that output a hypothesis it has been known as folklore that differential privacy implies stability of the hypothesis to replacing (or removing) an element of the input dataset. Such stability is long known to imply generalization in expectation (e.g. [SSSSS10]). See Section 2.1 for more details. Our technique can be seen as a substantial strengthening of this observation: from expectation to high probability bounds (which is crucial for answering many queries), from pure to approximate differential privacy (which is crucial for our improved efficient algorithms), and beyond the expected error of a hypothesis.

Further Developments: Our work has attracted substantial interest to the problem of statistical validity in adaptive data analysis and its relationship to differential privacy. Hardt and Ullman [HU14] and Steinke and Ullman [SU14] have proven complementary computational lower bounds for the problem formulated in this work. They show that, under standard cryptographic assumptions, the exponential running time of the algorithm instantiating our main result is unavoidable. Specifically, that the square-root dependence on the number of queries in the sample complexity of our efficient algorithm is nearly optimal, among all computationally efficient mechanisms for answering arbitrary statistical queries.

In [DFH+15a] we discuss approaches to the problem of adaptive data analysis more generally. We demonstrate how differential privacy and description-length-based analyses can be used in this context. In particular, we show that the bounds on $n_{1}$ obtained in Theorem 1 can also be obtained by analyzing the transcript of the median mechanism for query answering [RR10] (even without adding noise). Further, we define a notion of approximate max-information between the dataset and the output of the analysis that ensures generalization with high probability, composes adaptively and unifies (pure) differential privacy and description-length-based analyses. We also demonstrate an application of these techniques to the problem of reusing the holdout (or testing) dataset. An overview of this work and [DFH+15a] intended for a broad scientific audience appears in [DFH+15b].

Blum and Hardt [BH15] give an algorithm for reusing the holdout dataset specialized to the problem of maintaining an accurate leaderboard for a machine learning competition (such as those organized by Kaggle Inc. and discussed earlier). Their generalization analysis is based on the description length of the algorithm’s transcript.

Our results for approximate ( $\delta>0$ ) differential privacy apply only to statistical queries (see Thm. 10). Bassily, Nissim, Smith, Steinke, Stemmer and Ullman [BNS+15] give a novel, elegant analysis of the $\delta>0$ case that gives an exponential improvement in the dependence on $\delta$ and generalizes it to arbitrary low-sensitivity queries. This leads to stronger bounds on sample complexity that remove an $O(\sqrt{\log(m)/\tau})$ factor from the bounds on $n_{0}$ we give in Theorems 1 and 2. It also implies a similar improvement and generalization to low-sensitivity queries in the reusable holdout application [DFH+15a].

Another implication of our work is that composition and post-processing properties (which are crucial in the adaptive setting) can be ensured by measuring the effect of data analysis on the probability space of the analysis outcomes. Several additional techniques of this type have been recently analyzed. Bassily et al. [BNS+15] show that generalization in expectation (as discussed in Cor. 7) can also be obtained from two additional notions of stability: KL-stability and TV-stability that bound the KL-divergence and total variation distance between output distributions on adjacent datasets, respectively. Russo and Zou [RZ15] show that generalization in expectation can be derived by bounding the mutual information between the dataset and the output of analysis. They give applications of their approach to analysis of adaptive feature selection procedures. We note that these techniques do not imply high-probability generalization bounds that we obtain here and in [DFH+15a].

Preliminaries

A statistical query is defined by a function $\psi:\mathcal{X}\rightarrow$ and tolerance $\tau$ . For distribution ${\mathcal{P}}$ over $\mathcal{X}$ a valid response to such a query is any value $v$ such that $|v-{\mathcal{P}}(\psi)|\leq\tau$ .

We now formally define differential privacy. We say that datasets $S,S^{\prime}$ are adjacent if they differ in a single element.

where the probability space is over the coin flips of the algorithm ${\cal A}$ . The case when $\delta=0$ is sometimes referred to as pure differential privacy, and in this case we may say simply that ${\cal A}$ is $\varepsilon$ -differentially private.

Appendix B contains additional background that we will need later on.

We now briefly summarize the basic connection between differential privacy and generalization that is considered folklore. This connection follows from an observation that differential privacy implies stability to replacing a single sample in a dataset together with known connection between stability and on-average generalization. We first state the form of stability that is immediately implied by the definition of differential privacy. For simplicity, we state it only for $$-valued functions. The extension to any other bounded range is straightforward.

Let ${\mathcal{A}}$ be an $(\epsilon,\delta)$ -differentially private algorithm ranging over functions from $\mathcal{X}$ to $ $. For any pair of adjacent datasets$ S $and$ S^{\prime} $and$ x\in\mathcal{X}$:

Algorithms satisfying equation (1) are referred to as strongly-uniform-replace-one stable with rate $(e^{\varepsilon}-1+\delta)$ by Shalev-Schwartz et al. [SSSSS10]. It is easy to show and is well-known that replace-one stability implies generalization in expectation, referred to as on-average generalization [SSSSS10, Lemma 11]. In our case this connection immediately gives the following corollary.

Let ${\mathcal{A}}$ be an $(\epsilon,\delta)$ -differentially private algorithm ranging over functions from $\mathcal{X}$ to $ $, let$ {\mathcal{P}} $be a distribution over$ \mathcal{X} $and let$ {\bm{S}} $be an independent random variable distributed according to$ {\mathcal{P}}^{n}$. Then

This corollary was observed in the context of functions expressing the loss of the hypothesis output by a (private) learning algorithm, that is, $\phi(x)=L(h(x),x)$ , where $x$ is a sample (possibly including a label), $h$ is a hypothesis function and $L$ is a non-negative loss function. When applied to such a function, Corollary 7 implies that the expected true loss of a hypothesis output by an $(\varepsilon,\delta)$ -differentially private algorithm is at most $e^{\varepsilon}-1+\delta$ larger than the expected empirical loss of the output hypothesis, where the expectation is taken over the random dataset and the randomness of the algorithm. A special case of this corollary is stated in a recent work of Bassily et al. [BST14]. More recently, Wang et al. [WLF15] have similarly used the stability of differentially private learning algorithms to show a general equivalence of differentially private learning and differentially private empirical loss minimization.

Differential Privacy and Preservation of Moments

We now prove that if a function ${\bm{\phi}}$ is output by an $(\varepsilon,\delta)$ -differentially private algorithm ${\cal A}$ on input of a random dataset ${\bm{S}}$ drawn from ${\mathcal{P}}^{n},$ then the average of ${\bm{\phi}}$ on ${\bm{S}},$ that is, ${\cal E}_{{\bm{S}}}[{\bm{\phi}}],$ is concentrated around its true expectation ${\mathcal{P}}[{\bm{\phi}}].$

Our proof is easier to execute when $\delta=0$ and we start with this case for the sake of exposition.

Our main technical tool relates the moments of the random variables that we are interested in.

Assume that ${\mathcal{A}}$ is an $(\epsilon,0)$ -differentially private algorithm ranging over functions from $\mathcal{X}$ to $.$ Let ${\bm{S}},{\bm{T}}$ be independent random variables distributed according to ${\mathcal{P}}^{n}.$ For any function $\psi:\mathcal{X}\rightarrow$ in the support of ${\cal A}({\bm{S}})$ ,

We use $I$ to denote a $k$ -tuple of indices $(i_{1},\ldots,i_{k})\in[n]^{k}$ and use $\bm{I}$ to denote a $k$ -tuple chosen randomly and uniformly from $[n]^{k}$ . For a data set $T=(y_{1},\ldots,y_{n})$ we denote by $\Pi_{T}^{I}(\psi)=\prod_{j\in[k]}\psi(y_{i_{j}})$ . We first observe that for any $\psi$ ,

For two datasets $S,T\in\mathcal{X}^{n}$ , let $S_{I\leftarrow T}$ denote the data set in which for every $j\in[k]$ , element $i_{j}$ in $S$ is replaced with the corresponding element from $T$ . We fix $I$ . Note that the random variable ${\bm{S}}_{I\leftarrow{\bm{T}}}$ is distributed according to ${\mathcal{P}}^{n}$ and therefore

Now for any fixed $t$ , $S$ and $T$ consider the event $\Pi_{T}^{I}({\cal A}(S))\geq t\mbox{ and }{\cal A}(S)=\psi$ (defined on the range of ${\cal A}$ ). Data sets $S$ and $S_{I\leftarrow T}$ differ in at most $k$ elements. Therefore, by the $\varepsilon$ -differential privacy of ${\cal A}$ and Lemma 24, the distribution ${\cal A}(S)$ and the distribution ${\cal A}(S_{I\leftarrow T})$ satisfy:

Taking the probability over ${\bm{S}}$ and ${\bm{T}}$ we get:

Taking the expectation over $\bm{I}$ and using eq. (3) we obtain that

We now turn our moment inequality into a theorem showing that ${\cal E}_{\bm{S}}[{\bm{\phi}}]$ is concentrated around the true expectation ${\mathcal{P}}[{\bm{\phi}}].$

Consider an execution of ${\cal A}$ with $\varepsilon=\tau/2$ on a data set ${\bm{S}}$ of size $n\geq 12\ln(4/\beta)/\tau^{2}$ . By Lemma 29 we obtain that RHS of our bound in Lemma 8 is at most $e^{\varepsilon k}\mathcal{M}_{k}[B(n,{\mathcal{P}}[\psi])]$ . We use Lemma 31 with $\varepsilon=\tau/2$ and $k=4\ln(4/\beta)/\tau$ (noting that the assumption $n\geq 12\ln(4/\beta)/\tau^{2}$ ensures the necessary bound on $n$ ) to obtain that

This holds for every $\psi$ in the range of ${\cal A}$ and therefore,

We can apply the same argument to the function $1-{\bm{\phi}}$ to obtain that

A union bound over the above inequalities implies the claim. ∎

2 Extension to δ>0𝛿0\delta>0

We use the notation from the proof of Theorem 9 and consider an execution of ${\cal A}$ with $\varepsilon$ and $\delta$ satisfying the conditions of the theorem.

We use the same decomposition of the $k$ -th moment as before:

Now for a fixed $I\in[n]^{k}$ , exactly as in eq. (4), we obtain

Taking the probability over ${\bm{S}}$ and ${\bm{T}}$ and substituting this into eq. (6) we get

Taking the expectation over $\bm{I}$ and using eq. (3) we obtain:

Apply the same argument to $1-{\bm{\phi}}$ and use a union bound. We obtain the claim after rescaling $\tau$ and $\beta$ by a factor $2.$

Beyond statistical queries

Fix $y\in Z$ . We first observe that by Jensen’s inequality,

Further, by definition of differential privacy, for two databases $S,S^{\prime}$ that differ in a single element,

and let $B_{0}\doteq\{S\ |\ g(S)\leq\varepsilon\sqrt{n\ln(2/\beta)/2}\}.$

where the case of $i=0$ follows from the assumptions of the lemma.

The condition $\varepsilon\leq\sqrt{\frac{\ln(1/\beta)}{2n}}$ implies that

Substituting this into inequality (10), we get

Clearly, $\cup_{i\geq 0}B_{i}=\mathcal{X}^{[n]}$ . Therefore

Finally, let $\cal Y$ denote the distribution of ${\bm{Y}}$ . Then,

Our theorem gives a result for statistical queries that achieves the same bound as our earlier result in Theorem 9 up to constant factors in the parameters.

By the Chernoff bound, for any fixed query function $\psi:\mathcal{X}\rightarrow$ ,

Now, by Theorem 11 for $R(\psi)=\left\{S\in\mathcal{X}^{n}\ |\ |\mathcal{P}[\psi]-{\cal E}_{{\bm{S}}}[\psi]|>\tau\right\}$ , $\beta=2e^{-2\tau^{2}n}$ and any $\varepsilon\leq\sqrt{\tau^{2}-\ln(2)/2n}$ ,

Applications

To obtain algorithms for answering adaptive statistical queries we first note that if for a query function $\psi$ and a dataset $S$ , $|{\mathcal{P}}[\psi]-{\cal E}_{S}[\psi]|\leq\tau/2$ then we can use an algorithm that outputs a value $v$ that is $\tau/2$ -close to ${\cal E}_{S}[\psi]$ to obtain a value that is $\tau$ -close to ${\mathcal{P}}[\psi]$ . Differentially private algorithms that for a given dataset $S$ and an adaptively chosen sequence of queries $\phi_{1},\ldots,\phi_{m}$ produce a value close to ${\cal E}_{S}[\phi_{i}]$ for each query $\phi_{i}\colon\mathcal{X}\to$ have been the subject of intense investigation in the differential privacy literature (see [DR14] for an overview). Such queries are usually referred to as (fractional) counting queries or linear queries in this context. This allows us to obtain statistical query answering algorithms by using various known differentially private algorithms for answering counting queries.

The results in Sections 3 and 4 imply that $|{\mathcal{P}}[\psi]-{\cal E}_{\bm{S}}[\psi]|\leq\tau$ holds with high probability whenever $\psi$ is generated by a differentially private algorithm ${\mathcal{M}}$ . This might appear to be inconsistent with our application since there the queries are generated by an arbitrary (possibly adversarial) adaptive analyst and we can only guarantee that the query answering algorithm is differentially private. The connection comes from the following basic fact about differentially private algorithms:

Let ${\mathcal{M}}:\mathcal{X}^{n}\rightarrow{\mathcal{O}}$ be an $(\epsilon,\delta)$ differentially private algorithm with range ${\mathcal{O}}$ , and let $\cal F:{\mathcal{O}}\rightarrow{\mathcal{O}}^{\prime}$ be an arbitrary randomized algorithm. Then ${\cal F}\circ{{\mathcal{M}}}:{\mathcal{X}}^{n}\rightarrow{\mathcal{O}}^{\prime}$ is $(\epsilon,\delta)$ -differentially private.

Hence, an arbitrary adaptive analyst ${\cal A}$ is guaranteed to generate queries in a manner that is differentially private in ${\bm{S}}$ so long as the only access that she has to ${\bm{S}}$ is through a differentially private query answering algorithm ${\mathcal{M}}$ . We also note that the bounds we state here give the probability of correctness for each individual answer to a query, meaning that the error probability $\beta$ is for each query $\phi_{i}$ and not for all queries at the same time. The bounds we state in Section 1.2 hold with high probability for all $m$ queries and to obtain them from the bounds in this section, we apply the union bound by setting $\beta=\beta^{\prime}/m$ for some small $\beta^{\prime}$ .

We now highlight a few applications of differentially private algorithms for answering counting queries to our problem.

Applying our main generalization bound for $(\epsilon,0)$ -differential privacy directly gives the following corollary.

We apply Theorem 9 with $\epsilon=\tau/2$ and plug this choice of $\epsilon$ into the definition of $n_{L}$ in Theorem 14. We note that the stated lower bound on $n$ implies the lower bound required by Theorem 9. ∎

The corollary that follows the $(\epsilon,\delta)$ bound gives a quadratic improvement in $m$ compared with Corollary 15 at the expense of a slightly worse dependence on $\tau$ and $1/\beta.$

2 Multiplicative Weights Technique

The private multiplicative weights algorithm [HR10] achieves an exponential improvement in $m$ compared with the Laplacian mechanism. The main drawback is a running time that scales linearly with the domain size in the worst case and is therefore not computationally efficient in general.

We apply Theorem 9 with $\epsilon=\tau/2$ and plug this choice of $\epsilon$ into the definition of $n_{MW}$ in Theorem 17. We note that the stated lower bound on $n$ implies the lower bound required by Theorem 9. ∎

Under $(\epsilon,\delta)$ differential privacy we get the following corollary that improves the dependence on $\tau$ and $\log|\mathcal{X}|$ in Corollary 18 at the expense of a slightly worse dependence on $\beta.$

3 Sparse Vector Technique

In this section we give a computationally efficient technique for answering exponentially many queries $\phi_{1},\dots,\phi_{m}$ in the size of the data set $n$ so long as they are chosen using only $o(n)$ rounds of adaptivity. We say that a sequence of queries $\phi_{1},\ldots,\phi_{m}\in\mathcal{X}^{}$ , answered with numeric values $a_{1},\ldots,a_{m}$ is generated with $r$ rounds of adaptivity if there are $r$ indices $i_{1},\ldots,i_{r}$ such that the procedure that generates the queries as a function of the answers can be described by $r+1$ (possibly randomized) algorithms $f_{0},f_{1},\ldots,f_{r}$ satisfying:

We build our algorithm out of a differentially private algorithm called SPARSE that takes as input an adaptively chosen sequence of queries together with guesses of the answers to those queries. Rather than always returning numeric valued answers, it compares the error of our guess to a threshold $T$ and returns a numeric valued answer to the query only if (a noisy version of) the error of our guess was above the given threshold. SPARSE is computationally efficient, and has the remarkable property that its accuracy has polynomial dependence only on the number of queries for which the error of our guesses are close to being above the threshold.

We observe that the naïve method of answering queries using their empirical average allows us to answer each query up to accuracy $\tau$ with probability $1-\beta$ given a data set of size $n_{0}\geq\ln(2/\beta)/\tau^{2}$ so long as the queries are non-adaptively chosen. Thus, with high probability, problems only arise between rounds of adaptivity. If we knew when these rounds of adaptivity occurred, we could refresh our sample between each round, and obtain total sample complexity linear in the number of rounds of adaptivity. The method we present (using $(\epsilon,0)$ -differential privacy) lets us get a comparable bound without knowing where the rounds of adaptivity appear. Using $(\epsilon,\delta)$ privacy would allow us to obtain constant factor improvements if the number of queries was large enough, but does not get an asymptotically better dependence on the number of rounds $r$ (it would allow us to reuse the round testing set quadratically many times, but we would still potentially need to refresh the training set after each round of adaptivity, in the worst case).

The idea is the following: we obtain $r$ different estimation samples $S_{1},\ldots,S_{r}$ each of size sufficient to answer non-adaptively chosen queries to error $\tau/8$ with probability $1-\beta/3$ , and a separate round detection sample $S_{h}$ of size $n_{SV}(\tau/8,\beta/3,\epsilon)$ for $\epsilon=\tau/16$ , which we access only through a copy of SPARSE we initialize with threshold $T=\tau/4$ . As queries $\phi_{i}$ start arriving, we compute their answers $a_{i}^{t}={\cal E}_{S_{1}}[\phi_{i}]$ using the naïve method on estimation sample $S_{1}$ which we use as our guess of the correct value on $S_{h}$ when we feed $\phi_{i}$ to SPARSE. If the answer SPARSE returns is $a^{h}_{i}=\bot$ , then we know that with probability $1-\beta/3$ , $a_{i}^{t}$ is accurate up to tolerance $T+\tau/8=3\tau/8$ with respect to $S_{h}$ , and hence statistically valid up to tolerance $\tau/2$ by Theorem 9 with probability at least $1-2\beta/3$ . Otherwise, we discard our estimation set $S_{1}$ and continue with estimation set $S_{2}$ . We know that with probability $1-\beta/3$ , $a_{i}^{h}$ is accurate with respect to $S_{h}$ up to tolerance $\tau/8$ , and hence statistically valid up to tolerance $\tau/4$ by Theorem 9 with probability at least $1-2\beta/3$ . We continue in this way, discarding and incrementing our estimation set whenever our guess $g_{i}$ is incorrect. This succeeds in answering every query so long as our guesses are not incorrect more than $r$ times in total. Finally, we know that except with probability at most $m\beta/3$ , by the accuracy guarantee of our estimation set for non-adaptively chosen queries, the only queries $i$ for which our guesses $g_{i}$ will deviate from ${\cal E}_{S_{h}}[\phi_{i}]$ by more than $T-\tau/8=\tau/8$ are those queries that lie between rounds of adaptivity. There are at most $r$ of these by assumption, so the algorithm runs to completion with probability at least $1-m\beta/3$ . The algorithm is given in figure 1.

This algorithm yields the following theorem:

Note that the accuracy guarantee of SPARSE depends only on the number of incorrect guesses that are actually made. Hence, EffectiveRounds does not halt until the actual number of instances of over-fitting to the estimation samples $S_{i}$ is larger than $r$ . This could be equal to the number of rounds of adaptivity in the worst case (for example, if the analyst is running the Dinur-Nissim reconstruction attack within each round [DN03]), but in practice might achieve a much better bound (if the analyst is not fully adversarial).

We would like to thank Sanjeev Arora, Nina Balcan, Avrim Blum, Dean Foster, Michael Kearns, Jon Kleinberg, Sasha Rakhlin, and Jon Ullman for enlightening discussions and helpful comments. We also thank the Simons Institute for Theoretical Computer Science at Berkeley where part of this research was done.

References

Appendix A Adaptivity in fitting a linear model

In this section, we give a very simple example to illustrate how a data analyst could end up overfitting to a dataset by asking only a small number of (adaptively) chosen queries to the dataset, if they are answered using the naïve method.

Not knowing the distribution the analyst decides to solve the corresponding optimization problem on her finite sample:

The analyst attempts to solve the problem using the following simple but adaptive strategy:

Intuitively, this natural approach first determines for each attribute whether it is positively or negatively correlated. It then aggregates this information across all $d$ attributes into a single linear model.

The next lemma shows that this adaptive strategy has a terrible generalization performance (if $d$ is large). Specifically, we show that even if there is no linear structure whatsoever in the underlying distribution (namely it is normally distributed), the analyst’s strategy falsely discovers a linear model with large objective value.

Now, $(1/n)\sum_{x\in D}x_{i}$ is distributed like a gaussian random variable $g\sim N(0,1/n),$ since each $x_{i}$ is a standard gaussian. It follows that

Appendix B Background on Differential Privacy

When applying $(\epsilon,\delta)$ -differential privacy, we are typically interested in values of $\delta$ that are very small compared to $n$ . In particular, values of $\delta$ on the order of $1/n$ yield no meaningful definition of privacy as they permit the publication of the complete records of a small number of data set participants—a violation of any reasonable notion of privacy.

where the probability space is over the coin flips of the mechanism ${\cal A}$ .

Differential privacy also degrades gracefully under composition. It is easy to see that the independent use of an $(\varepsilon_{1},0)$ -differentially private algorithm and an $(\varepsilon_{2},0)$ -differentially private algorithm, when taken together, is $(\varepsilon_{1}+\varepsilon_{2},0)$ -differentially private. More generally, we have

Let ${\cal A}_{i}:\mathcal{X}^{n}\rightarrow\mathcal{R}_{i}$ be an $(\varepsilon_{i},\delta_{i})$ -differentially private algorithm for $i\in[k]$ . Then if ${\cal A}_{[k]}:\mathcal{X}^{n}\rightarrow\prod_{i=1}^{k}\mathcal{R}_{i}$ is defined to be ${\cal A}_{[k]}(S)=({\cal A}_{1}(S),\ldots,{\cal A}_{k}(S))$ , then ${\cal A}_{[k]}$ is $(\sum_{i=1}^{k}\varepsilon_{i},\sum_{i=1}^{k}\delta_{i})$ -differentially private.

A more sophisticated argument yields significant improvement when $\varepsilon<1$ :

For all $\varepsilon,\delta,\delta^{\prime}\geq 0$ , the composition of $k$ arbitrary $(\varepsilon,\delta)$ -differentially private mechanisms is $(\varepsilon^{\prime},k\delta+\delta^{\prime})$ -differentially private, where

even when the mechanisms are chosen adaptively.

Theorems 25 and 26 are very general. For example, they apply to queries posed to overlapping, but not identical, data sets. Nonetheless, data utility will eventually be consumed: the Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy in a spectacular way (see [DN03] et sequelae). The goal of algorithmic research on differential privacy is to stretch a given privacy “budget” of, say, $\varepsilon_{0}$ , to provide as much utility as possible, for example, to provide useful answers to a great many counting queries. The bounds afforded by the composition theorems are the first, not the last, word on utility.

Appendix C Concentration and moment bounds

We will use the following statement of the multiplicative Chernoff bound:

Let $Y_{1},Y_{2},\ldots,Y_{n}$ be i.i.d. Bernoulli random variables with expectation $p>0$ . Then for every $\gamma>0$ ,

C.2 Moment Bounds

and by using this in equality (11) we obtain that

For all integers $n\geq k\geq 1$ and $p\in$ ,

Let $U$ denote $\frac{1}{n}\sum_{i\in[n]}X_{i}$ , where $X_{i}$ ’s are i.i.d. Bernoulli random variables with expectation $p>0$ (the claim is obviously true if $p=0$ ). Then

We substitute $t=(1+\gamma)^{k}p^{k}$ and observe that Lemma 27 gives:

Using this substitution in eq.(12) together with $\frac{dt}{d\gamma}=k(1+\gamma)^{k-1}\cdot p^{k}$ we obtain

We now find the maximum of $g(\gamma)\doteq k\ln(1+\gamma)-np((1+\gamma)\ln(1+\gamma)-\gamma)$ . Differentiating the expression we get $\frac{k}{1+\gamma}-np\ln(1+\gamma)$ and therefore the function attains its maximum at the (single) point $\gamma_{0}$ which satisfies: $(1+\gamma_{0})\ln(1+\gamma_{0})=\frac{k}{np}$ . This implies that $\ln(1+\gamma_{0})\leq\ln\left(\frac{k}{np}\right).$ Now we observe that $(1+\gamma)\ln(1+\gamma)-\gamma$ is always non-negative and therefore $g(\gamma_{0})\leq k\ln\left(\frac{k}{np}\right)$ . Substituting this into eq.(13) we conclude that

Finally, we observe that if $p\geq 1/n$ then clearly $\ln(1/p)\leq\ln n$ and the claim holds. For any $p<1/n$ we use monotonicity of $\mathcal{M}_{k}[B(n,p)]$ in $p$ and upper bound the probability by the bound for $p=1/n$ that equals

$k\geq\max\{4p\ln(2/\beta)/\tau,\ 2\log\log n\}$ ,

Using the condition $\varepsilon\leq\tau/2$ and $\tau\leq 1/3$ we first observe that

Hence, with the condition that $k\geq 4p\ln(2/\beta)/\tau$ we get

Together with the condition $k\geq\max\{4\ln(2/\beta)/\tau,\ 2\log\log n\}$ , we have

since $k/2\geq\log\log n$ holds by assumption and for $k\geq 12\ln(2/\beta)$ , $k/6\geq\log(2/\beta)$ and $k/3\geq\log(k+1)$ (whenever $\beta<2/3$ ). Therefore we get

Combining eq.(15) and (16) we obtain that

Substituting this into eq.(14) we obtain the claim. ∎