Algorithmic Stability for Adaptive Data Analysis

Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, Jonathan Ullman

Introduction

Multiple hypothesis testing is a ubiquitous task in empirical research. A finite sample of data is drawn from some unknown population, and several analyses are performed on that sample. The outcome of an analysis is deemed significant if it is unlikely to have occurred by chance alone, and a “false discovery” occurs if the analyst incorrectly declares an outcome to be significant. False discovery has been identified as a substantial problem in the scientific community (see e.g. [Ioa05, GL14]). This problem persists despite decades of research by statisticians on methods for preventing false discovery, such as the widely used Bonferroni Correction [Bon36, Dun61] and the Benjamini-Hochberg Procedure [BH95].

False discovery is often attributed to misuse of statistics. An alternative explanation is that the prevalence of false discovery arises from the inherent adaptivity in the data analysis process—the fact that the choice of analyses to perform depends on previous interactions with the data (see e.g. [GL14]). Adaptivity is essentially unavoidable when a sequence of research groups publish research papers based on overlapping data sets. Adaptivity also arises naturally in other settings, for example: in multistage inference algorithms where data are preprocessed (say, to select features or restrict to a principal subspace) before the main analysis is performed; in scoring data-based competitions [BH15]; and in the re-use of holdout or test data [DFH+15c, DFH+15a].

The general problem of adaptive data analysis was formally modeled and studied in recent papers by Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth [DFH+15b] and by Hardt and Ullman [HU14]. The striking results of Dwork et al. [DFH+15b] gave the first nontrivial algorithms for provably ensuring statistical validity in adaptive data analysis, allowing for even an exponential number of tests against the same sample. In contrast, [HU14, SU15b] showed inherent statistical and computational barriers to preventing false discovery in adaptive settings.

The key ingredient in Dwork et al. is a notion of “algorithmic stability” that is suitable for adaptive analysis. Informally, changing one input to a stable algorithm does not change it’s output “too much.” Traditionally, stability was measured via the change in the generalization error of an algorithm’s output, and algorithms stable according to such a criterion have long been known to ensure statistical validity in nonadaptive analysis [DW79a, DW79b, KR99, BE02, SSSS10]. Following a connection first suggested by McSherrySee, e.g., [McS14], although the observation itself dates back at least to 2008 (personal communication)., Dwork et al. showed that a stronger stability condition designed to ensure data privacy—called differential privacy (DP) [DMNS06, Dwo06]—guarantees statistical validity in adaptive data analysis. This allowed them to repurpose known DP algorithms to prevent false discovery. A crucial difference from traditional notions of stability is that DP requires a change in one input lead to a small change in the probability distribution on the outputs (in particular, differentially private algorithms must be randomized). In this paper, we refer to differential privacy as max-KL-stability (Definition 2.3), to emphasize the relation to the literature on algorithmic stability other notions of stability we study (KL- and TV-stability, in particular).

In this work, we extend the results of Dwork et al. along two axes. First, we give an optimal analysis of the statistical validity of max-KL stable algorithms. As a consequence, we immediately obtain the best known bounds on the sample complexity (equivalently, the convergence rate) of adaptive data analysis. Second, we generalize the connection between max-KL stability and statistical validity to a much larger family of statistics. Our proofs are also significantly simpler than those of Dwork et al., and clarify the role of different stability notions in the adaptive setting.

Following the previous work on this subject ([DFH+15b]), we formalize the problem of adaptive data analysis as follows. There is a distribution $\mathbf{P}$ over some finite universe $\mathcal{X}$ , and a mechanism $\mathcal{M}$ that does not know $\mathbf{P}$ , but is given a set $\mathbf{x}$ consisting of $n$ samples from $\mathbf{P}$ . Using its sample, the mechanism must answer queries on $\mathbf{P}$ . Here, a query $q$ , coming from some family $Q$ , maps a distribution $\mathbf{P}$ to a real-valued answer. The mechanism’s answer $a$ to a query $q$ is $\alpha$ -accurate if $|a-q(\mathbf{P})|\leq\alpha$ with high probability. Importantly, the mechanism’s goal is to provide answers that “generalize” to the underlying distribution, rather than answers that are specific to its sample.

We model adaptivity by allowing a data analyst to ask a sequence of queries $q_{1},q_{2},\dots,q_{k}\in Q$ to the mechanism, which responds with answers $a_{1},a_{2},\dots,a_{k}$ . In the adaptive setting, the query $q_{j}$ may depend on the previous queries and answers $q_{1},a_{1},\dots,q_{j-1},a_{j-1}$ arbitrarily. We say the mechanism is $\alpha$ -accurate given $n$ samples for $k$ adaptively chosen queries if, with high probability, when given a vector $\mathbf{x}$ of $n$ samples from an arbitrary distribution $\mathbf{P}$ , the mechanism accurately responds to any adaptive analyst that makes at most $k$ queries.

Surprisingly, Dwork et al. [DFH+15b], showed there are mechanisms that are much more effective than naïvely outputting the empirical answer. They show that “stable” mechanisms are accurate and, by applying a stable mechanism from the literature on differential privacy, they obtain a mechanisms that are accurate given only $n\gtrsim\sqrt{k}/\alpha^{2.5}$ samples, which is a significant improvement over the naïve mechanism when $\alpha$ is not too small. (See Table 1 for more detailed statements of their results, including results that achieve an exponential improvement in the sample complexity when $|\mathcal{X}|$ is bounded.)

Our first contribution is to give a simpler and quantitatively optimal analysis of the generalization properties of stable algorithms, which immediately yields new accuracy bounds for adaptive statistical queries. In particular, we show that $n\gtrsim\sqrt{k}/\alpha^{2}$ samples suffice. Since $1/\alpha^{2}$ samples are required to answer a single nonadaptive query, our dependence on $\alpha$ is optimal.

Although statistical queries are surprisingly general [Kea93], we would like to be able to ask more general queries on the distribution $\mathbf{P}$ that capture a wider variety of machine learning and data mining tasks. To this end, we give the first bounds on the sample complexity required to answer large numbers of adaptively chosen low-sensitivity queries and optimization queries, which we now describe.

Our sample complexity bounds are summarized in Table 1.

Our bounds were applied in subsequent work of Dwork et al. [DFH+15c, DFH+15a] in the analysis of their “reusable holdout” construction.

2 Overview of Techniques

Our main result is a new proof, with optimal parameters, that a stable algorithm that provides answers to adaptive queries that are close to the empirical value on the sample gives answers that generalize to the underlying distribution. In particular, we prove:

Let $\mathcal{M}$ be a mechanism that takes a sample $\mathbf{x}\in\mathcal{X}^{n}$ and answers $k$ adaptively-chosen low-sensitivity queries. Suppose that $\mathcal{M}$ satisfies the following:

$\mathcal{M}$ satisfies $(\alpha,\alpha\beta)$ -max-KL stability (Definition 2.3, identical to $(\alpha,\alpha\beta)$ -differential privacy).

Our actual result is somewhat more general than Theorem 1.1. We show that the population-level error of a stable algorithm is close to its error on the sample, whether or not that error is low. Put glibly: stable algorithms cannot be wrong without realizing it.

Compared to the results of [DFH+15b], Theorem 1.1 requires a quantitatively weaker stability guarantee— $(\alpha,\alpha\beta)$ -stability, instead of $(\alpha,(\beta/k)^{1/\alpha})$ -stability. It also applies to arbitrary low-sensitivity queries as opposed to the special case of statistical queries.

Our analysis differs from that of Dwork et al. in two key ways. First, we give a better bound on the probability with which a single low-sensitivity query output by a max-KL stable algorithm has good generalization error. Second, we show a reduction from the case of many queries to the case of a single query that has no loss in parameters (in contrast, previous work took a union bound over queries, leading to a dependence on $k$ , the number of queries).

Both steps rely on a thought experiment in which several “real” executions of a stable algorithm are simulated inside another algorithm, called a monitor, which outputs a function of the “real” transcripts. Because stability is closed under post-processing, the monitor is itself stable. Because it exists only as a thought experiment, the monitor can be given knowledge of the true distribution from which the data are drawn, and can use this knowledge to process the outputs of the simulated “real” runs. The monitor technique allows us to start from a basic guarantee, which states that a single query has good generalization error with constant probability, and amplify the guarantee so that (a) the generalization error holds with very high probability, and (b) the guarantee holds simultaneously over all queries in a sequence. The proof of the basic guarantee follows the lines of existing proofs using algorithmic stability (e.g., [DW79a]), while the monitor technique and the resulting amplification statements are new.

The amplification of success probability is the more technically sophisticated of the two key steps. The idea is to run many (about $1/\beta$ , using the notation of Theorem 1.1) copies of a stable mechanism on independently selected data sets. Each of these interactions results in a sequence of queries and answers. The monitor then selects the the query and answer pair from amongst all of the sequences that has the largest error. It then outputs this query as well as the index of the interaction that produced it. Our main technical lemma shows that the monitor will find a “bad” query/dataset pair (one where the true and empirical values of the query differ) with at most constant probability. This implies that the each of the real executions outputs a bad query with probability $O(\beta)$ . Relative to previous work, the resulting argument yields better bounds, applies to more general classes of queries, and even generalizes to other notions of stability.

In general, we cannot prove that our bounds are optimal. Even for nonadaptive statistical queries, $n\gtrsim\log(k)/\alpha^{2}$ samples are necessary, and [HU14, SU15b] showed that $n\gtrsim\min\{\sqrt{k},\sqrt{\log|\mathcal{X}|}\}/\alpha$ samples are necessary to answer adaptively chosen statistical queries. However, the gap between the upper and lower bounds is still significant.

However, we can show that our connection between max-KL stability and generalization is optimal (see Section 7 for details). Moreover, for every family of queries we consider, no max-KL stable algorithm can achieve better sample complexity [BUV14, BST14]. Thus, any significant improvement to our bounds must come from using a weaker notion of stability or some entirely different approach.

Each of our results requires instantiating the mechanism with a suitable stable / differentially private algorithm. For statistical queries, the optimal mechanisms are the well known Gaussian and Laplace Mechanisms (slightly refined by [SU15a]) when $k$ is small and the Private Multiplicative Weights Mechanism [HR10] when $k$ is large. For arbitrary low-sensitivity queries, the Gaussian or Laplace Mechanism is again optimal when $k$ is small, and for large $k$ we can use the Median Mechanism [RR10].

When considering arbitrary search queries over an arbitrary finite range, the optimal algorithm is the Exponential Mechanism [MT07]. For the special case of convex minimization queries over an infinite domain, we use the optimal algorithm of [BST14] when $k$ is small, and when $k$ is large, we use an algorithm of [Ull15] that accurately answers exponentially many such queries.

Our techniques applies to notions of distributional stability other than max-KL/differential privacy. In particular, defining stability in terms of total variation (TV) or KL divergence (KL) leads to bounds on the generalization error that have polynomially, rather than exponentially, decreasing tails. See Section 4 for details.

Preliminaries

Given a distribution $\mathbf{P}$ over $\mathcal{X}$ or a sample $\mathbf{x}=(\mathbf{x}_{1},\cdots,\mathbf{x}_{n})\in\mathcal{X}^{n}$ , we would like to answer queries about $\mathbf{P}$ or $x$ from some family $Q$ . We will often want to bound the “sensitivity” of the queries with respect to changing one element of the sample. To this end, we use $x\sim x^{\prime}$ to denote that $x,x^{\prime}\in\mathcal{X}^{n}$ differ on at most one entry. We will consider several different families of queries:

Statistical Queries: These queries are specified by a function $q:\mathcal{X}\to$ , and (abusing notation) are defined as

The error of an answer $a$ to a statistical query $q$ with respect to $\mathbf{P}$ or $\mathbf{x}$ is defined to be

The error of an answer $a$ to a $\Delta$ -sensitive query $q$ with respect to $\mathbf{P}$ or $\mathbf{x}$ is defined to be

We denote the set of all $\Delta$ -sensitive queries by $Q_{\Delta}$ . If $\Delta=O(1/n)$ we say the query is low sensitivity. Note that $1/n$ -sensitive queries are a strict generalization of statistical queries.

Here $\Theta$ is an arbitrary set of items (sometimes called “parameter values”) among which we aim to chose the item (“parameter”) with minimal loss, either with respect to a particular input data set $x$ , or with respect to expectation over a distribution $\mathbf{P}$ .

We denote the set of minimization queries by $Q_{\mathit{min}}$ . We highlight two special cases:

Minimization for Finite Sets: We denote by $Q_{\mathit{min},{D}}$ the set of minimization queries where $\Theta$ is finite with size at most $D$ .

2 Mechanisms for Adaptive Queries

Our goal is to design a mechanism $\mathcal{M}$ that answers queries on $\mathbf{P}$ using only independent samples $x_{1},\dots,x_{n}\leftarrow_{\mbox{\tiny R}}\mathbf{P}$ . Our focus is the case where the queries are chosen adaptively and adversarially.

Specifically, $\mathcal{M}$ is a stateful algorithm that holds a collection of samples $x_{1},\dots,x_{n}\in\mathcal{X}$ , takes a query $q$ from some family $Q$ as input, and returns an answer $a$ . We require that when $x_{1},\dots,x_{n}$ are independent samples from $\mathbf{P}$ , the answer $a$ is “close” to $q(\mathbf{P})$ in a sense that is appropriate for the family of queries. Moreover we require that this condition holds for every query in an adaptively chosen sequence $q_{1},\dots,q_{k}$ . Formally, we define an accuracy game between a mechanism $\mathcal{M}$ and a stateful data analyst $\mathcal{A}$ in Figure 1.

A mechanism $\mathcal{M}$ is $(\alpha,\beta)$ -accurate with respect to the population for $k$ adaptively chosen queries from $Q$ given $n$ samples in $\mathcal{X}$ if for every adversary $\mathcal{A}$ ,

We will also use a definition of accuracy relative to the sample given to the mechanism, described in Figure 2.

A mechanism $\mathcal{M}$ is $(\alpha,\beta)$ -accurate with respect to samples of size $n$ from $\mathcal{X}$ for $k$ adaptively chosen queries from $Q$ if for every adversary $\mathcal{A}$ ,

3 Max-KL Stability (a.k.a. Differential Privacy)

Informally, an algorithm is “stable” if changing one of its inputs does not change its output “too much.” For our results, we will consider randomized algorithms, and require that changing one input does not change the distribution of the algorithm’s outputs too much. With this in mind, we will define here one notion of algorithmic stability that is related to the well-known notion of KL-divergence between distributions. In Section 4.1, we will give other related notions of algorithmic stability based on different notions of closeness between distributions.

Let $\mathcal{W}:\mathcal{X}^{n}\to\mathcal{R}$ be a randomized algorithm. We say that $\mathcal{W}$ is $(\varepsilon,\delta)$ -max-KL stable if for every pair of samples $\mathbf{x},\mathbf{x}^{\prime}$ that differ on exactly one element, and every $R\subseteq\mathcal{R}$ ,

This notion of $(\varepsilon,\delta)$ -max-KL stability is also commonly known as $(\varepsilon,\delta)$ -differential privacy [DMNS06], however in this context we choose the term max-KL stability to emphasize the conceptual relationship between this notion and other notions of algorithmic stability that have been studied in machine learning. We also emphasise that our work has a very different motivation to the motivation of differential privacy — stable algorithms are desirable even when privacy is not a concern, such as when the data does not concern humans.

In our analysis, we will make crucial use of the fact that max-KL-stability (as well as the other notions of stability discussed in Section 4.1)is closed under post-processing.

Let $\mathcal{W}:\mathcal{X}^{n}\to\mathcal{R}$ and $f:\mathcal{R}\to\mathcal{R}^{\prime}$ be a pair of randomized algorithms. If $\mathcal{W}$ is $(\varepsilon,\delta)$ -max-KL-stable then the algorithm $f(\mathcal{W}(\mathbf{x}))$ is $(\varepsilon,\delta)$ -max-KL-stable.

The definition we gave above does not immediately apply to algorithms that interact with a data analyst to answer adaptively chosen queries. Such a mechanism does not simply take a sample $\mathbf{x}$ as input and produce an output. Instead, in the interactive setting, there is a mechanism $\mathcal{M}$ that holds a sample $\mathbf{x}$ and interacts with some algorithm $\mathcal{A}.$ We can view this entire interaction between $\mathcal{M}$ and $\mathcal{A}$ as a single noninteractive meta algorithm that outputs the transcript of the interaction and define stability with respect to that meta algorithm. Specifically, we define the algorithm $\mathcal{W}[\mathcal{M},\mathcal{A}](x)$ that simulates the interaction between $\mathcal{M}(x)$ and $\mathcal{A}$ and outputs the messages sent between them. The simulation is also parameterized by $n,k,Q,$ although we will frequently omit these parameters when they are clear from context.

Note that $\mathcal{W}[\mathcal{M},\mathcal{A}]$ is a noninteractive mechanism, and its output is just the query-answer pairs of $\mathcal{M}$ and $\mathcal{A}$ in the sample accuracy game, subject to the mechanism being given the sample $\mathbf{x}.$ Now we can define the stability of an interactive mechanism $\mathcal{M}$ using $\mathcal{W}.$

[Stability of for Interactive Mechanism] We say an interactive mechanism $\mathcal{M}$ is $(\varepsilon,\delta)$ -max-KL stable for $k$ queries from $Q$ if for every adversary $\mathcal{A}$ , the algorithm $\mathcal{W}_{n,k,Q}[\mathcal{M},\mathcal{A}](\mathbf{x}):\mathcal{X}^{n}\to(Q\times\mathcal{R})^{k}$ is $(\varepsilon,\delta)$ -max-KL stable.

3.2 Composition of Max-KL Stability

The definition above allows for adaptive composition. This follows directly from composition results of $(\varepsilon,\delta)$ -differentially private algorithms. A mechanism that is $(\varepsilon,\delta)$ -max-KL stable for $1$ query is $(\approx\varepsilon\sqrt{k},\approx\delta k)$ -stable for $k$ adaptively chosen queries [DMNS06, DRV10]. More precisely, for every $0\leq\varepsilon\leq 1$ and $\delta,\delta^{\prime}>0$ , if a mechanism that is $(\varepsilon,\delta)$ -max-KL stable for $1$ query is used to answer $k$ adaptively chosen queries, it remains $(\varepsilon\sqrt{k\log(1/\delta^{\prime})}+2\varepsilon^{2}k,\;\delta^{\prime}+k\delta)$ -max-KL stable [DRV10].

From Max-KL Stability to Accuracy for Low-Sensitivity Queries

In this section we prove our main result that any mechanism that is both accurate with respect to the sample and satisfies max-KL stability (with suitable parameters) is also accurate with respect to the population. The proof proceeds in two mains steps. First, we prove a lemma that says that there is no max-KL stable mechanism that takes several independent sets of samples from the distribution and finds a query and a set of samples such that the answer to that query on that set of samples is very different from the answer to that query on the population. In Section 3.2 we prove this lemma for the simpler case of statistical queries and then in 3.3 we extend the proof to the more general case of low-sensitivity queries.

The second step is to introduce a monitoring algorithm. This monitoring algorithm will simulate the interaction between the mechanism and the adversary on multiple independent sets of samples. It will then output the least accurate query across all the different interactions. We show that if the mechanism is stable then the monitoring algorithm is also stable. By choosing the number of sets of samples appropriately, we ensure that if the mechanism has even a small probability of being inaccurate in a given interaction, then the monitor will have a constant probability of finding an inaccurate query in one of the interactions. By the lemma proven in the first step, no such monitoring algorithm can satisfy max-KL stability, therefore every stable mechanism must be accurate with high probability.

As a warmup, in this section we give a simpler version of our main lemma for the case of statistical queries and a single sample. Although these results follow from the results of Section 3.3 on general low-sensitivity queries, we include the simpler version to introduce the main ideas in the cleanest possible setting.

Before giving the proof, we set up some notation. Let $\mathbf{x}=(x_{1},\dots,x_{n})$ . For a single element $x^{\prime}\in\mathcal{X}$ , and an index $i\in[n]$ , we use $\mathbf{x}_{i\rightarrow x^{\prime}}$ to denote the new sample where the $i$ -th element of $\mathbf{x}$ has been replaced by the element $x^{\prime}$ . Let $x^{\prime}\leftarrow_{\mbox{\tiny R}}\mathbf{P}$ be independent from $\mathbf{x}$ .

Therefore, using the fact that $|q(\mathbf{P})|\leq 1$ for any statistical query $q$ and distribution $\mathbf{P}$ , we have

2 Warmup: A Multi-Sample De-Correlated Expectation Lemma for SQs

As a second warmup, in this section we give a simpler version of our main lemma for the case of statistical queries and multiple samples. That is, we consider a setting where there are many subsamples available to the algorithm. The multi-sample de-correlated expectation lemma says that a max-KL stable algorithm cannot take a collection of samples $\mathbf{x}_{1},\dots,\mathbf{x}_{T}$ and output a pair $(q,t)$ such that $q(\mathbf{P})$ and $q(\mathbf{x}_{t})$ differ significantly in expectation.

Let $\mathcal{W}:(\mathcal{X}^{n})^{T}\to Q\times[T]$ be $(\varepsilon,\delta)$ -max-KL stable where $Q$ is the class statistical queries $q:\mathcal{X}\to$ . Let $\mathbf{P}$ be a distribution on $\mathcal{X}$ and let $\mathbf{X}=(\mathbf{x}_{1},\dots,\mathbf{x}_{T})\leftarrow_{\mbox{\tiny R}}(\mathbf{P}^{n})^{T}$ . Then

Before giving the proof, we set up some notation. Let $\mathbf{X}=(\mathbf{x}_{1},\dots,\mathbf{x}_{T})$ be a set of $T$ samples where each sample $\mathbf{x}_{t}=(x_{t,1},\dots,x_{t,n})$ . For a single element $x^{\prime}\in\mathcal{X}$ , and a pair of indices $(m,i)\in[T]\times[n]$ , we use $\mathbf{X}_{(m,i)\rightarrow x^{\prime}}$ to denote the new set of $T$ samples where the $i$ -th element of the $m$ -th sample of $\mathbf{X}$ has been replaced by the element $x^{\prime}$ .

3 A Multi-Sample De-Correlated Expectation Lemma

Here, we give the most general de-correlated expectation lemma that considers multiple samples and applies to the more general class of low-sensitivity queries.

We remark that if we use the weaker assumption that $\mathcal{W}$ is $(e^{\varepsilon}-1+\delta)$ -TV stable, (defined in Section 4.1), then we would obtain the same conclusion but with the weaker bound of $2T(e^{\varepsilon}-1+\delta)\Delta n.$ The advantage of using the stronger definition of max-KL stability is that we only have to decrease $\delta$ with $T$ and not $\varepsilon.$ This advantage is crucial because algorithms satisfying $(\varepsilon,\delta)$ -max-KL stability necessarily have a linear dependence on $1/\varepsilon$ but only a polylogarithmic dependence on $1/\delta$ .

Let $\mathbf{X}^{\prime}=(\mathbf{x}^{\prime}_{1},\dots,\mathbf{x}^{\prime}_{T})\leftarrow_{\mbox{\tiny R}}(\mathbf{P}^{n})^{T}$ be independent of $\mathbf{X}$ . Recall that each element $\mathbf{x}_{t}$ of $\mathbf{X}$ is itself a vector $(x_{t,1},\dots,x_{t,n}),$ and the same is true for each element $\mathbf{x}^{\prime}_{t}$ of $\mathbf{X}^{\prime}.$ We will sometimes refer to the vectors $\mathbf{x}_{1},\dots,\mathbf{x}_{T}$ as the subsamples of $\mathbf{X}$ .

4 From Multi-Sample De-Correlated Expectation to Accuracy

Now that we have Lemma 3.3, we can prove the following result that max-KL stable mechanisms that are also accurate with respect to their sample are also accurate with respect to the population from which that sample was drawn.

Let $Q$ be a family of $\Delta$ -sensitive queries on $\mathcal{X}$ . Assume that, for some $\alpha,\beta\in(0,.1)$ , $\mathcal{M}$ is

$(\varepsilon=\alpha/64\Delta n,\delta=\alpha\beta/32\Delta n)$ -max-KL stable for $k$ adaptively chosen queries from $Q$ and

$(\alpha^{\prime}=\alpha/8,\beta^{\prime}=\alpha\beta/16\Delta n)$ -accurate with respect to its sample for $n$ samples from $\mathcal{X}$ for $k$ adaptively chosen queries from $Q$ .

Then $\mathcal{M}$ is $(\alpha,\beta)$ -accurate with respect to the population for $k$ adaptively chosen queries from $Q$ given $n$ samples from $\mathcal{X}$ .

The key step in the proof is to define a monitoring algorithm that takes $T$ separate samples $\mathbf{X}=(\mathbf{x}_{1},\dots,\mathbf{x}_{T})$ and for each sample $\mathbf{x}_{t}$ , simulates an independent interaction between $\mathcal{M}(\mathbf{x}_{t})$ and $\mathcal{A}$ . This monitoring algorithm then outputs the query with the largest error across all of the queries and interactions ( $kT$ queries in total). Since changing one input to $\mathbf{X}$ only affects one of the simulations, the monitoring algorithm will be stable so long as $\mathcal{M}$ is stable, without any loss in the stability parameter. On the other hand, if $\mathcal{M}$ has even a small chance $\beta$ of answering a query with large error, then if we simulate $T\approx 1/\beta$ independent interactions, there is a constant probability that at least one of the simulations results in a query with large error. Thus, the monitor will be a stable algorithm that outputs a query with large error in expectation. By the multi-sample de-correlated expectation lemma, such a monitor is impossible, which implies that $\mathcal{M}$ has probability $\leq\beta$ of answering any query with large error.

Let $\mathcal{M}$ be an interactive mechanism. Let $\mathcal{A}$ be an analyst and let $\mathbf{P}$ be the distribution chosen by $\mathcal{A}$ . We define the following monitoring algorithm.

If $\mathcal{M}$ is stable then so is $\mathcal{W},$ and this fact follows easily from the post-processing lemma (Lemma 2.4):

For every $\varepsilon,\delta\geq 0$ , if the mechanism $\mathcal{M}$ is $(\varepsilon,\delta)$ -max-KL stable for $k$ adaptively chosen queries from $Q,$ then for every $\mathbf{P}$ and $\mathcal{A}$ , the monitor $\mathcal{W}_{\mathbf{P},k,Q}[\mathcal{M},\mathcal{A}]$ is $(\varepsilon,\delta)$ -max-KL stable.

If $\mathcal{M}$ is $(\varepsilon,\delta)$ -max-KL stable for $k$ adaptively chosen queries from $Q$ then for every analyst $\mathcal{A}$ who asks $k$ queries from $Q$ , and every $t$ the algorithm $\mathcal{W}^{\prime}(\mathbf{x}_{t})$ that simulates the interaction between $\mathcal{M}(\mathbf{x}_{t})$ and $\mathcal{A}$ and outputs the resulting query-answer pairs is $(\varepsilon,\delta)$ -max-KL stable. From this, it follows that the algorithm $\mathcal{W}^{\prime}(\mathbf{X})$ that simulates the interactions between $\mathcal{M}(\mathbf{x}_{t})$ and $\mathcal{A}$ for every $t=1,\dots,T$ and outputs the resulting query-answer pairs is $(\varepsilon,\delta)$ -max-KL stable. To see this, observe that if $\mathbf{X},\mathbf{X}^{\prime}$ differ only on one subsample $\mathbf{x}_{t}$ , then for every $t^{\prime}\neq t$ , $\mathbf{x}_{t^{\prime}}=\mathbf{x}^{\prime}_{t^{\prime}}$ and thus the query-answer pairs corresponding to subsample $t^{\prime}$ are identically distributed regardless of whether we use $\mathbf{X}$ or $\mathbf{X}^{\prime}$ as input to $\mathcal{W}.$

Observe that the algorithm $\mathcal{W}$ defined above is simply a post-processing of these $kT$ query-answer pairs. That is, $(q^{*},t^{*})$ depends only on $\{(q_{t,j},a_{t,j}\}_{t\in[T],j\in[k]}$ and $\mathbf{P}$ , and not on $\mathbf{X}.$ Thus, by Lemma 2.4, $\mathcal{W}$ is $(\varepsilon,\delta)$ -max-KL stable. ∎

We will use the $\mathcal{W}$ with $T=\left\lfloor 1/\beta\right\rfloor.$ In light of Claim 3.5 and our assumption that $\mathcal{M}$ is $(\varepsilon,\delta)$ -max-KL stable, we can apply Lemma 3.3 to obtain

To complete the proof, we show that if $\mathcal{M}$ is not $(\alpha,\beta)$ -accurate with respect to the population $\mathbf{P}$ , then (1) cannot hold. To do so, we need the following natural claim about the output of the monitor.

Since $\mathcal{M}$ fails to be $(\alpha,\beta)$ -accurate, for every $t\in[T]$ ,

We obtain the claim from (2) by using the fact that the $T$ sets of query-answer pairs corresponding to different subsamples $\mathbf{x}_{1},\dots,\mathbf{x}_{T}$ are independent. That is, the random variables $\max_{j\in[k]}\left|q_{t,j}(\mathbf{P})-a_{t,j}\right|$ indexed by $t\in[T]$ are independent. Since $q^{*}(\mathbf{P})-a_{q^{*}}$ is simply the maximum of these independent random variables, the first part of the claim follows. Also, by construction, $\mathcal{W}$ ensures that

If $\mathcal{M}$ is $(\alpha^{\prime},\beta^{\prime})$ -accurate for the sample but not $(\alpha,\beta)$ -accurate for the population, then

Line (4) follows from two observations. First, since $\mathcal{M}$ is assumed to be $(\alpha/8,\alpha\beta/16\Delta n)$ -accurate for one sample, by a union bound, it is simultaneously $(\alpha/8,T(\alpha\beta/16\Delta n))$ -accurate for all of the $T$ samples. Thus, we have $a_{q^{*}}-q^{*}(\mathbf{x}_{t^{*}})\leq\alpha^{\prime}$ except with probability at most $T(\alpha\beta/16\Delta n).$ Second, since $q^{*}$ is a $\Delta$ -sensitive query, we always have $a_{q^{*}}-q^{*}(\mathbf{x}_{t^{*}})\leq 2\Delta n.$ Without loss of generality, the answers of $\mathcal{M}$ can be truncated to an interval of width $2\Delta n$ that contains the correct answer $q^{*}(\mathbf{x}_{t^{*}}).$ Doing so will ensure $|a_{q^{*}}-q^{*}(\mathbf{x}_{t^{*}})|\leq 2\Delta n.$ ∎

Thus, if $\mathcal{M}$ is not $(\alpha,\beta)$ -accurate for the population, we will obtain a contradiction to (1). This completes the proof. ∎

Other Notions of Stability and Accuracy on Average

Definition 4.2 gives one notion of stability, namely max-KL stability. However, this is by no means the only way to formalise stability for our purposes. In this section we consider other notions of stability and the advantages they have.

We will define here other notions of algorithmic stability, and in Section 4.2, we will show that such notions can provide expected guarantees for generalization error which can be used to achieve accuracy on average.

Let $\mathcal{W}:\mathcal{X}^{n}\to\mathcal{R}$ be a randomized algorithm. We say that $\mathcal{W}$ is $\varepsilon$ -TV stable if for every pair of samples that differ on exactly one element,

Let $\mathcal{W}:\mathcal{X}^{n}\to\mathcal{R}$ be a randomized algorithm. We say that $\mathcal{W}$ is $\varepsilon$ -KL-stable if for every pair of samples $\mathbf{x},\mathbf{x}^{\prime}$ that differ on exactly one element,

The post-processing property of Max-KL stability (Lemma 2.4 in Section 2.3) also applies to the two stability notions above.

Let $\mathcal{W}:\mathcal{X}^{n}\to\mathcal{R}$ and $f:\mathcal{R}\to\mathcal{R}^{\prime}$ be a pair of randomized algorithms. If $\mathcal{W}$ is { $\varepsilon$ -TV, $\varepsilon$ -KL, $(\varepsilon,\delta)$ -max-KL}-stable then the algorithm $f(\mathcal{W}(\mathbf{x}))$ is { $\varepsilon$ -TV, $\varepsilon$ -KL, $(\varepsilon,\delta)$ -max-KL}-stable.

$\varepsilon$ -KL stability implies $\varepsilon$ -TV stability by Pinsker’s inequality. The relationship between max-KL stability defined in Section 2.3 and the above notions is more subtle. When $\varepsilon\leq 1$ , $(\varepsilon,0)$ -max-KL stability implies $\varepsilon$ -KL stability and thus also $\varepsilon$ -TV stability. When $\varepsilon\leq 1$ and $\delta>0$ , $(\varepsilon,\delta)$ -max-KL stability implies $(2\varepsilon+\delta)$ -TV stability. It also implies that $\mathcal{M}$ is “close” to satisfying $2\varepsilon$ -KL stability (cf. [DRV10] for more discussion of these notions).

As in Section 2.3.1, we define TV-stability and KL-stability of an interactive mechanism $\mathcal{M}$ through a noninteractive mechanism that simulates the interaction between $\mathcal{M}$ and an adversary $\mathcal{A}$ . The definition for these notions of stability is precisely analogous to Definition 2.5 for max-KL stability.

As with max-KL stability, both notions above allow for adaptive composition. In fact, $\varepsilon$ -TV stability composes linearly—a mechanism that is $\varepsilon$ -TV stable for $1$ query is $\varepsilon k$ -stable for $k$ queries. The advantage of the stronger notions of KL and max-KL stability is that they have a stronger composition. A mechanism that is $\varepsilon$ -KL stable for $1$ query is $(\varepsilon\sqrt{k})$ -stable for $k$ queries.

2 From TV Stability to Accuracy on Average

In this section we show that TV stable algorithms guarantee a weaker notion of accuracy on average for adaptively chosen queries.

3 Accuracy on Average

In Section 2.2 we defined accurate mechanisms to be those that answer accurately (either with respect to the population or the sample) with probability close to $1$ . In this section we define a relaxed notion of accuracy that only requires low error in expectation over the coins of $\mathcal{M}$ and $\mathcal{A}$ .

A mechanism $\mathcal{M}$ is $\alpha$ -accurate on average with respect to the population for $k$ adaptively chosen queries from $Q$ given $n$ samples in $\mathcal{X}$ if for every adversary $\mathcal{A}$ ,

We will also use a definition of accuracy relative to the sample given to the mechanism:

A mechanism $\mathcal{M}$ is $\alpha$ -accurate on average with respect to samples of size $n$ from $\mathcal{X}$ for $k$ adaptively chosen queries from $Q$ if for every adversary $\mathcal{A}$ ,

Towards our goal of proving that TV stability implies accuracy on average in the adaptive setting, we first prove a lemma saying that TV stable algorithms cannot output a low-sensitivity query such that the sample has large error for that query. In the next section we will show how this lemma implies accuracy on average in the adaptive setting.

Combining, the second and third observation with the triangle inequality, we have

3.2 From De-Correlated Expectation to Accuracy on Average

Let $Q_{\Delta}$ be the family of $\Delta$ -sensitive queries on $\mathcal{X}$ . Assume that $\mathcal{M}$ is

$(\varepsilon=\alpha/4\Delta n)$ -TV stable for $k$ adaptively chosen queries from $Q=Q_{\Delta}$ and

$(\alpha^{\prime}=\alpha/2)$ -accurate on average with respect to its sample for $n$ samples from $\mathcal{X}$ for $k$ adaptively chosen queries from $Q$ .

Then $\mathcal{M}$ is $\alpha$ -accurate on average with respect to the population for $k$ adaptively chosen queries from $Q$ given $n$ samples from $\mathcal{X}$ .

The high level approach of the proof is to apply the Lemma 4.6 to a “monitoring algorithm” that watches the interaction between the mechanism $\mathcal{M}(\mathbf{x})$ and the analyst $\mathcal{A}$ and then outputs the least accurate query. Since $\mathcal{M}(\mathbf{x})$ is stable, the de-correlated expectation lemma says that the query output by the monitor will satisfy $q(\mathbf{P})\approx q(\mathbf{x})$ in expectation, this implies that even for the least accurate query in the interaction between $\mathcal{M}(\mathbf{x})$ and $\mathcal{A}$ , $q(\mathbf{P})\approx q(\mathbf{x})$ in expectation Thus, if $\mathcal{M}$ is accurate with respect to the sample $\mathbf{x}$ , it is also accurate with respect to $\mathbf{P}.$

Let $\mathcal{M}$ be an interactive mechanism and $\mathcal{A}$ be an analyst that chooses the distribution $\mathbf{P}$ . We define the following monitoring algorithm.

If $\mathcal{M}$ is stable then so is $\mathcal{W},$ and this fact follows easily from the post-processing lemma (Lemma 2.4).

For every $\varepsilon\geq 0$ , if the mechanism $\mathcal{M}$ is $\varepsilon$ -TV stable for $k$ adaptively chosen queries from $Q,$ then for every $\mathbf{P}$ and $\mathcal{A}$ , the monitor $\mathcal{W}_{\mathbf{P}}[\mathcal{M},\mathcal{A}]$ is $\varepsilon$ -TV stable.

The assumption that $\mathcal{M}$ is $\varepsilon$ -TV stable for $k$ adaptively chosen queries from $Q$ means that for every analyst $\mathcal{A}$ who asks $k$ queries from $Q$ , the algorithm $\mathcal{W}^{\prime}(\mathbf{x})$ that simulates the interaction between $\mathcal{M}(\mathbf{x})$ and $\mathcal{A}$ and outputs the resulting query-answer pairs is $\varepsilon$ -TV stable. Observe that the algorithm $\mathcal{W}$ defined above is simply a post-processing of these query-answer pairs. That is, $q^{*}$ depends only on $q_{1},a_{1},\dots,q_{k},a_{k}$ and $\mathbf{P}$ , and not on $\mathbf{x}.$ Thus, by Lemma 2.4, for every $\mathbf{P}$ and $\mathcal{A}$ , the monitor $\mathcal{W}_{\mathbf{P}}[\mathcal{M},\mathcal{A}]$ is $\varepsilon$ -TV stable. ∎

In light of Claim 4.8 and our assumption that $\mathcal{M}$ is $(\alpha/4\Delta n)$ -TV stable, we can apply Lemma 4.6 to obtain

To complete the proof, we show that if $\mathcal{M}$ is not $\alpha$ -accurate on average with respect to the population $\mathbf{P}$ , then (5) cannot hold.

If $\mathcal{M}$ is $(\alpha/2)$ -accurate for the sample but not $\alpha$ -accurate for the population, then

Using our assumptions, we can calculate as follows.

Line (6) follows from two observations. First, by construction of $\mathcal{W}$ , we always have $q^{*}(\mathbf{P})-a_{q^{*}}\leq 0.$ Second, since $\mathcal{M}$ is assumed not to be $\alpha$ -accurate on average for the population, the expected value of $|q^{*}(\mathbf{P})-a_{q^{*}}|>\alpha.$ Since $\mathcal{W}$ ensures that $a_{q^{*}}-q^{*}(\mathbf{P})\geq 0,$ we also have that the absolute value of the expectation of $q^{*}(\mathbf{P})-a_{q^{*}}$ is greater than $\alpha.$ Line (7) follows from the assumption that $\mathcal{M}$ is $(\alpha/2)$ -accurate on average for the sample. ∎

Thus, if $\mathcal{M}$ is not $\alpha$ -accurate on average for the population, we will obtain a contradiction to (5). This completes the proof. ∎

From Low-Sensitivity Queries to Optimization Queries

In this section, we extend our results for low-sensitivity queries to the more general family of minimization queries. To do so, we design a suitable monitoring algorithm for minimization queries. As in our analysis of low-sensitivity queries, we will have the monitoring algorithm take as input many independent samples and simulate the interaction between $\mathcal{M}$ and $\mathcal{A}$ on each of those samples. Thus, if $\mathcal{M}$ has even a small probability of being inaccurate, then with constant probability the monitor will find a minimization query that $\mathcal{M}$ has answered inaccurately. Previously, we had monitor simply output this query and applied Lemma 3.3 to arrive at a contradiction. However, since Lemma 3.3 only applies to algorithms that output a low-sensitivity query, we can’t apply it to the monitor that outputs a minimization query. We address this by having the monitor output the error function associated with the loss function and answer it selects, which is a low-sensitivity query. If we assume that the mechanism is accurate for its sample but not for the population, then the monitor will find a loss function and an answer with low error on the sample but large error on the population. Thus the error function will be a low-sensitivity query with very different answers on the sample and the population, which is a contradiction. To summarize, we have the following theorem.

Let $Q=Q_{\mathit{min}}$ be the family of $\Delta$ -sensitive minimization queries on $\mathcal{X}$ . Assume that, for some $\alpha,\beta\geq 0$ , $\mathcal{M}$ is

$(\varepsilon=\alpha/128\Delta n,\delta=\alpha\beta/64\Delta n)$ -max-KL stable for $k$ adaptively chosen queries from $Q$ and

$(\alpha^{\prime}=\alpha/8,\beta^{\prime}=\alpha\beta/32\Delta n)$ -accurate with respect to its sample for $n$ samples from $\mathcal{X}$ for $k$ adaptively chosen queries from $Q$ .

Then $\mathcal{M}$ is $(\alpha,\beta)$ -accurate with respect to the population for $k$ adaptively chosen queries from $Q$ given $n$ samples from $\mathcal{X}$ .

The formal proof is nearly identical to that of Theorem 3.4, so we omit the full proof. Instead, we will simply describe the modified monitoring algorithm.

Applications

We now plug known stable mechanisms (designed in the context of differential privacy) in to Theorem 3.4 to obtain mechanisms that provide strong error guarantees with high probability for both low-sensitivity and statistical queries.

There is a mechanism $\mathcal{M}$ that is $(\alpha,\beta)$ -accurate with respect to the population for $k$ adaptively chosen queries from $Q_{\Delta}$ where $\Delta=O(1/n)$ given $n$ samples from $\mathcal{X}$ for

There is a mechanism $\mathcal{M}$ that is $\alpha$ -accurate on average with respect to the population for $k$ adaptively chosen queries from $Q_{\mathit{SQ}}$ given $n$ samples from $\mathcal{X}$ for

2 Optimization Queries

The results of the Section 5 can be combined with existing differentially private algorithms for minimizing “empirical risk” (that is, loss with respect to the sample $x$ ) to obtain algorithms for answering adaptive sequences of minimization queries. We provide a few specific instantiations here, based on known differentially private mechanisms.

Let $\Theta$ be a finite set of size at most $D$ . Let $Q\subset Q_{\mathit{min}}$ be the set of sensitivity- $1/n$ loss functions bounded between and $C$ . Then there is a mechanism $\mathcal{M}$ that is $(\alpha,\beta)$ -accurate with respect to the population for $k$ adaptively chosen queries from $Q_{\mathit{min}}$ given

samples from $\mathcal{X}$ . The running time of the mechanism is dominated by $O((k+\log(1/\beta))\cdot D)$ evaluations of the loss function.

2.2 Convex Minimization

samples from $Q$ . The running time of the mechanism is dominated by $k\cdot n^{2}$ evaluations of the gradient $\nabla L$ .

samples from $\mathcal{X}$ . The running time of the mechanism is dominated by $k\cdot n^{2}$ evaluations of the gradient $\nabla L$ .

An Alternative Form of Generalization and Tightness of Our Results

We now provide an alternative form of our generalization bounds. The following Theorem is more general than Theorem 3.4 because it says that no max-KL stable procedure that outputs a low-sensitivity can output a query that distinguishes the sample from the population (not just max-KL stable procedures that are accurate for the sample).

First we prove the following technical lemma.

where $H(X)$ is the Shannon entropy of the distribution of $X$ (measured in nats, rather than bits). In particular,

as the uniform distribution maximizes entropy. Moreover, $C\geq\max_{x\in F}e^{\eta f(x)}$ , whence $\frac{1}{\eta}\log C\geq\max_{x\in F}f(x)$ . The result now follows from these two inequalities. ∎

Intuitively, Theorem 7.2 says that “stability prevents overfitting.” It says that no stable algorithm can output a low-sensitivity function that distinghishes its input from the population the input was drawn from (i.e. “overfits” its sample).

In particular, Theorem 7.2 implies that, if a mechanism $\mathcal{M}$ is stable and outputs $q$ that “fits” its data, then $q$ also “fits” the population. This gives a learning theory perspective on our results.

Consider the following monitor algorithm $\mathcal{W}$ .

We will use the monitor $\mathcal{W}$ with $T=\left\lfloor\epsilon/\delta\right\rfloor$ . Observe that $\mathcal{W}$ only access its input through $\mathcal{M}$ (which is $(\varepsilon,\delta)$ -max-KL-stable) and the exponential mechanism (which is $(\varepsilon,0)$ -max-KL-stable). Thus, by composition and postprocessing, $\mathcal{W}$ is $(2\varepsilon,\delta)$ -max-KL stable. We can hence apply Lemma 3.3 to obtain

Now we can apply Lemma 7.1 with $f(q,t)=q(\mathbf{x}_{t})-q(\mathbf{P})$ and $\eta=\frac{\varepsilon}{\Delta}$ to get

To complete the proof, we assume, for the sake of contradition, that $\mathcal{M}$ has a high enough probability of outputting a query $q$ such that $\left|q(\mathbf{P})-q(\mathbf{x})\right|$ is large. To obtain a contradiction from this assumption, we need the following natural claim (analogous to Claim 3.6) about the output of the monitor.

This contradicts the assumption that $n\geq\frac{1}{\varepsilon^{2}}\log(\frac{4\varepsilon}{\delta})$ and hence completes the proof. ∎

We now show that our connection between max-KL stability and generalization (Theorem 7.2 and Theorem 3.4) is optimal.

Let $\alpha>\delta>0$ , let $n\geq\frac{1}{\alpha}$ , and let $\Delta\in$ . Let $\mathcal{U}$ be the uniform distribution over $ $. There exists a$ (0,\delta) $-max-KL stable algorithm$ \mathcal{A}:^{n}\to Q_{\Delta} $such that if$ \mathbf{X}\leftarrow_{\mbox{\tiny R}}\mathcal{U}^{n} $and if$ q\leftarrow_{\mbox{\tiny R}}\mathcal{A}(\mathbf{X})$ then

Consider the following simple algorithm, denoted as $\mathcal{B}$ . On input a database $\mathbf{x}$ , output $\mathbf{x}$ with probability $\delta$ , and otherwise output the empty database. Clearly, $\mathcal{B}$ is $(0,\delta)$ -max-KL stable. Now construct the following algorithm $\mathcal{A}$ .

As $\mathcal{B}$ is $(0,\delta)$ -max-KL stable, and as $\mathcal{A}$ only applies $\mathcal{B}$ on disjoint databases, we get that $\mathcal{A}$ is also $(0,\delta)$ -max-KL stable.

Suppose $\mathbf{X}=(\mathbf{x}_{1},\dots,\mathbf{x}_{1/\alpha})$ contains i.i.d. samples from $\mathcal{U}$ , and consider the execution of $\mathcal{A}$ on $\mathbf{X}$ . Observe that the predicate $p$ evaluates to 1 only on a finite number of points from $ $, and hence, we have that$ q_{p}(\mathcal{U})=0 $. Next note that$ q_{p}(\mathbf{X})=\alpha\Delta n\cdot|\{i:\hat{\mathbf{x}}_{i}=\mathbf{x}_{i}\}| $. Therefore, if there exists an$ i $s.t.$ \hat{\mathbf{x}}_{i}=\mathbf{x}_{i} $then$ q_{p}(\mathbf{X})-q_{p}(\mathcal{U})\geq\alpha\Delta n$. The probability that this is not the case is at most

ans thus, with probability at least $\frac{\delta}{2\alpha}$ , algorithm $\mathcal{A}$ outputs a $\Delta$ -sensitive query $q_{p}$ s.t. $q_{p}(\mathbf{X})-q_{p}(\mathcal{U})\geq\alpha\Delta n$ . ∎

In particular, using Lemma 7.4 with $\alpha=\varepsilon$ shows that the parameters in Theorem 7.2 are tight.

We thank Mark Bun, Moritz Hardt, Aaron Roth, and Salil Vadhan for many helpful discussions.