Interactive Privacy via the Median Mechanism

Aaron Roth, Tim Roughgarden

Introduction

Managing a data set with sensitive but useful information, such as medical records, requires reconciling two objectives: providing utility to others, perhaps in the form of aggregate statistics; and respecting the privacy of individuals who contribute to the data set. The field of private data analysis, and in particular work on differential privacy, provides a mathematical foundation for reasoning about this utility-privacy trade-off and offers methods for non-trivial data analysis that are provably privacy-preserving in a precise sense. For a recent survey of the field, see Dwork [Dwo08].

More precisely, consider a domain XX and database size nn. A mechanism is a randomized function from the set XnX^{n} of databases to some range. For a parameter α>0\alpha>0, a mechanism MM is α\alpha-differentially private if, for every database DD and fixed subset SS of the range of MM, changing a single component of DD changes the probability that MM outputs something in SS by at most an eαe^{\alpha} factor. The output of a differentially private mechanism (and any analysis or privacy attack that follows) is thus essentially independent of whether or not a given individual “opts in” or “opts out” of the database.

Achieving differential privacy requires “sufficiently noisy” answers [DN03]. For example, suppose we’re interested in the result of a query ff — a function from databases to some range — that simply counts the fraction of database elements that satisfy some predicate φ\varphi on XX. A special case of a result in Dwork et al. [DMNS06] asserts that the following mechanism is α\alpha-differentially private: if the underlying database is DD, output f(D)+Δf(D)+\Delta, where the output perturbation Δ\Delta is drawn from the Laplace distribution Lap(1nα)\textrm{Lap}(\tfrac{1}{n\alpha}) with density p(y)=nα2exp(nαy)p(y)=\tfrac{n\alpha}{2}\exp(-n\alpha|y|). Among all α\alpha-differentially private mechanisms, this one (or rather, a discretized analog of it) maximizes user utility in a strong sense [GRS09].

What if we care about more than a single one-dimensional statistic? Suppose we’re interested in kk predicate queries f1,,fkf_{1},\ldots,f_{k}, where kk could be large, even super-polynomial in nn. A natural solution is to use an independent Laplace perturbation for each query answer [DMNS06]. To maintain α\alpha-differential privacy, the magnitude of noise has to scale linearly with kk, with each perturbation drawn from Lap(knα)\textrm{Lap}(\tfrac{k}{n\alpha}). Put another way, suppose one fixes “usefulness parameters” ϵ,δ\epsilon,\delta, and insists that the mechanism is (ϵ,δ)(\epsilon,\delta)-useful, meaning that the outputs are within ϵ\epsilon of the correct query answers with probability at least 1δ1-\delta. This constrains the magnitude of the Laplace noise, and the privacy parameter α\alpha now suffers linearly with the number kk of answered queries. This dependence limits the use of this mechanism to a sublinear k=o(n)k=o(n) number of queries.

Can we do better than independent output perturbations? For special classes of queries like predicate queries, Blum, Ligett, and Roth [BLR08] give an affirmative answer (building on techniques of Kasiviswanathan et al. [KLN+08]). Specifically, in [BLR08] the exponential mechanism of McSherry and Talwar [MT07] is used to show that, for fixed usefulness parameters ϵ,δ\epsilon,\delta, the privacy parameter α\alpha only has to scale logarithmically with the number of queries.More generally, linearly with the VC dimension of the set of queries, which is always at most log2k\log_{2}k. This permits simultaneous non-trivial utility and privacy guarantees even for an exponential number of queries. Moreover, this dependence on logk\log k is necessary in every differentially private mechanism (see the full version of [BLR08]).

The mechanism in [BLR08] suffers from two drawbacks, however. First, it is non-interactive: it requires all queries f1,,fkf_{1},\ldots,f_{k} to be given up front, and computes (noisy) outputs of all of them at once.Or rather, it computes a compact representation of these outputs in the form of a synthetic database. By contrast, independent Laplace output perturbations can obviously be implemented interactively, with the queries arriving online and each answered immediately. There is good intuition for why the non-interactive setting helps: outperforming independent output perturbations requires correlating perturbations across multiple queries, and this is clearly easier when the queries are known in advance. Indeed, prior to the present work, no interactive mechanism better than independent Laplace perturbations was known.

Second, the mechanism in [BLR08] is inefficient. Here by “efficient” we mean has running time polynomial in nn, kk, and X|X|; Dwork et al. [DNR+09] prove that this is essentially the best one could hope for (under certain cryptographic assumptions). The mechanism in [BLR08] is not efficient because it requires sampling from a non-trivial probability distribution over an unstructured space of exponential size. Dwork et al. [DNR+09] recently gave an efficient (non-interactive) mechanism that is better than independent Laplace perturbations, in that the privacy parameter α\alpha of the mechanism scales as 2logk2^{\sqrt{\log k}} with the number of queries kk (for fixed usefulness parameters ϵ,δ\epsilon,\delta).

We define a new interactive differentially private mechanism for answering kk arbitrary predicate queries, called the median mechanism.The privacy guarantee is (α,τ)(\alpha,\tau)-differential privacy for a negligible function τ\tau; see Section 2 for definitions. The basic implementation of the median mechanism interactively answers queries f1,,fkf_{1},\ldots,f_{k} that arrive online, is (ϵ,δ)(\epsilon,\delta)-useful, and has privacy α\alpha that scales with logklogX\log k\log|X|; see Theorem 4.1 for the exact statement. These privacy and utility guarantees hold even if an adversary can adaptively choose each fif_{i} after seeing the mechanism’s first i1i-1 answers. This is the first interactive mechanism better than the Laplace mechanism, and its performance is close to the best possible even in the non-interactive setting.

The basic implementation of the median mechanism is not efficient, and we give an efficient implementation with a somewhat weaker utility guarantee. (The privacy guarantee is as strong as in the basic implementation.) This alternative implementation runs in time polynomial in nn, kk, and X|X|, and satisfies the following (Theorem 5.1): for every sequence f1,,fkf_{1},\ldots,f_{k} of predicate queries, for all but a negligible fraction of input distributions, the efficient median mechanism is (ϵ,δ)(\epsilon,\delta)-useful.

This is the first efficient mechanism with a non-trivial utility guarantee and polylogarithmic privacy cost, even in the non-interactive setting.

2 The Main Ideas

The key challenge to designing an interactive mechanism that outperforms the Laplace mechanism lies in determining the appropriate correlations between different output perturbations on the fly, without knowledge of future queries. It is not obvious that anything significantly better than independent perturbations is possible in the interactive setting.

Our median mechanism and our analysis of it can be summarized, at a high level, by three facts. First, among any set of kk queries, we prove that there are O(logklogX)O(\log k\log|X|) “hard” queries, the answers to which completely determine the answers to all of the other queries (up to ±ϵ\pm\epsilon). Roughly, this holds because: (i) by a VC dimension argument, we can focus on databases over XX of size only O(logk)O(\log k); and (ii) every time we answer a “hard” query, the number of databases consistent with the mechanism’s answers shrinks by a constant factor, and this number cannot drop below 1 (because of the true input database). Second, we design a method to privately release an indicator vector which distinguishes between hard and easy queries online. We note that a similar private ‘indicator vector’ technique was used by Dwork et al. [DNR+09]. Essentially, the median mechanism deems a query “easy” if a majority of the databases that are consistent (up to ±ϵ\pm\epsilon) with the previous answers of the mechanism would answer the current query accurately. The median mechanism answers the small number of hard queries using independent Laplace perturbations. It answers an easy query (accurately) using the median query result given by databases that are consistent with previous answers. A key intuition is that if a user knows that query ii is easy, then it can generate the mechanism’s answer on its own. Thus answering an easy query communicates only a single new bit of information: that the query is easy. Finally, we show how to release the classification of queries as “easy” and “hard” with low privacy cost; intuitively, this is possible because (independent of the database) there can be only O(logklogX)O(\log k\log|X|) hard queries.

Our basic implementation of the median mechanism is not efficient for the same reasons as for the mechanism in [BLR08]: it requires non-trivial sampling from a set of super-polynomial size. For our efficient implementation, we pass to fractional databases, represented as fractional histograms with components indexed by XX. Here, we use the random walk technology of Dyer, Frieze, and Kannan [DFK91] for convex bodies to perform efficient random sampling. To explain why our utility guarantee no longer holds for every input database, recall the first fact used in the basic implementation: every answer to a hard query shrinks the number of consistent databases by a constant factor, and this number starts at XO(logk)|X|^{O(\log k)} and cannot drop below 1. With fractional databases (where polytope volumes play the role of set sizes), the lower bound of 1 on the set of consistent (fractional) databases no longer holds. Nonetheless, we prove a lower bound on the volume of this set for almost all fractional histograms (equivalently probability distributions), which salvages the O(logklogX)O(\log k\log|X|) bound on hard queries for databases drawn from such distributions.

Preliminaries

A mechanism MM is (ϵ,δ)(\epsilon,\delta)-useful if for every sequence of queries (f1,,fk)(f_{1},\ldots,f_{k}) and every database DD, with probability at least 1δ1-\delta it provides answers a1,,aka_{1},\ldots,a_{k} that are ϵ\epsilon-accurate for f1,,fkf_{1},\ldots,f_{k} and DD.

Recall that differential privacy means that changing the identity of a single element of the input database does not affect the probability of any outcome by more than a small factor. Formally, given a database DD, we say that a database DD^{\prime} of the same size is a neighbor of DD if it differs in only a single element: DD=D1|D\cap D^{\prime}|=|D|-1.

We are generally interested in the case where τ\tau is a negligible function of some of the problem parameters, meaning one that goes to zero faster than xcx^{-c} for every constant cc.

Finally, the sensitivity of a real-valued query is the largest difference between its values on neighboring databases. For example, the sensitivity of every non-trivial predicate query is precisely 1/n1/n.

The Median Mechanism: Basic Implementation

We now describe the median mechanism and our basic implementation of it. As described in the Introduction, the mechanism is conceptually simple. It classifies queries as “easy” or “hard”, essentially according to whether or not a majority of the databases consistent with previous answers to hard queries would give an accurate answer to it (in which case the user already “knows the answer”). Easy queries are answered using the corresponding median value; hard queries are answered as in the Laplace mechanism.

To explain the mechanism precisely, we need to discuss a number of parameters. We take the privacy parameter α\alpha, the accuracy parameter ϵ\epsilon, and the number kk of queries as input; these are hard constraints on the performance of our mechanism.We typically think of α,ϵ\alpha,\epsilon as small constants, though our results remain meaningful for some sub-constant values of α\alpha and ϵ\epsilon as well. We always assume that α\alpha is at least inverse polynomial in kk. Note that when α\alpha or ϵ\epsilon is sufficiently small (at most c/nc/n for a small constant cc, say), simultaneously meaningful privacy and utility is clearly impossible. Our mechanism obeys these constraints with a value of δ\delta that is inverse polynomial in kk and nn, and a value of τ\tau that is negligible in kk and nn, provided nn is sufficiently large (at least polylogarithmic in kk and X|X|, see Theorem 4.1). Of course, such a result can be rephrased as a nearly exponential lower bound on the number of queries kk that can be successfully answered as a function of the database size nn.In contrast, the number of queries that the Laplace mechanism can privately and usefully answer is at most linear.

The median mechanism is shown in Figure 1, and it makes use of several additional parameters. For our analysis, we set their values to:

The denominator in (2) can be thought of as our “privacy cost” as a function of the number of queries kk. Needless to say, we made no effort to optimize the constants.

The value rir_{i} in Step 2(a) of the median mechanism is defined as

For the Laplace perturbations in Steps 2(a) and 2(d), recall that the distribution Lap(σ)\textrm{Lap}(\sigma) has the cumulative distribution function

The motivation behind the mechanism’s steps is as follows. The set CiC_{i} is the set of size-mm databases consistent (up to ±ϵ/50\pm\epsilon/50) with previous answers of the mechanism to hard queries. The focus on databases with the small size mm is justified by a VC dimension argument, see Proposition 4.6. Steps 2(a) and 2(b) choose a random value r^i\hat{r}_{i} and a random threshold tit_{i}. The value rir_{i} in Step 2(a) is a measure of how easy the query is, with higher numbers being easier. A more obvious measure would be the fraction of databases SS in Ci1C_{i-1} for which fi(S)fi(D)ϵ|f_{i}(S)-f_{i}(D)|\leq\epsilon, but this is a highly sensitive statistic (unlike rir_{i}, see Lemma 4.9). The mechanism uses the perturbed value r^i\hat{r}_{i} rather than rir_{i} to privately communicate which queries are easy and which are hard. In Step 2(b), we choose the threshold tit_{i} at random between 3/43/4 and 9/109/10. This randomly shifted threshold ensures that, for every database DD, there is likely to be a significant gap between rir_{i} and tit_{i}; such gaps are useful when optimizing the privacy guarantee. Steps 2(c) and 2(d) answer easy and hard queries, respectively. Step 2(e) updates the set of databases consistent with previous answers to hard queries. We prove in Lemma 4.7 that Step 2(f) occurs with at most inverse polynomial probability.

Finally, we note that the median mechanism is defined as if the total number of queries kk is (approximately) known in advance. This assumption can be removed by using successively doubling “guesses” of kk; this increases the privacy cost by an O(logk)O(\log k) factor.

Analysis of Median Mechanism

This section proves the following privacy and utility guarantees for the basic implementation of the median mechanism.

For every sequence of adaptively chosen predicate queries f1,,fkf_{1},\ldots,f_{k} arriving online, the median mechanism is (ϵ,δ)(\epsilon,\delta)-useful and (α,τ)(\alpha,\tau)-differentially private, where τ\tau is a negligible function of kk and X|X|, and δ\delta is an inverse polynomial function of kk and nn, provided the database size nn satisfies

We prove the utility and privacy guarantees in Sections 4.1 and 4.2, respectively.If desired, in Theorem 4.1 we can treat nn as a parameter and solve for the error ϵ\epsilon. The maximum error on any query (normalized by the database size) is O(logklog1/3X/n1/3α1/3)O(\log k\log^{1/3}|X|/n^{1/3}\alpha^{1/3}); the unnormalized error is a factor of nn larger.

Here we prove a utility guarantee for the median mechanism.

The median mechanism is (ϵ,δ)(\epsilon,\delta)-useful, where δ=kexp(Ω(ϵnα))\delta=k\exp(-\Omega(\epsilon n\alpha^{\prime})).

Note that under assumption (6), δ\delta is inverse polynomial in kk and nn.

We give the proof of Theorem 4.2 in three pieces: with high probability, every hard query is answered accurately (Lemma 4.4); every easy query is answered accurately (Lemmas 4.3 and 4.5); and the algorithm does not fail (Lemma 4.7). The next two lemmas follow from the definition of the Laplace distribution (5), our choice of δ\delta, and trivial union bounds.

With probability at least 1δ21-\tfrac{\delta}{2}, rir^i1/100|r_{i}-\hat{r}_{i}|\leq 1/100 for every query ii.

With probability at least 1δ21-\tfrac{\delta}{2}, every answer to a hard query is (ϵ/100)(\epsilon/100)-accurate for DD.

The next lemma shows that median answers are accurate for easy queries.

If rir^i1/100|r_{i}-\hat{r}_{i}|\leq 1/100 for every query ii, then every answer to an easy query is ϵ\epsilon-accurate for DD.

For a query ii, let Gi1={SCi1:fi(D)fi(S)ϵ}G_{i-1}=\{S\in C_{i-1}:|f_{i}(D)-f_{i}(S)|\leq\epsilon\} denote the databases of Ci1C_{i-1} on which the result of query fif_{i} is ϵ\epsilon-accurate for DD. Observe that if Gi1.51Ci1|G_{i-1}|\geq.51\cdot|C_{i-1}|, then the median value of fif_{i} on Ci1C_{i-1} is an ϵ\epsilon-accurate answer for DD. Thus proving the lemma reduces to showing that r^i3/4\hat{r}_{i}\geq 3/4 only if Gi1.51Ci1|G_{i-1}|\geq.51\cdot|C_{i-1}|.

Consider a query ii with Gi1<.51Ci1|G_{i-1}|<.51\cdot|C_{i-1}|. Using (4), we have

Since rir^i1/100|r_{i}-\hat{r}_{i}|\leq 1/100 for every query ii by assumption, the proof is complete. ∎

Our final lemma shows that the median mechanism does not fail and hence answers every query, with high probability; this will conclude our proof of Theorem 4.2. We need the following preliminary proposition, which instantiates the standard uniform convergence bound with the fact that the VC dimension of every set of kk predicate queries is at most log2k\log_{2}k [Vap96]. Recall the definition of the parameter mm from (1).

For every collection of kk predicate queries f1,,fkf_{1},\ldots,f_{k} and every database DD, a database SS obtained by sampling points from DD uniformly at random will satisfy fi(D)fi(S)ϵ|f_{i}(D)-f_{i}(S)|\leq\epsilon for all ii except with probability δ\delta, provided

In particular, there exists a database SS of size mm such that for all i{1,,k}i\in\{1,\ldots,k\}, fi(D)fi(S)ϵ/400|f_{i}(D)-f_{i}(S)|\leq\epsilon/400.

In other words, the results of kk predicate queries on an arbitrarily large database can be well approximated by those on a database with size only O(logk)O(\log k).

If rir^i1/100|r_{i}-\hat{r}_{i}|\leq 1/100 for every query ii and every answer to a hard query is (ϵ/100)(\epsilon/100)-accurate for DD, then the median mechanism answers fewer than 20mlogX20m\log|X| hard queries (and hence answers all queries before terminating).

The plan is to track the contraction of CiC_{i} as hard queries are answered by the median mechanism. Initially we have C0Xm|C_{0}|\leq|X|^{m}. If the median mechanism answers a hard query ii, then the definition of the mechanism and our hypotheses yield

We then claim that the size of the set Ci={SCi1:fi(S)aiϵ/50}C_{i}=\{S\in C_{i-1}:|f_{i}(S)-a_{i}|\leq\epsilon/50\} is at most 94100Ci1\tfrac{94}{100}|C_{i-1}|. For if not,

Iterating now shows that the number of consistent databases decreases exponentially with the number of hard queries:

On the other hand, Proposition 4.6 guarantees the existence of a database SC0S^{*}\in C_{0} for which fi(S)fi(D)ϵ/100|f_{i}(S^{*})-f_{i}(D)|\leq\epsilon/100 for every query fif_{i}. Since all answers aia_{i} produced by the median mechanism for hard queries ii are (ϵ/100)(\epsilon/100)-accurate for DD by assumption, fi(S)aifi(S)fi(D)+fi(D)aiϵ/50|f_{i}(S^{*})-a_{i}|\leq|f_{i}(S^{*})-f_{i}(D)|+|f_{i}(D)-a_{i}|\leq\epsilon/50. This shows that SCkS^{*}\in C_{k} and hence Ck1|C_{k}|\geq 1. Combining this with (7) gives

2 Privacy of the Median Mechanism

This section establishes the following privacy guarantee for the median mechanism.

The median mechanism is (α,τ)(\alpha,\tau)- differentially private, where τ\tau is a negligible function of X|X| and kk when nn is sufficiently large (as in (6)).

Our first lemma states that the small sensitivity of predicate queries carries over, with a 2/ϵ2/\epsilon factor loss, to the rr-function defined in (4).

The function ri(D)=(SCexp(ϵ1f(D)f(S))/Cr_{i}(D)=(\sum_{S\in C}\exp(-\epsilon^{-1}|f(D)-f(S)|)/|C| has sensitivity 2ϵn\frac{2}{\epsilon n} for every fixed set CC of databases and predicate query ff.

Let DD and DD^{\prime} be neighboring databases. Then

where the first inequality follows from the fact that the (predicate) query ff has sensitivity 1/n1/n, the second from the fact that ex1+2xe^{x}\leq 1+2x when xx\in, and the third from the fact that ri(D)1r_{i}(D^{\prime})\leq 1. ∎

The next lemma identifies nice properties of “typical executions” of the median mechanism. Consider an output (d,a)({d},{a}) of the median mechanism with a database DD. From DD and (d,a)({d},{a}), we can uniquely recover the values r1,,rkr_{1},\ldots,r_{k} computed (via (4)) in Step 2(a) of the median mechanism, with rir_{i} depending only on the first i1i-1 components of dd and aa. We sometimes write such a value as ri(D,(d,a))r_{i}(D,({d},{a})), or as ri(D)r_{i}(D) if an output (d,a)({d},{a}) has been fixed. Call a possible threshold tit_{i} good for DD and (d,a)({d},{a}) if di=0d_{i}=0 and ri(D,(d,a))ti+γr_{i}(D,({d},{a}))\geq t_{i}+\gamma, where γ\gamma is defined as in (3). Call a vector t{t} of possible thresholds good for DD and (d,a)({d},{a}) if all but 180mlnX180m\ln|X| of the thresholds are good for DD and (d,a)({d},{a}).

For every database DD, with all but negligible (exp(Ω(logklogX/ϵ2))\exp(-\Omega(\log k\log|X|/\epsilon^{2}))) probability, the thresholds t{t} generated by the median mechanism are good for its output (d,a)({d},{a}).

The idea is to “charge” the probability of bad thresholds to that of answering hard queries, which are strictly limited by the median mechanism. Since the median mechanism only allows 20mlnX20m\ln|X| of the did_{i}’s to be 1, we only need to bound the number of queries ii with output di=0d_{i}=0 and threshold tit_{i} satisfying ri<ti+γr_{i}<t_{i}+\gamma, where rir_{i} is the value computed by the median mechanism in Step 2(a) when it answers the query ii.

Let YiY_{i} be the indicator random variable corresponding to the (larger) event that ri<ti+γr_{i}<t_{i}+\gamma. Define ZiZ_{i} to be 1 if and only if, when answering the iith query, the median mechanism chooses a threshold tit_{i} and a Laplace perturbation Δi\Delta_{i} such that ri+Δi<tir_{i}+\Delta_{i}<t_{i} (i.e., the query is classified as hard). If the median mechanism fails before reaching query ii, then we define Yi=Zi=0Y_{i}=Z_{i}=0. Set Y=i=1kYiY=\sum_{i=1}^{k}Y_{i} and Z=i=1kZiZ=\sum_{i=1}^{k}Z_{i}. We can finish the proof by showing that YY is at most 160mlnX160m\ln|X| except with negligible probability.

Consider a query ii and condition on the event that ri910r_{i}\geq\tfrac{9}{10}; this event depends only on the results of previous queries. In this case, Yi=1Y_{i}=1 only if ti=9/10t_{i}=9/10. But this occurs with probability 23/20γ2^{-3/20\gamma}, which using (3) and (6) is at most 1/k1/k.For simplicity, we ignore the normalizing constant in the distribution over jj’s in Step 2(b), which is Θ(1)\Theta(1). Therefore, the expected contribution to YY coming from queries ii with ri910r_{i}\geq\tfrac{9}{10} is at most 11. Since tit_{i} is selected independently at random for each ii, the Chernoff bound implies that the probability that such queries contribute more than mlnXm\ln|X| to YY is

Now condition on the event that ri<910r_{i}<\tfrac{9}{10}. Let TiT_{i} denote the threshold choices that would cause YiY_{i} to be 1, and let sis_{i} be the smallest such; since ri<910r_{i}<\tfrac{9}{10}, Ti2|T_{i}|\geq 2. For every tiTit_{i}\in T_{i}, ti>riγt_{i}>r_{i}-\gamma; hence, for every tiTi{si}t_{i}\in T_{i}\setminus\{s_{i}\}, ti>rit_{i}>r_{i}. Also, our distribution on the jj’s in Step 2(b) ensures that Pr[tiTi{si}]12Pr[tiTi]\Pr[t_{i}\in T_{i}\setminus\{s_{i}\}]\geq\tfrac{1}{2}\Pr[t_{i}\in T_{i}]. Since the Laplace distribution is symmetric around zero and the random choices Δi,ti\Delta_{i},t_{i} are independent, we have

The definition of the median mechanism ensures that Z20mlnXZ\leq 20m\ln|X| with probability 1. Linearity of expectation, inequality (8), and the Chernoff bound imply that queries with ri<910r_{i}<\tfrac{9}{10} contribute at most 159mlnX159m\ln|X| to YY with probability at least 1exp(Ω(logklogX/ϵ2))1-\exp(-\Omega(\log k\log|X|/\epsilon^{2})). The proof is complete. ∎

with some t{t} good for (d,a),D({d},{a}),D, and where τ\tau is the negligible function of Lemma 4.10. We complete the proof by showing that, for every neighboring database DD^{\prime}, possible output (d,a)({d},{a}), and thresholds tt good for (d,a)({d},{a}) and DD,

Fix a neighboring database DD^{\prime}, a target output (d,a)({d},{a}), and thresholds tt good for (d,a)({d},{a}) and DD. The probability that the median mechanism chooses the target thresholds tt is independent of the underlying database, and so is the same on both sides of (9). For the rest of the proof, we condition on the event that the median mechanism uses the thresholds tt (both with database DD and database DD^{\prime}).

Imagine running the median mechanism in parallel on D,DD,D^{\prime} and condition on the events Ei1,Ei1\mathcal{E}_{i-1},\mathcal{E}^{\prime}_{i-1}. The set Ci1C_{i-1} is then the same in both runs of the mechanism, and ri(D),ri(D)r_{i}(D),r_{i}(D^{\prime}) are now fixed. Let bib_{i} (bib^{\prime}_{i}) be 0 if MM(D,f)MM(D,f) (MM(D,f)MM(D^{\prime},f)) classifies query ii as easy and 1 otherwise. Since ri(D)[ri(D)±2ϵn]r_{i}(D^{\prime})\in[r_{i}(D)\pm\tfrac{2}{\epsilon n}] (Lemma 4.9) and a perturbation with distribution Lap(2αϵn)\textrm{Lap}(\tfrac{2}{\alpha^{\prime}\epsilon n}) is added to these values before comparing to the threshold tit_{i} (Step 2(a)),

and similarly for the events where bi,bi=1b_{i},b^{\prime}_{i}=1. Suppose that the target classification is di=1d_{i}=1 (a hard query), and let sis_{i} and sis^{\prime}_{i} denote the random variables fi(D)+Lap(1αn)f_{i}(D)+\textrm{Lap}(\tfrac{1}{\alpha^{\prime}n}) and fi(D)+Lap(1αn)f_{i}(D^{\prime})+\textrm{Lap}(\tfrac{1}{\alpha^{\prime}n}), respectively. Independence of the Laplace perturbations in Steps 2(a) and 2(d) implies that

Since the predicate query fif_{i} has sensitivity 1/n1/n, we have

Now suppose that di=0d_{i}=0, and let mim_{i} denote the median value of fif_{i} on Ci1C_{i-1}. Then Pr[EiEi1]\Pr[\mathcal{E}_{i}|\mathcal{E}_{i-1}] is either 0 (if miaim_{i}\neq a_{i}) or Pr[bi=0Ei1]\Pr[b_{i}=0\,|\,\mathcal{E}_{i-1}] (if mi=aim_{i}=a_{i}); similarly, Pr[EiEi1]\Pr[\mathcal{E}^{\prime}_{i}|\mathcal{E}^{\prime}_{i-1}] is either 0 or Pr[bi=0Ei1]\Pr[b^{\prime}_{i}=0\,|\,\mathcal{E}^{\prime}_{i-1}]. Thus the bound in (10) continues to hold (even with e2αe^{2\alpha^{\prime}} replaced by eαe^{\alpha^{\prime}}) when di=0d_{i}=0.

Since α\alpha^{\prime} is not much smaller than the privacy target α\alpha (recall (2)), we cannot afford to suffer the upper bound in (10) for many queries. Fortunately, for queries ii with good thresholds we can do much better. Consider a query ii such that tit_{i} is good for (d,a)({d},{a}) and DD and condition again on Ei1,Ei1\mathcal{E}_{i-1},\mathcal{E}^{\prime}_{i-1}, which fixes Ci1C_{i-1} and hence ri(D)r_{i}(D). Goodness implies that di=0d_{i}=0, so the arguments from the previous paragraph also apply here. We can therefore assume that the median value mim_{i} of fif_{i} on Ci1C_{i-1} equals aia_{i} and focus on bounding Pr[bi=0Ei1]\Pr[b_{i}=0\,|\,\mathcal{E}_{i-1}] in terms of Pr[bi=0Ei1]\Pr[b^{\prime}_{i}=0\,|\,\mathcal{E}^{\prime}_{i-1}]. Goodness also implies that ri(D)ti+γr_{i}(D)\geq t_{i}+\gamma and hence ri(D)ti+γ2ϵnti+γ2r_{i}(D^{\prime})\geq t_{i}+\gamma-\tfrac{2}{\epsilon n}\geq t_{i}+\tfrac{\gamma}{2} (by Lemma 4.9). Recalling from (3) the definition of γ\gamma, we have

and of course, Pr[bi=0Ei1]1\Pr[b_{i}=0\,|\,\mathcal{E}_{i-1}]\leq 1.

Applying (10) to the bad queries — at most 180mlnX180m\ln|X| of them, since tt is good for (d,a)({d},{a}) and DD — and (11) to the rest, we can derive

which completes the proof of both the inequality (9) and the theorem. \blacksquare

The Median Mechanism: Efficient Implementation

The basic implementation of the median mechanism runs in time XΘ(logklog(1/ϵ)/ϵ2)|X|^{\Theta(\log k\log(1/\epsilon)/\epsilon^{2})}. This section provides an efficient implementation, running in time polynomial in nn, kk, and X|X|, although with a weaker usefulness guarantee.

Assume that the database size nn satisfies (6). For every sequence of adaptively chosen predicate queries f1,,fkf_{1},\ldots,f_{k} arriving online, the efficient implementation of the median Mechanism is (α,τ)(\alpha,\tau)-differentially private for a negligible function τ\tau. Moreover, for every fixed set f1,,fkf_{1},\ldots,f_{k} of queries, it is (ϵ,δ)(\epsilon,\delta)-useful for all but a negligible fraction of fractional databases (equivalently, probability distributions).

We note however that even for the small fraction of fractional histograms for which the efficient median mechanism may not satisfy our usefulness guarantee, it does not output incorrect answers: it merely halts after having answered a sufficiently large number of queries using the Laplace mechanism. Therefore, even for this small fraction of databases, the efficient median mechanism is an improvement over the Laplace mechanism: in the worst case, it simply answers every query using the Laplace mechanism before halting, and in the best case, it is able to answer many more queries.

We give a high-level overview of the proof of Theorem 5.1 which we then make formal. First, why isn’t the median mechanism a computationally efficient mechanism? Because C0C_{0} has super-polynomial size Xm|X|^{m}, and computing rir_{i} in Step 2(a), the median value in Step 2(c), and the set CiC_{i} in Step 2(e) could require time proportional to C0|C_{0}|. An obvious idea is to randomly sample elements of Ci1C_{i-1} to approximately compute rir_{i} and the median value of fif_{i} on Ci1C_{i-1}; while it is easy to control the resulting sampling error and preserve the utility and privacy guarantees of Section 4, it is not clear how to sample from Ci1C_{i-1} efficiently.

We now give a formal analysis of the efficient implementation.

We redefine the sets CiC_{i} to represent databases that can contain points fractionally, as opposed to the finite set of small discrete databases. Equivalently, we can view the sets CiC_{i} as containing probability distributions over the set of points XX.

We generalize our query functions fif_{i} to fractional histograms in the natural way:

The update operation after a hard query ii is answered is the same as in the basic implementation:

Note that each updating operation after a hard query merely intersects Ci1C_{i-1} with the pair of halfspaces:

and so CiC_{i} is a convex polytope for each ii.

Since CiC_{i} is given as the intersection of a set of explicit halfspaces, we have a simple membership oracle to determine whether a given point FCi{F}\in C_{i}: we simply check that F{F} lies on the appropriate side of each of the halfspaces. This takes time poly(X,m)(|X|,m), since the number of halfspaces defining CiC_{i} is linear in the number of answers to hard queries given before time ii, which is never more than 20mlnX20m\ln|X|. Moreover, for each ii we have CiC0mB1XmB2XXB2X.C_{i}\subseteq C_{0}\subset mB_{1}^{|X|}\subset mB_{2}^{|X|}\subset|X|B_{2}^{|X|}. Finally, we can safely assume that B2XCiB_{2}^{X}\subseteq C_{i} by simply considering the convex set Ci=Ci+B2XC^{\prime}_{i}=C_{i}+B_{2}^{X} instead. This will not affect our results.

Therefore, we can implement the median mechanism in time poly(X,k)(|X|,k) by using sets CiC_{i} as defined in this section, and sampling from them using the grid walk of [DFK91]. Estimation error in computing rir_{i} and the median value of fif_{i} on Ci1C_{i-1} by random sampling rather than brute force is easily controlled via the Chernoff bound and can be incorporated into the proofs of Lemmas 4.3 and 4.5 in the obvious way. It remains to prove a continuous version of Lemma 4.7 to show that the efficient implementation of the median mechanism is (ϵ,δ)(\epsilon,\delta)-useful on all but a negligibly small fraction of fractional histograms FF.

2 Usefulness for Almost All Distributions

We now prove an analogue of Lemma 4.7 to establish a usefulness guarantee for the efficient version of the median mechanism.

With respect to any set of kk queries f1,,fkf_{1},\ldots,f_{k} and for any FC0{F}^{*}\in C_{0}, define

as the set of points that agree up to an additive ϵ\epsilon factor with F{F}^{*} on every query fif_{i}.

For any F{F}^{*}, Goodϵ(F)\textrm{Good}_{\epsilon}({F}^{*}) is a convex polytope contained inside C0C_{0}. We will prove that the efficient version of the median mechanism is (ϵ,δ)(\epsilon,\delta)-useful for a database DD if

We first prove that (12) holds for almost every fractional histogram. For this, we need a preliminary lemma.

Let L\mathcal{L} denote the set of integer points inside C0C_{0}. Then with respect to an arbitrary set of kk queries,

Every rational valued point FC0{F}\in C_{0} corresponds to some (large) database DXD\subset X by scaling F{F} to an integer-valued histogram. Irrational points can be arbitrarily approximated by such a finite database. By Proposition 4.6, for every set of kk predicates f1,,fkf_{1},\ldots,f_{k}, there is a database FXF^{*}\subset X with F=m|F^{*}|=m such that for each ii, fi(F)fi(F)ϵ/400|f_{i}(F^{*})-f_{i}(F)|\leq\epsilon/400. Recalling that the histograms corresponding to databases of size at most mm are exactly the integer points in C0C_{0}, the proof is complete. ∎

All but an Xm|X|^{-m} fraction of fractional histograms FF satisfy

Consider a randomly selected fractional histogram FC0F^{*}\in C_{0}. For any FBF\in\mathcal{B} we have:

Since BLXm|\mathcal{B}|\leq|\mathcal{L}|\leq|X|^{m}, by a union bound we can conclude that except with probability 1Xm\frac{1}{|X|^{m}}, F∉Goodϵ/400(F)F^{*}\not\in\textrm{Good}_{\epsilon/400}(F) for any FBF\in\mathcal{B}. However, by Lemma 5.2, FGoodϵ/400(F)F^{*}\in\textrm{Good}_{\epsilon/400}({F^{\prime}}) for some FLF^{\prime}\in\mathcal{L}. Therefore, except with probability 1/Xm1/|X|^{m}, FLBF^{\prime}\in\mathcal{L}\setminus\mathcal{B}. Thus, since Goodϵ/400(F)Goodϵ/200(F)\textrm{Good}_{\epsilon/400}({F^{\prime}})\subseteq\textrm{Good}_{\epsilon/200}({F^{*}}), except with negligible probability, we have:

We are now ready to prove the analogue of Lemma 4.7 for the efficient implementation of the median mechanism.

For every set of kk queries f1,,fkf_{1},\ldots,f_{k}, for all but an O(Xm)O(|X|^{-m}) fraction of fractional histograms FF, the efficient implementation of the median mechanism guarantees that: The mechanism answers fewer than 40mlogX40m\log|X| hard queries, except with probability kexp(Ω(ϵnα))k\exp(-\Omega(\epsilon n\alpha^{\prime})),

We assume that all answers to hard queries are ϵ/100\epsilon/100 accurate, and that rir^i1100|r_{i}-\hat{r}_{i}|\leq\frac{1}{100} for every ii. By Lemmas 4.3 and 4.4 — the former adapted to accommodate approximating rir_{i} via random sampling — we are in this case except with probability kexp(Ω(ϵnα))k\exp(-\Omega(\epsilon n\alpha^{\prime})).

We analyze how the volume of CiC_{i} contracts with the number of hard queries answered. Suppose the mechanism answers a hard query at time ii. Then:

Since all answers to hard queries are ϵ/100\epsilon/100 accurate, it must be that Goodϵ/100(D)Ck\textrm{Good}_{\epsilon/100}(D)\in C_{k}. Therefore, for an input database DD that satisfies (12) — and this is all but an O(Xm)O(|X|^{-m}) fraction of them, by Lemma 5.3 — we have

Combining inequalities (13) and (14) yields

Lemmas 4.4, 4.5, and 5.4 give the following utility guarantee.

For every set f1,,fkf_{1},\ldots,f_{k} of queries, for all but a negligible fraction of fractional histograms FF, the efficient implementation of the median mechanism is (ϵ,δ)(\epsilon,\delta)-useful with δ=kexp(Ω(ϵnα))\delta=k\exp(-\Omega(\epsilon n\alpha^{\prime})).

3 Usefulness for Finite Databases

Fractional histograms correspond to probability distributions over XX. Lemma 5.3 shows that most probability distributions are ‘good’ for the efficient implementation of the Median Mechanism; in fact, more is true. We next show that finite databases sampled from randomly selected probability distributions also have good volume properties. Together, these lemmas show that the efficient implementation of the median mechanism will be able to answer nearly exponentially many queries with high probability, in the setting in which the private database DD is drawn from some ‘typical’ population distribution. DatabaseSample(D|D|):

Select a fractional point FC0F\in C_{0} uniformly at random.

Sample and return a database DD of size D|D| by drawing each xDx\in D independently at random from the probability distribution over XX induced by FF (i.e. sample xiXx_{i}\in X with probability proportional to FiF_{i}).

For D|D| as in (6) (as required for the Median Mechanism), a database sampled by DatabaseSample(D|D|) satisfies (12) except with probability at most O(Xm)O(|X|^{-m}).

By lemma 5.3, except with probability Xm|X|^{-m}, the fractional histogram FF selected in step 11 satisfies

By lemma 4.6, when we sample a database DD of size DO((logXlog3klog1/ϵ)/ϵ3)|D|\geq O((\log|X|\log^{3}k\log 1/\epsilon)/\epsilon^{3}) from the probability distribution induced by FF, except with probability δ=O(kXlog3k/ϵ)\delta=O(k|X|^{-\log^{3}k/\epsilon}), Goodϵ/200(F)Goodϵ/100(D)\textrm{Good}_{\epsilon/200}(F)\subset\textrm{Good}_{\epsilon/100}(D), which gives us condition (12). ∎

We would like an analogue of lemma 5.3 that holds for all but a diminishing fraction of finite databases (which correspond to lattice points within C0C_{0}) rather than fractional points in C0C_{0}, but it is not clear how uniformly randomly sampled lattice points distribute themselves with respect to the volume of C0C_{0}. If n>>Xn>>|X|, then the lattice will be fine enough to approximate the volume of C0C_{0}, and lemma 5.3 will continue to hold. We now show that small uniformly sampled databases will also be good for the efficient version of the median mechanism. Here, small means n=o(X)n=o(\sqrt{|X|}), which allows for databases which are still polynomial in the size of XX. A tighter analysis is possible, but we opt instead to give a simple argument.

For every nn such that nn satisfies (6) and n=o(X)n=o(\sqrt{|X|}), all but an O(n2/X)O(n^{2}/|X|) fraction of databases DD of size D=n|D|=n satisfy condition (12).

Finally, let BB be the event that database DD fails to satisfy (12). We have:

where the last equality follows from lemma 5.6, which states that PrN[B]\Pr_{N}[B] is negligibly small. ∎

We observe that we can substitute either of the above lemmas for lemma 5.3 in the proof of lemma 5.4 to obtain versions of Thoerem 5.5:

For every set f1,,fkf_{1},\ldots,f_{k} of queries, for all but a negligible fraction of databases sampled by DatabaseSample, the efficient implementation of the median mechanism is (ϵ,δ)(\epsilon,\delta)-useful with δ=kexp(Ω(ϵnα))\delta=k\exp(-\Omega(\epsilon n\alpha^{\prime})).

For every set f1,,fkf_{1},\ldots,f_{k} of queries, for all but an n2/Xn^{2}/|X| fraction of uniformly randomly sampled databases of size nn, the efficient implementation of the median mechanism is (ϵ,δ)(\epsilon,\delta)-useful with δ=kexp(Ω(ϵnα))\delta=k\exp(-\Omega(\epsilon n\alpha^{\prime})).

Conclusion

We have shown that in the setting of predicate queries, interactivity does not pose an information theoretic barrier to differentially private data release. In particular, our dependence on the number of queries kk nearly matches the optimal dependence of logk\log k achieved in the offline setting by [BLR08]. We remark that our dependence on other parameters is not necessarily optimal: in particular, [DNR+09] achieves a better (and optimal) dependence on ϵ\epsilon. We have also shown how to implement our mechanism in time poly(X,k)(|X|,k), although at the cost of sacrificing worst-case utility guarantees. The question of an interactive mechanism with poly(X,k)(|X|,k) runtime and worst-case utility guarantees remains an interesting open question. More generally, although the lower bounds of [DNR+09] seem to preclude mechanisms with run-time poly(logX)(\log|X|) from answering a superlinear number of generic predicate queries, the question of achieving this runtime for specific query classes of interest (offline or online) remains largely open. Recently a representation-dependent impossibility result for the class of conjunctions was obtained by Ullman and Vadhan [UV10]: either extending this to a representation-independent impossibility result, or circumventing it by giving an efficient mechanism with a novel output representation would be very interesting.

Acknowledgments

The first author wishes to thank a number of people for useful discussions, including Avrim Blum, Moritz Hardt, Katrina Ligett, Frank McSherry, and Adam Smith. He would particularly like to thank Moritz Hardt for suggesting trying to prove usefulness guarantees for a continuous version of the BLR mechanism, and Avrim Blum for suggesting the distribution from which we select the threshold in the median mechanism.

References