A Learning Theory Approach to Non-Interactive Database Privacy

Avrim Blum, Katrina Ligett, Aaron Roth

Introduction

As large-scale collection of personal information becomes easier, the problem of database privacy is increasingly important. In many cases, we might hope to learn useful information from sensitive data (for example, we might learn a correlation between smoking and lung cancer from a collection of medical records). However, for legal, financial, or moral reasons, administrators of sensitive datasets might not want to release their data. If those with the expertise to learn from large datasets are not the same as those who administer the datasets, what is to be done? In order to study this problem theoretically, it is important to quantify what exactly we mean by “privacy.”

A series of recent papers [DN04, BDMN05, DMNS06] formalizes the notion of differential privacy. A database privatization mechanism (which may be either interactive or non-interactive) satisfies differential privacy if the addition or removal of a single database element does not change the probability of any outcome of the privatization mechanism by more than some small amount. The definition is intended to capture the notion that “distributional information is not private”—we may reveal that smoking correlates to lung cancer, but not that any individual has lung cancer. Individuals may submit their personal information to the database secure in the knowledge that (almost) nothing can be discovered from the database with their information that could not have been discovered without their information.

In this paper, motivated by learning theory, we propose the study of privacy-preserving mechanisms that are useful for all queries in a particular class (such as all conjunction queries or all halfspace queries). In particular, we focus on counting queries of the form, “what fraction of the database entries satisfy predicate φ\varphi?” and say that a sanitized output is useful for a class CC if the answers to all queries in CC have changed by at most some ±α\pm\alpha.

Building on the techniques of Kasiviswanathan et al. [KLN+08], we show that for discretized domains, for any concept class that admits an α\alpha-net Nα\mathcal{N}_{\alpha}, it is possible to privately release synthetic data that is useful for the class, with error that grows proportionally to the logarithm of the size of Nα\mathcal{N}_{\alpha}. As a consequence, we show that it is possible to release data useful for a set of counting queries with error that grows proportionally to the VC-dimension of the class of queries. The algorithm is not in general computationally efficient. We are able to give a different algorithm that efficiently releases synthetic data for the class of interval queries (and more generally, axis-aligned rectangles in fixed dimension) that achieves guarantees in a similar range of parameters.

Unfortunately, we show that for non-discretized domains, under the above definition of usefulness, it is impossible to publish a differentially private database that is useful for even quite simple classes such as interval queries. We next show how, under a natural relaxation of the usefulness criterion, one can release information that can be used to usefully answer (arbitrarily many) halfspace queries while satisfying privacy. In particular, instead of requiring that useful mechanisms answer each query approximately correctly, we allow our algorithm to produce an answer that is approximately correct for some nearby query. This relaxation is motivated by the notion of large-margin separators in learning theory [AB99, Vap98, SS02]; in particular, queries with no data points close to the separating hyperplane must be answered accurately, and the allowable error more generally is a function of the fraction of points close to the hyperplane.

We also introduce a new concept, distributional privacy, which makes explicit the notion that when run on a database drawn from a distribution, privacy-preserving mechanisms should reveal only information about the underlying distribution, and nothing else. Given a distribution D\mathcal{D} over database points, a database privatization mechanism satisfies distributional privacy if with high probability, drawing an entirely new database from D\mathcal{D} does not change the probability of any outcome of the privatization mechanism by more than some small amount. We show that distributional privacy is a strictly stronger guarantee than differential privacy by showing that any mechanism that satisfies distributional privacy also satisfies differential privacy, but that there are some functions that can be answered accurately while satisfying differential privacy, and yet reveal information about the particular database (although not about any particular database element) that is not “distributional.”

Recent work on theoretical guarantees for data privacy was initiated by [DN03]. The notion of differential privacy, finally formalized by [DMNS06], separates issues of privacy from issues of outside information by defining privacy as indistinguishability of neighboring databases. This captures the notion that (nearly) anything that can be learned if your data is included in the database can also be learned without your data. This notion of privacy ensures that users have very little incentive to withhold their information from the database. The connection between data privacy and incentive-compatibility was formalized by McSherry and Talwar [MT07].

Much of the initial work focused on lower bounds. Dinur and Nissim [DN03] showed that any mechanism that answers substantially more than a linear number of subset-sum queries with error o(1/n)o(1/\sqrt{n}) yields what they called blatant non-privacy – i.e. it allows an adversary to reconstruct all but a o(1)o(1) fraction of the original database. Dwork et al. [DMT07] extend this result to the case in which the private mechanism can answer a constant fraction of queries with arbitrary error, and show that still if the error on the remaining queries is o(1/n)o(1/\sqrt{n}), the result is blatant non-privacy. Dwork and Yekhanin [DY08] give further improvements. These results easily extend to the case of counting queries which we consider here.

Dwork et al. [DMNS06], in the paper that defined differential privacy, show that releasing the answers to kk low sensitivity queries with noise drawn independently from the Laplace distribution with scale k/ϵk/\epsilon preserves ϵ\epsilon-differential privacy. Unfortunately, the noise scales linearly in the number of queries answered, and so this mechanism can only answer a sub-linear number of queries with non-trivial accuracy. Blum et al. [BDMN05] consider a model of learning and show that concept classes that are learnable in the statistical query (SQ) model are also learnable from a polynomially sized dataset accessed through an interactive differential-privacy-preserving mechanism. We note that such mechanisms still access the database by asking counting-queries perturbed with independent noise from the Laplace distribution, and so can still only make a sublinear number of queries. In this paper, we give a mechanism for privately answering counting queries with noise that grows only logarithmically with the number of queries asked (or more generally with the VC-dimension of the query class). This improvement allows an analyst to answer an exponentially large number of queries with non-trivial error, rather than only linearly many.

Most similar to this paper is the work of Kasiviswanathan et al. [KLN+08]. Kasiviswanathan et al. study what can be learned privately when what is desired is that the hypothesis output by the learning algorithm satisfies differential privacy. They show that in a PAC learning model in which the learner has access to the private database, ignoring computational constraints, anything that is PAC learnable is also privately PAC learnable. We build upon the technique in their paper to show that in fact, it is possible to privately release a dataset that is simultaneously useful for any function in a concept class of polynomial VC-dimension. Kasiviswanathan et al. also study several restrictions on learning algorithms, show separation between these learning models, and give efficient algorithms for learning particular concept classes.

1.2 Subsequent Work

Since the original publication of this paper in STOC 2008 [BLR08] there has been a substantial amount of follow up work. A sequence of papers by Dwork et al. [DNR+09, DRV10] give a non-interactive mechanism for releasing counting queries with accuracy that depends in a similar way to the mechanism presented in this paper on the total number of queries asked, but has a better dependence on the database size. This comes at the expense of relaxing the notion of ϵ\epsilon-differential privacy to an approximate version called (ϵ,δ)(\epsilon,\delta)-differential privacy. The mechanism of [DRV10] also extends to arbitrary low-sensitivity queries rather than only counting queries. This extension makes crucial use of the relaxation to (ϵ,δ)(\epsilon,\delta)-privacy, as results such as those given in this paper cannot be extended to arbitrary low-sensitivity queries while satisfying ϵ\epsilon-differential privacy as shown recently by De [De11].

Roth and Roughgarden [RR10] showed that bounds similar to those achieved in this paper can also be achieved in the interactive setting, in which queries are allowed to arrive online and must be answered before the next query is known. In many applications, this gives a large improvement in the accuracy of answers, because it allows the analyst to pay for those queries which were actually asked in the course of a computation (which may be only polynomially many), as opposed to all queries which might potentially be asked, as is necessary for a non-interactive mechanism. Hardt and Rothblum [HR10] gave an improved mechanism for the interactive setting based on the multiplicative weights framework which achieves bounds comparable to the improved bounds of [DRV10], also in the interactive setting. An offline version of this mechanism (constructed by pairing the online mechanism with an agnostic learner for the class of queries of interest) was given by Gupta et al. [GHRU11], and an experimental evaluation on real data was done by Hardt, Ligett, and McSherry [HLM11]. Gupta, Roth, and Ullman unified the online mechanisms of [RR10, HR10] into a generic framework (and improved their error bounds) by giving a generic reduction from online learning algorithms in the mistake bound model to private query release algorithms in the interactive setting [GRU11]. [GRU11] also gives a new mechanism based on this reduction that achieves improved error guarantees for the setting in which the database size is comparable to the size of the data universe.

Following this work, there has been significant attention paid to the problem of releasing the class of conjunctions (a special case of counting queries) with low error using algorithms with more efficient run-time than the one given in this paper. Gupta et al [GHRU11] give an algorithm which runs in time polynomial in the size of the database, and releases the class of conjunctions to O(1)O(1) average error while preserving differential privacy. Hardt, Rothblum, and Servedio [HRS11] give an algorithm which runs in time proportional dkd^{k} (for databases over a data universe X={0,1}d)X=\{0,1\}^{d}) and releases conjunctions of most kk variables with worst-case error guarantees. Their algorithm improves over the Laplace mechanism (which also requires run-time dkd^{k}) because it only requires that the database size be proportional to dkd^{\sqrt{k}} (The Laplace mechanism would require a database of size dkd^{k}). As a building block for this result, they also give a mechanism with run-time proportional to dkd^{\sqrt{k}} which gives average-case error guarantees. Kasiviswanathan et al. [KRSU10] extend the lower bounds [DN03] from arbitrary subset-sum queries to hold also for an algorithm that only releases conjunctions.

Xiao et al. [XWG10] gave an algorithm for releasing range queries, which extends the class of constant-dimensional interval queries which we consider in this paper.

There has also been progress in proving lower bounds. Dwork et al. [DNR+09] show that in general, the problem of releasing synthetic data giving non-trivial error for arbitrary classes of counting queries requires run-time that is linear in the size of the data universe and the size of the query class (modulo cryptographic assumptions). This in particular precludes improving the run-time of the general mechanism presented in this paper to be only polynomial in the size of the database. Ullman and Vadhan [UV11] extend this result to show that releasing synthetic data is hard even for the simple class of conjunctions of at most 2 variables. This striking result emphasizes that output representation is extremely important, because it is possible to release the answers to all of the (at most d2d^{2}) conjunctions of size 2 privately and efficiently using output representations other than synthetic data. Hardt and Talwar showed how to prove lower bounds for differentially query release using packing arguments, and gave an optimal lower bound for a certain range of parameters [HT10]. De recently refined this style of argument and extended it to wider settings [De11]. Gupta et al. [GHRU11] showed that the class of queries that can be released by mechanisms that access the database using only statistical queries (which includes almost all mechanisms known to date, with the exception of the parity learning algorithm of [KLN+08]) is equal to the class of queries that can be agnostically learned using statistical queries. This rules out a mechanism even for releasing conjunctions to subconstant error which accesses the data using only a polynomial number of statistical queries.

2 Motivation from Learning Theory

3 Organization

Definitions

We consider databases which are nn-tuples from some abstract domain XX: i.e. DXnD\in X^{n}. We will write n=Dn=|D| for the size of the database. We think of XX as the set of all possible data-records. For example, if data elements are represented as bit-strings of length dd, then X={0,1}dX=\{0,1\}^{d} would be the boolean hypercube in dd dimensions. These tuples are not endowed with an ordering: they are simply multi-sets (they can contain multiple copies of the same element xXx\in X).

A database access mechanism is a randomized mapping A:XRA:X^{*}\rightarrow R, where RR is some arbitrary range. We say that AA outputs synthetic data if its output is itself a database: i.e. if R=XR=X^{*}.

Our privacy solution concept will be the by now standard notion of differential privacy. Crucial to this definition will be the notion of neighboring databases. We say that two databases D,DXnD,D^{\prime}\in X^{n} are neighboring if they differ in only a single data element: i.e. they are neighbors if their symmetric difference DΔD1|D\Delta D^{\prime}|\leq 1.

A database access mechanism A:XnRA:X^{n}\rightarrow R is ϵ\epsilon-differentially private if for all neighboring pairs of databases D,DXnD,D^{\prime}\in X^{n} and for all outcome events SRS\subseteq R, the following holds:

In Section 7, we propose an alternate definition of privacy, distributional privacy, and show that it is strictly stronger than differential privacy. For simplicity, however, in the main body of the paper, we use the standard definition, differential privacy. All of these proofs can be adapted to the distributional privacy notion.

The global sensitivity of a query ff is its maximum difference when evaluated on two neighboring databases:

In this paper, we consider the private release of information useful for classes of counting queries.

A counting query QφQ_{\varphi}, defined in terms of a predicate φ:X{0,1}\varphi:X\rightarrow\{0,1\} is defined to be

It evaluates to the fraction of elements in the database that satisfy the predicate φ\varphi.

For any predicate φ:X{0,1}\varphi:X\rightarrow\{0,1\}, the corresponding counting query Qφ:XnQ_{\varphi}:X^{n}\rightarrow has global sensitivity GSQφ1/nGS_{Q_{\varphi}}\leq 1/n

Let D,DXnD,D^{\prime}\in X^{n} be neighboring databases. Then:

Where the inequality follows by the definition of neighboring databases. Similarly, Qφ(D)Qφ(D)1/nQ_{\varphi}(D)\geq Q_{\varphi}(D^{\prime})-1/n. The observation then follows. ∎

We remark that everything in this paper easily extends to the case of more general linear queries, which can are defined analogously to counting queries, but involve real valued predicates φ:X\varphi:X\rightarrow. For simplicity we restrict ourselves to counting queries in this paper, but see [Rot10] for the natural extension to linear queries.

A key measure of complexity that we will use for counting queries is VC-dimension. VC-dimension is strictly speaking a measure of complexity of classes of predicates, but we will associate the VC-dimension of classes of predicates with their corresponding class of counting queries.

A class of predicates PP shatters a collection of points SXS\subseteq X if for every TST\subseteq S, there exists an φP\varphi\in P such that {xS:φ(x)=1}=T\{x\in S:\varphi(x)=1\}=T. That is, PP shatters SS if for every one of the 2S2^{|S|} subsets TT of SS, there is some predicate in PP that labels exactly those elements as positive, and does not label any of the elements in STS\setminus T as positive.

We can now define our complexity measure for counting queries.

A collection of predicates PP has VC-dimension dd if there exists some set SXS\subseteq X of cardinality S=d|S|=d such that PP shatters SS, and PP does not shatter any set of cardinality d+1d+1. We can denote this quantity by VC-DIM(P)\textrm{VC-DIM}(P). We abuse notation and also write VC-DIM(C)\textrm{VC-DIM}(C) where CC is a class of counting queries, to denote the VC-dimension of the corresponding collection of predicates.

Dwork et al. [DMNS06] give a mechanism which can answer any single low-sensitivity query while preserving differential privacy:

The Laplace mechanism responds to a query QQ by returning Q(D)+ZQ(D)+Z where ZZ is a random variable drawn from the Laplace distribution: ZLap(GSQ/ϵ)Z\sim\textrm{Lap}(GS_{Q}/\epsilon).

The Laplace distribution with scale bb, which we denote by Lap(b)\textrm{Lap}(b), has probability density function

The Laplace mechanism preserves ϵ\epsilon-differential privacy.

This mechanism answers queries interactively, but for a fixed privacy level, its accuracy guarantees degrade linearly in the number of queries that it answers. The following composition theorem is useful: it tells us that a mechanism which runs kk ϵ\epsilon-differentially private subroutines is kϵk\epsilon-differentially private.

If mechanisms M1,,MkM_{1},\ldots,M_{k} are each ϵ\epsilon-differentially private, then the mechanism MM defined by the (string) composition of the kk mechanisms: M(D)=(M1(D),,Mk(D))M(D)=(M_{1}(D),\ldots,M_{k}(D)) is kϵk\epsilon-differentially private.

We propose to construct database access mechanisms which produce one-shot (non-interactive) outputs that can be released to the public, and so can necessarily be used to answer an arbitrarily large number of queries. We seek to do this while simultaneously preserving privacy. However, as implied by the lower bounds of Dinur and Nissim [DN03], we cannot hope to be able to usefully answer arbitrary queries. We instead seek to release synthetic databases which are “useful” (defined below) for restricted classes of queries CC.

A database access mechanism AA is (α,δ)(\alpha,\delta)-useful with respect to queries in class CC if for every database DXnD\in X^{n}, with probability at least 1δ1-\delta, the output of the mechanism D^=A(D)\widehat{D}=A(D) satisfies:

In this paper, we will derive (α,δ)(\alpha,\delta)-useful mechanisms from small α\alpha-nets:

An α\alpha-net of databases with respect to a class of queries CC is a set NXN\subset X^{*} such that for all DXD\in X^{*}, there exists an element of the α\alpha-net DND^{\prime}\in N such that:

We write Nα(C)N_{\alpha}(C) to denote an α\alpha-net of minimum cardinality among the set of all α\alpha-nets for CC.

General release mechanism

In this section we present our general release mechanism. It is an instantiation of the exponential mechanism of McSherry and Talwar [MT07].

The exponential mechanism ME(D,q,R)M_{E}(D,q,\mathcal{R}) selects and outputs an element rRr\in\mathcal{R} with probability proportional to exp(ϵq(D,r)2GSq)\exp(\frac{\epsilon q(D,r)}{2GS_{q}}).

McSherry and Talwar showed that the exponential mechanism preserves differential privacy. It is important to note that the exponential mechanism can define a complex distribution over a large arbitrary domain, and so it may not be possible to implement the exponential mechanism efficiently when the range of qq is super-polynomially large in the natural parameters of the problem. This will be the case with our instantiation of it.

The exponential mechanism preserves ϵ\epsilon-differential privacy.

We first observe that the Net mechanism preserves ϵ\epsilon-differential privacy.

The Net mechanism is ϵ\epsilon-differentially private.

The Net mechanism is simply an instantiation of the exponential mechanism. Therefore, privacy follows from Theorem 3.2. ∎

We may now analyze the usefulness of the net mechanism.

For any class of queries CC (not necessarily counting queries) the Net Mechanism is (2α,δ)(2\alpha,\delta)-useful for any α\alpha such that:

First observe that the sensitivity of the quality score GSqmaxQCGSQ=ΔGS_{q}\leq\max_{Q\in C}GS_{Q}=\Delta.

By the definition of an α\alpha-net, we know that there exists some DRD^{*}\in\mathcal{R} such that q(D,D)αq(D,D^{*})\geq-\alpha. By the definition of the exponential mechanism, this DD^{*} is output with probability proportional to at least exp(ϵα2GSq)\exp(\frac{-\epsilon\alpha}{2GS_{q}}). Similarly, there are at most Nα(C)|N_{\alpha}(C)| databases DRD^{\prime}\in\mathcal{R} such that q(D,D)2αq(D,D^{\prime})\leq-2\alpha (simply because R=Nα(C)\mathcal{R}=N_{\alpha}(C)). Hence, by a union bound, the probability that the exponential mechanism outputs some DD^{\prime} with q(D,D)2αq(D,D^{\prime})\leq-2\alpha is at most Nα(C)exp(2ϵα2GSq)|N_{\alpha}(C)|\exp(\frac{-2\epsilon\alpha}{2GS_{q}}). Therefore, if we denote by AA the event that the net mechanism outputs some DD^{*} with q(D,D)αq(D,D^{*})\geq-\alpha, and denote by BB the event that the net mechanism outputs some DD^{\prime} with q(D,D)2αq(D,D^{\prime})\leq-2\alpha, we have:

Note that if this ratio is at least 1/δ1/\delta, then we will have proven that the net mechanism is (2α,δ)(2\alpha,\delta) useful with respect to CC. Solving for α\alpha, we find that this is condition is satisfied so long as

We have therefore reduced the problem of giving upper bounds on the usefulness of differentially private database access mechanisms to the problem of upper bounding the sensitivity of the queries in question, and the size of the smallest α\alpha-net for the set of queries in question. Recall that for counting queries QQ on databases of size nn, we always have GSQ1/nGS_{Q}\leq 1/n. Therefore we have the immediate corollary:

For any class of counting queries CC the Net Mechanism is (2α,δ)(2\alpha,\delta)-useful for any α\alpha such that:

To complete the proof of utility for the net mechanism for counting queries, it remains to prove upper bounds on the size of minimal α\alpha-nets for counting queries. We begin with a bound for finite classes of queries.

For any finite class of counting queries CC:

In order to prove this theorem, we will show that for any collection of counting queries CC and for any database DD, there is a “small” database DD^{\prime} of size D=logCα2|D^{\prime}|=\frac{\log|C|}{\alpha^{2}} that approximately encodes the answers to every query in CC, up to error α\alpha. Crucially, this bound will be independent of D|D|.

For any DXD\in X^{*} and for any finite collection of counting queries CC, there exists a database DD^{\prime} of size

Let m=logCα2m=\frac{\log|C|}{\alpha^{2}} We will construct a database DD^{\prime} by taking mm uniformly random samples from the elements of DD. Specifically, for i{1,,m}i\in\{1,\ldots,m\} let XiX_{i} be a random variable taking value xjx_{j} with probability {xD:x=xj}/D|\{x\in D:x=x_{j}\}|/|D|, and let DD^{\prime} be the database containing elements X1,,XmX_{1},\ldots,X_{m}. Now fix any QφCQ_{\varphi}\in C and consider the quantity Qφ(D)Q_{\varphi}(D^{\prime}). We have: Qφ(D)=1mi=1mφ(Xi)Q_{\varphi}(D^{\prime})=\frac{1}{m}\sum_{i=1}^{m}\varphi(X_{i}). We note that each term of the sum φ(Xi)\varphi(X_{i}) is a bounded random variable taking values 0φ(Xi)10\leq\varphi(X_{i})\leq 1, and that the expectation of Qφ(D)Q_{\varphi}(D^{\prime}) is:

Therefore, we can apply a standard Chernoff bound which gives:

Taking a union bound over all of the counting queries QφCQ_{\varphi}\in C, we get:

Plugging in mm makes the right hand side smaller than 11 (so long as C>2|C|>2), proving that there exists a database of size mm satisfying the stated bound, which completes the proof of the lemma. ∎

Now we can complete the proof of Theorem 3.6.

By Lemma 3.7, we have that for any DXD\in X^{*} there exists a database DXD^{\prime}\in X^{*} with D=logCα2|D^{\prime}|=\frac{\log|C|}{\alpha^{2}} such that maxQφCQφ(D)Qφ(D)α\max_{Q_{\varphi}\in C}\left|Q_{\varphi}(D)-Q_{\varphi}(D^{\prime})\right|\leq\alpha. Therefore, if we take N={DX:D=logCα2}N=\{D^{\prime}\in X^{*}:|D^{\prime}|=\frac{\log|C|}{\alpha^{2}}\} to be the set of every database of size logCα2\frac{\log|C|}{\alpha^{2}}, we have an α\alpha-net for CC. Since

and by definition Nα(C)N\left|N_{\alpha}(C)\right|\leq\left|N\right|, we have proven the theorem. ∎

For infinite concept classes CC, we can replace lemma 3.7 with the following lemma:

For any DXD\in X^{*} and for any collection of counting queries CC, there exists a database DD^{\prime} of size

This lemma straightforwardly gives an analogue of Theorem 3.6:

Note that we always have VCDIM(C)logC\textrm{VCDIM}(C)\leq\log|C| for finite classes of counting queries, and so Theorem 3.9 is strictly stronger than Theorem 3.6.

Finally, we can instantiate Corollary 3.5 to give our main utility theorem for the Net Mechanism.

For any class of counting queries CC the Net Mechanism is (α,δ)(\alpha,\delta)-useful for any α\alpha such that:

Solving for α\alpha, the Net Mechanism is (α,δ)(\alpha,\delta)-useful for:

Theorem 3.10 shows that a database of size O~(logXVCDIM(C)α3ϵ)\widetilde{O}(\frac{\log X\textrm{VCDIM}(C)}{\alpha^{3}\epsilon}) is sufficient in order to output a set of points that is α\alpha-useful for a concept class CC, while simultaneously preserving ϵ\epsilon-differential privacy. If we were to view our database as having been drawn from some distribution D\mathcal{D}, this is only an extra O~(logXαϵ)\widetilde{O}(\frac{\log X}{\alpha\epsilon}) factor larger than what would be required to achieve α\alpha-usefulness with respect to D\mathcal{D}, even without any privacy guarantee!

The results in this section only apply for discretized database domains, and may not be computationally efficient. We explore these two issues further in the remaining sections of the paper.

We just gave an ϵ\epsilon-differentially private mechanism that is (α,δ)(\alpha,\delta)-useful with respect to any set of counting queries CC, when given a database of size nO~(logXVCDIM(C)α3ϵ)n\geq\widetilde{O}(\frac{\log X\textrm{VCDIM}(C)}{\alpha^{3}\epsilon}). In this section, we show that the dependence on the VC-dimension of the class CC is tight. Namely:

For any class of counting queries CC, for any 0<δ<10<\delta<1 bounded away from 11 by a constant, for any ϵ1\epsilon\leq 1, if MM is an ϵ\epsilon-differentially private mechanism that is (α,δ)(\alpha,\delta) useful for CC given databases of size nVCDIM(C)2n\leq\frac{\textrm{VCDIM}(C)}{2}, then αΩ(14+16ϵ)\alpha\geq\Omega\left(\frac{1}{4+16\epsilon}\right).

Fix a class of counting queries CC corresponding to a class of predicates PP of VC-dimension dd. Let SXS\subset X denote a set of universe elements of size S=d|S|=d that are shattered by PP, as guaranteed by the definition of VC-dimension. We will consider all subsets TST\subset S of size T=d/2|T|=d/2. Denote this set by DS={TS:T=d/2}\mathcal{D}_{S}=\{T\subset S:|T|=d/2\} For each such TDST\in\mathcal{D}_{S}, let φT\varphi_{T} be the predicate such that:

as guaranteed by the definition of shattering, and let QT=QφTQ_{T}=Q_{\varphi_{T}} be the corresponding counting query.

For every pair T,TDST,T^{\prime}\in\mathcal{D}_{S}:

where the last equality follows from the fact that T=T|T|=|T^{\prime}|. ∎

This lemma will allow a sufficiently useful mechanism for the class of queries CC to be used to reconstruct a database selected from DS\mathcal{D}_{S} with high accuracy.

For any 0<δ<10<\delta<1 bounded away from 11 by a constant, let MM be an (α,δ)(\alpha,\delta)-useful mechanism for CC. Given as input M(T)M(T) where TT is any database TDST\in\mathcal{D}_{S}, there is a procedure which with constant probability 1δ1-\delta reconstructs a database TT^{\prime} with TΔT2dα|T^{\prime}\Delta T|\leq 2d\alpha.

Write D=M(T)D^{\prime}=M(T). With probability at least 1δ1-\delta, we have maxQCQ(T)Q(D)α\max_{Q\in C}|Q(T)-Q(D^{\prime})|\leq\alpha. For the rest of the argument, assume this event occurs. For each TDST^{\prime}\in\mathcal{D}_{S}, define:

Let T=argminTDSvTT^{\prime}=\mathop{\rm argmin}_{T^{\prime}\in\mathcal{D}_{S}}v_{T^{\prime}}. We have:

by the accuracy of the mechanism. By combining the accuracy of the mechanism with Lemma 3.12, We also have:

Combining these two inequalities yields a database TT^{\prime} such that: TΔT2dα|T\Delta T^{\prime}|\leq 2d\alpha as claimed. ∎

We can now complete the proof. Let TDST\in\mathcal{D}_{S} be a set selected uniformly at random, let xTx\in T be an element of TT selected uniformly at random, and let ySTy\in S\setminus T be an element of STS\setminus T selected uniformly at random. Note that the marginal distributions on xx and yy are identical: both are uniformly random elements from SS. Let T^=(T{x}){y}\hat{T}=(T\setminus\{x\})\cup\{y\} be the set obtained by swapping elements xx and yy. Note that T^\hat{T} is also distributed uniformly at random from DS\mathcal{D}_{S}. Let TT^{\prime} be the set reconstructed from D=M(T)D^{\prime}=M(T) as in lemma 3.13, and let T^\hat{T}^{\prime} be the set reconstructed from D=M(T^)D^{\prime}=M(\hat{T}). Assume that TΔT2dα|T\Delta T^{\prime}|\leq 2d\alpha and that T^ΔT^2dα|\hat{T}\Delta\hat{T}^{\prime}|\leq 2d\alpha, which occurs except with probability at most 2δ2\delta. We now have:

Now recall that TΔT^2|T\Delta\hat{T}|\leq 2 and so by the fact that MM is ϵ\epsilon-differentially private, we also know:

Because we also have exp(2ϵ)1+8ϵ\exp(2\epsilon)\leq 1+8\epsilon, we can solve to find α14+16ϵ\alpha\geq\frac{1}{4+16\epsilon}. ∎

Interval queries

In this section we give an efficient algorithm for privately releasing a database useful for the class of interval queries over a discretized domain, given a database of size only polynomial in our privacy and usefulness parameters. We note that our algorithm is easily extended to the class of axis-aligned rectangles in dd dimensional space for dd a constant; we present the case of d=1d=1 for clarity.

Consider a database DD of nn points in a discrete interval {1,,2d}\{1,\ldots,2^{d}\} (in Corollary 5.2 we show some discretization is necessary). Given a1a2a_{1}\leq a_{2}, both in {1,2,,2d}\{1,2,\ldots,2^{d}\}, let Ia1,a2I_{a_{1},a_{2}} be the indicator function corresponding to the interval [a1,a2][a_{1},a_{2}]. That is:

An interval query Q[a1,a2]Q_{[a_{1},a_{2}]} is defined to be

Note that GSQ[a1,a2]=1/nGS_{Q_{[a_{1},a_{2}]}}=1/n, and we may answer interval queries while preserving ϵ\epsilon-differential privacy by adding noise proportional to Lap(1/(ϵn))\textrm{Lap}(1/(\epsilon n)).

We now give the algorithm. We will assume for simplicity that all points x,xDx,x^{\prime}\in D are distinct, but this condition can be easily discarded.

The algorithm 2 is very simple. It repeatedly performs a binary search to partition the unit interval into regions that have approximately an α\alpha^{\prime} fraction of the point mass in them. It then releases a database that has exactly an α\alpha^{\prime}-fraction of the point mass in each of the intervals that it has discovered. There are at most 1/α\approx 1/\alpha^{\prime} such intervals, and each binary search terminates after at most dd rounds (because the interval consists of at most 2d2^{d} points). Therefore, the algorithm requires only d/α\approx d/\alpha^{\prime} accesses to the database, and each one is performed in a privacy preserving manner using noise from the Laplace mechanism. The privacy of the mechanism then follows immediately:

ReleaseIntervals is ϵ\epsilon-differentially private.

The algorithm runs a binary search at most 4/3α\lceil 4/3\alpha^{\prime}\rceil times. Each time, the search halts after dd queries to the database using the Laplace mechanism. Each query is 1/ϵ1/\epsilon^{\prime}-differentially private (the sensitivity of an interval query is 1/n1/n since it is a counting query). Privacy then follows from the definition of ϵ\epsilon and the fact that the composition of kk differentially private mechanisms is kϵk\epsilon differentially private. ∎

ReleaseIntervals is (α,δ)(\alpha,\delta)-useful for databases of size:

By a union bound and the definition of the Laplace distribution, if the database size nn satisfies the hypothesis of the theorem, then except with probability at most δ\delta, none of the (4/3)d/α(4/3)d/\alpha^{\prime} draws from the Laplace distribution have magnitude greater than α2\alpha^{\prime 2}. That is, at every step, we have v^Q[a,b](D)α2|\hat{v}-Q_{[a,b]}(D)|\leq\alpha^{\prime 2} except with probability β\beta. Conditioned on this event occurring, for each interval [[Bounds[j1][j-1],Bounds[j]][j]] for j[i]j\in[i], fBounds[j1],Bounds[j](D)[αα2,α+α2]f_{\textrm{Bounds}[j-1],\textrm{Bounds}[j]}(D)\in[\alpha^{\prime}-\alpha^{\prime 2},\alpha^{\prime}+\alpha^{\prime 2}]. In the synthetic database DD^{\prime} released, each such interval contains exactly an α\alpha^{\prime} fraction of the database elements. We can now analyze the error incurred on any query when evaluated on the synthetic database instead of on the real database. Any interval [[Bounds[j1][j-1],Bounds[j]][a,b][j]]\subset[a,b] will contribute error at most α\alpha^{\prime} to the total, and any interval [[Bounds[j1][j-1],Bounds[j]]⊄[a,b][j]]\not\subset[a,b] that also intersects with [a,b][a,b] contributes error at most (α+α2)(\alpha^{\prime}+\alpha^{\prime 2}) to the total. Note that there are at most 2 intervals of this second type. Therefore, on any query Q[a,b]Q_{[a,b]} we have:

We note that although the class of intervals (and more generally, low dimensional axis-aligned rectangles) is a simple class of functions, it nevertheless contains exponentially many queries, and so it is not feasible to simply ask all possible interval queries using an interactive mechanism.

Lower bounds

Could we possibly modify the results of Sections 4 and 3 to hold for non-discretized databases? Suppose we could usefully answer an arbitrary number of queries in some simple concept class CC representing interval queries on the real line (for example, “How many points are contained within the following interval?”) while still preserving privacy. Then, for any database containing single-dimensional real valued points, we would be able to answer median queries with values that fall between the 50δ,50+δ50-\delta,50+\delta percentile of database points by performing a binary search on DD using AA (where δ=δ(α)\delta=\delta(\alpha) is some small constant depending on the usefulness parameter α\alpha). However, answering such queries is impossible while guaranteeing differential privacy. Unfortunately, this would seem to rule out usefully answering queries in simple concept classes such as halfspaces and axis-aligned rectangles, that are generalizations of intervals.

We say that a mechanism answers a median query MM usefully if it outputs a real value rr such that rr falls within the 50δ,50+δ50-\delta,50+\delta percentile of points in database DD for some δ<50\delta<50.

No mechanism AA can answer median queries MM with outputs that fall between the 50δ,50+δ50-\delta,50+\delta percentile with positive probability on any real valued database DD, while still preserving ϵ\epsilon-differential privacy, for δ<50\delta<50 and any ϵ\epsilon.

Consider real valued databases containing elements in the interval $.Let. LetD_{0}=(0,\ldots,0)bethedatabasecontainingbe the database containingnpointswithvalue0.Supposepoints with value 0. SupposeAcananswermedianqueriesusefully.Thenwemusthavecan answer median queries usefully. Then we must have\Pr[A(D_{0},M)=0]>0sinceeverypointinsince every point inD_{0}is.Sinceis . Sinceisacontinuousinterval,theremustbesomevalueis a continuous interval, there must be some valuev\insuchthatsuch that\Pr[A(D_{0},M)=v]=0.Let. LetD_{n}=(v,\ldots,v)bethedatabasecontainingbe the database containingnpointswithvaluepoints with valuev.Wemusthave. We must have\Pr[A(D_{n},M)=v]>0.For. For1,let, letD_{i}=(\underbrace{0,\ldots,0}_{n-i},\underbrace{v,\ldots,v}_{i}).Thenwemusthaveforsome. Then we must have for somei,,\Pr[A(D_{i},M)=v]=0butbut\Pr[A(D_{i+1},M)=v]>0.Butsince. But sinceD_{i}andandD_{i+1}$ differ only in a single element, this violates differential privacy. ∎

No mechanism can be (α,δ)(\alpha,\delta)-useful for the class of interval queries, nor for any class CC that generalizes interval queries to higher dimensions (for example, halfspaces, axis-aligned rectangles, or spheres), while preserving ϵ\epsilon-differential privacy, for any α,δ<1/2\alpha,\delta<1/2 and any ϵ0\epsilon\geq 0.

Consider any real valued database containing elements in the interval $.If. IfAisis(\alpha,\delta)usefulforintervalqueriesandpreservesdifferentialprivacy,thenwecanconstructamechanism-useful for interval queries and preserves differential privacy, then we can construct a mechanismA^{\prime}thatcananswermedianqueriesusefullywhilepreservingdifferentialprivacy.ByTheorem5.1,thisisimpossible.that can answer median queries usefully while preserving differential privacy. By Theorem 5.1, this is impossible.A^{\prime}simplycomputessimply computes\widehat{D}=A(D),andperformsbinarysearchon, and performs binary search on\widehat{D}tofindsomeintervalto find some interval[0,a]thatcontainsthat containsn/2\pm\alpha npoints.Privacyispreservedsinceweonlyaccesspoints. Privacy is preserved since we only accessDthroughthroughA,whichbyassumptionpreserves, which by assumption preserves\epsilondifferentialprivacy.Withpositiveprobability,allintervalquerieson-differential privacy. With positive probability, all interval queries on\widehat{D}arecorrecttowithinare correct to within\pm\alpha,andsothebinarysearchcanproceed.Since, and so the binary search can proceed. Since\alpha<1/2$, the result follows. ∎

We may get around the impossibility result of Corollary 5.2 by relaxing our definitions. One approach is to discretize the database domain, as we do in Sections 3 and 4. Another approach, which we take in Section 6, is to relax our definition of usefulness.

Answering Halfspace Queries

With respect to a database, a halfspace can have a certain margin γ\gamma:

Before we present the algorithm, we will introduce a useful fact about random projections, called the Johnson-Lindenstrauss lemma. It states, roughly, that the norm of a vector is accurately preserved with high probability when the vector is projected into a lower dimensional space with a random linear projection.

For our purposes, the relevant fact will be that norm preserving projections also preserve pairwise inner products with high probability. The following corollary is well known.

Consider the two vectors u=x+yu=x+y and v=xyv=x-y. We apply Theorem 6.3 to uu and vv. By a union bound, except with probability 2τ2\tau we have: A(x+y)22x+y22ςx+y22|||A(x+y)||_{2}^{2}-||x+y||_{2}^{2}|\leq\varsigma||x+y||_{2}^{2} and A(xy)22xy22ςxy22|||A(x-y)||_{2}^{2}-||x-y||_{2}^{2}|\leq\varsigma||x-y||_{2}^{2}. Therefore:

An identical calculation shows that (Ax),(Ay)x,yς2(x22+y22)\langle(Ax),(Ay)\rangle\geq\langle x,y\rangle-\frac{\varsigma}{2}\left(||x||_{2}^{2}+||y||_{2}^{2}\right), which completes the proof. ∎

Instead of outputting synthetic data, our algorithm will output a datastructure based on a collection of random projections.

A TT dimensional projected halfspace data structure of size mm, DH={{Ai},U,{vi,j}}D_{H}=\{\{A_{i}\},U,\{v_{i,j}\}\} consists of three parts:

A projected halfspace data structure DHD_{H} can be used to evaluate a halfspace query fyf_{y} as follows. We write fy(DH)f_{y}(D_{H}) to denote this evaluation:

For each i[m]i\in[m] compute ui,j(i)=argminujUy^iuj2u_{i,j(i)}=\mathop{\rm argmin}_{u_{j}\in U}||\hat{y}_{i}-u_{j}||_{2}

Output fy(DH)=1mi=1mvi,j(i)f_{y}(D_{H})=\frac{1}{m}\sum_{i=1}^{m}v_{i,j(i)}

First we bound the size of the needed net UU for halfspaces.

ReleaseHalfspaces preserves ϵ\epsilon-differential privacy.

Privacy follows from the fact that the composition of kk ϵ\epsilon-differentially private mechanisms is kϵk\epsilon-differentially private The algorithm makes mUm|U| calls to the Laplace mechanism, and each call preserves ϵ/(mU)\epsilon/(m|U|)-differential privacy (since each query has sensitivity 1/n1/n). ∎

By linearity of expectation, the expected number of points in DD moved by more than γ/4\gamma/4 with respect to yy^{\prime} in any projection is at most αn/4\alpha n/4:

Let ui,y=argminuUuAiy2u_{i,y^{\prime}}=\mathop{\rm argmin}_{u\in U}||u-A_{i}y^{\prime}||_{2}. Because ui,yAiy2γ/4||u_{i,y^{\prime}}-A_{i}y^{\prime}||_{2}\leq\gamma/4, we have for any xD^ix\in\hat{D}_{i}:

In other words, fy(D)α/4E[fui,y(D^i)]fy(D)+α/4f_{y^{\prime}}(D)-\alpha/4\leq E[f_{u_{i,y^{\prime}}}(\hat{D}_{i})]\leq f_{y^{\prime}}(D)+\alpha/4. Moreover, for each ii, fui,y(D^i)f_{u_{i,y^{\prime}}}(\hat{D}_{i}) is an independent random variable taking values in the bounded range $,andsowewillbeabletoapplyaChernoffbound.Foreach, and so we will be able to apply a Chernoff bound. For eachy^{\prime}$:

Taking a union bound over all (4d/γ)d(4\sqrt{d}/\gamma)^{d} vectors yUdy^{\prime}\in U^{d} and plugging in our chosen value of mm, and recalling our bound on E[fuy(D^)]E[f_{u_{y^{\prime}}}(\hat{D})] we find that:

Also note that the algorithm makes mUm|U| draws from the distribution Lap(mUϵn)\textrm{Lap}\left(\frac{m|U|}{\epsilon n}\right) during its run, assigning these draws to values pi,jp_{i,j}. Except with probability at most β/3\beta/3, we have for all i,ji,j:

Therefore, conditioning on pi,j1|p_{i,j}|\leq 1 for all i,ji,j and applying another Chernoff bound, we find that for any sequence of indices j(i)j(i):

Again taking a union bound and plugging in our value of mm, we find that:

Finally, conditioning on these three events (which together occur except with probability β\beta), we have for any yy^{\prime}:

Distributional Privacy

In this section we give an alternative privacy definition, motivated by learning theory, and show that it is a strengthening of differential privacy.

For any subset of the universe SXS\subseteq X of size Sn|S|\geq n, we say that two databases are SS-neighbors if they were both drawn at random without replacement from SS.

We say that a mechanism A:XnRA:X^{n}\rightarrow R satisfies (ϵ,β)(\epsilon,\beta)-distributional privacy if for any SXS\subseteq X, and for any pair of databases D1,D2XD_{1},D_{2}\subseteq X of size nn which are SS-neighbors, with probability 1β1-\beta over the draw of the databases we have for all events ERE\subseteq R:

For example, suppose that a collection of hospitals in a region each treats a random sample of patients with disease XX. A hospital can release information with a guarantee of distributional privacy, which is informative about patients in the region, without revealing which hospital the data came from. Actually, our main motivation is that this definition is particularly natural from the perspective of learning theory: given a sample of points drawn from some distribution D{\cal D}, one would like to reveal no more information about the sample than is inherent in D{\cal D} itself.

We will typically think of β\beta as being exponentially small, whereas ϵ\epsilon must be Ω(1/n)\Omega(1/n) for AA to be useful.

It is not a priori clear whether either differential privacy or distributional privacy is a stronger notion than the other, or if the two are equivalent, or distinct. On the one hand, differential privacy only provides a guarantee when D1D_{1} and D2D_{2} differ in a single element,We get tϵt\epsilon-differential privacy for D1D_{1} and D2D_{2} that differ in tt elements. whereas distributional privacy can provide a guarantee for two databases D1D_{1} and D2D_{2} that differ in all of their elements. On the other hand, distributional privacy makes the strong assumption that the elements in D1D_{1} and D2D_{2} are drawn from some distribution D\mathcal{D}, and allows for privacy violations with some exponentially small probability β\beta (necessarily: with some small probability, two databases drawn from the same distribution might nevertheless possess very different statistical properties). However, as we show, distributional privacy is a strictly stronger guarantee than differential privacy.

If AA satisfies (ϵ,β)(\epsilon,\beta)-distributional privacy for any β=o(1/n2)\beta=o(1/n^{2}), then AA satisfies ϵ\epsilon-differential privacy.

Consider any database D1D_{1} drawn from domain XX, and any neighboring database D2D_{2} that differs from D1D_{1} in only a single element xXx\in X. Let S=D1{x}S=D_{1}\cup\{x\} be a set of size S=n+1|S|=n+1. Consider two SS-neighbors D1D_{1}^{\prime} and D2D_{2}’, then with probability 2/n22/n^{2} we have {D1,D2}={D1,D2}\{D_{1}^{\prime},D_{2}^{\prime}\}=\{D_{1},D_{2}\}, and so if β=o(1/n2)\beta=o(1/n^{2}), we have with certainty that for all events EE:

Since this holds for all pairs of neighboring databases, AA satisfies ϵ\epsilon-differential privacy. ∎

Define the mirrored mod mm function as follows:

For a database D{0,1}nD\subset\{0,1\}^{n}, define the query

Note that the global sensitivity of any query QmQ_{m} satisfies GSQm1GS_{Q_{m}}\leq 1. Therefore, the mechanism AA that answers queries QnQ_{n} by A(D,Qm)=Qm(D)+ZA(D,Q_{m})=Q_{m}(D)+Z where ZZ is drawn from Lap(1/ϵ)\textrm{Lap}(1/\epsilon) satisfies ϵ\epsilon-differential privacy.

There exist mechanisms AA that satisfy ϵ\epsilon-differential privacy, but do not satisfy (ϵ,β)(\epsilon,\beta)-distributional privacy for any ϵ<1,\epsilon<1, β=o(1)\beta=o(1) (that is, for any meaningful values of ϵ,β\epsilon,\beta).

Consider databases with elements drawn from X={0,1}nX=\{0,1\}^{n} and the query Q2/ϵQ_{2/\epsilon}. As observed above, a mechanism AA such that A(D,Qi)=Qi(D)+ZA(D,Q_{i})=Q_{i}(D)+Z for ZLap(1/ϵ)Z\sim\textrm{Lap}(1/\epsilon) satisfies ϵ\epsilon-differential privacy for any ii. Note however that with constant probability, two databases D1,D2D_{1},D_{2} drawn from S=XS=X have Q2/ϵ(D1)Q2/ϵ(D2)1/ϵ|Q_{2/\epsilon}(D_{1})-Q_{2/\epsilon}(D_{2})|\geq 1/\epsilon. Therefore, for any output xx, we have that with constant probability over draws of two SS neighbors D1D_{1} and D2D_{2}:

Therefore the mechanism does not satisfy (1,o(1))(1,o(1))-distributional privacy. ∎

Although there are simpler functions for which preserving distributional privacy requires more added noise than preserving differential privacy, the mirrored-mod function above is an example of a function for which it is possible to preserve differential privacy usefully, but yet impossible to reveal any useful information while preserving distributional privacy.

We note that in order for distributional privacy to imply differential privacy, it is important that in the definition of distributional privacy, database elements are drawn from some distribution D\mathcal{D} without replacement. Otherwise, for any non-trivial distribution, there is some database DD_{*} that is drawn with probability at most 1/2n1/2^{n}, and we may modify any distributional-privacy preserving mechanism AA such that for every query QQ, A(D,Q)=DA(D_{*},Q)=D_{*}, and for any DiDD_{i}\neq D_{*}, A(Di,Q)A(D_{i},Q) behaves as before. Since this new behavior occurs with probability β\leq\beta over draws from DD for β=O(1/2n)\beta=O(1/2^{n}), AA still preserves distributional privacy, but no longer preserves differential privacy (which requires that the privacy guarantee hold for every pair of neighboring databases).

Conclusions and Open Problems

In this paper we have shown a very general information theoretic result: that small nets are sufficient to certify the existence of accurate, differentially private mechanisms for a class of queries. For counting queries, this allows algorithms which can accurately answer queries from a class CC given a database that is only logarithmic in the size of CC, or linear its VC-dimension. We then also gave an efficient algorithm for releasing the class of interval queries on a discrete interval, and for releasing large-margin halfspace queries in the unit sphere.

The main question left open by our work is the design of algorithms which achieve utility guarantees comparable to our net mechanism, but have running time only polynomial in nn, the size of the input database. This question is extremely interesting even for very specific classes of queries. Is there such a mechanism for the class of conjunctions? For the class of parity queries?

Acknowledgments

We thank David Abraham, Cynthia Dwork, Shiva Kasiviswanathan, Adam Meyerson, Ryan O’Donnell, Sofya Raskhodnikova, Amit Sahai, and Adam Smith for many useful discussions.

References