Between Pure and Approximate Differential Privacy
Thomas Steinke, Jonathan Ullman
Introduction
The goal of privacy-preserving data analysis is to enable rich statistical analysis of a database while protecting the privacy of individuals whose data is in the database. A formal privacy guarantee is given by -differential privacy [DMNS06, DKM+06], which ensures that no individual’s data has a significant influence on the information released about the database. The two parameters and control the level of privacy. Very roughly, is an upper bound on the amount of influence an individual’s record has on the information released and is the probability that this bound fails to holdThis intuition is actually somewhat imprecise, although it is suitable for this informal discussion. See [KS08] for a more precise semantic interpretation of -differential privacy., so the definition becomes more stringent as .
A natural way to measure the tradeoff between privacy and utility is sample complexity—the minimum number of records that is sufficient in order to publicly release a given set of statistics about the database, while achieving both differential privacy and statistical accuracy. Intuitively, it’s easier to achieve these two goals when is large, as each individual’s data will have only a small influence on the aggregate statistics of interest. Conversely, the sample complexity should increase as and decrease (which strengthens the privacy guarantee).
The strongest version of differential privacy, in which , is known as pure differential privacy. The sample complexity of achieving pure differential privacy is well known for many settings (e.g. [HT10]). The more general case where is known as approximate differential privacy, and is less well understood. Recently, Bun, Ullman, and Vadhan [BUV14] showed how to prove strong lower bounds for approximate differential privacy that are essentially optimal for , which is essentially the weakest privacy guarantee that is still meaningful.When there are algorithms that are intuitively not private, yet satisfy -differential privacy.
Since bounds the probability of a complete privacy breach, we would like to be very small. Thus we would like to quantify the cost (in terms of sample complexity) as . In this work we give lower bounds for approximately differentially private algorithms that are nearly optimal for every choice of , and smoothly interpolate between pure and approximate differential privacy.
Specifically, we consider algorithms that compute the one-way marginals of the database—an extremely simple and fundamental family of queries. For a database , the one-way marginals are simply the mean of the bits in each of the columns. Formally, we define
where is the -th row of . A mechanism is said to be accurate if, on input , its output is “close to” . Accuracy may be measured in a worst-case sense—i.e. , meaning every one-way marginal is answered with accuracy —or in an average-case sense—i.e. , meaning the marginals are answered with average accuracy .
Some of the earliest results in differential privacy [DN03, DN04, BDMN05, DMNS06] give a simple -differentially private algorithm—the Laplace mechanism—that computes the one-way marginals of with average error as long as
More generally, this is the first result showing that the sample complexity must grow by a multiplicative factor of for answering any family of queries, as opposed to an additive dependence on . We also remark that the assumption on the range of is necessary, as the Laplace mechanism gives accuracy and satisfies -differential privacy when .
Our lower bound holds for mechanisms with an average-case () error guarantee. Thus, it also holds for algorithms that achieve worst-case () error guarantees. The Laplace mechanism gives a matching upper bound for average-case error. In many cases worst-case error guarantees are preferrable. For worst-case error, the sample complexity of the Laplace mechanism degrades by an additional factor compared to (1).
Surprisingly, this degradation is not necessary. We present algorithms that answer every one-way marginal with accuracy and improve on the sample complexity of the Laplace mechanism by roughly a factor. These algorithms demonstrate that the widely used technique of adding independent noise to each query is suboptimal when the goal is to achieve worst-case error guarantees.
Our algorithm for pure differential privacy satisfies the following.
For every , , and , there exists an efficient mechanism that is -differentially private and
And our algorithm for approximate differential privacy is as follows.
For every , , and
there exists an efficient mechanism that is -differentially private and
These algorithms improve over the sample complexity of the best known mechanisms for each privacy and accuracy guarantee by a factor of . Namely, the Laplace mechanism requires samples for pure differential privacy and the Gaussian mechanism requires samples for approximate differential privacy.
2 Techniques
Our proof uses a new, more general reduction from breaking fingerprinting codes to differentially private data release. Specifically, our reduction uses group differential privacy. This property states that if an algorithm is -differentially private with respect to the change of one individual’s data, then for any , it is roughly -differentially private with respect to the change of individuals’ data. Thus an -differentially private algorithm provides a meaningful privacy guarantee for groups of size .
To use this in our reduction, we start with a mechanism that takes a database of rows and is -differentially private. We design a mechanism that takes a database of rows, copies each of its rows times, and uses the result as input to . The resulting mechanism is roughly -differentially private. For our choice of , these parameters will be small enough to apply the attack of [BUV14] to obtain a lower bound on the number of samples used by , which is . Thus, for larger values of (equivalently, smaller values of ), we obtain a stronger lower bound. The remainder of the proof is to quantify the parameters precisely.
Upper Bounds:
Preliminaries
We define a database to be a matrix of rows, where each row corresponds to an individual, and each row has dimension (consists of binary attributes). We say that two databases are adjacent if they differ only by a single row, and we denote this by . In particular, we can replace the th row of a database with some fixed element of to obtain another database .
Let be a randomized mechanism. We say that is -differentially private if for every two adjacent databases and every subset ,
A well known fact about differential privacy is that it generalizes smoothly to databases that differ on more than a single row. We say that two databases are -adjacent if they differ by at most rows, and we denote this by .
For every , if is -differentially private, then for every two -adjacent databases , and every subset ,
All of the upper and lower bounds for one-way marginals have a multiplicative dependence on the accuracy and the privacy loss . This is no coincidence - there is a generic reduction:
Let and .
Suppose there exists a -differentially private mechanism such that for every database ,
Then there exists a -differentially private mechanism for such that for every database ,
Lower Bounds for Approximate Differential Privacy
Our main theorem can be stated as follows.
Let be a -differentially private mechanism that answers one-way marginals such that
where is the true answer vector. If and is sufficiently large, then
Theorem 1.1 in the introduction follows by rearranging terms, and applying Fact 2.3. The statement above is more convenient technically, but the statement in the introduction is more consistent with the literature.
First we must introduce fingerprinting codes. The following definition is tailored to the application to privacy. Fingerprinting codes were originally defined by Boneh and Shaw [BS98] with a worst-case accuracy guarantee. Subsequent works [BUV14, SU14] have altered the accuracy guarantee to an average-case one, which we use here.
A -complete -sound -robust fingerprinting code for users with length is a pair of random variables and such that the following hold.
Completeness: For any fixed ,
Soundness: For any and fixed ,
where denotes with the row replaced by some fixed element of .
Fingerprinting codes with optimal length were first constructed by Tardos [Tar08] (for worst-case error) and subsequent works [BUV14, SU14] have adapted Tardos’ construction to work for average-case error guarantees, which yields the following theorem.
For every , , and , there exists a -complete -sound -robust fingerprinting code for users with length .
We now show how the existence of fingerprinting codes implies our lower bound.
Let be a -differentially private mechanism such that
Let be a parameter to be chosen later. Let . Let be the following mechanism. On input , creates by taking copies of and filling the remaining entries with 1s. Then runs on and outputs .
By group privacy (Fact 2.2), is a -differentially private mechanism. By the triangle inequality,
Thus . Assume . Thus and, by (2) and (3),
Assume , were is as in Theorem 3.3. We will show by contradiction that this cannot be – that is . Let and be a -complete -sound -robust fingerprinting code for users of length .
By the completeness of the fingerprinting code,
In particular, there exists such that
We have that is a -differentially private function of , as it is only postprocessing . Thus
where the second inequality follows from the soundness of the fingerprinting code.
If , then (8) gives a contradiction. Let . Assuming ensures , as required. Assuming implies . This setting of gives a contradiction, which implies that
Adding independent noise seems very natural for one-way marginals, but it is suboptimal if one is interested in worst-case (i.e. ) error bounds, rather than average-case (i.e. ) error bounds.
Theorem 1.2 follows from Theorem 4.1. In particular, the mechanism in Theorem 1.2 is given by , where and is the distribution from Theorem 4.1 with .Note that we must truncate the output of to ensure that is always in .
Efficiency: can be efficiently sampled.
The distribution is simply an instantiation of the exponential mechanism [MT07]. In particular, the probability density function is given by
Firstly, this is clearly a well-defined distribution as long as .
Define a distribution on to by meaning for . To prove accuracy, we must give a tail bound on . The probability density function of is given by
which is obtained by integrating the probability density function of over the infinity-ball of radius , which has surface area . Thus is precisely the gamma distribution with shape and mean . The moment generating function is therefore
for all . By Markov’s inequality
Setting gives the required bound.
It is easy to verify that can be sampled by first sampling a radius from a gamma distribution with shape and mean and then sampling uniformly at random. To sample we can set , where each is uniform and independent. This gives an algorithm (in the form of an explicit circuit) to sample that uses only real arithmetic operations, logarithms, and independent uniform samples from $$.
2 Approximate Differential Privacy
Our algorithm for approximate differential privacy makes use of a powerful tool from the literature [DNR+09, HR10, DNPR10, RR10] called the sparse vector algorithm:
For every , , and
there exists a mechanism with the following properties.
takes as input a database and provides answers to (adaptive) linear queries .
is -differentially private.
A proof of this theorem can be found in [DR13, Theorem 3.28].Note that the algorithms in the literature are designed to sometimes output as an answer or halt prematurely. To modify these algorithms into the form given by Theorem 4.2 simply output in these cases. We now describe our approximately differentially private mechanism.
Now we must prove accuracy. Suppose that for all . Then
as required. So we need only show that for all , which sparse vector guarantees will happen with probability at least as long as
Now we verify that (9) holds with high probability.
By our setting of parameters, we have . This means
Let be the indicator of the event . Since the s are independent, so are the s. Thus we can apply a Chernoff bound:
The failure probability of is bounded by the failure probability of plus (10), which is dominated by .
Finally we consider the sample complexity. The accuracy is bounded by
for sparse vector to work, which is also satisfied. ∎
We remark that we have not attempted to optimize the constant factors in this analysis.
References
Appendix A Alternative Lower Bound for Pure Differential Privacy
It is known [HT10] that any -differentially private mechanism that answers one-way marginals requires samples. Our techniques yield an alternative simple proof of this fact.
Let be a -differentially private mechanism. Suppose
The proof uses a special case of Hoeffding’s Inequality:
Let be independent and uniform. Let be copies of and, likewise, let be copies of . Let and .
Now we give conflicting tail bounds for and , which we can relate by privacy.
By our hypothesis and Markov’s inequality,
Since is independent from , we have
Now and are databases that differ in rows, so privacy implies that
Rearranging , gives