Differentially Private Release and Learning of Threshold Functions
Mark Bun, Kobbi Nissim, Uri Stemmer, Salil Vadhan
Introduction
The line of work on differential privacy [DMNS06] is aimed at enabling useful statistical analyses on privacy-sensitive data while providing strong privacy protections for individual-level information. Privacy is achieved in differentially private algorithms through randomization and the introduction of “noise” to obscure the effect of each individual, and thus differentially private algorithms can be less accurate than their non-private analogues. Nevertheless, by now a rich literature has shown that many data analysis tasks of interest are compatible with differential privacy, and generally the loss in accuracy vanishes as the number of individuals tends to infinity. However, in many cases, there is still is a price of privacy hidden in these asymptotics — in the rate at which the loss in accuracy vanishes, and in how large needs to be to start getting accurate results at all (the “sample complexity”).
We recall the definition of differential privacy. We think of a dataset as consisting of rows from a data universe , where each row corresponds to one individual. Differential privacy requires that no individual’s data has a significant effect on the distribution of what we output.
A randomized algorithm is differentially private if for every two datasets that differ on one row, and every set , we have
The original definition from [DMNS06] had , and is sometimes referred to as pure differential privacy. However, a number of subsequent works have shown that allowing a small (but negligible) value of , referred to as approximate differential privacy, can provide substantial gains over pure differential privacy [DL09, HT10, DRV10, De12, BNS13b].
The common setting of parameters is to take to be a small constant and to be negligible in (or a given security parameter). To simplify the exposition, we fix and throughout the introduction (but precise dependencies on these parameters are given in the main body).
2 Private Query Release
A special case of interest is the case where consists of counting queries. In this case, we are given a set of predicates on individual rows, and then extend them to databases by averaging. That is, counts the fraction of the population that satisfies predicate .
The sample complexity of releasing threshold functions over a data universe with differential privacy is at least . In particular, there is no differentially private algorithm for releasing threshold functions over an infinite data universe.
In addition, inspired by the ideas in our lower bound, we present a simplification of the algorithm of [BNS13b] and improve the sample complexity to (from roughly ). Closing the gap between the lower bound of and the upper bound of remains an intriguing open problem.
We remark that in the case of pure differential privacy (), a sample complexity lower bound of follows from a standard “packing argument” [HT10, BKN10]. For point functions, this is matched by the standard Laplace mechanism [DMNS06]. For threshold functions, a matching upper bound was recently obtained [RR14], building on a construction of [DNPR10]. We note that these algorithms have a slightly better dependence on the accuracy parameter than our algorithm (linear rather than nearly linear in ). In general, while packing arguments often yield tight lower bounds for pure differential privacy, they fail badly for approximate differential privacy, for which much less is known.
3 Private Distribution Learning
The well-known Dvoretzky-Kiefer-Wolfowitz inequality [DKW56] implies that without privacy, any distribution over can be learned to within constant error with samples. On the other hand, we show that with approximate differential privacy, the task of releasing thresholds is essentially equivalent to distribution learning. As a consequence, with approximate differential privacy, distribution learning instead requires sample complexity that grows with the size of the domain.
The sample complexity of learning arbitrary distributions on a domain with differential privacy is at least .
4 Private PAC Learning
Kasiviswanathan et al. [KLN+11] defined private PAC learning as a combination of probably approximately correct (PAC) learning [Val84] and differential privacy. Recall that a PAC learning algorithm takes some labeled examples where the ’s are i.i.d. samples of an arbitrary and unknown distribution on a data universe and is an unknown concept from some concept class . The goal of the learning algorithm is to output a hypothesis that approximates well on the unknown distribution. We are interested in PAC learning algorithms that are also differentially private. Here is the hypothesis class; if , then is called a proper learner.
As with query release and distribution learning, a natural problem is to characterize the sample complexity — the minimum number of samples in order to achieve differentially private PAC learning for a given concept class . Without privacy, it is well-known that the sample complexity of (proper) PAC learning is proportional to the Vapnik–Chervonenkis (VC) dimension of the class [VC71, BEHW89, EHKV89]. In the initial work on differentially private learning, Kasiviswanathan et al. [KLN+11] showed that labeled examples suffice for privately learning any concept class .As with the query release discussion, we omit the dependency on all parameters except for , and . The VC dimension of a concept class is always at most , but is significantly lower for many interesting classes. Hence, the results of [KLN+11] left open the possibility that the sample complexity of private learning may be significantly higher than that of non-private learning.
In the case of pure differential privacy (), this gap in the sample complexity was shown to be unavoidable in general. Beimel, Kasiviswanathan, and Nissim [BKN10] considered the concept class of point functions over a data universe , which have VC dimension 1 and hence can be (properly) learned without privacy with samples. In contrast, they showed that proper PAC learning with pure differential privacy requires sample complexity . Feldman and Xiao [FX14] showed a similar separation even for improper learning — the class of threshold functions over also has VC dimension 1, but PAC learning with pure differential privacy requires sample complexity .
For approximate differential privacy (), however, it was still open whether there is an asymptotic gap between the sample complexity of private learning and non-private learning. Indeed, Beimel et al. [BNS13b] showed that point functions can be properly learned with approximate differential privacy using samples (i.e. with no dependence on ). For threshold functions, they exhibited a proper learner with sample complexity , but it was conceivable that the sample complexity could also be reduced to .
We prove that the sample complexity of proper PAC learning with approximate differential privacy can be asymptotically larger than the VC dimension:
The sample complexity of properly learning threshold functions over a data universe with differential privacy is at least .
Based on these results, it would be interesting to fully characterize the difference between the sample complexity of proper non-private learners and of proper learners with (approximate) differential privacy. Furthermore, our results still leave open the possibility that improper PAC learning with (approximate) differential privacy has sample complexity . We consider this to be an important question for future work.
5 Techniques
Our results for query release and proper learning of threshold functions are obtained by analyzing the sample complexity of a related but simpler problem, which we call the interior-point problem. Here we want a mechanism (for a totally ordered data universe ) such that for every database , with high probability we have . We give reductions showing that the sample complexity of this problem is equivalent to the other ones we study:
Over every totally ordered data universe , the following four problems have the same sample complexity (up to constant factors) under differential privacy:
Distribution learning (with respect to Kolmogorov distance).
Proper PAC learning of threshold functions.
Thus we obtain our lower bounds and our simplified and improved upper bounds for query release and proper learning by proving such bounds for the interior-point problem, such as:
The sample complexity for solving the interior-point problem over a data universe with differential privacy is .
Note that for every fixed distribution over there exists a simple differentially private algorithm for solving the interior-point problem (w.h.p.) over databases sampled i.i.d. from – simply output a point s.t. . Hence, in order to prove Theorem 1.8, we show a (correlated) distribution over databases of size on which privately solving the interior-point problem is impossible. The construction is recursive: we use a hard distribution over databases of size over a data universe of size logarithmic in to construct the hard distribution over databases of size over .
By another reduction to the interior-point problem, we show an impossibility result for the following undominated-point problem:
Note that for the above problem, one cannot hope to construct a single distribution over databases that every private mechanism fails on. The reason is that for any such distribution , and any desired failure probability , there is some number for which , and hence that the mechanism that always outputs solves the problem. Hence, given a mechanism we must tailor a hard distribution . We use a similar mechanism-dependent approach to prove Theorem 1.6.
Preliminaries
Throughout this work, we use the convention that and write for . We use to denote the set of all threshold functions over a totally ordered domain . That is,
We will present algorithms that access their input database using (several) differentially private mechanisms and use the following composition theorem to prove their overall privacy guarantee.
Let be -differentially private. Let be -differentially private for any fixed value of its second argument. Then the composition is -differentially private.
The Interior Point Problem
In this work we exhibit a close connection between the problems of privately learning and releasing threshold queries, distribution learning, and solving the interior point problem as defined below.
An algorithm solves the interior point problem on with error probability if for every ,
where the probability is taken over the coins of . The sample complexity of the algorithm is the database size .
We call a solution with an interior point of . Note that need not be a member of the database .
2 Lower Bound
We now prove our lower bound on the sample complexity of private algorithms for solving the interior point problem.
Fix any constant . Let . Then for every positive integer , solving the interior point problem on with probability at least and with -differential privacy requires sample complexity .
The inductive lemma we prove depends on a number of parameters we now define. Fix . Let be any positive non-increasing sequence for which
for every . In particular, it suffices that
Let and define the function recursively by
For every positive integer , there exists a distribution over databases such that for every -differentially private mechanism ,
where the probability is taken over and the coins of .
In this section, we give a direct proof of the lemma and in Appendix B, we show how the lemma follows from the construction of a new combinatorial object we call an “interior point fingerprinting code.” This is a variant on traditional fingerprinting codes, which have been used recently to show nearly optimal lower bounds for other problems in approximate differential privacy [BUV14, DTTZ14, BST14].
The proof is by induction on . We first argue that the claim holds for by letting be uniform over the singleton databases and . To that end let and note that for any -differentially private mechanism it holds that
giving the desired bound on .
Now inductively suppose we have a distribution that satisfies the claim. We construct a distribution on databases that is sampled as follows:
Sample .
Sample a uniformly random . We write the base representation of as .
For each let be a base number (written ) that agrees with the base representation of on the first digits and contains a random sample from in every index thereafter.
Suppose for the sake of contradiction that there were an -differentially private mechanism that could solve the interior point problem on with probability greater than . We use to construct the following private mechanism for solving the interior point problem on , giving the desired contradiction:
The mechanism is also -differentially private, since for all pairs of adjacent databases and every ,
where is the set of that agree with in exactly the first entries for some .
Now we argue that solves the interior point problem on with probability greater than . First we show that with probability greater than . Observe that by construction, all the elements of agree in at least the first digits, and hence so does any interior point of . Therefore, if succeeds in outputting an interior point of , then must in particular agree with in at least digits, so .
Now we use the privacy that provides to to show that except with probability at most . Fix a database . Let , and fix all the randomness of but the st entry of (note that since , this fixes ). Since the st entry of is still a uniformly random element of , the privately produced entry should not be able to do much better than randomly guessing . Formally, for each , let denote the database with set to and everything else fixed as above. Then by the differential privacy of ,
where all probabilities are also taken over the coins of . Thus except with probability at most . By a union bound, with probability greater than
We now prove Theorem 3.2 by estimating the guaranteed by Lemma 3.3.
Let be as in Lemma 3.3. We introduce the following notation for iterated exponentials:
Observe that for , , and ,
By induction on we get an upper bound of
This immediately shows that solving the interior point problem on requires sample complexity
To get a lower bound for solving the interior point problem on when is not of the form , note that a mechanism for is also a mechanism for every s.t. . The lower bound follows by setting for the largest such that . ∎
3 Upper Bound
We now present a recursive algorithm, RecPrefix, for privately solving the interior point problem.
RecPrefix is -differentially private;
With probability at least , the output satisfies .
The idea of the algorithm is that on each level of recursion, RecPrefix takes an input database over and constructs a database over a smaller universe , where , in which every element is the length of the longest prefix of a pair of elements in (represented in binary). In a sense, this reverses the construction presented in Section 3.2.
The exponential mechanism is -differentially private.
Let be a quality function with sensitivity at most . Fix a database and let . Let . Then exponential mechanism outputs a solution with with probability at most .
We will also use an -differentially private variant of the exponential mechanism called the choosing mechanism, introduced in [BNS13b].
A quality function with sensitivity at most is of -bounded-growth if adding an element to a database can increase (by 1) the score of at most solutions, without changing the scores of other solutions. Specifically, it holds that
for all ,
If , then for all , and
There are at most values of for which .
The choosing mechanism is a differentially private algorithm for approximately solving bounded-growth choice problems. Step 1 of the algorithm checks whether a good solution exists (otherwise any solution is approximately optimal) and Step 2 invokes the exponential mechanism, but with the small set of good solutions instead of .
The following lemmas give the privacy and utility guarantees of the choosing mechanism. We give a slightly improved utility result over [BNS13b], and the analysis is presented in Appendix A.
Fix , and . If is a -bounded-growth quality function, then the choosing mechanism is -differentially private.
Let the choosing mechanism be executed on a -bounded-growth quality function, and on a database s.t. there exists a solution with quality . With probability at least , the choosing mechanism outputs a solution with quality .
Let the choosing mechanism be executed on a -bounded-growth quality function, and on a database containing elements. With probability at least , the choosing mechanism outputs a solution with quality .
3.2 The RecPrefix algorithm
We are now ready to present and analyze the algorithm RecPrefix.
We start the analysis of RecPrefix with the following two simple observations.
There are at most recursive calls throughout the execution of RecPrefix on a database .
Let RecPrefix be executed on a database , where . Every recursive call throughout the execution operates on a database containing at least elements.
We now analyze the utility guarantees of RecPrefix by proving the following lemma.
Let , and be inputs on which RecPrefix performs at most recursive calls, all of which are on databases of at least elements. With probability at least , the output is s.t.
.
Before proving the lemma, we make a combinatorial observation that motivates the random shuffling in Step 2 of RecPrefix. A pair of elements is useful in Algorithm RecPrefix if many of the values in lie between and – a prefix on which agree is also a prefix of every element between and . A prefix common to a useful pair can hence be identified privately via stability-based techniques. Towards creating useful pairs, the set is shuffled randomly. We will use the following lemma:
Let be a random permutation of . Then for all ,
We need to show that w.h.p. there are at most “bad” pairs within distance . For each , we call the left side of the pair, and the right side of the pair. Let us first choose elements to be placed on the left side of bad pairs (there are such choices). Once those are fixed, there are at most choices for placing elements on the right side of those pairs. Now we have pairs and unpaired elements that can be shuffled in ways. Overall, the probability of having at least bad pairs is at most
where we have used Stirling’s approximation for the first inequality. ∎
Suppose we have paired random elements in our input database , and constructed a database containing lengths of the prefixes for those pairs. Moreover, assume that by recursion we have identified a length which is the length at least random pairs. Although those prefixes may be different for each pair, Claim 3.12 guarantees that (w.h.p.) at least one of these prefixes is the prefix of at least input elements. This will help us in (privately) identifying such a prefix.
The proof is by induction on the number of recursive calls, denoted as . For (i.e., ), the claim holds as long as the exponential mechanism outputs an with except with probability at most . By Proposition 3.5, it suffices to have , since .
Assume that the stated lemma holds whenever RecPrefix performs at most recursive calls. Let and be inputs on which algorithm RecPrefix performs recursive calls, all of which are on databases containing at least elements. Consider the first call in the execution on those inputs, and let be the random permutation chosen on Step 2. We say that a pair is close if
By Claim 3.12, except with probability at most , there are at most close pairs. We continue the proof assuming that this is the case.
Let be the database constructed in Step 3. By the inductive assumption, with probability at least , the value obtained in Step 4 is s.t. (1) s.t. ; and (2) . We proceed with the analysis assuming that this event happened.
By (2), there are at least pairs that agree on a prefix of length at least . At least one of those pairs, say , is not close. Note that every between and agrees on the same prefix of length , and that there are at least such elements in . Moreover, as the next bit is either 0 or 1, at least half of those elements agree on a prefix of length . Thus, when using the choosing mechanism on Step 5 (to choose a prefix of length ), there exists at least one prefix with quality at least . By Lemma 3.8, the choosing mechanism ensures, therefore, that with probability at least , the chosen prefix is the prefix of at least one , and, hence, this satisfies (defined in Step 6). We proceed with the analysis assuming that this is the case.
Let be s.t. . By the definition of , this means that and agree on a prefix of length at most . Hence, as is of length , we have that either or . If , then satisfies Condition 1 of being a good output. It also satisfies Condition 2 because and , which we took to be the smallest elements of . Similarly, is a good output if . In any case, at least one out of is a good output.
If both and are good outputs, then Step 8 cannot fail. We have already established the existence of . Hence, if is not a good output, then there are at most elements s.t. . Hence, the probability of and Step 8 failing is at most . It remains to analyze the case where is not a good output (and is).
If is not a good output, then every satisfies . In particular, , and, hence, . Recall that there are at least elements in which are bigger than . As , the probability that and RecPrefix fails to return in this case is at most .
All in all, RecPrefix fails to return an appropriate with probability at most . ∎
We now proceed with the privacy analysis.
When executed for recursive calls, RecPrefix is -differentially private.
The proof is by induction on the number of recursive calls, denoted by . For (i.e., ), then by Proposition 3.5 the exponential mechanism ensures that RecPrefix is -differentially private. Assume that the stated lemma holds whenever RecPrefix performs at most recursive calls, and let be two neighboring databases on which RecPrefix performs recursive calls.The recursion depth is determined by , which is identical in and in . Let denote an algorithm consisting of steps 1-4 of RecPrefix (the output of is the value from Step 4). Consider the executions of on and on , and denote by and by the elements as they are in the executions on and on .
We show that the distributions on the databases and are similar in the sense that for each database in one of the distributions there exist a neighboring database in the other that have the same probability. Thus, applying the recursion (which is differentially private by the inductive assumption) preserves privacy. We now make this argument formal.
First note that as differ in only one element, there is a bijection between orderings and of the smallest elements of and of respectively s.t. and are neighboring databases. This is because there exists a permutation of the smallest elements of that is a neighbor of the smallest elements of ; composition with this fixed permutation yields the desired bijection. Moreover, note that whenever are neighboring databases, the same is true for and . Hence, for every set of outputs it holds that
So when executed for recursive calls, the sequence of Steps 1-4 of RecPrefix is -differentially private. On Steps 5 and 7, algorithm RecPrefix interacts with its database through the choosing mechanism and using the Laplace mechanism, each of which is -differentially private. By composition (Lemma 2.2), we get that RecPrefix is -differentially private. ∎
Combining Lemma 3.11 and Lemma 3.13 we obtain Theorem 3.4.
3.3 Informal Discussion and Open Questions
An natural open problem is to close the gap between our (roughly) upper bound on the sample complexity of privately solving the interior point problem (Theorem 3.4), and our lower bound (Theorem 3.2). Below we describe an idea for reducing the upper bound to .
In our recursive construction for the lower bound, we took elements and generated elements where is a random element (independent of the ’s), and every is the length of the longest common prefix of and . Therefore, a change limited to one affects only one and privacy is preserved (assuming that our future manipulations on preserve privacy). While the representation length of domain elements grows exponentially on every step, the database size grows by 1. This resulted in the lower bound.
In RecPrefix on the other hand, every level of recursion shrank the database size by a factor of , and hence, we required a sample of (roughly) elements. Specifically, in each level of recursion, two input elements were paired and a new element was defined as the length of their longest common prefix. As with the lower bound, we wanted to ensure that a change limited to one of the inputs affects only one new element, and hence, every input element is paired only once, and the database size shrinks.
If we could pair input elements twice then the database size would only be reduced additively (which will hopefully result in a upper bound). However, this must be done carefully, as we are at risk of deteriorating the privacy parameter by a factor of and thus remaining with an exponential dependency in . Consider the following thought experiment for pairing elements.
Input: . 1. Let denote a random permutation of . 2. For to : For to , let be the length of the longest common prefix of and .
As (most of the) elements are paired twice on every step, the database size reduces additively. In addition, every input element affects at most elements at depth , and the privacy loss is acceptable. However, this still does not solve the problem. Recall that every iteration of RecPrefix begins by randomly shuffling the inputs. Specifically, we needed to ensure that (w.h.p.) the number of “close” pairs is limited. The reason was that if a “not close” pair agrees on a prefix , then is the prefix “a lot” of other elements as well, and we could privately identify . In the above process we randomly shuffled only the elements at depth . Thus we do not know if the number of “close” pairs is small at depth . On the other hand, if we changed the pairing procedure to shuffle at every step, then each input element might affect elements at depth , causing the privacy loss to deteriorate rapidly.
Query Release and Distribution Learning
Recall that a counting query is a predicate . For a database , we write to denote the average value of over the rows of , i.e. In the query release problem, we seek differentially private algorithms that can output approximate answers to a family of counting queries simultaneously.
We are interested in the query release problem for the class of threshold queries, which we view as a class of counting queries.
We are also interested in the following distribution learning problem, which is very closely related to the query release problem.
We highlight two important special cases of the distance measure in the distribution learning problem. First, when is the collection of all counting queries on a domain , the distance is the total variation distance between distributions, defined by
Let be a distribution over a totally ordered domain . The CDF of is defined by . If is finite, then any function that is non-decreasing with is a CDF.
Algorithm is an -accurate distribution learner with respect to Kolmogorov distance with sample complexity if for all distributions on a totally ordered domain , given an input of samples where each is drawn i.i.d. from , algorithm outputs a CDF with with probability at least .
The query release problem for a collection of counting queries is very closely related to the distribution learning problem with respect to . In particular, solving the query release problem on a dataset amounts to learning the empirical distribution of . Conversely, results in statistical learning theory show that one can solve the distribution learning problem by first solving the query release problem on a sufficiently large random sample, and then fitting a distribution to approximately agree with the released answers. The requisite size of this sample (without privacy considerations) is characterized by a combinatorial measure of the class called the VC dimension:
Let be a collection of queries over domain . A set is shattered by if for every there exists such that . The Vapnik-Chervonenkis (VC) dimension of , denoted , is the cardinality of the largest set that is shattered by .
It is known [AB09] that solving the query release problem on random samples yields an -accurate distribution learner for a query class .
2 Equivalences with the Interior Point Problem
We show that the problems of privately releasing thresholds and solving the interior point problem are equivalent.
Let be a totally ordered domain. Then,
If there exists an -differentially private algorithm that is able to release threshold queries on with -accuracy and sample complexity , then there is an -differentially private algorithm that solves the interior point problem on with error and sample complexity .
If there exists an -differentially private algorithm solving the interior point problem on with error and sample complexity , then there is a -differentially private algorithm for releasing threshold queries with -accuracy and sample complexity
For the first direction, observe that an algorithm for releasing thresholds could easily be used for solving the interior point problem. Formally,
Suppose is a private -accurate algorithm for releasing thresholds over for databases of size . Define on databases of size to pad the database with an equal number of and entries, and run on the result. We can now return any point for which the approximate answer to the query is on the (padded) database. ∎
We now show the converse, i.e., that the problem of releasing thresholds can be reduced to the interior point problem. Specifically, we reduce the problem to a combination of solving the interior point problem, and of releasing thresholds on a much smaller data universe. The latter task is handled by the following algorithm.
In Appendix C, we describe another reduction that, up to constant factors, gives the same sample complexity.
Let be an -differentially private algorithm solving the interior point problem on with error and sample complexity . We may actually assume that is differentially private in the sense that if and differs from up to the addition or removal of a row, then for every , , and that solves the interior point problem with probability at least whenever its input is of size at least . This is because we can pad databases of size less than with an arbitrary fixed element, and subsample the first entries from any database with size greater than .
Consider the following algorithm for answering thresholds on databases for :
Let where , and consider a neighboring database . Assume without loss of generality that , and suppose
Moreover, under the bijection we constructed between and , noise vector is sampled with density at most times the density of , so for every set ,
Finally, the execution of at the end of the algorithm is -differentially private, so by composition (Lemma 2.2), we obtain the asserted level of privacy.
We can produce -accurate answers to every threshold function as long as
The partitioning exhausts the database, i.e. every element of is in some ,
Every execution of succeeds at finding an interior point,
Every database has size at most (ensuring that we have error at most from interpolation),
The answers obtained from executing all have error at most .
We now estimate the probabilities of each event. For each we have with probability at least
So by a union bound, every is at least with probability at least . If this is the case, then item 1 holds because . Moreover, if every , then item 2 also holds with probability at least . This is because every , and hence every execution of on a subdatabase succeeds with probability .
By a similar argument, property 3 holds as long as each noise value is at most , which happens with probability at least . Finally, property 4 holds with probability at least since
A union bound over the four properties completes the proof. ∎
2.2 Releasing Thresholds vs. Distribution Learning
Query release and distribution learning are very similar tasks: A distribution learner can be viewed as an algorithm for query release with small error w.r.t. the underlying distribution (rather than the fixed input database). We show that the two tasks are equivalent under differential privacy.
Let be a collection of counting queries over a domain .
If there exists an -differentially private algorithm for releasing with -accuracy and sample complexity , then there is an -differentially private -accurate distribution learner w.r.t. with sample complexity .
If there exists an -differentially private -accurate distribution learner w.r.t. with sample complexity , then there is an -differentially private query release algorithm for with -accuracy and sample complexity .
The first direction follows from a standard generalization bound, showing that if a given database contains (enough) i.i.d. samples from a distribution , then (w.h.p.) accuracy with respect to implies accuracy with respect to . We remark that the sample complexity lower bound on required to apply item 1 of Theorem 4.8 does not substantially restrict its applicability: It is known that an -differentially private algorithm for releasing always requires sample complexity anyway [BLR08].
We now use the following generalization theorem to show that (w.h.p.) is close to for every .
Let be a collection of counting queries over a domain . Let consist of i.i.d. samples from a distribution over . If , then
Using the above theorem, together with the fact that , we see that except with probability at least we have that for every . By a union bound (and the triangle inequality) we get that is -accurate. ∎
In the special case where for a totally ordered domain , corresponding to distribution learning under Kolmogorov distance, the above theorem holds as long as . This follows from using the Dvoretzky-Kiefer-Wolfowitz inequality [DKW56, Mas90] in place of Theorem 4.9.
If there exists an -differentially private algorithm for releasing over a totally ordered domain with -accuracy and sample complexity , then there is an -differentially private -accurate distribution learner under Kolmogorov distance with sample complexity .
We now show the other direction of the equivalence.
Note that drawing i.i.d. samples from is equivalent to subsampling rows of (with replacement). Then with probability at least , the distribution returned by is such that for every
Here, denotes the subsample of consisting of the indices in , and similarly for . Note that , since if index is not sampled. Our goal is to show that
To do this, observe that by privacy, , so
Combining inequalities 1 and 2 we get that
PAC Learning
A concept is a predicate that labels examples taken from the domain . A concept class over is a set of concepts over the domain . A learner is given examples sampled from an unknown probability distribution over that are labeled according to an unknown target concept and outputs a hypothesis that approximates the target concept with respect to the distribution . More precisely,
The generalization error of a hypothesis (with respect to a target concept and distribution ) is defined by If we say that is an -good hypothesis for on .
Algorithm is an -accurate PAC learner for a concept class over using hypothesis class with sample complexity if for all target concepts and all distributions on , given an input of samples , where each is drawn i.i.d. from , algorithm outputs a hypothesis satisfying
The probability is taken over the random choice of the examples in and the coin tosses of the learner . If then is called proper, otherwise, it is called improper.
Classical results in statistical learning theory show that a sample of size is both necessary and sufficient for PAC learning a concept class . That samples suffice follows from a “generalization” argument: for any concept and distribution , with probability at least over random labeled examples, every concept that agrees with on the examples has error at most on . Therefore, can be properly learned by finding any hypothesis that agrees with the given examples.
Recall the class of threshold functions, which are concepts defined over a totally ordered domain by where iff . The class of threshold functions has VC dimension , and hence can be learned with samples.
A private learner is a PAC learner that is differentially private. Following [KLN+11], we consider algorithms , where is a hypothesis class, and require that
is an -accurate PAC learner for a concept class with sample complexity , and
is -differentially private.
Note that while we require utility (PAC learning) to hold only when the database consists of random labeled examples from a distribution, the requirement of differential privacy applies to every pair of neighboring databases , including those that do not correspond to examples labeled by any concept.
Recall the relationship between distribution learning and releasing thresholds, where accuracy is measured w.r.t. the underlying distribution in the former and w.r.t. the fixed input database in the later. Analogously, we now define the notion of an empirical learner which is similar to a PAC learner where accuracy is measured w.r.t. the fixed input database.
Algorithm is an -accurate empirical learner for a concept class over using hypothesis class with sample complexity if for every and for every database algorithm outputs a hypothesis satisfying
The probability is taken over the coin tosses of .
Note that without privacy (and ignoring computational efficiency) identifying a hypothesis with small empirical error is trivial for every concept class and for every database of size at least . This is not the case with -differential privacy,The lower bound in Theorem 5.5 also holds for label private empirical learners, that are only required to provide privacy for the labels in the database. and the sample complexity of every empirical learner for a concept class is at least :
For every , every and , if is an -differentially private -accurate empirical learner for a class with sample complexity , then .
The proof of Theorem 5.5 is very similar the analysis of [BLR08] for lower bounding the sample complexity of releasing approximated answers for queries in the class . As we will see in the next section, at least in some cases (namely, for threshold functions) the sample complexity must also have some dependency in the size of the domain .
Observe that is a random element of that is labeled as 1 in , and that an -consistent hypothesis for must label at least such elements as 1. Hence, by the utility properties of , we have that
Similarly, is a random elements of that is labeled as 0 in , and an -consistent hypothesis for must not label more than such elements as 1. Hence,
Finally, as and differ in at most entries, differential privacy ensures that
showing that . ∎
2 Private Learning of Thresholds vs. the Interior Point Problem
We show that with differential privacy, there is a multiplicative relationship between the sample complexities of properly PAC learning thresholds with -accuracy and of solving the interior point problem with error probability . Specifically, we show
Let be a totally ordered domain. Then,
If there exists an -differentially private algorithm solving the interior point problem on with error probability and sample complexity , then there is a -differentially private -accurate proper PAC learner for with sample complexity .
If there exists an -differentially private -accurate proper PAC learner for with sample complexity , then there is a -differentially private algorithm that solves the interior point problem on with error and sample complexity .
We show this equivalence in two phases. In the first, we show a relationship between the sample complexity of solving the interior point problem and the sample complexity of empirically learning thresholds. We then use generalization and resampling arguments to show that with privacy, this latter task is equivalent to learning with samples from a distribution.
Let be a totally ordered domain. Then,
If there exists an -differentially private algorithm solving the interior point problem on with error probability and sample complexity , then there is a -differentially private algorithm for properly and empirically learning thresholds with -accuracy and sample complexity .
If there exists an -differentially private algorithm that is able to properly and empirically learn thresholds on with -accuracy and sample complexity , then there is a -differentially private algorithm that solves the interior point problem on with error and sample complexity .
For the first direction, let be a private algorithm for the interior point problem on databases of size . Consider the algorithm that, on input a database of size , runs on a database consisting of the largest elements of that are labeled and the smallest elements of that are labeled . If there are not enough of either such element, pad with ’s or ’s respectively. Note that if is an interior point of then is a threshold function with error at most on , and is hence -consistent with . For privacy, note that changing one row of changes at most two rows of . Hence, applying algorithm preserves -differential privacy.
For the reverse direction, suppose privately finds an -consistent threshold functions for databases of size . Define on a database to label the smaller points 1 and the larger points 0 to obtain a labeled database , pad with an equal number of and entries to make it of size , and run on the result. Note that if is a threshold function with error at most on then is an interior point of , as otherwise has error at least on . For privacy, note that changing one row of changes at most two rows of . Hence, applying algorithm preserves -differential privacy. ∎
Now we show that the task of privately outputting an almost consistent hypothesis on any fixed database is essentially equivalent to the task of private (proper) PAC learning. One direction follows immediately from a standard generalization bound for learning thresholds:
Any algorithm for empirically learning with -accuracy is also a -accurate PAC learner for when given at least samples.
Let be a distribution over a totally ordered domain and fix a target concept . It suffices to show that for a sample where and the are drawn i.i.d. from , it holds that
Let be the largest point with . If some has then , and hence for any sample , . Similarly let be the smallest point with . Let and . Then it suffices to show that
Concentrating first on , we define the error region as the interval where disagrees with . By a Chernoff bound, the probability that after independent samples from , fewer than appear in is at most . The same argument holds for , so the result follows by a union bound. ∎
In general, an algorithm that can output an -consistent hypothesis from concept class can also be used to learn using samples [BEHW89]. The concept class of thresholds has VC dimension , so the generalization bound for thresholds saves an factor over the generic statement.
For the other direction, we note that a distribution-free learner must perform well on the uniform distribution on the rows of any fixed database, and thus must be useful for outputting a consistent hypothesis on such a database.
Suppose is an -differentially private -accurate PAC learner for a concept class with sample complexity . Then there is an -differentially private -accurate empirical learner for with sample complexity . Moreover, if is proper, then so is the resulting empirical learner.
Thresholds in High Dimension
First observe that is -differentially private. To see this, note that a change limited to one input entry affects only one entry of the multiset . Hence, applying the -differentially private algorithm on preserves privacy.
Now the proof of Theorem 6.2 follows from the lower bound on the sample complexity of privately finding an -consistent threshold function (see Section 3.2):
There exists a constant s.t. the following holds. For every totally ordered data universe there exists a distribution over databases containing at most labeled examples from such that every -differentially private algorithm fails to find an -consistent threshold function for with probability at least .
Let be a class of predicates. If there exists a -differentially private algorithm capable of releasing queries from with -accuracy and sample complexity , then there exists a -differentially private -accurate PAC learner for with sample complexity .
Similar arguments show that , proving that is -accurate and contradicting the lower bound on the sample complexity of such algorithms. ∎
Mechanism-Dependent Lower Bounds
By a reduction to the interior point problem problem, we can prove an impossibility result for the problem of privately outputting something that is at least the minimum of a database on an unbounded domain. Specifically, we show
2 Properly Learning Point Functions with Pure Differential Privacy
We resolve this question in the negative. Specifically, we show that it is impossible to learn (even improperly) point functions over an infinite domain with pure differential privacy using a countable hypothesis class.
Let be an infinite domain, let be a countable collection of hypotheses , and let . Then there is no -differentially private -accurate PAC learner for points over using the hypothesis class .
A learner implemented by an algorithm (i.e. a probabilistic Turing machine) must use a hypothesis class where each hypothesis has a finite description. Note that the standard proper learner for can be implemented by an algorithm. However, a consequence of our result is that there is no algorithm for privately learning .
There is an infinite sequence of points together with distributions such that the sets are all disjoint.
It is impossible for this to hold for infinitely many disjoint sets . ∎
and hence is disjoint from the preceding ’s. ∎
We thank Amos Beimel, Adam Smith, Jonathan Ullman, and anonymous reviewers for helpful conversations and suggestions that helped guide our work. We also thank Gautam Kamath for pointing us to references on distribution learning.
References
Appendix A The Choosing Mechanism
We supply the proofs of privacy and utility for the choosing mechanism.
Let denote the choosing mechanism (Algorithm 2). Let be neighboring databases of elements. We need to show that for every set of outputs . Note first that has sensitivity at most , so by the properties of the Laplace Mechanism,
Similarly, we have . Thus, we my assume below that . (If , then we can write , and similarly for .)
Let and be the sets used in step 2 in the execution and on respectively. We will show that the following two facts hold:
For every , it holds that .
For every possible output , it holds that .
We first show that the two facts imply that the lemma holds for Case (b). Let , and note that as is of -bounded growth, . Using the above two facts, for every set of outputs we have
To prove Fact 1, let . That is, and . As has sensitivity at most , it must be that . As there exists with , we have that
which is at most for .
To prove Fact 2, let be a possible output of . We use the following Fact 3, proved below.
.
Using Fact 3, for every possible output we have that
We now prove Fact 3. Let . Since there exists a solution s.t. , we have .
Now, recall that is of -bounded growth, so , and every satisfies . Hence,
where the last inequality follows from the fact that . This concludes the proof of Fact 3, and completes the proof of the lemma. ∎
The utility analysis for the choosing mechanism is rather straightforward:
Recall that the mechanism defines as . Note that the mechanism succeeds whenever . This happens provided the random variable is at most , which happens with probability at least . ∎
Note that if , then every solution is a good output, and the mechanism cannot fail. Assume, therefore, that there exists a solution s.t. , and recall that the mechanism defines as . As in the proof of Lemma 3.7, with probability at least , we have . Assuming this event occurs, we will show that with probability at least , the exponential mechanism chooses a solution s.t. .
By the growth-boundedness of , and as is of size , there are at most possible solutions with . That is, . By the properties of the Exponential Mechanism, we obtain a solution as desired with probability at least
By a union bound, we get that the choosing mechanism outputs a good solution with probability at least . ∎
Appendix B Interior Point Fingerprinting Codes
Fingerprinting codes were introduced by Boneh and Shaw [BS98] to address the problem of watermarking digital content. Suppose a content distributor wishes to distribute a piece of digital content to legitimate users in such a way that any pirated copy of that content can be traced back to any user who helped in producing the copy. A fingerprinting code is a scheme for assigning each users a codeword that can be hidden in their copy of the content, and then be uniquely traced back to the identity of that user. Informally, a finger printing code is fully collusion-resistant if when an arbitrary coalition of users combine their codewords to produce a new pirate codeword the pirate codeword can still be successfully traced to a member of , provided the pirate codeword satisfies a certain marking assumption. Traditionally, this marking assumption requires that if all users in see the same bit at index of their codewords, then index of their combined codeword must also be .
Recent work has shown how to use fingerprinting codes to obtain lower bounds in differential privacy [BUV14, DTTZ14, BST14]. Roughly speaking, these works show how any algorithm with nontrivial accuracy for a given task can be used to create a pirate algorithm that satisfies the marking assumption for a fingerprinting code. The security of the fingerprinting code means that the output of this algorithm can be traced back to one of its inputs. This implies that the algorithm is not differentially private.
We show how our lower bound for privately solving the interior point problem can also be proved by the construction of an object we call an interior point fingerprinting code. The difference between this object and a traditional fingerprinting code lies in the marking assumption. Thinking of our codewords as being from an ordered domain , our marking assumption is that the codeword produced by a set of users must be an interior point of their codewords. The full definition of the code is as follows.
For a totally ordered domain , an interior point fingerprinting code over consists of a pair of randomized algorithms with the following syntax.
samples a codebook
takes as input a “codeword” and outputs either a user or a failure symbol .
The algorithms and are allowed to share a common state (e.g. their random coin tosses).
The adversary to a fingerprinting code consists of a subset of users and a pirate algorithm . The algorithm is given , i.e. the codewords for , and its output is said to be “feasible” if . The security guarantee of a fingerprinting code is that for all coalitions and all pirate algorithms , if , then we have
Completeness: , where is the completeness error.
Soundness: , where is the soundness error.
The probabilities in both cases are taken over the coins of , and .
We note that an interior point fingerprinting code could also be interpreted as an ordinary fingerprinting code (using the traditional marking assumption) with codewords of length of the form . As an example for using such a code, consider a vendor interested in fingerprinting movies. Using an interior point fingerprinting code, the vendor could produce fingerprinted copies by simply splicing two versions of the movie.
We now argue as in [BUV14] that the existence of an interior point fingerprinting code yields a lower bound for privately solving the interior point problem.
Let , , and . If there is an interior point fingerprinting code on domain for users with completeness error and soundness error , then there is no -differentially private algorithm that, with probability at least , solves the interior point problem on for databases of size .
Suppose for the sake of contradiction that there were a differentially private for solving the interior point problem on . Let , and let for a codebook .
Therefore, there exists some such that
Now consider the coalition obtained by replacing user with user . Let , again for a random codebook . Since is differentially private,
contradicting the soundness of the interior point fingerprinting code. ∎
We now show how to construct an interior point fingerprinting code, using similar ideas as in the proof of Lemma 3.3. For users, the codewords lie in a domain with size an exponential tower in , allowing us to recover the lower bound for interior point queries.
Let , and define the function recursively by and . By induction on , we will construct codes for users over a domain of size with perfect completeness and soundness at most . First note that there is a code with perfect completeness and perfect soundness for user over a domain of size . Suppose we have defined the behavior of for users. Then we define
samples and . For each , let be a base- number (written , where is the most significant digit) that agrees with in the most-significant digits, and has random entries from at every index thereafter. The output codebook is .
retrieves the codebook from its shared state with . Let be the maximum number of digits to which any (for ) agrees with . If agrees with on more than digits, accuse user . Otherwise, let be the number of indices on which agrees with , and run with respect to codebook .
We reduce the security of this scheme to that of . To check completeness, let be a pirate coalition and let be a pirate algorithm. Consider the pirate algorithm for codes on users that, given a set of codewords where , simulates to produce a set of codewords and outputs the number of indices on which agrees with .
If is feasible for and , then is feasible for . Therefore,
by induction, proving perfect completeness.
To prove soundness, let . Then
Combining Lemmas B.3 and B.4 yields Theorem 1.8.
Appendix C Another Reduction from Releasing Thresholds to the Interior Point Problem
This reduction computes approximate -quantiles of its input, which can then be used to release thresholds with -accuracy. To do so, it uses the strategy of [DNPR10] of using a complete binary tree to generate a sequence of noise values. The tree has leaves and depth , and at each node in the tree we sample a Laplace random variable. The noise value corresponding to a leaf is the sum of the samples along the path from that leaf to the root.
We take the sorted input database and divide it into equal-size blocks around the -quantiles, and perturb the boundaries of the blocks by the noise values. Solving the interior point problem on these buckets then gives approximate -quantiles. Moreover, the noisy bucketing step ensures that the final algorithm is differentially private.
We formally describe this algorithm as below. Let be an -differentially private mechanism for solving the interior point problem on that succeeds with probability at least on databases of size . In the algorithm below, let denote the set of prefixes of the binary representation of (including the empty prefix).
We will show that this algorithm satisfies -differential privacy, and is able to release approximate -quantiles with -accuracy, and hence -accurate answers to threshold queries, as long as
Let where , and let . Assume without loss of generality that , and suppose
Consider vectors of noise values . Then there is a bijection between noise vectors and noise vectors such that partitioned according to and partitioned according to differ on at most 2 blocks (cf. [DNPR10]). Moreover, this bijection changes at most values by at most . Thus under this mapping, noise vector is sampled with probability at most times the probability is sampled. We get that for any set ,
We can produce -accurate estimates of every quantile as long as
Every noise value has magnitude at most
By the analysis of Lemma 4.7 in [DNPR10], with probability at least , every noise value is bounded by . This suffices to achieve item 1. Moreover, conditioned on the noise values being so bounded, each , so each execution of individually succeeds with probability . Hence they all succeed simultaneously with probability at least , giving item 2.