Sample Complexity Bounds on Differentially Private Learning via Communication Complexity
Vitaly Feldman, David Xiao
Introduction
In machine learning tasks, the training data often consists of information collected from individuals. This data can be highly sensitive, for example in the case of medical or financial information, and therefore privacy-preserving data analysis is becoming an increasingly important area of study in machine learning, data mining and statistics (Dwork and Smith, 2009; Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).
In this work we focus on the task of learning to classify from labeled examples. Two standard and closely related models of this task are PAC learning (Valiant, 1984) and agnostic (Haussler, 1992; Kearns et al., 1994) learning. In the PAC learning model the algorithm is given random examples in which each point is sampled i.i.d. from some unknown distribution over the domain and is labeled by an unknown function from a set of functions (called concept class). In the agnostic learning model the algorithm is given examples sampled i.i.d. from an arbitrary (and unknown) distribution over labeled points. The goal of the learning algorithm in both models is to output a hypothesis whose prediction error on the distribution from which examples are sampled is not higher (up to an additive ) than the prediction error of the best function in (which is in the PAC model). See Section 2.1 for formal definitions.
We rely on the well-studied differential privacy model of privacy. Differential privacy gives a formal semantic guarantee of privacy, saying intuitively that no single individual’s data has too large of an effect on the output of the algorithm, and therefore observing the output of the algorithm does not leak much information about an individual’s private data (Dwork et al., 2006) (see Section 2.2 for the formal definition). The downside of this desirable guarantee is that for some problems achieving it has an additional cost: both in terms of the number of examples, or sample complexity, and computation.
Beimel et al. (2013b) consider PAC learning with approximate -differential privacy where the privacy guarantee holds with probability (the basic notion is also referred to as pure to distinguish it from the approximate version). They show that can be PAC learned using samples ( is a constant as before). Their algorithm is proper so this separates the sample complexity of pure differentially private proper PAC learning from the approximate version. This work leaves open the question of whether such a separation can be proved for (non-proper) PAC learning.
The sample complexity of -differentially privately learning is .
Our upper bound is also simpler than the upper bound in (Beimel et al., 2013b). See Section 6 for details.
2 Related work
There is now an extensive amount of literature on differential privacy in machine learning and related areas which we cannot hope to cover here. The reader is referred to the excellent surveys in (Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).
Blum et al. (2005) showed that algorithms that can be implemented in the statistical query (SQ) framework of Kearns (1998) can also be easily converted to differentially private algorithms. This result implies polynomial upper bounds on the sample (and computational) complexity of all learning problems that can be solved using statistical queries (which includes the vast majority of problems known to be solvable efficiently). Formal treatment of differentially private PAC and agnostic learning was initiated in the seminal work of Kasiviswanathan et al. (2011). Aside from the results we already mentioned, they separated SQ learning from differentially private learning. Further, they showed that SQ learning is (up to polynomial factors) equivalent to local differential privacy a more stringent model in which each data point is privatized before reaching the learning algorithm.
Sample complexity of more general problems in statistics was investigated in several works starting with Dwork and Lei (2009) (measured alternatively via convergence rates of statistical estimators) (Smith, 2011; Chaudhuri and Hsu, 2012; Duchi et al., 2013a, b). A recent work of Duchi et al. (2013a) shows a number of -dimensional problems where differentially private algorithms must incur an additional factor cost in sample complexity. However their lower bounds apply only to a substantially more stringent local model of differential privacy and are known not to hold in the model we consider here.
Preliminaries
In both PAC and agnostic learning model an algorithm that outputs a hypothesis in is referred to as proper.
2 Differentially Private Learning
where the probability is over the randomness of (Dwork et al., 2006). When is -differentially private we say that it satisfies pure differential privacy, which we also write as -differential privacy.
3 Representation Dimension
For agnostic learning we have that sample complexity is at most
This form of upper bound combines accuracy and confidence boosting from (Beimel et al., 2013a) to first obtain -probabilistic representation and then the use of exponential mechanism as in (Kasiviswanathan et al., 2011). The results in (Kasiviswanathan et al., 2011) show the extension of this bound to agnostic learning. Note that the characterization for PAC learning is tight up to logarithmic factors.
4 Communication Complexity
Let denote the class of all private-coin one-way protocols computing with error , namely private-coin one-way protocols satisfying for all
A deterministic one-way protocol and its cost are defined as above but without dependence on random bits. We will also require distributional notions of complexity, where there is a fixed input distribution from which are drawn. For a distribution over , we define to be all deterministic one-way protocols such that
Yao’s minimax principle (Yao, 1977) states that for all functions :
Error in both public and private-coin protocols can be reduced by using several independent copies of the protocol and then taking a majority vote of the result. This implies that for every ,
5 Littlestone’s Dimension
While in this work we will not use the definition of the online mistake-bound model itself, we briefly describe it for completeness. In the online mistake-bound model learning proceeds in rounds. At the beginning of round , a learning algorithm has some hypothesis . In round , the learner sees a point and predicts . At the end of the round, the correct label is revealed and the learner makes a mistake if . The learner then updates its hypothesis to and this process continues. When learning a concept class in this model for some unknown . The (sample) complexity of such learning is defined as the largest number of mistakes that any learning algorithm can be forced to make when learning . Littlestone (1987) proved that it is exactly characterized by a dimension defined as follows.
Equivalence between representation dimension and communication complexity
Our main result is the following two bounds.
For any and , and any concept class , it holds that:
: let be the public-coin one-way protocol that achieves the optimal communication complexity . For each choice of the public random coins , let denote the set of functions over all possible . Thus, each has size at most . Let the distribution be to choose uniformly random and then output .
We show that this family -probabilistically represents . We know from the fact that computes with error that it must hold for all and that:
In particular, it must hold for any distribution over that:
By Yao’s minimax principle (Equation 2.1) (Yao, 1977) this implies that
For any , it holds that:
In particular, this means that for all distributions over , it holds that
By a standard averaging argument, there must exist at least one such that
This implies that -deterministically represents .
Our private-coin protocol for will be the following: on input , Alice will use her private randomness to sample and send the index of to Bob. Bob then outputs . Thus, for each , it holds that
and so the protocol computes with error .
Our equivalence theorems allow us to import many results from communication complexity into the context of private PAC learning, both proving new facts and simplifying proofs of previously known results in the process.
We note that it is known that VC dimension corresponds to the maximal distributional one-way communication complexity over all product input distributions. Hence this separation is analogous to separation of distributional one-way complexity over product distributions and the maximal distributional complexity over all distributions achieved using the greater-than function (Kremer et al., 1999).
Accuracy and confidence boosting.
Our equivalence theorems give a simple alternative way to reduce error in probabilistic and deterministic representations without using sequential boosting as was done in (Beimel et al., 2013a). Given a private PAC learner with constant error, say , one can first convert the learner to a communication protocol with error , use simple independent repetitions (as in eq.(2.2)) to reduce the error to , and then convert the protocol back into a -probabilistic representation. The “magic” here happens when we convert between the communication complexity and probabilistic representation using min-max type arguments. This is the same tool that can be used to prove (computationally inefficient) boosting theorems.
Probabilistic vs. deterministic representation dimension.
It was shown by Newman (1991) that public and private coin complexity are the same up to additive logarithmic terms. In our setting (and with a specific choice of error bounds to simplify presentation), Newman’s theorem implies that
Simpler learning algorithms.
Using our equivalence theorems, we can “import” results from communication complexity to give simple private PAC learners. For example, the well-known constant communication equality protocol using inner-product-based hashing can be converted to a probabilistic representation using Theorem 3.1, which can then be used to learn point functions. The resulting learning algorithm is somewhat simpler than the constant sample complexity learner for described in (Beimel et al., 2010) and we believe that this view also provides useful intuition. We remark that the probabilistic representation for that results from the communication protocol is known and was used for learning by Feldman (2009) in the context of evolvability. A closely related representation is also mentioned in (Beimel et al., 2013a).
Lower Bounds via Littlestone’s Dimension
A proof of this lower bound can be easily derived by adapting the proof in (Bar-Yossef et al., 2004) and we include it in Section A.
An immediate corollary of Lemma 4 and Lemma 4 is the following lower bound.
We give a new information-theoretic and simpler proof of Aaronson’s result for randomized public-coin communication. We start with a brief review of basic notions from information theory.
We will use the convention of letting bold-face denote random variables and regular type denote particular values that those random variables may take.
Recall the following definitions of entropy (all logarithms are base ):
The Kullback-Leibler divergence (also called relative entropy) is defined as:
For two jointly distributed random variables , let denote independent samples from the marginal distributions of and . If conditioned on some event , we write .
Recall the following characterization of mutual information in terms of Kullback-Leibler divergence:
Let be jointly distributed random variables. Suppose the support of has size . Then for every it holds that:
Let be random variables over a common universe . Recall the following two equivalent definitions of statistical distance.
We set and let . For the claim obviously holds so we can assume that . This implies that and .
By the min-max principle, it suffices to exhibit an input distribution for which
We will show that this is impossible for any one-way protocol that communicates less than bits. Fix any such , and let be the message sent by Alice.
By the divergence characterization of mutual information, this implies that
On the other hand, observe that the distribution is identical to the distribution sampled according to the distribution . (The definition of good is the same as for , since the marginal distribution of both and on the variables is identical.)
Therefore we have by Pinsker’s inequality and Equation 5.3 that:
However, since the output of Bob depends only on , it therefore follows that the probability that Bob outputs under is less than the probability that Bob outputs under plus , and this contradicts Equation 5.2, proving the theorem.
The joint distribution drawn from can be viewed in the following order: first sample from the marginal distribution, then sample conditioned on , and then sample a random point conditioned on .
Conditioned on , consider and sampled independently as just described. There are two ways a collision can occur: either and collide or they do not. In the first case occurs with probability , in the second with probability at most since there is at most one point where the two lines intersect.
Separating pure and (α,β)𝛼𝛽(\alpha,\beta)-differential privacy
For any prime , any , one can -accurately learn with -differential privacy using samples.
We prove this theorem in two steps: first we construct a learner with poor dependence on and then amplify using the exponential mechanism to obtain a learner with good dependence on .
For any prime , any , it suffices to take samples in order to -learn with -differential privacy.
At a high level, we run the basic (non-private) learner based on VC-dimension times. We use the fact that is stable in that after a constant number of samples, with high probability there is a unique hypothesis that classifies the samples correctly. (This is simply because any two distinct points on a line define the line.) Therefore, in each of the executions of the non-private learner, we are likely to recover the same hypothesis. We can then release this hypothesis -privately using the “Propose-Test-Release” framework.
The main challenge in implementing this intuition is to eliminate corner cases, where with roughly probability the sample set may contain two distinct positively labeled points and with probability only a single positively labeled point, as this would lead to unstable outputs. We do this by randomizing the number of samples we take.
Let be a number of samples, to be chosen later. Given samples , our basic learner will do the following:
See if there exist two distinct samples that are both classified positively. If so, output the unique line defined by these points.
Otherwise, see if there exists any sample classified positively. Output the point function that outputs on and zero elsewhere.
Otherwise, output the constant hypothesis.
If then output , otherwise output the constant hypothesis.
Here, denotes the Laplace distribution, whose density function at point equals . It is easy to check that adding to a sum of Boolean values renders that sum -differentially private (Dwork et al., 2006).
We analyze the overall learner. Observe that once is fixed, the basic learner is deterministic.
The most frequent hypotheses are different. In this case for both . The probability of not outputting in either case is given by
Accuracy:
we now show that the overall learner -PAC learns. We claim that:
Fix any hidden line and any input distribution . With probability over the choice of , there is a unique hypothesis with error that the basic learner will output with probability at least when given independent samples from .
Thus the overall probability of not returning an -good hypothesis is at most .
Proof of Claim 6.1
Define the following events parameterized by an integer and defined over the probability space of drawing independently from :
is the event that all of the are not on the line .
is the event that there exists some on the line , and furthermore for every other on the line is in fact equal to .
is the event that there exists distinct that are both on the line .
Next we will show that with probability over the choice of , one of these three events has probability at least , and then we show that this suffices to imply the claim.
The characterizations for are obvious. The characterization of is exactly the probability over all that all samples are either labeled or equal , excluding the event that they are all labeled .
From the above and by considering the maximizing , we have the following bounds:
t𝑡t is good with high probability.
Let us say that is good for if . We say is good for if . We say is good for if . (It is possible that some of these events may be empty, but this does not affect our argument.) Using Equation 6.1, Equation 6.2 and Equation 6.3, it is clear that if is good for some event, then the probability of that event is at least .
Let us say is good if it is good for any one of . is good means the following when viewed on the logarithmic scale:
But this means that is bad on the logarithmic scale is equivalent to:
Thus, for any , there are at most integer values of that are bad. But recall that where is uniformly chosen from . Therefore the probability that is one of the bad values defined in Equation 6.4 is at most .
When t𝑡t is good, basic learner outputs unique accurate hypothesis.
To conclude, we argue that when is good then the basic learner will output a unique hypothesis with error with probability . This is obvious when is good for , since whenever the basic learner sees two points on the line, it recovers the exact line. It is also easy to see that when is good for , the basic learner outputs the hypothesis with probability , and this has error at most since
It remains to argue that the basic learner outputs a unique hypothesis with error at most when is good for . Observe that we have actually set the parameters so that when is good for , it holds that:
where . Therefore, for such , the basic learner will output the point function that is positive on exactly with probability at least .
To show that this point function has error at most , it suffices to prove that
Conclusions and Open Problems
While we focus on differential privacy our lower bounds have immediate implications in other settings where information about the examples needs to be stored or transmitted for the purposes of classification, such as distributed computation, streaming and low-memory computation.
Acknowledgements
We are grateful to Kobbi Nissim for first drawing our attention to the intriguing problem of understanding the relationship between probabilistic representation dimension and VC dimension, and for valuable discussions regarding the sample complexity of privately learning threshold functions. We thank Nina Balcan and Avrim Blum who brought up the relationship of our bounds for intervals in Section 3.1 to those based on Littlestone’s dimension. Their insightful comments and questions have lead to our result in Theorem 1.2. We also thank Sasha Rakhlin and Sasha Sherstov for useful suggestions and references.
D.X. was supported in part by the French ANR Blanc program under contract ANR-12-BS02-005 (RDAM project), by NSF grant CNS-1237235, a gift from Google, Inc., and a Simons Investigator grant to Salil Vadhan.
References
Appendix A Proof of Lemma 4
where is the input distribution that samples uniform and uniform .
Consider any deterministic protocol computing with error at most over , and suppose that uses communication. Let . Then . By the chain rule for mutual information it follows that
We therefore deduce that (for uniform over ):
Appendix B Ldim Lower Bound for Halfspaces
Recall that denotes the concept class of all halfspaces over , where . Our proof is based on the technique used in [Muroga, 1971] to prove a lower bound of on the total number of distinct halfspaces over . As a first step we prove the following simple lemma (for it can also be found in [Muroga, 1971]).
represents , that is if and only if ;
for every two distinct , .
We refer to such a representation of as collision-free.
Let be any integer weight representation of (such representation always exists for a halfspace over integer points). We first create a margin around the decision boundary by setting and . Note that also represents and, in addition, for every , . We now define for every , . It is easy to see that and therefore represents . Further, least significant bits in the binary representation of are exactly equal to and therefore the second condition is also satisfied.
We construct a mistake tree over and inductively over the dimension . For , includes all threshold functions on and therefore we define is the complete binary tree representing the binary search on this interval. Note that the depth of this tree is .
Appendix C Improving Dependence on δ𝛿\delta in Theorem 6.1
We now improve the exponential dependence on in subsection 6.1 to prove Theorem 6.1. We first introduce the exponential mechanism of McSherry and Talwar . For simplicity we only describe its restriction to the learning setting. Let be a hypothesis class and define the “quality score function” . The exponential mechanism for with privacy is the following: given an input , output according to the distribution given by
We use the following theorem about the exponential mechanism. Let .
The exponential mechanism is -differentially private. Furthermore, for all and all , it holds that
We will use the algorithm of subsection 6.1 with in order to construct a small set of hypotheses from which we’ll then select one using the exponential mechanism. More precisely:
Set . Run the algorithm of subsection 6.1 times independently with fresh samples and with -accuracy and -differential privacy. Call the resulting hypotheses .
Sample additional samples, call this set . Use the exponential mechanism to output .
The mechanism is -differentially private because samples used to produce are learned with differential privacy (notice each sample can only affect one of the , therefore the privacy loss does not add up when considering the set of all hypotheses). Also, the samples used to pick are used via the exponential mechanism, which is also -differentially private.
To analyze the accuracy, observe that since each of the is produced using an -accurate learner, therefore by independence of the executions and our choice of , it holds with probability that contains some that has error at most .
This implies that with probability over the probability of computing and sampling , it suffices to output some such that . This is because we are in the case where contains a hypothesis with error , and therefore by Equation 3.2 it holds that and therefore any such will have error over . By Equation 3.1, we deduce that any such must have error over .
By Theorem C.1, the probability that the exponential mechanism outputs such a is at least . Therefore the overall probability of outputting an -good hypothesis is at least .