Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

Vitaly Feldman, David Xiao

Introduction

In machine learning tasks, the training data often consists of information collected from individuals. This data can be highly sensitive, for example in the case of medical or financial information, and therefore privacy-preserving data analysis is becoming an increasingly important area of study in machine learning, data mining and statistics (Dwork and Smith, 2009; Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).

In this work we focus on the task of learning to classify from labeled examples. Two standard and closely related models of this task are PAC learning (Valiant, 1984) and agnostic (Haussler, 1992; Kearns et al., 1994) learning. In the PAC learning model the algorithm is given random examples in which each point is sampled i.i.d. from some unknown distribution over the domain and is labeled by an unknown function from a set of functions CC (called concept class). In the agnostic learning model the algorithm is given examples sampled i.i.d. from an arbitrary (and unknown) distribution over labeled points. The goal of the learning algorithm in both models is to output a hypothesis whose prediction error on the distribution from which examples are sampled is not higher (up to an additive ε\varepsilon) than the prediction error of the best function in CC (which is in the PAC model). See Section 2.1 for formal definitions.

We rely on the well-studied differential privacy model of privacy. Differential privacy gives a formal semantic guarantee of privacy, saying intuitively that no single individual’s data has too large of an effect on the output of the algorithm, and therefore observing the output of the algorithm does not leak much information about an individual’s private data (Dwork et al., 2006) (see Section 2.2 for the formal definition). The downside of this desirable guarantee is that for some problems achieving it has an additional cost: both in terms of the number of examples, or sample complexity, and computation.

Beimel et al. (2013b) consider PAC learning with approximate (α,β)(\alpha,\beta)-differential privacy where the privacy guarantee holds with probability 1β1-\beta (the basic notion is also referred to as pure to distinguish it from the approximate version). They show that Thrb\mathsf{Thr}_{b} can be PAC learned using O(16log(b)log(1/β))O(16^{\log^{*}(b)}\cdot\log(1/\beta)) samples (α\alpha is a constant as before). Their algorithm is proper so this separates the sample complexity of pure differentially private proper PAC learning from the approximate version. This work leaves open the question of whether such a separation can be proved for (non-proper) PAC learning.

The sample complexity of (α,β)(\alpha,\beta)-differentially privately learning Linep\mathsf{Line}_{p} is O(1αlog(1/β))O(\frac{1}{\alpha}\log(1/\beta)).

Our upper bound is also simpler than the upper bound in (Beimel et al., 2013b). See Section 6 for details.

2 Related work

There is now an extensive amount of literature on differential privacy in machine learning and related areas which we cannot hope to cover here. The reader is referred to the excellent surveys in (Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).

Blum et al. (2005) showed that algorithms that can be implemented in the statistical query (SQ) framework of Kearns (1998) can also be easily converted to differentially private algorithms. This result implies polynomial upper bounds on the sample (and computational) complexity of all learning problems that can be solved using statistical queries (which includes the vast majority of problems known to be solvable efficiently). Formal treatment of differentially private PAC and agnostic learning was initiated in the seminal work of Kasiviswanathan et al. (2011). Aside from the results we already mentioned, they separated SQ learning from differentially private learning. Further, they showed that SQ learning is (up to polynomial factors) equivalent to local differential privacy a more stringent model in which each data point is privatized before reaching the learning algorithm.

Sample complexity of more general problems in statistics was investigated in several works starting with Dwork and Lei (2009) (measured alternatively via convergence rates of statistical estimators) (Smith, 2011; Chaudhuri and Hsu, 2012; Duchi et al., 2013a, b). A recent work of Duchi et al. (2013a) shows a number of dd-dimensional problems where differentially private algorithms must incur an additional factor d/α2d/\alpha^{2} cost in sample complexity. However their lower bounds apply only to a substantially more stringent local model of differential privacy and are known not to hold in the model we consider here.

Preliminaries

In both PAC and agnostic learning model an algorithm that outputs a hypothesis in CC is referred to as proper.

2 Differentially Private Learning

where the probability is over the randomness of AA (Dwork et al., 2006). When AA is (α,0)(\alpha,0)-differentially private we say that it satisfies pure differential privacy, which we also write as α\alpha-differential privacy.

3 Representation Dimension

For agnostic learning we have that sample complexity is at most

This form of upper bound combines accuracy and confidence boosting from (Beimel et al., 2013a) to first obtain (ε,δ)(\varepsilon,\delta)-probabilistic representation and then the use of exponential mechanism as in (Kasiviswanathan et al., 2011). The results in (Kasiviswanathan et al., 2011) show the extension of this bound to agnostic learning. Note that the characterization for PAC learning is tight up to logarithmic factors.

4 Communication Complexity

Let Πε(g)\Pi^{\rightarrow}_{\varepsilon}(g) denote the class of all private-coin one-way protocols π\pi computing gg with error ε\varepsilon, namely private-coin one-way protocols π\pi satisfying for all xX,yYx\in X,y\in Y

A deterministic one-way protocol π\pi and its cost are defined as above but without dependence on random bits. We will also require distributional notions of complexity, where there is a fixed input distribution from which x,yx,y are drawn. For a distribution μ\mu over X×YX\times Y, we define Πε(g;μ)\Pi^{\rightarrow}_{\varepsilon}(g;\mu) to be all deterministic one-way protocols π\pi such that

Yao’s minimax principle (Yao, 1977) states that for all functions gg:

Error in both public and private-coin protocols can be reduced by using several independent copies of the protocol and then taking a majority vote of the result. This implies that for every ε,γ(0,1/2)\varepsilon,\gamma\in(0,1/2),

5 Littlestone’s Dimension

While in this work we will not use the definition of the online mistake-bound model itself, we briefly describe it for completeness. In the online mistake-bound model learning proceeds in rounds. At the beginning of round tt, a learning algorithm has some hypothesis hth_{t}. In round tt, the learner sees a point xtXx_{t}\in X and predicts ht(xt)h_{t}(x_{t}). At the end of the round, the correct label yty_{t} is revealed and the learner makes a mistake if ht(xt)yth_{t}(x_{t})\neq y_{t}. The learner then updates its hypothesis to ht+1h_{t+1} and this process continues. When learning a concept class CC in this model yt=f(xt)y_{t}=f(x_{t}) for some unknown fCf\in C. The (sample) complexity of such learning is defined as the largest number of mistakes that any learning algorithm can be forced to make when learning CC. Littlestone (1987) proved that it is exactly characterized by a dimension defined as follows.

Equivalence between representation dimension and communication complexity

Our main result is the following two bounds.

For any ε[0,1/2]\varepsilon\in[0,1/2] and δ\delta\in, and any concept class CC, it holds that:

()(\leq): let π\pi be the public-coin one-way protocol that achieves the optimal communication complexity cc. For each choice of the public random coins rr, let HrH_{r} denote the set of functions hσ(x)=πB(σ,x;r)h_{\sigma}(x)=\pi_{B}(\sigma,x;r) over all possible σ\sigma. Thus, each HrH_{r} has size at most 2c2^{c}. Let the distribution H\mathcal{H} be to choose uniformly random rr and then output HrH_{r}.

We show that this family (ε,δ)(\varepsilon,\delta)-probabilistically represents CC. We know from the fact that π\pi computes EvalC\mathsf{Eval}_{C} with error εδ\varepsilon\delta that it must hold for all fCf\in C and xXx\in X that:

In particular, it must hold for any distribution D\mathcal{D} over XX that:

By Yao’s minimax principle (Equation 2.1) (Yao, 1977) this implies that

For any ε[0,1/2]\varepsilon\in[0,1/2], it holds that:

In particular, this means that for all distributions D\mathcal{D} over XX, it holds that

By a standard averaging argument, there must exist at least one σ\sigma such that

This implies that HH ε\varepsilon-deterministically represents CC.

Our private-coin protocol π\pi for EvalC\mathsf{Eval}_{C} will be the following: on input ff, Alice will use her private randomness to sample hhfh\sim\mathfrak{h}_{f} and send the index of hh to Bob. Bob then outputs h(x)h(x). Thus, for each f,xf,x, it holds that

and so the protocol computes EvalC\mathsf{Eval}_{C} with error ε\leq\varepsilon.

Our equivalence theorems allow us to import many results from communication complexity into the context of private PAC learning, both proving new facts and simplifying proofs of previously known results in the process.

We note that it is known that VC dimension corresponds to the maximal distributional one-way communication complexity over all product input distributions. Hence this separation is analogous to separation of distributional one-way complexity over product distributions and the maximal distributional complexity over all distributions achieved using the greater-than function (Kremer et al., 1999).

Accuracy and confidence boosting.

Our equivalence theorems give a simple alternative way to reduce error in probabilistic and deterministic representations without using sequential boosting as was done in (Beimel et al., 2013a). Given a private PAC learner with constant error, say (ε,δ)=(1/8,1/8)(\varepsilon,\delta)=(1/8,1/8), one can first convert the learner to a communication protocol with error 1/41/4, use O(log1εδ)O(\log\frac{1}{\varepsilon^{\prime}\delta^{\prime}}) simple independent repetitions (as in eq.(2.2)) to reduce the error to εδ\varepsilon^{\prime}\delta^{\prime}, and then convert the protocol back into a (ε,δ)(\varepsilon^{\prime},\delta^{\prime})-probabilistic representation. The “magic” here happens when we convert between the communication complexity and probabilistic representation using min-max type arguments. This is the same tool that can be used to prove (computationally inefficient) boosting theorems.

Probabilistic vs. deterministic representation dimension.

It was shown by Newman (1991) that public and private coin complexity are the same up to additive logarithmic terms. In our setting (and with a specific choice of error bounds to simplify presentation), Newman’s theorem implies that

Simpler learning algorithms.

Using our equivalence theorems, we can “import” results from communication complexity to give simple private PAC learners. For example, the well-known constant communication equality protocol using inner-product-based hashing can be converted to a probabilistic representation using Theorem 3.1, which can then be used to learn point functions. The resulting learning algorithm is somewhat simpler than the constant sample complexity learner for Pointb\mathsf{Point}_{b} described in (Beimel et al., 2010) and we believe that this view also provides useful intuition. We remark that the probabilistic representation for Pointb\mathsf{Point}_{b} that results from the communication protocol is known and was used for learning Pointb\mathsf{Point}_{b} by Feldman (2009) in the context of evolvability. A closely related representation is also mentioned in (Beimel et al., 2013a).

Lower Bounds via Littlestone’s Dimension

A proof of this lower bound can be easily derived by adapting the proof in (Bar-Yossef et al., 2004) and we include it in Section A.

An immediate corollary of Lemma 4 and Lemma 4 is the following lower bound.

We give a new information-theoretic and simpler proof of Aaronson’s result for randomized public-coin communication. We start with a brief review of basic notions from information theory.

We will use the convention of letting bold-face a,b\mathbf{a},\mathbf{b} denote random variables and regular type a,ba,b denote particular values that those random variables may take.

Recall the following definitions of entropy (all logarithms are base 22):

The Kullback-Leibler divergence (also called relative entropy) is defined as:

For two jointly distributed random variables xy\mathbf{x}\mathbf{y}, let xy\langle\mathbf{x}\rangle\langle\mathbf{y}\rangle denote independent samples from the marginal distributions of x\mathbf{x} and y\mathbf{y}. If conditioned on some event EE, we write (xyE)=xEyE(\langle\mathbf{x}\rangle\langle\mathbf{y}\rangle\mid E)=\langle\mathbf{x}\mid E\rangle\langle\mathbf{y}\mid E\rangle.

Recall the following characterization of mutual information in terms of Kullback-Leibler divergence:

Let xy\mathbf{x}\mathbf{y} be jointly distributed random variables. Suppose the support of y\mathbf{y} has size 2s2^{s}. Then for every t>0t>0 it holds that:

Let x,y\mathbf{x},\mathbf{y} be random variables over a common universe XX. Recall the following two equivalent definitions of statistical distance.

We set ε=1/5\varepsilon=1/5 and let γ=2(121p2/5)2\gamma=2(\frac{1}{2}-\frac{1}{p}-2/5)^{2}. For p128p\leq 128 the claim obviously holds so we can assume that p>110p>110. This implies that γ2/121\gamma\geq 2/121 and log(1/γ)6\log(1/\gamma)\leq 6.

By the min-max principle, it suffices to exhibit an input distribution μ\mu for which

We will show that this is impossible for any one-way protocol π\pi that communicates less than c=logp1log1γlogp7c=\log p-1-\log\frac{1}{\gamma}\geq\log p-7 bits. Fix any such π\pi, and let mm be the message sent by Alice.

By the divergence characterization of mutual information, this implies that

On the other hand, observe that the distribution (x,ymm is good)(\langle\mathbf{x},\mathbf{y}\rangle\langle\mathbf{m}\rangle\mid\mathbf{m}\text{ is good}) is identical to the distribution (x,y,mm is good)(\mathbf{x}^{\prime},\mathbf{y}^{\prime},\mathbf{m}^{\prime}\mid\mathbf{m}^{\prime}\text{ is good}) sampled according to the distribution μ0\mu_{0}. (The definition of m\mathbf{m}^{\prime} good is the same as for m\mathbf{m}, since the marginal distribution of both μ0\mu_{0} and μ1\mu_{1} on the variables a,b,m\mathbf{a},\mathbf{b},\mathbf{m} is identical.)

Therefore we have by Pinsker’s inequality and Equation 5.3 that:

However, since the output of Bob depends only on x,y,mx,y,m, it therefore follows that the probability that Bob outputs 11 under μ1\mu_{1} is less than the probability that Bob outputs 11 under μ0\mu_{0} plus 11p2ε1-\frac{1}{p}-2\varepsilon, and this contradicts Equation 5.2, proving the theorem.

The joint distribution a,b,x,y,m\mathbf{a},\mathbf{b},\mathbf{x},\mathbf{y},\mathbf{m} drawn from μ1\mu_{1} can be viewed in the following order: first sample m\mathbf{m} from the marginal distribution, then sample a,b\mathbf{a},\mathbf{b} conditioned on m\mathbf{m}, and then sample x,y\mathbf{x},\mathbf{y} a random point conditioned on ax+b=y\mathbf{a}\mathbf{x}+\mathbf{b}=\mathbf{y}.

Conditioned on m=m\mathbf{m}=m, consider x1,y1\mathbf{x}_{1},\mathbf{y}_{1} and x2,y2\mathbf{x}_{2},\mathbf{y}_{2} sampled independently as just described. There are two ways a collision can occur: either a1,b1\mathbf{a}_{1},\mathbf{b}_{1} and a2,b2\mathbf{a}_{2},\mathbf{b}_{2} collide or they do not. In the first case (x1,y1)=(x2,y2)(\mathbf{x}_{1},\mathbf{y}_{1})=(\mathbf{x}_{2},\mathbf{y}_{2}) occurs with probability 1/p1/p, in the second with probability at most 1/p21/p^{2} since there is at most one point where the two lines intersect.

Separating pure and (α,β)𝛼𝛽(\alpha,\beta)-differential privacy

For any prime pp, any ε,δ,α,β(0,1/2)\varepsilon,\delta,\alpha,\beta\in(0,1/2), one can (ε,δ)(\varepsilon,\delta)-accurately learn Linep\mathsf{Line}_{p} with (α,β)(\alpha,\beta)-differential privacy using O(1εαlog1βlog1δ)O(\frac{1}{\varepsilon\alpha}\log\frac{1}{\beta}\log\frac{1}{\delta}) samples.

We prove this theorem in two steps: first we construct a learner with poor dependence on δ\delta and then amplify using the exponential mechanism to obtain a learner with good dependence on δ\delta.

For any prime pp, any ε,δ,α,β(0,1/2)\varepsilon,\delta,\alpha,\beta\in(0,1/2), it suffices to take O(1ε26/δ1αlog1βδ)O(\frac{1}{\varepsilon}2^{6/\delta}\cdot\frac{1}{\alpha}\log\frac{1}{\beta\delta}) samples in order to (ε,δ)(\varepsilon,\delta)-learn Linep\mathsf{Line}_{p} with (α,β)(\alpha,\beta)-differential privacy.

At a high level, we run the basic (non-private) learner based on VC-dimension O(1αlog1β)O(\frac{1}{\alpha}\log\frac{1}{\beta}) times. We use the fact that Linep\mathsf{Line}_{p} is stable in that after a constant number of samples, with high probability there is a unique hypothesis that classifies the samples correctly. (This is simply because any two distinct points on a line define the line.) Therefore, in each of the executions of the non-private learner, we are likely to recover the same hypothesis. We can then release this hypothesis (α,β)(\alpha,\beta)-privately using the “Propose-Test-Release” framework.

The main challenge in implementing this intuition is to eliminate corner cases, where with roughly probability 1/21/2 the sample set may contain two distinct positively labeled points and with probability 1/21/2 only a single positively labeled point, as this would lead to unstable outputs. We do this by randomizing the number of samples we take.

Let tt be a number of samples, to be chosen later. Given tt samples (x1,y1),,(xt,yt)(x_{1},y_{1}),\ldots,(x_{t},y_{t}), our basic learner will do the following:

See if there exist two distinct samples (xi,yi)(xj,yj)(x_{i},y_{i})\neq(x_{j},y_{j}) that are both classified positively. If so, output the unique line defined by these points.

Otherwise, see if there exists any sample (xi,yi)(x_{i},y_{i}) classified positively. Output the point function that outputs 11 on (xi,yi)(x_{i},y_{i}) and zero elsewhere.

Otherwise, output the constant hypothesis.

If c+Λ(1/α)>1αln12β+1c+\Lambda(1/\alpha)>\frac{1}{\alpha}\ln\frac{1}{2\beta}+1 then output h\overline{h}, otherwise output the constant hypothesis.

Here, Λ(1/α)\Lambda(1/\alpha) denotes the Laplace distribution, whose density function at point xx equals αeαx\alpha e^{-\alpha|x|}. It is easy to check that adding Λ(1/α)\Lambda(1/\alpha) to a sum of Boolean values renders that sum α\alpha-differentially private (Dwork et al., 2006).

We analyze the overall learner. Observe that once tt is fixed, the basic learner is deterministic.

The most frequent hypotheses are different. In this case c=1c=1 for both x,xx,x^{\prime}. The probability of not outputting in either case is given by

Accuracy:

we now show that the overall learner (ε,δ)(\varepsilon,\delta)-PAC learns. We claim that:

Fix any hidden line ff and any input distribution D\mathcal{D}. With probability 1δ/21-\delta/2 over the choice of tt, there is a unique hypothesis with error ε\leq\varepsilon that the basic learner will output with probability at least 2/32/3 when given tt independent samples from D\mathcal{D}.

Thus the overall probability of not returning an ε\varepsilon-good hypothesis is at most δ\delta.

Proof of Claim 6.1

Define the following events Nonet,Onet,Twot\mathsf{None}_{t},\mathsf{One}_{t},\mathsf{Two}_{t} parameterized by an integer t>0t>0 and defined over the probability space of drawing (x1,y1),,(xt,yt)(x_{1},y_{1}),\ldots,(x_{t},y_{t}) independently from D\mathcal{D}:

Nonet\mathsf{None}_{t} is the event that all of the (xi,yi)(x_{i},y_{i}) are not on the line (a,b)(a,b).

Onet\mathsf{One}_{t} is the event that there exists some (xi,yi)(x_{i},y_{i}) on the line (a,b)(a,b), and furthermore for every other (xj,yj)(x_{j},y_{j}) on the line (a,b)(a,b) is in fact equal to (xi,yi)(x_{i},y_{i}).

Twot\mathsf{Two}_{t} is the event that there exists distinct (xi,yi)(xj,yj)(x_{i},y_{i})\neq(x_{j},y_{j}) that are both on the line (a,b)(a,b).

Next we will show that with probability 1δ/21-\delta/2 over the choice of tt, one of these three events has probability at least 2/32/3, and then we show that this suffices to imply the claim.

The characterizations for Nonet,Twot\mathsf{None}_{t},\mathsf{Two}_{t} are obvious. The characterization of Onet\mathsf{One}_{t} is exactly the probability over all (x,y)f1(1)(x,y)\in f^{-1}(1) that all samples are either labeled or equal (x,y)(x,y), excluding the event that they are all labeled .

From the above and by considering the (x,y)(x,y) maximizing qx,yq_{x,y}, we have the following bounds:

t𝑡t is good with high probability.

Let us say that tt is good for Nonet\mathsf{None}_{t} if t13rt\leq\frac{1}{3r}. We say tt is good for Onet\mathsf{One}_{t} if t[ln6r,16(rq)]t\in[\frac{\ln 6}{r},\frac{1}{6(r-q)}]. We say tt is good for Twot\mathsf{Two}_{t} if t2ln6rqt\geq\frac{2\ln 6}{r-q}. (It is possible that some of these events may be empty, but this does not affect our argument.) Using Equation 6.1, Equation 6.2 and Equation 6.3, it is clear that if tt is good for some event, then the probability of that event is at least 2/32/3.

Let us say tt is good if it is good for any one of Nonet,Onet,Twot\mathsf{None}_{t},\mathsf{One}_{t},\mathsf{Two}_{t}. tt is good means the following when viewed on the logarithmic scale:

But this means that tt is bad on the logarithmic scale is equivalent to:

Thus, for any rr, there are at most 33 integer values of logt\log t that are bad. But recall that t=2kt=2^{k} where kk is uniformly chosen from {log(ln(3/2)/ε),,log(ln(3/2)/ε)+6/δ}\{\log(\ln(3/2)/\varepsilon),\ldots,\log(\ln(3/2)/\varepsilon)+6/\delta\}. Therefore the probability that k=logtk=\log t is one of the bad values defined in Equation 6.4 is at most δ/2\delta/2.

When t𝑡t is good, basic learner outputs unique accurate hypothesis.

To conclude, we argue that when tt is good then the basic learner will output a unique hypothesis with error ε\leq\varepsilon with probability 2/3\geq 2/3. This is obvious when tt is good for Twot\mathsf{Two}_{t}, since whenever the basic learner sees two points on the line, it recovers the exact line. It is also easy to see that when tt is good for Nonet\mathsf{None}_{t}, the basic learner outputs the hypothesis with probability 2/32/3, and this has error at most ε\varepsilon since

It remains to argue that the basic learner outputs a unique hypothesis with error at most ε\varepsilon when tt is good for Onet\mathsf{One}_{t}. Observe that we have actually set the parameters so that when tt is good for Onet\mathsf{One}_{t}, it holds that:

where (xmax,ymax)=argmax(x,y)f1(1)qx,y(x_{\max},y_{\max})=\operatorname*{argmax}_{(x,y)\in f^{-1}(1)}q_{x,y}. Therefore, for such tt, the basic learner will output the point function that is positive on exactly (xmax,ymax)(x_{\max},y_{\max}) with probability at least 2/32/3.

To show that this point function has error at most ε\varepsilon, it suffices to prove that

Conclusions and Open Problems

While we focus on differential privacy our lower bounds have immediate implications in other settings where information about the examples needs to be stored or transmitted for the purposes of classification, such as distributed computation, streaming and low-memory computation.

Acknowledgements

We are grateful to Kobbi Nissim for first drawing our attention to the intriguing problem of understanding the relationship between probabilistic representation dimension and VC dimension, and for valuable discussions regarding the sample complexity of privately learning threshold functions. We thank Nina Balcan and Avrim Blum who brought up the relationship of our bounds for intervals in Section 3.1 to those based on Littlestone’s dimension. Their insightful comments and questions have lead to our result in Theorem 1.2. We also thank Sasha Rakhlin and Sasha Sherstov for useful suggestions and references.

D.X. was supported in part by the French ANR Blanc program under contract ANR-12-BS02-005 (RDAM project), by NSF grant CNS-1237235, a gift from Google, Inc., and a Simons Investigator grant to Salil Vadhan.

References

Appendix A Proof of Lemma 4

where μ\mu is the input distribution that samples uniform xR{0,1}d\mathbf{x}\stackrel{{\scriptstyle R}}{{\leftarrow}}\{0,1\}^{d} and uniform iR[d]\mathbf{i}\stackrel{{\scriptstyle R}}{{\leftarrow}}[d].

Consider any deterministic protocol π\pi computing AugIndex\mathsf{AugIndex} with error at most ε\varepsilon over μ\mu, and suppose that π\pi uses δd\delta\cdot d communication. Let (x,i)μ(\mathbf{x},\mathbf{i})\sim\mu. Then I(πA(x);x)δd\mathbf{I}(\pi_{A}(\mathbf{x});\mathbf{x})\leq\delta\cdot d. By the chain rule for mutual information it follows that

We therefore deduce that (for i\mathbf{i} uniform over [d][d]):

Appendix B Ldim Lower Bound for Halfspaces

Recall that HSbd\mathsf{HS}_{b}^{d} denotes the concept class of all halfspaces over IbdI_{b}^{d}, where Ib={0,1,,2b1}I_{b}=\{0,1,\ldots,2^{b}-1\}. Our proof is based on the technique used in [Muroga, 1971] to prove a lower bound of 2d(d1)2^{d(d-1)} on the total number of distinct halfspaces over {0,1}d\{0,1\}^{d}. As a first step we prove the following simple lemma (for b=1b=1 it can also be found in [Muroga, 1971]).

(w,θ)(w,\theta) represents ff, that is f(x)=1f(x)=1 if and only if wxθw\cdot x\geq\theta;

for every two distinct x,xIbdx,x^{\prime}\in I_{b}^{d}, wxwxw\cdot x\neq w\cdot x^{\prime}.

We refer to such a representation of ff as collision-free.

Let (w,θ)(w^{\prime},\theta^{\prime}) be any integer weight representation of ff (such representation always exists for a halfspace over integer points). We first create a margin around the decision boundary by setting w=2d+1ww^{\prime\prime}=2^{d+1}w^{\prime} and θ=2d+1θ2d\theta=2^{d+1}\theta^{\prime}-2^{d}. Note that (w,θ)(w^{\prime\prime},\theta) also represents ff and, in addition, for every xx, wxθ2d|w^{\prime\prime}\cdot x-\theta|\geq 2^{d}. We now define for every i[d]i\in[d], wi=wi+2iw_{i}=w^{\prime\prime}_{i}+2^{i}. It is easy to see that wxwx2d1|w\cdot x-w^{\prime\prime}\cdot x|\leq 2^{d}-1 and therefore (w,θ)(w,\theta) represents ff. Further, dd least significant bits in the binary representation of wxw\cdot x are exactly equal to xx and therefore the second condition is also satisfied.

We construct a mistake tree TdT_{d} over HSbd\mathsf{HS}_{b}^{d} and IbdI_{b}^{d} inductively over the dimension dd. For d=1d=1, HSbd\mathsf{HS}_{b}^{d} includes all threshold functions on IbI_{b} and therefore we define T1T_{1} is the complete binary tree representing the binary search on this interval. Note that the depth of this tree is bb.

Appendix C Improving Dependence on δ𝛿\delta in Theorem 6.1

We now improve the exponential dependence on 1/δ1/\delta in subsection 6.1 to prove Theorem 6.1. We first introduce the exponential mechanism of McSherry and Talwar . For simplicity we only describe its restriction to the learning setting. Let HH be a hypothesis class and define the “quality score function” q(S,h)={(x,y)Sh(x)=y}q(S,h)=|\{(x,y)\in S\mid h(x)=y\}|. The exponential mechanism for qq with privacy α\alpha is the following: given an input S(X×{0,1})nS\in(X\times\{0,1\})^{n}, output hHh\in H according to the distribution EM(S)\mathsf{EM}(S) given by

We use the following theorem about the exponential mechanism. Let qmax(S)=maxhHq(S,h)q_{\max}(S)=\max_{h\in H}q(S,h).

The exponential mechanism is α\alpha-differentially private. Furthermore, for all S(X×{0,1})nS\in(X\times\{0,1\})^{n} and all t>0t>0, it holds that

We will use the algorithm of subsection 6.1 with δ=Θ(1)\delta=\Theta(1) in order to construct a small set of hypotheses from which we’ll then select one using the exponential mechanism. More precisely:

Set k=log(2/δ)/log(4/3)k=\log(2/\delta)/\log(4/3). Run the algorithm of subsection 6.1 kk times independently with fresh samples and with (ε4,14)(\frac{\varepsilon}{4},\frac{1}{4})-accuracy and (α,β)(\alpha,\beta)-differential privacy. Call the resulting hypotheses H={h1,,hk}H=\{h_{1},\ldots,h_{k}\}.

Sample m=16εαlog4kδm=\frac{16}{\varepsilon\alpha}\log\frac{4k}{\delta} additional samples, call this set SS. Use the exponential mechanism to output hHh\in H.

The mechanism is (α,β)(\alpha,\beta)-differentially private because samples used to produce h1,,hkh_{1},\ldots,h_{k} are learned with (α,β)(\alpha,\beta) differential privacy (notice each sample can only affect one of the hih_{i}, therefore the privacy loss does not add up when considering the set of all hypotheses). Also, the samples used to pick hHh\in H are used via the exponential mechanism, which is also α\alpha-differentially private.

To analyze the accuracy, observe that since each of the hih_{i} is produced using an (ε4,14)(\frac{\varepsilon}{4},\frac{1}{4})-accurate learner, therefore by independence of the executions and our choice of kk, it holds with probability 1(114)k1δ2\geq 1-(1-\frac{1}{4})^{k}\geq 1-\frac{\delta}{2} that HH contains some hh that has error at most ε4\frac{\varepsilon}{4}.

This implies that with probability 13δ/41-3\delta/4 over the probability of computing HH and sampling SS, it suffices to output some hHh\in H such that q(S,h)maxhHq(S,h)ε/8q(S,h)\geq\max_{h^{\prime}\in H}q(S,h^{\prime})-\varepsilon/8. This is because we are in the case where HH contains a hypothesis with error ε/4\leq\varepsilon/4, and therefore by Equation 3.2 it holds that maxhHq(S,h)>S(13ε/8)\max_{h^{\prime}\in H}q(S,h^{\prime})>|S|(1-3\varepsilon/8) and therefore any such hh will have error ε/2\leq\varepsilon/2 over SS. By Equation 3.1, we deduce that any such hh must have error ε\leq\varepsilon over D\mathcal{D}.

By Theorem C.1, the probability that the exponential mechanism outputs such a hh is at least 1keαεm/161δ/41-ke^{-\alpha\varepsilon m/16}\geq 1-\delta/4. Therefore the overall probability of outputting an ε\varepsilon-good hypothesis is at least 1δ1-\delta.