Sample Complexity Bounds on Differentially Private Learning via Communication Complexity

Vitaly Feldman, David Xiao

Introduction

In machine learning tasks, the training data often consists of information collected from individuals. This data can be highly sensitive, for example in the case of medical or financial information, and therefore privacy-preserving data analysis is becoming an increasingly important area of study in machine learning, data mining and statistics (Dwork and Smith, 2009; Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).

In this work we focus on the task of learning to classify from labeled examples. Two standard and closely related models of this task are PAC learning (Valiant, 1984) and agnostic (Haussler, 1992; Kearns et al., 1994) learning. In the PAC learning model the algorithm is given random examples in which each point is sampled i.i.d. from some unknown distribution over the domain and is labeled by an unknown function from a set of functions $C$ (called concept class). In the agnostic learning model the algorithm is given examples sampled i.i.d. from an arbitrary (and unknown) distribution over labeled points. The goal of the learning algorithm in both models is to output a hypothesis whose prediction error on the distribution from which examples are sampled is not higher (up to an additive $\varepsilon$ ) than the prediction error of the best function in $C$ (which is in the PAC model). See Section 2.1 for formal definitions.

We rely on the well-studied differential privacy model of privacy. Differential privacy gives a formal semantic guarantee of privacy, saying intuitively that no single individual’s data has too large of an effect on the output of the algorithm, and therefore observing the output of the algorithm does not leak much information about an individual’s private data (Dwork et al., 2006) (see Section 2.2 for the formal definition). The downside of this desirable guarantee is that for some problems achieving it has an additional cost: both in terms of the number of examples, or sample complexity, and computation.

Beimel et al. (2013b) consider PAC learning with approximate $(\alpha,\beta)$ -differential privacy where the privacy guarantee holds with probability $1-\beta$ (the basic notion is also referred to as pure to distinguish it from the approximate version). They show that $\mathsf{Thr}_{b}$ can be PAC learned using $O(16^{\log^{*}(b)}\cdot\log(1/\beta))$ samples ( $\alpha$ is a constant as before). Their algorithm is proper so this separates the sample complexity of pure differentially private proper PAC learning from the approximate version. This work leaves open the question of whether such a separation can be proved for (non-proper) PAC learning.

The sample complexity of $(\alpha,\beta)$ -differentially privately learning $\mathsf{Line}_{p}$ is $O(\frac{1}{\alpha}\log(1/\beta))$ .

Our upper bound is also simpler than the upper bound in (Beimel et al., 2013b). See Section 6 for details.

2 Related work

There is now an extensive amount of literature on differential privacy in machine learning and related areas which we cannot hope to cover here. The reader is referred to the excellent surveys in (Sarwate and Chaudhuri, 2013; Dwork and Roth, 2014).

Blum et al. (2005) showed that algorithms that can be implemented in the statistical query (SQ) framework of Kearns (1998) can also be easily converted to differentially private algorithms. This result implies polynomial upper bounds on the sample (and computational) complexity of all learning problems that can be solved using statistical queries (which includes the vast majority of problems known to be solvable efficiently). Formal treatment of differentially private PAC and agnostic learning was initiated in the seminal work of Kasiviswanathan et al. (2011). Aside from the results we already mentioned, they separated SQ learning from differentially private learning. Further, they showed that SQ learning is (up to polynomial factors) equivalent to local differential privacy a more stringent model in which each data point is privatized before reaching the learning algorithm.

Sample complexity of more general problems in statistics was investigated in several works starting with Dwork and Lei (2009) (measured alternatively via convergence rates of statistical estimators) (Smith, 2011; Chaudhuri and Hsu, 2012; Duchi et al., 2013a, b). A recent work of Duchi et al. (2013a) shows a number of $d$ -dimensional problems where differentially private algorithms must incur an additional factor $d/\alpha^{2}$ cost in sample complexity. However their lower bounds apply only to a substantially more stringent local model of differential privacy and are known not to hold in the model we consider here.

Preliminaries

In both PAC and agnostic learning model an algorithm that outputs a hypothesis in $C$ is referred to as proper.

2 Differentially Private Learning

where the probability is over the randomness of $A$ (Dwork et al., 2006). When $A$ is $(\alpha,0)$ -differentially private we say that it satisfies pure differential privacy, which we also write as $\alpha$ -differential privacy.

3 Representation Dimension

For agnostic learning we have that sample complexity is at most

This form of upper bound combines accuracy and confidence boosting from (Beimel et al., 2013a) to first obtain $(\varepsilon,\delta)$ -probabilistic representation and then the use of exponential mechanism as in (Kasiviswanathan et al., 2011). The results in (Kasiviswanathan et al., 2011) show the extension of this bound to agnostic learning. Note that the characterization for PAC learning is tight up to logarithmic factors.

4 Communication Complexity

Let $\Pi^{\rightarrow}_{\varepsilon}(g)$ denote the class of all private-coin one-way protocols $\pi$ computing $g$ with error $\varepsilon$ , namely private-coin one-way protocols $\pi$ satisfying for all $x\in X,y\in Y$

A deterministic one-way protocol $\pi$ and its cost are defined as above but without dependence on random bits. We will also require distributional notions of complexity, where there is a fixed input distribution from which $x,y$ are drawn. For a distribution $\mu$ over $X\times Y$ , we define $\Pi^{\rightarrow}_{\varepsilon}(g;\mu)$ to be all deterministic one-way protocols $\pi$ such that

Yao’s minimax principle (Yao, 1977) states that for all functions $g$ :

Error in both public and private-coin protocols can be reduced by using several independent copies of the protocol and then taking a majority vote of the result. This implies that for every $\varepsilon,\gamma\in(0,1/2)$ ,

5 Littlestone’s Dimension

While in this work we will not use the definition of the online mistake-bound model itself, we briefly describe it for completeness. In the online mistake-bound model learning proceeds in rounds. At the beginning of round $t$ , a learning algorithm has some hypothesis $h_{t}$ . In round $t$ , the learner sees a point $x_{t}\in X$ and predicts $h_{t}(x_{t})$ . At the end of the round, the correct label $y_{t}$ is revealed and the learner makes a mistake if $h_{t}(x_{t})\neq y_{t}$ . The learner then updates its hypothesis to $h_{t+1}$ and this process continues. When learning a concept class $C$ in this model $y_{t}=f(x_{t})$ for some unknown $f\in C$ . The (sample) complexity of such learning is defined as the largest number of mistakes that any learning algorithm can be forced to make when learning $C$ . Littlestone (1987) proved that it is exactly characterized by a dimension defined as follows.

Equivalence between representation dimension and communication complexity

Our main result is the following two bounds.

For any $\varepsilon\in[0,1/2]$ and $\delta\in$ , and any concept class $C$ , it holds that:

$(\leq)$ : let $\pi$ be the public-coin one-way protocol that achieves the optimal communication complexity $c$ . For each choice of the public random coins $r$ , let $H_{r}$ denote the set of functions $h_{\sigma}(x)=\pi_{B}(\sigma,x;r)$ over all possible $\sigma$ . Thus, each $H_{r}$ has size at most $2^{c}$ . Let the distribution $\mathcal{H}$ be to choose uniformly random $r$ and then output $H_{r}$ .

We show that this family $(\varepsilon,\delta)$ -probabilistically represents $C$ . We know from the fact that $\pi$ computes $\mathsf{Eval}_{C}$ with error $\varepsilon\delta$ that it must hold for all $f\in C$ and $x\in X$ that:

In particular, it must hold for any distribution $\mathcal{D}$ over $X$ that:

By Yao’s minimax principle (Equation 2.1) (Yao, 1977) this implies that

For any $\varepsilon\in[0,1/2]$ , it holds that:

In particular, this means that for all distributions $\mathcal{D}$ over $X$ , it holds that

By a standard averaging argument, there must exist at least one $\sigma$ such that

This implies that $H$ $\varepsilon$ -deterministically represents $C$ .

Our private-coin protocol $\pi$ for $\mathsf{Eval}_{C}$ will be the following: on input $f$ , Alice will use her private randomness to sample $h\sim\mathfrak{h}_{f}$ and send the index of $h$ to Bob. Bob then outputs $h(x)$ . Thus, for each $f,x$ , it holds that

and so the protocol computes $\mathsf{Eval}_{C}$ with error $\leq\varepsilon$ .

Our equivalence theorems allow us to import many results from communication complexity into the context of private PAC learning, both proving new facts and simplifying proofs of previously known results in the process.

We note that it is known that VC dimension corresponds to the maximal distributional one-way communication complexity over all product input distributions. Hence this separation is analogous to separation of distributional one-way complexity over product distributions and the maximal distributional complexity over all distributions achieved using the greater-than function (Kremer et al., 1999).

Accuracy and confidence boosting.

Our equivalence theorems give a simple alternative way to reduce error in probabilistic and deterministic representations without using sequential boosting as was done in (Beimel et al., 2013a). Given a private PAC learner with constant error, say $(\varepsilon,\delta)=(1/8,1/8)$ , one can first convert the learner to a communication protocol with error $1/4$ , use $O(\log\frac{1}{\varepsilon^{\prime}\delta^{\prime}})$ simple independent repetitions (as in eq.(2.2)) to reduce the error to $\varepsilon^{\prime}\delta^{\prime}$ , and then convert the protocol back into a $(\varepsilon^{\prime},\delta^{\prime})$ -probabilistic representation. The “magic” here happens when we convert between the communication complexity and probabilistic representation using min-max type arguments. This is the same tool that can be used to prove (computationally inefficient) boosting theorems.

Probabilistic vs. deterministic representation dimension.

It was shown by Newman (1991) that public and private coin complexity are the same up to additive logarithmic terms. In our setting (and with a specific choice of error bounds to simplify presentation), Newman’s theorem implies that

Simpler learning algorithms.

Using our equivalence theorems, we can “import” results from communication complexity to give simple private PAC learners. For example, the well-known constant communication equality protocol using inner-product-based hashing can be converted to a probabilistic representation using Theorem 3.1, which can then be used to learn point functions. The resulting learning algorithm is somewhat simpler than the constant sample complexity learner for $\mathsf{Point}_{b}$ described in (Beimel et al., 2010) and we believe that this view also provides useful intuition. We remark that the probabilistic representation for $\mathsf{Point}_{b}$ that results from the communication protocol is known and was used for learning $\mathsf{Point}_{b}$ by Feldman (2009) in the context of evolvability. A closely related representation is also mentioned in (Beimel et al., 2013a).

Lower Bounds via Littlestone’s Dimension

A proof of this lower bound can be easily derived by adapting the proof in (Bar-Yossef et al., 2004) and we include it in Section A.

An immediate corollary of Lemma 4 and Lemma 4 is the following lower bound.

We give a new information-theoretic and simpler proof of Aaronson’s result for randomized public-coin communication. We start with a brief review of basic notions from information theory.

We will use the convention of letting bold-face $\mathbf{a},\mathbf{b}$ denote random variables and regular type $a,b$ denote particular values that those random variables may take.

Recall the following definitions of entropy (all logarithms are base $2$ ):

The Kullback-Leibler divergence (also called relative entropy) is defined as:

For two jointly distributed random variables $\mathbf{x}\mathbf{y}$ , let $\langle\mathbf{x}\rangle\langle\mathbf{y}\rangle$ denote independent samples from the marginal distributions of $\mathbf{x}$ and $\mathbf{y}$ . If conditioned on some event $E$ , we write $(\langle\mathbf{x}\rangle\langle\mathbf{y}\rangle\mid E)=\langle\mathbf{x}\mid E\rangle\langle\mathbf{y}\mid E\rangle$ .

Recall the following characterization of mutual information in terms of Kullback-Leibler divergence:

Let $\mathbf{x}\mathbf{y}$ be jointly distributed random variables. Suppose the support of $\mathbf{y}$ has size $2^{s}$ . Then for every $t>0$ it holds that:

Let $\mathbf{x},\mathbf{y}$ be random variables over a common universe $X$ . Recall the following two equivalent definitions of statistical distance.

We set $\varepsilon=1/5$ and let $\gamma=2(\frac{1}{2}-\frac{1}{p}-2/5)^{2}$ . For $p\leq 128$ the claim obviously holds so we can assume that $p>110$ . This implies that $\gamma\geq 2/121$ and $\log(1/\gamma)\leq 6$ .

By the min-max principle, it suffices to exhibit an input distribution $\mu$ for which

We will show that this is impossible for any one-way protocol $\pi$ that communicates less than $c=\log p-1-\log\frac{1}{\gamma}\geq\log p-7$ bits. Fix any such $\pi$ , and let $m$ be the message sent by Alice.

By the divergence characterization of mutual information, this implies that

On the other hand, observe that the distribution $(\langle\mathbf{x},\mathbf{y}\rangle\langle\mathbf{m}\rangle\mid\mathbf{m}\text{ is good})$ is identical to the distribution $(\mathbf{x}^{\prime},\mathbf{y}^{\prime},\mathbf{m}^{\prime}\mid\mathbf{m}^{\prime}\text{ is good})$ sampled according to the distribution $\mu_{0}$ . (The definition of $\mathbf{m}^{\prime}$ good is the same as for $\mathbf{m}$ , since the marginal distribution of both $\mu_{0}$ and $\mu_{1}$ on the variables $\mathbf{a},\mathbf{b},\mathbf{m}$ is identical.)

Therefore we have by Pinsker’s inequality and Equation 5.3 that:

However, since the output of Bob depends only on $x,y,m$ , it therefore follows that the probability that Bob outputs $1$ under $\mu_{1}$ is less than the probability that Bob outputs $1$ under $\mu_{0}$ plus $1-\frac{1}{p}-2\varepsilon$ , and this contradicts Equation 5.2, proving the theorem.

The joint distribution $\mathbf{a},\mathbf{b},\mathbf{x},\mathbf{y},\mathbf{m}$ drawn from $\mu_{1}$ can be viewed in the following order: first sample $\mathbf{m}$ from the marginal distribution, then sample $\mathbf{a},\mathbf{b}$ conditioned on $\mathbf{m}$ , and then sample $\mathbf{x},\mathbf{y}$ a random point conditioned on $\mathbf{a}\mathbf{x}+\mathbf{b}=\mathbf{y}$ .

Conditioned on $\mathbf{m}=m$ , consider $\mathbf{x}_{1},\mathbf{y}_{1}$ and $\mathbf{x}_{2},\mathbf{y}_{2}$ sampled independently as just described. There are two ways a collision can occur: either $\mathbf{a}_{1},\mathbf{b}_{1}$ and $\mathbf{a}_{2},\mathbf{b}_{2}$ collide or they do not. In the first case $(\mathbf{x}_{1},\mathbf{y}_{1})=(\mathbf{x}_{2},\mathbf{y}_{2})$ occurs with probability $1/p$ , in the second with probability at most $1/p^{2}$ since there is at most one point where the two lines intersect.

Separating pure and (α,β)𝛼𝛽(\alpha,\beta)-differential privacy

For any prime $p$ , any $\varepsilon,\delta,\alpha,\beta\in(0,1/2)$ , one can $(\varepsilon,\delta)$ -accurately learn $\mathsf{Line}_{p}$ with $(\alpha,\beta)$ -differential privacy using $O(\frac{1}{\varepsilon\alpha}\log\frac{1}{\beta}\log\frac{1}{\delta})$ samples.

We prove this theorem in two steps: first we construct a learner with poor dependence on $\delta$ and then amplify using the exponential mechanism to obtain a learner with good dependence on $\delta$ .

For any prime $p$ , any $\varepsilon,\delta,\alpha,\beta\in(0,1/2)$ , it suffices to take $O(\frac{1}{\varepsilon}2^{6/\delta}\cdot\frac{1}{\alpha}\log\frac{1}{\beta\delta})$ samples in order to $(\varepsilon,\delta)$ -learn $\mathsf{Line}_{p}$ with $(\alpha,\beta)$ -differential privacy.

At a high level, we run the basic (non-private) learner based on VC-dimension $O(\frac{1}{\alpha}\log\frac{1}{\beta})$ times. We use the fact that $\mathsf{Line}_{p}$ is stable in that after a constant number of samples, with high probability there is a unique hypothesis that classifies the samples correctly. (This is simply because any two distinct points on a line define the line.) Therefore, in each of the executions of the non-private learner, we are likely to recover the same hypothesis. We can then release this hypothesis $(\alpha,\beta)$ -privately using the “Propose-Test-Release” framework.

The main challenge in implementing this intuition is to eliminate corner cases, where with roughly probability $1/2$ the sample set may contain two distinct positively labeled points and with probability $1/2$ only a single positively labeled point, as this would lead to unstable outputs. We do this by randomizing the number of samples we take.

Let $t$ be a number of samples, to be chosen later. Given $t$ samples $(x_{1},y_{1}),\ldots,(x_{t},y_{t})$ , our basic learner will do the following:

See if there exist two distinct samples $(x_{i},y_{i})\neq(x_{j},y_{j})$ that are both classified positively. If so, output the unique line defined by these points.

Otherwise, see if there exists any sample $(x_{i},y_{i})$ classified positively. Output the point function that outputs $1$ on $(x_{i},y_{i})$ and zero elsewhere.

Otherwise, output the constant hypothesis.

If $c+\Lambda(1/\alpha)>\frac{1}{\alpha}\ln\frac{1}{2\beta}+1$ then output $\overline{h}$ , otherwise output the constant hypothesis.

Here, $\Lambda(1/\alpha)$ denotes the Laplace distribution, whose density function at point $x$ equals $\alpha e^{-\alpha|x|}$ . It is easy to check that adding $\Lambda(1/\alpha)$ to a sum of Boolean values renders that sum $\alpha$ -differentially private (Dwork et al., 2006).

We analyze the overall learner. Observe that once $t$ is fixed, the basic learner is deterministic.

The most frequent hypotheses are different. In this case $c=1$ for both $x,x^{\prime}$ . The probability of not outputting in either case is given by

Accuracy:

we now show that the overall learner $(\varepsilon,\delta)$ -PAC learns. We claim that:

Fix any hidden line $f$ and any input distribution $\mathcal{D}$ . With probability $1-\delta/2$ over the choice of $t$ , there is a unique hypothesis with error $\leq\varepsilon$ that the basic learner will output with probability at least $2/3$ when given $t$ independent samples from $\mathcal{D}$ .

Thus the overall probability of not returning an $\varepsilon$ -good hypothesis is at most $\delta$ .

Proof of Claim 6.1

Define the following events $\mathsf{None}_{t},\mathsf{One}_{t},\mathsf{Two}_{t}$ parameterized by an integer $t>0$ and defined over the probability space of drawing $(x_{1},y_{1}),\ldots,(x_{t},y_{t})$ independently from $\mathcal{D}$ :

$\mathsf{None}_{t}$ is the event that all of the $(x_{i},y_{i})$ are not on the line $(a,b)$ .

$\mathsf{One}_{t}$ is the event that there exists some $(x_{i},y_{i})$ on the line $(a,b)$ , and furthermore for every other $(x_{j},y_{j})$ on the line $(a,b)$ is in fact equal to $(x_{i},y_{i})$ .

$\mathsf{Two}_{t}$ is the event that there exists distinct $(x_{i},y_{i})\neq(x_{j},y_{j})$ that are both on the line $(a,b)$ .

Next we will show that with probability $1-\delta/2$ over the choice of $t$ , one of these three events has probability at least $2/3$ , and then we show that this suffices to imply the claim.

The characterizations for $\mathsf{None}_{t},\mathsf{Two}_{t}$ are obvious. The characterization of $\mathsf{One}_{t}$ is exactly the probability over all $(x,y)\in f^{-1}(1)$ that all samples are either labeled or equal $(x,y)$ , excluding the event that they are all labeled .

From the above and by considering the $(x,y)$ maximizing $q_{x,y}$ , we have the following bounds:

t𝑡t is good with high probability.

Let us say that $t$ is good for $\mathsf{None}_{t}$ if $t\leq\frac{1}{3r}$ . We say $t$ is good for $\mathsf{One}_{t}$ if $t\in[\frac{\ln 6}{r},\frac{1}{6(r-q)}]$ . We say $t$ is good for $\mathsf{Two}_{t}$ if $t\geq\frac{2\ln 6}{r-q}$ . (It is possible that some of these events may be empty, but this does not affect our argument.) Using Equation 6.1, Equation 6.2 and Equation 6.3, it is clear that if $t$ is good for some event, then the probability of that event is at least $2/3$ .

Let us say $t$ is good if it is good for any one of $\mathsf{None}_{t},\mathsf{One}_{t},\mathsf{Two}_{t}$ . $t$ is good means the following when viewed on the logarithmic scale:

But this means that $t$ is bad on the logarithmic scale is equivalent to:

Thus, for any $r$ , there are at most $3$ integer values of $\log t$ that are bad. But recall that $t=2^{k}$ where $k$ is uniformly chosen from $\{\log(\ln(3/2)/\varepsilon),\ldots,\log(\ln(3/2)/\varepsilon)+6/\delta\}$ . Therefore the probability that $k=\log t$ is one of the bad values defined in Equation 6.4 is at most $\delta/2$ .

When t𝑡t is good, basic learner outputs unique accurate hypothesis.

To conclude, we argue that when $t$ is good then the basic learner will output a unique hypothesis with error $\leq\varepsilon$ with probability $\geq 2/3$ . This is obvious when $t$ is good for $\mathsf{Two}_{t}$ , since whenever the basic learner sees two points on the line, it recovers the exact line. It is also easy to see that when $t$ is good for $\mathsf{None}_{t}$ , the basic learner outputs the hypothesis with probability $2/3$ , and this has error at most $\varepsilon$ since

It remains to argue that the basic learner outputs a unique hypothesis with error at most $\varepsilon$ when $t$ is good for $\mathsf{One}_{t}$ . Observe that we have actually set the parameters so that when $t$ is good for $\mathsf{One}_{t}$ , it holds that:

where $(x_{\max},y_{\max})=\operatorname*{argmax}_{(x,y)\in f^{-1}(1)}q_{x,y}$ . Therefore, for such $t$ , the basic learner will output the point function that is positive on exactly $(x_{\max},y_{\max})$ with probability at least $2/3$ .

To show that this point function has error at most $\varepsilon$ , it suffices to prove that

Conclusions and Open Problems

While we focus on differential privacy our lower bounds have immediate implications in other settings where information about the examples needs to be stored or transmitted for the purposes of classification, such as distributed computation, streaming and low-memory computation.

Acknowledgements

We are grateful to Kobbi Nissim for first drawing our attention to the intriguing problem of understanding the relationship between probabilistic representation dimension and VC dimension, and for valuable discussions regarding the sample complexity of privately learning threshold functions. We thank Nina Balcan and Avrim Blum who brought up the relationship of our bounds for intervals in Section 3.1 to those based on Littlestone’s dimension. Their insightful comments and questions have lead to our result in Theorem 1.2. We also thank Sasha Rakhlin and Sasha Sherstov for useful suggestions and references.

D.X. was supported in part by the French ANR Blanc program under contract ANR-12-BS02-005 (RDAM project), by NSF grant CNS-1237235, a gift from Google, Inc., and a Simons Investigator grant to Salil Vadhan.

References

Appendix A Proof of Lemma 4

where $\mu$ is the input distribution that samples uniform $\mathbf{x}\stackrel{{\scriptstyle R}}{{\leftarrow}}\{0,1\}^{d}$ and uniform $\mathbf{i}\stackrel{{\scriptstyle R}}{{\leftarrow}}[d]$ .

Consider any deterministic protocol $\pi$ computing $\mathsf{AugIndex}$ with error at most $\varepsilon$ over $\mu$ , and suppose that $\pi$ uses $\delta\cdot d$ communication. Let $(\mathbf{x},\mathbf{i})\sim\mu$ . Then $\mathbf{I}(\pi_{A}(\mathbf{x});\mathbf{x})\leq\delta\cdot d$ . By the chain rule for mutual information it follows that

We therefore deduce that (for $\mathbf{i}$ uniform over $[d]$ ):

Appendix B Ldim Lower Bound for Halfspaces

Recall that $\mathsf{HS}_{b}^{d}$ denotes the concept class of all halfspaces over $I_{b}^{d}$ , where $I_{b}=\{0,1,\ldots,2^{b}-1\}$ . Our proof is based on the technique used in [Muroga, 1971] to prove a lower bound of $2^{d(d-1)}$ on the total number of distinct halfspaces over $\{0,1\}^{d}$ . As a first step we prove the following simple lemma (for $b=1$ it can also be found in [Muroga, 1971]).

$(w,\theta)$ represents $f$ , that is $f(x)=1$ if and only if $w\cdot x\geq\theta$ ;

for every two distinct $x,x^{\prime}\in I_{b}^{d}$ , $w\cdot x\neq w\cdot x^{\prime}$ .

We refer to such a representation of $f$ as collision-free.

Let $(w^{\prime},\theta^{\prime})$ be any integer weight representation of $f$ (such representation always exists for a halfspace over integer points). We first create a margin around the decision boundary by setting $w^{\prime\prime}=2^{d+1}w^{\prime}$ and $\theta=2^{d+1}\theta^{\prime}-2^{d}$ . Note that $(w^{\prime\prime},\theta)$ also represents $f$ and, in addition, for every $x$ , $|w^{\prime\prime}\cdot x-\theta|\geq 2^{d}$ . We now define for every $i\in[d]$ , $w_{i}=w^{\prime\prime}_{i}+2^{i}$ . It is easy to see that $|w\cdot x-w^{\prime\prime}\cdot x|\leq 2^{d}-1$ and therefore $(w,\theta)$ represents $f$ . Further, $d$ least significant bits in the binary representation of $w\cdot x$ are exactly equal to $x$ and therefore the second condition is also satisfied.

We construct a mistake tree $T_{d}$ over $\mathsf{HS}_{b}^{d}$ and $I_{b}^{d}$ inductively over the dimension $d$ . For $d=1$ , $\mathsf{HS}_{b}^{d}$ includes all threshold functions on $I_{b}$ and therefore we define $T_{1}$ is the complete binary tree representing the binary search on this interval. Note that the depth of this tree is $b$ .

Appendix C Improving Dependence on δ𝛿\delta in Theorem 6.1

We now improve the exponential dependence on $1/\delta$ in subsection 6.1 to prove Theorem 6.1. We first introduce the exponential mechanism of McSherry and Talwar . For simplicity we only describe its restriction to the learning setting. Let $H$ be a hypothesis class and define the “quality score function” $q(S,h)=|\{(x,y)\in S\mid h(x)=y\}|$ . The exponential mechanism for $q$ with privacy $\alpha$ is the following: given an input $S\in(X\times\{0,1\})^{n}$ , output $h\in H$ according to the distribution $\mathsf{EM}(S)$ given by

We use the following theorem about the exponential mechanism. Let $q_{\max}(S)=\max_{h\in H}q(S,h)$ .

The exponential mechanism is $\alpha$ -differentially private. Furthermore, for all $S\in(X\times\{0,1\})^{n}$ and all $t>0$ , it holds that

We will use the algorithm of subsection 6.1 with $\delta=\Theta(1)$ in order to construct a small set of hypotheses from which we’ll then select one using the exponential mechanism. More precisely:

Set $k=\log(2/\delta)/\log(4/3)$ . Run the algorithm of subsection 6.1 $k$ times independently with fresh samples and with $(\frac{\varepsilon}{4},\frac{1}{4})$ -accuracy and $(\alpha,\beta)$ -differential privacy. Call the resulting hypotheses $H=\{h_{1},\ldots,h_{k}\}$ .

Sample $m=\frac{16}{\varepsilon\alpha}\log\frac{4k}{\delta}$ additional samples, call this set $S$ . Use the exponential mechanism to output $h\in H$ .

The mechanism is $(\alpha,\beta)$ -differentially private because samples used to produce $h_{1},\ldots,h_{k}$ are learned with $(\alpha,\beta)$ differential privacy (notice each sample can only affect one of the $h_{i}$ , therefore the privacy loss does not add up when considering the set of all hypotheses). Also, the samples used to pick $h\in H$ are used via the exponential mechanism, which is also $\alpha$ -differentially private.

To analyze the accuracy, observe that since each of the $h_{i}$ is produced using an $(\frac{\varepsilon}{4},\frac{1}{4})$ -accurate learner, therefore by independence of the executions and our choice of $k$ , it holds with probability $\geq 1-(1-\frac{1}{4})^{k}\geq 1-\frac{\delta}{2}$ that $H$ contains some $h$ that has error at most $\frac{\varepsilon}{4}$ .

This implies that with probability $1-3\delta/4$ over the probability of computing $H$ and sampling $S$ , it suffices to output some $h\in H$ such that $q(S,h)\geq\max_{h^{\prime}\in H}q(S,h^{\prime})-\varepsilon/8$ . This is because we are in the case where $H$ contains a hypothesis with error $\leq\varepsilon/4$ , and therefore by Equation 3.2 it holds that $\max_{h^{\prime}\in H}q(S,h^{\prime})>|S|(1-3\varepsilon/8)$ and therefore any such $h$ will have error $\leq\varepsilon/2$ over $S$ . By Equation 3.1, we deduce that any such $h$ must have error $\leq\varepsilon$ over $\mathcal{D}$ .

By Theorem C.1, the probability that the exponential mechanism outputs such a $h$ is at least $1-ke^{-\alpha\varepsilon m/16}\geq 1-\delta/4$ . Therefore the overall probability of outputting an $\varepsilon$ -good hypothesis is at least $1-\delta$ .