Private Learning and Sanitization: Pure vs. Approximate Differential Privacy

Amos Beimel, Kobbi Nissim, Uri Stemmer

Introduction

Learning is often applied to collections of sensitive data of individuals and it is important to protect the privacy of these individuals. We examine the sample complexity of private learning and a related task – sanitization – while preserving differential privacy . We show striking differences between the required sample complexity for these tasks under $\epsilon$ -differential privacy (also called pure differential privacy) and its variant $(\epsilon,\delta)$ -differential privacy (also called approximate differential privacy).

Differential privacy protects the privacy of individuals by requiring that the information of an individual does not significantly affect the output. More formally, an algorithm $A$ satisfies the requirement of Pure Differential Privacy if for every two databases that differ on exactly one entry, and for every event defined over the output set of $A$ , the probability of this event is close up to a multiplicative factor of $e^{\epsilon}\approx 1+\epsilon$ whether $A$ is applied on one database or on the other. Approximate Differential Privacy is a relaxation of pure differential privacy where the above guarantee needs to be satisfied only for events whose probability is at least $\approx\delta$ . We show that even negligible $\delta>0$ can have a significant effect on sample complexity of private learning and sanitization.

Private learning was introduced in as a combination of Valiant’s PAC learning model and differential privacy. For now, we can think of a private learner as a differentially private algorithm that operates on a set of classified random examples, and outputs a hypothesis that misclassifies fresh examples with probability at most (say) $\frac{1}{10}$ . The work on private learning has mainly focused on pure privacy. On the one hand, Blum et al. and Kasiviswanathan et al. have showed, via generic constructions, that every finite concept class $C$ can be learned privately, using sample complexity proportional to $\mathop{\rm{poly}}\nolimits(\log|C|)$ (often efficiently). On the other hand, a significant difference was shown between the sample complexity of traditional (non-private) learners (crystallized in terms of $\operatorname{\rm VC}(C)$ and smaller than $\log|C|$ in many interesting cases) and private learners. As an example, let $\operatorname{\tt POINT}_{d}$ be the class of point functions over the domain $\{0,1\}^{d}$ (these are the functions that evaluate to one on exactly one point of the domain and to zero elsewhere). Consider the task of properly learning $\operatorname{\tt POINT}_{d}$ where, after consulting its sample, the learner outputs a hypothesis that is by itself in $\operatorname{\tt POINT}_{d}$ . Non-privately, learning $\operatorname{\tt POINT}_{d}$ requires merely a constant number of examples (as $\operatorname{\rm VC}(\operatorname{\tt POINT}_{d})=1$ ). Privately, $\Omega(d)$ examples are required . Curiously, the picture changes when the private learner is allowed to output a hypothesis not in $\operatorname{\tt POINT}_{d}$ (such learners are called improper), as the sample complexity can be reduced to $O(1)$ . This, however, comes with a price, as it was shown in that such learners must return hypotheses that evaluate to one on exponentially many points in $\{0,1\}^{d}$ and, hence, are very far from all functions in $\operatorname{\tt POINT}_{d}$ .

A complete characterization for the sample complexity of pure-private learners was recently given in , in terms of a new dimension – the Representation Dimension, that is, given a class $C$ , the number of samples needed and sufficient for privately learning $C$ is $\Theta(\operatorname{\rm RepDim}(C))$ . Following that, Feldman and Xiao showed an equivalence between the representation dimension of a concept $C$ and the randomized one-way communication complexity of the evaluation problem for concepts from $C$ . Using this equivalence they separated the sample complexity of pure-private learners from that of non-private ones. For example, they showed a lower bound of $\Omega(d)$ on the sample complexity of every pure-private (proper or improper) learner for the class $\operatorname*{\tt THRESH}_{d}$ of threshold functions over the interval $[0,2^{d}-1]$ . This is a strong separation from the non-private sample complexity, which is $O(1)$ (as the VC dimension of this class is constant).

We show that the sample complexity of proper learning with approximate differential privacy can be significantly lower than that satisfying pure differential privacy. Our starting point for this work is an observation that with approximate $(\epsilon,\delta)$ -differential privacy, sample complexity of $O(\log(1/\delta))$ suffices for learning points properly. This gives a separation between pure and approximate proper private learning for $\delta=2^{-o(d)}$ .

The notion of differentially private sanitization was introduced in the work of Blum et al. . A sanitizer for a class of predicates $C$ is a differentially private mechanism translating an input database $S$ to an output database $\hat{S}$ such that $\hat{S}$ (approximately) agrees with $S$ on the fraction of the entries satisfying $\varphi$ for all $\varphi\in C$ , where every predicate $\varphi\in C$ is a function from $X$ to $\{0,1\}$ . Blum et al. gave a generic construction of pure differentially private sanitizers exhibiting sample complexity $O(\operatorname{\rm VC}(C)\log|X|)$ . Lower bounds partially supporting this sample complexity were given by . As with private learning, we show significant differences between the sample complexity required for sanitization of simple predicate classes under pure and approximate differential privacy. We note that the construction of sanitizers is not generally computationally feasible .

1 Our Contributions

To simplify the exposition, we omit in this section dependency on all variables except for $d$ , corresponding to the representation length of domain elements.

A recent instantiation of the Propose-Test-Release (PTR) framework by Smith and Thakurta results, almost immediately, with a proper learner for points, exhibiting $O(1)$ sample complexity while preserving approximate differential privacy. This simple technique does not suffice for our other constructions of learners and sanitizers, and we, hence, introduce new tools for coping with proper private learning of thresholds and axis-aligned rectangles, and sanitization for point functions and thresholds:

Choosing mechanism: Given a low-sensitivity quality function, one can use the exponential mechanism to choose an approximately maximizing solution. This requires, in general, a database of size logarithmic in the number of possible solutions. We identify a sub family of low-sensitivity functions, called bounded-growth functions, for which it is possible to significantly reduce the necessary database size when using the exponential mechanism.

Recursive algorithm for quasi-concave promise problems: We define a family of optimization problems, which we call Quasi-Concave Promise Problems. The possible solutions are ordered, and quasi-concavity means that if two solutions $f\leq h$ have quality of at least $\mathcal{X}$ , then any solution $f\leq g\leq h$ also has quality of at least $\mathcal{X}$ . The optimization goal is, when there exists a solution with a promised quality of (at least) $r$ , to find a solution with quality $\approx r$ . We observe that a quasi-concave promise problem can be privately approximated using a solution to a smaller instance of a quasi-concave promise problem. This allows us to construct an efficient recursive algorithm solving such problems privately. We show that the task of learning $\operatorname*{\tt THRESH}_{d}$ is, in fact, a quasi-concave promise problem, and it can be privately solved using our algorithm with sample size roughly $2^{O(\log^{*}d)}$ . Sanitization for $\operatorname*{\tt THRESH}_{d}$ does not exactly fit the model of quasi-concave promise problems but can still be solved by iteratively defining and solving a small number of quasi-concave promise problems.

We give new private proper-learning algorithms for the classes $\operatorname{\tt POINT}_{d}$ and $\operatorname*{\tt THRESH}_{d}$ . We also construct a new private proper-learner for (a discrete version of) the class of all axis-aligned rectangles over $n$ dimensions. Our algorithms exhibit sample complexity that is significantly lower than bounds given in prior work, separating pure and approximate private learning. Similarly, we construct sanitizers for $\operatorname{\tt POINT}_{d}$ and $\operatorname*{\tt THRESH}_{d}$ , again with sample complexity that is significantly lower than bounds given in prior work, separating sanitization in the pure and approximate privacy cases. Our algorithms are time-efficient.

Gupta et al. have given reductions in both directions between agnostic learning of a concept class $C$ , and the sanitization task for the same class $C$ . The learners and sanitizers they consider are limited to access their data via statistical queries (such algorithm can be easily transformed to satisfy differential privacy ). In Section 5 we show a similar reduction from the task of privately learning a concept class $C$ to the sanitization task of $C$ , where the sanitizer’s access to the database is unrestricted. This allows us to exploit lower bounds on the sample complexity of private learners and show an explicit class of predicates $C$ over a domain $X$ for which every private sanitizer requires databases of size $\Omega(\operatorname{\rm VC}(C)\log|X|)$ . A similar lower bound was shown by Hardt and Rothblum , achieving tighter results in terms of the approximation parameter. Their work proves the existence of such a concept class, but does not give an explicit one.

In Section 6 we examine private learning under a relaxation of differential privacy called label privacy (see and references therein), where the learner is required to only protect the privacy of the labels in the sample. Chaudhuri et al. have proved lower bounds for label-private learners in terms of the doubling dimension of the target concept class. We show that the VC dimension completely characterizes the sample complexity of such learners, that is, the sample complexity of learning with label privacy is equal (up to constants) to learning without privacy.

2 Open Questions

This work raises two kinds of research directions. First, this work presents (time and sample efficient) private learners and sanitizers for relatively simple concept classes. It would be natural to try and construct private learners and sanitizers for more complex concept classes. In particular, constructing a (time and sample efficient) private learner for hyperplanes would be very interesting to the community.

Another very interesting research direction is to try and understand the sample complexity of approximate-private learners. Currently, no lower bounds are known on the sample complexity of such learners. On the other hand, no generic construction for such learners is known to improve the sample complexity achieved by the generic construction of Kasiviswanathan et al. for pure-private learners. Characterizing the sample complexity of approximate-private learners is a very interesting open question.

3 Other Related Work

Most related to our work is the work on private learning and its sample complexity and the early work on sanitization mentioned above. Another related work is the work of De , who proved a separation between pure $\epsilon$ -differential privacy and approximate $(\epsilon,\delta)$ -differential privacy. Specifically, he demonstrated that there exists a query where it is sufficient to add noise $O(\sqrt{n\log(1/\delta)})$ when $\delta>0$ and $\Omega(n)$ noise is required when $\delta=0$ . Earlier work by Hardt and Talwar separated pure from approximate differential privacy for $\delta=n^{-O(1)}$ .

Another interesting gap between pure and approximate differential privacy is the following. Blum et al. have given a generic construction of pure-private sanitizers, in which the sample complexity grows as $\frac{1}{\alpha^{3}}$ (where $\alpha$ is the approximation parameter). Following that, Hardt and Rothblum showed that with approximate-privacy, the sample complexity can be reduce to grow as $\frac{1}{\alpha^{2}}$ . Currently, it is unknown whether this gap is essential.

Preliminaries

We use $O_{\gamma}(f(t))$ as a shorthand for $O(h(\gamma)\cdot f(t))$ for some non-negative function $h$ . In informal discussions, we sometimes use $\widetilde{O}(f(t))$ instead of $O(f(t)\cdot{\rm polylog}(f(t)))$ . For example, $2^{\log^{*}(d)}\cdot\log^{*}(d)=\widetilde{O}\left(2^{\log^{*}(d)}\right)$ .

We use $X$ to denote an arbitrary domain, and $X_{d}$ for the domain $\{0,1\}^{d}$ . We use $X^{m}$ (and respectively $X_{d}^{m}$ ) for the cartesian $m^{\text{t}h}$ power of $X$ , i.e., $X^{m}=(X)^{m}$ , and use $X^{*}=\bigcup_{m=0}^{\infty}{X^{m}}$ .

Given a distribution $\mathcal{D}$ over a domain $X$ , we denote $\mathcal{D}(j)\triangleq\Pr_{x\sim\mathcal{D}}[x=j]$ for $j\in X$ , and $\mathcal{D}(J)\triangleq\Pr_{x\sim\mathcal{D}}[x\in J]$ for $J\subseteq X$ .

1 Differential Privacy

Differential privacy aims at protecting information of individuals. We consider a database, where each entry contains information pertaining to an individual. An algorithm operating on databases is said to preserve differential privacy if a change of a single record of the database does not significantly change the output distribution of the algorithm. Intuitively, this means that whatever is learned about an individual could also be learned with her data arbitrarily modified (or without her data). Formally:

Databases $S_{1}\in X^{m}$ and $S_{2}\in X^{m}$ over a domain $X$ are called neighboring if they differ in exactly one entry.

A randomized algorithm $A$ is $(\epsilon,\delta)$ -differentially private if for all neighboring databases $S_{1},S_{2}\in X^{m}$ , and for all sets $\mathcal{F}$ of outputs,

The probability is taken over the random coins of $A$ . When $\delta=0$ we omit it and say that $A$ preserves $\epsilon$ -differential privacy.

We use the term pure differential privacy when $\delta=0$ and the term approximate differential privacy when $\delta>0$ , in which case $\delta$ is typically a negligible function of the database size $m$ .

We will later present algorithms that access their input database using (several) differentially private mechanisms. We will use the following composition theorems.

If $A_{1}$ and $A_{2}$ satisfy $(\epsilon_{1},\delta_{1})$ and $(\epsilon_{2},\delta_{2})$ differential privacy, respectively, then their concatenation $A(S)=\langle A_{1}(S),A_{2}(S)\rangle$ satisfies $(\epsilon_{1}+\epsilon_{2},\delta_{1}+\delta_{2})$ -differential privacy.

Moreover, a similar theorem holds for the adaptive case, where a mechanism interacts with $k$ adaptively chosen differentially private mechanisms.

A mechanism that permits $k$ adaptive interactions with mechanisms that preserves $(\epsilon,\delta)$ -differential privacy (and does not access the database otherwise) ensures $(k\epsilon,k\delta)$ -differential privacy.

Note that the privacy guaranties of the above bound deteriorates linearly with the number of interactions. By bounding the expected privacy loss in each interaction (as opposed to worst-case), Dwork et al. showed the following stronger composition theorem, where privacy deteriorates (roughly) as $\sqrt{k}\epsilon+k\epsilon^{2}$ (rather than $k\epsilon$ ).

Let $0<\epsilon,\delta^{\prime}\leq 1$ , and let $\delta\in$ . A mechanism that permits $k$ adaptive interactions with mechanisms that preserves $(\epsilon,\delta)$ -differential privacy (and does not access the database otherwise) ensures $(\epsilon^{\prime},k\delta+\delta^{\prime})$ -differential privacy, for $\epsilon^{\prime}=\sqrt{2k\ln(1/\delta^{\prime})}\cdot\epsilon+2k\epsilon^{2}$ .

2 Preliminaries from Learning Theory

A concept $c:X\rightarrow\{0,1\}$ is a predicate that labels examples taken from the domain $X$ by either 0 or 1. A concept class $C$ over $X$ is a set of concepts (predicates) mapping $X$ to $\{0,1\}$ . A learning algorithm is given examples sampled according to an unknown probability distribution $\mathcal{D}$ over $X$ , and labeled according to an unknown target concept $c\in C$ . The learning algorithm is successful when it outputs a hypothesis $h$ that approximates the target concept over samples from $\mathcal{D}$ . More formally:

The generalization error of a hypothesis $h:X\rightarrow\{0,1\}$ is defined as

If ${\rm error}_{\mathcal{D}}(c,h)\leq\alpha$ we say that $h$ is $\alpha$ -good for $c$ and $\mathcal{D}$ .

Algorithm $A$ is an $(\alpha,\beta,m)$ -PAC learner for a concept class $C$ over $X$ using hypothesis class $H$ if for all concepts $c\in C$ , all distributions $\mathcal{D}$ on $X$ , given an input of $m$ samples $S=(z_{1},\ldots,z_{m})$ , where $z_{i}=(x_{i},c(x_{i}))$ and each $x_{i}$ is drawn i.i.d. from $\mathcal{D}$ , algorithm $A$ outputs a hypothesis $h\in H$ satisfying

The probability is taken over the random choice of the examples in $S$ according to $\mathcal{D}$ and the coin tosses of the learner $A$ . If $H\subseteq C$ then $A$ is called a proper PAC learner; otherwise, it is called an improper PAC learner.

For a labeled sample $S=(x_{i},y_{i})_{i=1}^{m}$ , the empirical error of $h$ is

2.2 The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis (VC) Dimension is a combinatorial measure of concept classes, which characterizes the sample size of PAC learners.

That is, $B$ is shattered by $C$ if $C$ realizes all possible dichotomies over $B$ .

The VC-Dimension of a concept class $C$ (over a domain $X$ ), denoted as $\operatorname{\rm VC}(C)$ , is the cardinality of the largest set $B\subseteq X$ shattered by $C$ . If arbitrarily large finite sets can be shattered by $C$ , then $\operatorname{\rm VC}(C)=\infty$ .

Observe that as $\Pi_{C}(B)\leq|C|$ a set $B$ can be shattered only if $|B|\leq\log|C|$ and hence $\operatorname{\rm VC}(C)\leq\log|C|$ .

2.3 VC Bounds

Classical results in computational learning theory state that a sample of size $\theta(\operatorname{\rm VC}(C))$ is both necessary and sufficient for the PAC learning of a concept class $C$ . The following two theorems give upper and lower bounds on the sample complexity.

Any algorithm for PAC learning a concept class $C$ must have sample complexity $\Omega(\frac{\operatorname{\rm VC}(C)}{\alpha})$ , where $\alpha$ is the approximation parameter.

Let $C$ and $\mathcal{D}$ be a concept class and a distribution over a domain $X$ . Let $\alpha,\beta>0$ , and $m\geq\frac{8}{\alpha}(\operatorname{\rm VC}(C)\ln(\frac{16}{\alpha})+\ln(\frac{2}{\beta}))$ . Fix a concept $c\in C$ , and suppose that we draw a sample $S=(x_{i},y_{i})_{i=1}^{m}$ , where $x_{i}$ are drawn i.i.d. from $\mathcal{D}$ and $y_{i}=c(x_{i})$ . Then,

So, for any concept class $C$ , any algorithm that takes a sample of $m=\Omega_{\alpha,\beta}(\operatorname{\rm VC}(C))$ labeled examples and produces as output a concept $h\in C$ that agrees with the sample is a PAC learner for $C$ . Such an algorithm is a PAC learner for $C$ using $C$ (that is, both the target concept and the returned hypotheses are taken from the same concept class $C$ ), and, therefore, there always exist a hypothesis $h\in C$ with ${\rm error}_{S}(h)=0$ (e.g., the target concept itself).

The next theorem handles (in particular) the agnostic case, in which a learning algorithm for a concept class $C$ is using a hypotheses class $H\neq C$ , and given a sample $S$ (labeled by some $c\in C$ ), a hypothesis $h$ with ${\rm error}_{S}(h)=0$ might not exist in $H$ .

Let $\mathcal{D}$ and $H$ be a distribution and a concept class over a domain $X$ , and let $f:X\rightarrow\{0,1\}$ be some concept, not necessarily in $H$ . For a sample $S=(x_{i},f(x_{i}))_{i=1}^{m}$ where $m\geq\frac{50\operatorname{\rm VC}(H)}{\alpha^{2}}\ln(\frac{1}{\alpha\beta})$ and $\{x_{i}\}$ are drawn i.i.d. from $\mathcal{D}$ , it holds that

Notice that in the agnostic case the sample complexity is proportional to $\frac{1}{\alpha^{2}}$ , as opposed to $\frac{1}{\alpha}$ when learning a class $C$ using $C$ .

3 Private Learning

In private learning, we would like to accomplish the same goal as in non-private learning, while protecting the privacy of the input database.

Let $A$ be an algorithm that gets an input $S=\{z_{1},\ldots,z_{m}\}$ . Algorithm $A$ is an $(\alpha,\beta,\epsilon,\delta,m)$ -PPAC learner for a concept class $C$ over $X$ using hypothesis class $H$ if

Privacy. Algorithm $A$ is $(\epsilon,\delta)$ -differentially private (as in Definition 2.2);

Utility. Algorithm $A$ is an $(\alpha,\beta,m)$ -PAC learner for $C$ using $H$ (as in Definition 2.7).

When $\delta=0$ (pure privacy) we omit it from the list of parameters.

Note that the utility requirement in the above definition is an average-case requirement, as the learner is only required to do well on typical samples (i.e., samples drawn i.i.d. from a distribution $\mathcal{D}$ and correctly labeled by a target concept $c\in C$ ). In contrast, the privacy requirement is a worst-case requirement, and Inequality (1) must hold for every pair of neighboring databases (no matter how they were generated, even if they are not consistent with any concept in $C$ ).

4 Sanitization

Given a database $S=(x_{1},\ldots,x_{m})$ containing elements from some domain $X$ , the goal of sanitization mechanisms is to output (while preserving differential privacy) another database $\hat{S}$ that is in some sense similar to $S$ . This returned database $\hat{S}$ is called a sanitized database.

Let $c:X\rightarrow\{0,1\}$ be a concept. The counting query $Q_{c}:X^{*}\rightarrow$ is

That is, $Q_{c}(S)$ is the fraction of the entries in $S$ that satisfy the concept $c$ . Given a database $S$ , a sanitizer for a concept class $C$ is required to output a sanitized database $\hat{S}$ s.t. $Q_{c}(S)\approx Q_{c}(\hat{S})$ for every $c\in C$ . For computational reasons, sanitizers are sometimes allowed not to return an actual database, but rather a data structure capable of approximating $Q_{c}(S)$ for every $c\in C$ .

Let $C$ be a concept class and let $S$ be a database. A function $\operatorname{\rm Est}:C\rightarrow$ is called $\alpha$ -close to $S$ if $|Q_{c}(S)-\operatorname{\rm Est}(c)|\leq\alpha$ for every $c\in C$ . If, furthermore, $\operatorname{\rm Est}$ is defined in terms of a database $\hat{S}$ , i.e., $\operatorname{\rm Est}(c)=Q_{c}(\hat{S})$ , we say that $\hat{S}$ is $\alpha$ -close to $S$ .

Let $C$ be a class of concepts mapping $X$ to $\{0,1\}$ . Let $A$ be an algorithm that on an input database $S\in X^{*}$ outputs a description of a function $\operatorname{\rm Est}:C\rightarrow$ . Algorithm $A$ is an $(\alpha,\beta,\epsilon,\delta,m)$ -improper-sanitizer for predicates in the class $C$ , if

$A$ is $(\epsilon,\delta)$ -differentially private;

For every input $S\in X^{m}$ , it holds that $\Pr\limits_{A}\left[\mbox{\rm$ \operatorname{\rm Est} $is$ \alpha $-close to$ S $}\right]\geq 1-\beta.$

The probability is over the coin tosses of algorithm $A$ . If on an input database $S$ algorithm $A$ outputs another database $\hat{S}\in X^{*}$ , and $\operatorname{\rm Est}(\cdot)$ is defined as $\operatorname{\rm Est}(c)=Q_{c}(\hat{S})$ , then algorithm $A$ is called a proper-sanitizer (or simply a sanitizer). As before, when $\delta=0$ (pure privacy) we omit it from the set of parameters.

Note that without the privacy requirements sanitization is a trivial task as it is possible to simply output the input database $S$ . Furthermore, ignoring computational complexity, an $(\alpha,\beta,\epsilon,\delta,m)$ -improper-sanitizer can always be transformed into a $(2\alpha,\beta,\epsilon,\delta,m)$ -sanitizer, by finding a database $\hat{S}$ of $m$ entries that is $\alpha$ -close to $\operatorname{\rm Est}$ . Such a database must exist except with probability $\beta$ (as in particular $S$ is $\alpha$ -close to $\operatorname{\rm Est}$ ), and is $2\alpha$ -close to $S$ (by the triangle inequality).

The following theorems state some of the known results on the sample complexity of pure-privacy sanitizers. We start with an upper bound on the necessary sample complexity.

There exists a constant $\Gamma$ such that for any class of predicates $C$ over a domain $X$ , and any parameters $\alpha,\beta,\epsilon$ , there exists an $(\alpha,\beta,\epsilon,m)$ -sanitizer for $C$ , provided that the size of the database, denoted $m$ , is at least

The above theorem states that, in principle, data sanitization is possible. The input database may be required to be as big as the representation size of elements in $X$ . The next theorem states a general lower bound (far from the above upper bound) on the sample complexity of any concept class $C$ . Better bounds are known for specific concept classes .

Let $C$ be a class of predicates, and let $m\leq\frac{\operatorname{\rm VC}(C)}{2}$ . For any $0<\beta<1$ bounded away from 1 by a constant, for any $\epsilon\leq 1$ , if $A$ is an $(\alpha,\beta,\epsilon,m)$ -sanitizer for $C$ , then $\alpha\geq\frac{1}{4+16\epsilon}$ .

Recall that a proper sanitizer operates on an input database $S\in X^{m}$ , and outputs a sanitized database $\hat{S}\in X^{*}$ . The following is a simple corollary of Theorem 2.14, stating that the size of $\hat{S}$ does not necessarily depend on the size of the input database $S$ .

Let $C$ be a concept class. For any database $S$ there exists a database $\hat{S}$ of size $n=O(\frac{\operatorname{\rm VC}(C)}{\alpha^{2}}\log(\frac{1}{\alpha}))$ such that $\max_{h\in C}|Q_{h}(S)-Q_{h}(\hat{S})|\leq\alpha$ .

In particular, the above theorem implies that an $(\alpha,\beta,\epsilon,\delta,m)$ -sanitizer $A$ can always be transformed into a $(2\alpha,\beta,\epsilon,\delta,m)$ -sanitizer $A^{\prime}$ s.t. the sanitized databases returned by $A^{\prime}$ are always of fixed size $n=O(\frac{\operatorname{\rm VC}(C)}{\alpha^{2}}\log(\frac{1}{\alpha}))$ . This can be done by finding a database $\hat{S}$ of $n$ entries that is $\alpha$ -close to the sanitized database returned by $A$ . Using the triangle inequality, $\hat{S}$ is (w.h.p.) $2\alpha$ -close to the input database.

5 Basic Differentially-Private Mechanisms

The most basic constructions of differentially private algorithms are via the Laplace mechanism as follows.

A random variable has probability distribution $\mathop{\rm{Lap}}\nolimits(b)$ if its probability density function is $f(x)=\frac{1}{2b}\exp(-\frac{|x|}{b})$ , where $x\in\R$ .

A function $f:X^{m}\rightarrow{}^{n}$ has sensitivity $k$ if for every neighboring $D,D^{\prime}\in X^{m}$ , it holds that $||f(D)-f(D^{\prime})||_{1}\leq k$ .

Let $f:X^{m}\rightarrow{}^{n}$ be a sensitivity $k$ function. The mechanism $A$ that on input $D\in X^{m}$ adds independently generated noise with distribution $\mathop{\rm{Lap}}\nolimits(\frac{k}{\epsilon})$ to each of the $n$ output terms of $f(D)$ preserves $\epsilon$ -differential privacy. Moreover,

where $A_{i}(D)$ and $f_{i}(D)$ are the $i^{\text{t}h}$ coordinates of $A(D)$ and $f(D)$ .

5.2 The Exponential Mechanism

We next describe the exponential mechanism of McSherry and Talwar . Let $X$ be a domain and $H$ a set of solutions. Given a quality function $q:X^{*}\times H\rightarrow\N$ , and a database $S\in X^{*}$ , the goal is to chooses a solution $h\in H$ approximately maximizing $q(S,h)$ . The mechanism chooses a solution probabilistically, where the probability mass that is assigned to each solution $h$ increases exponentially with its quality $q(S,h)$ :

Input: parameter $\epsilon$ , finite solution set $H$ , database $S\in X^{m}$ , and a sensitivity 1 quality function $q$ . 1. Randomly choose $h\in H$ with probability $\frac{\exp\left(\epsilon\cdot q(S,h)/2\right)}{\sum_{f\in H}\exp\left(\epsilon\cdot q(S,f)/2\right)}.$ 2. Output $h$ .

(i) The exponential mechanism is $\epsilon$ - differentially private. (ii) Let $\hat{e}\triangleq\max_{f\in H}\{q(S,f)\}$ and $\Delta>0$ . The exponential mechanism outputs a solution $h$ such that $q(S,h)\leq(\hat{e}-\Delta m)$ with probability at most $|H|\cdot\exp(-\epsilon\Delta m/2)$ .

Kasiviswanathan et al. showed in 2008 that the exponential mechanism can be used as a generic private learner – when used with the quality function $q(S,h)=|\{i:h(x_{i})=y_{i}\}|$ , the probability that the exponential mechanism outputs a hypothesis $h$ such that ${\rm error}_{S}(h)>\min_{f\in H}\{{\rm error}_{S}(f)\}+\Delta$ is at most $|H|\cdot\exp(-\epsilon\Delta m/2)$ . This results in a generic private proper-learner for every finite concept class $C$ , with sample complexity $O_{\alpha,\beta,\epsilon}(\log|C|)$ .

We restate a simplified variant of algorithm $\mathcal{A}_{\rm dist}$ by Smith and Thakurta , which is an instantiation of the Propose-Test-Release framework . Let $q:X^{*}\times H\rightarrow\N$ be a sensitivity-1 quality function over a domain $X$ and a set of solutions $H$ . Given a database $S\in X^{*}$ , the goal is to choose a solution $h\in H$ maximizing $q(S,h)$ , under the assumption that the optimal solution $h$ scores much better than any other solution in $H$ .

Algorithm $\mathcal{A}_{\rm dist}$ Input: parameters $\epsilon,\delta$ , database $S\in X^{*}$ , sensitivity-1 quality function $q$ . 1. Let $h_{1}\neq h_{2}$ be two highest score solutions in $H$ , where $q(S,h_{1})\geq q(S,h_{2})$ . 2. Let ${\rm gap}=q(S,h_{1})-q(S,h_{2})$ and ${\rm gap}^{*}={\rm gap}+\mathop{\rm{Lap}}\nolimits(\frac{1}{\epsilon})$ . 3. If ${\rm gap}^{*}<\frac{1}{\epsilon}\log(\frac{1}{\delta})$ then output $\bot$ and halt. 4. Output $h_{1}$ .

(i) Algorithm $\mathcal{A}_{\rm dist}$ is $(\epsilon,\delta)$ - differentially private. (ii) When given an input database $S$ for which ${\rm gap}\geq\frac{1}{\epsilon}\log(\frac{1}{\beta\delta})$ , algorithm $\mathcal{A}_{\rm dist}$ outputs $h_{1}$ maximizing $q(h,S)$ with probability at least $(1-\beta)$ .

6 Concentration Bounds

Learning with Approximate Privacy

We present proper $(\epsilon,\delta)$ -private learners for two simple concept classes, $\operatorname{\tt POINT}_{d}$ and $\operatorname*{\tt THRESH}_{d}$ , demonstrating separations between pure and approximate private proper learning.

For $j\in X_{d}$ let $c_{j}:X_{d}\rightarrow\{0,1\}$ be defined as $c_{j}(x)=1$ if $x=j$ and $c_{j}(x)=0$ otherwise. Define the concept class $\operatorname{\tt POINT}_{d}=\{c_{j}\}_{j\in X_{d}}$ .

Note that the VC dimension of $\operatorname{\tt POINT}_{d}$ is 1, and, therefore, there exists a proper non-private learner for $\operatorname{\tt POINT}_{d}$ with sample complexity $O_{\alpha,\beta}(1)$ . Beimel et al. proved that every proper $\epsilon$ -private learner for $\operatorname{\tt POINT}_{d}$ must have sample complexity $\Omega(d)=\Omega(\log|\operatorname{\tt POINT}_{d}|)$ . They also showed that there exists an improper $\epsilon$ -private learner for this class, with sample complexity $O_{\alpha,\beta,\epsilon}(1)$ . An alternative private learner for this class was presented in .

As we will now see, algorithm $\mathcal{A}_{\rm dist}$ (defined in Section 2.5) can be used as a proper $(\epsilon,\delta)$ -private learner for $\operatorname{\tt POINT}_{d}$ with sample complexity $O_{\alpha,\beta,\epsilon,\delta}(1)$ . This is our first (and simplest) example separating the sample complexity of pure and approximate private proper-learners. Consider the following algorithm.

Input: parameters $\alpha,\beta,\epsilon,\delta$ , and a database $S\in(X_{d+1})^{m}$ . 1. For every $x\in X_{d}$ , define $q(S,x)$ as the number of appearances of $(x,1)$ in $S$ . 2. Execute $\mathcal{A}_{\rm dist}$ on $S$ with the quality function $q$ and parameters $\frac{\alpha}{2},\frac{\beta}{2},\epsilon,\delta$ . 3. If the output was $j$ then return $c_{j}$ . 4. Else, if the output was $\bot$ then return a random $c_{i}\in\operatorname{\tt POINT}_{d}$ .

Let $\alpha,\beta,\epsilon,\delta$ be s.t. $\frac{1}{\alpha\beta}\leq 2^{d}$ . The above algorithm is an efficient $(\alpha,\beta,\epsilon,\delta)$ -PPAC proper learner for $\operatorname{\tt POINT}_{d}$ using a sample of $m=O\left(\frac{1}{\alpha\epsilon}\ln(\frac{1}{\beta\delta})\right)$ labeled examples.

For intuition, consider a target concept $c_{j}$ and an underlying distribution $\mathcal{D}$ . Whenever $\mathcal{D}(j)$ is noticeable, a typical sample $S$ contains many copies of the point $j$ labeled as $1$ . As every other point $i\neq j$ will be labeled as , we expect $q(S,j)$ to be significantly higher than any other $q(S,i)$ , and we can use algorithm $\mathcal{A}_{\rm dist}$ to identify $j$ .

Recall that the above algorithm outputs a random $c_{i}\in\operatorname{\tt POINT}_{d}$ whenever $\mathcal{A}_{\rm dist}$ outputs $\bot$ . In order for this random $c_{i}$ to be good (w.h.p.) we needed $2^{d}$ (i.e., the number of possible concepts) to be at least $\frac{1}{\alpha\beta}$ . This requirement could be avoided by outputting the all zero hypothesis $c_{0}\equiv 0$ whenever $\mathcal{A}_{\rm dist}$ outputs $\bot$ . However, this approach results in a proper learner only if we add the all zero concept to $\operatorname{\tt POINT}_{d}$ .

For $0\leq j\leq 2^{d}$ let $c_{j}:X_{d}\rightarrow\{0,1\}$ be defined as $c_{j}(x)=1$ if $x<j$ and $c_{j}(x)=0$ otherwise. Define the concept class $\operatorname*{\tt THRESH}_{d}=\{c_{j}\}_{0\leq j\leq 2^{d}}$ .

Note that $\operatorname{\rm VC}(\operatorname*{\tt THRESH}_{d})=1$ , and, therefore, there exists a proper non-private learner for $\operatorname*{\tt THRESH}_{d}$ with sample complexity $O_{\alpha,\beta}(1)$ . As $|\operatorname*{\tt THRESH}_{d}|=2^{d}+1$ , one can use the generic construction of Kasiviswanathan et al. and get a proper $\epsilon$ -private learner for this class with sample complexity $O_{\alpha,\beta,\epsilon}(d)$ . Feldman and Xiao showed that this is in fact optimal, and every $\epsilon$ -private learner for this class (proper or improper) must have sample complexity $\Omega(d)$ .

Our learner for $\operatorname{\tt POINT}_{d}$ relied on a strong “stability” property of the problem: Given a labeled sample, either a random concept is (w.h.p.) a good output, or, there is exactly one consistent concept in the class, and every other concept has large empirical error. This, however, is not the case when dealing with $\operatorname*{\tt THRESH}_{d}$ . In particular, many hypotheses can have low empirical error, and changing a single entry of a sample $S$ can significantly affect the set of hypotheses consistent with it.

In Section 3.3, we present a proper $(\epsilon,\delta)$ -private learner for $\operatorname*{\tt THRESH}_{d}$ with sample complexity (roughly) $2^{O(\log^{*}(d))}$ . We use this section for motivating the construction. We start with two simplifying assumptions. First, when given a labeled sample $S$ , we aim at choosing a hypothesis $h\in\operatorname*{\tt THRESH}_{d}$ approximately minimizing the empirical error (rather than the generalization error). Second, we assume that we are given a “diverse” sample $S$ that contains many points labeled as $1$ and many points labeled as . Those two assumptions (and any other informalities made hereafter) will be removed in Section 3.3.

We refer to points labeled as $1$ in $S$ as ones, and to points labeled as as zeros. Imagine for a moment that we already have a differentially private algorithm that given $S$ outputs an interval $G\subseteq X_{d}$ with the following two properties:

The interval $G$ contains “a lot” of ones, and “a lot” of zeros in $S$ .

Every interval $I\subseteq X_{d}$ of length $\leq\frac{|G|}{k}$ does not contain, simultaneously, “too many” ones and “too many” zeros in $S$ , where $k$ is some constant.

Given such a $4$ -good interval $G$ , we can (without using the sample $S$ ) define a set $H$ of five hypotheses s.t. at least one of them has small empirical error. To see this, consider Figure 2, where $G$ is divided into four equal intervals $g_{1},g_{2},g_{3},g_{4}$ , and five hypotheses $h_{1},\ldots,h_{5}$ are defined s.t. the points where they switch from one to zero are located at the edges of $g_{1},g_{2},g_{3},g_{4}$ .

After defining such a set $H$ , we could use the exponential mechanism to choose a hypothesis $h\in H$ with small empirical error on $S$ . As the size of $H$ is constant, this requires only a constant number of samples. To conclude, finding a $4$ -good interval $G$ (while preserving privacy) is sufficient for choosing a good hypothesis. We next explain how to find such an interval.

Assume, for now, that we have a differentially private algorithm that given a sample $S$ , returns an interval length $J$ s.t. there exists a 2-good interval $G\subseteq X_{d}$ of length $|G|=J$ . This length $J$ is used to find an explicit 4-good interval as follows. Divide $X_{d}$ into intervals $\{A_{i}\}$ of length $2J$ , and into intervals $\{B_{i}\}$ of length $2J$ right shifted by $J$ as in Figure 3.

As the promised $2$ -good interval $G$ is of length $J$ , at least one of the above intervals contains $G$ . We next explain how to privately choose such interval. If, e.g., $G\subseteq A_{2}$ then $A_{2}$ contains both a lot of zeros and a lot of ones. The target concept must switch inside $A_{2}$ , and, therefore, every other $A_{i}\neq A_{2}$ cannot contain both zeros and ones. For every interval $A_{i}$ , define its quality $q(A_{i})$ to be the minimum between the number of zeros in $A_{i}$ and the number of ones in $A_{i}$ . Therefore, $q(A_{2})$ is large, while $q(A_{i})=0$ for every $A_{i}\neq A_{2}$ . That is, $A_{2}$ scores much better than any other $A_{i}$ under this quality function $q$ . The sensitivity of $q()$ is one and we can use algorithm $\mathcal{A}_{\rm dist}$ to privately identify $A_{2}$ . It suffices, e.g., that $q(A_{2})\geq\frac{1}{4}\alpha m$ , and we can, therefore, set our “a lot” bound to be $\frac{1}{4}\alpha m$ . Recall that $G\subseteq A_{2}$ is a $2$ -good interval, and that $|A_{2}|=2|G|$ . The identified $A_{2}$ is, therefore, a $4$ -good interval.

To conclude, if we could indeed find (while preserving privacy) a length $J$ s.t. there exists a $2$ -good interval $G$ of that length, then our task would be completed.

At first attempt, one might consider preforming a binary search for such a length $0\leq J\leq 2^{d}$ , in which every comparison will be made using the Laplace mechanism. More specifically, for every length $0\leq J\leq 2^{d}$ , define

If, e.g., $Q(J)=100$ for some $J$ , then there exists an interval $[a,b]\subseteq X_{d}$ of length $J$ that contains at least $100$ ones and at least $100$ zeros. Moreover, every interval of length $\leq J$ either contains at most $100$ ones, or, contains at most $100$ zeros.

Note that $Q(\cdot)$ is a monotonically non-decreasing function, and that $Q(0)=0$ (as in a correctly labeled sample a point cannot appear both with the label 1 and with the label 0). Recall our assumption that the sample $S$ is “diverse” (contains many points labeled as $1$ and many points labeled as ), and, therefore, $Q(2^{d})$ is large. Hence, there exists a $J$ s.t. $Q(J)$ is “big enough” (say at least $\frac{1}{4}\alpha m$ ) while $Q(J-1)$ is “small enough” (say at most $\frac{3}{4}\alpha m$ ). That is, a $J$ s.t. (1) there exists an interval of length $J$ containing lots of ones and lots of zeros; and (2), every interval of length $<J$ cannot contain too many ones and too many zeros simultaneously. Such a $J$ can easily be (privately) obtained using a (noisy) binary search. However, as there are $d$ noisy comparisons, this solution requires a sample of size $d^{O(1)}$ in order to achieve reasonable utility guarantees.

As a second attempt, one might consider preforming a binary search, not on $0\leq J\leq 2^{d}$ , but rather on the power $j$ of an interval of length $2^{j}$ . That is, preforming a search for a power $0\leq j\leq d$ for which there exists a $2$ -good interval of length $2^{j}$ . Here there are only $\log(d)$ noisy comparisons, and the sample size is reduced to $\log^{\Omega(1)}(d)$ . Again, a (noisy) binary search on $0\leq j\leq d$ can (privately) yield an appropriate length $J=2^{j}$ s.t. $Q(2^{j})$ is “big enough”, while $Q(2^{j-1})$ is “small enough”. Such a $J=2^{j}$ is, indeed, a length of a $2$ -good interval. Too see this, note that as $Q(2^{j})$ is “big enough”, there exists an interval of length $2^{j}$ containing lots of ones and lots of zeros. Moreover, as $Q(2^{j-1})$ is “small enough”, every interval of length $2^{j-1}=\frac{1}{2}2^{j}$ cannot contain too many ones and too many zeros simultaneously.

A binary search as above would have to operate on noisy values of $Q(\cdot)$ (as otherwise differential privacy cannot be obtained). For this reason, we set the bounds for “big enough” and “small enough” to overlap. Namely, we search for a value $j$ such that $Q(2^{j})\geq\frac{\alpha}{4}m$ and $Q(2^{j-1})\leq\frac{3\alpha}{4}m$ , where $\alpha$ is our approximation parameter, and $m$ is the sample size.

To summarize, using a binary search we find a length $J=2^{j}$ such that there exists a 2-good interval of length $J$ . Then, using $\mathcal{A}_{\rm dist}$ , we find a 4-good interval. Finally, we partition this interval to 4 intervals, and using the exponential mechanism we choose a starting point or end point of one of these intervals as our the threshold.

We will apply recursion to reduce the costs of computing $J=2^{j}$ to $2^{O(\log^{*}(d))}$ . The tool performing the recursion would be formalized and analyzed in the next section. This tool will later be used in our construction of a proper $(\epsilon,\delta)$ -private learner for $\operatorname*{\tt THRESH}_{d}$ .

3 Privately Approximating Quasi-Concave Promise Problems

We next define the notions that enable our recursive algorithm.

A Quasi-Concave Promise Problem consists of an ordered set of possible solutions $[0,T]=\{0,1,\ldots,T\}$ , a database $S\in X^{m}$ , a sensitivity-1 quality function $Q:X^{*}\times[0,T]\rightarrow\R$ , an approximation parameter $\alpha$ , and another parameter $r$ (called a quality promise).

If $Q(S,\cdot)$ is quasi-concave and if there exists a solution $p\in[0,T]$ for which $Q(S,p)\geq r$ then a good output for the problem is a solution $k\in[0,T]$ satisfying $Q(S,k)\geq(1-\alpha)r$ . The outcome is not restricted otherwise.

Consider a sample $S=(x_{i},y_{i})_{i=1}^{m}$ , labeled by some target function $c_{j}\in\operatorname*{\tt THRESH}_{d}$ . The goal of choosing a hypothesis with small empirical error can be viewed as a quasi-concave promise problem as follows. Set the range of possible solutions to $[0,2^{d}]$ , the approximation parameter to $\alpha$ and the quality promise to $m$ . Define $Q(S,k)=|\{i:c_{k}(x_{i})=y_{i}\}|$ ; i.e., $Q(S,k)$ is the number of points in $S$ correctly classified by $c_{k}\in\operatorname*{\tt THRESH}_{d}$ . Note that the target concept $c_{j}$ satisfies $Q(S,j)=m$ . Our task is to find a hypothesis $h_{k}\in\operatorname*{\tt THRESH}_{d}$ s.t. ${\rm error}_{S}(h_{k})\leq\alpha$ , which is equivalent to finding $k\in[0,2^{d}]$ s.t. $Q(S,k)\geq(1-\alpha)m$ .

To see that $Q(S,\cdot)$ is quasi-concave, let $u\leq v\leq w$ be s.t. $Q(S,u),Q(S,w)\geq\lambda$ . Consider $j$ , the index of the target concept, and assume w.l.o.g. that $j\leq v$ (the other case is symmetric). That is, $j\leq\ v\leq w$ . Note that $c_{v}$ errs only on points in between $j$ and $v$ , and $c_{w}$ errs on all these points. That is, ${\rm error}_{S}(c_{v})\leq{\rm error}_{S}(c_{w})$ , and, therefore, $Q(S,v)\geq\lambda$ . See Figure 4 for an illustration.

Note that if the sample $S$ in the above example is not consistent with any $c\in\operatorname*{\tt THRESH}_{d}$ , then there is no $j$ s.t. $Q(S,j)=m$ , and the quality promise is void. Moreover, in such a case $Q(S,\cdot)$ might not be quasi-concave.

We are interested in solving quasi-concave promise problems while preserving differential privacy. As motivated by Remark 3.9, privacy must be preserved even when $Q(S,\cdot)$ is not quasi-concave or $Q(S,p)<r$ for all $p\in[0,T]$ . Our algorithm $RecConcave$ is presented in Figure 5 (see inline comments for some of the underlying intuition).

We start the analysis of Algorithm $RecConcave$ by bounding the number of recursive calls.

On a range $[0,T]$ there could be at most $\log^{\lceil*\rceil}(T)=\log^{*}(T)$ recursive calls throughout the execution of $RecConcave$ .

Before proceeding to the privacy analysis, we make the following simple observation.

Let $\{f_{1},f_{2},\ldots,f_{N}\}$ be a set of sensitivity-1 functions mapping $X^{*}$ to . Then $f_{max}(S)=\max_{i}\{f_{i}(S)\}$ and $f_{min}(S)=\min_{i}\{f_{i}(S)\}$ are sensitivity-1 functions.

We now proceed with the privacy analysis of algorithm $RecConcave$ .

When executed on a sensitivity-1 quality function $Q$ , parameters $\epsilon,\delta$ , and a bound on the recursion depth $N$ , algorithm $RecConcave$ preserves $(3N\epsilon,3N\delta)$ -differential privacy.

Note that since $Q$ is a sensitivity-1 function, all of the quality functions defined throughout the execution of $RecConcave$ are of sensitivity 1 (see Observation 3.11). In each recursive call algorithm $RecConcave$ invokes at most three differentially private mechanisms – once with the Exponential Mechanism (on Step 1 or on Step 11), and at most twice with algorithm $\mathcal{A}_{\rm dist}$ (on Step 9). As there are at most $N$ recursive calls, we conclude that throughout the entire execution algorithm $RecConcave$ invokes most $3N$ mechanisms, each $(\epsilon,\delta)$ -differentially private. Hence, using Theorem 2.4, algorithm $RecConcave$ is $(3N\epsilon,3N\delta)$ -differentially private. ∎

We now turn to proving the correctness of algorithm $RecConcave$ . As the proof is by induction (on the number of recursive calls), we need to show that each of the recursive calls to $RecConcave$ is made with appropriate inputs. We first claim that the function $q(S,\cdot)$ constructed in Step 4 is quasi-concave. Note that for this claim we do not need to assume that $Q(S,\cdot)$ is quasi-concave.

Let $Q:X^{*}\times[0,T]\rightarrow\R$ be a quality function, and let the functions $L(\cdot,\cdot)$ and $q(\cdot,\cdot)$ be as in steps 3, 4 of algorithm $RecConcave$ . Then, for every $S\in X^{*}$ , it holds that $q(S,\cdot)$ is quasi-concave.

Fix $S\in X^{*}$ . First observe that the function

is monotonically non-increasing (as a function of $j$ ). To see this, note that if $L(S,j)=\mathcal{X}$ , then there exists an interval of length $2^{j}$ in which every point has quality at least $\mathcal{X}$ . In particular, there exists such an interval of length $\frac{1}{2}2^{j}$ , and $L(S,j-1)\geq\mathcal{X}$ .

We use $\log^{\lceil N\rceil}(\cdot)$ to denote the outcome of $N$ iterative applications of the function $\lceil\log(\cdot)\rceil$ , i.e., $\log^{\lceil N\rceil}(n)=\underbrace{\lceil\log\lceil\log\lceil\cdots\lceil\log}_{N\text{ times}}(n)\rceil\cdots\rceil\rceil\rceil$ . Observe that $\log^{\lceil N\rceil}(n)\leq 2+\underbrace{\log\log\cdots\log}_{N\text{ times}}(n)$ for every $N\leq\log^{*}(n)$ . For example $\lceil\log\lceil\log\lceil\log(n)\rceil\rceil\rceil\leq\lceil\log\lceil\log(2+\log(n))\rceil\rceil\leq\lceil\log\lceil\log(2\log(n))\rceil\rceil=\lceil\log\lceil 1+\log\log(n)\rceil\rceil\leq\lceil\log(2+\log\log(n))\rceil\leq\lceil\log(2\log\log(n))\rceil=\lceil 1+\log\log\log(n)\rceil\leq 2+\log\log\log(n)$ .

Let $Q:X^{*}\times[0,T]\rightarrow\R$ be a sensitivity-1 quality function, and let $S\in X^{*}$ be a database s.t. $Q(S,\cdot)$ is quasi-concave. Let $\alpha\leq\frac{1}{2}$ and let $\beta,\epsilon,\delta,r,N$ be s.t.

When executed on $S,[0,T],r,\alpha,\epsilon,\delta,N$ , algorithm $RecConcave$ fails to outputs an index $j$ s.t. $Q(S,j)\geq(1-\alpha)r$ with probability at most $2\beta N$ .

The proof is by induction on the number of recursive calls, denoted as $t$ . For $t=1$ (i.e., $T\leq 32$ or $N=1$ ), the exponential mechanism ensures that for $r\geq\frac{2}{\alpha\epsilon}\log(\frac{T}{\beta})$ , the probability of algorithm $RecConcave$ failing to output a $j$ s.t. $Q(S,j)\geq(1-\alpha)r$ is at most $\beta$ .

Assume that the stated lemma holds whenever algorithm $RecConcave$ performs at most $t-1$ recursive calls, and let $S,[0,T],r,\alpha,\epsilon,\delta,N$ be inputs (satisfying the conditions of Lemma 3.14) on which algorithm $RecConcave$ preforms $t$ recursive calls. Consider the first call in the execution of $RecConcave$ on those inputs, and denote by $T^{\prime}$ the smallest power of 2 s.t. $T^{\prime}\geq T$ . In order to apply the inductive assumption, we need to show that for the recursive call in step 6, all the conditions of Lemma 3.14 hold.

We first note that by Claim 3.13, the quality function $q(S,\cdot)$ defined of step 4 is quasi-concave. We next show that the recursive call is preformed with an appropriate quality promise $R=\frac{\alpha}{2}r$ . The conditions of the lemma ensure that $L(S,0)\geq r$ , and, by definition, we have that $L(S,\log(T^{\prime})+1)\leq 0$ . There exists therefore a $j\in[0,\log(T^{\prime})]$ for which $L(S,j)\geq(1-\frac{\alpha}{2})r$ , and $L(S,j+1)<(1-\frac{\alpha}{2})r$ . Plugging these inequalities in the definition of $q(S,j)$ we get that $q(S,j)\geq\frac{\alpha}{2}r$ . Therefore, there exists an index $j\in[0,\log(T^{\prime})]$ with quality $q(S,j)\geq R$ . Moreover, the recursive call of step 6 executes $RecConcave$ on the range $[0,\log(T^{\prime})]=[0,\lceil\log(T)\rceil]$ with $(N-1)$ as the bound on the recursion depth, with $\widetilde{\alpha}\triangleq\frac{1}{4}$ as the approximation parameter, and with a quality promise $R$ satisfying

We next show that w.h.p. at least one of the two intervals $A,B$ chosen on Step 9, contains a lot of points with high score. Denote the index returned by the recursive call of step 6 as $k$ . By the inductive assumption, with probability at least $(1-2\beta(N-1))$ , the index $k$ is s.t. $q(S,k)\geq(1-\frac{1}{4})R=\frac{3\alpha}{8}r$ ; we proceed with the analysis assuming that this event happened. By the definition of $q(S,k)$ , this means that $L(S,k)\geq q(S,k)+(1-\alpha)r\geq(1-\frac{5\alpha}{8})r$ and that $L(S,k+1)\leq r-q(S,k)\leq(1-\frac{3\alpha}{8})r$ . That is, there exists an interval $G$ of length $2^{k}$ s.t. $\forall i\in G$ it holds that $Q(S,i)\geq(1-\frac{5\alpha}{8})r$ , and every interval of length $2\cdot 2^{k}$ contains at least one point $i$ s.t. $Q(S,i)\leq(1-\frac{3\alpha}{8})r$ .

Note that if $P_{1}$ (or $P_{2}$ ) is trimmed, then there are no points on the left of (or on the right of) $P$ . So, the interval $P$ contains the point $p$ with quality $Q(S,p)\geq r$ and every point $i\in[0,T]\setminus P$ has quality of at most $(1-\frac{3\alpha}{8})r$ . Moreover, $P$ is of length $4\cdot 2^{k}-1$ . As the intervals of the partitions $\{A_{i}\}$ and $\{B_{i}\}$ are of length $8\cdot 2^{k}$ , and the $\{B_{i}\}$ ’s are shifted by $4\cdot 2^{k}$ , there must exist an interval $C\in\{A_{i}\}\cup\{B_{i}\}$ s.t. $P\subseteq C$ . Assume without loss of generality that $C\in\{A_{i}\}$ .

Recall that the quality $u(S,\cdot)$ of an interval $I$ is defined as the maximal quality $Q(S,i)$ of a point $i\in I$ . Therefore, as $p\in C$ , the quality of $C$ is at least $r$ . On the other hand, the quality of every $A_{i}\neq C$ is at most $(1-\frac{3\alpha}{8})r$ . That is, the interval $C$ scores better (under $u$ ) than any other interval in $\{A_{i}\}$ by at least an additive factor of $\frac{3\alpha}{8}r\geq\frac{1}{\epsilon}\log(\frac{1}{\beta\delta})$ . By the properties of $\mathcal{A}_{\rm dist}$ , with probability at least $(1-\beta)$ , the chosen interval $A$ in step 9 is s.t. $P\subseteq A$ . We proceed with the analysis assuming that this is the case.

Consider again the interval $P$ containing the point $p$ , and recall that there exists an interval $G$ of length $2^{k}$ containing only points with quality $Q(S,\cdot)$ of at least $(1-\frac{5\alpha}{8})r$ . Such an interval must be contained in $P$ . Otherwise, by the quasi-concavity of $Q(S,\cdot)$ , all the points between $G$ and the point $p$ must also have quality at least $(1-\frac{5\alpha}{8})r$ , and, in particular, $P$ must indeed contain such an interval.

So, the chosen interval $A$ in step 9 is of length $8\cdot 2^{k}$ , and it contains a sub interval of length $2^{k}$ in which every point has quality at least $(1-\frac{5\alpha}{8})r$ . That is, at least $\frac{1}{16}$ out of the points in $(A\cup B)$ has quality at least $(1-\frac{5\alpha}{8})r$ . Therefore, as $r\geq\frac{4}{\alpha\epsilon}\log(\frac{16}{\beta})$ , the exponential mechanism ensures that the probability of step 10 failing to return a point $h\in(A\cup B)$ with $Q(S,h)\geq(1-\alpha)r$ is at most $\beta$ . As there are at least $\frac{1}{16}|A\cup B|$ solutions with quality at least $(1-\frac{5\alpha}{8})r$ , the probability that the exponential mechanism outputs a specific solution $h\in(A\cup B)$ with $Q(S,h)\geq(1-\alpha)r$ is at most $\frac{\exp(\frac{\epsilon}{2}(1-\alpha)r)}{\frac{1}{16}|A\cup B|\exp(\frac{\epsilon}{2}(1-\frac{5\alpha}{8})r)}$ . Hence, the probability that the exponential mechanism outputs any solution $h\in(A\cup B)$ with $Q(S,h)\geq(1-\alpha)r$ is at most $16\frac{\exp(\frac{\epsilon}{2}(1-\alpha)r)}{\exp(\frac{\epsilon}{2}(1-\frac{5\alpha}{8})r)}$ , which is at most $\beta$ for our choice of $r$ .

All in all, with probability at least $(1-2\beta(N-1)-2\beta)=(1-2\beta N)$ , algorithm $RecConcave$ returns an index $j\in[0,T]$ s.t. $Q(S,j)\geq(1-\alpha)r$ . ∎

Combining Lemma 3.12 and Lemma 3.14 we get the following theorem.

Let algorithm $RecConcave$ be executed on a range $[0,T]$ , a sensitivity-1 quality function $Q$ , a database $S$ , a bound on the recursion depth $N$ , privacy parameters $\frac{\epsilon}{3N},\frac{\delta}{3N}$ , approximation parameter $\alpha$ , and a quality promise $r$ . The following two statements hold:

Algorithm $RecConcave$ preserves $(\epsilon,\delta)$ -differential privacy.

If $S$ is s.t. $Q(S,\cdot)$ is quasi-concave, and if

then algorithm $RecConcave$ fails to outputs an index $j$ s.t. $Q(S,j)\geq(1-\alpha)r$ with probability at most $\beta$ .

Recall that the number of recursive calls on a range $[0,T]$ is always bounded by $\log^{*}(T)$ , and note that for $N=\log^{*}(T)$ we have that $\log^{\lceil N\rceil}(T)\leq 1$ . Therefore, the promise requirement in Inequality (2) can be replaced with $8^{\log^{*}(T)}\cdot\frac{36\log^{*}(T)}{\alpha\epsilon}\log\Big{(}\frac{12\log^{*}(T)}{\beta\delta}\Big{)}$ .

The computational efficiency of algorithm $RecConcave$ depends on the quality function $Q(\cdot,\cdot)$ . Note, however, that it suffices to efficiently implement the top level call (i.e., without the recursion). This is true because an iteration of algorithm $RecConcave$ , operating on a range $[0,T]$ , can easily be implemented in time $\mathop{\rm{poly}}\nolimits(T)$ , and the range given as input to recursive calls is logarithmic in the size of the initial range.

As we will now see, algorithm $RecConcave$ can be used as a proper $(\alpha,\beta,\epsilon,\delta,m)$ -private learner for $\operatorname*{\tt THRESH}_{d}$ . Recall Example 3.8 (showing that the goal of choosing a hypothesis with small empirical error can be viewed as a quasi-concave promise problem), and consider the following algorithm.

Algorithm $LearnThresholds$ Input: A labeled sample $S=(x_{i},y_{i})_{i=1}^{m}$ and parameters $\alpha,\epsilon,\delta,N$ . 1. Denote $\hat{\alpha}=\frac{\alpha}{2},\;\;\hat{\epsilon}=\frac{\epsilon}{3N}$ , and $\hat{\delta}=\frac{\delta}{3N}$ . 2. For every $0\leq j\leq 2^{d}$ , define $Q(S,j)=|\{i:c_{j}(x_{i})=y_{i}\}|$ . 3. Execute algorithm $RecConcave$ on the sample $S$ , the range $[0,2^{d}]$ , the quality function $Q(\cdot,\cdot)$ , the promise $m$ , and parameters $\hat{\alpha},\hat{\epsilon},\hat{\delta},N$ . Denote the returned value as $k$ . 4. Return $c_{k}$ .

For every $1\leq N\leq\log^{*}(2^{d})$ , Algorithm $LearnThresholds$ is an efficient proper $(\alpha,\beta,\epsilon,\delta,m)$ -PPAC learner for $\operatorname*{\tt THRESH}_{d}$ , where the sample size is

By Theorem 3.15, algorithm $LearnThresholds$ is $(\epsilon,\delta)$ -differentially private. For the utility analysis, fix a target concept $c_{j}\in\operatorname*{\tt THRESH}_{d}$ , and a distribution $\mathcal{D}$ on $X_{d}$ , and let $S$ be a sample drawn i.i.d. from $\mathcal{D}$ and labeled by $c_{j}$ . Define the following two good events:

$\forall\;h\in\operatorname*{\tt THRESH}_{d},\;\;|{\rm error}_{\mathcal{D}}(h,c_{j})-{\rm error}_{S}(h)|\leq\frac{\alpha}{2}$ .

Algorithm $RecConcave$ returns $k$ s.t. ${\rm error}_{S}(c_{k})\leq\frac{\alpha}{2}$

Clearly, when both $E_{1},E_{2}$ occur, algorithm $LearnThresholds$ succeeds in outputting an $\alpha$ -good hypothesis for $c_{j}$ and $\mathcal{D}$ . Note that as $\operatorname{\rm VC}(\operatorname*{\tt THRESH}_{d})=1$ , Theorem 2.14 ensures that for $m\geq\frac{200}{\alpha^{2}}\ln(\frac{4}{\alpha\beta})$ , event $E_{1}$ happens with probability at least $(1-\frac{\beta}{2})$ .

Next, note that for the target concept $c_{j}$ it holds that $Q(S,j)=m$ , and algorithm $RecConcave$ is executed on step 3 with a valid quality promise. Moreover, as shown in Example 3.8, algorithm $RecConcave$ is executed with a quasi-concave quality function.

So, algorithm $RecConcave$ is executed on step 3 with a valid quality promise and with a quasi-concave quality function. For

algorithm $RecConcave$ ensures that with probability at least $(1-\frac{\beta}{2})$ , the index $k$ at step 2 is s.t. $Q(k)\geq(1-\frac{\alpha}{2})m$ . The empirical error of $c_{k}$ is at most $\frac{\alpha}{2}$ in such a case. Therefore, Event $E_{2}$ happens with probability at least $(1-\frac{\beta}{2})$ . Overall, we conclude that $LearnThresholds$ is a proper $(\alpha,\beta,\epsilon,\delta,m)$ -PPAC learner for $C$ , where

By using $N=\log^{*}(2^{d})$ in the above theorem, we can bound the sample complexity of $LearnThresholds$ by

5 Axis-Aligned Rectangles in High Dimension

Consider the class of all axis-aligned rectangles (or hyperrectangles) in the Euclidean space n. A concept in this class could be thought of as the product of $n$ intervals, one on each axis. We briefly describe an efficient approximate-private proper-learner for a discrete version of this class.

Let $X_{d}^{n}=(\{0,1\}^{d})^{n}$ denote a discrete $n$ -dimensional domain, in which every axis consists of $2^{d}$ points $\{0,1,\ldots,2^{d}-1\}$ . For every $\vec{a}=(a_{1},\ldots,a_{n}),\vec{b}=(b_{1},\ldots,b_{n})\in X_{d}^{n}$ define the concept $c_{[\vec{a},\vec{b}]}:X_{d}^{n}\rightarrow\{0,1\}$ where $c_{[\vec{a},\vec{b}]}(\vec{x})=1$ if and only if for every $1\leq i\leq n$ it holds that $a_{i}\leq x_{i}\leq b_{i}$ . Define the concept class of all axis-aligned rectangles over $X^{n}_{d}$ as $\operatorname*{\tt RECTANGLE}_{d}^{n}=\{c_{[\vec{a},\vec{b}]}\}_{\vec{a},\vec{b}\in X_{d}^{n}}$ .

The VC dimension of this class is $2n$ , and, thus, it can be learned non-privately with sample complexity $O_{\alpha,\beta}(n)$ . Note that $|\operatorname*{\tt RECTANGLE}_{d}^{n}|=2^{O(nd)}$ , and, therefore, the generic construction of Kasiviswanathan et al. yields an inefficient proper $\epsilon$ -private learner for this class with sample complexity $O_{\alpha,\beta,\epsilon}(nd)$ .

In , Kearns gave an efficient (noise resistant) non-private learner for this class. The learning model there was a variant of the statistical queries model , in which the learner is also being given access to the underling distribution $\mathcal{D}$ . Every learning algorithm in the statistical queries model can be transformed to satisfy differential privacy while preserving efficiency . However, as Kearns’ algorithm assumes direct access to $\mathcal{D}$ , this transformation cannot be applied directly.

Kearns’ algorithm begins by sampling $\mathcal{D}$ and using the drawn samples to divide each axis $i\in[n]$ into $O(n/\alpha)$ intervals ${\cal I}_{i}=\{I\}$ with the property that the $x_{i}$ component of a random point from $\mathcal{D}$ is approximately equally likely to fall into each of the intervals in ${\cal I}_{i}$ . The algorithm proceeds by estimating the boundary of the target rectangle separately for every dimension $i$ : For every interval $I\in{\cal I}_{i}$ , the algorithm uses statistical queries to estimate the probability that a positively labeled input has its $x_{i}$ component in $I$ , i.e.,

The algorithm places the left boundary of the hypothesis rectangle in the $i$ -th dimension at the left-most interval $I\in{\cal I}_{i}$ such that $p_{I}$ is significant, and analogously on the right.

Note that once the interval sets ${\cal I}_{i}$ are defined for each axis $i\in[n]$ , estimating every single $p_{I}$ can be done via statistical queries, and can, therefore, be made private using the transformation of . Alternatively, estimating (simultaneously) all of the $p_{I}$ ’s (on the $i^{\text{t}h}$ axis) could be done privately using the laplacian mechanism. This use of the laplacian mechanism is known as a histogram (see Theorem 2.24).

Thus, our task is to privately partition each axis. The straight forward approach for privately finding ${\cal I}_{i}$ is by a noisy binary search for the boundary of each of the $n/\alpha$ intervals (in each axis). This would result in $\Omega(d)$ noisy comparisons, which, in turn, results in a private learner with a high sample complexity.

We now overcome this issue using a sanitizer for $\operatorname*{\tt THRESH}_{d}$ . Such a sanitizer will be constructed in Section 4.2; here we use it for privately finding ${\cal I}_{i}$ .

Fix $\alpha,\beta,\epsilon,\delta$ . There exists an efficient $(\alpha,\beta,\epsilon,\delta,m)$ -sanitizer for $\operatorname*{\tt THRESH}_{d}$ , where $m=\widetilde{O}_{\beta,\epsilon,\delta}\left(\frac{1}{\alpha^{2.5}}\cdot 8^{\log^{*}(d)}\right)$ .

As we next explain, such a sanitizer can be used to (privately) divide the axes. Given an interval $[a,b]\subseteq X_{d}$ and a sample $S$ , we denote the probability mass of $[a,b]$ under $\mathcal{D}$ as $\mathcal{D}[a,b]$ , and the number of sample points in this interval as $\#_{S}[a,b]$ . Standard arguments in learning theory (specifically, Theorem 2.14) state that for a large enough sample (whose size is bigger than the VC dimensions of the intervals class) w.h.p. $\frac{1}{|S|}\#_{S}[a,b]\approx\mathcal{D}[a,b]$ for every interval $[a,b]\subseteq X_{d}$ .

On an input database $S\in(X_{d})^{*}$ , such a sanitizer for $\operatorname*{\tt THRESH}_{d}$ outputs an alternative database $\hat{S}\in(X_{d})^{*}$ s.t. $\frac{1}{|\hat{S}|}\#_{\hat{S}}[0,b]\approx\frac{1}{|S|}\#_{S}[0,b]$ for every interval $[0,b]\subseteq X_{d}$ . Hence, for every interval $[a,b]\subseteq X_{d}$ we have that

So, in order to divide the $i^{\text{t}h}$ axis we apply the above mentioned sanitizer, and divide the axis using the returned sanitized database. In order to accumulate error of up to $\alpha/n$ on each axis (as required by Kearns’ algorithm), we need to execute the above mentioned sanitizer with an approximation parameter of (roughly) $\alpha/n$ . Every such execution requires, therefore, a sample of $\widetilde{O}_{\alpha,\beta,\epsilon,\delta}\left(n^{2.5}\cdot 8^{\log^{*}(d)}\right)$ elements. As there are $n$ such executions (one for each axis), using Theorem 2.5 (composition theorem), the described learner is of sample complexity $\widetilde{O}_{\alpha,\beta,\epsilon,\delta}\left(n^{3}\cdot 8^{\log^{*}(d)}\right)$ .

There exists an efficient $(\alpha,\beta,\epsilon,\delta,m)$ -PPAC proper-learner for $\operatorname*{\tt RECTANGLE}_{d}^{n}$ , where

This should be contrasted with $\theta_{\alpha,\beta}(n)$ , which is the non-private sample complexity for this class (as the $\operatorname{\rm VC}$ -dimension of $\operatorname*{\tt RECTANGLE}_{d}^{n}$ is $2n$ ), and with $\theta_{\alpha,\beta,\epsilon}(nd)$ which is the pure-private sample complexity for this class.The general construction of Kasiviswanathan et al. yields an (inefficient) pure-private proper-learner for this class with sample complexity $O_{\alpha,\beta,\epsilon}(nd)$ . Feldman and Xiao showed that this is in fact optimal, and every $\epsilon$ -private (proper or improper) learner for this class must have sample complexity $\Omega(nd)$ .

Sanitization with Approximate Privacy

In this section we present $(\epsilon,\delta)$ -private sanitizers for several concept classes, and separate the database size necessary for $(\epsilon,0)$ -private sanitizers from the database size sufficient for $(\epsilon,\delta)$ -private sanitizers.

Recall that in our private PAC learner for $\operatorname{\tt POINT}_{d}$ , given a typical labeled sample, there exists a unique concept in the class that stands out (we used algorithm $\mathcal{A}_{\rm dist}$ to identify it). This is not the case in the context of sanitization, as a given database $S$ can have many $\alpha$ -close sanitized databases $\hat{S}$ . We will overcome this issue by using the following private tool for approximating a restricted class of choosing problems.

A function $q:X^{*}\times\mathcal{F}\rightarrow\N$ defines an optimization problem over the domain $X$ and solution set $\mathcal{F}$ : Given a dataset $S$ over domain $X$ choose $f\in\mathcal{F}$ that (approximately) maximizes $q(S,f)$ . We are interested in a subset of these optimization problems, which we call bounded-growth choice problems. In this section we consider a database $S\subseteq X^{*}$ as a multiset.

Given $q$ and $S$ define $\mathop{\rm{opt}}\nolimits_{q}(S)=\max_{f\in\mathcal{F}}\{q(S,f)\}$ . A solution $f\in\mathcal{F}$ is called $\alpha$ -good for a database $S$ if $q(S,f)\geq\mathop{\rm{opt}}\nolimits_{q}(S)-\alpha|S|$ .

A quality function $q:X^{*}\times\mathcal{F}\rightarrow\N$ is $k$ -bounded-growth if:

$q(\emptyset,f)=0$ for all $f\in\mathcal{F}$ .

If $S_{2}=S_{1}\cup\{x\}$ , then (i) $q(S_{1},f)+1\geq q(S_{2},f)\geq q(S_{1},f)$ for all $f\in\mathcal{F}$ ; and (ii) there are at most $k$ solutions $f\in\mathcal{F}$ s.t. $q(S_{2},f)>q(S_{1},f)$ .

In words, the second requirement means that (i) Adding an element to the database could either have no effect on the score of a solution $f$ , or can increase the score by exactly $1$ ; and (ii) There could be at most $k$ solutions whose scores are increased (by $1$ ). Note that a $k$ -bounded-growth quality function is, in particular, a sensitivity-1 function as two neighboring $S,S^{\prime}$ must be of the form $D\cup\{x_{1}\}$ and $D\cup\{x_{2}\}$ respectively. Hence, $q(S,f)-q(S^{\prime},f)\leq q(D,f)+1-q(D,f)=1$ for every solution $f$ .

As an example of a 1-bounded growth quality function, consider the following $q:X^{*}\times X\rightarrow\N$ . Given a database $S=(x_{1},\ldots,x_{m})$ containing elements from some domain $X$ , define $q(S,a)=\big{|}\{i:x_{i}=a\}\big{|}$ . That is, $q(S,a)$ is the number of appearances of $a$ in $S$ . Clearly, $q(\emptyset,f)=0$ for all $f\in X$ . Moreover, adding an element $a\in X$ to a database $S$ increases by 1 the quality of $q(S,a)$ , and does not effect the quality of every other $b\neq a$ .

The choosing mechanism (in Figure 8) is a private algorithm for approximately solving bounded-growth choice problems. Step 1 of the algorithm checks whether a good solutions exist, as otherwise any solution is approximately optimal (and the mechanism returns $\bot$ ). Step 2 invokes the exponential mechanism, but with the small set $G(S)$ instead of $\mathcal{F}$ .

When $q$ is a $k$ -bounded-growth quality function, the choosing mechanism preserves $(\epsilon,\delta)$ -differential privacy for databases of $m\geq\frac{16}{\alpha\epsilon}\ln(\frac{16k}{\alpha\beta\epsilon\delta})$ elements.

Let $S,S^{\prime}$ be neighboring databases of $m$ elements. We need to show that $\Pr[A(S)\in R]\leq\exp(\epsilon)\cdot\Pr[A(S^{\prime})\in R]+\delta$ for any set of outputs $R$ . Note first that by the properties of the Laplace Mechanism,

Using $m\geq\frac{16}{\alpha\epsilon}\ln(\frac{1}{2\delta})$ , we get that

Let $G(S)$ and $G(S^{\prime})$ be the sets used in step 2 in the execution $S$ and on $S^{\prime}$ respectively. We will show that the following two facts hold:

$Fact\;1:$ For every $f\in G(S)\setminus G(S^{\prime})$ , it holds that $\Pr[A(S)=f]\leq\frac{\delta}{k}$ .

$Fact\;2:$ For every possible output $f\notin G(S)\setminus G(S^{\prime})$ , it holds that $\Pr[A(S)=f]\leq e^{\epsilon}\Pr[A(S^{\prime})=f]$ .

For proving Fact 1, let $f\in G(S)\setminus G(S^{\prime})$ . That is, $q(S,f)\geq 1$ and $q(S^{\prime},f)=0$ . As $q$ is (in particular) a sensitivity-1 function, it must be, therefore, that $q(S,f)=1$ . As there exists $\hat{f}\in S$ with $q(S,\hat{f})\geq\frac{\alpha m}{4}$ , we have that

which is at most $\frac{\delta}{k}$ for $m\geq\frac{16}{\alpha\epsilon}(\frac{\epsilon}{4}+\ln(\frac{k}{\delta}))$ .

For proving Fact 2, let $f\notin G(S)\setminus G(S^{\prime})$ be a possible output of $A(S)$ . If $f\notin(G(S)\cup\{\bot\})$ then trivially $\Pr[A(S)=f]=0\leq e^{\epsilon}\Pr[A(S^{\prime})=f]$ . We have already established (in Inequality (3)) that for $f=\bot$ it holds that $\Pr[A(S)=\bot]\leq e^{\epsilon/4}\Pr[A(S^{\prime})=\bot]$ . It remains, hence, to deal with the case where $f\in G(S)\cap G(S^{\prime})$ . For this case, we use the following Fact 3, proved below.

$Fact\;3:$ $\sum\limits_{h\in G(S^{\prime})}\exp(\frac{\epsilon}{4}q(S^{\prime},h))\leq e^{\epsilon/2}\cdot\sum\limits_{h\in G(S)}\exp(\frac{\epsilon}{4}q(S,h))$ .

Using Fact 3, for every possible output $f\in G(S)\cap G(S^{\prime})$ we have that

We now prove Fact 3. Denote $\mathcal{X}\triangleq\sum\limits_{h\in G(S)}\exp(\frac{\epsilon}{4}q(S,h))$ . We first show that

That is, we need to show that $\mathcal{X}\geq\frac{k}{e^{\epsilon/4}-1}$ . As $1+\frac{\epsilon}{4}\leq e^{\epsilon/4}$ , it suffices to show that $\mathcal{X}\geq\frac{4k}{\epsilon}$ . Recall that there exists a solution $\hat{f}$ s.t. $q(S,\hat{f})\geq\frac{\alpha m}{4}$ . Therefore, $\mathcal{X}\geq\exp(\frac{\epsilon}{4}\frac{\alpha m}{4})$ , which is at least $\frac{4k}{\epsilon}$ for $m\geq\frac{16}{\alpha\epsilon}\ln(\frac{4k}{\epsilon})$ . This proves (4).

Now, recall that as $q$ is $k$ -growth-bounded, for every $h\in\mathcal{F}$ it holds that $|q(S,h)-q(S^{\prime},h)|\leq 1$ . Moreover, $|G(S^{\prime})\setminus G(S)|\leq k$ , and every $h\in(G(S^{\prime})\setminus G(S))$ obeys $q(S^{\prime},h)=1$ . Hence,

This concludes the proof of Fact 3, and completes the proof of the lemma. ∎

The utility analysis for the choosing mechanism is rather straight forward:

When $q$ is a $k$ -bounded-growth quality function, given a database $S$ of $m\geq\frac{16}{\alpha\epsilon}\ln(\frac{16k}{\alpha\beta\epsilon\delta})$ elements, the choosing mechanism outputs an $\alpha$ -good solution for $S$ with probability at least $1-\beta$ .

Note that if $q(S,f)<\alpha m$ for every solution $f$ , then every solution is an $\alpha$ -good solution, and the mechanism cannot fail. Assume, therefore, that there exists a solution $f$ s.t. $q(f,S)\geq\alpha m$ , and recall that the mechanism defines ${\rm best}(S)$ as $\max_{f\in\mathcal{F}}\left\{q(f,S)\right\}+\mathop{\rm{Lap}}\nolimits(\frac{4}{\epsilon})$ . Now consider the following two good events:

The exponential mechanism chooses a solution $f$ s.t. $q(S,f)\geq\mathop{\rm{opt}}\nolimits(S)-\alpha m$ .

If $E_{2}$ occurs then the mechanism outputs an $\alpha$ -good solution. Note that the event $E_{2}$ is contained inside the event $E_{1}$ , and, therefore, $\Pr[E_{2}]=\Pr[E_{1}\wedge E_{2}]=\Pr[E_{1}]\cdot\Pr[E_{2}|E_{1}]$ . By the properties of the Laplace Mechanism, $\Pr[E_{1}]\geq\left(1-\frac{1}{2}\exp(-\frac{\epsilon}{4}\frac{\alpha m}{2})\right)$ , which is at least $(1-\frac{\beta}{2})$ for $m\geq\frac{8}{\alpha\epsilon}\ln(\frac{1}{\beta})$ .

By the growth-boundedness of $q$ , and as $S$ is of size $m$ , there are at most $km$ possible solutions $f$ with $q(f,S)>0$ . That is, $|G(S)|\leq km$ . By the properties of the Exponential Mechanism, we have that $\Pr[E_{2}|E_{1}]\geq\left(1-km\cdot\exp(-\frac{\alpha\epsilon m}{4})\right)$ , which is at least $(1-\frac{\beta}{2})$ for $m\geq\frac{8}{\alpha\epsilon}\ln(\frac{16k}{\alpha\beta\epsilon})$ . For our choice of $m$ we have, therefore, that $\Pr[E_{2}]\geq(1-\frac{\beta}{2})(1-\frac{\beta}{2})\geq(1-\beta)$ .

All in all, for $m\geq\frac{16}{\alpha\epsilon}\ln(\frac{16k}{\alpha\beta\epsilon\delta})$ we get that with probability at least $(1-\beta)$ it outputs an $\alpha$ -good solution for its input database. ∎

Beimel et al. showed that every pure $\epsilon$ -private sanitizer for $\operatorname{\tt POINT}_{d}$ , must operate on databases of $\Omega(d)$ elements. In this section we present an $(\epsilon,\delta)$ -private sanitizer for $\operatorname{\tt POINT}_{d}$ with sample complexity $O_{\alpha,\beta,\epsilon,\delta}(1)$ . This separates the database size necessary for $(\epsilon,0)$ -private sanitizers from the database size sufficient for $(\epsilon,\delta)$ -private sanitizers.

Let $S=(x_{1},x_{2},\ldots,x_{m})\in X_{d}^{m}$ be a database of $d$ -bit strings. For every $c_{j}\in\operatorname{\tt POINT}_{d}$ , the query $Q_{c_{j}}:X_{d}^{*}\rightarrow$ is defined to be the fraction of the strings in the database that equal $j$

Our sanitizing algorithm invokes the Choosing Mechanism to choose points $x\in X_{d}$ . Consider the following $q:X_{d}^{*}\times X_{d}\rightarrow\N$ . Given a database $S\in X_{d}^{m}$ and a point $x\in X_{d}$ , define $q(S,x)$ to be the number of appearances of $x$ in $S$ . By Example 4.3, $q$ defines a 1-bounded-growth choosing problem. Moreover, given a subset $R\subseteq X_{d}$ consider the restriction of $q$ to the subset $R$ defined as $q_{R}(S,x)=q(S,x)$ for $x\in R$ and zero otherwise. The function $q_{R}$ is a 1-bounded-growth quality function. Our sanitizer $SanPoints$ appears in Figure 9.

Fix $\alpha,\beta,\epsilon,\delta$ . For $m\geq O\left(\frac{1}{\alpha^{1.5}\epsilon}\sqrt{\ln(\frac{1}{\delta})}\ln(\frac{1}{\alpha\beta\epsilon\delta})\right)$ , algorithm $SanPoints$ is an efficient $(\alpha,\beta,\epsilon,\delta,m)$ -improper-sanitizer for $\operatorname{\tt POINT}_{d}$ .

We start with the utility analysis. Fix a database $S=(x_{1},\ldots,x_{m})$ , and consider the execution of algorithm $SanPoints$ on $S$ . Denote the element chosen by the Choosing Mechanism on the $i^{\rm th}$ iteration of step 2 by $b_{i}$ , and denote the set of all such elements as $B=\{b_{1},\ldots,b_{2/\alpha}\}\setminus\{\bot\}$ . Moreover, let $R_{i}$ denote the set $R$ as is was at the beginning of the $i^{\text{th}}$ iteration. Consider the following two bad events:

$\exists b\in B$ s.t. $|Q_{c_{b}}(S)-\operatorname{\rm Est}(b)|>\alpha$ .

$\exists a\notin B$ s.t. $Q_{c_{a}}(S)>\alpha$ .

If none of these two events happen, then algorithm $SanPoints$ succeeds in outputting an estimation $\operatorname{\rm Est}$ s.t. $\forall c_{j}\in\operatorname{\tt POINT}_{d}\;\;\big{|}Q_{c_{j}}(S)-\operatorname{\rm Est}(j)\big{|}\leq\alpha$ . We now bound the probability of both events.

We now bound $\Pr[E_{2}]$ . By the properties of the Choosing Mechanism (Lemma 4.5), with probability at least $(1-\frac{\alpha\beta}{4})$ , an execution of the Choosing Mechanism on step 2a returns an $\frac{\alpha}{2}$ -good solution $b_{i}$ s.t.

Using the union bound on the number of iterations, we get that with probability at least $(1-\frac{\beta}{2})$ , Inequality (5) holds for every iteration $1\leq i\leq\frac{\alpha}{2}$ . We will now see that in such a case, event $E_{2}$ does not occur. Assume to the contrary that there exists an $a\notin B$ s.t. $Q_{c_{a}}(S)>\alpha$ . Therefore, for every iteration $i$ it holds that $\max_{x\in X_{d}}\{q_{R_{i}}(S,x)\}>\alpha m$ and thus $q_{R_{i}}(S,b_{i})>\frac{\alpha}{2}m$ . This means that there exist (at least) $\frac{2}{\alpha}$ different points $b_{i}\in X_{d}$ that appear in $S$ more than $\frac{\alpha}{2}m$ times, which contradicts the fact that the size of $S$ is $m$ .

All in all, $\Pr[E_{2}]\leq\frac{\beta}{2}$ , and the probability of algorithm $SanPoints$ failing to output an estimation $\operatorname{\rm Est}$ s.t. $\forall c_{j}\in\operatorname{\tt POINT}_{d}\;\;\big{|}Q_{c_{j}}(S)-\operatorname{\rm Est}(j)\big{|}\leq\alpha$ is at most $\beta$ .

The above algorithm $SanPoints$ can also be used as a sanitizer for the concept class $\operatorname{\tt k-POINT}_{d}$ , defined as follows. For every $A\subseteq X_{d}$ s.t. $|A|=k$ , the concept class $\operatorname{\tt k-POINT}_{d}$ contains the concept $c_{A}:X_{d}\rightarrow\{0,1\}$ , defined as $c_{A}(x)=1$ if $x\in A$ and $c_{A}(x)=0$ otherwise.

Let $S=(x_{1},x_{2},\ldots,x_{m})\in X_{d}^{m}$ be a database. For every $c_{I}\in\operatorname{\tt k-POINT}_{d}$ , the query $Q_{c_{I}}:X_{d}^{*}\rightarrow$ is defined as

Fix $k,\alpha,\beta,\epsilon,\delta$ . For $m\geq O\left(\frac{k^{1.5}}{\alpha^{1.5}\epsilon}\sqrt{\ln(\frac{1}{\delta})}\ln(\frac{k}{\alpha\beta\epsilon\delta})\right)$ , the above algorithm is an efficient $(\alpha,\beta,\epsilon,\delta,m)$ -improper-sanitizer for $\operatorname{\tt k-POINT}_{d}$ .

The privacy of the above algorithm is immediate. Fix a database $S=(x_{1},x_{2},\ldots,x_{m})\in X_{d}^{m}$ . By Theorem 4.6, with probability at least $(1-\beta)$ , the estimation $\operatorname{\rm Est}$ on step 1 is s.t. $\forall j\in X_{d}\;\;\big{|}\frac{1}{m}\sum_{i=1}^{m}{\mathds{1}_{\{x_{i}=j\}}}-\operatorname{\rm Est}(j)\big{|}\leq\frac{\alpha}{k}$ . Now fix a set $I\subseteq X_{d}$ of cardinality $k$ . As $Q_{c_{I}}(S)=\frac{1}{m}|\{i\;:\;x_{i}\in I\}|$ , we have that $|Q_{c_{I}}(S)-\sum_{i\in I}{\operatorname{\rm Est}(i)}|\leq k\frac{\alpha}{k}=\alpha$ . ∎

Recall that $\operatorname*{\tt THRESH}_{d}=\{c_{0},\ldots,c_{2^{d}}\}$ , where $c_{j}(x)=1$ if and only if $x<j$ . Let $S=(x_{1},\ldots,x_{m})\in X_{d}^{m}$ be a database. For every $c_{j}\in\operatorname*{\tt THRESH}_{d}$ , the query $Q_{c_{j}}:X_{d}^{*}\rightarrow$ is defined as

As $|\operatorname*{\tt THRESH}_{d}|=2^{d}+1$ , one can use the generic construction of Blum et al. , and get an $\epsilon$ -private sanitizer for this class with sample complexity $O(d)$ . By , this is the best possible when guaranteeing pure privacy (ignoring the dependency on $\alpha,\beta$ and $\epsilon$ ). We next present a recursive sanitizer for $\operatorname*{\tt THRESH}_{d}$ , guaranteeing approximated privacy and exhibiting sample complexity $\widetilde{O}_{\alpha,\beta,\epsilon,\delta}(8^{\log^{*}(d)})$ .

The algorithm maintains its sanitized database $\hat{S}$ as a global variable, which is initialized as the empty set. In addition, for the privacy analysis, we would need a bound on the number of recursive calls. It will be convenient to maintain another global variable, $calls$ , initialized at the desired bound and decreased in every recursive call.

Every iteration of algorithm $SanThresholds$ can access its input database at most twice using the laplacian mechanism (on steps 2,11), at most once using the Choosing Mechanism (on step 9 or on step 10b), and at most once using algorithm $RecConcave$ (on step 8). By the properties of the laplacian mechanism, every interaction with it preserves $(\epsilon,0)$ -differential privacy. Note that the quality function with which we call the Choosing Mechanism is at most 2-growth-bounded. Therefore, as $m\geq\frac{1024}{\alpha\epsilon}\ln(\frac{2048}{\alpha\beta\epsilon\delta})$ , every such interaction with the Choosing Mechanism preserves $(\epsilon,\delta)$ -differential privacy. Last, for our choice of $\hat{\epsilon},\hat{\delta}$ , every interaction with algorithm $RecConcave$ preserves $(\epsilon,\delta)$ -differential privacy.

We start the utility analysis of $SanThresholds$ with the following simple claim.

The function $Q(S,\cdot)$ , defined on step 6, is quasi-concave.

First note that the function $I(S,\cdot)$ defined on step 5 is non-decreasing. Now, let $u\leq v\leq w$ be s.t. $Q(S,u),Q(S,w)\geq x$ . That is,

Using the fact that $I(S,\cdot)$ is non-decreasing, we have that $I(S,u)\leq I(S,v)$ and that $I(S,v-1)\leq I(S,w-1)$ . Therefore

Note that every iteration of algorithm $SanThresholds$ draws at most 2 random samples (on steps 2 and 11) from $\mathop{\rm{Lap}}\nolimits(\frac{1}{\epsilon})$ . We now proceed with the utility analysis by identifying 3 good events that occur with high probability (over the coin tosses of the algorithm).

Fix $\alpha,\beta,\epsilon,\delta$ . Let $SanThresholds$ be executed with $\operatorname{\rm calls}$ initialized to $c\geq\frac{77}{\alpha}$ , and on a database $S$ of $m\geq 8^{\log^{*}(d)}\cdot\frac{60c}{\alpha\epsilon}\log^{*}(d)\log\big{(}\frac{12\log^{*}(d)}{\beta\epsilon\delta}\big{)}$ elements. With probability at least $(1-3c\beta)$ the following 3 events happen:

In every random draw of $\mathop{\rm{Lap}}\nolimits(\frac{1}{\epsilon})$ throughout the execution of $SanThresholds$ it holds that $|\mathop{\rm{Lap}}\nolimits(\frac{1}{\epsilon})|\leq\frac{\alpha m}{16c}$ .

Every interaction with algorithm $RecConcave$ on step 8 succeeds in returning a value $z$ s.t. $Q(S,z)\geq\frac{3\alpha m}{128}$ .

Every iteration that halts after step 13, defines an interval $[a,b]$ s.t. $\#_{S}[a,b]\geq\frac{5\alpha m}{128}$ .

First note that it suffices to lower bound the terms $\Pr[B_{1}],\;\Pr[B_{2}|B_{1}]$ , and $\Pr[B_{3}|B_{1}\wedge B_{2}]$ , as by the chain rule of conditional probability it holds that

We now bound each of those terms, starting with $\Pr[B_{1}]$ . In every single draw, the probability of $|\mathop{\rm{Lap}}\nolimits(\frac{1}{\epsilon})|>\frac{\alpha m}{16c}$ is at most $\exp(\frac{-\alpha\epsilon m}{16c})$ , which is at most $\frac{\beta}{2}$ for $m\geq\frac{16c}{\alpha\epsilon}\ln(\frac{2}{\beta})$ . As $c$ (the initial value of $\operatorname{\rm calls}$ ) limits the number of iteration, we get that $\Pr[B_{1}]\geq(1-c\beta)$ .

So, given that event $(B_{1}\wedge B_{2})$ has occurred, in every iteration that halts after step 13, the probability of defining $[a,b]$ s.t. $\#_{S}[a,b]<\frac{5\alpha m}{128}$ is at most $\beta$ . As there are at most $c$ iterations, we see that $\Pr[B_{3}|B_{1}\wedge B_{2}]\geq(1-c\beta)$ .

All in all, for $c\beta\leq 3$ we get that

Every iteration of algorithm $SanThresholds$ that does not halt on step 1 defines an interval $[a,b]$ (on exactly one of the steps 3,9,10b). This interval $[a,b]$ is not part of any range that is given as input to any future recursive call. Moreover, if none of the recursive calls throughout the execution of $SanThresholds$ halts on step 1, these $[a,b]$ intervals form a partition of the initial range. We now proceed with the utility analysis by identifying yet another 3 good events (at a somewhat higher level) that occur whenever $(B_{1}\wedge B_{2}\wedge B_{3})$ occur.

There are at most $\frac{77}{\alpha}$ recursive calls, none of them halts on the first step.

Every iteration defines $[a,b]$ s.t. $\#_{S}[a,b-1]\leq\frac{\alpha m}{2}$ . That is, every iteration defines $[a,b]$ s.t. the interval $[a,b-1]$ contains at most $\frac{\alpha m}{2}$ points in $S$ .

In every iteration $\left|\#_{S}[a,b]-\hat{\#}[a,b]\right|\leq\frac{\alpha m}{4}\frac{\alpha}{77}$ .

Consider again events $B_{1},B_{2},B_{3}$ defined in Claim 4.10. We will show that the event $(E_{1}\wedge E_{2}\wedge E_{3})$ is implied by $(B_{1}\wedge B_{2}\wedge B_{3})$ (which happens with probability at least $(1-3c\beta)$ by Claim 4.10). We, therefore, continue the proof assuming that $(B_{1}\wedge B_{2}\wedge B_{3})$ has occurred.

We begin by showing that event $E_{1}$ occurs. Denote the number of iterations that halts on steps 1-3 as $y_{1}$ , and the number of complete iterations (i.e., that halts after step 13) as $y_{2}$ . Clearly, $y_{1}\leq 2y_{2}$ . Now, as event $B_{3}$ has occurred, we have that every iteration that halts after step 13 defines an interval $[a,b]$ s.t. $\#_{S}[a,b]\geq\frac{5\alpha m}{128}$ . This interval does not intersect any range given as input to future calls, and, therefore, $y_{2}\leq\frac{128}{5\alpha}$ . The total number of iterations is, therefore, bounded by $3y_{2}\leq\frac{384}{5\alpha}<\frac{77}{\alpha}$ . Thus, whenever $\operatorname{\rm calls}$ is initialized to at least $77/\alpha$ , there are at most $\frac{77}{\alpha}$ iterations, none of them halts on step 1. That is, $E_{1}$ occurs.

We next show that $E_{3}$ occurs. As we have seen, event $B_{3}$ ensures that no iteration halts on step 1. Therefore every iteration defines $\hat{\#}[a,b]$ by adding a random draw of $\mathop{\rm{Lap}}\nolimits(\frac{1}{\epsilon})$ to $\#_{S}[a,b]$ . As event $B_{1}$ has occurred, it holds that $\left|\#_{S}[a,b]-\hat{\#}[a,b]\right|\leq\frac{\alpha m}{16c}\leq\frac{\alpha m}{4}\frac{\alpha}{77}$ . So, $E_{3}$ occurs.

Consider an iteration of algorithm $SanThresholds$ that defines $[a,b]$ on step 9. In that iteration, $[a,b]$ is defined as $[a,a]$ . Trivially, the empty interval $[a,b-1]=[a,a-1]$ contains at most $\frac{\alpha m}{2}$ points.

Consider an iteration of algorithm $SanThresholds$ that defines $[a,b]$ on step 10b (of length at most $2\cdot 2^{z}$ ). As event $B_{2}$ has occurred, $z$ is s.t. $Q(S,z)\geq\frac{3\alpha m}{128}$ . In particular $L(S,z-1)\leq\frac{9\alpha m}{128}$ , and every interval of length $\frac{1}{2}2^{z}$ contains at most $\frac{9\alpha m}{128}$ points in $S$ . Therefore $\#_{S}[a,b-1]\leq 4\frac{9\alpha m}{128}\leq\frac{\alpha m}{2}$ . Note that we needed $z$ to be at least 1 (ensured by the If condition on step 10), as otherwise the constraint on intervals of length $\frac{1}{2}2^{z}$ has no meaning.

At any case, we have that $E_{2}$ must occur.

We will now complete the utility analysis by showing that the input database $S$ and the sanitized database $\hat{S}$ (at the end of $SanThresholds$ ’ execution) are $\alpha$ -close whenever $(E_{1}\wedge E_{2}\wedge E_{3})$ occurs.

Fix $\alpha,\beta,\epsilon,\delta$ . Let $SanThresholds$ be executed on the range $X_{d}$ , a global variable $\operatorname{\rm calls}$ initialize to $c\geq\frac{77}{\alpha}$ , and on a database $S$ of $m\geq 8^{\log^{*}(d)}\cdot\frac{60c}{\alpha\epsilon}\log^{*}(d)\log\big{(}\frac{12\log^{*}(d)}{\beta\epsilon\delta}\big{)}$ elements. With probability at least $(1-3c\beta)$ , the sanitized database $\hat{S}$ at the end of the execution is s.t. $|Q_{c_{j}}(S)-Q_{c_{j}}(\hat{S})|\leq\alpha$ for every $c_{j}\in\operatorname*{\tt THRESH}_{d}$ .

Denote $S=(x_{1},\ldots,x_{m})$ , and $\hat{S}=(\hat{x_{1}},\ldots,\hat{x_{n}})$ . Note that $|S|=m$ and that $|\hat{S}|=n$ . By Claim 4.11, the event $E_{1}\cap E_{2}\cap E_{3}$ occurs with probability at least $(1-3c\beta)$ . We will show that in such a case, the sanitized database $\hat{S}$ is s.t. $|Q_{c_{j}}(S)-Q_{c_{j}}(\hat{S})|\leq\alpha$ for every $c_{j}\in\operatorname*{\tt THRESH}_{d}$ .

As event $E_{1}$ has occurred, the intervals $[a,b]$ defined throughout the execution of $SanThresholds$ defines a partition of the domain $X_{d}$ . Denote those intervals as $[a_{1},b_{1}],\;[a_{2},b_{2}],\;\ldots,\;[a_{w},b_{w}]$ , where $a_{1}=0,\;b_{w}=2^{d}-1$ , and $a_{i+1}=b_{i}+1$ . Now fix some $c_{j}\in\operatorname*{\tt THRESH}_{d}$ , and let $t$ be s.t. $j\in[a_{t},b_{t}]$ . We have that

As event $E_{1}$ has occurred, $t\leq\frac{77}{\alpha}$ , and

Similar arguments show that $Q_{c_{j}}(S)\geq-\frac{3\alpha}{4}+\frac{1}{m}\#_{\hat{S}}[0,j-1]$ , and so $\left|Q_{c_{j}}(S)-\frac{1}{m}\#_{\hat{S}}[0,j-1]\right|\leq\frac{3\alpha}{4}$ .

Recall that the sanitized database $\hat{S}$ is of size $n$ , and that $Q_{c_{j}}(\hat{S})=\frac{1}{n}\#_{\hat{S}}[0,j-1]$ . As event $(E_{1}\cap E_{3})$ has occurred, we have that $n\leq m+\frac{\alpha m}{4}=(1+\frac{\alpha}{4})m$ . Therefore,

Similar arguments show that $\left|\frac{1}{m}\#_{\hat{S}}[0,j-1]-\frac{1}{n}\#_{\hat{S}}[0,j-1]\right|\leq\frac{\alpha}{4}$ . By the triangle inequality we have therefore that $\left|Q_{c_{j}}(S)-Q_{c_{j}}(\hat{S})\right|\leq\frac{3\alpha}{4}+\frac{\alpha}{4}=\alpha$ . ∎

The following theorem is an immediate consequence of Lemma 4.12 and Lemma 4.8.

Fix $\alpha,\beta,\epsilon,\delta$ . There exists an efficient $(\alpha,\beta,\epsilon,\delta,m)$ -sanitizer for $\operatorname*{\tt THRESH}_{d}$ , where

4 Sanitization with Pure Privacy

Here we give a general lower bound on the database size of pure private sanitizers. Beimel et al. showed that every pure $\epsilon$ -private sanitizer for $\operatorname{\tt POINT}_{d}$ must operate on databases of $\Omega(d)$ elements. With slight modifications, their proof technique can yield a much more general result.

Given a concept class $C$ over a domain $X$ , we denote the effective size of $X$ w.r.t. $C$ as

That is, $X_{C}$ is the cardinality of the biggest subset $\widetilde{X}\subseteq X$ s.t. every two different elements of $\widetilde{X}$ are labeled differently by at least one concept in $C$ .

Let $C$ be a concept class over a domain $X$ . For every $(\alpha,\beta,\epsilon,m)$ -sanitizer for $C$ (proper or improper) it holds that $m=\Omega\left(\frac{1}{\epsilon\alpha}(\log X_{C}+\log(1/\beta))\right)$ .

Let $\widetilde{X}\subseteq X$ be s.t. $|\widetilde{X}|=X_{C}$ and every two different elements of $\widetilde{X}$ are labeled differently by at least one concept in $C$ . Fix some $x_{1}\in\widetilde{X}$ , and for every $x_{i}\in\widetilde{X}$ , construct a database $S_{i}\in\widetilde{X}^{m}$ by setting $(1-3\alpha)m$ entries as $x_{1}$ and the remaining $3\alpha m$ entries as $x_{i}$ (for $i=1$ all entries of $S_{1}$ are $x_{1}$ ). Note that for all $i\neq j$ , databases $S_{i}$ and $S_{j}$ differ on $3\alpha m$ entries.

Let $\mathcal{A}$ be an $(\alpha,\beta,\epsilon,m)$ -sanitizer for $C$ . Without loss of generality, we can assume that $\mathcal{A}$ is a proper sanitizer (otherwise, we could transform it into a proper one by replacing $\alpha$ with $2\alpha$ ). See Remark 2.18.

Solving for $m$ , we get that $m=\Omega(\frac{1}{\epsilon\alpha}(\log X_{C}+\log(1/\beta)))$ . ∎

Lemma 4.15, together with a lower bound from , yields the following result:

Let $C$ be a concept class over a domain $X$ . If $\mathcal{A}$ is an $(\frac{1}{8},\frac{1}{8},\frac{1}{2},m)$ -sanitizer for $C$ , then $m=\Omega(\log(X_{C})+\operatorname{\rm VC}(C))$ .

Immediate from Lemma 4.15 and Theorem 2.20. ∎

The above lower bound is the best possible general lower bound in terms of $X_{C}$ and $\operatorname{\rm VC}(C)$ (up to a factor of $\log\operatorname{\rm VC}(C)$ ). To see this, let $n<d$ , and consider a concept class over $X_{d}$ containing the following two kinds of concepts. The first kind are $2^{n}$ concepts shattering the left $n$ points of $X_{d}$ (and zero everywhere else). The second kind are $(2^{d}-n)$ “point concepts” over the right $(2^{d}-n)$ points of $X_{d}$ (and zero on the first $n$ ). Formally, for every $j=(j_{0},j_{1},\ldots,j_{n-1})\in\{0,1\}^{n}$ , let $c_{j}:X_{d}\rightarrow\{0,1\}$ be defines as $c_{j}(x)=j_{x}$ if $x<n$ and $c_{j}(x)=0$ otherwise. Define the concept class $C_{L}=\{c_{j}\}_{j\in X_{n}}$ . For every $n\leq j<2^{d}$ , define $f_{j}:X_{d}\rightarrow\{0,1\}$ as $f_{j}(x)=1$ if $x=j$ and $f_{j}(x)=0$ otherwise. Define the concept class $C_{R}=\{f_{j}\}_{n\leq j<2^{d}}$ . Now define $C=C_{L}\bigcup C_{R}$ .

We can now construct a sanitizer for $C$ by applying the generic construction of separately for $C_{L}$ and for $C_{R}$ . Given a database $S$ , this will result in two sanitized databases $\widehat{S}_{L},\widehat{S}_{R}$ , with which we can answer all queries in the class $C$ – a query for $c\in C_{L}$ is answered using $\widehat{S}_{L}$ , and a query for $f\in C_{R}$ is answered using $\widehat{S}_{R}$ . The described (improper) sanitizer for $C$ is of sample complexity $O_{\alpha,\beta,\epsilon}(\log(X_{C})+\operatorname{\rm VC}(C)\log\operatorname{\rm VC}(C))$ .

Sanitization and Proper Private PAC

Similar techniques are used for both data sanitization and private learning, suggesting relationships between the two tasks. We now explore one such relationship in proving a lower bound on the sample complexity needed for sanitization (under pure differential privacy). In particular, we show a reduction from the task of private learning to the task of data sanitization, and then use a lower bound on private learners to derive a lower bound on data sanitization. A similar reduction was given by Gupta et al. , where it is stated in terms of statistical queries. They showed that the existence of a sanitizer that accesses the database using at most $k$ statistical queries, implies the existence of a learner that makes at most $2k$ statistical queries. We complement their proof and add the necessary details in order to show that the existence of an arbitrary sanitizer (that is not restricted to access its data via statistical queries) implies the existence of a private learner.

We will refer to an element of $X_{d+1}$ as $\vec{x}\circ y$ , where $\vec{x}\in X_{d}$ , and $y\in\{0,1\}$ .

1 Sanitization Implies Proper PPAC

We show that sanitization of a class $C$ implies private learning of $C$ . Consider an input labeled sample $S=(x_{i},y_{i})_{i=1}^{m}\in(X\times\{0,1\})^{m}$ , labeled by some concept $c\in C$ . The key observation is that in order to privately output a good hypothesis it is suffices to first produce a sanitization $\hat{S}$ of $S$ (w.r.t. a slightly different concept class $C^{\rm label}$ , to be defined) and then to output a hypothesis $h\in C$ that minimizes the empirical error over the sanitized database $\hat{S}$ . To complete the proof we then show that sanitization for $C$ implies sanitization for $C^{\rm label}$ .

In order for the chosen hypothesis $h$ to have small generalization error (rather then just small empirical error), our input database $S$ must contain at least $\frac{\operatorname{\rm VC}(C)}{\alpha^{2}}\log(\frac{1}{\alpha\beta})$ elements. We therefore start with the following simple (technical) lemma, handling a case where our initial sanitizer operates only on smaller databases.

If there exists an $(\alpha,\beta,\epsilon,m)$ -sanitizer for a class $C$ , then for every $q\in\N$ s.t. $q\geq\frac{18}{\beta}\ln(1/\beta)$ there exists a $((2\alpha+2\beta),\beta,\epsilon,qm)$ -sanitizer for $C$ .

Fix $q\in\N$ and let $A$ be an $(\alpha,\beta,\epsilon,m)$ -sanitizer for a class $C$ over a domain $X$ . Note that by Theorem 2.21, there exists a $(2\alpha,\frac{3}{2}\beta,\epsilon,m)$ -sanitizer $A^{\prime}$ s.t. the sanitized databases returned by $A^{\prime}$ are always of fixed sized $n=O(\frac{\operatorname{\rm VC}(C)}{\alpha^{2}}\log(\frac{1}{\alpha\beta}))$ . We now construct a $((2\alpha+2\beta),\beta,\epsilon,qm)$ -sanitizer $B$ as follows.

As $A^{\prime}$ is $\epsilon$ -differentially private, so is $B$ . Denote $\hat{S}=(\hat{z}_{1},\hat{z}_{1},\ldots,\hat{z}_{qn})\in(X)^{qn}$ . Recall that $q\geq\frac{18}{\beta}\ln(1/\beta)$ , and, hence, using the Chernoff bound, with probability at least $(1-\beta)$ it holds that at least $(1-2\beta)q$ of the $\hat{S_{i}}$ ’s are $2\alpha$ -good for their matching $S_{i}$ ’s. In such a case $\hat{S}$ is $(2\alpha+2\beta)$ -good for $S$ : for every $f\in C$ it holds that

As at least $(1-2\beta)q$ of the $\hat{S_{i}}$ ’s are $2\alpha$ -good for their matching $S_{i}$ ’s, and as trivially $Q_{f}(S_{i})\leq 1$ for each database $\hat{S_{i}}$ that is not $2\alpha$ -good,

Similar arguments show that $Q_{f}(S)\geq Q_{f}(\hat{S})-(2\alpha+2\beta)$ . Algorithm $B$ is, therefore, a $((2\alpha+2\beta),\beta,\epsilon,qm)$ -sanitizer for $C$ , as required. ∎

As mentioned above, our first step in showing that sanitization for a class $C$ implies private learning for $C$ is to show that privately learning $C$ is implied by sanitization for the slightly modified class $C^{\rm label}$ , defined as follows. For a given predicate $c$ over $X_{d}$ , we define the predicate $c^{\rm label}$ over $X_{d+1}$ as

Note that $c^{\rm label}(\vec{x}\circ\sigma)=\sigma\oplus c(\vec{x})$ for $\sigma\in\{0,1\}$ . For a given class of predicates $C$ over $X_{d}$ , we define $C^{\rm label}=\{c^{\rm label}\;:\;c\in C\}$ .

$\operatorname{\rm VC}(C)\leq\operatorname{\rm VC}(C^{\rm label})\leq 2\cdot\operatorname{\rm VC}(C)$ .

For the first inequality notice that if a set $S\subseteq X_{d}$ is shuttered by $C$ then the set $S\circ 0$ is shuttered by $C^{\rm label}$ . For the second inequality, assume $S\subseteq X_{d+1}$ is shattered by $C^{\rm label}$ . Consider the partition of $S$ to $S_{0}$ and $S_{1}$ , where $S_{\sigma}=\{\vec{x}\circ y\in S\;:\;y=\sigma\}$ . For at least one $\sigma\in\{0,1\}$ , it holds that $|S_{\sigma}|\geq\frac{|S|}{2}$ . Hence, the set $\hat{S}=\{\vec{x}\;:\;\vec{x}\cdot\sigma\in S_{\sigma}\}$ is shattered by $C$ and $\operatorname{\rm VC}(C^{\rm label})\leq 2\cdot|\hat{S}|\leq 2\cdot\operatorname{\rm VC}(C)$ . ∎

The next lemma shows that for every concept class $C$ , a sanitizer for $C^{\rm label}$ implies a private learner for $C$ . In the next lemma, this connection is made under the assumption that the given sanitizer operates on large enough databases. This assumption will be removed in the lemma that follows.

Let $C$ be a class of predicates over $X_{d}$ . If there exists an $(\alpha,\beta,\epsilon,m)$ -sanitizer $A$ for $C^{\rm label}$ , where $m\geq\frac{50\operatorname{\rm VC}(C)}{\gamma^{2}}\ln(\frac{1}{\gamma\beta})$ for some $\gamma>0$ , then there exists a proper $((2\alpha+\gamma),2\beta,\epsilon,m)$ -PPAC learner for $C$ .

Let $A$ be an $(\alpha,\beta,\epsilon,m)$ -sanitizer, and consider the following algorithm $Learn$ :

As $A$ is $\epsilon$ -differentially private, so is $Learn$ . For the utility analysis, fix some target concept $c_{t}\in C$ and a distribution $\mathcal{D}$ over $X_{d}$ , and define the following two good events:

$\forall h\in C,\;\;\big{|}{\rm error}_{S}(h)-{\rm error}_{\hat{S}}(h)\big{|}\leq\alpha$ .

$\forall h\in C,\;\;|{\rm error}_{\mathcal{D}}(h,c_{t})-{\rm error}_{S}(h)|\leq\gamma$ .

We first show that if these 2 good events happen, algorithm $Learn$ returns a $(2\alpha+\gamma)$ -good hypothesis. As the target concept satisfies ${\rm error}_{S}(c_{t})=0$ , event $E_{1}$ ensures the existence of a concept $f\in C$ s.t. ${\rm error}_{\hat{S}}(f)\leq\alpha$ . Thus, algorithm $Learn$ chooses a hypothesis $h\in C$ s.t. ${\rm error}_{\hat{S}}(h)\leq\alpha$ . Using event $E_{1}$ again, this $h$ obeys ${\rm error}_{S}(h)\leq 2\alpha$ . Therefore, event $E_{2}$ ensures that $h$ satisfies ${\rm error}_{\mathcal{D}}(h,c_{t})\leq 2\alpha+\gamma$ .

We will now show that these 2 events happen with high probability. By the definition of $C^{\rm label}$ , for every $c^{\rm label}\in C^{\rm label}$ we have that

Therefore, as $A$ is an $(\alpha,\beta,\epsilon,m)$ -sanitizer for $C^{\rm label}$ , event $E_{1}$ happens with probability at least $(1-\beta)$ . As $m\geq\frac{50\operatorname{\rm VC}(C)}{\gamma^{2}}\ln(\frac{1}{\gamma\beta})$ , Theorem 2.14 ensures that event $E_{2}$ happens with probability at least $(1-\beta)$ as well. All in all, $Learn$ is a proper $((2\alpha+\gamma),2\beta,\epsilon,m)$ -PPAC learner for $C$ . ∎

The above lemma describes a reduction from the task of privately learning a concept class $C$ to the sanitization task of the slightly different concept class $C^{\rm label}$ . We next show that given a sanitizer for a class $C$ , it is possible to construct a sanitizer for $C^{\rm label}$ . Along the way we will also slightly increase the sample complexity of the starting sanitizer, in order to be able to use Lemma 5.3. This results in a reduction from the task of privately learning a concept class $C$ to the sanitization task of the same concept class $C$ .

If there exists an $(\alpha,\beta,\epsilon,m)$ -sanitizer for a class $C$ , then there exists a $((5\alpha+4\beta),5\beta,6\epsilon,t)$ -sanitizer for $C^{\rm label}$ , where

Let $A^{\prime}$ be an $(\alpha,\beta,\epsilon,m)$ -sanitizer for a class $C$ . By replacing $\alpha$ with $2\alpha$ , and $\beta$ with $2\beta$ , we can assume that the sanitized databases returned by $A^{\prime}$ are always of fixed size $n=O(\frac{\operatorname{\rm VC}(C)}{\alpha^{2}}\log(\frac{1}{\alpha\beta}))$ (see Theorem 2.21). Moreover, we can assume that $A^{\prime}$ treats its input database as a multiset (as otherwise we could alter $A$ to first randomly shuffle its input database). Denote $M=m\left\lceil\frac{18}{\beta}\ln(\frac{2}{\alpha\beta})\cdot\left(1+\frac{1}{m\epsilon}\right)\right\rceil$ . By Lemma 5.1 for every $qM$ (where $q\in\N$ ) there exists a $((4\alpha+4\beta),2\beta,\epsilon,qM)$ sanitizer $A$ for $C$ (as $qM=q^{\prime}m$ for an integer $q^{\prime}$ ). Denote $t=\left\lceil\frac{6}{\alpha^{2}}\right\rceil M$ , and consider algorithm $B$ presented in Figure 12

Note that the output on Step 8 is just a post-processing of the 4 outputs on Step 7. We first show that each of those 4 outputs preserves differential privacy, and, hence, $B$ is private (with slightly bigger privacy parameter, see Theorem 2.3).

By the properties of the laplacian mechanism, $\hat{m_{0}}$ and $\hat{m_{1}}$ each preserves $\epsilon$ -differential privacy. The analysis for $\widetilde{S_{0}}$ and $\widetilde{S_{1}}$ is symmetric, and we next give the analysis for $\widetilde{S_{0}}$ . Denote by $B_{0}$ an algorithm identical to the first 7 steps of $B$ , except that the only output of $B_{0}$ on Step 7 is $\widetilde{S_{0}}$ . We now show that $B_{0}$ is private.

Fix two neighboring databases $D,D^{\prime}$ , and let $F$ be a set of possible outputs. Note that as $D,D^{\prime}$ are neighboring, it holds that $S_{0}[D]$ and $S_{0}[D^{\prime}]$ are identical up to an addition or a change of one entry. Therefore, whenever $m_{0}[D]=m_{0}[D^{\prime}]=L$ , we have that $S_{0}[D,L]$ and $S_{0}[D^{\prime},L]$ are neighboring databases. Moreover, by the properties of the laplacian mechanism, for every value $L$ it holds that $\Pr[m_{0}[D]=L]\leq e^{\epsilon}\Pr[m_{0}[D^{\prime}]=L]$ . Hence,

Overall (since we use two $\epsilon$ -private algorithms and two $(2\epsilon)$ -private algorithms), algorithm $B$ is $(6\epsilon)$ -differentially private. As for the utility analysis, fix a database $D=(x_{i},y_{i})_{i=1}^{t}$ and consider the execution of $B$ on $D$ . We now show that w.h.p. the sanitized database $\widetilde{D}$ is $(5\alpha+4\beta)$ -close to $D$ .

Fix a concept $c^{\rm label}\in C^{\rm label}$ . It holds that

By the properties of algorithm $A$ , with probability at least $(1-4\beta)$ we have that $\widetilde{S_{0}}$ and $\widetilde{S_{1}}$ are $(4\alpha+4\beta)$ -close to $\hat{S_{0}}$ and to $\hat{S_{1}}$ (respectively). We proceed with the analysis assuming that this is the case. Hence,

Note that as $c^{\rm label}(x_{i}\circ 0)=c(x_{i})$ and as $c^{\rm label}(x_{i}\circ 1)=1-c(x_{i})$ , it holds that

Denoting $\widetilde{D}=(z_{i})_{i=1}^{r}\in(X_{d+1})^{r}$ (where $r=n(\hat{m_{0}}+\hat{m_{1}})$ ), we get

Similar arguments show that $Q_{c^{\rm label}}(D)\geq Q_{c^{\rm label}}(\widetilde{D})-(5\alpha+4\beta)$ . Algorithm $B$ is, therefore, a $(5\alpha+4\beta),5\beta,6\epsilon,t)$ -sanitizer for $C^{\rm label}$ , where

Let $\alpha,\epsilon\leq\frac{1}{8}$ , and let $C$ be a class of predicates. If there exists an $(\alpha,\beta,\epsilon,m)$ -sanitizer $A$ for $C$ , then there exists a proper $((15\alpha+12\beta),10\beta,6\epsilon,t)$ -PPAC learner for $C$ , where $t=O_{\alpha,\beta,\epsilon}(m)$ .

Let $A$ be an $(\alpha,\beta,\epsilon,m)$ -sanitizer for $C$ . Note that by Theorem 2.20, it must be that $m\geq\frac{\operatorname{\rm VC}(C)}{2}$ . By Lemma 5.4, there exists a $((5\alpha+4\beta),5\beta,6\epsilon,t)$ -sanitizer for $C^{\rm label}$ , where $t=O_{\alpha,\beta,\epsilon}(m)$ and $t\geq\frac{100m}{\alpha^{2}}\ln(\frac{1}{\alpha\beta})\geq\frac{50\operatorname{\rm VC}(C)}{\alpha^{2}}\ln(\frac{1}{\alpha\beta})$ . By Lemma 5.3, there exists a proper $((15\alpha+12\beta),10\beta,6\epsilon,t)$ -PPAC learner for $C$ . ∎

Given an efficient proper-sanitizer for $C$ and assuming the existence of an efficient non-private learner for $C$ , this reduction results in an efficient private learner for $C$ .

Next we prove a lower bound on the database size of every sanitizer for $\operatorname{\tt k-POINT}_{d}$ that preserves pure differential privacy.

Consider the following concept class over $X_{d}$ . For every $A\subseteq X_{d}$ s.t. $|A|=k$ , the concept class $\operatorname{\tt k-POINT}_{d}$ contains the concept $c_{A}:X_{d}\rightarrow\{0,1\}$ , defined as $c_{A}(x)=1$ if $x\in A$ and $c_{A}(x)=0$ otherwise. The VC dimension of $\operatorname{\tt k-POINT}_{d}$ is $k$ (assuming $2^{d}\geq 2k$ ).

To prove a lower bound on the sample complexity of sanitization, we first prove a lower bound on the sample complexity of the related learning problem and then use the reduction (Theorem 5.5). Thus, we start by showing that every private proper learner for $\operatorname{\tt k-POINT}_{d}$ requires $\Omega(\frac{kd}{\alpha\epsilon})$ labeled examples. A similar version of this lemma appeared in Beimel et al. , where it is shown that every private proper learner for $\operatorname{\tt POINT}_{d}$ requires $\Omega(\frac{d}{\alpha\epsilon})$ labeled examples.

Let $\alpha<\frac{1}{5}$ , and let $k,d$ be s.t. $2^{d}\geq k^{1.1}$ . If $L$ is a proper $(\alpha,\frac{1}{2},\epsilon,m)$ -PPAC learner for $\operatorname{\tt k-POINT}_{d}$ , then $m=\Omega(\frac{kd}{\alpha\epsilon})$ .

Let $L$ be a proper $(\alpha,\frac{1}{2},\epsilon,m)$ -PPAC learner for $\operatorname{\tt k-POINT}_{d}$ . Without loss of generality, we can assume that $m\geq\frac{5\ln(4)}{3\alpha}$ (since $L$ can ignore part of the sample).

Consider a maximal cardinality subset $B\subseteq\operatorname{\tt k-POINT}_{d}$ s.t. for every $c_{A}\in B$ it holds that $0^{d}\notin A$ , and moreover, for every $c_{A_{1}}\neq c_{A_{2}}\in B$ it holds that $|A_{1}\cap A_{2}|\leq\frac{k}{2}$ . We have that $|B|\geq\left(\frac{2^{d}-1}{4e^{2}k}\right)^{k/2}$ . To see this, we could construct such a set using the following greedy algorithm. Initiate $\hat{B}=\emptyset$ , and $C=\operatorname{\tt k-POINT}_{d}\setminus\{c_{I}\in\operatorname{\tt k-POINT}_{d}\;:\;0^{d}\in I\}$ . While $C\neq\emptyset$ , arbitrarily choose a concept $c_{A}\in C$ , add $c_{A}$ to $\hat{B}$ , and remove from $C$ every concept $c_{I}$ s.t. $|A\cap I|\leq\frac{k}{2}$ .

Clearly, for every two $c_{A_{1}}\neq c_{A_{2}}\in\hat{B}$ it holds that $|A_{1}\cap A_{2}|\leq\frac{k}{2}$ . Moreover, at every step, the number of concepts that are removed from $C$ is at most

For every $c_{A}\in B$ we will now define a distribution $\mathcal{D}_{A}$ , a set of hypotheses $G(A)$ , and a database $S_{A}$ . The distribution $\mathcal{D}_{A}$ is defined as

Define the set $G(A)\subseteq\operatorname{\tt k-POINT}_{d}$ as all $\alpha$ -good hypothesis for $(c_{A},\mathcal{D}_{A})$ in $\operatorname{\tt k-POINT}_{d}$ . Note that for every $h_{I}\in\operatorname{\tt k-POINT}_{d}$ s.t. ${\rm error}_{\mathcal{D}_{A}}(h_{I},c_{A})\leq\alpha$ it holds that $|I\cap A|\geq\frac{4k}{5}$ . Therefore, for every $c_{A_{1}}\neq c_{A_{2}}\in B$ it holds that $G(A_{1})\cap G(A_{2})=\emptyset$ (as $|A_{1}\cap A_{2}|\leq\frac{k}{2}$ , and as $|A_{1}|=|A_{2}|=|I|=k$ ).

By the utility properties of $L$ , we have that $\Pr_{L,\mathcal{D}_{A}}[L(S)\in G(A)]\geq\frac{1}{2}$ . We say that a database $S$ of $m$ labeled examples is good if the unlabeled example $0^{d}$ appears in $S$ at least $(1-8\alpha)m$ times. Let $S$ be a database constructed by taking $m$ i.i.d. samples from $\mathcal{D}_{A}$ , labeled by $c_{A}$ . By the Chernoff bound, $S$ is good with probability at least $1-\exp(-3\alpha m/5)$ . Hence,

Note that, as $0^{d}\notin A$ , every appearance of the example $0^{d}$ in $S$ is labeled by . Therefore, there exists a good database $S$ of $m$ samples that contains the entry $0^{d}\circ 0$ at least $(1-8\alpha)m$ times, and $\Pr_{L}\left[L(S)\in G(A)\right]\geq\frac{1}{4}$ , where the probability is only over the randomness of $L$ . We define $S_{A}$ as such a database.

Note that all of the databases $S_{A_{i}}$ defined here are of distance at most $8\alpha m$ from one another. The privacy of $L$ ensures, therefore, that for any two such $S_{A_{i}},S_{A_{j}}$ it holds that $\Pr_{L}[L(S_{A_{i}})\in G(A_{j})]\geq\frac{1}{4}\exp(-8\alpha\epsilon m)$ .

Solving for $m$ yields $m=\Omega(\frac{k}{\alpha\epsilon}(d-\ln(k)))$ . Recall that $2^{d}\geq k^{1.1}$ , and, hence, $m=\Omega(\frac{kd}{\alpha\epsilon})$ ∎

The constant $1.1$ in the above lemma could be replaced with any constant strictly bigger than $1$ . Moreover, whenever $2^{d}=O(k)$ we have that $|\operatorname{\tt k-POINT}_{d}|={2^{d}\choose k}=2^{O(2^{d})}$ and, hence, the generic construction of Kasiviswanathan et al. yields a proper $\epsilon$ -private learner for this class with sample complexity $O_{\alpha,\beta,\epsilon}(2^{d})=O_{\alpha,\beta,\epsilon}(k)$ .

In the next lemma we will use the last lower bound on the sample complexity of private learners, together with the reduction of Theorem 5.5, and derive a lower bound on the database size necessary for pure private sanitizers for $\operatorname{\tt k-POINT}_{d}$ .

Let $\epsilon\leq\frac{1}{8}$ , and let $k$ and $d$ be s.t. $2^{d}\geq k^{1.1}$ . Every $(\frac{1}{150},\frac{1}{150},\epsilon,m)$ -sanitizer for $\operatorname{\tt k-POINT}_{d}$ requires databases of size

Let $A$ be a $(\frac{1}{150},\frac{1}{150},\epsilon,m)$ -sanitizer for $\operatorname{\tt k-POINT}_{d}$ . By Theorem 5.5, there exists a proper $(\frac{9}{50},\frac{1}{15},6\epsilon,t)$ -PPAC learner for $\operatorname{\tt k-POINT}_{d}$ , where $t=O\left(m\right)$ . By Lemma 5.7, $t=\Omega\left(\frac{kd}{\epsilon}\right)$ , and hence $m=\Omega\left(\frac{kd}{\epsilon}\right)$ . ∎

Recall that in the proof of Theorem 5.5, we increased the sample complexity in order to use Lemma 5.3. This causes a slackness of $\alpha^{2}$ in the database size of the resulting learner, which, in turn, eliminates the dependency in $\alpha$ in the above lower bound. For the class $\operatorname{\tt k-POINT}_{d}^{\rm label}$ it is possible to obtain a better lower bound, by using the reduction of Lemma 5.3 twice.

Let $\alpha\leq\frac{1}{50}$ and $\epsilon\leq\frac{1}{8}$ . There exist a $d_{0}=d_{0}(\alpha,\epsilon)$ s.t. for every $k$ and $d$ s.t. $2^{d}\geq\max\{k^{1.1}\;,\;2^{d_{0}}\}$ , it holds that every $(\alpha,\frac{1}{50},\epsilon,m)$ -sanitizer for $\operatorname{\tt k-POINT}_{d}^{\rm label}$ must operate on databases of size

Let $A$ be a $(\frac{1}{50},\frac{1}{50},\epsilon,m)$ -sanitizer for a class $\operatorname{\tt k-POINT}_{d}^{\rm label}$ , where $\epsilon\leq\frac{1}{8}$ . Note that by Theorem 2.20, it must be that $m\geq\frac{\operatorname{\rm VC}(\operatorname{\tt k-POINT}_{d}^{\rm label})}{2}\geq\frac{\operatorname{\rm VC}(\operatorname{\tt k-POINT}_{d})}{2}$ . In order to use Lemma 5.3, we need a slightly stronger guarantee, and therefore use Lemma 5.1 to increase the input database size as follows.

Denote $q=\left\lceil 100\cdot 50^{3}\ln(50^{2})\right\rceil$ . By Lemma 5.1, there exists a $(\frac{2}{25},\frac{1}{50},\epsilon,t)$ -sanitizer $B$ for $\operatorname{\tt k-POINT}_{d}^{\rm label}$ , where

By Lemma 5.3, there exists a proper $(\frac{9}{50},\frac{1}{25},\epsilon,t)$ -PPAC learner for $\operatorname{\tt k-POINT}_{d}$ . By Lemma 5.7, $t=\Omega\left(\frac{kd}{\epsilon}\right)$ , and hence

Let $\alpha\leq\frac{1}{50}$ and $\epsilon\leq\frac{1}{8}$ , and let $B$ be an $(\alpha,\frac{1}{50},\epsilon,m)$ -sanitizer for $\operatorname{\tt k-POINT}_{d}^{\rm label}$ . As $B$ is, in particular, a $(\frac{1}{50},\frac{1}{50},\epsilon,m)$ -sanitizer for $\operatorname{\tt k-POINT}_{d}^{\rm label}$ , where $\epsilon\leq\frac{1}{8}$ , Equation 15 states that there exists a constant $\lambda$ s.t. $m\geq\lambda\frac{kd}{\epsilon}$ . Asserting that $d\geq d_{0}\triangleq\frac{50\epsilon}{\lambda\alpha^{2}}\ln(\frac{50}{\alpha})$ , we ensure that $m\geq\frac{50k}{\alpha^{2}}\ln(\frac{50}{\alpha})$ . By reusing Lemme 5.3, we now get that there exists a proper $(3\alpha,\frac{1}{25},\epsilon,m)$ -PPAC learner for $\operatorname{\tt k-POINT}_{d}$ . Lemma 5.7 now states that

Label-Private Learners

In this section we consider relaxed definitions of private learners preserving pure privacy (i.e., $\delta=0$ ). We start with the model of label privacy (see and references therein). In this model, privacy must only be preserved for the labels of the elements in the database, and not necessarily for their identity. This is a reasonable privacy requirement when the identity of individuals in a population are known publicly but not their labels. In general, this is not a reasonable assumption.

We consider a database $S=(x_{i},y_{i})_{i=1}^{m}$ containing labeled points from some domain $X$ , and denote $S_{x}=(x_{i})_{i=1}^{m}\in X^{m}$ , and $S_{y}=(y_{i})_{i=1}^{m}\in\{0,1\}^{m}$ .

Let $A$ be an algorithm that gets as input a database $S_{x}\in X^{m}$ and its labels $S_{y}\in\{0,1\}^{m}$ . Algorithm $A$ is an $(\alpha,\beta,\epsilon,m)$ -Label Private PAC Learner for a concept class $C$ over $X$ if

Privacy. $\forall S_{x}\in X^{m}$ , algorithm $A(S_{x},\cdot)=A_{S_{x}}(\cdot)$ is $\epsilon$ -differentially private (as in Definition 2.2);

Utility. Algorithm $A$ is an $(\alpha,\beta,m)$ -PAC learner for $C$ (as in Definition 2.7).

Chaudhuri et al. proved lower bounds on the sample complexity of label-private learners for a class $C$ in terms of its doubling dimension. As we will now see, the correct measure for characterizing the sample complexity of such learners is the VC dimension, and the sample complexity of label-private learners is actually of the same order as that of non-private learners (assuming $\alpha,\beta$ , and $\epsilon$ are constants).

Let $C$ be a concept class over a domain $X$ . For every $\alpha,\beta,\epsilon$ , there exists an $(\alpha,\beta,\epsilon,m)$ -Label Private PAC learner for $C$ , where $m=O_{\alpha,\beta,\epsilon}(\operatorname{\rm VC}(C))$ . The learner might not be efficient.

In Figure 13 we describe a label-private algorithm $A$ . Algorithm $A$ constructs a set of hypotheses $H$ as follows: It samples an unlabeled sample $S_{1}$ , and defines $B$ as the set of points in $S_{1}$ . For every labeling of the points in $B$ realized by $C$ , add to $H$ an arbitrary concept consistent with this labeling. Afterwards, algorithm $A$ uses the exponential mechanism to choose a hypothesis out of $H$ .

Note that steps 1-4 of algorithm $A$ are independent of the labeling vector $S_{y}$ . By the properties of the exponential mechanism (which is used to access $S_{y}$ on Step 5), for every set of elements $S_{x}$ , algorithm $A(S_{x},\cdot)$ is $\epsilon$ -differentially private.

For the utility analysis, fix a target concept $c\in C$ and a distribution $\mathcal{D}$ over $X$ , and define the following 3 good events:

The constructed set $H$ contains at least one hypothesis $f$ s.t. ${\rm error}_{S^{2}}(f)\leq\frac{\alpha}{4}$ .

For every $h\in H$ s.t. ${\rm error}_{S^{2}}(h)\leq\frac{\alpha}{2}$ , it holds that ${\rm error}_{\mathcal{D}}(c,h)\leq\alpha$ .

The exponential mechanism chooses an $h$ such that ${\rm error}_{S^{2}}(h)\leq\frac{\alpha}{4}+\min_{f\in H}\left\{{\rm error}_{S^{2}}(f)\right\}$ .

We first show that if these 3 good events happen, then algorithm $A$ returns an $\alpha$ -good hypothesis. Event $E_{1}$ ensures the existence of a hypothesis $f\in H$ s.t. ${\rm error}_{S^{2}}(f)\leq\frac{\alpha}{4}$ . Thus, event $E_{1}\cap E_{3}$ ensures algorithm $A$ chooses (using the exponential mechanism) a hypothesis $h\in H$ s.t. ${\rm error}_{S^{2}}(h)\leq\frac{\alpha}{2}$ . Event $E_{2}$ ensures, therefore, that this $h$ obeys ${\rm error}_{\mathcal{D}}(c,h)\leq\alpha$ .

Fix a hypothesis $h$ s.t. ${\rm error}_{\mathcal{D}}(c,h)>\alpha$ . Using the Chernoff bound, the probability that ${\rm error}_{S^{2}}(h)\leq\frac{\alpha}{2}$ is less than $\exp(-(m-n)\alpha/8)$ . As $|H|=2^{|B|}\leq 2^{n}$ , the probability that there is such a hypothesis in $H$ is at most $2^{n}\cdot\exp(-(m-n)\alpha/8)$ . For $m\geq\frac{8}{\alpha}(n+\ln(\frac{4}{\beta}))$ , this probability is at most $\frac{\beta}{4}$ , and event $E_{2}$ happens with probability at least $(1-\frac{\beta}{4})$ .

The exponential mechanism ensures that the probability of event $E_{3}$ is at least $1-|H|\cdot\exp(-\epsilon\alpha m/8)$ (see Proposition 2.25), which is at least $(1-\frac{\beta}{4})$ for $m\geq\frac{8}{\alpha\epsilon}(n+\ln(\frac{4}{\beta}))$ .

All in all, by setting $n=\frac{32}{\alpha}(\operatorname{\rm VC}(C)\ln(\frac{64}{\alpha})+\ln(\frac{8}{\beta}))$ and $m\geq\frac{768}{\alpha^{2}\epsilon}(\operatorname{\rm VC}(C)\ln(\frac{64}{\alpha})+2\ln(\frac{8}{\beta}))$ , we ensure that the probability of $A$ failing to output an $\alpha$ -good hypothesis is at most $\beta$ . ∎

2 Label Privacy Extension

We consider a slight generalization of the label privacy model. Recall that given a labeled sample, a private learner is required to preserve the privacy of the entire sample, while a label-private learner is only required to preserve privacy for the labels of each entry.

Consider a scenario where there is no need in preserving the privacy of the distribution $\mathcal{D}$ (for example, $\mathcal{D}$ might be publicly known), but we still want to preserve the privacy of the entire sample $S$ . We can model this scenario as a learning algorithm $A$ which is given as input 2 databases – a labeled database $S$ , and an unlabeled database $D$ . For every database $D$ , algorithm $A(D,\cdot)=A_{D}(\cdot)$ must preserve differential privacy. We will refer to such a learner as a Semi-Private learner.

Clearly, $\Omega(\operatorname{\rm VC}(C))$ samples are necessary in order to semi-privately learn a concept class $C$ , as this is the case for non-private learners.The lower bound of $\Omega(\operatorname{\rm VC}(C))$ is worst case over choices of distributions $\mathcal{D}$ . For a specific distribution, less samples may suffice. This lower bound is tight, as the above generic learner could easily be adjusted for the semi-privacy model, and result in a generic semi-private learner with sample complexity $O_{\alpha,\beta,\epsilon}(\operatorname{\rm VC}(C))$ . To see this, recall that in the above algorithm, the input sample $S$ is divided into $S_{1}$ and $S_{2}$ . Note that the labels in $S_{1}$ are ignored, and, hence, $S_{1}$ could be replaced with an unlabeled database. Moreover, note that $S_{2}$ is only accessed using the exponential mechanism (on Step 5), which preserves the privacy both for the labels and for the examples in $S_{2}$ .

Consider the task of learning a concept class $C$ , and suppose that the relevant distribution over the population is publicly known. Now, given a labeled database $S$ , we can use a semi-private learner and guarantee privacy both for the labellings and for the mere existence of an individual in the database. That is, in such a case, the privacy guarantee of a semi-private learner is the same as that of a private learner. Moreover, the necessary sample complexity is $O_{\alpha,\beta,\epsilon}(\operatorname{\rm VC}(C))$ , which should be contrasted with $O_{\alpha,\beta,\epsilon}(\log|C|)$ which is the sample complexity that would result from the general construction of Kasiviswanathan et al. .

We thank Salil Vadhan and Jon Ullman for helpful discussions of ideas in this work.