Learning with Differential Privacy: Stability, Learnability and the Sufficiency and Necessity of ERM Principle

Yu-Xiang Wang, Jing Lei, Stephen E. Fienberg

Introduction

Increasing public concerns regarding data privacy have posed obstacles in the development and application of new machine learning methods as data collectors and curators may no longer be able to share data for research purposes. In addition to addressing the original goal of information extraction, privacy-preserving learning also requires the learning procedure to protect sensitive information of individual data entries. For example, the second Netflix Prize competition was canceled in response to a lawsuit and Federal Trade Commission privacy concerns, and the National Institute of Health decided in August 2008 to remove aggregate Genome-Wide Association Studies (GWAS) data from the public web site, after learning about a potential privacy risk.

A major challenge in developing privacy-preserving learning methods is to quantify formally the amount of privacy leakage, given all possible and unknown auxiliary information the attacker may have, a challenge in part addressed by the notion of differential privacy (Dwork, 2006; Dwork et al., 2006b). Differential privacy has three main advantages over other approaches: (1) it rigorously quantifies the privacy property of any data analysis mechanism; (2) it controls the amount of privacy leakage regardless of the attacker’s resource or knowledge, (3) it has useful interpretations from the perspectives of Bayesian inference and statistical hypothesis testing, and hence fits naturally in the general framework of statistical machine learning, e.g., see (Dwork & Lei, 2009; Wasserman & Zhou, 2010; Smith, 2011; Lei, 2011; Wang et al., 2015), as well as applications involving regression (Chaudhuri et al., 2011; Thakurta & Smith, 2013) and GWAS data (Yu et al., 2014), etc.

In this paper we focus on the following fundamental question about differential privacy and machine learning: What problems can we learn with differential privacy? Most literature focuses on designing differentially private extensions of various learning algorithms, where the methods depend crucially on the specific context and differ vastly in nature. But with the privacy constraint, we have less choice in developing learning and data analysis algorithms. It remains unclear how such a constraint affects our ability to learn, and if it is possible to design a generic privacy-preserving analysis mechanism that is applicable to a wide class of learning problems.

We provide a general answer to the relationship between learnability and differential privacy under Vapnik’s General Learning Setting (Vapnik, 1995) in four aspects.

1. We characterize the subset of problems in the General Learning Setting that can be learned under differential privacy. Specifically, we show that a sufficient and necessary condition for a problem to be privately learnable is the existence of an algorithm that is differentially private and asymptotically minimizes the empirical risk. This characterization generalizes previous studies of the subject (Kasiviswanathan et al., 2011; Beimel et al., 2013a) that focus on binary classification in discrete domain under the PAC learning model. Technically, the result relies on the now well-known intuitive observation that “privacy implies algorithmic stability” and the argument in Shalev-Shwartz et al. (2010) that shows a variant of algorithmic stability is necessary for learnability.

2. We also introduce a weaker notion of learnability, which only requires consistency for a class of distributions $\mathfrak{D}$ . Problems that are not privately learnable (a surprisingly large class that includes simple problems such as 0-1 loss binary classification in continuous feature domain (Chaudhuri & Hsu, 2011)) are usually private $\mathfrak{D}$ -learnable for some “nice” distribution class $\mathfrak{D}$ . We characterize the subset of private $\mathfrak{D}$ -learnable problems that are also (non-privately) learnable using conditions analogous to those in distribution-free private learning.

3. Inspired by the equivalence between privacy learnability and private AERM, we propose a generic (but impractical) procedure that always finds a consistent and private algorithm for any privately learnable (or $\mathfrak{D}$ -learnable) problems. We also study a specific algorithm that aims at minimizing the empirical risk while preserving the privacy. We show that under a sufficient condition that relies on the geometry of the hypothesis space and the data distribution, this algorithm is able to privately learn (or $\mathfrak{D}$ -learn) a large range of learning problems including classification, regression, clustering, density estimation and etc, and it is computationally efficient when the problem is convex. In fact, this generic learning algorithm learns any privately learnable problems in the PAC learning setting (Beimel et al., 2013a). It remains an open problem whether the second algorithm also learns any privately learnable problem in the General Learning Setting.

4. Lastly, we provide a preliminary study of learnability under the more practical $(\epsilon,\delta)$ -differential privacy. Our results reveal that whether there is separation between learnability and approximate private learnability depends on how fast $\delta$ is required to go to with respect to the size of the data. Finding where the exact phase transition occurs is an open problem of future interest.

Our primary objective is to understand the conceptual impact of differential privacy and learnability under a general framework and the rates of convergence obtained in the analysis may be suboptimal. Although we do provide some discussion on polynomial time approximations to the proposed algorithm, learnability under computational constraints is beyond the scope of this paper.

Related work

While a large amount of work has been devoted to finding consistent (and rate optimal) differentially private learning algorithms in various settings (e.g., Chaudhuri et al., 2011; Kifer et al., 2012; Jain & Thakurta, 2013; Bassily et al., 2014), the characterization of privately learnable problems were only studied in a few special cases.

Kasiviswanathan et al. (2011) showed that, for binary classification with a finite discrete hypothesis space, anything that is non-privately learnable is privately learnable under the agnostic Probably Approximately Correct (PAC) learning framework, therefore “finite VC-dimension” characterizes the set of private learnable problems in this setting. Beimel et al. (2013a) extends Kasiviswanathan et al. (2011) by characterizing the sample complexity of the same class of problems, but the result only applies to the realizable (non-agnostic) case. Chaudhuri & Hsu (2011) provided a counter-example showing that for continuous hypothesis space and data space, there is a gap between learnability and learnability under privacy constraint. They proposed to fix this issue by either weakening the privacy requirement to labels only or by restricting the class of potential distribution. While meaningful in some cases, these approaches do not resolve the learnability problem in general.

A key difference of our work from Kasiviswanathan et al. (2011); Chaudhuri & Hsu (2011); Beimel et al. (2013a) is that we consider a more general class of learning problems and provide a proper treatment in a statistical learning framework. This allows us to capture a wider collection of important learning problems (see Figure 1(a) and Table 1).

It is important to note that despite its generality, Vapnik’s general learning setting still does not nearly cover the full spectrum of private learning. In particular, our results do not apply to improper learning (learning using a different hypothesis class) as considered in Beimel et al. (2013a) or to structural loss minimization (the loss function jointly take all data points as input) considered in Beimel et al. (2013b). Also, our results do not address the sample complexity problem, which remains open in the general learning setting even for learning without privacy constraints.

Our characterization of private learnability (and private $\mathfrak{D}$ -learnability) in Section 3 uses a recent advance in the characterization of general learnability given by Shalev-Shwartz et al. (2010). Roughly speaking, they showed that a problem is learnable if and only if there exists an algorithm that (i) is stable under small perturbation of training data, and (ii) behaves like empirical risk minimization (ERM) asymptotically. We also makes use of a folklore observation that “Privacy $\Rightarrow$ Stability $\Rightarrow$ Generalization”. The connection of privacy and stability appeared as early as 2008 in a conference version of Kasiviswanathan et al. (2011). Further connection to “generalization” recently appeared in blog postsFor instance, Frank McSherry described in a blog post an example of exploiting differential privacy for measure concentration http://windowsontheory.org/2014/02/04/differential-privacy-for-measure-concentration/; Moritz Hardt discussed the connection of differential privacy to stability and generalization in his blog post http://blog.mrtz.org/2014/01/13/false-discovery., stated as a theorem in Appendix F of Bassily et al. (2014), and was shown to hold with strong concentration in Dwork et al. (2015b).

Dwork et al. (2015b) is part of an independent line of work (Hardt & Ullman, 2014; Bassily et al., 2015; Dwork et al., 2015a; Blum & Hardt, 2015) on adaptive data analysis, which also stems from the observation that privacy implies stability and generalization. Comparing to adaptive data analysis works, our focus is quite different. Adaptive data analysis work focus on the impact of $k$ on how fast the maximum absolute error of $k$ -adaptively chosen queries goes to as a function of $n$ , while this paper is concerned with whether the error can go to at all for each learning problem when we require the learning algorithm be differentially private with $\epsilon<\infty$ . Nonetheless, we acknowledge that Theorem 7 in Dwork et al. (2015b) provides an interesting alternative proof for “differentially private learners have small generalization error”, when choosing the statistical query as evaluating a loss function at a privately learned hypothesis. The connection is not quite obvious and we provide a more detailed explanation in Appendix B.

The main tool used in the construction of our generic private learning algorithm in Section 4 is the Exponential Mechanism (McSherry & Talwar, 2007), which provides a simple and differentially-private approximation to the maximizer of a score function among a candidate set. In the general learning context, we use the negative empirical risk as the utility function, and apply the exponential mechanism to a possibly pre-discretized hypothesis space. This exponential mechanism approach was used in Bassily et al. (2014) for minimizing convex and Lipschitz functions. The sample discretization procedure has been considered in Chaudhuri & Hsu (2011) and Beimel et al. (2013a). Our scope and proof techniques are different. Our strategy is to show that, under some general regularity conditions, the exponential mechanism is stable and behaves like ERM. Our sublevel set condition has the same flavor as that in the proof of Bassily et al. (2014, Theorem 3.2), although we do not need the loss function to be convex or Lipschitz.

Stability, privacy and generalization were also studied in Thakurta & Smith (2013) with different notions of stability. More importantly, their stability is used as an assumption rather than a consequence, so their result is not directly comparable to ours.

Background

To account for the randomness in the data, we are primarily interested in the case where the data $Z=\{z_{1},...,z_{n}\}\in\mathcal{Z}^{n}$ are independent samples drawn from an unknown probability distribution $\mathcal{D}$ on $\mathcal{Z}$ . We denote such a random sample by $Z\sim\mathcal{D}^{n}$ . For a given distribution $\mathcal{D}$ , let $R(h)$ be the expected loss of hypothesis $h$ and $\hat{R}(h,Z)$ the empirical risk from a sample $Z\in\mathcal{Z}^{n}$ :

The optimal risk $R^{*}=\inf_{h\in\mathcal{H}}R(h)$ and we assume that it is achieved by an optimal $h^{*}\in\mathcal{H}$ . Similarly, the minimal empirical risk $\hat{R}^{*}(Z)=\inf_{h\in\mathcal{H}}\hat{R}(h,Z)$ is achieved by $\hat{h}^{*}(Z)\in\mathcal{H}$ . For a possibly randomized algorithm $\mathcal{A}:\mathcal{Z}^{n}\rightarrow\mathcal{H}$ that learns some hypothesis $\mathcal{A}(Z)\in\mathcal{H}$ given data sample $Z$ , we say $\mathcal{A}$ is consistent if

In addition, we say $\mathcal{A}$ is consistent with rate $\xi(n)$ if

Since the distribution $\mathcal{D}$ is unknown, we cannot adapt the algorithm $\mathcal{A}$ to $\mathcal{D}$ , especially when privacy is a concern. Also, even if $\mathcal{A}$ is pointwise consistent for any distribution $\mathcal{D}$ , it may have different rates for different $\mathcal{D}$ and potentially be arbitrarily slow for some $\mathcal{D}$ . This makes it hard to evaluate whether $\mathcal{A}$ indeed learns the learning problem and forbids the study of the learnability problem. In this study, we adopt the stronger notion of learnability considered in Shalev-Shwartz et al. (2010), which is a direct generalization of PAC-learnability (Valiant, 1984) and agnostic PAC-learnability (Kearns et al., 1992) to the General Learning Setting as studied by Haussler (1992).

This definition requires consistency to hold universally for any distribution $\mathcal{D}$ with a uniform (distribution-independent) rate $\xi(n)$ . This type of problem is often called distribution-free learning (Valiant, 1984), and an algorithm is said to be universally consistent with rate $\xi(n)$ if it realizes the criterion.

2 Differential privacy

Differential privacy requires that if we arbitrarily perturb a database by only one data point, the output should not differ much. Therefore, if one conducts a statistical test for whether any individual is in the database or not, the false positive and false negative probabilities cannot both be small (Wasserman & Zhou, 2010). Formally, define “Hamming distance”

An algorithm $\mathcal{A}$ is $\epsilon$ -differentially private, if

for $\forall\ Z,\ Z^{\prime}$ obeying $d(Z,Z^{\prime})=1$ and any measurable subset $H\subseteq\mathcal{H}$ .

There are weaker notions of differential privacy. For example $(\epsilon,\delta)$ -differential privacy allows for a small probability $\delta$ where the privacy guarantee does not hold. In this paper, we will mainly work with the stronger $\epsilon$ -differential privacy. In Section 6 we discuss the problem of $(\epsilon,\delta)$ -differential privacy and extend some of the results to this setting.

Our objective is to understand whether there is a gap between learnable problems and privately learnable problems in the general learning setting, and to quantify the tradeoff required to protect privacy. To achieve this objective, we need to show the existence of an algorithm that learns a class of problems while preserving differential privacy. More formally, we define

A learning problem is privately learnable with rate $\xi(n)$ if there exists an algorithm $\mathcal{A}$ that satisfies both universal consistency (as in Definition 1) with rate $\xi(n)$ and $\epsilon$ -differential privacy with privacy parameter $\epsilon<\infty$ .

We can view the consistency requirement Definition 3 as a measure of utility. This utility is not a function of the observed data, however, but rather how the results generalize to unseen data.

The following lemma shows that the above definition of private learnability is actually equivalent to a seemingly much stronger condition with a vanishing privacy loss $\epsilon$ .

If there is an $\epsilon$ -DP algorithm that is consistent with rate $\xi(n)$ for some constant $0<\epsilon<\infty$ , then there is a $\frac{2}{\sqrt{n}}\left(e^{\epsilon}-e^{-\epsilon}\right)$ -DP algorithm that is consistent with rate $\xi(\sqrt{n})$ .

The proof, given in Section A.1, uses a subsampling theorem adapted from Beimel et al. (2014, Lemma 4.4).

There are many approaches to design differentially private algorithms, such as noise perturbation using Laplace noise (Dwork, 2006; Dwork et al., 2006b) and the Exponential Mechanism (McSherry & Talwar, 2007). Our construction of generic differentially private learning algorithms applies the Exponential Mechanism to penalized empirical risk minimization. Our argument will make use of a general characterization of learnability described below.

3 Stability and Asymptotic ERM

An important breakthrough in learning theory is a full characterization of all learnable problems in the General Learning Setting in terms of stability and empirical risk minimization (Shalev-Shwartz et al., 2010). Without assuming uniform convergence of empirical risk, Shalev-Shwartz et al. showed that a problem is learnable if and only if there exists a “strongly uniform-RO stable” and “always asymptotically empirical risk minimization” (Always AERM) randomized algorithm that learns the problem. Here “RO” stands for “replace one”. Also, any strongly uniform-RO stable and “universally” AERM (weaker than “always” AERM) learning rule learns the problem consistently. Here we give detailed definitions.

A (possibly randomized) learning rule $\mathcal{A}$ is Universally AERM if for any distribution $\mathcal{D}$ defined on domain $\mathcal{Z}$

where $\hat{R}^{*}(Z)$ is the minimum empirical risk for data set $Z$ . We say $\mathcal{A}$ is Always AERM, if in addition,

An algorithm $\mathcal{A}$ is strongly uniform RO-stable if

where $d(Z,Z^{\prime})$ is defined in (3), in other word, $Z$ and $Z^{\prime}$ can differ by at most one data point.

Since we will not deal with other variants of algorithmic stability in this paper (e.g., hypothesis stability (Kearns & Ron, 1999), uniform stability (Bousquet & Elisseeff, 2002) and leave-one-out (LOO) stability in Mukherjee et al. (2006)), we simply call Definition 6 stability or uniform stability. Likewise, we will refer to $\epsilon$ -differential privacy as just “privacy” although there are several other notions of privacy in the literature.

Characterization of private learnability

There exists a differentially private universally AERM algorithm.

There exists a differentially private always AERM algorithm.

The proof is simple yet revealing, we will present the arguments for $2\Rightarrow 1$ (sufficiency of AERM) in Section 3.1 and $1\Rightarrow 3$ (necessity of AERM) in Section 3.2. $3\Rightarrow 2$ follows trivially from the definition of “always” and “universal” AERM.

The theorem says that we can stick to ERM-like algorithms for private learning, despite that ERM may fail for some problems in the (non-private) general learning setting (Shalev-Shwartz et al., 2010). Thus a standard procedure for finding universally consistent and differentially private algorithms would be to approximately minimize the empirical risk using some differentially private procedures (Chaudhuri et al., 2011; Kifer et al., 2012; Bassily et al., 2014). If the utility analysis reveals that the method is AERM, we do not need to worry about generalization as it is guaranteed by privacy. This consistency analysis is considerably simpler than non-private learning problems where one typically needs to control generalization error either via uniform convergence (VC-dimension, Rademacher complexity, metric entropy, etc) or to adopt the stability argument (Shalev-Shwartz et al., 2010).

This result does not imply that privacy is helping the algorithm to learn in any sense, as the simplicity is achieved at the cost of having a smaller class of learnable problems. A concrete example of a problem being learnable but not privately learnable is given in (Chaudhuri & Hsu, 2011) and we will revisit it in Section 3.3. For some problems where ERM fails, it may not be possible to make it AERM while preserving privacy. In particular, we were not able to privatize the problem in Section 4.1 of Shalev-Shwartz et al. (2010).

To avoid any potential misunderstanding, we stress that Theorem 7 is a characterization of learnability, not learning algorithms. It does not prevent the existence of a universally consistent learning algorithm that is private but not AERM. Also, the characterization given in Theorem 7 is about consistency, and it does not claim anything on sample complexity. An algorithm that is AERM may be suboptimal in terms of convergence rate.

A key ingredient in the proof of sufficiency is a well-known heuristic observation that differential privacy by definition implies uniform stability, which is useful in its own right.

The proof of this lemma comes directly from the definition of differential privacy so it is algorithm independent. The converse, however, is not true in general (e.g., a non-trivial deterministic algorithm can be stable, but not differentially private.)

If a learning algorithm $\mathcal{A}$ is $\epsilon(n)$ -differentially private and $\mathcal{A}$ is universally AERM with rate $\xi(n)$ , then $\mathcal{A}$ is universally consistent with rate $\xi(n)+e^{\epsilon(n)}-1=O(\xi(n)+\epsilon(n))$ .

The proof of Corollary 9, provided in the Appendix, combines Lemma 8 and the fact that consistency is implied by stability and AERM (Theorem 35). Our Theorem 35 is based on minor modifications of Theorem 8 in Shalev-Shwartz et al. (2010). In fact, Corollary 9 can be stated in a stronger per distribution form, since universality is not used in the proof. We will revisit this point when we discuss a weaker notion of private learnability below.

Lemma 4 and Corollary 9 together establishes $2\Rightarrow 1$ in Theorem 7.

If for a problem privacy and always AERM cannot coexist, then the problem is not privately learnable. This is what we will show next.

2 Necessity: Consistency implies Always AERM

To prove that the existence of an always AERM learning algorithm is necessary for any private learnable problems, it suffices to construct such a learning algorithm from

or each learnable problem. any universally consistent learning algorithm.

If $\mathcal{A}$ is a universally consistent learning algorithm satisfying $\epsilon$ -DP with any $\epsilon>0$ and consistent with rate $\xi(n)$ , then there is another universally consistent learning algorithm $\mathcal{A}^{\prime}$ that is always AERM with rate $\xi(\sqrt{n})$ and satisfies $\frac{2}{\sqrt{n}}(e^{\epsilon}-e^{-\epsilon})$ -DP.

Lemma 10 is proved in Section A.2. The proof idea is to run $\mathcal{A}$ on a size $O(\sqrt{n})$ random subsample of $Z$ , which will be universally consistent with a slower rate, differentially private with $\epsilon(n)\rightarrow 0$ (Lemma 34), and at the same time always AERM. The last part uses an argument in Lemma 24 of Shalev-Shwartz et al. (2010) which appeals to the universality of $\mathcal{A}$ ’s consistency on a specific discrete distribution supported on the given data set $Z$ .

As pointed out by an anonymous reviewer, there is a simpler proof by invoking Theorem 10 of Shalev-Shwartz et al. (2010) that says any consistent and generalizing algorithm must be AERM and a result (e.g., Bassily et al., 2014, Appendix F) that says “privacy $\Rightarrow$ generalization”. This is a valid observation. But their Theorem 10 is proven using a detour through “generalization”, which leads to a slower rate than what we are able to obtain in Lemma 10 using a more direct argument.

3 Private Learnability vs. Non-private Learnability

Now we have a characterization of all privately learnable problems, a natural question to ask is that whether any learnable problem is also privately learnable. The answer is “yes” for learning in Statistical Query (SQ)-model and PAC Learning model (binary classification) with finite hypothesis space, and is “no” for continuous hypothesis space (Chaudhuri & Hsu, 2011).

By definition, all privately learnable problems are learnable. But now that we know that privacy implies generalization, it is tempting to hope that privacy can help at least some problem to learn better than any non-private algorithm. In terms of learnability, the question becomes: Could there be a (learnable) problem that is exclusively learnable through private algorithms? We now show that such a problem does not exist.

If a learning problem is learnable by an $\epsilon$ -DP algorithm $\mathcal{A}$ , then it is also learnable by a non-private algorithm.

The proof is given in Section A.3. The idea is that $\mathcal{A}(Z)$ defines a distribution over $\mathcal{H}$ . Pick an $z\in\mathcal{Z}$ . If $z\notin Z$ , algorithm $\mathcal{A}^{\prime}=\mathcal{A}$ . Otherwise, $\mathcal{A}^{\prime}(Z)$ samples from a slightly different distribution than $\mathcal{A}(Z)$ that does not affect the expectation much.

On the other hand, not all learnable problems are privately learnable. This can already be seen from Chaudhuri & Hsu (2011), where the gap between learning and private learning is established. We revisit Chaudhuri & Hsu’s example in our notation under the general learning setting and produce an alternative proof by showing that differential privacy contradicts always AERM, then invoking Theorem 7 to show the problem is not privately learnable.

There exists a problem that is learnable by a non-private algorithm, but not privately learnable. In particular, any private algorithm cannot be always AERM in this problem.

We describe the counterexample and re-establish the impossibility of private learning for this problem using the contrapositive of Theorem 7, which suggests that if privacy and always AERM algorithm cannot coexist for some problem, then the problem is not privately learnable.

Consider the binary classification problem with $\mathcal{X}=$ , $\mathcal{Y}=\{0,1\}$ and 0-1 loss function. Let $\mathcal{H}$ be the collection of threshold functions that output $h(x)=1$ if $x>h$ and $h(x)=0$ otherwise. This class has VC-dimension 1, and hence the problem is learnable.

Next we will construct $K=\lceil\exp(\epsilon_{n}n)\rceil$ data sets such that if $K-1$ of them obey AERM, the remaining one cannot be. Let $\eta=1/\exp(\epsilon n)$ , $K:=\lceil 1/\eta\rceil$ . Let $h_{1},h_{2},...,h_{K}$ be a disjoint thresholds such that they are at least $\eta$ apart and $[h_{i}-\eta/3,h_{i}+\eta/3]$ are disjoint intervals.

since these intervals are disjoint. Then by the definition of $\epsilon$ -DP,

As is pointed out by an anonymous reviewer, the same conclusion of this impossibility result of privately learning thresholds on $ $can be drawn numerically through the characterization of the sample complexity (Beimel et al., 2013a), via the bound that depends logarithmically on the$ \log(|\mathcal{H}|) $and on$ $this number is infinite. The above analysis provides different insights about the problem. We will be using it again for understanding the separation of learnability and learnability under$ (\epsilon,\delta)$-Differential Privacy later in Section 6.

4 Private 𝔇𝔇\mathfrak{D}-learnability

The above example implies that even very simple learning problems may not be privately learnable. To fix this caveat, note that most data sets of practical interest have nice distributions. Therefore, it makes sense to consider a smaller class of distributions, e.g., smooth distributions that have bounded $k$ th order derivative, or those having bounded total variation. These are common assumptions in non-parametric statistics, such as kernel density estimation, smoothing spline regression and mode clustering. Similarly, in high dimensional statistics, there are often assumptions on the structures of the underlying distribution, such as sparsity, smoothness, and low-rank conditions.

Almost all of our arguments hold in a per distribution fashion, therefore they also hold for any such subclass $\mathfrak{D}$ . The only exception is the necessity of “always AERM” (Lemma 10), where we used the universal consistency on an arbitrary discrete uniform distribution in the proof. The characterization still holds if the class $\mathfrak{D}$ contains all finite discrete uniform distributions. For general distribution classes, we characterize private $\mathfrak{D}$ -learnability using a weaker “universally AERM” (instead of “always AERM”) under the assumption that the problem itself is learnable in a distribution-free setting without privacy constraints.

If an $\epsilon$ -DP algorithm $\mathcal{A}$ is $\mathfrak{D}$ -universally consistent with rate $\xi(n)$ and the problem itself is learnable in a distribution-free sense with rate $\xi^{\prime}(n)$ , then there exists a $\mathfrak{D}$ -universally consistent learning algorithm $\mathcal{A}^{\prime}$ that is $\mathfrak{D}$ -universally AERM with rate $12\xi^{\prime}(n^{1/4})+\frac{37}{\sqrt{n}}+\xi(\sqrt{n})$ and satisfies $\frac{2}{\sqrt{n}}(e^{\epsilon}-e^{-\epsilon})$ -DP.

The proof, given in Section A.4, shows that the algorithm $\mathcal{A}^{\prime}$ that applies $\mathcal{A}$ to a random subsample of size $\lfloor\sqrt{n}\rfloor$ is AERM for any distribution in the class $\mathfrak{D}$ .

A problem is privately $\mathfrak{D}$ -learnable if there exists an algorithm that is $\mathfrak{D}$ -universally AERM and differentially private with privacy loss $\epsilon(n)\rightarrow 0$ . If in addition, the problem is (distribution-free and non-privately) learnable, then the converse is also true.

The “if” part is exactly the same as the argument in Section 3.1, since both Lemma 8 and Lemma 9 holds for each distribution independently. Under the additional assumption that the problem itself is learnable (distribution-free and non-privately), the “only if” part is given by Lemma 14. ∎

This result may appear to be unsatisfactory due to the additional assumption of learnability. It is clearly a strong assumption because many problems that are $\mathfrak{D}$ -learnable for a practically meaningful $\mathfrak{D}$ are not actually learnable. We provide one such example here.

For any discrete distribution with a finite support set, there is an $h\in\mathcal{H}$ such that the optimal risk is . Assume the problem is learnable with rate $\xi(n)$ , then for some $n$ $\xi(n)<0.5$ . However, we can always construct a uniform distribution over $3n$ elements and it is information-theoretically impossible for any estimators based on $n$ samples from the distribution to achieve a risk better than $2/3$ . The problem is therefore not learnable. When we assume an upper bound $N$ on the maximum number of bins of the underlying distribution, then the ERM which outputs just the support of all observed data will be universally consistent with rate $\xi(n)=N/n$ . ∎

It turns out that we cannot hope to completely remove the assumption from Theorem 15. The following example illustrates that some form of qualification (implied by the learnability assumption) is necessary for the converse statement to be true.

Consider the learning problem in Example 16. Let $\mathfrak{D}$ be the class of all continuous distributions. There is a learning problem that is s privately $\mathfrak{D}$ -learnable but no private AERM algorithm exists.

Interestingly, this problem is $\mathfrak{D}$ -learnable via a non-private AERM algorithm, which always outputs $h={Z}$ . This is -consistent, -AERM but not generalizing. This example suggests that $\mathfrak{D}$ -learnability and learnability are quite different because for learnable problems, if an algorithm is consistent and AERM, then it must also be generalizing (Shalev-Shwartz et al., 2010, Theorem 10).

5 A generic learning algorithm

The characterization of private learnability suggests a generic (but impractical) procedure that learns all privately learnable problems (in the same flavor as the generic algorithm in Shalev-Shwartz et al. (2010) that learns all learnable problems). This is to solve

or to privately $\mathfrak{D}$ -learn the problem when (6) is not feasible

Assume the problem is learnable. If the problem is private learnable, (6) will always output a universally consistent private learning algorithm. If the problem is private $\mathfrak{D}$ -learnable, (7) will always output a $\mathfrak{D}$ -universally consistent private learning algorithm.

If the problem is private learnable, by Theorem 7 there exists an algorithm $\mathcal{A}$ that is $\epsilon(n)$ -DP and always AERM with rate $\xi(n)$ and $\epsilon(n)+\xi(n)\rightarrow 0$ . This $\mathcal{A}$ is a witness in the optimization so we know that any minimizer of (6) will have a objective value that is no greater than $\epsilon(n)+\xi(n)$ for any $n$ . Corollary 9 concludes its universal consistency. The second claim follows from the characterization of private $\mathfrak{D}$ -learnability in Theorem 15. ∎

It is of course impossible to minimize the supremum over any data $Z$ , nor is it possible to efficiently search over the space of all algorithms, let alone DP algorithms. But conceptually, this formulation may be of interest to theoretical questions related to the search of private learning algorithms and the fundamental limit of machine learning under privacy constraints.

Private learning for penalized ERM

Now we describe a generic and practical class of private learning algorithms, based on the idea of minimizing the empirical risk under privacy constraint:

The first term is empirical risk and the second term vanishes as $n$ increases so that this estimator is asymptotically ERM. The same formulation has been studied before in the context of differentially private machine learning (Chaudhuri et al., 2011; Kifer et al., 2012), but our focus is more generic and does not require the objective function to be convex, differentiable, continuous, or even have a finite dimensional Euclidean space embedding, hence covers a larger class of learning problems.

Our generic algorithm for differentially private learning is summarized in Algorithm 1. It applies the exponential mechanism (McSherry & Talwar, 2007) to penalized ERM. We note that this algorithm implicitly requires that $\int_{\mathcal{H}}\exp(\frac{\epsilon(n)}{2\Delta q}q(h,Z))dh<\infty$ , otherwise the distribution is not well-defined and it does not make sense to talk about differential privacy. In general, if $\mathcal{H}$ is a compact set with a finite volume (with respect to a base measure, such as the Lebesgue measure or counting measure), then such a distribution always exists. We will revisit this point and discuss the practicality of this assumption in the Section 5.3.

Using the characterization results developed so far, we are able to give sufficient conditions for consistency of private learning algorithms without having to establish uniform convergence. Define the sublevel set as

where $F(h,Z)$ is the regularized empirical risk function defined in (8). In particular, we assume the following conditions:

A2. Sublevel set condition: There exist constant positive integer $n_{0}$ , positive real number $t_{0}$ , and a sequence of regularizer $g_{n}$ satisfying $\sup_{h\in\mathcal{H}}|g_{n}(h)|=o(n)$ , such that for any $0<t<t_{0}$ , $n>n_{0}$

where $K=K(n),\rho=\rho(n)$ satisfy $\log K+\rho\log n=o(n)$ . Here the measure $\mu$ may depend on context, such as Lebesgue measure ( $\mathcal{H}$ is continuous) or counting measure ( $\mathcal{H}$ is discrete).

The first condition of boundedness is common. It is assumed in Vapnik’s characterization for ERM learnability and Shalev-Shwartz et al.’s general characterization of all learnable problems. In fact, we can always consider $\mathcal{H}$ to be a sublevel set such that the boundedness condition holds. For the second condition, the intuition is that we require the sublevel set to be large enough such that the sampling procedure will return a good hypothesis with large probability. $\mu(\mathcal{S}_{t})$ is a critical parameter in the utility guarantee for the exponential mechanism (McSherry & Talwar, 2007). Also, it is worth pointing out that A2 implies that the exponential distribution is well-defined.

In particular, if $\epsilon(n)=o(1)$ , $\sup_{h\in\mathcal{H}}|g_{n}(h)|=o(1)$ and $\log K+\rho\log n=o(n\epsilon(n))$ for all $\mathcal{D}$ (in $\mathfrak{D}$ ) Algorithm 1 privately learns ( $\mathfrak{D}$ -learns) the problem.

We give an illustration of the proof in Figure 3. The detailed proof, based on the stability argument (Shalev-Shwartz et al., 2010), is deferred to Section A.5.

To see that Theorem 19 actually contains a large number of problems in the general learning setting. We provide concrete examples that satisfy A1 and A2 below for both privately learnable and privately $\mathfrak{D}$ -learnable problems that can be learned using Algorithm 1.

We start from a few cases where Algorithm 1 is universally consistent for all distributions.

Suppose $\mathcal{H}$ can be fully encoded by $M$ -bits, then

since there are at least $1$ optimal hypothesis for each function and now $\mu$ is the counting measure. In other word, we can take $K=2^{M}$ and $\rho=0$ in the (11). Plug this into the expression and take $g_{n}\equiv 0$ , $\epsilon(n)=\sqrt{(M+\log n)/n}$ , we get a rate of consistency $\xi(n)=O(\frac{M+\log n}{\sqrt{n}})$ . In addition, if we can find a data-independent covering set for a continuous space, then we can discretize the space and the result same results follow. This observation will be used in the construction of many private learning algorithms below.

Then for sufficiently small $t$ , we have Lebesgue measure

and Condition A.2 holds with $K=\mu(\mathcal{H})\beta_{p}^{-1}L^{d}$ , $\rho=d$ . Furthermore, if we take $\epsilon(n)=\sqrt{\frac{d(\log L+\log n)+\log(\mu(\mathcal{H})/\beta_{p})}{n}}$ , the algorithm is $O\left(\sqrt{\frac{d(\log L+\log n)+\log(\mu(\mathcal{H})/\beta_{p})}{n}}+\underset{h\in\mathcal{H}}{\sup}|g_{n}(h)|\right)$ -consistent.

This shows that condition A2 holds for a large class of low-dimensional problems of interest in machine learning and one can learn the problem privately without actually needing to find a covering set algorithmically. Specifically, the example includes many practically used methods such as logistic regression, linear SVM, ridge regression, even multi-layer neural networks, since the loss functions in these methods are jointly bounded in $(Z,h)$ and Lipschitz in $h$ .

The example also raises an interesting observation that while differentially private classification is not possible in a distribution-free setting for 0-1 loss function (Chaudhuri & Hsu, 2011), it is learnable under smoother surrogate loss, e.g., logistic loss or hinge loss. In other words, private learnability and computational tractability both benefit from the same relaxation.

The Lipschitz condition still requires the dimension of the hypothesis space to be $o(n)$ . Thus it does not cover high-dimensional machine learning problems where $d\gg n$ , nor does it contain the example of Shalev-Shwartz et al. (2010) that ERM fails.

For high dimensional problems where $d$ grows with $n$ , typically some assumptions or restrictions need to be made either on the data or on the hypothesis space (so that it becomes essentially low-dimensional). We give one example here for the problem of sparse regression.

2 Examples of privately 𝔇𝔇\mathfrak{D}-learnable problems.

For problems where private learnability is impossible to achieve, we may still apply Theorem 19 to prove the weaker private $\mathfrak{D}$ -learnability for some specific class of distributions.

For binary classification problems with - $1$ loss (PAC learning), this has been well-studied. In particular, Beimel et al. (2013a) characterized the sample complexity of privately learnable problems using a combinatorial condition they call a “Probabilistic Representation”, which basically involves finding a finite, data-independent set of hypotheses to approximate any hypothesis in the class. Their claim is that if the “representation dimension” is finite, then the problem is privately learnable, otherwise it is not. We can extend the notion of probabilistic representation beyond the finite discrete and countably infinite hypothesis class considered in Beimel et al. (2013a) to cases when the problem is not privately learnable (e.g, learning threshold functions on $ $). The existence of probabilistic representation for all distributions in$ \mathfrak{D} $would lead to a$ \mathfrak{D}$-universally private learning algorithm.

Another way to define a class of distribution $\mathfrak{D}$ is to assume the existence of a reference distribution that is close to any distribution of interest as in Chaudhuri & Hsu (2011).

To deal with the - $1$ loss classification problems on a continuous hypothesis domain, Chaudhuri & Hsu (2011) assume that there exists a data-independent reference distribution $\mathcal{D}^{*}$ , which by multiplying a fixed constant on its density, uniformly dominates any distributtion of interest. This essentially produces a subset of distributions $\mathfrak{D}$ . The consequence is that one can build an $\epsilon$ -net of $\mathcal{H}$ with metric defined on the risk under $\mathcal{D}^{*}$ and this will also be a (looser) covering set of any distribution $\mathcal{D}\in\mathfrak{D}$ , thereby learning the problem for any distribution in the set.

The same idea can be applied to the general learning setting. For any fixed reference distribution $\mathcal{D}^{*}$ defined on $\mathcal{Z}$ and constant $c$ ,

is a valid set of distributions and we are able to $\mathfrak{D}$ -privately learn this problem whenever we can construct a sufficiently small cover set with respect to $\mathcal{D}^{*}$ and reduce the problem to Example 20. This class of problems includes high-dimensional and infinity dimensional problems such as density estimation, nonparametric regression, kernel methods and essentially any other problems that are strictly learnable (Vapnik, 1998), since they are characterized by one-sided uniform convergence (and the corresponding entropy condition).

3 Discussion on uniform convergence and private learnability

A key point in Shalev-Shwartz et al. (2010) is that the learnability (by any algorithm) in general learning setting is no longer characterized by variants of uniform convergence. However, the class of privately learnable problems is much smaller. Clearly, uniform convergence is not sufficient for a problem to be privately learnable (see Section 3.3), but is it necessary?

In binary classification with discrete domain (agnostic PAC Learning), since VC-dimension being finite characterizes the class of privately PAC learnable problems, the necessity of uniform convergence is clear. This could also be more explicitly seen from Beimel et al. (2013a) where the probabilistic representation dimension is a form of uniform convergence on its own.

In the general learning setting, the problem is still open. We were not able to prove that private learnability implies uniform convergence, but we could not construct a counter example either. All our examples in this section do implicitly or explicitly uses uniform convergence, which seems to hint at a positive answer.

Practical concerns

We have stated all results so far in expectation. We can easily convert these to the high-confidence learning paradigm by applying Markov’s inequality, since convergence in expectation to the minimum risk implies convergence in probability to the minimum risk. While the $1/\delta$ dependence on the failure probability $\delta$ is not ideal, we can apply a similar meta-algorithm “boosting”(Schapire, 1990) as in Shalev-Shwartz et al. (2010, Section 7) to get a $\log(1/\delta)$ rate. The approach is similar to cross-validation. Given a pre-chosen positive integer $a$ , the original boosting algorithm randomly partitions the data into $(a+1)$ subsamples of size $n/(a+1)$ , and applies Algorithm 1 on the first $a$ partitions, obtaining $a$ candidate hypotheses. The method then returns the one hypothesis with smallest validation error, calculated using the remaining subsample. To ensure differential privacy, our method instead uses the exponential mechanism to sample the best candidate hypothesis, where the logarithm of sampling probability is proportional to the negative validation error.

If an algorithm $\mathcal{A}$ privately learns a problem with rate $\xi(n)$ and privacy parameter $\epsilon(n)$ , then the boosting algorithm $\mathcal{A}^{\prime}$ with $a=\log\frac{3}{\delta}$ is $\max\left\{\epsilon\left(\frac{n}{\log(3/n)+1}\right),\frac{\log(3/\delta)+1}{\sqrt{n}}\right\}$ -differentially private, its output $h$ obeys

for an absolute constant $C$ with probability at least $1-\delta$ .

2 Efficient sampling algorithm for convex problems

Our proposed exponential sampling based algorithm is to establish a more explicit geometric condition upon which AERM holds, hence the algorithm may not be computationally tractable. Ignoring the difficulty of constructing the $\epsilon$ -covering set of an exponential number of elements, sampling from the set alone is not a polynomial time algorithm. But we can solve a subset of the continuous version of our Algorithm 1 described in Theorem 19 in polynomial time to arbitrary accuracy (see also Bassily et al. (2014, Theorem 3.4)).

3 Exponential mechanism in infinite domain

As we mention earlier, the results in Section 4 based on the exponential mechanism implicitly assumes certain regularity conditions that ensures the existence of a probability distribution.

Things get even trickier when $\mathcal{H}$ is an infinite dimensional space, such as a subset of a Hilbert space. While probability measures can still be defined, no density function can be defined on such spaces. Therefore, we cannot use exponential mechanism to define a valid probability distribution.

The practical implication is that exponential mechanism is really only applicable to cases when the hypothesis space $\mathcal{H}$ allows for definitions of densities in the usual sense, or then $\mathcal{H}$ can be approximated by such a space. For example, a separable Hilbert space can be studied by finite-dimensional projections. Also, we can approximate RKHS induced by translation invariant kernels via random Fourier features (Rahimi & Recht, 2007).

Results for learnability under (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)-differential privacy

Another way to weaken the definition of private learnability is through $(\epsilon,\delta)$ -approximate differential privacy.

An algorithm $\mathcal{A}$ obeys $(\epsilon,\delta)$ -differential privacy if for any $Z,Z^{\prime}$ such that $d(Z,Z^{\prime})\leq 1$ , and for any measurable set $\mathcal{S}\subset\mathcal{H}$

We say a learning problem is $\Delta(n)$ -approximately privately learnable for some pre-specified family of rate $\Delta(n)$ if for some $\epsilon<\infty$ , $\delta(n)\in\Delta(n)$ , there exists a universally consistent algorithm that is $(\epsilon,\delta(n))$ -DP.

This is a completely different subject to study and the class of approximately privately learnable problems could be substantially larger than the pure privately learnable problems. Moreover, the picture may vary with respect to how small $\delta(n)$ is required to be. In this section, we present our preliminary investigation on this problem.

Specifically, we will consider two questions:

Does the existence of an $(\epsilon,\delta)$ -DP always AERM algorithm characterize the class of approximately private learnable problems?

Are all learnable problems approximately privately learnable for different choices of $\Delta(n)$ ?

The minimal requirement in the same flavor of Definition 3 would be to require $\Delta(n)=\{\delta(n)|\delta(n)\rightarrow 0\}$ . The learnability problem turns out to be trivial under this definition due to the following observation.

For any algorithm $\mathcal{A}$ that acts on $Z$ , $\mathcal{A}^{\prime}$ that runs $\mathcal{A}$ on a randomly chosen subset of $Z$ of size $\sqrt{n}$ is $(0,\frac{1}{\sqrt{n}})$ -DP.

Let $Z$ and $Z^{\prime}$ be adjacent datasets that differs only in data point $i$ . For any $i$ and any $S\in\sigma(\mathcal{H})$ .

This verifies the $(0,1/\sqrt{n})$ -DP of algorithm $\mathcal{A}^{\prime}$ . ∎

The above lemma suggests that if $\delta(n)=o(1)$ is all we need for the approximately private learnability, then any consistent learning algorithm can be made approximately DP by simply subsampling. In other words, any learnable problem is also learnable under approximate differential privacy.

To get around this triviality, we need to specify a sufficiently fast rate of $\delta(n)$ going to . While it is common to require that $\delta(n)=o(1/\text{poly}(n))$ Here the notation “ $o(1/\text{poly}(n))$ ” means “decays faster than any polynomial of $n$ ”. A sequence $a(n)=o(1/\text{poly}(n))$ if and only if $a(n)=o(n^{-r})$ for any $r>0$ . for cryptographically strong privacy protection, requiring $\delta(n)=o(1/n)$ is already enough to invalidate the above subsampling argument and makes the problem of learnability a non-trivial one.

Again, the question is whether AERM characterizes approximately private learnability and whether there is a gap between the class of learnable and approximately privately learnable problems.

Here we show that the “folklore” Lemma 8 and subsampling lemma (Lemma 34) can be extended to work with $(\epsilon,\delta)$ -DP and then we provide a positive answer to the first question.

For any $Z,Z^{\prime}$ such that $d(Z,Z^{\prime})\leq 1$ and for any $z\in\mathcal{Z}$ . Let the event $E=\{h|p(h)\geq p^{\prime}(h)\}$ ,

The last line applies the definition of $(\epsilon,\delta)$ -DP. ∎

If $\mathcal{A}$ is $(\epsilon,\delta)$ -DP, then $\mathcal{A}^{\prime}$ that acts on a random subsample of $Z$ of size $\gamma n$ obeys $(\epsilon^{\prime},\delta^{\prime})$ -DP with $\epsilon^{\prime}=\log(1+\gamma e^{\epsilon}(e^{\epsilon}-1))$ and $\delta^{\prime}=\gamma e^{\epsilon}\delta$ .

For any event $E\in\sigma(\mathcal{H})$ , let $i$ be the coordinate where $Z$ and $Z^{\prime}$ differs

where in last line, we apply $(\epsilon,\delta)$ -DP of $\mathcal{A}$ .

Denote $\mathcal{I}_{1}=\{I|i\in I\}$ , $\mathcal{I}_{2}=\{I|i\notin I\}$ . We known $|\mathcal{I}_{1}|={n-1\choose\gamma n-1}$ , and $|\mathcal{I}_{2}|={n-1\choose\gamma n}$ and $|\mathcal{I}_{1}|/|\mathcal{I}_{2}|=\gamma n/(n-\gamma n)$ . For every $I\in\mathcal{I}_{2}$ there are precisely $\gamma n$ elements $J\in\mathcal{I}_{1}$ such that $d(I,J)=1$ . Likewise, for every $J\in\mathcal{I}_{1}$ , there are $n-\gamma n$ elements $I\in\mathcal{I}_{2}$ such that $d(I,J)=1$ . It follows by symmetry that if we apply $(\epsilon,\delta)$ -DP to $1/\gamma n$ of each $I\in\mathcal{I}_{2}$ and change $I$ to their corresponding $J\in\mathcal{I}_{1}$ , then each $J\in\mathcal{I}_{1}$ will receive $(n-\gamma n)/\gamma n$ “contribution” in total from the sum over all $I\in\mathcal{I}_{2}$ .

Using the above two lemmas, we are able to establish the same result which says that AERM characterizes the approximate private learnability for certain classes of $\Delta(n)$ .

A problem is $\Delta(n)$ -approximately privately learnable implies that there exists an always AERM algorithm that is $(\epsilon(n),n^{-1/2}e^{\epsilon}\delta(\sqrt{n}))$ -DP for some $\epsilon(n)\rightarrow 0$ and $\delta(\sqrt{n})\in\Delta(n)$ . The converse is also true if $n^{-1/2}e^{\epsilon}\delta(\sqrt{n})\in\Delta(n)$ .

If we have an always AERM algorithm with $\xi_{erm}(n)$ that is $(\epsilon(n),\delta(n))$ -DP for $\delta(n)\in\Delta(n)$ . Then by Lemma 30, this algorithm is strongly uniform RO-stable with rate $e^{\epsilon(n)}-1+\delta(n)$ . By Theorem 35, the algorithm is universally consistent with rate $\xi_{erm}(n)+e^{\epsilon(n)}-1+\delta(n)$ . This establishes the “if” part.

To see the “only if” part, by definition if a problem is $\Delta(n)$ -approximately privately learnable with $\epsilon$ and $\delta(n)\in\Delta(n)$ . Then by Lemma 31 with $\gamma=1/\sqrt{n}$ , we get an algorithm that obeys the privacy condition. It remains to prove always AERM, which requires exactly the same arguments in the proof of Lemma 10. Details are omitted. ∎

Note that the results above suggest that in the two canonical settings $\Delta(n)=o(1/n)$ or $\Delta(n)=o(1/\text{poly}(n))$ , existence of a private AERM algorithm that satisfies the stronger constraint $\epsilon(n)=o(1)$ characterizes the learnability.

The next question that whether any learnable problems are also approximately privately learnable would depend on how fast $\delta(n)$ is required to decay. We know that when we only have $\Delta(n)=o(1)$ , all learnable problems are approximately privately learnable, and when we have $\Delta(n)=\{0\}$ , only a strict subset of these problems is privately learnable. The following result establishes that when $\delta(n)$ needs to go to with a sufficiently fast rate, there is separation between learnability and approximately private learnability.

We now show that when we require a fast decaying $\delta(n)$ , then suddenly the example in Section 3.3 due to Chaudhuri & Hsu (2011) becomes not approximately privately learnable even for $(\epsilon,\delta)$ -DP. Let $Z,Z^{\prime}$ be two completely different data sets, by repeatedly applying the definition of $(\epsilon,\delta)$ -DP, for any set $\mathcal{S}\subset\mathcal{H}$

When we shift the inequality around, we get

Consider the same example in Section 3.3 where we hope to learn a threshold on $ $. Assuming there exists an algorithm$ \mathcal{A} $that is universally AERM and$ (\epsilon(n),\delta(n)) $-DP for$ \epsilon(n)<\infty $and$ \delta(n)\leq 0.4ne^{-\epsilon n}$.

Everything up to (4) remains exactly the same. Now, apply the above implication of $(\epsilon,\delta)$ -DP, we can replace (4) for each $i=2,...,K$ , by

The bound can be further improved to $\exp(-\epsilon(n)n)/n$ if we directly work with universal consistency on various distributions rather than through always AERM on specific data points. Even that is likely to be suboptimal as there might be more challenging problems and less favorable packings to consider.

The point of this exposition, however, is to illustrate that $(\epsilon,\delta)$ -DP alone does not close the gap between learnability and private learnability. Additional relaxation on the specified rate of decay on $\delta$ does. We now know that the phase transition occurs when $\delta(n)$ is somewhere between $\Omega(\exp(-n^{2}\log n))$ and $O(1/n)$ ; but there is still a substantial gap between the upper and lower bounds.

Conclusion and future work

In this paper, we revisited the question “What can we learned privately?” and considered a broader class of statistical machine learning problems than those studied previously. Specifically, we characterized the learnability under privacy constraint by showing any privately learnable problems can be learned by a private algorithm that asymptotically minimizes the empirical risk for any data, and the problem is not privately learnable otherwise. This allows us to construct a conceptual procedure that privately learns any privately learnable problem. We also propose a relaxed notion of private learnability called private $\mathfrak{D}$ -learnability, which requires the existence of an algorithm that is consistent for any the distribution within a class of distributions $\mathfrak{D}$ . We characterized private $\mathfrak{D}$ -learnability too with a weaker notion of AERM. For problems that can be formulated as penalized empirical risk minimization, we provide a sampling algorithm with a set of meaningful sufficient conditions on the geometry of the hypothesis space and demonstrate that it covers a large class of problems. In addition, we further extended the characterization to learnability under $(\epsilon,\delta)$ -differential privacy and provided a preliminary analysis which establishes the existence of a phase transition from all learnable problems being approximately private learnable to some learnable problems being not approximately private learnable at some non-trivial rate of decay on $\delta(n)$ .

Future work includes understanding the conditions under which privacy and AERM are contradictory (recall that we only have one example on learning thresholding functions due to Chaudhuri & Hsu 2011), characterizing the rate of convergence, searching for practical algorithms that generically learns all privately learnable problems, and better understanding the gap between learnability and approximate private learnability.

Acknowledgment

We thank the AE and the anonymous reviewers for their comments that lead to significant improvement of this paper. The research was partially supported by NSF Award BCS-0941518 to the Department of Statistics at Carnegie Mellon University, and a grant by Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.

Appendix A Proofs of technical results

In this appendix, we provide detailed proofs to the technical results that in the main text.

Let $\mathcal{A}$ be the consistent $\epsilon$ -DP algorithm. Consider $\mathcal{A}^{\prime}$ that apply $\mathcal{A}$ to a random subsample of $\lfloor\sqrt{n}\rfloor$ data points. By Lemma 34 with $\gamma=\frac{\lfloor\sqrt{n}\rfloor}{n}\leq\frac{1}{\sqrt{n}}$ , we get the privacy claim. For the consistency claim, note that the given sample is an iid sample of size $\sqrt{n}$ from the original distribution. ∎

If Algorithm $\mathcal{A}$ is $\epsilon$ -DP for $Z\in\mathcal{Z}^{n}$ for any $n=1,2,3,...$ , then the algorithm $\mathcal{A}^{\prime}$ that output the result of $\mathcal{A}$ to a random subsample of size $\gamma n$ data points preserves $2\gamma(e^{\epsilon}-e^{-\epsilon})$ -DP.

This is a corollary of Lemma 4.4 in Beimel et al. (2014). To be self-contained, we reproduce the proof here in our notation.

Recall that $\mathcal{A}^{\prime}$ is the algorithm that first randomly subsample $\gamma n$ data points then apply $\mathcal{A}$ . Let $Z$ and $Z^{\prime}$ be any neighboring databases and assume they differ on the $i$ th data point. Let $\mathcal{S}\subset[n]$ be the indices of the random subset of the entries that are selected, and $\mathcal{R}\subset[n]\backslash\{i\}$ be a index size of size $\gamma n-1$ . We apply the law of total expectation twice and argue that for any adjacent $Z$ , $Z^{\prime}$ , any event $E\subset\mathcal{H}$ ,

By the given condition that $\mathcal{A}$ is $\epsilon$ -DP, we can replace $\mathcal{R}\cup\{i\}$ with $\mathcal{R}\cup\{j\}$ for an arbitrary $j$ with bounded changes in the probability and the above likelihood ratio can be upper bounded by

By definition, the privacy loss of the algorithm $\mathcal{A}^{\prime}$ is therefore

Note that $\epsilon>0$ implies that $-1\leq e^{-\epsilon}-1<0$ and $0<e^{\epsilon}-1<\infty$ . The result follows by applying the property of the natural logarithm:

A.2 Characterization of private learnability

Lemma 8 says that an $\epsilon$ -differentially private algorithm is $(e^{\epsilon}-1)$ -stable (and also $2\epsilon$ -stable if $\epsilon<1$ ).

Construct $Z^{\prime}$ by replacing an arbitrary data point in $Z$ with $z^{\prime}$ and let the probability density/mass defined by $\mathcal{A}(Z)$ and $\mathcal{A}(Z^{\prime})$ be $p(h)$ and $p^{\prime}(h)$ respectively, then we can bound the stability as follows

For $\epsilon<1$ we have $\exp(\epsilon)-1<2\epsilon$ .

Stability + AERM ⇒⇒\Rightarrow consistency

If any algorithm is $\xi_{1}(n)$ -stable and $\xi_{2}(n)$ -AERM then it is consistent with rate $\xi(n)=\xi_{1}(n)+\xi_{2}(n)$ .

We will show the following the two steps as in Shalev-Shwartz et al. (2010)

Uniform RO stability $\Rightarrow$ On average stability $\Leftrightarrow$ On average generalization

AERM + On average generalization $\Rightarrow$ consistency

The definition of these quantities is self-explanatory.

To show that “stability implies generalization”, we have

Privacy + AERM ⇒⇒\Rightarrow consistency

It follows by combining Lemma 8 and Theorem 35. ∎

Necessity

We construct an algorithm $\mathcal{A}^{\prime}$ by subsampling the data points using a random subset of $\sqrt{n}$ and then running $\mathcal{A}$ . The privacy claim follows from Lemma 34 directly.

To prove the “always AERM” claim, we adapt the proof of Lemma 24 in Shalev-Shwartz et al. (2010). For any fixed data set $Z\in\mathcal{Z}^{n}$ ,

It is obvious that $\mathcal{A}^{\prime}$ is consistent with rate $\sqrt{n}$ as it applies $\mathcal{A}$ on a random sample of size $\sqrt{n}$ . By Lemma 4, $\mathcal{A}^{\prime}$ is $2n^{-1/2}(e^{\epsilon}-e^{-\epsilon})$ differentially private. By Corollary 9, the new algorithm $\mathcal{A}^{\prime}$ is universally consistent. ∎

A.3 Proofs for Section 3.3

If $\mathcal{A}(Z)$ is a continuous distribution, we can pick $h\in\mathcal{H}$ at any point where $\mathcal{A}(Z)$ has finite density and set $\mathcal{A}^{\prime}(Z)|z\in Z$ to be $h$ with probability $1/n$ and the same as $\mathcal{A}(Z)$ with probability $1-1/n$ . This breaks privacy because conditioned on two databases with $z$ or without $z$ , $\mathcal{A}$ , the probability ratio of outputting $h$ is $\infty$ .

The consistency of $\mathcal{A}^{\prime}$ follows easily as its risk is at most $1/n$ larger than that of $\mathcal{A}$ . ∎

A.4 Proofs for characterization of private 𝔇𝔇\mathfrak{D}-learnability

Let $\mathcal{A}^{\prime}$ be the algorithm that applies $\mathcal{A}$ to a random subsample of size $\lfloor\sqrt{n}\rfloor$ . If we can show that, for any $\mathcal{D}\in\mathfrak{D}$ ,

the empirical risk of $\mathcal{A}^{\prime}$ converges to the the optimal population risk $R^{*}$ in expectation;

the empirical risk of the ERM learning rule also converges to $R^{*}$ in expectation,

then by triangle inequality, the empirical risk of $\mathcal{A}^{\prime}$ must also converge to the empirical risk of ERM, i.e., $\mathcal{A}^{\prime}$ is $\mathfrak{D}$ -universal AERM.

We will start with (a). For any distribution $\mathcal{D}\in\mathfrak{D}$ , we have

To show (b), we need to exploit the assumption that the problem is (non-privately) learnable. By Shalev-Shwartz et al. (2010, Theorem 7), the problem being learnable implies that there exists a universally consistent algorithm $\mathcal{B}$ (not restricted to $\mathfrak{D}$ ), that is universally AERM with rate $3\xi^{\prime}(n^{\frac{1}{4}})+\frac{8}{\sqrt{n}}$ and stable with rate $\frac{2}{\sqrt{n}}$ . Moreover, by Shalev-Shwartz et al. (2010, Theorem 8), $\mathcal{B}$ ’s stability and AERM implies that $\mathcal{B}$ is also generalizing, with rate $6\xi^{\prime}(n^{\frac{1}{4}})+\frac{18}{\sqrt{n}}$ . Here the term “generalizing” means that the empirical risk is close to the population risk. Therefore, we can establish (b) via the following chain of approximations

Combine (14) and (15), we obtain the AERM of $\mathcal{A}^{\prime}$ with rate $12\xi^{\prime}(n^{1/4})+\frac{37}{\sqrt{n}}+\xi(\sqrt{n})$ as required. The privacy of $\mathcal{A}^{\prime}$ follows from Lemma 34. ∎

A.5 Proof for Theorem 19

We first present the proof for Theorem 19. Recall that the roadmap of the proof is summarized in Figure 3.

For readability, we denote $\epsilon(n)$ by simply $\epsilon$ .

Denote shorthand $F^{*}:=\inf_{f\in{\mathcal{H}}}F(Z,h)$ and $q^{*}:=-F^{*}$ , we can state an analog of the utility theorem of the exponential mechanism in (McSherry & Talwar, 2007).

Assuming $\epsilon<\log n$ (otherwise the privacy protection is meaningless anyway), if assumption A1, A2 hold for distribution $\mathcal{D}$ , then

By Lemma 7 in McSherry & Talwar (2007) (translated to our case),

Apply (16), take expectation over the data distribution on both sides, and applying assumption A2, we get

Take $t=\frac{4\big{[}(\rho+2)\log n+\log(K)\big{]}}{\epsilon n}$ , by the assumption that $\epsilon<\log n$ , we get $\log(nt)>0$ . Substitute $t$ into the expression of $\gamma$ we obtain

Now we can say something about the learning problem. In particular, the AERM follows directly from the utility result and stability follows from the definition of differential privacy.

Assume A1 and A2, and $\epsilon\leq\log n$ (so Lemma 36 holds), then

This is a simple consequence of boundedness and Lemma 36.

The above theorem shows that Algorithm 1 is asymptotic ERM. By Theorem 8, the fact that this algorithm is $\epsilon$ -differential private implies that it is $2\epsilon$ -stable. Now the proof follows by applying Theorem 35 which says that stability and AERM of an algorithm certify its consistency. Noting that this holds for any distribution $\mathcal{D}$ completes our proof for learnability in Theorem 19.

A.6 Proofs of other technical results

The algorithm $\mathcal{A}$ privately learns the problem with rate $\xi(n)$ implies that

Let $h\sim\mathcal{A}(Z)$ and $Z\sim\mathcal{D}^{n}$ , by Markov’s inequality, with probability at least $1-1/e$ ,

If we split the data randomly into $a+1$ parts of size $n/(a+1)$ and run $\mathcal{A}$ on the first $a$ partitions, then we get $h_{j}\sim\mathcal{A}(Z_{j})$ . Then with probability at lest $1-(1/e)^{a}$ , at least one of them has risk

This means that if exponential mechanism picked the one with the best validation risk it will be almost as good as the one with the best risk. Assume $h_{1}$ is the one that achieves the best validation risk.

Now it remains to bound the probability that exponential mechanism pick an $h\in\{h_{1},...,h_{a}\}$ that is much worse than $h_{1}$ .

Recall that the utility function is the negative validation risk which depends only on the last partition $I_{a+1}$ .

This is in fact a random function of the data because we are picking the the validation set $I_{a+1}$ randomly from the data. Suppose we arbitrarily replace one data point $j$ from the dataset, the distribution of the output of function $q(Z,h)$ is a mixture of the two cases: $j\in I_{a+1}$ and $j\notin I_{a+1}$ . Since in the first case, $q(Z,h)=q(Z^{\prime},h)$ for all $h$ , sensitivity for this case is . In the second case, by the boundedness assumption, the sensitivity is at most $2(a+1)/n$ . For the exponential mechanism guarantee $\epsilon$ differential privacy, it suffices to take the sensitivity parameter to be $2(a+1)/n$ .

By the utility theorem of the exponential mechanism,

Now by appropriately choosing $\eta=\log(3/\delta)/\log n$ , $a=\log(3/\delta)$ , $\delta_{1}=\delta/3$ , we get

combine the terms and take $\epsilon=\frac{\log(3/\delta)+1}{\sqrt{n}}$ , we get the bound of the excess risk in the theorem.

To get the privacy claim, note that we are applying $\mathcal{A}$ on disjoint partitions of the data so the privacy parameter does not aggregate. Take the worst over all partitions, we get the overall privacy loss $\max\left\{\epsilon\left(\frac{n}{\log(3/n)+1}\right),\frac{\log(3/\delta)+1}{\sqrt{n}}\right\}$ as stated in the theorem. ∎

The Lipschitz example.

Appendix B Alternative proof of Corollary 9 via Dwork et al. (2015b, Theorem 7)

In this Appendix, we describe how the results in Dwork et al. (2015b) can be used to obtain the forward direction of our characterization without going through a stability argument. We first restate the result here in our notation:

Let $\mathcal{B}$ be an $\epsilon$ -DP algorithm such that given a dataset $Z$ , $\mathcal{B}$ outputs a function from $\mathcal{Z}$ to $ $. For any distribution$ \mathcal{D} $over$ \mathcal{Z} $and random variable$ Z\sim\mathcal{D}^{n} $, we let$ \phi\sim\mathcal{B}(Z) $. Then for any$ \beta>0 $,$ \tau>0 $and$ n\geq 12\log(4/\beta)/\tau^{2} $, setting$ \epsilon<\tau/2$ ensures

This lemma was originally stated to prove the claim that privately generated mechanisms for answering statistical queries always generalize.

However, “generalization” alone still does not imply “consistency”, as we also need

The above proof of “consistency” via Lemma 38 and “AERM”, however, leads to a looser bound comparing to our result (Corollary 9) when the additional assumption on $n$ and $\tau$ (equivalently $\epsilon$ ) is active, i.e., when $\frac{\epsilon(n)}{\log(1/\epsilon(n))}<O\left(\frac{1}{\sqrt{n}}\right)$ . In this case it only implies a $\xi(n)+\frac{\log n}{\sqrt{n}}$ bound due to that $\epsilon$ -DP implies $\epsilon^{\prime}$ -DP for any $\epsilon^{\prime}>\epsilon$ . Our proof of Corollary 9 is considerably simpler and more general in that it does not require any assumption on the number of data points $n$ .