Generalization Bounds for Uniformly Stable Algorithms

Vitaly Feldman, Jan Vondrak

Introduction

We consider the basic problem of estimating the generalization error of learning algorithms. Over the last couple of decades, a remarkably rich and deep theory has been developed for bounding the generalization error via notions of complexity of the class of models (or predictors) output by the learning algorithm. At the same time, for a variety of learning algorithms this theory does not provide satisfactory bounds (even as compared with other theoretical analyses). Most notable among these are continuous optimization algorithms that play the central role in modern machine learning. For example, the standard generalization error bounds for stochastic gradient descent (SGD) on convex Lipschitz functions cannot be obtained by proving uniform convergence for all empirical risk minimizers (ERM) [SSSSS10, Fel16]. Specifically, there exist empirical risk minimizing algorithms whose generalization error is $\sqrt{d}$ times larger than the generalization error of SGD, where $d$ is the dimension of the problem (without the Lipschitzness assumption the gap is infinite even for $d=2$ ) [Fel16]. This disparity stems from the fact that uniform convergence bounds largely ignore the way in which the model output by the algorithm depends on the data. We note that in the restricted setting of generalized linear models one can obtain tight generalization bounds via uniform convergence [KST08].

Another classical and popular approach to proving generalization bounds is to analyze the stability of the learning algorithm to changes in the dataset. This approach has been used to obtain relatively strong generalization bounds for several convex optimization algorithms. For example, the seminal works of [BE02] and [SSSSS10] demonstrate that for strongly convex losses the ERM solution is stable. The use of stability is also implicit in standard analyses of online convex optimization [SSSSS10] and online-to-batch conversion [CCG04]. More recently, [HRS16] showed that for convex smooth losses the solution obtained via (stochastic) gradient descent is stable. They also conjectured that stability can be used to understand the generalization properties of algorithms used for training deep neural networks.

While a variety of notions of stability have been proposed and analyzed, most only lead to bounds on the expectation or the second moment of the estimation error over the random choice of the dataset (where estimation error refers to the difference between the true generalization error and the empirical error). In contrast, generalization bounds based on uniform convergence show that the estimation error is small with high probability (more formally, the distribution of the error has exponentially decaying tails). This discrepancy was first addressed by [BE02] who defined the notion of uniform stability.

Naturally, for most algorithms the stability parameter needs be balanced against the guarantees on the empirical error. For example, ERM solution to convex learning problems can be made uniformly stable by adding a strongly convex term to the objective [SSSSS10]. This change in the objective introduces an error. In the other example, the stability parameter of gradient descent on smooth objectives is determined by the sum of the rates used for all the gradient steps [HRS16]. Limiting the sum limits the empirical error that can be achieved. In both of those examples the optimal expected error can only be achieved when $\gamma=\theta(1/\sqrt{n})$ (which is also the excess loss of the solutions). Unfortunately, in this setting, eq. (3) gives a vacuous bound and only “low-probability” generalization bounds are known for the first example (since it is ERM and eq. (4) applies).

We give two new upper bounds on the estimation error of uniformly stable learning algorithms. Specifically, our bound on the second moment of the estimation error is $O(\gamma^{2}+1/n)$ matching (up to a constant) the simple lower bound of $\gamma+\frac{1}{\sqrt{n}}$ on the first moment. Our high probability bound improves the rate from $\sqrt{n}(\gamma+1/n)$ to $\sqrt{\gamma+1/n}$ . This rate is non-vacuous for any non-trivial stability parameter $\gamma=o(1)$ and matches the rate that was previously known only for the second moment (eq. (2)).

Uniform stability for data-dependent functions is defined analogously (Def. 2.1).

Let $M:Z^{n}\times Z\to$ be a data-dependent function with uniform stability $\gamma$ . Then for any probability distribution ${\mathcal{P}}$ over $Z$ and any $\delta\in(0,1)$ :

The results in Theorem 1.2 are stated only for deterministic functions (or algorithms). They can be extended to randomized algorithms in several standard ways [EEP05, SSSSS10]. If $M$ is uniformly $\gamma$ -stable with high probability over the choice of its random bits then one can obtain a statement which holds with high probability over the choice of both ${\bar{s}}$ and the random bits (e.g. [Lon17]). Alternatively, one can always consider the function $M^{\prime}({\bar{s}},z)=\mathop{\mathbf{E}}_{M}[M({\bar{s}},z)]$ . If $M^{\prime}({\bar{s}},z)$ is uniformly $\gamma$ -stable then Thm. 1.2 can be applied to it. The resulting statement will be only about the expected value of the estimation error with expectation taken over the randomness of the algorithm. Further, if $M$ is used with independent randomness in each evaluation of $M({\bar{s}},s_{i})$ then the empirical mean ${\mathcal{E}}_{\bar{s}}[M({\bar{s}})]$ will be strongly concentrated around ${\mathcal{E}}_{\bar{s}}[M^{\prime}({\bar{s}})]$ (whenever the variance of each evaluation is not too large). We note that randomized algorithms also allow to extend the notion of uniform stability to binary classification algorithms by considering the expectation of the $0/1$ loss (e.g. Sec. 5.2).

A natural and, we believe, important question left open by our work is whether the high probability result in eq. (6) is tight. Subsequent work by the authors [FV19] improves this bound to a nearly optimal bound of

for some constant $c>0$ . Note that the dependence on $\delta$ in this bound in slightly worse and it does not subsume eq. (5) due to the additional logarithmic factors. Proving tight (up to a constant) bounds remains an open problem.

The high-probability generalization result in [BE02] (eq. (3)) is based on a simple observation that as a function of ${\bar{s}}$ , $\Delta_{{\bar{s}}}(M)$ has the bounded differences property. Replacing any element of ${\bar{s}}$ can change $\Delta_{{\bar{s}}}(M)$ by at most $2\gamma+1/n$ (where $\gamma$ comes from changing the function $M({\bar{s}},\cdot)$ to $M({\bar{s}}^{\prime},\cdot)$ and $1/n$ comes the change in one of the points on which this function is evaluated). Applying McDiarmid’s concentration inequality immediately implies concentration with rate $\sqrt{n}(2\gamma+1/n)$ around the expectation. The expectation, in turn, is small by eq. (1). In contrast, our approach uses stability itself as a tool for proving concentration inequalities. It is based on ideas developed in [BNSSSU16] to prove generalization bounds for differentially private algorithms in the context of adaptive data analysis [DFHPRR14]. It was recently shown that this proof approach can be used to re-derive and extend several standard concentration inequalities [SU17, NS17].

At a high level, the first step of the argument reduces the task of proving a bound on the tail of a non-negative real-valued random variable to bounding the expectation of the maximum of multiple independent samples of that random variable. We then show that from multiple executions of $M$ on independently chosen datasets it is possible to select the execution of $M$ with approximately the largest estimation error in a stable way. That is, uniform stability of $M$ allows us to ensure that the selection procedure is itself uniformly stable. The selection procedure is based on the exponential mechanism [MT07] and satisfies differential privacy [DMNS06](Def. 2.3). The stability of this procedure allows us to bound the expectation of the estimation error of the execution of $M$ with approximately the largest estimation error (among the multiple executions). This gives us the desired bound on the expectation of the maximum of multiple independent samples of the estimation error random variable. We remark that the multiple executions and an algorithm for selecting among them exist purely for the purposes of the proof technique and do not require any modifications to the algorithm itself.

Our approach to proving the bound on the second moment of the estimation error is based on two ideas. First we decouple the point on which each $M({\bar{s}})$ is estimated from ${\bar{s}}$ by observing that for every dataset ${\bar{s}}$ the empirical mean is within $2\gamma$ of the “leave-one-out” estimate of the true mean. Specifically, our leave-one-out estimator is defined as $\mathop{\mathbf{E}}_{z\sim{\mathcal{P}}}\left[\frac{1}{n}\sum_{i=1}^{n}M({\bar{s}}^{i\leftarrow z},s_{i})\right]$ , where ${\bar{s}}^{i\leftarrow z}$ denotes replacing the element in ${\bar{s}}$ at index $i$ with $z$ . We then bound the second moment of the estimation error of the leave-one-out estimate by bounding the effect of dependence between the random variables by $O(\gamma^{2}+1/n)$ .

Applications

For concreteness, we consider the well-studied setting in which ${\mathcal{F}}$ contains $1$ -Lipschitz convex functions with range in $ $and$ {\mathcal{K}} $is included in the unit ball. In this case ERM with a strongly convex regularizer$ \frac{\lambda}{2}\|w\|^{2} $has uniform stability of$ 1/(\lambda n) $[BE02, SSSSS10]. From here, applying Markov’s inequality to eq. (4), [SSSSS10] obtain a “low-probability” generalization bound for the solution. Their bound on the true loss is within$ O(1/\sqrt{\delta n}) $from the optimum with probability at least$ 1-\delta $. Applying eq. (5) with Chebyshev’s inequality improves the dependence on$ \delta $quadratically, that is to$ O(1/(\delta^{1/4}\sqrt{n})) $. Further, using eq. (5) we obtain that for an appropriate choice of$ \lambda $, the excess loss is at most$ O(\sqrt{\log(1/\delta)}/n^{1/3})$.

Another algorithm that was shown to be uniformly stable is gradient descentThe analysis in [HRS16] focuses on the stochastic gradient descent and derives uniform stability for the expectation of the loss (over the randomness of the algorithm). However their analysis applies to gradient steps on smooth functions more generally. on sufficiently smooth convex functions [HRS16]. The generalization bounds we obtain for this algorithm are similar to those we get for the strongly convex ERM. We note that for the stability-based analysis in this case even “low-probability” generalization bounds were not known for the optimal rate of $1/\sqrt{n}$ .

Additional details of these applications can be found in Section 5.

2 Additional related work

The use of stability for understanding of generalization properties of learning algorithms dates back to the pioneering work of [RW78]. They showed that expected sensitivity of a classification algorithm to changes of individual examples can be used to obtain a bound on the variance of the leave-one-out estimator for the $k$ -NN algorithm. Early work on stability focused on extensions of these results to other “local” algorithms and estimators and focused primarily on variance (a notable exception is [DW79] where high probability bounds on the generalization error of $k$ -NN are proved). See [DGL96] for an overview. In a somewhat similar spirit, stability is also used for analysis of the variance of the $k$ -fold cross-validation estimator [BKL99, KKV11, KLVV13].

A long line of work focuses on the relationship between various notions of stability and learnability in supervised setting (see [PRMN04, SSSSS10] for an overview). This work employs relatively weak notions of average stability and derives a variety of asymptotic equivalence results. The results in [BE02] on uniform stability and their applications to generalization properties of strongly convex ERM algorithms have been extended and generalized in several directions (e.g. [Zha03, WP09, LLNT17]). [Mau17] considers generalization bounds for a special case of linear regression with a strongly convex regularizer and a sufficiently smooth loss function. Their bounds are data-dependent and are potentially stronger for large values of the regularization parameter (and hence stability). However the bound is vacuous when the stability parameter is larger than $n^{-1/4}$ and hence is not directly comparable to ours. Recent work of [AS18] and [CG16] gives high-probability generalization bounds similar to those in [BE02] but using a bound on a high-order moment of stability instead of the uniform stability.[KL18] give data-dependent generalization bounds for SGD on smooth convex and non-convex losses based on stability. They use on-average stability that does not imply generalization bounds with high probability. We also remark that all these works are based on techniques different from ours.

Uniform stability plays an important role in privacy-preserving learning since a differentially private learning algorithm can usually be obtained one by adding noise to the output of a uniformly stable one (e.g. [CMS11, WLKCJN17, DF18]).

Preliminaries

This definition is equivalent to having $M({\bar{s}},z)$ having sensitivity $\gamma$ or $\gamma$ -bounded differences for all $z\in Z$ .

We will also rely on several elementary properties of differential privacy [DMNS06]. In this context differential privacy is simply a form of uniform stability for randomized algorithms.

An algorithm $A:Z^{n}\to Y$ is $\epsilon$ -differentially private if, for all datasets ${\bar{s}},{\bar{s}}^{\prime}\in Z^{n}$ that differ on a single element,

Generalization with Exponential Tails

Our approach to proving the high-probability generalization bounds is based on the technique introduced by [NS15] (see [BNSSSU16]) to show that differentially private algorithm have strong generalization properties. It has recently been pointed out by [SU17] that this approach can be used to re-derive the standard Bernstein, Hoeffding, and Chernoff concentration inequalities. [NS17] used the same approach to generalize McDiarmid’s inequality to functions with unbounded (or high) sensitivity.

We prove a bound on the tail of a random variable by bounding the expectation of the maximum of multiple independent samples of the random variable. Specifically, the following simple lemma (see [SU17] for proof):

Let ${\mathcal{Q}}$ be a probability distribution over the reals. Then

The second step relies on the relationship between the maximum of a set of values and the value chosen by the soft-argmax, which we refer to as the stable-max. Specifically, we define

The final (and new) ingredient of our proof is a bound on the expected estimation error of any uniformly stable algorithm on a sub-dataset chosen in a differentially private way.

This gives the left hand side of the stated inequality. The right hand side is obtained analogously. ∎

We are now ready to put the ingredients together to prove the claimed result:

To bound this expression we choose $\epsilon=\sqrt{\left(2\gamma+\frac{1}{n}\right)\cdot\ln(m+1)}=\sqrt{\left(2\gamma+\frac{1}{n}\right)\cdot\ln(e\ln(2)/\delta)}$ . Our bound is at least $2\epsilon$ and hence holds trivially if $\epsilon\geq 1/2$ . Otherwise $(e^{\epsilon}-1)\leq 2\epsilon$ and we obtain the following bound on the expectation of the maximum.

where we used that $\gamma\leq\sqrt{\gamma}$ . Finally, plugging this bound into Lemma 3.1 we obtain that

It is easy to see from the proof that it can be stated in terms of two types of stability of $M$ : the uniform stability (denoted by $\gamma$ ) and the estimation error stability, that is the sensitivity of ${\mathcal{E}}_{\mathcal{P}}[M(S)]-{\mathcal{E}}_{S}[M(S)]$ . If we denote the latter by $\beta$ then our results would give a bound of $O(\sqrt{\beta\ln(1/\delta)}+\gamma)$ . Lemma 3.3 implies that $\beta\leq 2\gamma+1/n$ but tighter bounds might hold for specific algorithms.

Second Moment of the Estimation Error

In this section we prove eq. (5) of Theorem 1.2. It will be more convenient to directly work with the unbiased version of $M$ . Specifically, we define $L({\bar{s}},z)\doteq M({\bar{s}},z)-{\mathcal{E}}_{\mathcal{P}}[M({\bar{s}})]$ . Clearly, $L$ is unbiased with respect to ${\mathcal{P}}$ in the sense that for every ${\bar{s}}\in Z^{n}$ , ${\mathcal{E}}_{\mathcal{P}}[L({\bar{s}})]=0$ . Note that if the range of $M$ is $ $then the range of$ L $is$ $. Further,$ L $has uniform stability of at most$ 2\gamma $since for two datasets$ {\bar{s}} $and$ {\bar{s}}^{\prime}$ that differ in a single element,

Therefore eq. (5) of Theorem 1.2 will follow immediately from the following lemma (by using it with stability $2\gamma$ ).

Let $L\colon Z^{n}\times Z\to$ be a data-dependent function with uniform stability $\gamma$ and ${\mathcal{P}}$ be an arbitrary distribution over $Z$ . If $L$ is unbiased with respect to ${\mathcal{P}}$ then:

Our proof starts by first establishing this result for the leave-one-out estimate.

For a data-dependent function $L\colon Z^{n}\times Z\to$ , a dataset ${\bar{s}}\in Z^{n}$ and a distribution ${\mathcal{P}}$ , define

If $L$ has uniform stability $\gamma$ and is unbiased with respect to ${\mathcal{P}}$ then:

where we used convexity to obtain the first line and the bound on the range of $L$ to obtain the last inequality. For a fixed $i\neq j$ and a fixed setting of all the elements in ${\bar{s}}$ with other indices (which we denote by ${\bar{s}}^{\setminus\{i,j\}}$ ) we now analyze the cross term

(We remark that $g$ implicitly depends on $i,j$ and ${\bar{s}}^{\setminus\{i,j\}}$ ). Uniform stability of $L$ implies that

This means that for all $z_{i},z_{j},z\in Z$ ,

Note that $L$ is unbiased and $g$ does not depend on $s_{i}$ or $s_{j}$ . Therefore, for every fixed setting of $s_{i}$ and $z$ ,

implying that $v_{i,j}\leq\gamma^{2}$ . Substituting this into eq.(9) we obtain the claim. ∎

We can now obtain the proof of Lemma 4.1 by observing that for every ${\bar{s}}$ , the empirical mean ${\mathcal{E}}_{\bar{s}}[L({\bar{s}})]$ is within $\gamma$ of our leave-one-out estimator ${\mathcal{E}}_{\bar{s}}^{\leftarrow{\mathcal{P}}}\left[L\right]$ .

Observe that the uniform stability of $L$ implies that for every ${\bar{s}}$ ,

where we used the Cauchy-Schwarz to obtain the second line and Lemma 4.2 together with eq. (11) to obtain the third line. ∎

Applications

We now apply our bounds on the estimation error to several known uniformly stable algorithms. Many additional applications can be derived in a similar manner. Stronger high-probability bounds for these applications are implied by improved bounds in our subsequent work [FV19].

We consider learning problems that can be formulated as stochastic convex optimization. Specifically, these are problems in which the goal is to minimize the expected loss:

Many learning problems can be expressed in or relaxed to this general form. As a result many optimization algorithms are known and the optimal error rates are understood for a variety of families of convex functions. However most of these results are obtained via algorithm-specific techniques such as online-to-batch conversion [CCG04] and stability-based arguments rather than uniform convergence. As it turns out, this is unavoidable. This was first pointed out in the seminal work of [SSSSS10] who showed that there is exists a gap between the bounds that can be obtained via uniform convergence (or ERM algorithms) and bounds achievable via alternative approaches.

For concreteness, let ${\mathcal{F}}$ be the family of all convex $1$ -Lipschitz losses over the unit Euclidean ball in $d$ dimension (denoted by ${\mathcal{B}}_{2}^{d}(1)$ ). It is well-known that in this case the stochastic convex optimization problem can be solved with error $1/\sqrt{n}$ via projected SGD. At the same time it was shown in [SSSSS10] that there exists an algorithm that minimizes the empirical error while having the worst case error of $\Omega\left(\frac{\log d}{n}\right)$ . This has been subsequently strengthened to $\Omega\left(\frac{d}{n}\right)$ by [Fel16] who also showed a lower bound of $\Omega\left(\sqrt{\frac{d}{n}}\right)$ for obtaining uniform convergence in this setting. Further, with Lipschitzness assumption replaced by the assumption that functions have range in $ $the gap becomes infinite even for$ d=2$ [Fel16].

We now revisit the stability results known for this basic setting [BE02, SSSSS10] (for simplicity and without loss of generality we will scale the domain and functions to 1).

We note that the bound on estimation error is obtained by applying Markov’s inequality to eq. (4). Theorem 5.1 requires strong convexity. As pointed out in [SSSSS10], it is possible to add a strongly convex regularizing term $\frac{\lambda}{2}\|w\|^{2}$ to the objective function that has sufficiently small effect on the loss function while ensuring stability (and generalization). Specifically, by setting $\lambda=\frac{4}{\sqrt{\delta n}}$ the objective function will change by at most $\lambda$ since $w$ is assumed to be in a ball of radius $1$ . Plugging this value of $\lambda$ into Thm. 5.1 and accounting for the additional error they get:

We now spell out the results for these settings implied by our generalization bounds.

In the setting of Theorem 5.1, for some fixed constants $c_{1}$ and $c_{2}$ :

The first part of this corollary follows directly from applying Chebyshev’s inequality to eq. (5) in Theorem 1.2. To apply our results in the setting of Corollary 5.2 we will use a different choice of $\lambda$ to minimize the error. Specifically, we will choose $\lambda=c/\sqrt{\sqrt{\delta}n}$ for some constant $c$ when using the second moment and $\lambda=c/n^{2/3}$ when using the high probability result.

In the setting of Corollary 5.2 with appropriate choices of $\lambda$ and some fixed constants $c_{1}$ and $c_{2}$ :

Gradient descent on smooth functions

We now recall the results of [HRS16] for convex and smooth functions. These results derive their guarantees from the fact that a gradient step on a sufficiently smooth loss function is non-expansive. That is, for any pair of points $w$ and $w^{\prime}$ , any $\beta$ -smooth convex function $f$ , and $0\leq\eta\leq 2/\beta$ ,

For concreteness, we apply this technique to projected gradient descent on the empirical objective. Unlike for single-pass algorithms, we are not aware of any other approaches to proving generalization guarantees for this algorithm. For an integer $T$ , and dataset ${\bar{s}}$ , let PGD ${}_{T}({\bar{s}})$ denote the output of the algorithm that starting from $w_{0}$ being the origin, performs the following iterative updates for every $t\in[T]$ :

and for every distribution ${\mathcal{P}}$ over $Z$ :

To minimize the expected true loss the algorithm needs to be used with $T=n/\sqrt{2}$ , which implies that the stability parameter is $\Omega(1/\sqrt{n})$ . We remark that in this case even “low-probability” generalization results cannot be obtained directly from the bound on the expectation of the true loss.

Applying eq. (5) with Chebyshev’s inequality to the results of Theorem 5.5 gives that for some constant $c_{1}$ and every $\delta>0$ :

At the same time eq. (6) gives (for some constant $c_{2}$ ):

By optimizing the choice of $T$ we can get essentially the same rates as we have obtained for the ERM in Corollary 5.4 (although in this case we need a smoothness assumption).

In the setting of Theorem 5.5 with appropriate choices of $T$ , for every distribution ${\mathcal{P}}$ over $Z$ , $\delta>0$ , some fixed constants $c_{1}$ and $c_{2}$ :

2 Privacy-Preserving Prediction

Our results can also be used to improve the bounds on generalization error of learning algorithms with differentially private prediction. These are algorithms introduced to model privacy-preserving learning in the settings where users only have black-box access to the model via a prediction interface [DF18]. Formally,

Let $K$ be an algorithm that given a dataset ${\bar{s}}\in(X\times Y)^{n}$ and a point $x\in X$ produces a value in $Y$ . Then $K$ is $\epsilon$ -differentially private prediction algorithm if for every $x\in X$ , the output $K({\bar{s}},x)$ is $\epsilon$ -differentially private with respect to ${\bar{s}}$ .

These bounds are stronger than those obtained in [DF18] in several parameter regimes (but are more generally incomparable since bounds in [DF18] are multiplicative).

Introduction

Applications

2 Additional related work

Preliminaries

Generalization with Exponential Tails

Second Moment of the Estimation Error

Applications

Gradient descent on smooth functions

2 Privacy-Preserving Prediction

References