A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro

Introduction

Learning with deep neural networks has enjoyed great success across a wide variety of tasks. Even though learning neural networks is a hard problem, even for one hidden layer (Blum & Rivest, 1993), simple optimization methods such as stochastic gradient descent (SGD) have been able to minimize the training error. More surprisingly, solutions found this way also have small test error, even though the networks used have more parameters than the number of training samples, and have the capacity to easily fit random labels (Zhang et al., 2017).

Harvey et al. (2017) provided a generalization error bound by showing that the VC dimension of d-layer networks is depth times the number parameters improving over earlier bounds by Bartlett et al. (1999). Such VC bounds which depend on the number of parameters of the networks, cannot explain the generalization behavior in the over parametrized settings, where the number of samples is much smaller than the number of parameters.

For linear classifiers we know that the generalization behavior depends on the norm and margin of the classifier and not on the actual number of parameters. Hence, a generalization bound for neural networks that only depends on the norms of its layers, and not the actual number of parameters, can explain the good generalization behavior of the over parametrized networks.

Keskar et al. (2016) suggested a sharpness based measure to predict the difference in generalization behavior of networks trained with different batch size SGD. However sharpness is not a scale invariant measure and cannot predict the generalization behavior (Neyshabur et al., 2017). Instead sharpness when combined with the norms of the network can predict the generalization behavior according to the PAC-Bayes framework (McAllester, 1999). Dziugaite & Roy (2017) numerically evaluated a generalization error bound from the PAC-Bayes framework showing that it can predict the difference in generalization behavior of networks trained on true vs random labels. This generalization result is more of computational in nature and gives much tighter (non-vacuous) bounds compared with the ones in either (Bartlett et al., 2017a) or the ones in this paper.

In this paper we present and prove a margin based generalization bound for feedforward neural networks with ReLU activations, that depends on the product of the spectral norm of the weights in each layer, as well as the Frobenius norm of the weights.

More importantly, our proof technique is entirely different, and arguably simpler, than that of Bartlett et al. (2017a). We derive our bound using PAC-Bayes analysis, and more specifically a generic PAC-Bayes margin bound (Lemma 1). The main ingredient is a perturbation bound (Lemma 2), bounding the changes in the output of a network when the weights are perturbed, thereby its sharpness, in terms of the product of the spectral norm of the layers. This is an entirely different analysis approach from the covering number analysis of Bartlett et al. (2017a). We hope our analysis can give more direct intuition into the different ingredients in the bound and will allow modifying the analysis, e.g. by using different prior and perturbation distributions in the PAC-Bayes bound, to obtain tighter bounds, perhaps with dependence on different layer-wise norms.

For any distribution D\mathcal{D} and margin γ>0\gamma>0, we define the expected margin loss as follows:

Let L^γ(fw)\widehat{L}_{\gamma}(f_{\mathbf{w}}) be the empirical estimate of the above expected margin loss. Since setting γ=0\gamma=0 corresponds to the classification loss, we will use L0(fw)L_{0}(f_{\mathbf{w}}) and L^0(fw)\widehat{L}_{0}(f_{\mathbf{w}}) to refer to the expected risk and the training error. The loss LγL_{\gamma} defined this way is bounded between 0 and 1.

2 PAC-Bayesian framework

The PAC-Bayesian framework (McAllester, 1998; 1999) provides generalization guarantees for randomized predictors, drawn form a learned distribution QQ (as opposed to a learned single predictor) that depends on the training data. In particular, let fwf_{\mathbf{w}} be any predictor (not necessarily a neural network) learned from the training data and parametrized by w\mathbf{w}. We consider the distribution QQ over predictors of the form fw+uf_{\mathbf{w}+\mathbf{u}}, where u\mathbf{u} is a random variable whose distribution may also depend on the training data. Given a “prior” distribution PP over the set of predictors that is independent of the training data, the PAC-Bayes theorem states that with probability at least 1δ1-\delta over the draw of the training data, the expected error of fw+uf_{\mathbf{w}+\mathbf{u}} can be bounded as follows (McAllester, 2003):

In the above expression the KL is evaluated for a fixed w\mathbf{w} and only u\mathbf{u} is random, i.e. the distribution of w+u\mathbf{w}+\mathbf{u} is the distribution of u\mathbf{u} shifted by w\mathbf{w}. The lemma is analogous to similar analysis of Langford & Shawe-Taylor (2003) and McAllester (2003) obtaining PAC-Bayes margin bounds for linear predictors, and the proof, presented in Section 4, is essentially the same. As we state the lemma, it is not specific to linear separators, nor neural networks, and holds generally for any real-valued predictor.

We next show how to utilize the above general PAC-Bayes bound to prove generalization guarantees for feedforward networks based on the spectral norm of its layers.

Generalization Bound

In this section we present our generalization bound for feedfoward networks with ReLU activations, derived using the PAC-Bayesian framework. Langford & Caruana (2001), and more recently Dziugaite & Roy (2017) and Neyshabur et al. (2017), used PAC-Bayes bounds to analyze generalization behavior in neural networks, evaluating the KL-divergence, “perturbation error” L[fw+u]L[fw]L[f_{\mathbf{w}+\mathbf{u}}]-L[f_{\mathbf{w}}], or the entire bound numerically. Here, we use the PAC-Bayes framework as a tool to analytically derive a margin-based bound in terms of norms of the weights. As we saw in Lemma 1, the key to doing so is bounding the change in the output of the network when the weights are perturbed. In the following lemma, we bound this change in terms of the spectral norm of the layers:

This lemma characterizes the change in the output of a network with respect to perturbation of its weights, thereby bounding the sharpness of the network as defined in Keskar et al. (2016).

The proof of this lemma is presented in Section 4. Next we use the above perturbation bound and the PAC-Bayes result (Lemma 1) to derive the following generalization guarantee.

The proof involves mainly two steps. In the first step we calculate what is the maximum allowed perturbation of parameters to satisfy a given margin condition γ\gamma, using Lemma 2. In the second step we calculate the KL term in the PAC-Bayes bound in Lemma 1, for this value of the perturbation.

Since uN(0,σ2I)\mathbf{u}\sim\mathcal{N}(0,\sigma^{2}I), we get the following bound for the spectral norm of UiU_{i} (Tropp, 2012):

Taking a union bond over the layers, we get that, with probability 12\geq\frac{1}{2}, the spectral norm of the perturbation UiU_{i} in each layer is bounded by σ2hln(4dh)\sigma\sqrt{2h\ln(4dh)}. Plugging this spectral norm bound into Lemma 2 we have that with probability at least 12\frac{1}{2},

We now calculate the KL-term in Lemma 1 with the chosen distributions for PP and u\mathbf{u}, for the above value of σ\sigma.

Comparison to Existing Generalization Bounds

In this section we will compare the bound of Theorem 1 with a similar spectral norm based margin bound recently obtained by Bartlett et al. (2017a; b), as well as examine whether and when these bounds can improve over VC-based generalization guarantees.

where here and throughout this section we ignore logarithmic factors that depend on - the failure probability δ\delta, the sample size mm, the depth dd and the number of units hh.

Bartlett et al. (2017a) showed a generalization bound for neural networks based on the spectral norm of its layers, using a different proof approach based on covering number arguments. For feedforward depth-dd networks with ReLU activations, and when inputs are in XB,n\mathcal{X}_{B,n}, i.e. are of norm bounded by x2B\left\lvert{\mathbf{x}}\right\rvert_{2}\leq B, their generalization guarantee, ignoring logarithmic factors, ensures that, with high probability, for any w\mathbf{w},

Comparing between the bounds thus boils down to comparing hWiF\sqrt{h}\left\lVert{W_{i}}\right\rVert_{F} with Wi1\left\lVert{W_{i}}\right\rVert_{1}. Recalling that WiW_{i} is at most a h×hh\times h matrix, we have that WiFWi1hWiF\left\lVert{W_{i}}\right\rVert_{F}\leq\left\lVert{W_{i}}\right\rVert_{1}\leq h\left\lVert{W_{i}}\right\rVert_{F}. When the weights are fairly dense and are of uniform magnitude, the second inequality will be tight, and we will have hWiFWi1\sqrt{h}\left\lVert{W_{i}}\right\rVert_{F}\ll\left\lVert{W_{i}}\right\rVert_{1}, and Theorem 1 will dominate. When the weights are sparse with roughly a constant number of significant weights per unit (i.e. weight matrix with sparsity Θ(h)\Theta(h)), the bounds will be similar. Bartlett et al. (2017a) bound will dominate when the weights are extremely sparse, with much fewer significant weights than units, i.e. when most units do not have any incoming or outgoing weights of significant magnitude.

We always have WiFhWi2\left\lVert{W_{i}}\right\rVert_{F}\leq\sqrt{h}\left\lVert{W_{i}}\right\rVert_{2}, and this is tight only for orthogonal matrices, where all eigenvalues are equal. Satisfying (9), and thus having Theorem 1 potentially improving over the VC-bound, thus only requires fairly mild eigenvalue concentration (i.e. having multiple units be similar to each other), reduced rank or row-level sparsity in the weight matrices. Note that we cannot expect to improve over the VC-bound for unstructured “random” weight matrices—we can only expect norm-based guarantees to improve over the VC bound if there is some specific degenerate structure in the weights, and as we indeed see is the case here.

A similar comparison with Bartlett et al. (2017a) bound (6) and its multiplicative factor (8), yields the following condition for improving over the VC bound:

Since Wi1\left\lVert{W_{i}}\right\rVert_{1} can be as large as h1.5Wi2h^{1.5}\left\lVert{W_{i}}\right\rVert_{2}, in some sense more structure is required here in order to satisfy (10), such as elementwise sparsity combined with low-rank row structure. As discussed above, Theorem 1 and Bartlett et al. (2017a) bound can each be better in different regimes. Also in terms of comparison to the VC-bound, it is possible for either one to improve over the VC bound while the other doesn’t (i.e. for either (9) or (10) to be satisfied without the other one being satisfied), depending on the sparsity structure in the weights.

Proofs of Lemmas

In this section we present the proofs of Lemmas 1 and 2.

Let w=w+u\mathbf{w}^{\prime}=\mathbf{w}+\mathbf{u}. Let Sw{\mathcal{S}_{\mathbf{w}}} be the set of perturbations with the following property:

Since the above bound holds for any x\mathbf{x} in the domain XB,n\mathcal{X}_{B,n}, we can get the following a.s.:

Now using the above inequalities together with the equation (2), with probability 1δ1-\delta over the training set we have:

The last inequality follows from the following calculation.

where H(Z)=ZlnZ(1Z)ln(1Z)1H(Z)=-Z\ln Z-(1-Z)\ln(1-Z)\leq 1 is the binary entropy function. Since KL is always positive, we get,

Let Δi=fw+ui(x)fwi(x)2\Delta_{i}=\left\lvert{f^{i}_{\mathbf{w}+\mathbf{u}}(\mathbf{x})-f^{i}_{\mathbf{w}}(\mathbf{x})}\right\rvert_{2}. We will prove using induction that for any i0i\geq 0:

The above inequality together with (1+1d)de\left(1+\frac{1}{d}\right)^{d}\leq e proves the lemma statement. The induction base clearly holds since Δ0=xx2=0\Delta_{0}=\left\lvert{\mathbf{x}-\mathbf{x}}\right\rvert_{2}=0. For any i1i\geq 1, we have the following:

Conclusion

In this paper, we presented new perturbation bounds for neural networks thereby giving a bound on its sharpness. We also discussed how PAC-Bayes framework can be used to derive generalization bounds based on the sharpness of a model class. Applying this to the feedforward networks, we showed that a tighter generalization bound can be achieved based on the spectral norm and Frobenius norm of the layers. The simplicity of the proof compared to that of covering number arguments in Bartlett et al. (2017a) suggest that the PAC-Bayes framework might be an important tool in analyzing the generalization behavior of neural networks.

References