Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation
Colin Wei, Tengyu Ma
Introduction
Deep networks trained in practice typically use many more parameters than training examples, and therefore have the capacity to overfit to the training set (Zhang et al., 2016). Fortunately, there are also many known (and unknown) sources of regularization during training: model capacity regularization such as simple weight decay, implicit or algorithmic regularization (Gunasekar et al., 2017, 2018b; Soudry et al., 2018; Li et al., 2018b), and finally regularization that depends on the training data such as Batchnorm (Ioffe and Szegedy, 2015), layer normalization (Ba et al., 2016), group normalization (Wu and He, 2018), path normalization (Neyshabur et al., 2015a), dropout (Srivastava et al., 2014; Wager et al., 2013), and regularizing the variance of activations (Littwin and Wolf, 2018).
In many cases, it remains unclear why data-dependent regularization can improve the final test error — for example, why Batchnorm empirically improves the generalization performance in practice (Ioffe and Szegedy, 2015; Zhang et al., 2019). We do not have many tools for analyzing data-dependent regularization in the literature; with the exception of Dziugaite and Roy (2018), (Arora et al., 2018) and (Nagarajan and Kolter, 2019) (with which we compare later in more detail), existing bounds typically consider properties of the weights of the learned model but little about their interactions with the training set. Formally, define a data-dependent property as any function of the learned model and the training data. In this work, we prove tighter generalization bounds by considering additional data-dependent properties of the network. Optimizing these bounds leads to data-dependent regularization techniques that empirically improve performance.
Suppose . With probability over the training data, we can bound the test error of by
The degree of the dependencies on may look unconventional — this is mostly due to the dramatic simplification from our full Theorem 7.1, which obtains a more natural bound that considers all interlayer Jacobian norms instead of only the maximum. Our bound is polynomial in , and network depth, but independent of width. In practice, and have been observed to be much smaller than the product of matrix norms (Arora et al., 2018; Nagarajan and Kolter, 2019). We remark that our bound is not homogeneous because the smooth activations are not homogeneous and can cause a second order effect on the network outputs.
In contrast, the bounds of Neyshabur et al. (2015b); Bartlett et al. (2017); Neyshabur et al. (2017a); Golowich et al. (2017) all depend on a product of norms of weight matrices which scales exponentially in the network depth, and which can be thought of as a worst case Lipschitz constant of the network. In fact, lower bounds show that with only norm-based constraints on the hypothesis class, this product of norms is unavoidable for Rademacher complexity-based approaches (see for example Theorem 3.4 of (Bartlett et al., 2017) and Theorem 7 of (Golowich et al., 2017)). We circumvent these lower bounds by additionally considering the model’s Jacobian norms – empirical Lipschitz constants which are much smaller than the product of norms because they are only computed on the training data.
The bound of Arora et al. (2018) depends on similar quantities related to noise stability but only holds for a compressed network and not the original. The bound of Nagarajan and Kolter (2019) also depends polynomially on the Jacobian norms rather than exponentially in depth; however these bounds also require that the inputs to the activation layers are bounded away from 0, an assumption that does not hold in practice (Nagarajan and Kolter, 2019). We do not require this assumption because we consider networks with smooth activations, whereas the bound of Nagarajan and Kolter (2019) applies to relu nets.
In Section E, we additionally present a generalization bound for recurrent neural nets that scales polynomially in the same quantities as our bound for standard neural nets. Prior generalization bounds for RNNs either require parameter counting (Koiran and Sontag, 1997) or depend exponentially on depth (Zhang et al., 2018; Chen et al., 2019).
In Figure 1, we plot the distribution over the sum of products of Jacobian and hidden layer norms (which is the leading term of the bound in our full Theorem 7.1) for a WideResNet (Zagoruyko and Komodakis, 2016) trained with and without Batchnorm. Figure 1 shows that this sum blows up for networks trained without Batchnorm, indicating that the terms in our bound are empirically relevant for explaining data-dependent regularization.
An immediate bottleneck in proving Theorem 1.1 is that standard tools require fixing the hypothesis class before looking at training data, whereas conditioning on data-dependent properties makes the hypothesis class a random object depending on the data. A natural attempt is to augment the loss with indicators on the intended data-dependent quantities , with desired bounds as follows:
This augmented loss upper bounds the original loss , with equality when all properties hold for the training data. The augmentation lets us reason about a hypothesis class that is independent of the data by directly conditioning on data-dependent properties in the loss. The main challenges with this approach are twofold: 1) designing the correct set of properties and 2) proving generalization of the final loss , a complicated function of the network.
Our main tool is covering numbers: Lemma 4.1 shows that a composition of functions (i.e, a neural network) has low covering number if the output is worst-case Lipschitz at each level of the composition and internal layers are bounded in norm. Unfortunately, the standard neural net loss satisfies neither of these properties (without exponential dependencies on depth). However, by augmenting with properties , we can guarantee they hold. One technical challenge is that augmenting the loss makes it harder to reason about covering, as the indicators can introduce complicated dependencies between layers.
Our main technical contributions are: 1) We demonstrate how to augment a composition of functions to make it Lipschitz at all layers, and thus easy to cover. Before this augmentation, the Lipschitz constant could scale exponentially in depth (Theorem 4.4). 2) We reduce covering a complicated sequence of operations to covering the individual operations (Theorem 4.3). 3) By combining 1 and 2, it follows cleanly that our augmented loss on neural networks has low covering number and therefore has good generalization. Our bound scales polynomially, not exponentially, in the depth of the network when the network has good Lipschitz constants on the training data (Theorem 7.1).
As a complement to the main theoretical results in this paper, we show empirically in Section 8 that directly regularizing our complexity measure can result in improved test performance.
Related Work
Zhang et al. (2016) and Neyshabur et al. (2017b) show that generalizaton in deep learning often disobeys conventional statistical wisdom. One of the approaches adopted torwards explaining generalization is implicit regularization; numerous recent works have shown that the training method prefers minimum norm or maximum margin solutions (Soudry et al., 2018; Li et al., 2018b; Ji and Telgarsky, 2018; Gunasekar et al., 2017, 2018a, 2018b; Wei et al., 2018). With the exception of (Wei et al., 2018), these papers analyze simplified settings and do not apply to larger neural networks.
This paper more closely follows a line of work related to Rademacher complexity bounds for neural networks (Neyshabur et al., 2015b, 2018; Bartlett et al., 2017; Golowich et al., 2017; Li et al., 2018a). For a comparison, see the introduction. There has also been work on deriving PAC-Bayesian bounds for generalization (Neyshabur et al., 2017b, a; Nagarajan and Kolter, 2019). Dziugaite and Roy (2017a) optimize a bound to compute non-vacuous bounds for generalization error. Another line of work analyzes neural nets via their behavior on noisy inputs. Neyshabur et al. (2017b) prove PAC-Bayesian generalization bounds for random networks under assumptions on the network’s empirical noise stability. Arora et al. (2018) develop a notion of noise stability that allows for compression of a network under an appropriate noise distribution. They additionally prove that the compressed network generalizes well. In comparison, our Lipschitzness construction also relates to noise stability, but our bounds hold for the original network and do not rely on the particular noise distribution.
Nagarajan and Kolter (2019) use PAC-Bayes bounds to prove a similar result as ours for generalization of a network with bounded hidden layer and Jacobian norms. The main difference is that their bounds depend on the inverse relu preactivations, which are found to be large in practice (Nagarajan and Kolter, 2019); our bounds apply to smooth activations and avoid this dependence at the cost of an additional factor in the Jacobian norm (shown to be empirically small). We note that the choice of smooth activations is empirically justified (Clevert et al., 2015; Klambauer et al., 2017). We also work with Rademacher complexity and covering numbers instead of the PAC-Bayes framework. It is relatively simple to adapt our techniques to relu networks to produce a similar result to that of Nagarajan and Kolter (2019), by conditioning on large pre-activation values in our Lipschitz augmentation step (see Section 4.2). In Section F, we provide a sketch of this argument and obtain a bound for relu networks that is polynomial in hidden layer and Jacobian norms and inverse preactivations. However, it is not obvious how to adapt the argument of Nagarajan and Kolter (2019) to activation functions whose derivatives are not piecewise-constant.
Dziugaite and Roy (2018, 2017b) develop PAC-Bayes bounds for data-dependent priors obtained via some differentially private mechanism. Their bounds are for a randomized classifier sampled from the prior, whereas we analyze a deterministic, fixed model.
Novak et al. (2018) empirically demonstrate that the sensitivity of a neural net to input noise correlates with generalization. Sokolić et al. (2017); Krueger and Memisevic (2015) propose stability-based regularizers for neural nets. Hardt et al. (2015) show that models which train faster tend to generalize better. Keskar et al. (2016); Hoffer et al. (2017) study the effect of batch size on generalization. Brutzkus et al. (2017) analyze a neural network trained on hinge loss and linearly separable data and show that gradient descent recovers the exact separating hyperplane.
Notation
We use to denote total derivative operator, and thus represents the Jacobian of at . Suppose is a family of functions from to . Let be the covering number of the function class w.r.t. metric with cover size . In many cases, the covering number depends on the examples through the norms of the examples, and in this paper we only work with these cases. Thus, we let be the maximum covering number for any possible data points with norm not larger than . Precisely, if we define to be the set of all possible uniform distributions supported on data points with norms not larger than , then
Suppose contains functions with inputs that map from a tensor product Euclidean space to Euclidean space, then we define
Overview of Main Results and Proof Techniques
In this section, we give a general overview of the main technical results and outline how to prove them with minimal notation. We will point to later sections where many statements are formalized.
Textbook results (Bartlett and Mendelson, 2002) bound the generalization error by the Rademacher complexity (formally defined in Section A) of the family of losses , which in turn is bounded by the covering number of through Dudley’s entropy integral theorem (Dudley, 1967). Modulo minor nuances, the key remaining question is to give a tight covering number bound for the family for every target cover size in a certain range (often, considering suffices).
As alluded to in the introduction, generalization error bounds obtained through this machinery only depend on the (training) data through the margin in the loss function, and our aim is to utilize more data-dependent properties. Towards understanding which data-dependent properties are useful to regularize, it is helpful to revisit the data-independent covering technique of (Bartlett et al., 2017), the skeleton of which is summarized below.
Recall that denotes the covering number for arbitrary data points with norm less than . The following lemma says that if the intermediate variable (or the hidden layer) is bounded, and the composition of the rest of the functions is Lipschitz, then small covering number of local functions imply small covering number for the composition of functions.
[abstraction of techniques in (Bartlett et al., 2017)] In the context above, assume:
for any , .
Then, we have the following covering number bound for (for any choice of ):
The lemma says that the log covering number and the cover size scale linearly if the Lipschitzness parameters and norms remain constant. However, these two quantities, in the worst case, can easily scale exponentially in the number of layers, and they are the main sources of the dependency of product of spectral/Frobenius norms of layers in (Golowich et al., 2017; Bartlett et al., 2017; Neyshabur et al., 2017a, 2015b) More precisely, the worst-case Lipschitzness over all possible data points can be exponentially bigger than the average/typical Lipschitzness for examples randomly drawn from the training or test distribution. We aim to bridge this gap by deriving a generalization error bound that only depends on the Lipschitzness and boundedness on the training examples.
where ’s are user-defined parameters. For our application to neural nets, we instantiate as the maximum norm of layer and as the maximum norm of the Jacobian between layer and across the training dataset. A polynomial in can be shown to bound the worst-case Lipschitzness of the function w.r.t. the intermediate variables in the formula above.As mentioned in footnote 1, we will formalize the precise meaning of Lipschitzness later. By our choice of , , a) the training loss is unaffected by the augmentation and b) the worst-case Lipschitzness of the loss is controlled by a polynomial of the Lipschitzness on the training examples. We provide an informal overview of our augmentation procedure in Section 4.2 and formally state definitions and guarantees in Section 6. The downside of the Lipschitz augmentation is that it further complicates the loss function. Towards covering the loss function (assuming Lipschitz properties) efficiently, we extend Lemma 4.1, which works for sequential compositions of functions, to general families of formulas, or computational graphs. We informally overview this extension in Section 4.1 using a minimal set of notations, and in Section 5, we give a formal presentation of these results.
Combining the Lipschitz augmentation and graphs covering results, we obtain a covering number bound of augmented loss. The theorem below is formally stated in Theorem 6.3 of Section 6.
where denotes the function class obtained from applying the total derivative operator to all functions in .
Now, following the standard technique of bounding Rademacher complexity via covering numbers, we can obtain generalization error bounds for augmented loss. For the demonstration of our technique, suppose that the following simplification holds:
A computational graph is an acyclic directed graph with three components: the set of nodes corresponds to variables, the set of edges describes dependencies between these variables, and contains a list of composition rules indexed by the variables ’s, representing the process of computing from its direct predecessors. For simplicity, we assume the graph contains a unique sink, denoted by , and we call it the “output node”. We also overload the notation to denote the function that the computational graph finally computes. Let be the subset of nodes with no predecessors, which we call the “input nodes” of the graph.
The notion of a family of computational graphs generalizes the sequential family of function compositions in (5). Let be a family of computational graphs with shared nodes, edges, output node, and input nodes (denoted by ). Let be the collection of all possible composition rules used for node by the graphs in the family . This family defines a set of functions .
Suppose that there is an ordering of the nodes, so that after cutting out nodes , the node becomes a leaf node and the output is -Lipschitz w.r.t to for all . In addition, assume that for all , the node ’s value has norm at most . Let be all the predecessors of and be the list of norm upper bounds of the predecessors of .
Then, small covering numbers for all of the local composition rules of with resolution would imply small covering number for the family of computational graphs with resolution :
2 Lipschitz Augmentation of Computational Graphs
The covering number bound of Theorem 4.3 relies on Lipschitzness w.r.t internal nodes of the graph under a worst-case choice of inputs. For deep networks, this can scale exponentially in depth via the product of weight norms and easily be larger than the average Lipschitz-ness over typical inputs. In this section, we explain a general operation to augment sequential graphs (such as neural nets) into graphs with better worst-case Lipschitz constants, so tools such as Theorem 4.3 can be applied. Formal definitions and theorem statements are in Section 6.
The augmentation relies on introducing terms such as the soft indicators in equation (6) and (7) which condition on data-dependent properties. As outlined in Section 4, they will translate to the data-dependent properties in the generalization bounds. We also require the augmented function to upper bound the original.
Our key insight is that by considering a more complicated augmentation which conditions on the derivatives between all intermediate variables, we can still control Lipschitzness of the system, leading to the more involved augmentation presented in (7). Our main technical contribution is Theorem 4.4, which we informally state below.
Covering of Computational Graphs
This section is a formal version of Section 4.1 with full definition and theorem statements. In this section, we adapt the notion of a computational graph to our setting. In Section 5.1, we formalize the notion of a computational graph and demonstrate how neural networks fit under this framework. In Section 5.2, we define the notion of release-Lipschitzness that abstracts the sequential notion of Lipschitzness in Lemma 4.1. We show that when this release-Lipschitzness condition and a boundedness condition on the internal nodes hold, it is possible to cover a family of computational graphs by simply covering the function class at each vertex.
When we augment the neural network loss with data-dependent properties, we introduce dependencies between the various layers, making it complicated to cover the augmented loss. We use the notion of computational graphs to abstractly model these dependencies.
Computational graphs are originally introduced by Bauer (1974) to represent computational processes and study error propagation. Recall the notation introduced for a computational graph in Section 4.1, with input nodes and output node denoted by . (It’s straightforward to generalize to scenarios with multiple output nodes.)
For every variable , let be the space that resides in. If has direct predecessors , then the associated composition rule is a function that maps to . If is an input node, then the composition rule is not relevant. For any node , the computational graph defines/induces a function that computes the variable from inputs, or in mathematical words, that maps the inputs space to . This associated function, denoted by again with slight abuse of notations, is defined recursively as follows: set to
More succinctly, we can write . We also overload the notation to denote the function that the computational graph finally computes (which maps to ). For any set , use to denote the space . We use to denote the set of direct predecessors of in graph , or simply when the graph is clear from context.
2 Reducing graph covering to local function covering
In this section we introduce the notion of a family of computational graphs, generalizing the sequential family of function compositions in (5). We define release-Lipschitzness, a condition which allows reduce covering the entire the graph family to covering the composition rules at each node. We formally state this reduction in Theorem 5.3.
Let be a family of computational graph with shared nodes and edges, where is a collection of lists of composition rules. This family of computational graphs defines a set of functions . We’d like to cover this set of functions in with respect to some metric .
For a list of composition rules and subset , we define the projection of composition rules onto by . Now let denote the marginal collection of the composition rules on node subset .
For any computational graph and a non-input node , we can define the following operation that “releases” from its dependencies on its predecessors by cutting all the inward edges: Let be sub-graph of where all the edges pointing towards are removed from the graph. Thus, by definition, becomes a new input node of the graph : . Moreover, we can “recover” the dependency by plugging the right value for in the new graph : Let be the function associated to the node in graph , then we have
In our proofs, we will release variables in orders. Let be an ordering of the intermediate variables . We call a forest ordering if for any , in the original graph , at most depends on the input nodes and . For any sequence of variables , we can define the graph obtained by releasing the variables in order: . We next define the release-Lipschitz condition, which states that the graph function remains Lipschitz when we sequentially release vertices in a forest ordering of the graph.
A graph is release-Lipschitz with parameters w.r.t a forest ordering of the internal nodes, denoted by if the following happens: upon releasing in order from any , for any , we have that the function defined by the released graph is -Lipschitz in the argument , for any values of the rest of the input nodes (=.) We also say graph is release- Lipschitz if such a forest ordering exists.
Now we show that the release-Lipschitz condition allows us to cover any family of computational graphs whose output collapses when internal nodes are too large. The below is a formal and complete version of Theorem 4.3. For the augmented loss defined in (7), the function output collapses to when internal computations are large. The proof is deferred to Section B.
Suppose is a computational graph with the associated family of lists of composition rules , as formally defined above. Let be a uniform distribution over points in . Let , , and be three families of fixed parameters indexed by (whose meanings are defined below). Assume the following:
Every is release-Lipschitz with parameters w.r.t a forest ordering of the internal nodes (the parameter ’s and ordering doesn’t depend on the choice of .)
For the same order as before, if is an input of the released graph satisfying for some , then for some constant .
Then, small covering numbers for all of the local composition rules of with resolution would imply small covering number for the family of computational graphs with resolution :
Lipschitz Augmentation of Computational Graphs
In this section, we provide a more thorough and formal presentation of the augmentation framework of Section 4.2.
The covering number bound for the computational graph family in Theorem 5.3 relies on the release-Lipschitzness condition (condition 1 of Theorem 5.3) and rarely holds for deep computational graphs such as deep neural networks. The conundrum is that the worst-case Lipschitzness as required in the release-Lipschitz conditionWe say the Lipschitzness required is worst case because the release-Lipschitz condition requires the Lipschitzness of nodes for any possible choice of inputs is very likely to scale in the product of the worst-case Lipschitzness of each operations in the graph, which can easily be exponentially larger than the average Lipschitzness over typical examples.
In this section, we first define a model of sequential computational graphs, which captures the class of neural networks. Before Lipschitz augmentation, the worst-case Lipschitz constant of graphs in this family could scale exponentially in the depth of the graph. In Definition 6.1, we generalize the operation of (7) to augment any family of sequential graphs and produce a family satisfying the release-Lipschitz condition. In Theorem 6.3, we combine this augmentation with the framework of 5.3 to produce general covering number bounds for the augmented graphs. For the rest of this section we will work with sequential families of computational graphs.
A sequential computational graph has nodes set , where is the single input node, and all the edges are . We often use the notation to refer to the input . Below we formally define the augmentation operation.
Given a differentiable sequential computational graph with internal nodes , define its Lipschitz augmentation as follows. We first add nodes to the graph denoted by . The composition rules for original internal nodes remain the same, and the composition rule for is defined as
Here is the total derivative of the function . In other words, the variable is a Jacobian for , a linear operator that maps to . (Note that if ’s are considered as vector variables, then ’s are matrix variables.) We equip the space of with operator norm, denoted by , induced by the original norms on spaces and . The Lipschitz-ness w.r.t variable will be measured with operator norm.
We pre-determine a family of parameters for all pairs with . The final loss is augmented by a product of soft indicators that truncates the function when any of the Jacobians is much larger than :
where , , and . Note that is the total derivative of w.r.t , and thus the has the interpretation as an intended bound of the Jacobian between pairs of layers (variables). Figure 4 depicts the augmentation.
Note that under these definitions, we finally get that the output function of computes
[Lipschitz guarantees of augmented graphs] Let be a family of sequential computational graphs. Suppose for any , the composition rule of the output node, , is -Lipschitz in variable for all , and it only outputs value in $DR_{V_{i}}\bar{\kappa}_{i}iDR_{V_{i}}\mathcal{D}_{V_{i-1}}\mathcal{D}_{V_{i-1}}\mathcal{D}_{V_{i}}\kappa_{j\leftarrow i}i\leq j\mathcal{G}\widetilde{\mathcal{G}}$.
where for simplicity in the above expressions, we extend the definition of ’s to .
Finally, we combine Theorems 5.3 and Theorems 6.2 to derive covering number bounds for any Lipschitz augmentation of sequential computational graphs. The final covering bound in (16) can be easily computed given covering number bounds for each individual function class. In Section 7, we use this theorem to derive Rademacher complexity bounds for neural networks. The proof is deferred to Section C. In Section E, we also use these tools to derive Rademacher complexity bounds for RNNs.
Consider any family of sequential computational graphs satisfying the conditions of Theorem 6.2. By combining the augmentation of Definition 6.1 with additional indicators on the internal node norms, we can construct a new family of computational graphs which output
The family satisfies the following guarantees:
Each computational graph in upper bounds its counterpart in , i.e. .
where denotes the family of total derivatives of functions in and the input vertex.
Application to Neural Networks
The below result follows from modeling the neural net loss as a sequential computational graph and using our augmentation procedure to make it Lipschitz in its nodes with parameters . Then we cover the augmented loss to bound its Rademacher complexity.
Assume that the activation is 1-Lipschitz with a -Lipschitz derivative. Fix reference matrices , . With probability over the random draws of the data , all neural networks with parameters and positive margin satisfy:
where , and .
In these expressions, we define , , and:
where computes the Jacobian of layer w.r.t. layer . Note that the training error here is because of the existence of positive margin .
We note that our bound has no explicit dependence on width and instead depends on the norms of the weights offset by reference matrices . These norms can avoid scaling with the width of the network if the difference between the weights and reference matrices is sparse. The reference matrices are useful if there is some prior belief before training about what weight matrices are learned, and they also appear in the bounds of Bartlett et al. (2017). In Section E, we also show that our techniques can easily be extended to provide generalization bounds for RNNs scaling polynomially in depth via the same quantities .
Experiments
Though the main purpose of the paper is to study the data-dependent generalization bounds from a theoretical perspective, we provide preliminary experiments demonstrating that the proposed complexity measure and generalization bounds are empirically relevant. We show that regularizing the complexity measure leads to better test accuracy. Inspired by Theorem 7.1, we directly regularize the Jacobian of the classification margin w.r.t outputs of normalization layers and after residual blocks. Our reasoning is that normalization layers control the hidden layer norms, so additionally regularizing the Jacobians results in regularization of the product, which appears in our bound. We find that this is effective for improving test accuracy in a variety of settings. We note that Sokolić et al. (2017) show positive experimental results for a similar regularization technique in data-limited settings.
Suppose that denotes the margin of the network for example . Letting denote some hidden layer of the network, we define the notation and use training objective
where denotes the standard cross entropy loss, and are hyperparameters. Note the Jacobian is taken with respect to a scalar output and therefore is a vector, so it is easy to compute.
For a WideResNet16 (Zagoruyko and Komodakis, 2016) architecture, we train using the above objective. The threshold on the Frobenius norm in the regularization is inspired by the truncations in our augmented loss (in all our experiments, we choose ). We tune the coefficient as a hyperparameter. In our experiments, we took the regularized indices to be last layers in each residual block as well as layers in residual blocks following a BatchNorm in the standard WideResNet16 architecture. In the LayerNorm setting, we simply replaced BatchNorm layers with LayerNorm. The remaining hyperparameter settings are standard for WideResNet; for additional details see Section G.1.
Figure 1 shows the results for models trained and tested on CIFAR10 in low learning rate and no data augmentation settings, which are settings where generalization typically suffers. We also experiment with replacing BatchNorm layers with LayerNorm and additionally regularizing the Jacobian. We observe improvements in test error for all these settings. In Section G.2, we empirically demonstrate that our complexity measure indeed avoids the exponential scaling in depth for a WideResNet model trained on CIFAR10.
Conclusion
In this paper, we tackle the question of how data-dependent properties affect generalization. We prove tighter generalization bounds that depend polynomially on the hidden layer norms and norms of the interlayer Jacobians. To prove these bounds, we work with the abstraction of computational graphs and develop general tools to augment any sequential family of computational graphs into a Lipschitz family and then cover this Lipschitz family. This augmentation and covering procedure applies to any sequence of function compositions. An interesting direction for future work is to generalize our techniques to arbitrary computational graph structures. In follow-up work (Wei and Ma, 2019), we develop a simpler technique to derive Jacobian-based generalization bounds for both robust and clean accuracy, and we present an algorithm inspired by this theory which empirically improves performance over strong baselines.
Acknowledgments
CW was supported by a NSF Graduate Research Fellowship. Toyota Research Institute (TRI) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
References
Appendix A Missing Proofs for Section 7
We first elaborate more on the notations introduced in Section 7. First, by our indexing, matrix will be applied in layer of the network, and even layers apply . We let denote the function computed between layers and and denote the layer -to- Jacobian. By our definition of , , , and is recursively computed by for . We will use the convention that computes the identity mapping for .
will denote a test distribution over examples and labels , and will denote the distribution on training examples.
For a class of real-valued functions and dataset , define the empirical Rademacher complexity of this function class by
Assume that the activation is -Lipschitz with -Lipschitz derivative. Fix parameters , , , , and reference matrices , . With probability over the random draws of the distribution , all neural networks with parameters satisfying the following data-dependent conditions:
Hidden layers norms are controlled: .
Jacobians are balanced: .
The margin is large: .
and the additional data-independent condition
will have the following generalization to test data:
Here we use the convention that and let .
This generalization bound follows straightforwardly via the below Rademacher complexity bound for the augmented loss class:
Suppose that is -Lipschitz with -Lipschitz derivative. Define the following class of neural networks with norm bounds on its weight matrices with respect to reference matrices :
Fix parameters and for with and . When we apply this theorem, we will choose and which upper bound the layer to Jacobian norm and -th hidden layer norm, respectively. Define the class of augmented losses
and define for , meant to bound the influence of the matrix on the Jacobians and hidden variables, respectively as in (18), (19). Then we can bound the empirical Rademacher complexity of the augmented loss class by
We associate the un-augmented loss class on neural networks with a family of sequential computation graphs with depth . The composition rules are as follows: for internal node , , the set with only one element: the activation . We also let . Finally, we choose to be the singleton class . Our collection of computation rules is then simply . Since takes values in $\mathcal{G}s_{\mathcal{I}}=\max_{x\in P_{n}}\|x\|s_{V_{2i}}=t^{(i)}s_{V_{2i-1}}=\infty\kappa_{2i\leftarrow 2i}=1\kappa_{2i-1\leftarrow 2i-1}=\sigma^{(i)}\kappa_{j^{\prime}\leftarrow j}=\sigma_{j^{\prime}\leftarrow j}j^{\prime}>j\bar{\kappa}_{2i}=\bar{\sigma}_{\phi}\bar{\kappa}_{2i-1}=0\widetilde{\mathcal{G}}\mathcal{G}J_{i}s_{V_{2i-1}}\kappa_{i\leftarrow i}$, we get that
We first note that the last term in (20) is simply 0 because there is exactly one output function in . Now for the other terms of (20): by definition , consist of a singleton set and therefore have log cover size for any error resolution . Otherwise, to cover it suffices to bound . Thus, we can apply Lemma A.3 to obtain
Thus, substituting terms into (20) and collecting sums, we obtain that
Now we apply Dudley’s entropy theorem to obtain that
We start with Theorem A.2, which bounds the Rademacher complexity of the augmented loss class . Using to denote the application of this augmented loss on the network , its weights, and data , we first note that for any datapoint . We used the fact that margin loss upper bounds 0-1 loss, and upper bounds margin loss by the construction in Theorem 6.3. Thus, applying the standard Rademacher generalization bound, with probability over the training data, it holds that
Plugging in the bound on from Theorem A.2 gives the desired result. ∎
Finally, to prove Theorems 7.1 and 1.1, we simply take a union bound over the choices of parameters .
We will apply Theorem A.1 repeatedly over a grid of parameter choices , , , (following a technique of Bartlett et al. ). For a collection of nonnegative integers , , , , , we apply Theorem A.1 choosing , , , , and using error probability . First, we note that by union bound, using the fact that where ranges over nonnegative integers, we get that the generalization bound of Theorem A.1 holds for choices of with probability 1 - .
Now for the network at hand, there would have been some choice of for which the bound was applied using parameters , , , , and
The proof of the simpler Theorem 1.1, follows the same above argument. The only difference is that we union bound over parameters and the matrix norms. ∎
Let be an upper bound on the largest norm of a datapoint. Then the following bound relates Rademacher complexity to covering numbers:
Appendix B Missing Proofs in Section 5
We prove the theorem by induction on the number of non-input vertices in the vertex set . The statement is true if is the only non-input node in the graph: to cover the graph output with error , we simply cover .
Given a family of graphs (with shared edges and nodes ), we assume the inductive hypothesis that “for any family of graphs with more than input vertices, the theorem statement holds.” Under this hypothesis, we will show that the theorem statement holds for the graph family .
We take node from the forest ordering assumed in the theorem. Suppose depends on , which are assumed to be the input nodes by the definition of forest ordering. We release the node from the graph and obtain a new family with a smaller number of edges than that of .
Define for and , and . Then we can check that . Let , and let . As each function in is -Lipschitz in because of condition 1, and it equals the fixed constant if or , we have satisfies the conditions of the composition lemma (see Lemma B.1). With the lemma, we conclude:
Note that by the definition of forest ordering, we have that is a forest ordering of and by the assumption 1 of the theorem, we have that satisfies the condition 1 for the graph family . has one more input node than , so we can invoke the inductive hypothesis on and obtain
Combining equation (23) and (24) above, we prove (14) for , and complete the induction. ∎
Below we provide the composition lemma necessary for Theorem 5.3.
is a family of functions with two arguments and is another family of functions. We overload notation and refer to as . The spaces all associate with some norms (the norms can potentially be different for each space, but we use the same notation for all of them.) Assume the following:
All functions in are -Lipschitz in the argument for any possible choice of : for any , , and , we have .
Any function collapses on inputs with large norms: there exists a constant such that if or for any .
Then, the family of the composition of and , , has covering number bound:
When it is clear from context, we let denote the statement that . Suppose is a uniform distribution over data points with norms not larger than . Given function and , we will construct a pair of functions such that covers . We will count (in a straightforward way) how many distinct pairs of functions we have construct for all the pairs at the end of the proof.
Let be the uniform distribution over , and suppose is a error cover of with respect to the metric . We note that has size at most . We found such that is -close to in metric . Let denote . Let be the uniform distribution over , and let be the uniform distribution over all points, . Now we construct a intermediate cover (that depends on implicitly) that covers with error with respect to the metric . We augment this to a cover that covers with respect to metric as follows: for every , add the function to with
Note that by construction, the size of is at most . Now let be the cover element for w.r.t. , and be the corresponding cover element in . Because when or for some ,
Then we bound the difference between and by Lipschitzness; since when ,
where in the last step we used the property of the cover . Finally, by triangle inequality, we get that
Finally we count how many we have constructed: is of size at most . and for every , we’ve constructed a family of functions (that depends on ) of size at most . Therefore, the total size of the cover is at most . ∎
Appendix C Missing Proofs in Section 6
We first state the proofs of Theorem 6.2 and Theorem 6.3, which follow straightforwardly from the technical tools developed in Section D.
Now we prove release-Lipschitzness for a prefix sequence of that ends in node . For all , fix . It suffices to show that the function defined by
We first construct an augmented family of graphs sharing the same vertices and edges as . For , we add to computing
This is achieved by modifying the family of output rules as follows:
Now all terms match (16) except for the term . First, we note that all functions in can be written in the form
This allows us to conclude (16). Finally, we note that as the augmentation operations are in the form of those considered in Claim H.1, it follows that upper bounds . ∎
Appendix D Technical Tools for Lipschitz Augmentation
In this section, we develop the technical tools needed for proving Theorem 6.2. The main result in this section is our Lemma D.1, which essentially states that augmenting the loss with a product of Jacobians (plus additional matrices meant to model previous Jacobian nodes already released from the computational graph) will make the loss Lipschitz.
For this section, we say a function taking input and outputting an operator mapping to is -Lipschitz if for any in its input domain. We will consider functions , where and is a compact subset of some normed space. For ease of notation, we use to denote the (possibly distinct) norms on . For , Let denote the composition
For convenience in indexing, for with , we will set to be the identity function.
Finally consider a real-valued function and define the composition by
We will construct a “Lipschitz-fication” for the function .
Let denote a collection of linear operators that map to the space . We will furthermore use to denote the -to- Jacobian, i.e.
When and , we will also consider products between -to- Jacobians and the matrices : define
Note in particular that .
[Lipschitz-fication] Following the notation in this section, suppose that is -Lipschitz in its -th argument for . Suppose that is -Lipschitz for all . For any with , let be parameters that intend to be a tight bound on , and also define which will bound . Define the augmented function by
We define the following order on this collection of functions:
Now note that by Claim D.7, we have the bound
Define to be the Lipschitz constant of on for all guaranteed by Claim D.6. First, note that for all . Thus, by Claims D.4 and D.5, it follows that
Now note that if , then it follows that . Substituting into (30), we get that ,
In the setting of Lemma D.1, for , we can expand the error as follows:
Furthermore, for , we can expand the error as follows:
We will first show (31) by inducting on . The base case follows by definition, as we can reduce and to constant-valued functions that output the identity matrix.
For the inductive step, we use Claim H.2 to expand
To prove (32), we first note that by definition, , so
In the setting of Lemma D.1, suppose that is -Lipschitz for all . Then we can bound the operator norm error in the Jacobian by
Likewise, we can bound the operator norm error in the product between Jacobian and auxiliary matrices by
We will first prove (34), as the proof of (35) is nearly identical. Starting from (31) of Claim D.2, we have
By triangle inequality and the fact that is -Lipschitz , it follows that
Plugging the above into (37), we get . To prove (35), we start from (32) and follow the same steps as above. ∎
In the setting of Lemma D.1, suppose that is -Lipschitz for all . Then we can upper bound the error terms corresponding to the indicators by
Likewise, the following upper bound holds for all with :
Plugging this into our definition for (28), it follows that
Note that if , then such that and by definition of the order . It follows that if , , so . Otherwise, if , by Claim D.3 we have
where we recall that . Plugging this into (40) and using the fact that all functions are bounded by 1 gives the desired statement.
To prove (39), we simply apply the above argument with (35). ∎
In the setting of Lemma D.1, fix index with and suppose that is -Lipschitz. Then we can bound the error due to function composition by
Starting from (27), we can first express by
as . First we note that by definition, , as the function is -Lipschitz in its -th argument. Thus, since all functions are bounded by , it follows that
In the setting of Lemma D.1, such that , is -Lipschitz on a compact domain .
We first show inductively that is Lipschitz for all . The base case follows by definition, as is continuously differentiable and is a compact set.
Now we show the inductive step: first write . By continuity, is compact. Furthermore, is continuously differentiable under the assumptions of Lemma D.1. Thus, is Lipschitz on domain . As is the composition of Lipschitz functions by the inductive hypothesis, is itself Lipschitz.
Now it follows that , is Lipschitz on , as it is the composition of and , both of which are Lipschitz. Finally, by the chain rule (Claim H.2), we have that is the product of Lipschitz functions, and therefore Lipschitz for all . We simply take to be the maximum Lipschitz constant of over all . ∎
For , define by
Thus, denotes with the last inputs to depending on instead of . Now we claim that by a telescoping argument (Claim H.3),
To see this, compute the sum in the order the following sequence of terms, which corresponds to a traversal of in least-to-greatest order:
Now we simply apply triangle inequality on (41) and use the fact that to obtain the desired statement. ∎
In the setting of Theorem 6.2, fix and define
Here for convenience we use the convention that .
Appendix E Application to Recurrent Neural Networks
In this section, we will apply our techniques to recurrent neural networks. Suppose that we are in a classification setting. For simplicity, we will assume that the hidden layer and input dimensions are . We will define a recurrent neural network with activation layers as follows using parameters , activation and input sequence :
where is set to be 0. Now following the convention of Section 7, we will define the interlayer Jacobians. For odd indices , , we simply set to the constant function . For even indices , , we set , the Jacobian of the activation applied to the input of . Finally, we set to be the constant function . Now for , we set . If , we set to the identity matrix.
With this notation in place, we can state our generalization bound for RNN’s:
Assume that the activation is 1-Lipschitz with a -Lipschitz derivative. With probability over the random draws of , all RNNs will satisfy the following generalization guarantee:
where , and
In these expressions, we define , and:
Note that the training error here is because of the existence of positive margin .
Our proof follows the template of Theorem 7.1: we bound the Rademacher complexity of some augmented RNN loss. We then argue for generalization of the augmented loss and perform a union bound over all the choices of parameters. As the latter steps are identical to those in the proof of Theorem 7.1, we omit these and focus on bounding the Rademacher complexity of an augmented RNN loss.
Suppose that is -Lipschitz with -Lipschitz derivative. Define the following class of RNNs with bounded weight matrices:
and let be parameters that will bound the to layerwise Jacobian for , where we set and for , . Let be parameters bounding the layer norm after applying the -th activation, and let . Define the class of augmented losses
and define for , meant to bound the influence of the matrix on the Jacobians and hidden variables, respectively as in (18), (19). Then we can bound the empirical Rademacher complexity of the augmented loss class by
where are defined in Theorem E.1.
We will associate the family of losses with a computational graph structure on internal nodes , , , input nodes , and output node with the following edges:
Nodes will point towards the output .
Node will point towards nodes and .
Node will point towards node and node .
Node will point towards node .
We now define the composition rules at each node:
Finally, nodes will have composition rule . Finally, the output node will have composition rule
With this condition established, we can complete the proof via the same covering number argument as in Theorem A.2. ∎
Now as in the proof of Theorem 7.1, we first observe that the augmented loss upper bounds the 0-1 classification loss, giving us a 0-1 test error bound. We then apply the same union bound technique over parameters , as in the proof of Theorem 7.1.
Appendix F ReLU Networks
In this section, we apply our augmentation technique to relu networks to produce a generalization bound similar to that of Nagarajan and Kolter , which is polynomial in the Jacobian norms, hidden layer norms, and inverse pre-activations.
Recall the definition of neural nets in Example 5.1: the neural net with parameters and activation is defined by
For this section, we will set to be the relu activation. We also use the same notation for layers and indexing as Section 7. We first state our generalization bound for relu networks:
Fix reference matrices . With probability over the random draws of the data , all neural networks with relu activations parameterized by will have the following generalization guarantee
In these expressions, we define , to be the minimum pre-activation after the -th weight matrix over all coordinates in the -th layer and all datapoints:
where indexes the -th coordinate of , and additionally use
Note that we assume the existence of a positive margin, so the training error here is .
We note that compared to Theorem 7.1, , but now has a dependence on the preactivations , as in Nagarajan and Kolter .
We provide a proof sketch of Theorem F.1 here. We first bound the Rademacher complexity some family of augmented losses, specified precisely in Theorem F.2. The rest of the argument then follows the same way as the proof of Theorem 7.1: using Rademacher complexity to argue that the augmented losses generalize, applying the fact that the augmented losses upper-bound the 0-1 loss, and then union bounding over all choices of parameters.
Following the definitions in Theorem A.2, let denote the class of neural networks, be parameters intended to bound the spectral norm of the to layerwise Jacobian, and be parameters bounding the layer norm after applying the -th activation. Define as parameters intended to lower bound the minimum preactivations after the -th linear layer. Define the class of augmented losses
As in the proof of Theorem A.2, associate the loss class with a family of computation graphs on internal nodes as follows: define the graph structure to be identical to the Lipschitz augmentation of a sequential computation graph family (Figure 3) and define the composition rules
Assign to the nodes composition rule , and finally, assign to the output node the composition rule
The resulting family of computation graphs will compute . Now we claim that is -release-Lipschitz in nodes and -release-Lipschitz in nodes . (Note that the Lipschitzness of nodes will not matter because the associated function classes and singletons and therefore have a log covering number of 0 anyways).
The argument for the -release-Lipschitzness of follows analogously to the argument of Lemma D.8 and Theorem A.2.
Finally, to conclude the desired Rademacher complexity bounds given the release-Lipschitzness, we apply the same reasoning as in Theorem A.2. ∎
Appendix G Additional Experimental Details
For all settings, we train for 200 epochs with learning rate decay by a factor of 0.2 at epochs 60, 120, and 150. We additionally tuned the value of from values for each setting: for the experiments displayed in Figure 1, we used the following values:
For all other hyperparameters, we use the defaults in the PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch, and we base our code off of this implementation. We report results from a single run as the improvement with Jacobian regularization is statistically significant. We train on a single NVIDIA TitanXp GPU.
G.2 Empirical Scaling of our Complexity Measure with Depth
In this section, we empirically demonstrate that the leading term of our bounds can exhibit better scaling in depth than prior work.
We compute leading terms of our bound: , where ranges over the layers, , denote the -th hidden layer and Jacobian of the output with respect to the -th hidden layer, respectively, and denotes the smallest positive margin on the training dataset. We compare this quantity with that of the bound of [Bartlett et al., 2017]: . In Figure 5, we plot this comparison for WideResNetOur bound as stated in the paper technically does not apply to ResNet because the skip connections complicate the Lipschitz augmentation step. This can be remedied with a slight modification to our augmentation step, which we omit for simplicity. models of depths 10, 16, 22, 28 trained on CIFAR10. For all models, we remove data augmentation to ensure that our models fit the training data perfectly. We train each model for 50 epochs, which is sufficient for perfectly fitting the training data, and start from an initial learning rate of 0.1 which we decrease by a factor of 10 at epoch 30. All other parameters are set to the same as their defaults in the PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch. We plot the final complexity measures computed on a single model. We note that our models are trained with Batchnorm. At test time, these Batchnorm layers compute affine transformations, so we compute the bound by merging these transformations with the adjacent linear layer.
Figure 5 demonstrates that our complexity measure can be much lower than the spectral complexity. Furthermore, in Figure 5, our complexity measure appears to scale well with depth for WideResNet models.
Appendix H Toolbox
.
First, we note that . Furthermore, , which completes the proof of statements 1 and 2. To prove the third statement, we note that . ∎
The Jacobian of a composition of a sequence of functions satisfies
where the notations are standard matrix multiplication. For simplicity, we also write in the function form:
Let , and consider the total derivative operator mapping to a linear operator between normed spaces to . Suppose that is -Lipschitz in , in the sense that , where is the operator norm induced by and . Then
We write . Now we note that