Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Colin Wei, Tengyu Ma

Introduction

Deep networks trained in practice typically use many more parameters than training examples, and therefore have the capacity to overfit to the training set (Zhang et al., 2016). Fortunately, there are also many known (and unknown) sources of regularization during training: model capacity regularization such as simple weight decay, implicit or algorithmic regularization (Gunasekar et al., 2017, 2018b; Soudry et al., 2018; Li et al., 2018b), and finally regularization that depends on the training data such as Batchnorm (Ioffe and Szegedy, 2015), layer normalization (Ba et al., 2016), group normalization (Wu and He, 2018), path normalization (Neyshabur et al., 2015a), dropout (Srivastava et al., 2014; Wager et al., 2013), and regularizing the variance of activations (Littwin and Wolf, 2018).

In many cases, it remains unclear why data-dependent regularization can improve the final test error — for example, why Batchnorm empirically improves the generalization performance in practice (Ioffe and Szegedy, 2015; Zhang et al., 2019). We do not have many tools for analyzing data-dependent regularization in the literature; with the exception of Dziugaite and Roy (2018), (Arora et al., 2018) and (Nagarajan and Kolter, 2019) (with which we compare later in more detail), existing bounds typically consider properties of the weights of the learned model but little about their interactions with the training set. Formally, define a data-dependent property as any function of the learned model and the training data. In this work, we prove tighter generalization bounds by considering additional data-dependent properties of the network. Optimizing these bounds leads to data-dependent regularization techniques that empirically improve performance.

Suppose $\sigma,t\geq 1$ . With probability $1-\delta$ over the training data, we can bound the test error of $F$ by $\displaystyle L_{\textup{0-1}}(F)\leq\widetilde{O}\left(\frac{(\frac{\sigma}{\gamma}+r^{3}\sigma^{2})t\left(1+\sum_{i}\|{W^{(i)}}^{\top}\|_{2,1}^{2/3}\right)^{3/2}+r^{2}\sigma\left(1+\sum_{i}\|{W^{(i)}}\|_{1,1}^{2/3}\right)^{3/2}}{\sqrt{n}}+r\sqrt{\frac{\log(\frac{1}{\delta})}{n}}\right)$

The degree of the dependencies on $\sigma$ may look unconventional — this is mostly due to the dramatic simplification from our full Theorem 7.1, which obtains a more natural bound that considers all interlayer Jacobian norms instead of only the maximum. Our bound is polynomial in $t,\sigma$ , and network depth, but independent of width. In practice, $t$ and $\sigma$ have been observed to be much smaller than the product of matrix norms (Arora et al., 2018; Nagarajan and Kolter, 2019). We remark that our bound is not homogeneous because the smooth activations are not homogeneous and can cause a second order effect on the network outputs.

In contrast, the bounds of Neyshabur et al. (2015b); Bartlett et al. (2017); Neyshabur et al. (2017a); Golowich et al. (2017) all depend on a product of norms of weight matrices which scales exponentially in the network depth, and which can be thought of as a worst case Lipschitz constant of the network. In fact, lower bounds show that with only norm-based constraints on the hypothesis class, this product of norms is unavoidable for Rademacher complexity-based approaches (see for example Theorem 3.4 of (Bartlett et al., 2017) and Theorem 7 of (Golowich et al., 2017)). We circumvent these lower bounds by additionally considering the model’s Jacobian norms – empirical Lipschitz constants which are much smaller than the product of norms because they are only computed on the training data.

The bound of Arora et al. (2018) depends on similar quantities related to noise stability but only holds for a compressed network and not the original. The bound of Nagarajan and Kolter (2019) also depends polynomially on the Jacobian norms rather than exponentially in depth; however these bounds also require that the inputs to the activation layers are bounded away from 0, an assumption that does not hold in practice (Nagarajan and Kolter, 2019). We do not require this assumption because we consider networks with smooth activations, whereas the bound of Nagarajan and Kolter (2019) applies to relu nets.

In Section E, we additionally present a generalization bound for recurrent neural nets that scales polynomially in the same quantities as our bound for standard neural nets. Prior generalization bounds for RNNs either require parameter counting (Koiran and Sontag, 1997) or depend exponentially on depth (Zhang et al., 2018; Chen et al., 2019).

In Figure 1, we plot the distribution over the sum of products of Jacobian and hidden layer norms (which is the leading term of the bound in our full Theorem 7.1) for a WideResNet (Zagoruyko and Komodakis, 2016) trained with and without Batchnorm. Figure 1 shows that this sum blows up for networks trained without Batchnorm, indicating that the terms in our bound are empirically relevant for explaining data-dependent regularization.

An immediate bottleneck in proving Theorem 1.1 is that standard tools require fixing the hypothesis class before looking at training data, whereas conditioning on data-dependent properties makes the hypothesis class a random object depending on the data. A natural attempt is to augment the loss with indicators on the intended data-dependent quantities $\{\gamma_{i}\}$ , with desired bounds $\{\kappa_{i}\}$ as follows:

This augmented loss upper bounds the original loss $l_{\textup{old}}\in$ , with equality when all properties hold for the training data. The augmentation lets us reason about a hypothesis class that is independent of the data by directly conditioning on data-dependent properties in the loss. The main challenges with this approach are twofold: 1) designing the correct set of properties and 2) proving generalization of the final loss $l_{\textup{aug}}$ , a complicated function of the network.

Our main tool is covering numbers: Lemma 4.1 shows that a composition of functions (i.e, a neural network) has low covering number if the output is worst-case Lipschitz at each level of the composition and internal layers are bounded in norm. Unfortunately, the standard neural net loss satisfies neither of these properties (without exponential dependencies on depth). However, by augmenting with properties $\gamma$ , we can guarantee they hold. One technical challenge is that augmenting the loss makes it harder to reason about covering, as the indicators can introduce complicated dependencies between layers.

Our main technical contributions are: 1) We demonstrate how to augment a composition of functions to make it Lipschitz at all layers, and thus easy to cover. Before this augmentation, the Lipschitz constant could scale exponentially in depth (Theorem 4.4). 2) We reduce covering a complicated sequence of operations to covering the individual operations (Theorem 4.3). 3) By combining 1 and 2, it follows cleanly that our augmented loss on neural networks has low covering number and therefore has good generalization. Our bound scales polynomially, not exponentially, in the depth of the network when the network has good Lipschitz constants on the training data (Theorem 7.1).

As a complement to the main theoretical results in this paper, we show empirically in Section 8 that directly regularizing our complexity measure can result in improved test performance.

Related Work

Zhang et al. (2016) and Neyshabur et al. (2017b) show that generalizaton in deep learning often disobeys conventional statistical wisdom. One of the approaches adopted torwards explaining generalization is implicit regularization; numerous recent works have shown that the training method prefers minimum norm or maximum margin solutions (Soudry et al., 2018; Li et al., 2018b; Ji and Telgarsky, 2018; Gunasekar et al., 2017, 2018a, 2018b; Wei et al., 2018). With the exception of (Wei et al., 2018), these papers analyze simplified settings and do not apply to larger neural networks.

This paper more closely follows a line of work related to Rademacher complexity bounds for neural networks (Neyshabur et al., 2015b, 2018; Bartlett et al., 2017; Golowich et al., 2017; Li et al., 2018a). For a comparison, see the introduction. There has also been work on deriving PAC-Bayesian bounds for generalization (Neyshabur et al., 2017b, a; Nagarajan and Kolter, 2019). Dziugaite and Roy (2017a) optimize a bound to compute non-vacuous bounds for generalization error. Another line of work analyzes neural nets via their behavior on noisy inputs. Neyshabur et al. (2017b) prove PAC-Bayesian generalization bounds for random networks under assumptions on the network’s empirical noise stability. Arora et al. (2018) develop a notion of noise stability that allows for compression of a network under an appropriate noise distribution. They additionally prove that the compressed network generalizes well. In comparison, our Lipschitzness construction also relates to noise stability, but our bounds hold for the original network and do not rely on the particular noise distribution.

Nagarajan and Kolter (2019) use PAC-Bayes bounds to prove a similar result as ours for generalization of a network with bounded hidden layer and Jacobian norms. The main difference is that their bounds depend on the inverse relu preactivations, which are found to be large in practice (Nagarajan and Kolter, 2019); our bounds apply to smooth activations and avoid this dependence at the cost of an additional factor in the Jacobian norm (shown to be empirically small). We note that the choice of smooth activations is empirically justified (Clevert et al., 2015; Klambauer et al., 2017). We also work with Rademacher complexity and covering numbers instead of the PAC-Bayes framework. It is relatively simple to adapt our techniques to relu networks to produce a similar result to that of Nagarajan and Kolter (2019), by conditioning on large pre-activation values in our Lipschitz augmentation step (see Section 4.2). In Section F, we provide a sketch of this argument and obtain a bound for relu networks that is polynomial in hidden layer and Jacobian norms and inverse preactivations. However, it is not obvious how to adapt the argument of Nagarajan and Kolter (2019) to activation functions whose derivatives are not piecewise-constant.

Dziugaite and Roy (2018, 2017b) develop PAC-Bayes bounds for data-dependent priors obtained via some differentially private mechanism. Their bounds are for a randomized classifier sampled from the prior, whereas we analyze a deterministic, fixed model.

Novak et al. (2018) empirically demonstrate that the sensitivity of a neural net to input noise correlates with generalization. Sokolić et al. (2017); Krueger and Memisevic (2015) propose stability-based regularizers for neural nets. Hardt et al. (2015) show that models which train faster tend to generalize better. Keskar et al. (2016); Hoffer et al. (2017) study the effect of batch size on generalization. Brutzkus et al. (2017) analyze a neural network trained on hinge loss and linearly separable data and show that gradient descent recovers the exact separating hyperplane.

Notation

We use $D$ to denote total derivative operator, and thus $Df(x)$ represents the Jacobian of $f$ at $x$ . Suppose $\mathcal{F}$ is a family of functions from $\mathcal{D}_{x}$ to $\mathcal{D}_{f}$ . Let $\mathcal{C}(\epsilon,\mathcal{F},\rho)$ be the covering number of the function class $\mathcal{F}$ w.r.t. metric $\rho$ with cover size $\epsilon$ . In many cases, the covering number depends on the examples through the norms of the examples, and in this paper we only work with these cases. Thus, we let $\mathcal{N}(\epsilon,\mathcal{F},s)$ be the maximum covering number for any possible $n$ data points with norm not larger than $s$ . Precisely, if we define $\mathcal{P}_{n,s}$ to be the set of all possible uniform distributions supported on $n$ data points with norms not larger than $s$ , then

Suppose $\mathcal{F}$ contains functions with $m$ inputs that map from a tensor product $m$ Euclidean space to Euclidean space, then we define

Overview of Main Results and Proof Techniques

In this section, we give a general overview of the main technical results and outline how to prove them with minimal notation. We will point to later sections where many statements are formalized.

Textbook results (Bartlett and Mendelson, 2002) bound the generalization error by the Rademacher complexity (formally defined in Section A) of the family of losses $\mathcal{L}$ , which in turn is bounded by the covering number of $\mathcal{L}$ through Dudley’s entropy integral theorem (Dudley, 1967). Modulo minor nuances, the key remaining question is to give a tight covering number bound for the family $\mathcal{L}$ for every target cover size $\epsilon$ in a certain range (often, considering $\epsilon\in[1/n^{O(1)},1]$ suffices).

As alluded to in the introduction, generalization error bounds obtained through this machinery only depend on the (training) data through the margin in the loss function, and our aim is to utilize more data-dependent properties. Towards understanding which data-dependent properties are useful to regularize, it is helpful to revisit the data-independent covering technique of (Bartlett et al., 2017), the skeleton of which is summarized below.

Recall that $\mathcal{N}(\epsilon,\mathcal{F},s)$ denotes the covering number for arbitrary $n$ data points with norm less than $s$ . The following lemma says that if the intermediate variable (or the hidden layer) $f_{i}\circ\cdots\circ f_{1}(x)$ is bounded, and the composition of the rest of the functions $l\circ f_{k}\circ\cdots\circ f_{i+1}(x)$ is Lipschitz, then small covering number of local functions imply small covering number for the composition of functions.

[abstraction of techniques in (Bartlett et al., 2017)] In the context above, assume:

for any $x\in\text{supp}(P_{n})$ , ${|\kern-1.07639pt|\kern-1.07639pt|f_{i}\circ\cdots\circ f_{1}(x)|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{i}$ .

Then, we have the following covering number bound for $\mathcal{L}$ (for any choice of $\epsilon_{1},\dots,\epsilon_{k}>0$ ):

The lemma says that the log covering number and the cover size scale linearly if the Lipschitzness parameters and norms remain constant. However, these two quantities, in the worst case, can easily scale exponentially in the number of layers, and they are the main sources of the dependency of product of spectral/Frobenius norms of layers in (Golowich et al., 2017; Bartlett et al., 2017; Neyshabur et al., 2017a, 2015b) More precisely, the worst-case Lipschitzness over all possible data points can be exponentially bigger than the average/typical Lipschitzness for examples randomly drawn from the training or test distribution. We aim to bridge this gap by deriving a generalization error bound that only depends on the Lipschitzness and boundedness on the training examples.

where $\kappa_{j\leftarrow i}$ ’s are user-defined parameters. For our application to neural nets, we instantiate $s_{i}$ as the maximum norm of layer $i$ and $\kappa_{j\leftarrow i}$ as the maximum norm of the Jacobian between layer $j$ and $i$ across the training dataset. A polynomial in $\kappa,s$ can be shown to bound the worst-case Lipschitzness of the function w.r.t. the intermediate variables in the formula above.As mentioned in footnote 1, we will formalize the precise meaning of Lipschitzness later. By our choice of $\kappa$ , $s$ , a) the training loss is unaffected by the augmentation and b) the worst-case Lipschitzness of the loss is controlled by a polynomial of the Lipschitzness on the training examples. We provide an informal overview of our augmentation procedure in Section 4.2 and formally state definitions and guarantees in Section 6. The downside of the Lipschitz augmentation is that it further complicates the loss function. Towards covering the loss function (assuming Lipschitz properties) efficiently, we extend Lemma 4.1, which works for sequential compositions of functions, to general families of formulas, or computational graphs. We informally overview this extension in Section 4.1 using a minimal set of notations, and in Section 5, we give a formal presentation of these results.

Combining the Lipschitz augmentation and graphs covering results, we obtain a covering number bound of augmented loss. The theorem below is formally stated in Theorem 6.3 of Section 6.

where $D\mathcal{F}_{i}$ denotes the function class obtained from applying the total derivative operator to all functions in $\mathcal{F}_{i}$ .

Now, following the standard technique of bounding Rademacher complexity via covering numbers, we can obtain generalization error bounds for augmented loss. For the demonstration of our technique, suppose that the following simplification holds:

A computational graph $G(\mathcal{V},\mathcal{E},\{R_{V}\})$ is an acyclic directed graph with three components: the set of nodes $\mathcal{V}$ corresponds to variables, the set of edges $\mathcal{E}$ describes dependencies between these variables, and $\{R_{V}\}$ contains a list of composition rules indexed by the variables $V$ ’s, representing the process of computing $V$ from its direct predecessors. For simplicity, we assume the graph contains a unique sink, denoted by $O_{G}$ , and we call it the “output node”. We also overload the notation $O_{G}$ to denote the function that the computational graph $G$ finally computes. Let $\mathcal{I}_{G}=\{I_{1},\dots,I_{p}\}$ be the subset of nodes with no predecessors, which we call the “input nodes” of the graph.

The notion of a family of computational graphs generalizes the sequential family of function compositions in (5). Let $\mathcal{G}=\{G(\mathcal{V},\mathcal{E},\{R_{V}\})\}$ be a family of computational graphs with shared nodes, edges, output node, and input nodes (denoted by $\mathcal{I}$ ). Let $\mathfrak{R}_{V}$ be the collection of all possible composition rules used for node $V$ by the graphs in the family $\mathcal{G}$ . This family $\mathcal{G}$ defines a set of functions $O_{\mathcal{G}}\triangleq\{O_{G}:G\in\mathcal{G}\}$ .

Suppose that there is an ordering $(V_{1},\dots,V_{m})$ of the nodes, so that after cutting out nodes $V_{1},\dots,V_{i-1}$ , the node $V_{i}$ becomes a leaf node and the output $O_{G}$ is $\kappa_{V_{i}}$ -Lipschitz w.r.t to $V_{i}$ for all $G\in\mathcal{G}$ . In addition, assume that for all $G\in\mathcal{G}$ , the node $V$ ’s value has norm at most $s_{V}$ . Let $\textup{pr}(V)$ be all the predecessors of $V$ and $s_{\textup{pr}(V)}$ be the list of norm upper bounds of the predecessors of $V$ .

Then, small covering numbers for all of the local composition rules of $V$ with resolution $\epsilon_{V}$ would imply small covering number for the family of computational graphs with resolution $\sum_{V}\epsilon_{V}\kappa_{V}$ :

2 Lipschitz Augmentation of Computational Graphs

The covering number bound of Theorem 4.3 relies on Lipschitzness w.r.t internal nodes of the graph under a worst-case choice of inputs. For deep networks, this can scale exponentially in depth via the product of weight norms and easily be larger than the average Lipschitz-ness over typical inputs. In this section, we explain a general operation to augment sequential graphs (such as neural nets) into graphs with better worst-case Lipschitz constants, so tools such as Theorem 4.3 can be applied. Formal definitions and theorem statements are in Section 6.

The augmentation relies on introducing terms such as the soft indicators in equation (6) and (7) which condition on data-dependent properties. As outlined in Section 4, they will translate to the data-dependent properties in the generalization bounds. We also require the augmented function to upper bound the original.

Our key insight is that by considering a more complicated augmentation which conditions on the derivatives between all intermediate variables, we can still control Lipschitzness of the system, leading to the more involved augmentation presented in (7). Our main technical contribution is Theorem 4.4, which we informally state below.

Covering of Computational Graphs

This section is a formal version of Section 4.1 with full definition and theorem statements. In this section, we adapt the notion of a computational graph to our setting. In Section 5.1, we formalize the notion of a computational graph and demonstrate how neural networks fit under this framework. In Section 5.2, we define the notion of release-Lipschitzness that abstracts the sequential notion of Lipschitzness in Lemma 4.1. We show that when this release-Lipschitzness condition and a boundedness condition on the internal nodes hold, it is possible to cover a family of computational graphs by simply covering the function class at each vertex.

When we augment the neural network loss with data-dependent properties, we introduce dependencies between the various layers, making it complicated to cover the augmented loss. We use the notion of computational graphs to abstractly model these dependencies.

Computational graphs are originally introduced by Bauer (1974) to represent computational processes and study error propagation. Recall the notation $G(\mathcal{V},\mathcal{E},\{R_{V}\})$ introduced for a computational graph in Section 4.1, with input nodes $\mathcal{I}_{G}=\{I_{1},\dots,I_{p}\}$ and output node denoted by $O_{G}$ . (It’s straightforward to generalize to scenarios with multiple output nodes.)

For every variable $V\in\mathcal{V}$ , let $\mathcal{D}_{V}$ be the space that $V$ resides in. If $V$ has $t$ direct predecessors $C_{1},\dots,C_{t}$ , then the associated composition rule $R_{V}$ is a function that maps $\mathcal{D}_{C_{1}}\otimes\cdots\otimes\mathcal{D}_{C_{t}}$ to $\mathcal{D}_{V}$ . If $V$ is an input node, then the composition rule $R_{V}$ is not relevant. For any node $V$ , the computational graph defines/induces a function that computes the variable $V$ from inputs, or in mathematical words, that maps the inputs space $\mathcal{D}_{I_{1}}\otimes\cdots\otimes\mathcal{D}_{I_{p}}$ to $\mathcal{D}_{V}$ . This associated function, denoted by $V$ again with slight abuse of notations, is defined recursively as follows: set $V(x_{1},\dots,x_{p})$ to

More succinctly, we can write $V=R_{V}\circ(C_{1}\otimes\cdots\otimes C_{t})$ . We also overload the notation $O_{G}$ to denote the function that the computational graph $G$ finally computes (which maps $\mathcal{D}_{I_{1}}\otimes\cdots\otimes\mathcal{D}_{I_{p}}$ to $\mathcal{D}_{O}$ ). For any set $\mathcal{S}=\{V_{1},\ldots,V_{t}\}\subseteq\mathcal{V}$ , use $\mathcal{D}_{\mathcal{S}}$ to denote the space $\mathcal{D}_{V_{1}}\otimes\cdots\otimes\mathcal{D}_{V_{t}}$ . We use $\textup{pr}(G,V)$ to denote the set of direct predecessors of $V$ in graph $G$ , or simply $\textup{pr}(V)$ when the graph $G$ is clear from context.

2 Reducing graph covering to local function covering

In this section we introduce the notion of a family of computational graphs, generalizing the sequential family of function compositions in (5). We define release-Lipschitzness, a condition which allows reduce covering the entire the graph family to covering the composition rules at each node. We formally state this reduction in Theorem 5.3.

Let $\mathcal{G}=\{G(\mathcal{V},\mathcal{E},\{R_{V}\}):\{R_{V}\}\in\mathfrak{R}\}$ be a family of computational graph with shared nodes and edges, where $\mathfrak{R}$ is a collection of lists of composition rules. This family of computational graphs defines a set of functions $O_{\mathcal{G}}\triangleq\{O_{G}:G\in\mathcal{G}\}$ . We’d like to cover this set of functions in $O_{\mathcal{G}}$ with respect to some metric $L(P_{n},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|})$ .

For a list of composition rules $\{R_{V}\}\in\mathfrak{R}$ and subset $\mathcal{S}\subseteq\mathcal{V}$ , we define the projection of composition rules onto $\mathcal{S}$ by $\{R_{V}\}_{\mathcal{S}}=\{R_{V}:V\in\mathcal{S}\}$ . Now let $\mathfrak{R}_{S}=\{\{R_{V}\}_{\mathcal{S}}:\{R_{V}\}\in\mathfrak{R}\}$ denote the marginal collection of the composition rules on node subset $\mathcal{S}$ .

For any computational graph $G$ and a non-input node $V\in\mathcal{V}\setminus\mathcal{I}$ , we can define the following operation that “releases” $V$ from its dependencies on its predecessors by cutting all the inward edges: Let $G^{\backslash V}$ be sub-graph of $G$ where all the edges pointing towards $V$ are removed from the graph. Thus, by definition, $V$ becomes a new input node of the graph $G^{\backslash V}$ : $\mathcal{I}_{G^{\backslash V}}=\{V\}\cup\mathcal{I}_{G}$ . Moreover, we can “recover” the dependency by plugging the right value for $V$ in the new graph $G^{\backslash V}$ : Let $V(x)$ be the function associated to the node $V$ in graph $G$ , then we have

In our proofs, we will release variables in orders. Let $\mathcal{S}=(V_{1},\dots,V_{m})$ be an ordering of the intermediate variables $\mathcal{V}\backslash(\mathcal{I}\cup\{O\})$ . We call $\mathcal{S}$ a forest ordering if for any $i$ , in the original graph $G$ , $V_{i}$ at most depends on the input nodes and $V_{1},\dots,V_{i-1}$ . For any sequence of variables $(V_{1},\dots,V_{t})$ , we can define the graph obtained by releasing the variables in order: $G^{\backslash(V_{1},\dots,V_{t})}\triangleq(\cdots(G^{\backslash V_{1}})\cdots)^{\backslash V_{t}}$ . We next define the release-Lipschitz condition, which states that the graph function remains Lipschitz when we sequentially release vertices in a forest ordering of the graph.

A graph $G$ is release-Lipschitz with parameters $\{\kappa_{V}\}$ w.r.t a forest ordering of the internal nodes, denoted by $(V_{1},\dots,V_{m})$ if the following happens: upon releasing $V_{1},\dots,V_{m}$ in order from any $G\in\mathcal{G}$ , for any $0\leq i\leq m$ , we have that the function defined by the released graph $G^{\backslash(V_{1},\dots,V_{i})}$ is $\kappa_{V_{i}}$ -Lipschitz in the argument $V_{i}$ , for any values of the rest of the input nodes (= $\{V_{1},\dots,V_{i-1}\}\cup\mathcal{I}_{G}$ .) We also say graph $G$ is release- Lipschitz if such a forest ordering exists.

Now we show that the release-Lipschitz condition allows us to cover any family of computational graphs whose output collapses when internal nodes are too large. The below is a formal and complete version of Theorem 4.3. For the augmented loss defined in (7), the function output collapses to $1$ when internal computations are large. The proof is deferred to Section B.

Suppose $\mathcal{G}$ is a computational graph with the associated family of lists of composition rules $\mathfrak{R}$ , as formally defined above. Let $P_{n}$ be a uniform distribution over $n$ points in $\mathcal{D}_{\mathcal{I}}$ . Let $\kappa_{V}$ , $s_{V}$ , and $\epsilon_{V}$ be three families of fixed parameters indexed by $\mathcal{V}\backslash\mathcal{I}$ (whose meanings are defined below). Assume the following:

Every $G\in\mathcal{G}$ is release-Lipschitz with parameters $\{\kappa_{V}\}$ w.r.t a forest ordering of the internal nodes $(V_{1},\dots,V_{m})$ (the parameter $\kappa_{V}$ ’s and ordering doesn’t depend on the choice of $G$ .)

For the same order as before, if $(v,x)\in(\mathcal{D}_{V_{1}}\otimes\cdots\otimes\mathcal{D}_{V_{i}})\otimes\mathcal{D}_{\mathcal{I}}$ is an input of the released graph satisfying ${|\kern-1.07639pt|\kern-1.07639pt|v_{j}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{V_{j}}$ for some $j\leq i$ , then $O_{G^{\backslash(V_{1},\dots,V_{i})}}(v,x)=c$ for some constant $c$ .

Lipschitz Augmentation of Computational Graphs

In this section, we provide a more thorough and formal presentation of the augmentation framework of Section 4.2.

The covering number bound for the computational graph family $\mathcal{G}$ in Theorem 5.3 relies on the release-Lipschitzness condition (condition 1 of Theorem 5.3) and rarely holds for deep computational graphs such as deep neural networks. The conundrum is that the worst-case Lipschitzness as required in the release-Lipschitz conditionWe say the Lipschitzness required is worst case because the release-Lipschitz condition requires the Lipschitzness of nodes for any possible choice of inputs is very likely to scale in the product of the worst-case Lipschitzness of each operations in the graph, which can easily be exponentially larger than the average Lipschitzness over typical examples.

In this section, we first define a model of sequential computational graphs, which captures the class of neural networks. Before Lipschitz augmentation, the worst-case Lipschitz constant of graphs in this family could scale exponentially in the depth of the graph. In Definition 6.1, we generalize the operation of (7) to augment any family $\mathcal{G}$ of sequential graphs and produce a family $\widetilde{\mathcal{G}}$ satisfying the release-Lipschitz condition. In Theorem 6.3, we combine this augmentation with the framework of 5.3 to produce general covering number bounds for the augmented graphs. For the rest of this section we will work with sequential families of computational graphs.

A sequential computational graph has nodes set $\mathcal{V}=\{I,V_{1},\ldots,V_{q},O\}$ , where $I$ is the single input node, and all the edges are $\mathcal{E}=\{(I,V_{1}),(V_{1},V_{2}),\cdots,(V_{q-1},V_{q})\}\cup\{(V_{1},O),\dots,(V_{q},O)\}$ . We often use the notation $V_{0}$ to refer to the input $I$ . Below we formally define the augmentation operation.

Given a differentiable sequential computational graph $G$ with $q$ internal nodes $V_{1},\dots,V_{q}$ , define its Lipschitz augmentation $\widetilde{G}$ as follows. We first add $q$ nodes to the graph denoted by $J_{1},\dots,J_{q}$ . The composition rules for original internal nodes remain the same, and the composition rule for $J_{i}$ is defined as

Here $DR_{V_{i}}$ is the total derivative of the function $R_{V_{i}}$ . In other words, the variable $J_{i}$ is a Jacobian for $R_{V_{i}}$ , a linear operator that maps $\mathcal{D}_{V_{i-1}}$ to $\mathcal{D}_{V_{i}}$ . (Note that if $V_{i}$ ’s are considered as vector variables, then $J_{i}$ ’s are matrix variables.) We equip the space of $J_{i}$ with operator norm, denoted by $\|\cdot\|_{\textup{op}}$ , induced by the original norms on spaces $V_{i-1}$ and $V_{i}$ . The Lipschitz-ness w.r.t variable $J_{i}$ will be measured with operator norm.

We pre-determine a family of parameters $\kappa_{j\leftarrow i}$ for all pairs $(i,j)$ with $i\leq j$ . The final loss is augmented by a product of soft indicators that truncates the function when any of the Jacobians is much larger than $\kappa_{i\leftarrow j}$ :

where $x\in\mathcal{D}_{\mathcal{I}}$ , $v_{i}\in\mathcal{D}_{V_{i}}$ , and $D_{i}\in\mathcal{D}_{J_{i}}$ . Note that $D_{j}\cdots D_{i}$ is the total derivative of $V_{j}$ w.r.t $V_{i}$ , and thus the $\kappa_{j\leftarrow i}$ has the interpretation as an intended bound of the Jacobian between pairs of layers (variables). Figure 4 depicts the augmentation.

Note that under these definitions, we finally get that the output function of $\widetilde{G}$ computes

[Lipschitz guarantees of augmented graphs] Let $\mathcal{G}$ be a family of sequential computational graphs. Suppose for any $G\in\mathcal{G}$ , the composition rule of the output node, $R_{O_{G}}$ , is $c_{i}$ -Lipschitz in variable $V_{i}$ for all $i$ , and it only outputs value in $ $. Suppose that$ DR_{V_{i}} $is$ \bar{\kappa}_{i} $-Lipschitz for each$ i $.Note that$ DR_{V_{i}} $maps a vector in space$ \mathcal{D}_{V_{i-1}} $to an linear operator that maps$ \mathcal{D}_{V_{i-1}} $to$ \mathcal{D}_{V_{i}} $. Let$ \kappa_{j\leftarrow i} $(for$ i\leq j $) be a set of parameters that we intend to use to control Jacobians in the Lipschitz augmentation. With them, we apply Lipschitz augmentation as defined in Definition 6.1 to every graph in$ \mathcal{G} $and obtain a new family of graphs, denoted by$ \widetilde{\mathcal{G}}$.

where for simplicity in the above expressions, we extend the definition of $\kappa$ ’s to $\kappa_{j-1\leftarrow j}=1$ .

Finally, we combine Theorems 5.3 and Theorems 6.2 to derive covering number bounds for any Lipschitz augmentation of sequential computational graphs. The final covering bound in (16) can be easily computed given covering number bounds for each individual function class. In Section 7, we use this theorem to derive Rademacher complexity bounds for neural networks. The proof is deferred to Section C. In Section E, we also use these tools to derive Rademacher complexity bounds for RNNs.

Consider any family $\mathcal{G}$ of sequential computational graphs satisfying the conditions of Theorem 6.2. By combining the augmentation of Definition 6.1 with additional indicators on the internal node norms, we can construct a new family $\widetilde{\mathcal{G}}$ of computational graphs which output

The family $\widetilde{\mathcal{G}}$ satisfies the following guarantees:

Each computational graph in $\widetilde{\mathcal{G}}$ upper bounds its counterpart in $\mathcal{G}$ , i.e. $O_{\widetilde{G}}(x)\geq O_{G}(x)$ .

where $D\mathfrak{R}_{V_{i}}$ denotes the family of total derivatives of functions in $\mathfrak{R}_{V_{i}}$ and $V_{0}$ the input vertex.

Application to Neural Networks

The below result follows from modeling the neural net loss as a sequential computational graph and using our augmentation procedure to make it Lipschitz in its nodes with parameters $\kappa^{\textup{hidden},(i)},\kappa^{\textup{jacobian},(i)}$ . Then we cover the augmented loss to bound its Rademacher complexity.

Assume that the activation $\phi$ is 1-Lipschitz with a $\bar{\sigma}_{\phi}$ -Lipschitz derivative. Fix reference matrices $\{A^{(i)}\}$ , $\{B^{(i)}\}$ . With probability $1-\delta$ over the random draws of the data $P_{n}$ , all neural networks $F$ with parameters $\{W^{(i)}\}$ and positive margin $\gamma$ satisfy:

where $\kappa^{\textup{jacobian},(i)}\triangleq\sum_{1\leq j\leq 2i-1\leq j^{\prime}\leq 2r-1}\frac{\sigma_{j^{\prime}\leftarrow 2i}\sigma_{2i-2\leftarrow j}}{\sigma_{j^{\prime}\leftarrow j}}$ , and $\kappa^{\textup{hidden},(i)}\triangleq\xi+\frac{\sigma_{2r-1\leftarrow 2i}}{\gamma}+\sum_{i\leq i^{\prime}<r}\frac{\sigma_{2i^{\prime}\leftarrow 2i}}{t^{(i^{\prime})}}+\sum_{1\leq j\leq j^{\prime}\leq 2r-1}\sum_{\begin{subarray}{c}j^{\prime\prime}=\max\{2i,j\},\\ j^{\prime\prime}\textup{ even }\end{subarray}}^{j^{\prime}}\frac{\bar{\sigma}_{\phi}\sigma_{j^{\prime}\leftarrow j^{\prime\prime}+1}\sigma_{j^{\prime\prime}-1\leftarrow 2i}\sigma_{j^{\prime\prime}-1\leftarrow j}}{\sigma_{j^{\prime}\leftarrow j}}$ .

In these expressions, we define $\sigma_{j-1\leftarrow j}=1$ , $\xi=\textup{poly}(r)^{-1}$ , and:

where $Q_{j^{\prime}\leftarrow j}$ computes the Jacobian of layer $j^{\prime}$ w.r.t. layer $j$ . Note that the training error here is because of the existence of positive margin $\gamma$ .

We note that our bound has no explicit dependence on width and instead depends on the $\|\cdot\|_{2,1},\|\cdot\|_{1,1}$ norms of the weights offset by reference matrices $\{A^{(i)}\},\{B^{(i)}\}$ . These norms can avoid scaling with the width of the network if the difference between the weights and reference matrices is sparse. The reference matrices $\{A^{(i)}\},\{B^{(i)}\}$ are useful if there is some prior belief before training about what weight matrices are learned, and they also appear in the bounds of Bartlett et al. (2017). In Section E, we also show that our techniques can easily be extended to provide generalization bounds for RNNs scaling polynomially in depth via the same quantities $t^{(i)},\sigma_{j^{\prime}\leftarrow j}$ .

Experiments

Though the main purpose of the paper is to study the data-dependent generalization bounds from a theoretical perspective, we provide preliminary experiments demonstrating that the proposed complexity measure and generalization bounds are empirically relevant. We show that regularizing the complexity measure leads to better test accuracy. Inspired by Theorem 7.1, we directly regularize the Jacobian of the classification margin w.r.t outputs of normalization layers and after residual blocks. Our reasoning is that normalization layers control the hidden layer norms, so additionally regularizing the Jacobians results in regularization of the product, which appears in our bound. We find that this is effective for improving test accuracy in a variety of settings. We note that Sokolić et al. (2017) show positive experimental results for a similar regularization technique in data-limited settings.

Suppose that $m(F(x),y)=[F(x)]_{y}-\max_{j\neq y}[t]_{j}$ denotes the margin of the network for example $(x,y)$ . Letting $h^{(i)}$ denote some hidden layer of the network, we define the notation $J^{(i)}\triangleq\frac{\partial}{\partial h^{(i)}}m(F(x),y)$ and use training objective

where $l$ denotes the standard cross entropy loss, and $\lambda,\sigma$ are hyperparameters. Note the Jacobian is taken with respect to a scalar output and therefore is a vector, so it is easy to compute.

For a WideResNet16 (Zagoruyko and Komodakis, 2016) architecture, we train using the above objective. The threshold on the Frobenius norm in the regularization is inspired by the truncations in our augmented loss (in all our experiments, we choose $\sigma=0.1$ ). We tune the coefficient $\lambda$ as a hyperparameter. In our experiments, we took the regularized indices $i$ to be last layers in each residual block as well as layers in residual blocks following a BatchNorm in the standard WideResNet16 architecture. In the LayerNorm setting, we simply replaced BatchNorm layers with LayerNorm. The remaining hyperparameter settings are standard for WideResNet; for additional details see Section G.1.

Figure 1 shows the results for models trained and tested on CIFAR10 in low learning rate and no data augmentation settings, which are settings where generalization typically suffers. We also experiment with replacing BatchNorm layers with LayerNorm and additionally regularizing the Jacobian. We observe improvements in test error for all these settings. In Section G.2, we empirically demonstrate that our complexity measure indeed avoids the exponential scaling in depth for a WideResNet model trained on CIFAR10.

Conclusion

In this paper, we tackle the question of how data-dependent properties affect generalization. We prove tighter generalization bounds that depend polynomially on the hidden layer norms and norms of the interlayer Jacobians. To prove these bounds, we work with the abstraction of computational graphs and develop general tools to augment any sequential family of computational graphs into a Lipschitz family and then cover this Lipschitz family. This augmentation and covering procedure applies to any sequence of function compositions. An interesting direction for future work is to generalize our techniques to arbitrary computational graph structures. In follow-up work (Wei and Ma, 2019), we develop a simpler technique to derive Jacobian-based generalization bounds for both robust and clean accuracy, and we present an algorithm inspired by this theory which empirically improves performance over strong baselines.

Acknowledgments

CW was supported by a NSF Graduate Research Fellowship. Toyota Research Institute (TRI) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

References

Appendix A Missing Proofs for Section 7

We first elaborate more on the notations introduced in Section 7. First, by our indexing, matrix $W^{(i)}$ will be applied in layer $2i-1$ of the network, and even layers $2i$ apply $\phi$ . We let $F_{j^{\prime}\leftarrow j}$ denote the function computed between layers $j$ and $j^{\prime}$ and $Q_{j^{\prime}\leftarrow j}=DF_{j^{\prime}\leftarrow j}\circ F_{j^{\prime}-1\leftarrow 1}$ denote the layer $j$ -to- $j^{\prime}$ Jacobian. By our definition of $F_{j^{\prime}\leftarrow j}$ , $F_{2j\leftarrow 2j}=\phi$ , $F_{2j-1\leftarrow 2j-1}=h\mapsto W^{(j)}h$ , and $F_{j^{\prime}\leftarrow j}$ is recursively computed by $F_{j^{\prime}\leftarrow j^{\prime}}\circ F_{j^{\prime}-1\leftarrow j}$ for $j^{\prime}>j$ . We will use the convention that $F_{j-1\leftarrow j}$ computes the identity mapping for $i\leq j$ .

$P$ will denote a test distribution over examples $x$ and labels $y$ , and $P_{n}$ will denote the distribution on training examples.

For a class of real-valued functions $\mathcal{L}$ and dataset $P_{n}$ , define the empirical Rademacher complexity of this function class by

Assume that the activation $\phi$ is $1$ -Lipschitz with $\bar{\sigma}_{\phi}$ -Lipschitz derivative. Fix parameters $\sigma_{j^{\prime}\leftarrow j}$ , $t^{(i)}$ , $a^{(i)}$ , $b^{(i)}$ , $\gamma$ and reference matrices $\{A^{(i)}\}$ , $\{B^{(i)}\}$ . With probability $1-\delta$ over the random draws of the distribution $P_{n}$ , all neural networks $F$ with parameters $\{W^{(i)}\}$ satisfying the following data-dependent conditions:

Hidden layers norms are controlled: $\max_{x\in P_{n}}\|F_{2i\leftarrow 1}(x)\|\leq t^{(i)}\ \forall 1\leq i\leq r$ .

Jacobians are balanced: $\max_{x\in P_{n}}\|Q_{j^{\prime}\leftarrow j}(x)\|_{\textup{op}}\leq\sigma_{j^{\prime}\leftarrow j}\ \forall j<j^{\prime}$ .

The margin is large: $\min_{(x,y)\in P_{n}}[F(x)]_{y}-\max_{y^{\prime}\neq y}[F(x)]_{y^{\prime}}\geq\gamma>0$ .

and the additional data-independent condition

will have the following generalization to test data:

Here we use the convention that $\sigma_{j-1\leftarrow j}=1$ and let $t^{(0)}=\max_{x\in P_{n}}\|x\|$ .

This generalization bound follows straightforwardly via the below Rademacher complexity bound for the augmented loss class:

Suppose that $\phi$ is $1$ -Lipschitz with $\bar{\sigma}_{\phi}$ -Lipschitz derivative. Define the following class of neural networks with norm bounds on its weight matrices with respect to reference matrices $\{A^{(i)}\},\{B^{(i)}\}$ :

Fix parameters $t^{(i)}$ and $\sigma_{j^{\prime}\leftarrow j}$ for $j^{\prime}\geq j$ with $\sigma_{2i\leftarrow 2i}=1$ and $\sigma_{2i-1\leftarrow 2i-1}=\sigma^{(i)}$ . When we apply this theorem, we will choose $\sigma_{j^{\prime}\leftarrow j}$ and $t^{(i)}$ which upper bound the layer $j$ to $j^{\prime}$ Jacobian norm and $i$ -th hidden layer norm, respectively. Define the class of augmented losses

and define for $1\leq i\leq r$ , $\kappa^{\textup{jacobian},(i)},\kappa^{\textup{hidden},(i)}$ meant to bound the influence of the matrix $W^{(i)}$ on the Jacobians and hidden variables, respectively as in (18), (19). Then we can bound the empirical Rademacher complexity of the augmented loss class by

We associate the un-augmented loss class on neural networks $l_{\gamma}\circ\mathcal{F}$ with a family of sequential computation graphs $\mathcal{G}$ with depth $2r-1$ . The composition rules are as follows: for internal node $V_{2i}$ , $\mathfrak{R}_{V_{2i}}=\{\phi\}$ , the set with only one element: the activation $\phi$ . We also let $\mathfrak{R}_{V_{2i-1}}=\{h\mapsto Wh:\|W^{\top}-{A^{(i)}}^{\top}\|_{2,1}\leq a^{(i)},\|W-{B^{(i)}}\|_{1,1}\leq b^{(i)},\|W\|_{\textup{op}}\leq\sigma^{(i)}\}$ . Finally, we choose $\mathfrak{R}_{O}$ to be the singleton class $\{l_{\gamma}\}$ . Our collection of computation rules is then simply $\mathfrak{R}=\mathfrak{R}_{V_{1}}\otimes\cdots\otimes\mathfrak{R}_{V_{2r-1}}\otimes\mathfrak{R}_{O}$ . Since $O_{\mathcal{G}}$ takes values in $ $, we can apply Theorem 6.3 on this class$ \mathcal{G} $using$ s_{\mathcal{I}}=\max_{x\in P_{n}}\|x\| $,$ s_{V_{2i}}=t^{(i)} $,$ s_{V_{2i-1}}=\infty $,$ \kappa_{2i\leftarrow 2i}=1 $,$ \kappa_{2i-1\leftarrow 2i-1}=\sigma^{(i)} $, and$ \kappa_{j^{\prime}\leftarrow j}=\sigma_{j^{\prime}\leftarrow j} $for$ j^{\prime}>j $. Furthermore, we note that$ \bar{\kappa}_{2i}=\bar{\sigma}_{\phi} $, and$ \bar{\kappa}_{2i-1}=0 $as the Jacobian is constant for matrix multiplications. We thus obtain the class$ \widetilde{\mathcal{G}} $where each augmented loss upper bounds the corresponding loss in$ \mathcal{G} $. Recall that$ J_{i} $denote the additional nodes in our augmented computation graph. Note that under these choices of$ s_{V_{2i-1}} $,$ \kappa_{i\leftarrow i}$, we get that

We first note that the last term in (20) is simply 0 because there is exactly one output function in $\mathfrak{R}_{O}$ . Now for the other terms of (20): by definition $\mathfrak{R}_{V_{2i}}$ , $\mathfrak{R}_{J_{2i}}$ consist of a singleton set and therefore have log cover size for any error resolution $\epsilon$ . Otherwise, to cover $\mathfrak{R}_{V_{2i-1}}$ it suffices to bound $\log\mathcal{N}(\epsilon_{V_{2i-1}},\{h\mapsto Wh:\|W^{\top}-{A^{(i)}}^{\top}\|_{2,1}\leq a^{(i)}\},2t^{(i-1)})$ . Thus, we can apply Lemma A.3 to obtain

Thus, substituting terms into (20) and collecting sums, we obtain that

Now we apply Dudley’s entropy theorem to obtain that

We start with Theorem A.2, which bounds the Rademacher complexity of the augmented loss class $\mathcal{L}_{\textup{aug}}$ . Using $l_{\textup{aug}}(F,x,y)$ to denote the application of this augmented loss on the network $F$ , its weights, and data $(x,y)$ , we first note that $l_{\textup{0-1}}(F(x),y)\leq l_{\gamma}(F(x),y)\leq l_{\textup{aug}}(F,x,y)$ for any datapoint $(x,y)$ . We used the fact that margin loss upper bounds 0-1 loss, and $l_{\textup{aug}}$ upper bounds margin loss by the construction in Theorem 6.3. Thus, applying the standard Rademacher generalization bound, with probability $1-\delta$ over the training data, it holds that

Plugging in the bound on $\textup{Rad}_{n}(\mathcal{L}_{\textup{aug}})$ from Theorem A.2 gives the desired result. ∎

Finally, to prove Theorems 7.1 and 1.1, we simply take a union bound over the choices of parameters $\sigma_{j^{\prime}\leftarrow j},t^{(i)},a^{(i)},b^{(i)}$ .

We will apply Theorem A.1 repeatedly over a grid of parameter choices $t^{(i)}$ , $\sigma_{j^{\prime}\leftarrow j}$ , $a^{(i)}$ , $b^{(i)}$ (following a technique of Bartlett et al. ). For a collection $\mathcal{M}$ of nonnegative integers $m_{t}^{(i)}$ , $m_{\sigma}^{(j^{\prime}\leftarrow j)}$ , $m_{a}^{(i)}$ , $m_{b}^{(i)}$ , $m_{\gamma}$ , we apply Theorem A.1 choosing $t^{(i)}=\textup{poly}(r)^{-1}2^{m_{t}^{(i)}}$ , $\sigma_{j^{\prime}\leftarrow j}=\textup{poly}(r)^{-1}2^{m_{\sigma}^{(j^{\prime}\leftarrow j)}}$ , $a^{(i)}=\textup{poly}(r)^{-1}2^{m_{a}^{(i)}}$ , $b^{(i)}=\textup{poly}(r)^{-1}2^{m_{b}^{(i)}}$ , $\gamma=2^{-m_{\gamma}}\textup{poly}(r)\max_{i}\sigma_{2r-1\leftarrow 2i}$ and using error probability $\delta_{\mathcal{M}}\triangleq\frac{\delta}{2^{\sum_{m\in\mathcal{M}}m+1}}$ . First, we note that by union bound, using the fact that $\sum_{\textup{choices of }\mathcal{M}}\frac{\delta}{2^{\sum_{m\in\mathcal{M}}m+1}}=\delta$ where $\mathcal{M}$ ranges over nonnegative integers, we get that the generalization bound of Theorem A.1 holds for choices of $\mathcal{M}$ with probability 1 - $\delta$ .

Now for the network $F$ at hand, there would have been some choice of $\mathcal{M}$ for which the bound was applied using parameters $\hat{t}^{(i)}$ , $\hat{\sigma}_{j^{\prime}\leftarrow j}$ , $\hat{a}^{(i)}$ , $\hat{b}^{(i)}$ , $\hat{\gamma}$ and

The proof of the simpler Theorem 1.1, follows the same above argument. The only difference is that we union bound over parameters $\sigma,t$ and the matrix norms. ∎

Let $s=\max_{x\in P_{n}}\|x\|$ be an upper bound on the largest norm of a datapoint. Then the following bound relates Rademacher complexity to covering numbers:

Appendix B Missing Proofs in Section 5

We prove the theorem by induction on the number of non-input vertices in the vertex set $\mathcal{V}$ . The statement is true if $O$ is the only non-input node in the graph: to cover the graph output with error $\epsilon_{O}$ , we simply cover $\mathfrak{R}_{O}$ .

Given a family of graphs $\mathcal{G}$ (with shared edges $\mathcal{E}$ and nodes $\mathcal{V}$ ), we assume the inductive hypothesis that “for any family of graphs with more than $|\mathcal{I}|$ input vertices, the theorem statement holds.” Under this hypothesis, we will show that the theorem statement holds for the graph family $\mathcal{G}$ .

We take node $V_{1}$ from the forest ordering $(V_{1},\dots,V_{m})$ assumed in the theorem. Suppose $V_{1}$ depends on $C_{1},\dots,C_{t}$ , which are assumed to be the input nodes by the definition of forest ordering. We release the node $V_{1}$ from the graph and obtain a new family $\mathcal{G}^{\backslash V_{1}}=\{G^{\backslash V_{1}}:G\in\mathcal{G}\}$ with a smaller number of edges than that of $\mathcal{G}$ .

Define $u(h,x)\triangleq O_{G^{\backslash V_{1}}}(h,x)$ for $h\in\mathcal{D}_{V_{1}}$ and $x\in\mathcal{D}_{\mathcal{I}}$ , and $w(x)=V_{1}(x)$ . Then we can check that $u(w(x),x)=O_{G}(x)$ . Let $\mathcal{U}=\{O_{G^{\backslash V_{1}}}:G\in\mathcal{G}\}$ , and let $\mathcal{W}=\mathfrak{R}_{V_{1}}$ . As each function in $\mathcal{U}$ is $\kappa_{V_{1}}$ -Lipschitz in $V_{1}$ because of condition 1, and it equals the fixed constant $c$ if ${|\kern-1.07639pt|\kern-1.07639pt|V_{1}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{V}$ or ${|\kern-1.07639pt|\kern-1.07639pt|C_{i}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{C_{i}}$ , we have $\mathcal{U},\mathcal{W}$ satisfies the conditions of the composition lemma (see Lemma B.1). With the lemma, we conclude:

Note that by the definition of forest ordering, we have that $(V_{2},\dots,V_{m})$ is a forest ordering of $G^{\backslash V_{1}}$ and by the assumption 1 of the theorem, we have that $(V_{2},\dots,V_{m})$ satisfies the condition 1 for the graph family $\mathcal{G}^{\backslash V_{1}}$ . $\mathcal{G}^{\backslash V_{1}}$ has one more input node than $\mathcal{G}$ , so we can invoke the inductive hypothesis on $\mathcal{G}^{\backslash V_{1}}$ and obtain

Combining equation (23) and (24) above, we prove (14) for $\mathcal{G}$ , and complete the induction. ∎

Below we provide the composition lemma necessary for Theorem 5.3.

is a family of functions with two arguments and $\mathcal{W}\subseteq\{x^{(1)},\ldots,x^{(m)}\in\mathcal{D}_{x}^{(1)}\otimes\cdots\otimes\mathcal{D}_{x}^{(m)}\mapsto\mathcal{D}_{h}\}$ is another family of functions. We overload notation and refer to $x^{(1)},\ldots,x^{(m)}$ as $x$ . The spaces $\mathcal{D}_{h},\mathcal{D}_{x},\mathcal{D}_{u}$ all associate with some norms ${|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}$ (the norms can potentially be different for each space, but we use the same notation for all of them.) Assume the following:

All functions in $\mathcal{U}$ are $\kappa$ -Lipschitz in the argument $h$ for any possible choice of $x$ : for any $u\in\mathcal{U}$ , $x\in\mathcal{D}_{x}$ , and $h,h^{\prime}\in\mathcal{D}_{h}$ , we have ${|\kern-1.07639pt|\kern-1.07639pt|u(h,x)-u(h^{\prime},x)|\kern-1.07639pt|\kern-1.07639pt|}\leq\kappa{|\kern-1.07639pt|\kern-1.07639pt|h-h^{\prime}|\kern-1.07639pt|\kern-1.07639pt|}$ .

Any function $u\in\mathcal{U}$ collapses on inputs with large norms: there exists a constant $b$ such that $u(h,x)=b$ if ${|\kern-1.07639pt|\kern-1.07639pt|h|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{h}$ or ${|\kern-1.07639pt|\kern-1.07639pt|x^{(i)}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{x}^{(i)}$ for any $i$ .

Then, the family of the composition of $u$ and $w$ , $\mathcal{Z}=\left\{z(x)=u(w(x),x):u\in\mathcal{U},w\in\mathcal{W}\right\}$ , has covering number bound:

When it is clear from context, we let ${|\kern-1.07639pt|\kern-1.07639pt|x|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}$ denote the statement that ${|\kern-1.07639pt|\kern-1.07639pt|x^{(i)}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}^{(i)}\ \forall i$ . Suppose $P_{n}$ is a uniform distribution over $n$ data points $\{x_{1},\dots,x_{n}\}\subset\mathcal{D}_{x}$ with norms not larger than $s_{x}$ . Given function $u\in\mathcal{U}$ and $w\in\mathcal{W}$ , we will construct a pair of functions such that $\hat{u}(\hat{w}(x),x)$ covers $u(w(x),x)$ . We will count (in a straightforward way) how many distinct pairs of functions we have construct for all the $(u,w)$ pairs at the end of the proof.

Let $P^{\prime}$ be the uniform distribution over $\{x_{i}:{|\kern-1.07639pt|\kern-1.07639pt|x_{i}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}\}$ , and suppose $\hat{\mathcal{W}}$ is a $\epsilon_{w}\sqrt{\frac{n}{|\text{supp}(P^{\prime})|}}$ error cover of $\mathcal{W}$ with respect to the metric $L_{2}(P^{\prime},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|})$ . We note that $\hat{\mathcal{W}}$ has size at most $\mathcal{N}(\epsilon_{w},\mathcal{W},s_{x})$ . We found $\hat{w}\in\mathcal{W}$ such that $\hat{w}$ is $\epsilon_{w}$ -close to $w$ in metric $L_{2}(P^{\prime},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|})$ . Let $\hat{h}_{i}$ denote $\hat{w}(x_{i})$ . Let $Q^{\prime}$ be the uniform distribution over $\{(\hat{h}_{i},x_{i}):{|\kern-1.07639pt|\kern-1.07639pt|\hat{h}_{i}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{h},{|\kern-1.07639pt|\kern-1.07639pt|x_{i}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}\}$ , and let $Q$ be the uniform distribution over all $n$ points, $\{(\hat{h}_{1},x_{1}),\ldots,(\hat{h}_{n},x_{n})\}$ . Now we construct a intermediate cover $\widehat{\mathcal{U}}^{\prime}$ (that depends on $\hat{w}$ implicitly) that covers $\mathcal{U}$ with $\epsilon_{u}\sqrt{\frac{n}{|\text{supp}(Q^{\prime})|}}$ error with respect to the metric $L_{2}(Q^{\prime},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|})$ . We augment this to a cover $\widehat{\mathcal{U}}$ that covers $\mathcal{U}$ with respect to metric $L_{2}(Q,{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|})$ as follows: for every $\hat{u}^{\prime}\in\widehat{\mathcal{U}}^{\prime}$ , add the function $\hat{u}$ to $\widehat{\mathcal{U}}$ with

Note that by construction, the size of $\widehat{\mathcal{U}}$ is at most $\mathcal{N}(\epsilon_{u},\mathcal{U},(s_{h},s_{x}))$ . Now let $\hat{u}^{\prime}\in\widehat{\mathcal{U}}^{\prime}$ be the cover element for $u$ w.r.t. $L_{2}(Q,{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|})$ , and $\hat{u}$ be the corresponding cover element in $\widehat{\mathcal{U}}$ . Because $\hat{u}(\hat{h},x)=b=u(\hat{h},x)$ when ${|\kern-1.07639pt|\kern-1.07639pt|\hat{h}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{h}$ or ${|\kern-1.07639pt|\kern-1.07639pt|x^{(i)}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{x}^{(i)}$ for some $i$ ,

Then we bound the difference between $u(\hat{h},x)$ and $u(h,x)$ by Lipschitzness; since $u(\hat{h},x)=u(h,x)=b$ when ${|\kern-1.07639pt|\kern-1.07639pt|x|\kern-1.07639pt|\kern-1.07639pt|}>s_{x}$ ,

where in the last step we used the property of the cover $\mathcal{G}$ . Finally, by triangle inequality, we get that

Finally we count how many $(\hat{w},\hat{u})$ we have constructed: $\hat{\mathcal{W}}$ is of size at most $\mathcal{N}(\epsilon_{w},\mathcal{W},s_{x})$ . and for every $\hat{w}\in\hat{\mathcal{W}}$ , we’ve constructed a family of functions $\widehat{\mathcal{U}}$ (that depends on $\hat{w}$ ) of size at most $\mathcal{N}(\epsilon_{u},\mathcal{U},(s_{h},s_{x}))$ . Therefore, the total size of the cover is at most $\mathcal{N}(\epsilon_{w},\mathcal{W},s_{x})\cdot\mathcal{N}(\epsilon_{u},\mathcal{U},(s_{h},s_{x}))$ . ∎

Appendix C Missing Proofs in Section 6

We first state the proofs of Theorem 6.2 and Theorem 6.3, which follow straightforwardly from the technical tools developed in Section D.

Now we prove release-Lipschitzness for a prefix sequence $\mathcal{S}^{\prime}$ of $\mathcal{S}$ that ends in node $J_{i}$ . For all $j\neq i$ , fix $D_{j}\in\mathcal{D}_{J_{j}}$ . It suffices to show that the function $Q$ defined by

We first construct an augmented family of graphs $\mathcal{G}^{\prime}$ sharing the same vertices and edges as $\mathcal{G}$ . For $G\in\mathcal{G}$ , we add $G^{\prime}$ to $\mathcal{G}^{\prime}$ computing

This is achieved by modifying the family of output rules as follows:

Now all terms match (16) except for the term $\log\mathcal{N}(\epsilon_{O},\widetilde{\mathfrak{R}}_{O},\{2s_{V_{i}}\}\cup\{I\}\cup\{2s_{J_{i}}\}_{i\geq 1})$ . First, we note that all functions in $\widetilde{\mathfrak{R}}_{O}$ can be written in the form

This allows us to conclude (16). Finally, we note that as the augmentation operations are in the form of those considered in Claim H.1, it follows that $O_{\widetilde{G}}$ upper bounds $O_{G}$ . ∎

Appendix D Technical Tools for Lipschitz Augmentation

In this section, we develop the technical tools needed for proving Theorem 6.2. The main result in this section is our Lemma D.1, which essentially states that augmenting the loss with a product of Jacobians (plus additional matrices meant to model previous Jacobian nodes already released from the computational graph) will make the loss Lipschitz.

For this section, we say a function $J$ taking input $x\in\mathcal{D}$ and outputting an operator mapping $\mathcal{D}$ to $\mathcal{D}^{\prime}$ is $\kappa$ -Lipschitz if $\|J(x)-J(x^{\prime})\|_{\textup{op}}\leq\kappa\|x-x^{\prime}\|$ for any $x,x^{\prime}$ in its input domain. We will consider functions $f_{1},\ldots,f_{k}$ , where $f_{i}:\mathcal{D}_{i-1}\rightarrow\mathcal{D}_{i}$ and $\mathcal{D}_{0}$ is a compact subset of some normed space. For ease of notation, we use $\|\cdot\|$ to denote the (possibly distinct) norms on $\mathcal{D}_{0},\ldots,\mathcal{D}_{k}$ . For $1\leq i\leq j\leq k$ , Let $f_{j\leftarrow i}:\mathcal{D}_{i-1}\rightarrow\mathcal{D}_{j}$ denote the composition

For convenience in indexing, for $(i,j)$ with $i>j$ , we will set $f_{j\leftarrow i}:\mathcal{D}_{i-1}\rightarrow\mathcal{D}_{i-1}$ to be the identity function.

Finally consider a real-valued function $g:\mathcal{D}_{0}\otimes\cdots\otimes\mathcal{D}_{k}\rightarrow$ and define the composition $z:\mathcal{D}_{0}\mapsto$ by

We will construct a “Lipschitz-fication” for the function $z$ .

Let $A_{1},\ldots,A_{m}$ denote a collection of linear operators that map to the space $\mathcal{D}_{0}$ . We will furthermore use $J_{j\leftarrow i,m^{\prime}}$ to denote the $i$ -to- $j$ Jacobian, i.e.

When $i=1$ and $0\leq j\leq k$ , we will also consider products between $1$ -to- $j$ Jacobians and the matrices $A_{m^{\prime}}$ : define

Note in particular that $J_{0\leftarrow 1,m^{\prime}}=A_{m^{\prime}}$ .

[Lipschitz-fication] Following the notation in this section, suppose that $g$ is $c_{k^{\prime}}$ -Lipschitz in its $(k^{\prime}+1)$ -th argument for $0\leq k^{\prime}\leq k$ . Suppose that $Df_{j\leftarrow j}$ is $\bar{\tau}_{j}$ -Lipschitz for all $1\leq j\leq k$ . For any $(i,j)$ with $1\leq i\leq j\leq k$ , let $\tau_{j\leftarrow i}$ be parameters that intend to be a tight bound on $\|J_{j\leftarrow i}\|_{\textup{op}}$ , and also define $\tau_{j\leftarrow 1,m^{\prime}}$ which will bound $\|J_{j\leftarrow 1,m^{\prime}}\|_{\textup{op}}$ . Define the augmented function $\bar{z}:\mathcal{D}_{0}\mapsto$ by

We define the following order $\succ_{\mathcal{Q}}$ on this collection of functions:

Now note that by Claim D.7, we have the bound

Define $\bar{\tau}$ to be the Lipschitz constant of $J_{j\leftarrow i}$ on $\mathcal{D}_{0}$ for all $1\leq i\leq j\leq k$ guaranteed by Claim D.6. First, note that $\Delta_{0\leftarrow 1,m^{\prime}}=0$ for all $m^{\prime}$ . Thus, by Claims D.4 and D.5, it follows that

Now note that if $\|\nu\|\leq\frac{2\min_{i\leq j}\tau_{j\leftarrow i}}{\bar{\tau}}$ , then it follows that $2\tau_{j\leftarrow i}+\frac{\bar{\tau}}{2}\|\nu\|\leq 3\tau_{j\leftarrow i}\forall i\leq j$ . Substituting into (30), we get that $\forall x,\|\nu\|\leq\frac{2\min_{i\leq j}\tau_{j\leftarrow i}}{\bar{\tau}}$ ,

In the setting of Lemma D.1, for $1\leq i\leq j\leq k$ , we can expand the error $J_{j\leftarrow i}(x)-J_{j\leftarrow i}(x+\nu)$ as follows:

Furthermore, for $1\leq j\leq k,m^{\prime}$ , we can expand the error $J_{j\leftarrow 1,m^{\prime}}(x)-J_{j\leftarrow 1,m^{\prime}}(x+\nu)$ as follows:

We will first show (31) by inducting on $j-i$ . The base case $j=i$ follows by definition, as we can reduce $J_{i\leftarrow i+1}$ and $J_{i-1\leftarrow i}$ to constant-valued functions that output the identity matrix.

For the inductive step, we use Claim H.2 to expand

To prove (32), we first note that by definition, $J_{j\leftarrow 1,m^{\prime}}(x)=J_{j\leftarrow 1}(x)J_{0\leftarrow 1,m^{\prime}}$ , so

In the setting of Lemma D.1, suppose that $J_{j\leftarrow i}$ is $\bar{\tau}$ -Lipschitz for all $1\leq i\leq j\leq k$ . Then we can bound the operator norm error in the Jacobian by

Likewise, we can bound the operator norm error in the product between Jacobian and auxiliary matrices by

We will first prove (34), as the proof of (35) is nearly identical. Starting from (31) of Claim D.2, we have

By triangle inequality and the fact that $J_{j^{\prime}\leftarrow i^{\prime}}$ is $\bar{\tau}$ -Lipschitz $\forall i^{\prime}\leq j^{\prime}$ , it follows that

Plugging the above into (37), we get $\eqref{eq:jacobian_err-1}$ . To prove (35), we start from (32) and follow the same steps as above. ∎

In the setting of Lemma D.1, suppose that $J_{j\leftarrow i}$ is $\bar{\tau}$ -Lipschitz for all $1\leq i\leq j\leq k$ . Then we can upper bound the error terms corresponding to the indicators by

Likewise, the following upper bound holds for all $(j,m^{\prime})$ with $1\leq j\leq k,1\leq m^{\prime}\leq m$ :

Plugging this into our definition for $\Delta_{j\leftarrow i}$ (28), it follows that

Note that if $x\notin\mathcal{E}$ , then $\exists i^{\prime}<j^{\prime}$ such that $Q_{j^{\prime}\leftarrow i^{\prime}}(x)=0$ and $Q_{j^{\prime}\leftarrow i^{\prime}}\succ_{\mathcal{Q}}Q_{j\leftarrow i}$ by definition of the order $\succ_{\mathcal{Q}}$ . It follows that if $x\notin\mathcal{E}$ , $\prod_{h\succ_{\mathcal{Q}}Q_{j\leftarrow i}}h(x)=0$ , so $|\Delta_{j\leftarrow i}(x,\nu)|=0$ . Otherwise, if $x\in\mathcal{E}$ , by Claim D.3 we have

where we recall that $\tau_{i-1\leftarrow i}=1$ . Plugging this into (40) and using the fact that all functions $h\in\mathcal{Q}$ are bounded by 1 gives the desired statement.

To prove (39), we simply apply the above argument with (35). ∎

In the setting of Lemma D.1, fix index $j$ with $0\leq j\leq k$ and suppose that $J_{j\leftarrow 1}$ is $\bar{\tau}$ -Lipschitz. Then we can bound the error due to function composition by

Starting from (27), we can first express $\delta_{i}(x,\nu)$ by

as $Q_{j\leftarrow 1}\succ_{\mathcal{Q}}Q_{j\leftarrow 1,1}$ . First we note that by definition, $|\gamma_{j}(x,\nu)|\leq c_{j}\|f_{j\leftarrow 1}(x)-f_{j\leftarrow 1}(x+\nu)\|$ , as the function $g$ is $c_{j}$ -Lipschitz in its $j$ -th argument. Thus, since all functions $Q\in\mathcal{Q}$ are bounded by $1$ , it follows that

In the setting of Lemma D.1, $\exists\bar{\tau}$ such that $\forall i\leq j$ , $J_{j\leftarrow i}$ is $\bar{\tau}$ -Lipschitz on a compact domain $\mathcal{D}_{0}$ .

We first show inductively that $f_{i\leftarrow 1}$ is Lipschitz for all $i$ . The base case $f_{1\leftarrow 1}$ follows by definition, as $f_{1\leftarrow 1}$ is continuously differentiable and $\mathcal{D}_{0}$ is a compact set.

Now we show the inductive step: first write $f_{i\leftarrow 1}=f_{i}\circ f_{i-1\leftarrow 1}$ . By continuity, $\{f_{i-1\leftarrow 1}(x):x\in\mathcal{D}_{0}\}$ is compact. Furthermore, $f_{i}$ is continuously differentiable under the assumptions of Lemma D.1. Thus, $f_{i}$ is Lipschitz on domain $\{f_{i-1\leftarrow 1}(x):x\in\mathcal{D}_{0}\}$ . As $f_{i\leftarrow 1}=f_{i}\circ f_{i-1\leftarrow 1}$ is the composition of Lipschitz functions by the inductive hypothesis, $f_{i\leftarrow 1}$ is itself Lipschitz.

Now it follows that $\forall i$ , $J_{i\leftarrow i}$ is Lipschitz on $\mathcal{D}_{0}$ , as it is the composition of $Df_{i\leftarrow i}$ and $f_{i-1\leftarrow 1}$ , both of which are Lipschitz. Finally, by the chain rule (Claim H.2), we have that $J_{j\leftarrow i}=J_{j\leftarrow j}\cdots J_{i\leftarrow i}$ is the product of Lipschitz functions, and therefore Lipschitz for all $i<j$ . We simply take $\bar{\tau}$ to be the maximum Lipschitz constant of $J_{j\leftarrow i}$ over all $i\leq j$ . ∎

For $0\leq j\leq k+1$ , define $z_{j}(x,\nu)$ by

Thus, $z_{j}(x,\nu)$ denotes $g\circ(f_{0\leftarrow 1}\otimes\ldots\otimes f_{k\leftarrow 1})$ with the last $k+1-j$ inputs to $g$ depending on $x+\nu$ instead of $x$ . Now we claim that by a telescoping argument (Claim H.3),

To see this, compute the sum in the order the following sequence of terms, which corresponds to a traversal of $\mathcal{Q}$ in least-to-greatest order:

Now we simply apply triangle inequality on (41) and use the fact that $z_{j}(x,\nu)-1\in\ \forall 0\leq j\leq k+1$ to obtain the desired statement. ∎

In the setting of Theorem 6.2, fix $1\leq i\leq 0pt$ and define

Here for convenience we use the convention that $\kappa_{i-1\leftarrow i}=1$ .

Appendix E Application to Recurrent Neural Networks

In this section, we will apply our techniques to recurrent neural networks. Suppose that we are in a classification setting. For simplicity, we will assume that the hidden layer and input dimensions are $d$ . We will define a recurrent neural network with $r-1$ activation layers as follows using parameters $W,U,Y$ , activation $\phi$ and input sequence $x=(x^{(0)},\ldots,x^{(r-2)})$ :

where $h^{(0)}$ is set to be 0. Now following the convention of Section 7, we will define the interlayer Jacobians. For odd indices $2i-1$ , $i\leq r-1$ , we simply set $Q_{2i-1\leftarrow 2i-1}$ to the constant function $x\mapsto W$ . For even indices $2i$ , $i\leq r-1$ , we set $Q_{2i\leftarrow 2i}(x)\triangleq D\phi[h^{2i-1}(x)+u^{(i-1)}(x)]$ , the Jacobian of the activation applied to the input of $h^{(2i)}(x)$ . Finally, we set $Q_{2r-1\leftarrow 2r-1}$ to be the constant function $x\mapsto Y$ . Now for $i^{\prime}>i$ , we set $Q_{i^{\prime}\leftarrow i}(x)=Q_{i^{\prime}\leftarrow i^{\prime}}(x)\cdots Q_{i\leftarrow i}(x)$ . If $i^{\prime}<i$ , we set $Q_{i^{\prime}\leftarrow i}$ to the identity matrix.

With this notation in place, we can state our generalization bound for RNN’s:

Assume that the activation $\phi$ is 1-Lipschitz with a $\bar{\sigma}_{\phi}$ -Lipschitz derivative. With probability $1-\delta$ over the random draws of $P_{n}$ , all RNNs $F$ will satisfy the following generalization guarantee:

where $\kappa^{\textup{jacobian},(i)}\triangleq\sum_{1\leq j\leq 2i-1\leq j^{\prime}\leq 2r-1}\frac{\sigma_{j^{\prime}\leftarrow 2i}\sigma_{2i-2\leftarrow j}}{\sigma_{j^{\prime}\leftarrow j}}$ , and

In these expressions, we define $\sigma_{j-1\leftarrow j}=1$ , and:

Note that the training error here is because of the existence of positive margin $\gamma$ .

Our proof follows the template of Theorem 7.1: we bound the Rademacher complexity of some augmented RNN loss. We then argue for generalization of the augmented loss and perform a union bound over all the choices of parameters. As the latter steps are identical to those in the proof of Theorem 7.1, we omit these and focus on bounding the Rademacher complexity of an augmented RNN loss.

Suppose that $\phi$ is $1$ -Lipschitz with $\bar{\sigma}_{\phi}$ -Lipschitz derivative. Define the following class of RNNs with bounded weight matrices:

and let $\sigma_{j^{\prime}\leftarrow j}$ be parameters that will bound the $j$ to $j^{\prime}$ layerwise Jacobian for $j^{\prime}\geq j$ , where we set $\sigma_{2i\leftarrow 2i}=1$ and $\sigma_{2i-1\leftarrow 2i-1}=\sigma_{W}$ for $i\leq r-1$ , $\sigma_{2r-1\leftarrow 2r-1}=\sigma_{Y}$ . Let $t^{(i)}$ be parameters bounding the layer norm after applying the $i$ -th activation, and let $t^{(0)}=0,t^{\textup{data}}=\max_{x\in P_{n}}\max_{i}\|x^{(i)}\|$ . Define the class of augmented losses

where $\kappa^{\textup{rnn-hidden},(i)},\kappa^{\textup{rnn-jacobian},(i)}$ are defined in Theorem E.1.

We will associate the family of losses $\mathcal{L}_{\textup{rnn-aug}}$ with a computational graph structure on internal nodes $H_{1},H_{2},\ldots,H_{2r-1}$ , $J_{1},\ldots,J_{2r-1}$ , $K_{0},\ldots,K_{r-2}$ , input nodes $H_{0},I_{0},\ldots,I_{r-2}$ , and output node $O$ with the following edges:

Nodes $H_{i},J_{i}$ will point towards the output $O$ .

Node $H_{i}$ will point towards nodes $H_{i+1}$ and $J_{i+1}$ .

Node $K_{i-1}$ will point towards node $H_{2i}$ and node $J_{2i}$ .

Node $I_{i}$ will point towards node $K_{i}$ .

We now define the composition rules at each node:

Finally, nodes $J_{2i-1}$ will have composition rule $R_{J_{2i-1}}=DR_{H_{2i-1}}$ . Finally, the output node $O$ will have composition rule

With this condition established, we can complete the proof via the same covering number argument as in Theorem A.2. ∎

Now as in the proof of Theorem 7.1, we first observe that the augmented loss upper bounds the 0-1 classification loss, giving us a 0-1 test error bound. We then apply the same union bound technique over parameters $\gamma,t^{(i)},\sigma_{j^{\prime}\leftarrow j},a_{W},a_{U},a_{Y}$ , as in the proof of Theorem 7.1.

Appendix F ReLU Networks

In this section, we apply our augmentation technique to relu networks to produce a generalization bound similar to that of Nagarajan and Kolter , which is polynomial in the Jacobian norms, hidden layer norms, and inverse pre-activations.

Recall the definition of neural nets in Example 5.1: the neural net with parameters $\{W^{(i)}\}$ and activation $\phi$ is defined by

For this section, we will set $\phi$ to be the relu activation. We also use the same notation for layers and indexing as Section 7. We first state our generalization bound for relu networks:

Fix reference matrices $\{A^{(i)}\},\{B^{(i)}\}$ . With probability $1-\delta$ over the random draws of the data $P_{n}$ , all neural networks $F$ with relu activations parameterized by $\{W^{(i)}\}$ will have the following generalization guarantee

In these expressions, we define $\sigma_{j-1\leftarrow j}=1$ , $\gamma^{(i)}$ to be the minimum pre-activation after the $i$ -th weight matrix over all coordinates in the $i$ -th layer and all datapoints:

where $[F_{2i-1\leftarrow 1}(x)]_{j}$ indexes the $j$ -th coordinate of $F_{2i-1\leftarrow 1}(x)$ , and additionally use

Note that we assume the existence of a positive margin, so the training error here is .

We note that compared to Theorem 7.1, $\kappa^{\textup{relu-jacobian},(i)}=\kappa^{\textup{jacobian},(i)}$ , but $\kappa^{\textup{relu-hidden},(i)}$ now has a dependence on the preactivations $\gamma^{(i)}$ , as in Nagarajan and Kolter .

We provide a proof sketch of Theorem F.1 here. We first bound the Rademacher complexity some family of augmented losses, specified precisely in Theorem F.2. The rest of the argument then follows the same way as the proof of Theorem 7.1: using Rademacher complexity to argue that the augmented losses generalize, applying the fact that the augmented losses upper-bound the 0-1 loss, and then union bounding over all choices of parameters.

Following the definitions in Theorem A.2, let $\mathcal{F}$ denote the class of neural networks, $\sigma_{j^{\prime}\leftarrow j}$ be parameters intended to bound the spectral norm of the $j$ to $j^{\prime}$ layerwise Jacobian, and $t^{(i)}$ be parameters bounding the layer norm after applying the $i$ -th activation. Define $\gamma^{(i)}$ as parameters intended to lower bound the minimum preactivations after the $i$ -th linear layer. Define the class of augmented losses

As in the proof of Theorem A.2, associate the loss class $\mathcal{L}_{\textup{relu-aug}}$ with a family $\widetilde{\mathcal{G}}$ of computation graphs on internal nodes $V_{1},\ldots,V_{2r-1},J_{1},\ldots,J_{2r-1}$ as follows: define the graph structure to be identical to the Lipschitz augmentation of a sequential computation graph family (Figure 3) and define the composition rules

Assign to the $J_{i}$ nodes composition rule $R_{J_{i}}=DR_{V_{i}}$ , and finally, assign to the output node $O$ the composition rule

The resulting family of computation graphs will compute $\mathcal{L}_{\textup{relu-aug}}$ . Now we claim that $\widetilde{\mathcal{G}}$ is $\kappa^{\textup{relu-hidden},(i)}$ -release-Lipschitz in nodes $V_{2i-1}$ and $\kappa^{\textup{relu-jacobian},(i)}$ -release-Lipschitz in nodes $J_{2i-1}$ . (Note that the Lipschitzness of nodes $V_{2i},J_{2i}$ will not matter because the associated function classes and singletons and therefore have a log covering number of 0 anyways).

The argument for the $\kappa^{\textup{relu-jacobian},(i)}$ -release-Lipschitzness of $J_{2i-1}$ follows analogously to the argument of Lemma D.8 and Theorem A.2.

Finally, to conclude the desired Rademacher complexity bounds given the release-Lipschitzness, we apply the same reasoning as in Theorem A.2. ∎

Appendix G Additional Experimental Details

For all settings, we train for 200 epochs with learning rate decay by a factor of 0.2 at epochs 60, 120, and 150. We additionally tuned the value of $\lambda$ from values $\{0.1,0.05,0.01\}$ for each setting: for the experiments displayed in Figure 1, we used the following values:

For all other hyperparameters, we use the defaults in the PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch, and we base our code off of this implementation. We report results from a single run as the improvement with Jacobian regularization is statistically significant. We train on a single NVIDIA TitanXp GPU.

G.2 Empirical Scaling of our Complexity Measure with Depth

In this section, we empirically demonstrate that the leading term of our bounds can exhibit better scaling in depth than prior work.

We compute leading terms of our bound: $\frac{\sum_{i}\max_{x\in P_{n}}\|h^{(i)}(x)\|_{2}\max_{x\in P_{n}}\|J^{(i)}(x)\|\|_{\textup{op}}}{\gamma}$ , where $i$ ranges over the layers, $h^{(i)}$ , $J^{(i)}$ denote the $i$ -th hidden layer and Jacobian of the output with respect to the $i$ -th hidden layer, respectively, and $\gamma$ denotes the smallest positive margin on the training dataset. We compare this quantity with that of the bound of [Bartlett et al., 2017]: $\prod_{i}\|W^{(i)}\|_{\textup{op}}/\gamma$ . In Figure 5, we plot this comparison for WideResNetOur bound as stated in the paper technically does not apply to ResNet because the skip connections complicate the Lipschitz augmentation step. This can be remedied with a slight modification to our augmentation step, which we omit for simplicity. models of depths 10, 16, 22, 28 trained on CIFAR10. For all models, we remove data augmentation to ensure that our models fit the training data perfectly. We train each model for 50 epochs, which is sufficient for perfectly fitting the training data, and start from an initial learning rate of 0.1 which we decrease by a factor of 10 at epoch 30. All other parameters are set to the same as their defaults in the PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch. We plot the final complexity measures computed on a single model. We note that our models are trained with Batchnorm. At test time, these Batchnorm layers compute affine transformations, so we compute the bound by merging these transformations with the adjacent linear layer.

Figure 5 demonstrates that our complexity measure can be much lower than the spectral complexity. Furthermore, in Figure 5, our complexity measure appears to scale well with depth for WideResNet models.

Appendix H Toolbox

$u(u(x_{1},x_{2}),x_{3})=u(x_{1},x_{2}x_{3})$ .

First, we note that $u(x_{1},x_{2})=x_{1}x_{2}+1-x_{2}\leq x_{2}+1-x_{2}=1$ . Furthermore, $u(x_{1},x_{2})\geq x_{1}x_{2}+x_{1}(1-x_{2})=x_{1}$ , which completes the proof of statements 1 and 2. To prove the third statement, we note that $u(u(x_{1},x_{2}),x_{3})=(x_{1}x_{2}+1-x_{2})x_{3}+1-x_{3}=x_{1}x_{2}x_{3}+1-x_{2}x_{3}=u(x_{1},x_{2}x_{3})$ . ∎

The Jacobian of a composition of a sequence of functions $f_{1},\dots,f_{k}$ satisfies

where the $\cdot$ notations are standard matrix multiplication. For simplicity, we also write in the function form:

Let $f:\mathcal{D}\rightarrow\mathcal{D}^{\prime}$ , and consider the total derivative $Df$ operator mapping $\mathcal{D}$ to a linear operator between normed spaces $\mathcal{D}$ to $\mathcal{D}^{\prime}$ . Suppose that $Df[x]$ is $\kappa$ -Lipschitz in $x$ , in the sense that $\|Df[x]-Df[x+\nu]\|_{\textup{op}}\leq\kappa\|\nu\|$ , where $\|\cdot\|_{\textup{op}}$ is the operator norm induced by $\mathcal{D}$ and $\mathcal{D}^{\prime}$ . Then

We write $f(x+\nu)-f(x)=\left(\int_{t=0}^{1}Df[x+t\nu]dt\right)\nu$ . Now we note that