Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Colin Wei, Tengyu Ma

Introduction

Deep networks trained in practice typically use many more parameters than training examples, and therefore have the capacity to overfit to the training set (Zhang et al., 2016). Fortunately, there are also many known (and unknown) sources of regularization during training: model capacity regularization such as simple weight decay, implicit or algorithmic regularization (Gunasekar et al., 2017, 2018b; Soudry et al., 2018; Li et al., 2018b), and finally regularization that depends on the training data such as Batchnorm (Ioffe and Szegedy, 2015), layer normalization (Ba et al., 2016), group normalization (Wu and He, 2018), path normalization (Neyshabur et al., 2015a), dropout (Srivastava et al., 2014; Wager et al., 2013), and regularizing the variance of activations (Littwin and Wolf, 2018).

In many cases, it remains unclear why data-dependent regularization can improve the final test error — for example, why Batchnorm empirically improves the generalization performance in practice (Ioffe and Szegedy, 2015; Zhang et al., 2019). We do not have many tools for analyzing data-dependent regularization in the literature; with the exception of Dziugaite and Roy (2018), (Arora et al., 2018) and (Nagarajan and Kolter, 2019) (with which we compare later in more detail), existing bounds typically consider properties of the weights of the learned model but little about their interactions with the training set. Formally, define a data-dependent property as any function of the learned model and the training data. In this work, we prove tighter generalization bounds by considering additional data-dependent properties of the network. Optimizing these bounds leads to data-dependent regularization techniques that empirically improve performance.

Suppose σ,t1\sigma,t\geq 1. With probability 1δ1-\delta over the training data, we can bound the test error of FF by L0-1(F)O~((σγ+r3σ2)t(1+iW(i)2,12/3)3/2+r2σ(1+iW(i)1,12/3)3/2n+rlog(1δ)n)\displaystyle L_{\textup{0-1}}(F)\leq\widetilde{O}\left(\frac{(\frac{\sigma}{\gamma}+r^{3}\sigma^{2})t\left(1+\sum_{i}\|{W^{(i)}}^{\top}\|_{2,1}^{2/3}\right)^{3/2}+r^{2}\sigma\left(1+\sum_{i}\|{W^{(i)}}\|_{1,1}^{2/3}\right)^{3/2}}{\sqrt{n}}+r\sqrt{\frac{\log(\frac{1}{\delta})}{n}}\right)

The degree of the dependencies on σ\sigma may look unconventional — this is mostly due to the dramatic simplification from our full Theorem 7.1, which obtains a more natural bound that considers all interlayer Jacobian norms instead of only the maximum. Our bound is polynomial in t,σt,\sigma, and network depth, but independent of width. In practice, tt and σ\sigma have been observed to be much smaller than the product of matrix norms (Arora et al., 2018; Nagarajan and Kolter, 2019). We remark that our bound is not homogeneous because the smooth activations are not homogeneous and can cause a second order effect on the network outputs.

In contrast, the bounds of Neyshabur et al. (2015b); Bartlett et al. (2017); Neyshabur et al. (2017a); Golowich et al. (2017) all depend on a product of norms of weight matrices which scales exponentially in the network depth, and which can be thought of as a worst case Lipschitz constant of the network. In fact, lower bounds show that with only norm-based constraints on the hypothesis class, this product of norms is unavoidable for Rademacher complexity-based approaches (see for example Theorem 3.4 of (Bartlett et al., 2017) and Theorem 7 of (Golowich et al., 2017)). We circumvent these lower bounds by additionally considering the model’s Jacobian norms – empirical Lipschitz constants which are much smaller than the product of norms because they are only computed on the training data.

The bound of Arora et al. (2018) depends on similar quantities related to noise stability but only holds for a compressed network and not the original. The bound of Nagarajan and Kolter (2019) also depends polynomially on the Jacobian norms rather than exponentially in depth; however these bounds also require that the inputs to the activation layers are bounded away from 0, an assumption that does not hold in practice (Nagarajan and Kolter, 2019). We do not require this assumption because we consider networks with smooth activations, whereas the bound of Nagarajan and Kolter (2019) applies to relu nets.

In Section E, we additionally present a generalization bound for recurrent neural nets that scales polynomially in the same quantities as our bound for standard neural nets. Prior generalization bounds for RNNs either require parameter counting (Koiran and Sontag, 1997) or depend exponentially on depth (Zhang et al., 2018; Chen et al., 2019).

In Figure 1, we plot the distribution over the sum of products of Jacobian and hidden layer norms (which is the leading term of the bound in our full Theorem 7.1) for a WideResNet (Zagoruyko and Komodakis, 2016) trained with and without Batchnorm. Figure 1 shows that this sum blows up for networks trained without Batchnorm, indicating that the terms in our bound are empirically relevant for explaining data-dependent regularization.

An immediate bottleneck in proving Theorem 1.1 is that standard tools require fixing the hypothesis class before looking at training data, whereas conditioning on data-dependent properties makes the hypothesis class a random object depending on the data. A natural attempt is to augment the loss with indicators on the intended data-dependent quantities {γi}\{\gamma_{i}\}, with desired bounds {κi}\{\kappa_{i}\} as follows:

This augmented loss upper bounds the original loss loldl_{\textup{old}}\in, with equality when all properties hold for the training data. The augmentation lets us reason about a hypothesis class that is independent of the data by directly conditioning on data-dependent properties in the loss. The main challenges with this approach are twofold: 1) designing the correct set of properties and 2) proving generalization of the final loss laugl_{\textup{aug}}, a complicated function of the network.

Our main tool is covering numbers: Lemma 4.1 shows that a composition of functions (i.e, a neural network) has low covering number if the output is worst-case Lipschitz at each level of the composition and internal layers are bounded in norm. Unfortunately, the standard neural net loss satisfies neither of these properties (without exponential dependencies on depth). However, by augmenting with properties γ\gamma, we can guarantee they hold. One technical challenge is that augmenting the loss makes it harder to reason about covering, as the indicators can introduce complicated dependencies between layers.

Our main technical contributions are: 1) We demonstrate how to augment a composition of functions to make it Lipschitz at all layers, and thus easy to cover. Before this augmentation, the Lipschitz constant could scale exponentially in depth (Theorem 4.4). 2) We reduce covering a complicated sequence of operations to covering the individual operations (Theorem 4.3). 3) By combining 1 and 2, it follows cleanly that our augmented loss on neural networks has low covering number and therefore has good generalization. Our bound scales polynomially, not exponentially, in the depth of the network when the network has good Lipschitz constants on the training data (Theorem 7.1).

As a complement to the main theoretical results in this paper, we show empirically in Section 8 that directly regularizing our complexity measure can result in improved test performance.

Related Work

Zhang et al. (2016) and Neyshabur et al. (2017b) show that generalizaton in deep learning often disobeys conventional statistical wisdom. One of the approaches adopted torwards explaining generalization is implicit regularization; numerous recent works have shown that the training method prefers minimum norm or maximum margin solutions (Soudry et al., 2018; Li et al., 2018b; Ji and Telgarsky, 2018; Gunasekar et al., 2017, 2018a, 2018b; Wei et al., 2018). With the exception of (Wei et al., 2018), these papers analyze simplified settings and do not apply to larger neural networks.

This paper more closely follows a line of work related to Rademacher complexity bounds for neural networks (Neyshabur et al., 2015b, 2018; Bartlett et al., 2017; Golowich et al., 2017; Li et al., 2018a). For a comparison, see the introduction. There has also been work on deriving PAC-Bayesian bounds for generalization (Neyshabur et al., 2017b, a; Nagarajan and Kolter, 2019). Dziugaite and Roy (2017a) optimize a bound to compute non-vacuous bounds for generalization error. Another line of work analyzes neural nets via their behavior on noisy inputs. Neyshabur et al. (2017b) prove PAC-Bayesian generalization bounds for random networks under assumptions on the network’s empirical noise stability. Arora et al. (2018) develop a notion of noise stability that allows for compression of a network under an appropriate noise distribution. They additionally prove that the compressed network generalizes well. In comparison, our Lipschitzness construction also relates to noise stability, but our bounds hold for the original network and do not rely on the particular noise distribution.

Nagarajan and Kolter (2019) use PAC-Bayes bounds to prove a similar result as ours for generalization of a network with bounded hidden layer and Jacobian norms. The main difference is that their bounds depend on the inverse relu preactivations, which are found to be large in practice (Nagarajan and Kolter, 2019); our bounds apply to smooth activations and avoid this dependence at the cost of an additional factor in the Jacobian norm (shown to be empirically small). We note that the choice of smooth activations is empirically justified (Clevert et al., 2015; Klambauer et al., 2017). We also work with Rademacher complexity and covering numbers instead of the PAC-Bayes framework. It is relatively simple to adapt our techniques to relu networks to produce a similar result to that of Nagarajan and Kolter (2019), by conditioning on large pre-activation values in our Lipschitz augmentation step (see Section 4.2). In Section F, we provide a sketch of this argument and obtain a bound for relu networks that is polynomial in hidden layer and Jacobian norms and inverse preactivations. However, it is not obvious how to adapt the argument of Nagarajan and Kolter (2019) to activation functions whose derivatives are not piecewise-constant.

Dziugaite and Roy (2018, 2017b) develop PAC-Bayes bounds for data-dependent priors obtained via some differentially private mechanism. Their bounds are for a randomized classifier sampled from the prior, whereas we analyze a deterministic, fixed model.

Novak et al. (2018) empirically demonstrate that the sensitivity of a neural net to input noise correlates with generalization. Sokolić et al. (2017); Krueger and Memisevic (2015) propose stability-based regularizers for neural nets. Hardt et al. (2015) show that models which train faster tend to generalize better. Keskar et al. (2016); Hoffer et al. (2017) study the effect of batch size on generalization. Brutzkus et al. (2017) analyze a neural network trained on hinge loss and linearly separable data and show that gradient descent recovers the exact separating hyperplane.

Notation

We use DD to denote total derivative operator, and thus Df(x)Df(x) represents the Jacobian of ff at xx. Suppose F\mathcal{F} is a family of functions from Dx\mathcal{D}_{x} to Df\mathcal{D}_{f}. Let C(ϵ,F,ρ)\mathcal{C}(\epsilon,\mathcal{F},\rho) be the covering number of the function class F\mathcal{F} w.r.t. metric ρ\rho with cover size ϵ\epsilon. In many cases, the covering number depends on the examples through the norms of the examples, and in this paper we only work with these cases. Thus, we let N(ϵ,F,s)\mathcal{N}(\epsilon,\mathcal{F},s) be the maximum covering number for any possible nn data points with norm not larger than ss. Precisely, if we define Pn,s\mathcal{P}_{n,s} to be the set of all possible uniform distributions supported on nn data points with norms not larger than ss, then

Suppose F\mathcal{F} contains functions with mm inputs that map from a tensor product mm Euclidean space to Euclidean space, then we define

Overview of Main Results and Proof Techniques

In this section, we give a general overview of the main technical results and outline how to prove them with minimal notation. We will point to later sections where many statements are formalized.

Textbook results (Bartlett and Mendelson, 2002) bound the generalization error by the Rademacher complexity (formally defined in Section A) of the family of losses L\mathcal{L}, which in turn is bounded by the covering number of L\mathcal{L} through Dudley’s entropy integral theorem (Dudley, 1967). Modulo minor nuances, the key remaining question is to give a tight covering number bound for the family L\mathcal{L} for every target cover size ϵ\epsilon in a certain range (often, considering ϵ[1/nO(1),1]\epsilon\in[1/n^{O(1)},1] suffices).

As alluded to in the introduction, generalization error bounds obtained through this machinery only depend on the (training) data through the margin in the loss function, and our aim is to utilize more data-dependent properties. Towards understanding which data-dependent properties are useful to regularize, it is helpful to revisit the data-independent covering technique of (Bartlett et al., 2017), the skeleton of which is summarized below.

Recall that N(ϵ,F,s)\mathcal{N}(\epsilon,\mathcal{F},s) denotes the covering number for arbitrary nn data points with norm less than ss. The following lemma says that if the intermediate variable (or the hidden layer) fif1(x)f_{i}\circ\cdots\circ f_{1}(x) is bounded, and the composition of the rest of the functions lfkfi+1(x)l\circ f_{k}\circ\cdots\circ f_{i+1}(x) is Lipschitz, then small covering number of local functions imply small covering number for the composition of functions.

[abstraction of techniques in (Bartlett et al., 2017)] In the context above, assume:

for any xsupp(Pn)x\in\text{supp}(P_{n}), fif1(x)si{|\kern-1.07639pt|\kern-1.07639pt|f_{i}\circ\cdots\circ f_{1}(x)|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{i}.

Then, we have the following covering number bound for L\mathcal{L} (for any choice of ϵ1,,ϵk>0\epsilon_{1},\dots,\epsilon_{k}>0):

The lemma says that the log covering number and the cover size scale linearly if the Lipschitzness parameters and norms remain constant. However, these two quantities, in the worst case, can easily scale exponentially in the number of layers, and they are the main sources of the dependency of product of spectral/Frobenius norms of layers in (Golowich et al., 2017; Bartlett et al., 2017; Neyshabur et al., 2017a, 2015b) More precisely, the worst-case Lipschitzness over all possible data points can be exponentially bigger than the average/typical Lipschitzness for examples randomly drawn from the training or test distribution. We aim to bridge this gap by deriving a generalization error bound that only depends on the Lipschitzness and boundedness on the training examples.

where κji\kappa_{j\leftarrow i}’s are user-defined parameters. For our application to neural nets, we instantiate sis_{i} as the maximum norm of layer ii and κji\kappa_{j\leftarrow i} as the maximum norm of the Jacobian between layer jj and ii across the training dataset. A polynomial in κ,s\kappa,s can be shown to bound the worst-case Lipschitzness of the function w.r.t. the intermediate variables in the formula above.As mentioned in footnote 1, we will formalize the precise meaning of Lipschitzness later. By our choice of κ\kappa, ss, a) the training loss is unaffected by the augmentation and b) the worst-case Lipschitzness of the loss is controlled by a polynomial of the Lipschitzness on the training examples. We provide an informal overview of our augmentation procedure in Section 4.2 and formally state definitions and guarantees in Section 6. The downside of the Lipschitz augmentation is that it further complicates the loss function. Towards covering the loss function (assuming Lipschitz properties) efficiently, we extend Lemma 4.1, which works for sequential compositions of functions, to general families of formulas, or computational graphs. We informally overview this extension in Section 4.1 using a minimal set of notations, and in Section 5, we give a formal presentation of these results.

Combining the Lipschitz augmentation and graphs covering results, we obtain a covering number bound of augmented loss. The theorem below is formally stated in Theorem 6.3 of Section 6.

where DFiD\mathcal{F}_{i} denotes the function class obtained from applying the total derivative operator to all functions in Fi\mathcal{F}_{i}.

Now, following the standard technique of bounding Rademacher complexity via covering numbers, we can obtain generalization error bounds for augmented loss. For the demonstration of our technique, suppose that the following simplification holds:

A computational graph G(V,E,{RV})G(\mathcal{V},\mathcal{E},\{R_{V}\}) is an acyclic directed graph with three components: the set of nodes V\mathcal{V} corresponds to variables, the set of edges E\mathcal{E} describes dependencies between these variables, and {RV}\{R_{V}\} contains a list of composition rules indexed by the variables VV’s, representing the process of computing VV from its direct predecessors. For simplicity, we assume the graph contains a unique sink, denoted by OGO_{G}, and we call it the “output node”. We also overload the notation OGO_{G} to denote the function that the computational graph GG finally computes. Let IG={I1,,Ip}\mathcal{I}_{G}=\{I_{1},\dots,I_{p}\} be the subset of nodes with no predecessors, which we call the “input nodes” of the graph.

The notion of a family of computational graphs generalizes the sequential family of function compositions in (5). Let G={G(V,E,{RV})}\mathcal{G}=\{G(\mathcal{V},\mathcal{E},\{R_{V}\})\} be a family of computational graphs with shared nodes, edges, output node, and input nodes (denoted by I\mathcal{I}). Let RV\mathfrak{R}_{V} be the collection of all possible composition rules used for node VV by the graphs in the family G\mathcal{G}. This family G\mathcal{G} defines a set of functions OG{OG:GG}O_{\mathcal{G}}\triangleq\{O_{G}:G\in\mathcal{G}\}.

Suppose that there is an ordering (V1,,Vm)(V_{1},\dots,V_{m}) of the nodes, so that after cutting out nodes V1,,Vi1V_{1},\dots,V_{i-1}, the node ViV_{i} becomes a leaf node and the output OGO_{G} is κVi\kappa_{V_{i}}-Lipschitz w.r.t to ViV_{i} for all GGG\in\mathcal{G}. In addition, assume that for all GGG\in\mathcal{G}, the node VV’s value has norm at most sVs_{V}. Let pr(V)\textup{pr}(V) be all the predecessors of VV and spr(V)s_{\textup{pr}(V)} be the list of norm upper bounds of the predecessors of VV.

Then, small covering numbers for all of the local composition rules of VV with resolution ϵV\epsilon_{V} would imply small covering number for the family of computational graphs with resolution VϵVκV\sum_{V}\epsilon_{V}\kappa_{V}:

2 Lipschitz Augmentation of Computational Graphs

The covering number bound of Theorem 4.3 relies on Lipschitzness w.r.t internal nodes of the graph under a worst-case choice of inputs. For deep networks, this can scale exponentially in depth via the product of weight norms and easily be larger than the average Lipschitz-ness over typical inputs. In this section, we explain a general operation to augment sequential graphs (such as neural nets) into graphs with better worst-case Lipschitz constants, so tools such as Theorem 4.3 can be applied. Formal definitions and theorem statements are in Section 6.

The augmentation relies on introducing terms such as the soft indicators in equation (6) and (7) which condition on data-dependent properties. As outlined in Section 4, they will translate to the data-dependent properties in the generalization bounds. We also require the augmented function to upper bound the original.

Our key insight is that by considering a more complicated augmentation which conditions on the derivatives between all intermediate variables, we can still control Lipschitzness of the system, leading to the more involved augmentation presented in (7). Our main technical contribution is Theorem 4.4, which we informally state below.

Covering of Computational Graphs

This section is a formal version of Section 4.1 with full definition and theorem statements. In this section, we adapt the notion of a computational graph to our setting. In Section 5.1, we formalize the notion of a computational graph and demonstrate how neural networks fit under this framework. In Section 5.2, we define the notion of release-Lipschitzness that abstracts the sequential notion of Lipschitzness in Lemma 4.1. We show that when this release-Lipschitzness condition and a boundedness condition on the internal nodes hold, it is possible to cover a family of computational graphs by simply covering the function class at each vertex.

When we augment the neural network loss with data-dependent properties, we introduce dependencies between the various layers, making it complicated to cover the augmented loss. We use the notion of computational graphs to abstractly model these dependencies.

Computational graphs are originally introduced by Bauer (1974) to represent computational processes and study error propagation. Recall the notation G(V,E,{RV})G(\mathcal{V},\mathcal{E},\{R_{V}\}) introduced for a computational graph in Section 4.1, with input nodes IG={I1,,Ip}\mathcal{I}_{G}=\{I_{1},\dots,I_{p}\} and output node denoted by OGO_{G}. (It’s straightforward to generalize to scenarios with multiple output nodes.)

For every variable VVV\in\mathcal{V}, let DV\mathcal{D}_{V} be the space that VV resides in. If VV has tt direct predecessors C1,,CtC_{1},\dots,C_{t}, then the associated composition rule RVR_{V} is a function that maps DC1DCt\mathcal{D}_{C_{1}}\otimes\cdots\otimes\mathcal{D}_{C_{t}} to DV\mathcal{D}_{V}. If VV is an input node, then the composition rule RVR_{V} is not relevant. For any node VV, the computational graph defines/induces a function that computes the variable VV from inputs, or in mathematical words, that maps the inputs space DI1DIp\mathcal{D}_{I_{1}}\otimes\cdots\otimes\mathcal{D}_{I_{p}} to DV\mathcal{D}_{V}. This associated function, denoted by VV again with slight abuse of notations, is defined recursively as follows: set V(x1,,xp)V(x_{1},\dots,x_{p}) to

More succinctly, we can write V=RV(C1Ct)V=R_{V}\circ(C_{1}\otimes\cdots\otimes C_{t}). We also overload the notation OGO_{G} to denote the function that the computational graph GG finally computes (which maps DI1DIp\mathcal{D}_{I_{1}}\otimes\cdots\otimes\mathcal{D}_{I_{p}} to DO\mathcal{D}_{O}). For any set S={V1,,Vt}V\mathcal{S}=\{V_{1},\ldots,V_{t}\}\subseteq\mathcal{V}, use DS\mathcal{D}_{\mathcal{S}} to denote the space DV1DVt\mathcal{D}_{V_{1}}\otimes\cdots\otimes\mathcal{D}_{V_{t}}. We use pr(G,V)\textup{pr}(G,V) to denote the set of direct predecessors of VV in graph GG, or simply pr(V)\textup{pr}(V) when the graph GG is clear from context.

2 Reducing graph covering to local function covering

In this section we introduce the notion of a family of computational graphs, generalizing the sequential family of function compositions in (5). We define release-Lipschitzness, a condition which allows reduce covering the entire the graph family to covering the composition rules at each node. We formally state this reduction in Theorem 5.3.

Let G={G(V,E,{RV}):{RV}R}\mathcal{G}=\{G(\mathcal{V},\mathcal{E},\{R_{V}\}):\{R_{V}\}\in\mathfrak{R}\} be a family of computational graph with shared nodes and edges, where R\mathfrak{R} is a collection of lists of composition rules. This family of computational graphs defines a set of functions OG{OG:GG}O_{\mathcal{G}}\triangleq\{O_{G}:G\in\mathcal{G}\}. We’d like to cover this set of functions in OGO_{\mathcal{G}} with respect to some metric L(Pn,)L(P_{n},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}).

For a list of composition rules {RV}R\{R_{V}\}\in\mathfrak{R} and subset SV\mathcal{S}\subseteq\mathcal{V}, we define the projection of composition rules onto S\mathcal{S} by {RV}S={RV:VS}\{R_{V}\}_{\mathcal{S}}=\{R_{V}:V\in\mathcal{S}\}. Now let RS={{RV}S:{RV}R}\mathfrak{R}_{S}=\{\{R_{V}\}_{\mathcal{S}}:\{R_{V}\}\in\mathfrak{R}\} denote the marginal collection of the composition rules on node subset S\mathcal{S}.

For any computational graph GG and a non-input node VVIV\in\mathcal{V}\setminus\mathcal{I}, we can define the following operation that “releases” VV from its dependencies on its predecessors by cutting all the inward edges: Let G\VG^{\backslash V} be sub-graph of GG where all the edges pointing towards VV are removed from the graph. Thus, by definition, VV becomes a new input node of the graph G\VG^{\backslash V}: IG\V={V}IG\mathcal{I}_{G^{\backslash V}}=\{V\}\cup\mathcal{I}_{G}. Moreover, we can “recover” the dependency by plugging the right value for VV in the new graph G\VG^{\backslash V}: Let V(x)V(x) be the function associated to the node VV in graph GG, then we have

In our proofs, we will release variables in orders. Let S=(V1,,Vm)\mathcal{S}=(V_{1},\dots,V_{m}) be an ordering of the intermediate variables V\(I{O})\mathcal{V}\backslash(\mathcal{I}\cup\{O\}). We call S\mathcal{S} a forest ordering if for any ii, in the original graph GG, ViV_{i} at most depends on the input nodes and V1,,Vi1V_{1},\dots,V_{i-1}. For any sequence of variables (V1,,Vt)(V_{1},\dots,V_{t}), we can define the graph obtained by releasing the variables in order: G\(V1,,Vt)((G\V1))\VtG^{\backslash(V_{1},\dots,V_{t})}\triangleq(\cdots(G^{\backslash V_{1}})\cdots)^{\backslash V_{t}}. We next define the release-Lipschitz condition, which states that the graph function remains Lipschitz when we sequentially release vertices in a forest ordering of the graph.

A graph GG is release-Lipschitz with parameters {κV}\{\kappa_{V}\} w.r.t a forest ordering of the internal nodes, denoted by (V1,,Vm)(V_{1},\dots,V_{m}) if the following happens: upon releasing V1,,VmV_{1},\dots,V_{m} in order from any GGG\in\mathcal{G}, for any 0im0\leq i\leq m, we have that the function defined by the released graph G\(V1,,Vi)G^{\backslash(V_{1},\dots,V_{i})} is κVi\kappa_{V_{i}}-Lipschitz in the argument ViV_{i}, for any values of the rest of the input nodes (={V1,,Vi1}IG\{V_{1},\dots,V_{i-1}\}\cup\mathcal{I}_{G}.) We also say graph GG is release- Lipschitz if such a forest ordering exists.

Now we show that the release-Lipschitz condition allows us to cover any family of computational graphs whose output collapses when internal nodes are too large. The below is a formal and complete version of Theorem 4.3. For the augmented loss defined in (7), the function output collapses to 11 when internal computations are large. The proof is deferred to Section B.

Suppose G\mathcal{G} is a computational graph with the associated family of lists of composition rules R\mathfrak{R}, as formally defined above. Let PnP_{n} be a uniform distribution over nn points in DI\mathcal{D}_{\mathcal{I}}. Let κV\kappa_{V}, sVs_{V}, and ϵV\epsilon_{V} be three families of fixed parameters indexed by V\I\mathcal{V}\backslash\mathcal{I} (whose meanings are defined below). Assume the following:

Every GGG\in\mathcal{G} is release-Lipschitz with parameters {κV}\{\kappa_{V}\} w.r.t a forest ordering of the internal nodes (V1,,Vm)(V_{1},\dots,V_{m}) (the parameter κV\kappa_{V}’s and ordering doesn’t depend on the choice of GG.)

For the same order as before, if (v,x)(DV1DVi)DI(v,x)\in(\mathcal{D}_{V_{1}}\otimes\cdots\otimes\mathcal{D}_{V_{i}})\otimes\mathcal{D}_{\mathcal{I}} is an input of the released graph satisfying vjsVj{|\kern-1.07639pt|\kern-1.07639pt|v_{j}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{V_{j}} for some jij\leq i, then OG\(V1,,Vi)(v,x)=cO_{G^{\backslash(V_{1},\dots,V_{i})}}(v,x)=c for some constant cc.

Then, small covering numbers for all of the local composition rules of VV with resolution ϵV\epsilon_{V} would imply small covering number for the family of computational graphs with resolution VϵVκV\sum_{V}\epsilon_{V}\kappa_{V}:

Lipschitz Augmentation of Computational Graphs

In this section, we provide a more thorough and formal presentation of the augmentation framework of Section 4.2.

The covering number bound for the computational graph family G\mathcal{G} in Theorem 5.3 relies on the release-Lipschitzness condition (condition 1 of Theorem 5.3) and rarely holds for deep computational graphs such as deep neural networks. The conundrum is that the worst-case Lipschitzness as required in the release-Lipschitz conditionWe say the Lipschitzness required is worst case because the release-Lipschitz condition requires the Lipschitzness of nodes for any possible choice of inputs is very likely to scale in the product of the worst-case Lipschitzness of each operations in the graph, which can easily be exponentially larger than the average Lipschitzness over typical examples.

In this section, we first define a model of sequential computational graphs, which captures the class of neural networks. Before Lipschitz augmentation, the worst-case Lipschitz constant of graphs in this family could scale exponentially in the depth of the graph. In Definition 6.1, we generalize the operation of (7) to augment any family G\mathcal{G} of sequential graphs and produce a family G~\widetilde{\mathcal{G}} satisfying the release-Lipschitz condition. In Theorem 6.3, we combine this augmentation with the framework of 5.3 to produce general covering number bounds for the augmented graphs. For the rest of this section we will work with sequential families of computational graphs.

A sequential computational graph has nodes set V={I,V1,,Vq,O}\mathcal{V}=\{I,V_{1},\ldots,V_{q},O\}, where II is the single input node, and all the edges are E={(I,V1),(V1,V2),,(Vq1,Vq)}{(V1,O),,(Vq,O)}\mathcal{E}=\{(I,V_{1}),(V_{1},V_{2}),\cdots,(V_{q-1},V_{q})\}\cup\{(V_{1},O),\dots,(V_{q},O)\}. We often use the notation V0V_{0} to refer to the input II. Below we formally define the augmentation operation.

Given a differentiable sequential computational graph GG with qq internal nodes V1,,VqV_{1},\dots,V_{q}, define its Lipschitz augmentation G~\widetilde{G} as follows. We first add qq nodes to the graph denoted by J1,,JqJ_{1},\dots,J_{q}. The composition rules for original internal nodes remain the same, and the composition rule for JiJ_{i} is defined as

Here DRViDR_{V_{i}} is the total derivative of the function RViR_{V_{i}}. In other words, the variable JiJ_{i} is a Jacobian for RViR_{V_{i}}, a linear operator that maps DVi1\mathcal{D}_{V_{i-1}} to DVi\mathcal{D}_{V_{i}}. (Note that if ViV_{i}’s are considered as vector variables, then JiJ_{i}’s are matrix variables.) We equip the space of JiJ_{i} with operator norm, denoted by op\|\cdot\|_{\textup{op}}, induced by the original norms on spaces Vi1V_{i-1} and ViV_{i}. The Lipschitz-ness w.r.t variable JiJ_{i} will be measured with operator norm.

We pre-determine a family of parameters κji\kappa_{j\leftarrow i} for all pairs (i,j)(i,j) with iji\leq j. The final loss is augmented by a product of soft indicators that truncates the function when any of the Jacobians is much larger than κij\kappa_{i\leftarrow j} :

where xDIx\in\mathcal{D}_{\mathcal{I}}, viDViv_{i}\in\mathcal{D}_{V_{i}}, and DiDJiD_{i}\in\mathcal{D}_{J_{i}}. Note that DjDiD_{j}\cdots D_{i} is the total derivative of VjV_{j} w.r.t ViV_{i}, and thus the κji\kappa_{j\leftarrow i} has the interpretation as an intended bound of the Jacobian between pairs of layers (variables). Figure 4 depicts the augmentation.

Note that under these definitions, we finally get that the output function of G~\widetilde{G} computes

[Lipschitz guarantees of augmented graphs] Let G\mathcal{G} be a family of sequential computational graphs. Suppose for any GGG\in\mathcal{G}, the composition rule of the output node, ROGR_{O_{G}}, is cic_{i}-Lipschitz in variable ViV_{i} for all ii, and it only outputs value in $.Supposethat. Suppose thatDR_{V_{i}}isis\bar{\kappa}_{i}Lipschitzforeach-Lipschitz for eachi.Notethat.Note thatDR_{V_{i}}mapsavectorinspacemaps a vector in space\mathcal{D}_{V_{i-1}}toanlinearoperatorthatmapsto an linear operator that maps\mathcal{D}_{V_{i-1}}toto\mathcal{D}_{V_{i}}.Let. Let\kappa_{j\leftarrow i}(for(fori\leq j)beasetofparametersthatweintendtousetocontrolJacobiansintheLipschitzaugmentation.Withthem,weapplyLipschitzaugmentationasdefinedinDefinition6.1toeverygraphin) be a set of parameters that we intend to use to control Jacobians in the Lipschitz augmentation. With them, we apply Lipschitz augmentation as defined in Definition 6.1 to every graph in\mathcal{G}andobtainanewfamilyofgraphs,denotedbyand obtain a new family of graphs, denoted by\widetilde{\mathcal{G}}$.

where for simplicity in the above expressions, we extend the definition of κ\kappa’s to κj1j=1\kappa_{j-1\leftarrow j}=1.

Finally, we combine Theorems 5.3 and Theorems 6.2 to derive covering number bounds for any Lipschitz augmentation of sequential computational graphs. The final covering bound in (16) can be easily computed given covering number bounds for each individual function class. In Section 7, we use this theorem to derive Rademacher complexity bounds for neural networks. The proof is deferred to Section C. In Section E, we also use these tools to derive Rademacher complexity bounds for RNNs.

Consider any family G\mathcal{G} of sequential computational graphs satisfying the conditions of Theorem 6.2. By combining the augmentation of Definition 6.1 with additional indicators on the internal node norms, we can construct a new family G~\widetilde{\mathcal{G}} of computational graphs which output

The family G~\widetilde{\mathcal{G}} satisfies the following guarantees:

Each computational graph in G~\widetilde{\mathcal{G}} upper bounds its counterpart in G\mathcal{G}, i.e. OG~(x)OG(x)O_{\widetilde{G}}(x)\geq O_{G}(x).

where DRViD\mathfrak{R}_{V_{i}} denotes the family of total derivatives of functions in RVi\mathfrak{R}_{V_{i}} and V0V_{0} the input vertex.

Application to Neural Networks

The below result follows from modeling the neural net loss as a sequential computational graph and using our augmentation procedure to make it Lipschitz in its nodes with parameters κhidden,(i),κjacobian,(i)\kappa^{\textup{hidden},(i)},\kappa^{\textup{jacobian},(i)}. Then we cover the augmented loss to bound its Rademacher complexity.

Assume that the activation ϕ\phi is 1-Lipschitz with a σˉϕ\bar{\sigma}_{\phi}-Lipschitz derivative. Fix reference matrices {A(i)}\{A^{(i)}\}, {B(i)}\{B^{(i)}\}. With probability 1δ1-\delta over the random draws of the data PnP_{n}, all neural networks FF with parameters {W(i)}\{W^{(i)}\} and positive margin γ\gamma satisfy:

where κjacobian,(i)1j2i1j2r1σj2iσ2i2jσjj\kappa^{\textup{jacobian},(i)}\triangleq\sum_{1\leq j\leq 2i-1\leq j^{\prime}\leq 2r-1}\frac{\sigma_{j^{\prime}\leftarrow 2i}\sigma_{2i-2\leftarrow j}}{\sigma_{j^{\prime}\leftarrow j}}, and κhidden,(i)ξ+σ2r12iγ+ii<rσ2i2it(i)+1jj2r1j=max{2i,j},j even jσˉϕσjj+1σj12iσj1jσjj\kappa^{\textup{hidden},(i)}\triangleq\xi+\frac{\sigma_{2r-1\leftarrow 2i}}{\gamma}+\sum_{i\leq i^{\prime}<r}\frac{\sigma_{2i^{\prime}\leftarrow 2i}}{t^{(i^{\prime})}}+\sum_{1\leq j\leq j^{\prime}\leq 2r-1}\sum_{\begin{subarray}{c}j^{\prime\prime}=\max\{2i,j\},\\ j^{\prime\prime}\textup{ even }\end{subarray}}^{j^{\prime}}\frac{\bar{\sigma}_{\phi}\sigma_{j^{\prime}\leftarrow j^{\prime\prime}+1}\sigma_{j^{\prime\prime}-1\leftarrow 2i}\sigma_{j^{\prime\prime}-1\leftarrow j}}{\sigma_{j^{\prime}\leftarrow j}}.

In these expressions, we define σj1j=1\sigma_{j-1\leftarrow j}=1, ξ=poly(r)1\xi=\textup{poly}(r)^{-1}, and:

where QjjQ_{j^{\prime}\leftarrow j} computes the Jacobian of layer jj^{\prime} w.r.t. layer jj. Note that the training error here is because of the existence of positive margin γ\gamma.

We note that our bound has no explicit dependence on width and instead depends on the 2,1,1,1\|\cdot\|_{2,1},\|\cdot\|_{1,1} norms of the weights offset by reference matrices {A(i)},{B(i)}\{A^{(i)}\},\{B^{(i)}\}. These norms can avoid scaling with the width of the network if the difference between the weights and reference matrices is sparse. The reference matrices {A(i)},{B(i)}\{A^{(i)}\},\{B^{(i)}\} are useful if there is some prior belief before training about what weight matrices are learned, and they also appear in the bounds of Bartlett et al. (2017). In Section E, we also show that our techniques can easily be extended to provide generalization bounds for RNNs scaling polynomially in depth via the same quantities t(i),σjjt^{(i)},\sigma_{j^{\prime}\leftarrow j}.

Experiments

Though the main purpose of the paper is to study the data-dependent generalization bounds from a theoretical perspective, we provide preliminary experiments demonstrating that the proposed complexity measure and generalization bounds are empirically relevant. We show that regularizing the complexity measure leads to better test accuracy. Inspired by Theorem 7.1, we directly regularize the Jacobian of the classification margin w.r.t outputs of normalization layers and after residual blocks. Our reasoning is that normalization layers control the hidden layer norms, so additionally regularizing the Jacobians results in regularization of the product, which appears in our bound. We find that this is effective for improving test accuracy in a variety of settings. We note that Sokolić et al. (2017) show positive experimental results for a similar regularization technique in data-limited settings.

Suppose that m(F(x),y)=[F(x)]ymaxjy[t]jm(F(x),y)=[F(x)]_{y}-\max_{j\neq y}[t]_{j} denotes the margin of the network for example (x,y)(x,y). Letting h(i)h^{(i)} denote some hidden layer of the network, we define the notation J(i)h(i)m(F(x),y)J^{(i)}\triangleq\frac{\partial}{\partial h^{(i)}}m(F(x),y) and use training objective

where ll denotes the standard cross entropy loss, and λ,σ\lambda,\sigma are hyperparameters. Note the Jacobian is taken with respect to a scalar output and therefore is a vector, so it is easy to compute.

For a WideResNet16 (Zagoruyko and Komodakis, 2016) architecture, we train using the above objective. The threshold on the Frobenius norm in the regularization is inspired by the truncations in our augmented loss (in all our experiments, we choose σ=0.1\sigma=0.1). We tune the coefficient λ\lambda as a hyperparameter. In our experiments, we took the regularized indices ii to be last layers in each residual block as well as layers in residual blocks following a BatchNorm in the standard WideResNet16 architecture. In the LayerNorm setting, we simply replaced BatchNorm layers with LayerNorm. The remaining hyperparameter settings are standard for WideResNet; for additional details see Section G.1.

Figure 1 shows the results for models trained and tested on CIFAR10 in low learning rate and no data augmentation settings, which are settings where generalization typically suffers. We also experiment with replacing BatchNorm layers with LayerNorm and additionally regularizing the Jacobian. We observe improvements in test error for all these settings. In Section G.2, we empirically demonstrate that our complexity measure indeed avoids the exponential scaling in depth for a WideResNet model trained on CIFAR10.

Conclusion

In this paper, we tackle the question of how data-dependent properties affect generalization. We prove tighter generalization bounds that depend polynomially on the hidden layer norms and norms of the interlayer Jacobians. To prove these bounds, we work with the abstraction of computational graphs and develop general tools to augment any sequential family of computational graphs into a Lipschitz family and then cover this Lipschitz family. This augmentation and covering procedure applies to any sequence of function compositions. An interesting direction for future work is to generalize our techniques to arbitrary computational graph structures. In follow-up work (Wei and Ma, 2019), we develop a simpler technique to derive Jacobian-based generalization bounds for both robust and clean accuracy, and we present an algorithm inspired by this theory which empirically improves performance over strong baselines.

Acknowledgments

CW was supported by a NSF Graduate Research Fellowship. Toyota Research Institute (TRI) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

References

Appendix A Missing Proofs for Section 7

We first elaborate more on the notations introduced in Section 7. First, by our indexing, matrix W(i)W^{(i)} will be applied in layer 2i12i-1 of the network, and even layers 2i2i apply ϕ\phi. We let FjjF_{j^{\prime}\leftarrow j} denote the function computed between layers jj and jj^{\prime} and Qjj=DFjjFj11Q_{j^{\prime}\leftarrow j}=DF_{j^{\prime}\leftarrow j}\circ F_{j^{\prime}-1\leftarrow 1} denote the layer jj-to-jj^{\prime} Jacobian. By our definition of FjjF_{j^{\prime}\leftarrow j}, F2j2j=ϕF_{2j\leftarrow 2j}=\phi, F2j12j1=hW(j)hF_{2j-1\leftarrow 2j-1}=h\mapsto W^{(j)}h, and FjjF_{j^{\prime}\leftarrow j} is recursively computed by FjjFj1jF_{j^{\prime}\leftarrow j^{\prime}}\circ F_{j^{\prime}-1\leftarrow j} for j>jj^{\prime}>j. We will use the convention that Fj1jF_{j-1\leftarrow j} computes the identity mapping for iji\leq j.

PP will denote a test distribution over examples xx and labels yy, and PnP_{n} will denote the distribution on training examples.

For a class of real-valued functions L\mathcal{L} and dataset PnP_{n}, define the empirical Rademacher complexity of this function class by

Assume that the activation ϕ\phi is 11-Lipschitz with σˉϕ\bar{\sigma}_{\phi}-Lipschitz derivative. Fix parameters σjj\sigma_{j^{\prime}\leftarrow j}, t(i)t^{(i)}, a(i)a^{(i)}, b(i)b^{(i)}, γ\gamma and reference matrices {A(i)}\{A^{(i)}\}, {B(i)}\{B^{(i)}\}. With probability 1δ1-\delta over the random draws of the distribution PnP_{n}, all neural networks FF with parameters {W(i)}\{W^{(i)}\} satisfying the following data-dependent conditions:

Hidden layers norms are controlled: maxxPnF2i1(x)t(i) 1ir\max_{x\in P_{n}}\|F_{2i\leftarrow 1}(x)\|\leq t^{(i)}\ \forall 1\leq i\leq r.

Jacobians are balanced: maxxPnQjj(x)opσjj j<j\max_{x\in P_{n}}\|Q_{j^{\prime}\leftarrow j}(x)\|_{\textup{op}}\leq\sigma_{j^{\prime}\leftarrow j}\ \forall j<j^{\prime}.

The margin is large: min(x,y)Pn[F(x)]ymaxyy[F(x)]yγ>0\min_{(x,y)\in P_{n}}[F(x)]_{y}-\max_{y^{\prime}\neq y}[F(x)]_{y^{\prime}}\geq\gamma>0.

and the additional data-independent condition

will have the following generalization to test data:

Here we use the convention that σj1j=1\sigma_{j-1\leftarrow j}=1 and let t(0)=maxxPnxt^{(0)}=\max_{x\in P_{n}}\|x\|.

This generalization bound follows straightforwardly via the below Rademacher complexity bound for the augmented loss class:

Suppose that ϕ\phi is 11-Lipschitz with σˉϕ\bar{\sigma}_{\phi}-Lipschitz derivative. Define the following class of neural networks with norm bounds on its weight matrices with respect to reference matrices {A(i)},{B(i)}\{A^{(i)}\},\{B^{(i)}\}:

Fix parameters t(i)t^{(i)} and σjj\sigma_{j^{\prime}\leftarrow j} for jjj^{\prime}\geq j with σ2i2i=1\sigma_{2i\leftarrow 2i}=1 and σ2i12i1=σ(i)\sigma_{2i-1\leftarrow 2i-1}=\sigma^{(i)}. When we apply this theorem, we will choose σjj\sigma_{j^{\prime}\leftarrow j} and t(i)t^{(i)} which upper bound the layer jj to jj^{\prime} Jacobian norm and ii-th hidden layer norm, respectively. Define the class of augmented losses

and define for 1ir1\leq i\leq r, κjacobian,(i),κhidden,(i)\kappa^{\textup{jacobian},(i)},\kappa^{\textup{hidden},(i)} meant to bound the influence of the matrix W(i)W^{(i)} on the Jacobians and hidden variables, respectively as in (18), (19). Then we can bound the empirical Rademacher complexity of the augmented loss class by

We associate the un-augmented loss class on neural networks lγFl_{\gamma}\circ\mathcal{F} with a family of sequential computation graphs G\mathcal{G} with depth 2r12r-1. The composition rules are as follows: for internal node V2iV_{2i}, RV2i={ϕ}\mathfrak{R}_{V_{2i}}=\{\phi\}, the set with only one element: the activation ϕ\phi. We also let RV2i1={hWh:WA(i)2,1a(i),WB(i)1,1b(i),Wopσ(i)}\mathfrak{R}_{V_{2i-1}}=\{h\mapsto Wh:\|W^{\top}-{A^{(i)}}^{\top}\|_{2,1}\leq a^{(i)},\|W-{B^{(i)}}\|_{1,1}\leq b^{(i)},\|W\|_{\textup{op}}\leq\sigma^{(i)}\}. Finally, we choose RO\mathfrak{R}_{O} to be the singleton class {lγ}\{l_{\gamma}\}. Our collection of computation rules is then simply R=RV1RV2r1RO\mathfrak{R}=\mathfrak{R}_{V_{1}}\otimes\cdots\otimes\mathfrak{R}_{V_{2r-1}}\otimes\mathfrak{R}_{O}. Since OGO_{\mathcal{G}} takes values in $,wecanapplyTheorem6.3onthisclass, we can apply Theorem 6.3 on this class\mathcal{G}usingusings_{\mathcal{I}}=\max_{x\in P_{n}}\|x\|,,s_{V_{2i}}=t^{(i)},,s_{V_{2i-1}}=\infty,,\kappa_{2i\leftarrow 2i}=1,,\kappa_{2i-1\leftarrow 2i-1}=\sigma^{(i)},and, and\kappa_{j^{\prime}\leftarrow j}=\sigma_{j^{\prime}\leftarrow j}forforj^{\prime}>j.Furthermore,wenotethat. Furthermore, we note that\bar{\kappa}_{2i}=\bar{\sigma}_{\phi},and, and\bar{\kappa}_{2i-1}=0astheJacobianisconstantformatrixmultiplications.Wethusobtaintheclassas the Jacobian is constant for matrix multiplications. We thus obtain the class\widetilde{\mathcal{G}}whereeachaugmentedlossupperboundsthecorrespondinglossinwhere each augmented loss upper bounds the corresponding loss in\mathcal{G}.Recallthat. Recall thatJ_{i}denotetheadditionalnodesinouraugmentedcomputationgraph.Notethatunderthesechoicesofdenote the additional nodes in our augmented computation graph. Note that under these choices ofs_{V_{2i-1}},,\kappa_{i\leftarrow i}$, we get that

We first note that the last term in (20) is simply 0 because there is exactly one output function in RO\mathfrak{R}_{O}. Now for the other terms of (20): by definition RV2i\mathfrak{R}_{V_{2i}}, RJ2i\mathfrak{R}_{J_{2i}} consist of a singleton set and therefore have log cover size for any error resolution ϵ\epsilon. Otherwise, to cover RV2i1\mathfrak{R}_{V_{2i-1}} it suffices to bound logN(ϵV2i1,{hWh:WA(i)2,1a(i)},2t(i1))\log\mathcal{N}(\epsilon_{V_{2i-1}},\{h\mapsto Wh:\|W^{\top}-{A^{(i)}}^{\top}\|_{2,1}\leq a^{(i)}\},2t^{(i-1)}). Thus, we can apply Lemma A.3 to obtain

Thus, substituting terms into (20) and collecting sums, we obtain that

Now we apply Dudley’s entropy theorem to obtain that

We start with Theorem A.2, which bounds the Rademacher complexity of the augmented loss class Laug\mathcal{L}_{\textup{aug}}. Using laug(F,x,y)l_{\textup{aug}}(F,x,y) to denote the application of this augmented loss on the network FF, its weights, and data (x,y)(x,y), we first note that l0-1(F(x),y)lγ(F(x),y)laug(F,x,y)l_{\textup{0-1}}(F(x),y)\leq l_{\gamma}(F(x),y)\leq l_{\textup{aug}}(F,x,y) for any datapoint (x,y)(x,y). We used the fact that margin loss upper bounds 0-1 loss, and laugl_{\textup{aug}} upper bounds margin loss by the construction in Theorem 6.3. Thus, applying the standard Rademacher generalization bound, with probability 1δ1-\delta over the training data, it holds that

Plugging in the bound on Radn(Laug)\textup{Rad}_{n}(\mathcal{L}_{\textup{aug}}) from Theorem A.2 gives the desired result. ∎

Finally, to prove Theorems 7.1 and 1.1, we simply take a union bound over the choices of parameters σjj,t(i),a(i),b(i)\sigma_{j^{\prime}\leftarrow j},t^{(i)},a^{(i)},b^{(i)}.

We will apply Theorem A.1 repeatedly over a grid of parameter choices t(i)t^{(i)}, σjj\sigma_{j^{\prime}\leftarrow j}, a(i)a^{(i)}, b(i)b^{(i)} (following a technique of Bartlett et al. ). For a collection M\mathcal{M} of nonnegative integers mt(i)m_{t}^{(i)}, mσ(jj)m_{\sigma}^{(j^{\prime}\leftarrow j)}, ma(i)m_{a}^{(i)}, mb(i)m_{b}^{(i)}, mγm_{\gamma}, we apply Theorem A.1 choosing t(i)=poly(r)12mt(i)t^{(i)}=\textup{poly}(r)^{-1}2^{m_{t}^{(i)}}, σjj=poly(r)12mσ(jj)\sigma_{j^{\prime}\leftarrow j}=\textup{poly}(r)^{-1}2^{m_{\sigma}^{(j^{\prime}\leftarrow j)}}, a(i)=poly(r)12ma(i)a^{(i)}=\textup{poly}(r)^{-1}2^{m_{a}^{(i)}}, b(i)=poly(r)12mb(i)b^{(i)}=\textup{poly}(r)^{-1}2^{m_{b}^{(i)}}, γ=2mγpoly(r)maxiσ2r12i\gamma=2^{-m_{\gamma}}\textup{poly}(r)\max_{i}\sigma_{2r-1\leftarrow 2i} and using error probability δMδ2mMm+1\delta_{\mathcal{M}}\triangleq\frac{\delta}{2^{\sum_{m\in\mathcal{M}}m+1}}. First, we note that by union bound, using the fact that choices of Mδ2mMm+1=δ\sum_{\textup{choices of }\mathcal{M}}\frac{\delta}{2^{\sum_{m\in\mathcal{M}}m+1}}=\delta where M\mathcal{M} ranges over nonnegative integers, we get that the generalization bound of Theorem A.1 holds for choices of M\mathcal{M} with probability 1 - δ\delta.

Now for the network FF at hand, there would have been some choice of M\mathcal{M} for which the bound was applied using parameters t^(i)\hat{t}^{(i)}, σ^jj\hat{\sigma}_{j^{\prime}\leftarrow j}, a^(i)\hat{a}^{(i)}, b^(i)\hat{b}^{(i)}, γ^\hat{\gamma} and

The proof of the simpler Theorem 1.1, follows the same above argument. The only difference is that we union bound over parameters σ,t\sigma,t and the matrix norms. ∎

Let s=maxxPnxs=\max_{x\in P_{n}}\|x\| be an upper bound on the largest norm of a datapoint. Then the following bound relates Rademacher complexity to covering numbers:

Appendix B Missing Proofs in Section 5

We prove the theorem by induction on the number of non-input vertices in the vertex set V\mathcal{V}. The statement is true if OO is the only non-input node in the graph: to cover the graph output with error ϵO\epsilon_{O}, we simply cover RO\mathfrak{R}_{O}.

Given a family of graphs G\mathcal{G} (with shared edges E\mathcal{E} and nodes V\mathcal{V}), we assume the inductive hypothesis that “for any family of graphs with more than I|\mathcal{I}| input vertices, the theorem statement holds.” Under this hypothesis, we will show that the theorem statement holds for the graph family G\mathcal{G}.

We take node V1V_{1} from the forest ordering (V1,,Vm)(V_{1},\dots,V_{m}) assumed in the theorem. Suppose V1V_{1} depends on C1,,CtC_{1},\dots,C_{t}, which are assumed to be the input nodes by the definition of forest ordering. We release the node V1V_{1} from the graph and obtain a new family G\V1={G\V1:GG}\mathcal{G}^{\backslash V_{1}}=\{G^{\backslash V_{1}}:G\in\mathcal{G}\} with a smaller number of edges than that of G\mathcal{G}.

Define u(h,x)OG\V1(h,x)u(h,x)\triangleq O_{G^{\backslash V_{1}}}(h,x) for hDV1h\in\mathcal{D}_{V_{1}} and xDIx\in\mathcal{D}_{\mathcal{I}}, and w(x)=V1(x)w(x)=V_{1}(x). Then we can check that u(w(x),x)=OG(x)u(w(x),x)=O_{G}(x). Let U={OG\V1:GG}\mathcal{U}=\{O_{G^{\backslash V_{1}}}:G\in\mathcal{G}\}, and let W=RV1\mathcal{W}=\mathfrak{R}_{V_{1}}. As each function in U\mathcal{U} is κV1\kappa_{V_{1}}-Lipschitz in V1V_{1} because of condition 1, and it equals the fixed constant cc if V1sV{|\kern-1.07639pt|\kern-1.07639pt|V_{1}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{V} or CisCi{|\kern-1.07639pt|\kern-1.07639pt|C_{i}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{C_{i}}, we have U,W\mathcal{U},\mathcal{W} satisfies the conditions of the composition lemma (see Lemma B.1). With the lemma, we conclude:

Note that by the definition of forest ordering, we have that (V2,,Vm)(V_{2},\dots,V_{m}) is a forest ordering of G\V1G^{\backslash V_{1}} and by the assumption 1 of the theorem, we have that (V2,,Vm)(V_{2},\dots,V_{m}) satisfies the condition 1 for the graph family G\V1\mathcal{G}^{\backslash V_{1}}. G\V1\mathcal{G}^{\backslash V_{1}} has one more input node than G\mathcal{G}, so we can invoke the inductive hypothesis on G\V1\mathcal{G}^{\backslash V_{1}} and obtain

Combining equation (23) and (24) above, we prove (14) for G\mathcal{G}, and complete the induction. ∎

Below we provide the composition lemma necessary for Theorem 5.3.

is a family of functions with two arguments and W{x(1),,x(m)Dx(1)Dx(m)Dh}\mathcal{W}\subseteq\{x^{(1)},\ldots,x^{(m)}\in\mathcal{D}_{x}^{(1)}\otimes\cdots\otimes\mathcal{D}_{x}^{(m)}\mapsto\mathcal{D}_{h}\} is another family of functions. We overload notation and refer to x(1),,x(m)x^{(1)},\ldots,x^{(m)} as xx. The spaces Dh,Dx,Du\mathcal{D}_{h},\mathcal{D}_{x},\mathcal{D}_{u} all associate with some norms {|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|} (the norms can potentially be different for each space, but we use the same notation for all of them.) Assume the following:

All functions in U\mathcal{U} are κ\kappa-Lipschitz in the argument hh for any possible choice of xx: for any uUu\in\mathcal{U}, xDxx\in\mathcal{D}_{x}, and h,hDhh,h^{\prime}\in\mathcal{D}_{h}, we have u(h,x)u(h,x)κhh{|\kern-1.07639pt|\kern-1.07639pt|u(h,x)-u(h^{\prime},x)|\kern-1.07639pt|\kern-1.07639pt|}\leq\kappa{|\kern-1.07639pt|\kern-1.07639pt|h-h^{\prime}|\kern-1.07639pt|\kern-1.07639pt|}.

Any function uUu\in\mathcal{U} collapses on inputs with large norms: there exists a constant bb such that u(h,x)=bu(h,x)=b if hsh{|\kern-1.07639pt|\kern-1.07639pt|h|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{h} or x(i)sx(i){|\kern-1.07639pt|\kern-1.07639pt|x^{(i)}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{x}^{(i)} for any ii.

Then, the family of the composition of uu and ww, Z={z(x)=u(w(x),x):uU,wW}\mathcal{Z}=\left\{z(x)=u(w(x),x):u\in\mathcal{U},w\in\mathcal{W}\right\}, has covering number bound:

When it is clear from context, we let xsx{|\kern-1.07639pt|\kern-1.07639pt|x|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x} denote the statement that x(i)sx(i) i{|\kern-1.07639pt|\kern-1.07639pt|x^{(i)}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}^{(i)}\ \forall i. Suppose PnP_{n} is a uniform distribution over nn data points {x1,,xn}Dx\{x_{1},\dots,x_{n}\}\subset\mathcal{D}_{x} with norms not larger than sxs_{x}. Given function uUu\in\mathcal{U} and wWw\in\mathcal{W}, we will construct a pair of functions such that u^(w^(x),x)\hat{u}(\hat{w}(x),x) covers u(w(x),x)u(w(x),x). We will count (in a straightforward way) how many distinct pairs of functions we have construct for all the (u,w)(u,w) pairs at the end of the proof.

Let PP^{\prime} be the uniform distribution over {xi:xisx}\{x_{i}:{|\kern-1.07639pt|\kern-1.07639pt|x_{i}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}\}, and suppose W^\hat{\mathcal{W}} is a ϵwnsupp(P)\epsilon_{w}\sqrt{\frac{n}{|\text{supp}(P^{\prime})|}} error cover of W\mathcal{W} with respect to the metric L2(P,)L_{2}(P^{\prime},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}). We note that W^\hat{\mathcal{W}} has size at most N(ϵw,W,sx)\mathcal{N}(\epsilon_{w},\mathcal{W},s_{x}). We found w^W\hat{w}\in\mathcal{W} such that w^\hat{w} is ϵw\epsilon_{w}-close to ww in metric L2(P,)L_{2}(P^{\prime},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}). Let h^i\hat{h}_{i} denote w^(xi)\hat{w}(x_{i}). Let QQ^{\prime} be the uniform distribution over {(h^i,xi):h^ish,xisx}\{(\hat{h}_{i},x_{i}):{|\kern-1.07639pt|\kern-1.07639pt|\hat{h}_{i}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{h},{|\kern-1.07639pt|\kern-1.07639pt|x_{i}|\kern-1.07639pt|\kern-1.07639pt|}\leq s_{x}\}, and let QQ be the uniform distribution over all nn points, {(h^1,x1),,(h^n,xn)}\{(\hat{h}_{1},x_{1}),\ldots,(\hat{h}_{n},x_{n})\}. Now we construct a intermediate cover U^\widehat{\mathcal{U}}^{\prime} (that depends on w^\hat{w} implicitly) that covers U\mathcal{U} with ϵunsupp(Q)\epsilon_{u}\sqrt{\frac{n}{|\text{supp}(Q^{\prime})|}} error with respect to the metric L2(Q,)L_{2}(Q^{\prime},{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}). We augment this to a cover U^\widehat{\mathcal{U}} that covers U\mathcal{U} with respect to metric L2(Q,)L_{2}(Q,{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}) as follows: for every u^U^\hat{u}^{\prime}\in\widehat{\mathcal{U}}^{\prime}, add the function u^\hat{u} to U^\widehat{\mathcal{U}} with

Note that by construction, the size of U^\widehat{\mathcal{U}} is at most N(ϵu,U,(sh,sx))\mathcal{N}(\epsilon_{u},\mathcal{U},(s_{h},s_{x})). Now let u^U^\hat{u}^{\prime}\in\widehat{\mathcal{U}}^{\prime} be the cover element for uu w.r.t. L2(Q,)L_{2}(Q,{|\kern-1.07639pt|\kern-1.07639pt|\cdot|\kern-1.07639pt|\kern-1.07639pt|}), and u^\hat{u} be the corresponding cover element in U^\widehat{\mathcal{U}}. Because u^(h^,x)=b=u(h^,x)\hat{u}(\hat{h},x)=b=u(\hat{h},x) when h^sh{|\kern-1.07639pt|\kern-1.07639pt|\hat{h}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{h} or x(i)sx(i){|\kern-1.07639pt|\kern-1.07639pt|x^{(i)}|\kern-1.07639pt|\kern-1.07639pt|}\geq s_{x}^{(i)} for some ii,

Then we bound the difference between u(h^,x)u(\hat{h},x) and u(h,x)u(h,x) by Lipschitzness; since u(h^,x)=u(h,x)=bu(\hat{h},x)=u(h,x)=b when x>sx{|\kern-1.07639pt|\kern-1.07639pt|x|\kern-1.07639pt|\kern-1.07639pt|}>s_{x},

where in the last step we used the property of the cover G\mathcal{G}. Finally, by triangle inequality, we get that

Finally we count how many (w^,u^)(\hat{w},\hat{u}) we have constructed: W^\hat{\mathcal{W}} is of size at most N(ϵw,W,sx)\mathcal{N}(\epsilon_{w},\mathcal{W},s_{x}). and for every w^W^\hat{w}\in\hat{\mathcal{W}}, we’ve constructed a family of functions U^\widehat{\mathcal{U}} (that depends on w^\hat{w}) of size at most N(ϵu,U,(sh,sx))\mathcal{N}(\epsilon_{u},\mathcal{U},(s_{h},s_{x})). Therefore, the total size of the cover is at most N(ϵw,W,sx)N(ϵu,U,(sh,sx))\mathcal{N}(\epsilon_{w},\mathcal{W},s_{x})\cdot\mathcal{N}(\epsilon_{u},\mathcal{U},(s_{h},s_{x})). ∎

Appendix C Missing Proofs in Section 6

We first state the proofs of Theorem 6.2 and Theorem 6.3, which follow straightforwardly from the technical tools developed in Section D.

Now we prove release-Lipschitzness for a prefix sequence S\mathcal{S}^{\prime} of S\mathcal{S} that ends in node JiJ_{i}. For all jij\neq i, fix DjDJjD_{j}\in\mathcal{D}_{J_{j}}. It suffices to show that the function QQ defined by

We first construct an augmented family of graphs G\mathcal{G}^{\prime} sharing the same vertices and edges as G\mathcal{G}. For GGG\in\mathcal{G}, we add GG^{\prime} to G\mathcal{G}^{\prime} computing

This is achieved by modifying the family of output rules as follows:

Now all terms match (16) except for the term logN(ϵO,R~O,{2sVi}{I}{2sJi}i1)\log\mathcal{N}(\epsilon_{O},\widetilde{\mathfrak{R}}_{O},\{2s_{V_{i}}\}\cup\{I\}\cup\{2s_{J_{i}}\}_{i\geq 1}). First, we note that all functions in R~O\widetilde{\mathfrak{R}}_{O} can be written in the form

This allows us to conclude (16). Finally, we note that as the augmentation operations are in the form of those considered in Claim H.1, it follows that OG~O_{\widetilde{G}} upper bounds OGO_{G}. ∎

Appendix D Technical Tools for Lipschitz Augmentation

In this section, we develop the technical tools needed for proving Theorem 6.2. The main result in this section is our Lemma D.1, which essentially states that augmenting the loss with a product of Jacobians (plus additional matrices meant to model previous Jacobian nodes already released from the computational graph) will make the loss Lipschitz.

For this section, we say a function JJ taking input xDx\in\mathcal{D} and outputting an operator mapping D\mathcal{D} to D\mathcal{D}^{\prime} is κ\kappa-Lipschitz if J(x)J(x)opκxx\|J(x)-J(x^{\prime})\|_{\textup{op}}\leq\kappa\|x-x^{\prime}\| for any x,xx,x^{\prime} in its input domain. We will consider functions f1,,fkf_{1},\ldots,f_{k}, where fi:Di1Dif_{i}:\mathcal{D}_{i-1}\rightarrow\mathcal{D}_{i} and D0\mathcal{D}_{0} is a compact subset of some normed space. For ease of notation, we use \|\cdot\| to denote the (possibly distinct) norms on D0,,Dk\mathcal{D}_{0},\ldots,\mathcal{D}_{k}. For 1ijk1\leq i\leq j\leq k, Let fji:Di1Djf_{j\leftarrow i}:\mathcal{D}_{i-1}\rightarrow\mathcal{D}_{j} denote the composition

For convenience in indexing, for (i,j)(i,j) with i>ji>j, we will set fji:Di1Di1f_{j\leftarrow i}:\mathcal{D}_{i-1}\rightarrow\mathcal{D}_{i-1} to be the identity function.

Finally consider a real-valued function g:D0Dkg:\mathcal{D}_{0}\otimes\cdots\otimes\mathcal{D}_{k}\rightarrow and define the composition z:D0z:\mathcal{D}_{0}\mapsto by

We will construct a “Lipschitz-fication” for the function zz.

Let A1,,AmA_{1},\ldots,A_{m} denote a collection of linear operators that map to the space D0\mathcal{D}_{0}. We will furthermore use Jji,mJ_{j\leftarrow i,m^{\prime}} to denote the ii-to-jj Jacobian, i.e.

When i=1i=1 and 0jk0\leq j\leq k, we will also consider products between 11-to-jj Jacobians and the matrices AmA_{m^{\prime}}: define

Note in particular that J01,m=AmJ_{0\leftarrow 1,m^{\prime}}=A_{m^{\prime}}.

[Lipschitz-fication] Following the notation in this section, suppose that gg is ckc_{k^{\prime}}-Lipschitz in its (k+1)(k^{\prime}+1)-th argument for 0kk0\leq k^{\prime}\leq k. Suppose that DfjjDf_{j\leftarrow j} is τˉj\bar{\tau}_{j}-Lipschitz for all 1jk1\leq j\leq k. For any (i,j)(i,j) with 1ijk1\leq i\leq j\leq k, let τji\tau_{j\leftarrow i} be parameters that intend to be a tight bound on Jjiop\|J_{j\leftarrow i}\|_{\textup{op}}, and also define τj1,m\tau_{j\leftarrow 1,m^{\prime}} which will bound Jj1,mop\|J_{j\leftarrow 1,m^{\prime}}\|_{\textup{op}}. Define the augmented function zˉ:D0\bar{z}:\mathcal{D}_{0}\mapsto by

We define the following order Q\succ_{\mathcal{Q}} on this collection of functions:

Now note that by Claim D.7, we have the bound

Define τˉ\bar{\tau} to be the Lipschitz constant of JjiJ_{j\leftarrow i} on D0\mathcal{D}_{0} for all 1ijk1\leq i\leq j\leq k guaranteed by Claim D.6. First, note that Δ01,m=0\Delta_{0\leftarrow 1,m^{\prime}}=0 for all mm^{\prime}. Thus, by Claims D.4 and D.5, it follows that

Now note that if ν2minijτjiτˉ\|\nu\|\leq\frac{2\min_{i\leq j}\tau_{j\leftarrow i}}{\bar{\tau}}, then it follows that 2τji+τˉ2ν3τjiij2\tau_{j\leftarrow i}+\frac{\bar{\tau}}{2}\|\nu\|\leq 3\tau_{j\leftarrow i}\forall i\leq j. Substituting into (30), we get that x,ν2minijτjiτˉ\forall x,\|\nu\|\leq\frac{2\min_{i\leq j}\tau_{j\leftarrow i}}{\bar{\tau}},

In the setting of Lemma D.1, for 1ijk1\leq i\leq j\leq k, we can expand the error Jji(x)Jji(x+ν)J_{j\leftarrow i}(x)-J_{j\leftarrow i}(x+\nu) as follows:

Furthermore, for 1jk,m1\leq j\leq k,m^{\prime}, we can expand the error Jj1,m(x)Jj1,m(x+ν)J_{j\leftarrow 1,m^{\prime}}(x)-J_{j\leftarrow 1,m^{\prime}}(x+\nu) as follows:

We will first show (31) by inducting on jij-i. The base case j=ij=i follows by definition, as we can reduce Jii+1J_{i\leftarrow i+1} and Ji1iJ_{i-1\leftarrow i} to constant-valued functions that output the identity matrix.

For the inductive step, we use Claim H.2 to expand

To prove (32), we first note that by definition, Jj1,m(x)=Jj1(x)J01,mJ_{j\leftarrow 1,m^{\prime}}(x)=J_{j\leftarrow 1}(x)J_{0\leftarrow 1,m^{\prime}}, so

In the setting of Lemma D.1, suppose that JjiJ_{j\leftarrow i} is τˉ\bar{\tau}-Lipschitz for all 1ijk1\leq i\leq j\leq k. Then we can bound the operator norm error in the Jacobian by

Likewise, we can bound the operator norm error in the product between Jacobian and auxiliary matrices by

We will first prove (34), as the proof of (35) is nearly identical. Starting from (31) of Claim D.2, we have

By triangle inequality and the fact that JjiJ_{j^{\prime}\leftarrow i^{\prime}} is τˉ\bar{\tau}-Lipschitz ij\forall i^{\prime}\leq j^{\prime}, it follows that

Plugging the above into (37), we get \eqrefeq:jacobianerr1\eqref{eq:jacobian_err-1}. To prove (35), we start from (32) and follow the same steps as above. ∎

In the setting of Lemma D.1, suppose that JjiJ_{j\leftarrow i} is τˉ\bar{\tau}-Lipschitz for all 1ijk1\leq i\leq j\leq k. Then we can upper bound the error terms corresponding to the indicators by

Likewise, the following upper bound holds for all (j,m)(j,m^{\prime}) with 1jk,1mm1\leq j\leq k,1\leq m^{\prime}\leq m:

Plugging this into our definition for Δji\Delta_{j\leftarrow i} (28), it follows that

Note that if xEx\notin\mathcal{E}, then i<j\exists i^{\prime}<j^{\prime} such that Qji(x)=0Q_{j^{\prime}\leftarrow i^{\prime}}(x)=0 and QjiQQjiQ_{j^{\prime}\leftarrow i^{\prime}}\succ_{\mathcal{Q}}Q_{j\leftarrow i} by definition of the order Q\succ_{\mathcal{Q}}. It follows that if xEx\notin\mathcal{E}, hQQjih(x)=0\prod_{h\succ_{\mathcal{Q}}Q_{j\leftarrow i}}h(x)=0, so Δji(x,ν)=0|\Delta_{j\leftarrow i}(x,\nu)|=0. Otherwise, if xEx\in\mathcal{E}, by Claim D.3 we have

where we recall that τi1i=1\tau_{i-1\leftarrow i}=1. Plugging this into (40) and using the fact that all functions hQh\in\mathcal{Q} are bounded by 1 gives the desired statement.

To prove (39), we simply apply the above argument with (35). ∎

In the setting of Lemma D.1, fix index jj with 0jk0\leq j\leq k and suppose that Jj1J_{j\leftarrow 1} is τˉ\bar{\tau}-Lipschitz. Then we can bound the error due to function composition by

Starting from (27), we can first express δi(x,ν)\delta_{i}(x,\nu) by

as Qj1QQj1,1Q_{j\leftarrow 1}\succ_{\mathcal{Q}}Q_{j\leftarrow 1,1}. First we note that by definition, γj(x,ν)cjfj1(x)fj1(x+ν)|\gamma_{j}(x,\nu)|\leq c_{j}\|f_{j\leftarrow 1}(x)-f_{j\leftarrow 1}(x+\nu)\|, as the function gg is cjc_{j}-Lipschitz in its jj-th argument. Thus, since all functions QQQ\in\mathcal{Q} are bounded by 11, it follows that

In the setting of Lemma D.1, τˉ\exists\bar{\tau} such that ij\forall i\leq j, JjiJ_{j\leftarrow i} is τˉ\bar{\tau}-Lipschitz on a compact domain D0\mathcal{D}_{0}.

We first show inductively that fi1f_{i\leftarrow 1} is Lipschitz for all ii. The base case f11f_{1\leftarrow 1} follows by definition, as f11f_{1\leftarrow 1} is continuously differentiable and D0\mathcal{D}_{0} is a compact set.

Now we show the inductive step: first write fi1=fifi11f_{i\leftarrow 1}=f_{i}\circ f_{i-1\leftarrow 1}. By continuity, {fi11(x):xD0}\{f_{i-1\leftarrow 1}(x):x\in\mathcal{D}_{0}\} is compact. Furthermore, fif_{i} is continuously differentiable under the assumptions of Lemma D.1. Thus, fif_{i} is Lipschitz on domain {fi11(x):xD0}\{f_{i-1\leftarrow 1}(x):x\in\mathcal{D}_{0}\}. As fi1=fifi11f_{i\leftarrow 1}=f_{i}\circ f_{i-1\leftarrow 1} is the composition of Lipschitz functions by the inductive hypothesis, fi1f_{i\leftarrow 1} is itself Lipschitz.

Now it follows that i\forall i, JiiJ_{i\leftarrow i} is Lipschitz on D0\mathcal{D}_{0}, as it is the composition of DfiiDf_{i\leftarrow i} and fi11f_{i-1\leftarrow 1}, both of which are Lipschitz. Finally, by the chain rule (Claim H.2), we have that Jji=JjjJiiJ_{j\leftarrow i}=J_{j\leftarrow j}\cdots J_{i\leftarrow i} is the product of Lipschitz functions, and therefore Lipschitz for all i<ji<j. We simply take τˉ\bar{\tau} to be the maximum Lipschitz constant of JjiJ_{j\leftarrow i} over all iji\leq j. ∎

For 0jk+10\leq j\leq k+1, define zj(x,ν)z_{j}(x,\nu) by

Thus, zj(x,ν)z_{j}(x,\nu) denotes g(f01fk1)g\circ(f_{0\leftarrow 1}\otimes\ldots\otimes f_{k\leftarrow 1}) with the last k+1jk+1-j inputs to gg depending on x+νx+\nu instead of xx. Now we claim that by a telescoping argument (Claim H.3),

To see this, compute the sum in the order the following sequence of terms, which corresponds to a traversal of Q\mathcal{Q} in least-to-greatest order:

Now we simply apply triangle inequality on (41) and use the fact that zj(x,ν)1 0jk+1z_{j}(x,\nu)-1\in\ \forall 0\leq j\leq k+1 to obtain the desired statement. ∎

In the setting of Theorem 6.2, fix 1i0pt1\leq i\leq 0pt and define

Here for convenience we use the convention that κi1i=1\kappa_{i-1\leftarrow i}=1.

Appendix E Application to Recurrent Neural Networks

In this section, we will apply our techniques to recurrent neural networks. Suppose that we are in a classification setting. For simplicity, we will assume that the hidden layer and input dimensions are dd. We will define a recurrent neural network with r1r-1 activation layers as follows using parameters W,U,YW,U,Y, activation ϕ\phi and input sequence x=(x(0),,x(r2))x=(x^{(0)},\ldots,x^{(r-2)}):

where h(0)h^{(0)} is set to be 0. Now following the convention of Section 7, we will define the interlayer Jacobians. For odd indices 2i12i-1, ir1i\leq r-1, we simply set Q2i12i1Q_{2i-1\leftarrow 2i-1} to the constant function xWx\mapsto W. For even indices 2i2i, ir1i\leq r-1, we set Q2i2i(x)Dϕ[h2i1(x)+u(i1)(x)]Q_{2i\leftarrow 2i}(x)\triangleq D\phi[h^{2i-1}(x)+u^{(i-1)}(x)], the Jacobian of the activation applied to the input of h(2i)(x)h^{(2i)}(x). Finally, we set Q2r12r1Q_{2r-1\leftarrow 2r-1} to be the constant function xYx\mapsto Y. Now for i>ii^{\prime}>i, we set Qii(x)=Qii(x)Qii(x)Q_{i^{\prime}\leftarrow i}(x)=Q_{i^{\prime}\leftarrow i^{\prime}}(x)\cdots Q_{i\leftarrow i}(x). If i<ii^{\prime}<i, we set QiiQ_{i^{\prime}\leftarrow i} to the identity matrix.

With this notation in place, we can state our generalization bound for RNN’s:

Assume that the activation ϕ\phi is 1-Lipschitz with a σˉϕ\bar{\sigma}_{\phi}-Lipschitz derivative. With probability 1δ1-\delta over the random draws of PnP_{n}, all RNNs FF will satisfy the following generalization guarantee:

where κjacobian,(i)1j2i1j2r1σj2iσ2i2jσjj\kappa^{\textup{jacobian},(i)}\triangleq\sum_{1\leq j\leq 2i-1\leq j^{\prime}\leq 2r-1}\frac{\sigma_{j^{\prime}\leftarrow 2i}\sigma_{2i-2\leftarrow j}}{\sigma_{j^{\prime}\leftarrow j}}, and

In these expressions, we define σj1j=1\sigma_{j-1\leftarrow j}=1, and:

Note that the training error here is because of the existence of positive margin γ\gamma.

Our proof follows the template of Theorem 7.1: we bound the Rademacher complexity of some augmented RNN loss. We then argue for generalization of the augmented loss and perform a union bound over all the choices of parameters. As the latter steps are identical to those in the proof of Theorem 7.1, we omit these and focus on bounding the Rademacher complexity of an augmented RNN loss.

Suppose that ϕ\phi is 11-Lipschitz with σˉϕ\bar{\sigma}_{\phi}-Lipschitz derivative. Define the following class of RNNs with bounded weight matrices:

and let σjj\sigma_{j^{\prime}\leftarrow j} be parameters that will bound the jj to jj^{\prime} layerwise Jacobian for jjj^{\prime}\geq j, where we set σ2i2i=1\sigma_{2i\leftarrow 2i}=1 and σ2i12i1=σW\sigma_{2i-1\leftarrow 2i-1}=\sigma_{W} for ir1i\leq r-1, σ2r12r1=σY\sigma_{2r-1\leftarrow 2r-1}=\sigma_{Y}. Let t(i)t^{(i)} be parameters bounding the layer norm after applying the ii-th activation, and let t(0)=0,tdata=maxxPnmaxix(i)t^{(0)}=0,t^{\textup{data}}=\max_{x\in P_{n}}\max_{i}\|x^{(i)}\|. Define the class of augmented losses

and define for 1ir1\leq i\leq r, κjacobian,(i),κhidden,(i)\kappa^{\textup{jacobian},(i)},\kappa^{\textup{hidden},(i)} meant to bound the influence of the matrix W(i)W^{(i)} on the Jacobians and hidden variables, respectively as in (18), (19). Then we can bound the empirical Rademacher complexity of the augmented loss class by

where κrnn-hidden,(i),κrnn-jacobian,(i)\kappa^{\textup{rnn-hidden},(i)},\kappa^{\textup{rnn-jacobian},(i)} are defined in Theorem E.1.

We will associate the family of losses Lrnn-aug\mathcal{L}_{\textup{rnn-aug}} with a computational graph structure on internal nodes H1,H2,,H2r1H_{1},H_{2},\ldots,H_{2r-1}, J1,,J2r1J_{1},\ldots,J_{2r-1}, K0,,Kr2K_{0},\ldots,K_{r-2}, input nodes H0,I0,,Ir2H_{0},I_{0},\ldots,I_{r-2}, and output node OO with the following edges:

Nodes Hi,JiH_{i},J_{i} will point towards the output OO.

Node HiH_{i} will point towards nodes Hi+1H_{i+1} and Ji+1J_{i+1}.

Node Ki1K_{i-1} will point towards node H2iH_{2i} and node J2iJ_{2i}.

Node IiI_{i} will point towards node KiK_{i}.

We now define the composition rules at each node:

Finally, nodes J2i1J_{2i-1} will have composition rule RJ2i1=DRH2i1R_{J_{2i-1}}=DR_{H_{2i-1}}. Finally, the output node OO will have composition rule

With this condition established, we can complete the proof via the same covering number argument as in Theorem A.2. ∎

Now as in the proof of Theorem 7.1, we first observe that the augmented loss upper bounds the 0-1 classification loss, giving us a 0-1 test error bound. We then apply the same union bound technique over parameters γ,t(i),σjj,aW,aU,aY\gamma,t^{(i)},\sigma_{j^{\prime}\leftarrow j},a_{W},a_{U},a_{Y}, as in the proof of Theorem 7.1.

Appendix F ReLU Networks

In this section, we apply our augmentation technique to relu networks to produce a generalization bound similar to that of Nagarajan and Kolter , which is polynomial in the Jacobian norms, hidden layer norms, and inverse pre-activations.

Recall the definition of neural nets in Example 5.1: the neural net with parameters {W(i)}\{W^{(i)}\} and activation ϕ\phi is defined by

For this section, we will set ϕ\phi to be the relu activation. We also use the same notation for layers and indexing as Section 7. We first state our generalization bound for relu networks:

Fix reference matrices {A(i)},{B(i)}\{A^{(i)}\},\{B^{(i)}\}. With probability 1δ1-\delta over the random draws of the data PnP_{n}, all neural networks FF with relu activations parameterized by {W(i)}\{W^{(i)}\} will have the following generalization guarantee

In these expressions, we define σj1j=1\sigma_{j-1\leftarrow j}=1, γ(i)\gamma^{(i)} to be the minimum pre-activation after the ii-th weight matrix over all coordinates in the ii-th layer and all datapoints:

where [F2i11(x)]j[F_{2i-1\leftarrow 1}(x)]_{j} indexes the jj-th coordinate of F2i11(x)F_{2i-1\leftarrow 1}(x), and additionally use

Note that we assume the existence of a positive margin, so the training error here is .

We note that compared to Theorem 7.1, κrelu-jacobian,(i)=κjacobian,(i)\kappa^{\textup{relu-jacobian},(i)}=\kappa^{\textup{jacobian},(i)}, but κrelu-hidden,(i)\kappa^{\textup{relu-hidden},(i)} now has a dependence on the preactivations γ(i)\gamma^{(i)}, as in Nagarajan and Kolter .

We provide a proof sketch of Theorem F.1 here. We first bound the Rademacher complexity some family of augmented losses, specified precisely in Theorem F.2. The rest of the argument then follows the same way as the proof of Theorem 7.1: using Rademacher complexity to argue that the augmented losses generalize, applying the fact that the augmented losses upper-bound the 0-1 loss, and then union bounding over all choices of parameters.

Following the definitions in Theorem A.2, let F\mathcal{F} denote the class of neural networks, σjj\sigma_{j^{\prime}\leftarrow j} be parameters intended to bound the spectral norm of the jj to jj^{\prime} layerwise Jacobian, and t(i)t^{(i)} be parameters bounding the layer norm after applying the ii-th activation. Define γ(i)\gamma^{(i)} as parameters intended to lower bound the minimum preactivations after the ii-th linear layer. Define the class of augmented losses

As in the proof of Theorem A.2, associate the loss class Lrelu-aug\mathcal{L}_{\textup{relu-aug}} with a family G~\widetilde{\mathcal{G}} of computation graphs on internal nodes V1,,V2r1,J1,,J2r1V_{1},\ldots,V_{2r-1},J_{1},\ldots,J_{2r-1} as follows: define the graph structure to be identical to the Lipschitz augmentation of a sequential computation graph family (Figure 3) and define the composition rules

Assign to the JiJ_{i} nodes composition rule RJi=DRViR_{J_{i}}=DR_{V_{i}}, and finally, assign to the output node OO the composition rule

The resulting family of computation graphs will compute Lrelu-aug\mathcal{L}_{\textup{relu-aug}}. Now we claim that G~\widetilde{\mathcal{G}} is κrelu-hidden,(i)\kappa^{\textup{relu-hidden},(i)}-release-Lipschitz in nodes V2i1V_{2i-1} and κrelu-jacobian,(i)\kappa^{\textup{relu-jacobian},(i)}-release-Lipschitz in nodes J2i1J_{2i-1}. (Note that the Lipschitzness of nodes V2i,J2iV_{2i},J_{2i} will not matter because the associated function classes and singletons and therefore have a log covering number of 0 anyways).

The argument for the κrelu-jacobian,(i)\kappa^{\textup{relu-jacobian},(i)}-release-Lipschitzness of J2i1J_{2i-1} follows analogously to the argument of Lemma D.8 and Theorem A.2.

Finally, to conclude the desired Rademacher complexity bounds given the release-Lipschitzness, we apply the same reasoning as in Theorem A.2. ∎

Appendix G Additional Experimental Details

For all settings, we train for 200 epochs with learning rate decay by a factor of 0.2 at epochs 60, 120, and 150. We additionally tuned the value of λ\lambda from values {0.1,0.05,0.01}\{0.1,0.05,0.01\} for each setting: for the experiments displayed in Figure 1, we used the following values:

For all other hyperparameters, we use the defaults in the PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch, and we base our code off of this implementation. We report results from a single run as the improvement with Jacobian regularization is statistically significant. We train on a single NVIDIA TitanXp GPU.

G.2 Empirical Scaling of our Complexity Measure with Depth

In this section, we empirically demonstrate that the leading term of our bounds can exhibit better scaling in depth than prior work.

We compute leading terms of our bound: imaxxPnh(i)(x)2maxxPnJ(i)(x)opγ\frac{\sum_{i}\max_{x\in P_{n}}\|h^{(i)}(x)\|_{2}\max_{x\in P_{n}}\|J^{(i)}(x)\|\|_{\textup{op}}}{\gamma}, where ii ranges over the layers, h(i)h^{(i)}, J(i)J^{(i)} denote the ii-th hidden layer and Jacobian of the output with respect to the ii-th hidden layer, respectively, and γ\gamma denotes the smallest positive margin on the training dataset. We compare this quantity with that of the bound of [Bartlett et al., 2017]: iW(i)op/γ\prod_{i}\|W^{(i)}\|_{\textup{op}}/\gamma. In Figure 5, we plot this comparison for WideResNetOur bound as stated in the paper technically does not apply to ResNet because the skip connections complicate the Lipschitz augmentation step. This can be remedied with a slight modification to our augmentation step, which we omit for simplicity. models of depths 10, 16, 22, 28 trained on CIFAR10. For all models, we remove data augmentation to ensure that our models fit the training data perfectly. We train each model for 50 epochs, which is sufficient for perfectly fitting the training data, and start from an initial learning rate of 0.1 which we decrease by a factor of 10 at epoch 30. All other parameters are set to the same as their defaults in the PyTorch WideResNet implementation: https://github.com/xternalz/WideResNet-pytorch. We plot the final complexity measures computed on a single model. We note that our models are trained with Batchnorm. At test time, these Batchnorm layers compute affine transformations, so we compute the bound by merging these transformations with the adjacent linear layer.

Figure 5 demonstrates that our complexity measure can be much lower than the spectral complexity. Furthermore, in Figure 5, our complexity measure appears to scale well with depth for WideResNet models.

Appendix H Toolbox

u(u(x1,x2),x3)=u(x1,x2x3)u(u(x_{1},x_{2}),x_{3})=u(x_{1},x_{2}x_{3}).

First, we note that u(x1,x2)=x1x2+1x2x2+1x2=1u(x_{1},x_{2})=x_{1}x_{2}+1-x_{2}\leq x_{2}+1-x_{2}=1. Furthermore, u(x1,x2)x1x2+x1(1x2)=x1u(x_{1},x_{2})\geq x_{1}x_{2}+x_{1}(1-x_{2})=x_{1}, which completes the proof of statements 1 and 2. To prove the third statement, we note that u(u(x1,x2),x3)=(x1x2+1x2)x3+1x3=x1x2x3+1x2x3=u(x1,x2x3)u(u(x_{1},x_{2}),x_{3})=(x_{1}x_{2}+1-x_{2})x_{3}+1-x_{3}=x_{1}x_{2}x_{3}+1-x_{2}x_{3}=u(x_{1},x_{2}x_{3}). ∎

The Jacobian of a composition of a sequence of functions f1,,fkf_{1},\dots,f_{k} satisfies

where the \cdot notations are standard matrix multiplication. For simplicity, we also write in the function form:

Let f:DDf:\mathcal{D}\rightarrow\mathcal{D}^{\prime}, and consider the total derivative DfDf operator mapping D\mathcal{D} to a linear operator between normed spaces D\mathcal{D} to D\mathcal{D}^{\prime}. Suppose that Df[x]Df[x] is κ\kappa-Lipschitz in xx, in the sense that Df[x]Df[x+ν]opκν\|Df[x]-Df[x+\nu]\|_{\textup{op}}\leq\kappa\|\nu\|, where op\|\cdot\|_{\textup{op}} is the operator norm induced by D\mathcal{D} and D\mathcal{D}^{\prime}. Then

We write f(x+ν)f(x)=(t=01Df[x+tν]dt)νf(x+\nu)-f(x)=\left(\int_{t=0}^{1}Df[x+t\nu]dt\right)\nu. Now we note that