Sensitivity and Generalization in Neural Networks: an Empirical Study

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

Introduction

The empirical success of deep learning has thus far eluded interpretation through existing lenses of computational complexity (Blum & Rivest, 1988), numerical optimization (Choromanska et al., 2015; Goodfellow & Vinyals, 2014; Dauphin et al., 2014) and classical statistical learning theory (Zhang et al., 2016): neural networks are highly non-convex models with extreme capacity that train fast and generalize well. In fact, not only do large networks demonstrate good test performance, but larger networks often generalize better, counter to what would be expected from classical measures, such as VC dimension. This phenomenon has been observed in targeted experiments (Neyshabur et al., 2015), historical trends of Deep Learning competitions (Canziani et al., 2016), and in the course of this work (Figure 1).

This observation is at odds with Occam’s razor, the principle of parsimony, as applied to the intuitive notion of function complexity (see §A.2 for extended discussion). One resolution of the apparent contradiction is to examine complexity of functions in conjunction with the input domain. f(x)=x3sin(x)f(x)=x^{3}\sin(x) may seem decisively more complex than g(x)=xg(x)=x. But restrained to a narrow input domain of [0.01,0.01]\left[-0.01,0.01\right] they appear differently: gg remains a linear function of the input, while f(x)=O(x4)f(x)=\mathcal{O}\left(x^{4}\right) resembles a constant . In this work we find that such intuition applies to neural networks, that behave very differently close to the data manifold than away from it (§4.1).

We therefore analyze the complexity of models through their capacity to distinguish different inputs in the neighborhood of datapoints, or, in other words, their sensitivity. We study two simple metrics presented in §3 and find that one of them, the norm of the input-output Jacobian, correlates with generalization in a very wide variety of scenarios.

This work considers sensitivity only in the context of image classification tasks. We interpret the observed correlation with generalization as an expression of a universal prior on (natural) image classification functions that favor robustness (see §A.2 for details). While we expect a similar prior to exist in many other perceptual settings, care should be taken when extrapolating our findings to tasks where such a prior may not be justified (e.g. weather forecasting).

We first define sensitivity metrics for fully-connected neural networks in §3. We then relate them to generalization through a sequence of experiments of increasing level of nuance:

In §4.1 we begin by comparing the sensitivity of trained neural networks on and off the training data manifold, i.e. in the regions of best and typical (over the whole input space) generalization.

In §4.2 we compare sensitivity of identical trained networks that differ in a single hyper-parameter which is important for generalization.

Further, §4.3 associates sensitivity and generalization in an unrestricted manner, i.e. comparing networks of a wide variety of hyper-parameters such as width, depth, non-linearity, weight initialization, optimizer, learning rate and batch size.

Finally, §4.4 explores how predictive sensitivity (as measured by the Jacobian norm) is for individual test points.

2 Summary of Contributions

The novelty of this work can be summarized as follows:

Study of the behavior of trained neural networks on and off the data manifold through sensitivity metrics (§4.1).

Evaluation of sensitivity metrics on trained neural networks in a very large-scale experimental setting and finding that they correlate with generalization (§4.2, §4.3, §4.4).

§2 puts our work in context of related research studying complexity, generalization, or sensitivity metrics similar to ours.

Related Work

We analyze complexity of fully-connected neural networks for the purpose of model comparison through the following sensitivity measures (see §3 for details):

estimating the number of linear regions a network splits the input space into;

measuring the norm of the input-output Jacobian within such regions.

A few prior works have examined measures related to the ones we consider. In particular, Pascanu et al. (2013); Montúfar et al. (2014); Raghu et al. (2016) have investigated the expressive power of fully-connected neural networks built out of piecewise-linear activation functions. Such functions are themselves piecewise-linear over their input domain, so that the number of linear regions into which input space is divided is one measure of how nonlinear the function is. A function with many linear regions has the capacity to build complex, flexible decision boundaries. It was argued in (Pascanu et al., 2013; Montúfar et al., 2014) that an upper bound to the number of linear regions scales exponentially with depth but polynomially in width, and a specific construction was examined. Raghu et al. (2016) derived a tight analytic bound and considered the number of linear regions for generic networks with random weights, as would be appropriate, for instance, at initialization. However, the evolution of this measure after training has not been investigated before. We examine a related measure, the number of hidden unit transitions along one-dimensional trajectories in input space, for trained networks. Further motivation for this measure is discussed in §3.

Another perspective on function complexity can be gained by studying their robustness to perturbations to the input. Indeed, Rasmussen & Ghahramani (2000) demonstrate on a toy problem how complexity as measured by the number of parameters may be of limited utility for model selection, while measuring the output variation allows the invocation of Occam’s razor. In this work we apply related ideas to a large-scale practical context of neural networks with up to a billion free parameters (§4.2, §4.3) and discuss potential ways in which sensitivity permits the application of Occam’s razor to neural networks (§A.2).

Sokolic et al. (2017) provide theoretical support for the relevance of robustness, as measured by the input-output Jacobian, to generalization. They derive bounds for the generalization gap in terms of the Jacobian norm within the framework of algorithmic robustness (Xu & Mannor, 2012). Our results provide empirical support for their conclusions through an extensive number of experiments. In a similar spirit Zahavy et al. (2018) propose a different sensitivity measure in terms of adversarial robustness and relate it to generalization of stochastic algorithms theoretically and experimentally.

Several other recent papers have also focused on deriving tight generalization bounds for neural networks (Bartlett et al., 2017; Dziugaite & Roy, 2017; Neyshabur et al., 2018). We do not propose theoretical bounds in this paper but establish a correlation between our metrics and generalization in a substantially larger experimental setting than undertaken in prior works.

2 Regularization

In the context of regularization, increasing robustness to perturbations is a widely-used strategy: data augmentation, noise injection (Jiang et al., 2009), weight decay (Krogh & Hertz, 1992), and max-pooling all indirectly reduce sensitivity of the model to perturbations, while Rifai et al. (2011); Sokolic et al. (2017) explicitly penalize the Frobenius norm of the Jacobian in the training objective.

In this work we relate several of the above mentioned regularizing techniques to sensitivity, demonstrating through extensive experiments that improved generalization is consistently coupled with better robustness as measured by a single metric, the input-output Jacobian norm (§4.2). While some of these findings confirm common-sense expectations (random labels increase sensitivity, Figure 4, top row), others challenge our intuition of what makes a neural network robust (ReLU-networks, with unbounded activations, tend to be more robust than saturating HardSigmoid-networks, Figure 4, third row).

3 Inductive Bias of SGD

One of our findings demonstrates an inductive bias towards robustness in stochastic mini-batch optimization compared to full-batch training (Figure 4, bottom row). Interpreting this regularizing effect in terms of some measure of sensitivity, such as curvature, is not new (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016), yet we provide a novel perspective by relating it to reduced sensitivity to inputs instead of parameters.

The inductive bias of SGD (“implicit regularization”) has been previously studied in (Neyshabur et al., 2015), where it was shown through rigorous experiments how increasing the width of a single-hidden-layer network improves generalization, and an analogy with matrix factorization was drawn to motivate constraining the norm of the weights instead of their count. Neyshabur et al. (2017) further explored several weight-norm based measures of complexity that do not scale with the size of the model. One of our measures, the Frobenius norm of the Jacobian is of similar nature (since the Jacobian matrix size is determined by the task and not by a particular network architecture). However, this particular metric was not considered, and, to the best of our knowledge, we are the first to evaluate it in a large-scale setting (e.g. our networks are up to 6565 layers deep and up to 2162^{16} units wide).

4 Adversarial Examples

Sensitivity to inputs has attracted a lot of interest in the context of adversarial examples (Szegedy et al., 2013). Several attacks locate points of poor generalization in the directions of high sensitivity of the network (Goodfellow et al., 2014; Papernot et al., 2016; Moosavi-Dezfooli et al., 2016), while certain defences regularize the model by penalizing sensitivity (Gu & Rigazio, 2014) or employing decaying (hence more robust) non-linearities (Kurakin et al., 2016). However, work on adversarial examples relates highly specific perturbations to a similarly specific kind of generalization (i.e. performance on a very small, adversarial subset of the data manifold), while this paper analyzes average-case sensitivity (§3) and typical generalization. Certain measures of adversarial robustness have nevertheless been recently observed to correlate with generalization (Zahavy et al., 2018; Gilmer et al., 2018; Cubuk et al., 2018).

Sensitivity Metrics

How does the output of the network change as the input is perturbed within the linear region?

How likely is the linear region to change in response to change in the input?

For a local sensitivity measure we adopt the Frobenius norm of the class probabilities Jacobian J(x)=fσ(x)/xT{\mathbf{J}}({\bf x})=\partial{\bf f}_{\sigma}\left({\bf x}\right)/\partial{\bf x^{T}} (with Jij(x)=[fσ(x)]i/xjJ_{ij}(\mathbf{x})=\partial\left[{\bf f}_{\sigma}\left(\mathbf{x}\right)\right]_{i}/\partial{x_{j}}), where fσ=σf{\bf f}_{\sigma}=\bm{\sigma}\circ{\bf f} with σ\bm{\sigma} being the softmax functionThe norm of the Jacobian with respect to logits (f(x)/xT)\left(\partial{\bf f}\left({\bf x}\right)/\partial{\bf x^{T}}\right) experimentally turned out less predictive of test performance (not shown). See §A.3 for discussion of why the softmax Jacobian is related to generalization.. Given points of interest xtest{\bf x}_{\textrm{test}}, we estimate the sensitivity of the function in those regions with the average Jacobian norm:

that we will further refer to as simply “Jacobian norm”. Note that this does not require the labels for xtest{\bf x}_{\textrm{test}}.

Interpretation. The Frobenius norm J(x)F=ijJij(x)2\left\|\mathbf{J}({\bf x})\right\|_{F}=\sqrt{\sum_{ij}J_{ij}({\bf x})^{2}} estimates the average-case sensitivity of fσ{\bf f}_{\sigma} around x{\bf x}. Indeed, consider an infinitesimal Gaussian perturbation ΔxN(0,ϵI)\Delta{\bf x}\sim\mathcal{N}\left({\bf 0},\epsilon\mathbf{I}\right): the expected magnitude of the output change is then

To detect a change in linear region (further called a “transition”), we need to be able to identify it. We do this analogously to Raghu et al. (2016). For a network with piecewise-linear activations, we can, given an input x{\bf x}, assign a code to each neuron in the network f{\bf f}, that identifies the linear region of the pre-activation of that neuron. E.g. each ReLU unit will have or 11 assigned to it if the pre-activation value is less or greater than respectively. Similarly, a ReLU6 unit (see definition in §A.4) will have a code of , 11, or 22 assigned, since it has 33 linear regionsFor a non-piecewise-linear activation like Tanh, we consider as the boundary of two regions and find this metric qualitatively similar to counting transitions of a piecewise-linear non-linearity.. Then, a concatenation of codes of all neurons in the network (denoted by c(x){\bf c}({\bf x})) uniquely identifies the linear region of the input x{\bf x} (see §A.1.1 for discussion of edge cases).

Given this encoding scheme, we can detect a transition by detecting a change in the code. We then sample kk equidistant points z0,,zk1{\bf z}_{0},\dots,{\bf z}_{k-1} on a closed one-dimensional trajectory T(x)\mathcal{T}({\bf x}) (generated from a data point x\mathbf{x} and lying close to the data manifold; see below for details) and count transitions t(x)t(\mathbf{x}) along it to quantify the number of linear regions:

where the norm of the directional derivative c(z)/(dz)1\left\|\partial{\bf c}({\bf z})/\partial\left(d{\bf z}\right)\right\|_{1} amounts to a Dirac delta function at each transition (see §A.1.2 for further details).

By sampling multiple such trajectories around different points, we estimate the sensitivity metric:

that we will further refer to as simply “transitions” or “number of transitions.”

To assure the sampling trajectory T(xtest)\mathcal{T}({\bf x}_{\textrm{test}}) is close to the data manifold (since this is the region of interest), we construct it through horizontal translations of the image xtest{\bf x}_{\textrm{test}} in pixel space (Figure App.9, right). We similarly augment our training data with horizontal and vertical translations in the corresponding experiments (Figure 4, second row).

As earlier, this metric does not require knowing the label of xtest{\bf x}_{\textrm{test}}.

Interpretation. We can draw a qualitative parallel between the number of transitions and curvature of the function. One measure of curvature of a function f\bf f is the total norm of the directional derivative of its first derivative f\bf f^{\prime} along a path:

A piecewise-linear function f\bf f has a constant first derivative f\bf f^{\prime} everywhere except for the transition boundaries. Therefore, for a sufficiently large kk, curvature can be expressed as

where z0,,zk1{\bf z}_{0},\dots,{\bf z}_{k-1} are equidistant samples on T(x)\mathcal{T}\left(\bf x\right). This sum is similar to t(x)t({\bf x}) as defined in Equation 1, but quantifies the amount of change in between two linear regions in a non-binary way. However, estimating it on a densely sampled trajectory is a computationally-intensive task, which is one reason we instead count transitions.

As such, on a qualitative level, the two metrics (Jacobian norm and number of transitions) track the first and second order terms of the Taylor expansion of the function.

Above we have defined two sensitivity metrics to describe the learned function around the data, on average. In §4.1 we analyze these measures on and off the data manifold by simply measuring them along circular trajectories in input space that intersect the data manifold at certain points, but generally lie away from it (Figure 2, left).

Experimental Results

In the following subsections (§4.2, §4.3) each study analyzes performance of a large number (usually thousands) of fully-connected neural networks having different hyper-parameters and optimization procedures. Except where specified, we include only models which achieve 100%100\% training accuracy. This allows us to study generalization disentangled from properties like expressivity and trainability, which are outside the scope of this work.

In order to efficiently evaluate the compute-intensive metrics (§3) in a very wide range of hyper-parameters settings (see e.g. §A.5.5) we only consider fully-connected networks. Extending the investigation to more complex architectures is left for future work.

We analyze the behavior of a trained neural network near and away from training data. We do this by comparing sensitivity of the function along 3 types of trajectories:

A random ellipse. This trajectory is extremely unlikely to pass anywhere near the real data, and indicates how the function behaves in random locations of the input space that it never encountered during training.

An ellipse passing through three training points of different class (Figure 2, left). This trajectory does pass through the three data points, but in between it traverses images that are linear combinations of different-class images, and are expected to lie outside of the natural image space. Sensitivity of the function along this trajectory allows comparison of its behavior on and off the data manifold, as it approaches and moves away from the three anchor points.

An ellipse through three training points of the same class. This trajectory is similar to the previous one, but, given the dataset used in the experiment (MNIST), is expected to traverse overall closer to the data manifold, since linear combinations of the same digit are more likely to resemble a realistic image. Comparing transition density along this trajectory to the one through points of different classes allows further assessment of how sensitivity changes in response to approaching the data manifold.

We find that, according to both the Jacobian norm and transitions metrics, functions exhibit much more robust behavior around the training data (Figure 2, center and right). We further visualize this effect in 2D in Figure 3, where we plot the transition boundaries of the last (pre-logit) layer of a neural network before and after training. After training we observe that training points lie in regions of low transition density.

The observed contrast between the neural network behavior near and away from data further strengthens the empirical connection we draw between sensitivity and generalization in §4.2, §4.3 and §4.4; it also confirms that, as mentioned in §1, if a certain quality of a function is to be used for model comparison, input domain should always be accounted for.

2 Sensitivity and Generalization Factors

In §4.1 we established that neural networks implement more robust functions in the vicinity of the training data manifold than away from it.

We now consider the more practical context of model selection. Given two perfectly trained neural networks, does the model with better generalization implement a less sensitive function?

We study approaches in the machine learning community that are commonly believed to influence generalization (Figure 4, top to bottom):

We find that in each case, the change in generalization is coupled with the respective change in sensitivity (i.e. lower sensitivity corresponds to smaller generalization gap) as measured by the Jacobian norm (and almost always for the transitions metric).

In Figure 4, for many possible hyper-parameter configurations, we train two models that share all parameters and optimization procedure, but differ in a single binary setting (i.e. trained on true or random labels; with or without data augmentation; etc). Out of all such network pairs, we select only those where each network reached 100% training accuracy on the whole training set (apart from the data augmentation study). The two generalization or sensitivity values are then used as the xx and yy coordinates of a point corresponding to this pair of networks (with the plot axes labels denoting the respective value of the binary parameter considered). The position of the point with respect to the diagonal y=xy=x visually demonstrates which configuration has smaller generalization gap / lower sensitivity.

3 Sensitivity and Generalization Gap

We now perform a large-scale experiment to establish direct relationships between sensitivity and generalization in a realistic setting. In contrast to §4.1, where we selected locations in the input space, and §4.2, where we varied a single binary parameter impacting generalization, we now sweep simultaneously over many different architectural and optimization choices (§A.5.5).

Our main result is presented in Figure 5, indicating a strong relationship between the Jacobian norm and generalization. In contrast, Figure App.8 demonstrates that the number of transitions is not alone sufficient to compare networks of different sizes, as the number of neurons in the networks has a strong influence on transition count.

4 Sensitivity and Per-Point Generalization

In §4.3 we established a correlation between sensitivity (as measured by the Jacobian norm) and generalization averaged over a large test set (10410^{4} points). We now investigate whether the Jacobian norm can be predictive of generalization at individual points.

Conclusion

We have investigated sensitivity of trained neural networks through the input-output Jacobian norm and linear regions counting in the context of image classification tasks. We have presented extensive experimental evidence indicating that the local geometry of the trained function as captured by the input-output Jacobian can be predictive of generalization in many different contexts, and that it varies drastically depending on how close to the training data manifold the function is evaluated. We further established a connection between the cross-entropy loss and the Jacobian norm, indicating that it can remain informative of generalization even at the level of individual test points. Interesting directions for future work include extending our investigation to more complex architectures and other machine learning tasks.

References

Appendix A Appendix

The way of encoding a linear region c(z)\bf c\left({\bf z}\right) of a point z\bf z described in §3 (2) guarantees that different regions obtain different codes, but different codes might be assigned to the same region if all the neurons in any layer of the network are saturated (or if weights leading from the transitioning unit to active units are exactly zero, or exactly cancel). However, the probability of such an arrangement drops exponentially with width and hence is ignored in this work.

A.1.2 Transition Counting

The equality between the discrete and continuous versions of t(x)t\left({\bf x}\right) in Equation 1 becomes exact with a high-enough sampling density kk such that there are no narrow linear regions missed in between consecutive points (precisely, the encoding c(z)\bf c\left({\bf z}\right) has to only change at most once on the line between two consecutive points zi{\bf z}_{i} and zi+1{\bf z}_{i+1}).

For computational efficiency we also assume that no two neurons transitions simultaneously, which is extremely unlikely in the context of random initialization and stochastic optimization.

A.2 Do neural networks defy Occam’s razor?

Here we briefly discuss the motivation of this work in the context of Occam’s razor.

An alternative, qualitative justification of the heuristic is through considering the evidence as a normalized probability distribution over the whole dataset space:

and remarking that models with more parameters have to spread the probability mass more evenly across all the datasets by virtue of being able to fit more of them (Figure App.10, left). This similarly suggests (under a uniform prior on competing hypothesis classes) preferring models with fewer parameters, assuming that evidence is unimodal and peaks close to the dataset of interest.

Occam’s razor for neural networks. As seen in Figure 1, the above reasoning does not apply to neural networks: the best achieved generalization is obtained by a model that has around 10410^{4} times as many parameters as the simplest model capable of fitting the dataset (within the evaluated search space).

Above is one way to interpret the correlation between sensitivity and generalization that we observe in this work. It does not explain why large networks tend to converge to less sensitive functions. We conjecture large networks to have access to a larger space of robust solutions due to solving a highly-underdetermined system when fitting a dataset, while small models converge to more extreme weight values due to being overconstrained by the data. However, further investigation is needed to confirm this hypothesis.

A.3 Bounding the Jacobian Norm

Here we analyze the relationship between the Jacobian norm and the cross-entropy loss at individual test points as studied in §4.4.

Target class Jacobian. We begin by relating the derivative of the target class probability Jy(x)\mathbf{J}_{y({\bf x})} to per-point cross-entropy loss l(x)=log[fσ(x)]y(x)l({\bf x})=-\log\left[{\bf f}_{\sigma}({\bf x})\right]_{y({\bf x})} (where y(x)y({\bf x}) is the correct integer class).

We will denote fσ(x){\bf f}_{\sigma}({\bf x}) by σ{\bm{\sigma}} and drop the x{\bf x} argument to de-clutter notation (i.e. write f\bf f instead of f(x)\bf f({\bf x})). Then the Jacobian can be expressed as

where \odot is the Hadamard element-wise product. Then indexing both sides of the equation at the correct class yy yields

where ey{\bf e}_{y} is a vector of zeros everywhere except for ey=1{e}_{y}=1. Taking the norm of both sides results in

We now assume that magnitudes of the individual logit derivatives vary little in between logits and over the input spaceIn the limit of infinite width, and fully Bayesian training, deep network logits are distributed exactly according to a Gaussian process (Neal, 1994; Lee et al., 2018). Similarly, each entry in the logit Jacobian also corresponds to an independent draw from a Gaussian process (Solak et al., 2003). It is therefore plausible that the Jacobian norm, consisting of a sum over the square of independent Gaussian samples in the correct limits, will tend towards a constant.:

or, in terms of the cross-entropy loss l=logσyl=-\log\sigma_{y}:

We validate these approximate bounds in Figure App.11 (top).

Full Jacobian. Equation 5 establishes a close relationship between Jy\mathbf{J}_{y} and loss l=logσyl=-\log\sigma_{y}, but of course, at test time we do not know the target class yy. This allows us to only bound the full Jacobian norm from below:

For the upper bound, we assume the maximum-entropy case of σy\sigma_{y}: σi(1σy)/(n1)\sigma_{i}\approx(1-\sigma_{y})/(n-1), for iyi\neq y. The Jacobian norm is

Adding n1n-1 of such summands to Jy22\left\|{\mathbf{J}}_{y}\right\|_{2}^{2} results in

compared against the lower bound (Equation 6) and experimental data in Figure App.11.

A.4 Non-linearities Definitions

Following activation functions are used in this work:

ReLU6 (Krizhevsky, 2010): min(max(x,0),6)\min\left(\max(x,0),6\right);

Tanh: hyperbolic tangent, (exex)/(ex+ex)(e^{x}-e^{-x})/(e^{x}+e^{-x});

HardTanh (Gulcehre et al., 2016): min(max(x,1),1)\min\left(\max(x,-1),1\right);

HardSigmoid (Gulcehre et al., 2016): min(max(x+0.5,0),1)\min\left(\max(x+0.5,0),1\right);

A.5 Experimental Setup

All experiments were implemented in Tensorflow (Abadi et al., 2016) and executed with the help of Vizier (Golovin et al., 2017). All networks were trained with cross-entropy loss. All networks were trained without biases. All computations were done with 32-bit precision. Learning rate decayed by a factor of 0.10.1 every 500500 epochs.

Unless specified otherwise, initial weights were drawn from a normal distribution with zero mean and variance 2/n2/n for ReLU, ReLU6 and HardSigmoid; 1/n1/n for Tanh and HardTanh, where nn is the number of inputs to the current layer.

All inputs were normalized to have zero mean and unit variance, or, in other terms, lie on the dd-dimensional sphere of radius d\sqrt{d}, where dd is the dimensionality of the input.

All reported values, when applicable, were evaluated on the whole training and test sets of sizes 5000050000 and 1000010000 respectively. E.g. “generalization gap” is defined as the difference between train and test accuracies evaluated on the whole train and test sets.

When applicable, all trajectories/surfaces in input space were sampled with 2202^{20} points.

All figures except for 6 and App.11 are plotted with (pale) error bars (when applicable). The reported quantity was usually evaluated 88 times with random seeds from 11 to 88If a particular random seed did not finish, it was not taken into account; we believe this nuance did not influence the conclusions of this paper., unless specified otherwise. E.g. if a network is said to be 100%-accurate on the training set, it means that each of the 88 randomly-initialized networks is 100%-accurate after training.

The error bar is centered at the mean value of the quantity and spans the standard error of the mean in each direction. If the bar appears to not be visible, it may be smaller than the mean value marker.

Weight initialization, training set shuffling, data augmentation, picking anchor points of data-fitted trajectories, selecting axes of a zero-centered elliptic trajectory depend on the random seed.

A.5.2 Sensitivity along a Trajectory

A 20-layer ReLU-network of width 200 was trained on MNIST 128 times, with plots displaying the averaged values.

A random zero-centered ellipse was obtained by generating two axis vectors with normally-distributed entries of zero mean and unit variance (as such making points on the trajectory have an expected norm equal to that of training data) and sampling points on the ellipse with given axes.

A random data-fitted ellipse was generated by projecting three arbitrary input points onto a plane where they fall into vertices of an equilateral triangle, and then projecting their circumcircle back into the input space.

A.5.3 Linear Region Boundaries

A 15-layer ReLU6-network of width 300 was trained on MNIST for 2182^{18} steps using SGD with momentum (Rumelhart et al., 1988); images were randomly translated with wrapping by up to 4 pixels in each direction, horizontally and vertically, as well as randomly flipped along each axis, and randomly rotated by 9090 degrees clockwise and counter-clockwise.

The sampling grid in input space was obtain by projecting three arbitrary input points into a plane as described in §A.5.2 such that the resulting triangle was centered at and it’s vertices were at a distance 0.80.8 form the origin. Then, a sampling grid of points in the [1;1]×2\left[-1;1\right]^{\times 2} square was projected back into the input space.

A.5.4 Small Experiment

Relevant figures: 4 (second row) and 5 (bottom).

All networks were trained for 2182^{18} steps of batch size of 256256 using SGD with momentum. Learning rate was set to 0.0050.005 and momentum term coefficient to 0.90.9.

Data augmentation consisted of random translation of the input by up to 4 pixels in each direction with wrapping, horizontally and vertically. The input was also flipped horizontally with probability 0.50.5. When applying data augmentation (second row of Figure 4), the network is unlikely to encounter the canonical training data, hence few configurations achieved 100%100\% training accuracy. However, we verified that all networks trained with data augmentation reached a higher test accuracy than their analogues without, ensuring that the generalization gap shrinks not simply because of lower training accuracy.

For each dataset, networks of width {100,200,500,1000,2000,3000}\left\{100,200,500,1000,2000,3000\right\}, depth {2,3,5,10,15,20}\left\{2,3,5,10,15,20\right\} and activation function {ReLU, ReLU6, HardTanh, HardSigmoid}\left\{\textrm{ReLU, ReLU6, HardTanh, HardSigmoid}\right\} were evaluated on 88 random seeds from 11 to 88.

A.5.5 Large Experiment

Relevant figures: 1, 4 (except for the second row), 5 (top), App.8.

335671335671 networks were trained for 2192^{19} steps with random hyper-parameters; if training did not complete, a checkpoint at step 2182^{18} was used instead, if available. When using L-BFGS, the maximum number of iterations was set to 26842684. The space of available hyper-parameters includedDue to time and compute limitations, this experiment was set up such that configurations of small size were more likely to get evaluated (e.g. only a few networks of width 2162^{16} were trained, and all of them had depth 22). However, based on our experience with smaller experiments (where each configuration got evaluated), we believe this did not bias the findings of this paper, while allowing them to be validated across a very wide range of scenarios.:

CIFAR10 and CIFAR100 datasets cropped to a 24×2424\times 24 center region;

SGD, Momentum, ADAM (Kingma & Ba, 2014), RMSProp (Hinton et al., 2012) and L-BFGS optimizers;

learning rates from {0.01,0.005,0.0005}\left\{0.01,0.005,0.0005\right\}, when applicable. Secondary coefficients were fixed at default values of Tensorflow implementations of respective optimizers;

batch sizes of {128,512}\left\{128,512\right\} (unless using L-BFGS with the full batch of 5000050000);

standard deviations of initial weights from {0.5,1,4,8}\left\{0.5,1,4,8\right\} multiplied by the default value described in §A.5;

widths from {1,2,4,,216}\left\{1,2,4,\cdots,2^{16}\right\};

depths from {2,3,5,,26+1}\left\{2,3,5,\cdots,2^{6}+1\right\};

A.5.6 Per-point Generalization

Hyper-parameters were: non-linearity (all functions from §A.4), width (5050, 100100, 200200, 500500, 10001000), depth (22, 55, 1010, 2020, 3030), learning rate (0.00010.0001, 0.0010.001, 0.010.01), optimizer (SGD, Momentum, ADAM, RMSProp). Only one random seed (1) was used. For each dataset a random subset of 5 configurations among all the 100%100\%-accurate (on training) networks was plotted (apart from the case of CIFAR100, where networks of training accuracy of at least 99.98%99.98\% were selected).

A.5.7 Cross-entropy and Sensitivity Analysis

Networks were trained for 2182^{18} on the whole CIFAR10 training set and evaluated networks on a random test subset of size 10001000. The hyper-parameters consisted of non-linearity (all functions from §A.4), width (5050, 100100 or 200200) and depth (22, 55, 1010, 2020). Only one random seed (11) was considered. A single random 100%-accurate (on training data) network was drawn to compare experimental measurements with analytic bounds on the Jacobian norm.