Generalization in Deep Learning

Kenji Kawaguchi, Leslie Pack Kaelbling, Yoshua Bengio

Introduction

Deep learning has seen significant practical success and has had a profound impact on the conceptual bases of machine learning and artificial intelligence. Along with its practical success, the theoretical properties of deep learning have been a subject of active investigation. For expressivity of neural networks, there are classical results regarding their universality (Leshno et al., 1993) and exponential advantages over hand-crafted features (Barron, 1993). Another series of theoretical studies have considered how trainable (or optimizable) deep hypothesis spaces are, revealing structural properties that may enable non-convex optimization (Choromanska et al., 2015; Kawaguchi, 2016a). However, merely having an expressive and trainable hypothesis space does not guarantee good performance in predicting the values of future inputs, because of possible over-fitting to training data. This leads to the study of generalization, which is the focus of this paper.

Some classical theory work attributes generalization ability to the use of a low-capacity class of hypotheses (Vapnik, 1998; Mohri et al., 2012). From the viewpoint of compact representation, which is related to small capacity, it has been shown that deep hypothesis spaces have an exponential advantage over shallow hypothesis spaces for representing some classes of natural target functions (Pascanu et al., 2014; Montufar et al., 2014; Livni et al., 2014; Telgarsky, 2016; Poggio et al., 2017). In other words, when some assumptions implicit in the hypothesis space (e.g., deep composition of piecewise linear transformations) are approximately satisfied by the target function, one can achieve very good generalization, compared to methods that do not rely on that assumption. However, a recent paper (Zhang et al., 2017) has empirically shown that successful deep hypothesis spaces have sufficient capacity to memorize random labels. This observation has been called an “apparent paradox” and has led to active discussion by many researchers (Arpit et al., 2017; Krueger et al., 2017; Hoffer et al., 2017; Wu et al., 2017; Dziugaite and Roy, 2017; Dinh et al., 2017). Zhang et al. (2017) concluded with an open problem stating that understanding such observations require rethinking generalization, while Dinh et al. (2017) stated that explaining why deep learning models can generalize well, despite their overwhelming capacity, is an open area of research.

We begin, in Section 3, by illustrating that, even in the case of linear models, hypothesis spaces with overwhelming capacity can result in arbitrarily small test errors and expected risks. Here, test error is the error of a learned hypothesis on data that it was not trained on, but which is often drawn from the same distribution. Test error is a measure of how well the hypothesis generalizes to new data. We closely examine this phenomenon, extending the original open problem from previous papers (Zhang et al., 2017; Dinh et al., 2017) into a new open problem that strictly includes the original. We reconcile the possible apparent paradox by checking theoretical consistency and identifying a difference in the underlying assumptions. Considering a difference in the focuses of theory and practice, we outline possible practical roles that generalization theory can play.

Towards addressing these issues, Section 4 presents generalization bounds based on validation datasets, which can provide non-vacuous and numerically-tight generalization guarantees for deep learning in general. Section 5 analyzes generalization errors based on training datasets, focusing on a specific case of feed-forward neural networks with ReLU units and max-pooling. Under these conditions, the developed theory provides quantitatively tight theoretical insights into the generalization behavior of neural networks.

Background

A goal in machine learning is typically framed as the minimization of the expected risk ${\mathcal{R}}[f_{\mathcal{A}(S)}]$ . We typically aim to minimize the non-computable expected risk ${\mathcal{R}}[f_{\mathcal{A}(S)}]$ by minimizing the computable empirical risk ${\mathcal{R}}_{S}[\allowbreak f_{\mathcal{A}(S)}]$ (i.e., empirical risk minimization). One goal of generalization theory is to explain and justify when and how minimizing ${\mathcal{R}}_{S}[f_{\mathcal{A}(S)}]$ is a sensible approach to minimizing ${\mathcal{R}}[f_{\mathcal{A}(S)}]$ by analyzing

The hypothesis-space complexity approach handles this dependence by decoupling $f_{\mathcal{A}(S)}$ from the particular $S$ by considering the worst-case gap for functions in the hypothesis space as

and by carefully analyzing the right-hand side. Because the cardinality of ${\mathcal{F}}$ is typically (uncountably) infinite, a direct use of the union bound over all elements in ${\mathcal{F}}$ yields a vacuous bound, leading to the need to consider different quantities to characterize ${\mathcal{F}}$ ; e.g., Rademacher complexity and the Vapnik–Chervonenkis (VC) dimension. For example, if the codomain of ${\mathcal{L}}$ is in $ $, we have (Mohri et al., 2012, Theorem 3.1) that for any$ \delta>0 $, with probability at least$ 1-\delta$,

where $\mathfrak{R}_{m}(\mathcal{L}_{{\mathcal{F}}})$ is the Rademacher complexity of $\mathcal{L}_{{\mathcal{F}}}$ , which then can be bounded by the Rademacher complexity of ${\mathcal{F}}$ , $\mathfrak{R}_{m}({\mathcal{F}})$ . For the deep-learning hypothesis spaces ${\mathcal{F}}$ , there are several well-known bounds on $\mathfrak{R}_{m}({\mathcal{F}})$ including those with explicit exponential dependence on depth (Sun et al., 2016; Neyshabur et al., 2015b; Xie et al., 2015) and explicit linear dependence on the number of trainable parameters (Shalev-Shwartz and Ben-David, 2014). There has been significant work on improving the bounds in this approach, but all existing solutions with this approach still depend on the complexity of a hypothesis space or a sequence of hypothesis spaces.

The stability approach deals with the dependence of $f_{\mathcal{A}(S)}$ on the dataset $S$ by considering the stability of algorithm $\mathcal{A}$ with respect to different datasets. The considered stability is a measure of how much changing a data point in $S$ can change $f_{\mathcal{A}(S)}$ . For example, if the algorithm $\mathcal{A}$ has uniform stability $\beta$ (w.r.t. ${\mathcal{L}}$ ) and if the codomain of ${\mathcal{L}}$ is in $[0,M]$ , we have (Bousquet and Elisseeff, 2002) that for any $\delta>0$ , with probability at least $1-\delta$ ,

Based on previous work on stability (e.g., Hardt et al. 2015; Kuzborskij and Lampert 2017; Gonen and Shalev-Shwartz 2017), one may conjecture some reason for generalization in deep learning.

The robustness approach avoids dealing with certain details of the dependence of $f_{\mathcal{A}(S)}$ on $S$ by considering the robustness of algorithm $\mathcal{A}$ for all possible datasets. In contrast to stability, robustness is the measure of how much the loss value can vary w.r.t. the input space of $(x,y)$ . For example, if algorithm $\mathcal{A}$ is $(\Omega,\zeta(\cdot))$ -robust and the codomain of ${\mathcal{L}}$ is upper-bounded by $M$ , given a dataset $S$ , we have (Xu and Mannor, 2012) that for any $\delta>0$ , with probability at least $1-\delta$ ,

The robustness approach requires an a priori known and fixed partition of the input space such that the number of sets in the partition is $\Omega$ and the change of loss values in each set of the partition is bounded by $\zeta(S)$ for all $S$ (Definition 2 and the proof of Theorem 1 in Xu and Mannor 2012). In classification, if the margin is ensured to be large, we can fix the partition with balls of radius corresponding to the large margin, filling the input space. Recently, this idea was applied to deep learning (Sokolic et al., 2017a, b), producing insightful and effective generalization bounds, while still suffering from the curse of the dimensionality of the priori-known fixed input manifold.

With regard to the above approaches, flat minima can be viewed as the concept of low variation in the parameter space; i.e., a small perturbation in the parameter space around a solution results in a small change in the loss surface. Several studies have provided arguments for generalization in deep learning based on flat minima (Keskar et al., 2017). However, Dinh et al. (2017) showed that flat minima in practical deep learning hypothesis spaces can be turned into sharp minima via re-parameterization without affecting the generalization gap, indicating that it requires further investigation.

Rethinking generalization

Zhang et al. (2017) empirically demonstrated that several deep hypothesis spaces can memorize random labels, while having the ability to produce zero training error and small test errors for particular natural datasets (e.g., CIFAR-10). They also empirically observed that regularization on the norm of weights seemed to be unnecessary to obtain small test errors, in contradiction to conventional wisdom. These observations suggest the following open problem:

Supporting and extending the empirical observations by Zhang et al. (2017), we provide a theorem (Theorem 1) stating that the hypothesis space of over-parameterized linear models can memorize any training data and decrease the training and test errors arbitrarily close to zero (including to zero) with the norm of parameters being arbitrarily large, even when the parameters are arbitrarily far from the ground-truth parameters. Furthermore, Corollary 2 shows that conventional wisdom regarding the norm of the parameters $w$ can fail to explain generalization, even in linear models that might seem not to be over-parameterized. All proofs for this paper are presented in the appendix.

$\hat{Y}(w)=Y+\epsilon A$ for some matrix $A$ with $\|A\|_{F}\leq 1$ , and

$\|w\|_{F}\geq\delta$ and $\|w-w^{*}\|_{F}\geq\delta$ .

For any hypothesis space ${\mathcal{F}}$ whose hypothesis-space complexity is large enough to memorize any dataset and which includes $f^{*}_{\epsilon}$ possibly at an arbitrarily sharp minimum, there exist learning algorithms $\mathcal{A}$ such that the generalization gap of $f_{\mathcal{A}(S)}$ is at most $\epsilon$ , and

There exist arbitrarily unstable and arbitrarily non-robust algorithms $\mathcal{A}$ such that the generalization gap of $f_{\mathcal{A}(S)}$ is at most $\epsilon$ .

To see this, first consider statement (i). Given such a ${\mathcal{F}}$ , consider any $\mathcal{A}$ such that $\mathcal{A}$ takes ${\mathcal{F}}$ and $S$ as input and outputs $f^{*}_{\epsilon}$ . Clearly, there are many such algorithms $\mathcal{A}$ . For example, given a $S$ , fix $\mathcal{A}$ such that $\mathcal{A}$ takes ${\mathcal{F}}$ and $S$ as input and outputs $f^{*}_{\epsilon}$ (which already proves the statement), or even $f^{*}_{\epsilon}+\delta$ where $\delta$ becomes zero by the right choice of hyper-parameters and of small variations of ${\mathcal{F}}$ (e.g., architecture search in deep learning) such that ${\mathcal{F}}$ still satisfy the condition in the statement. This establishes statement (i).

Consider statement (ii). Given any dataset $S^{\prime}$ , consider a look-up algorithm $\mathcal{A}^{\prime}$ that always outputs $f^{*}_{\epsilon}$ if $S=S^{\prime}$ , and outputs $f_{1}$ otherwise such that $f_{1}$ is arbitrarily non-robust and $|{\mathcal{L}}(f^{*}_{\epsilon}(x),y)-{\mathcal{L}}(f_{1}(x),y)|$ is arbitrarily large (i.e., arbitrarily non-stable). This proves statement (ii). Note that while $A^{\prime}$ here suffices to prove statement (ii), we can also generate other non-stable and non-robust algorithms by noticing the essence captured in Remark 4.

We capture the essence of all the above observations in the following remark.

From these observations, we propose the following open problem:

Solving Open Problem 2 for deep learning implies solving Open Problem 1, but not vice versa. Open Problem 2 encapsulates the essence of Open Problem 1 and all the issues from our Theorem 1, Corollary 2 and Proposition 3.

The empirical observations in (Zhang et al., 2017) and our results above may seem to contradict the results of statistical learning theory. However, there is no contradiction, and the apparent inconsistency arises from the misunderstanding and misuse of the precise meanings of the theoretical statements.

Statistical learning theory can be considered to provide two types of statements relevant to the scope of this paper. The first type (which comes from upper bounds) is logically in the form of “ $p$ implies $q$ ,” where $p:=$ “the hypothesis-space complexity is small” (or another statement about stability, robustness, or flat minima), and $q:=$ “the generalization gap is small.” Notice that “ $p$ implies $q$ ” does not imply “ $q$ implies $p$ .” Thus, based on statements of this type, it is entirely possible that the generalization gap is small even when the hypothesis-space complexity is large or the learning mechanism is unstable, non-robust, or subject to sharp minima.

2 Difference in assumptions and problem settings

Under certain assumptions, many results in statistical learning theory have been shown to be tight and insightful (e.g., Mukherjee et al. 2006; Mohri et al. 2012). Hence, the need of rethinking generalization partly comes from a difference in the assumptions and problem settings.

We note that analyzing a set $\mathcal{P}\times D$ is of significant interest for its own merits and is a natural task in the field of computational complexity (e.g., categorizing a set of problem instances into subsets with or without polynomial solvability). Indeed, the situation where theory focuses more on a set and many practical studies focus on each element in the set is prevalent in computer science (see the discussion in Appendix A2 for more detail).

3 Practical role of generalization theory

From the discussions above, we can see that there is a logically expected difference between the scope in theory and the focus in practice; it is logically expected that there are problem instances where theoretical bounds are pessimistic. In order for generalization theory to have maximal impact in practice, we must be clear on a set of different roles it can play regarding practice, and then work to extend and strengthen it in each of these roles. We have identified the following practical roles for theory:

to be small for a given fixed $S$ , and/or

to approach zero with a fixed model class as $m$ increases.

Provide theoretical insights to guide the search over model classes.

Generalization bounds via validation

Direct analyses of neural networks

Unlike the previous section, this section analyzes the generalization gap with a training dataset $S$ . In Section 3, we extended Open Problem 1 to Open Problem 2, and identified the different assumptions in theoretical and empirical studies. Accordingly, this section aims to address these problems to some extent, both in the case of particular specified datasets and the case of random unspecified datasets. To achieve this goal, this section presents a direct analysis of neural networks, rather than deriving results about neural networks from more generic theories based on capacity, Rademacher complexity, stability, or robustness.

Sections 5.2 and 5.3 deals with the squared loss, while Section 5.4 considers 0-1 loss with multi-labels.

We consider general neural networks of any depth that have the structure of a directed acyclic graph (DAG) with ReLU nonlinearity and/or max pooling. This includes any structure of a feedforward network with convolutional and/or fully connected layers, potentially with skip connections. For pedagogical purposes, we first discuss our model description for layered networks without skip connections, and then describe it for DAGs.

By expanding $z^{[l]}(x,w)$ repeatedly and exchanging the sum and product via the distributive law of multiplication,

where $\overline{W}_{kj_{L-1}j_{L-2}\dots j_{0}}=W^{[L]}_{kj_{L-1}}\prod_{l=1}^{L-1}W^{[l]}_{j_{l}j_{l-1}}$ and $\dot{\sigma}_{j_{L-1}j_{L-2}\dots j_{1}}(x,w)=\allowbreak\prod_{l=1}^{L-1}\allowbreak\dot{\sigma}_{j_{l}}^{[l]}(x,w)$ . By merging the indices $j_{0},\dots,j_{L-1}$ into $j$ with some bijection between $\{1,\dots,n_{0}\}\times\cdots\times\{1,\dots,\allowbreak n_{L-1}\}\ni(j_{0},\dots,j_{L-1})$ and $\{1,\dots,n_{0}n_{1}\cdots\allowbreak n_{L-1}\}\ni j$ ,

where $\bar{w}_{k,j},\bar{\sigma}_{j}(x,w)$ and $\bar{x}_{j}$ represent $\overline{W}_{kj_{L-1}j_{L-2}\dots j_{0}},\dot{\sigma}_{j_{L-1}j_{L-2}\dots j_{1}}(\allowbreak x,w)$ and $x_{j_{0}}$ respectively with the change of indices (i.e., $\sigma_{j}(x,w)$ and $\bar{x}_{j}$ respectively contain the $n_{0}$ numbers and $n_{1}\cdots n_{L-1}$ numbers of the same copy of each $\dot{\sigma}_{j_{L-1}j_{L-2}\dots j_{1}}(x,w)$ and $x_{j_{0}}$ ). Note that $\sum_{j}$ represents summation over all the paths from the input $x$ to the $k$ -th output unit.

Remember that every DAG has at least one topological ordering, which can be used to to create a layered structure with possible skip connections (e.g., see Healy and Nikolov 2001; Neyshabur et al. 2015b). In other words, we consider DAGs such that the pre-activation vector of the $l$ -th layer can be written as

where $\sum_{j}$ represents the summation over all paths from the input $x$ to the $k$ -th output unit; i.e., $\bar{w}_{k,j}\bar{\sigma}_{j}(x,w)\bar{x}_{j}$ is the contribution from the $j$ -th path to the $k$ -th output unit. Each of $\bar{w}_{k,j},\bar{\sigma}_{j}(x,w)$ and $\bar{x}_{j}$ is defined in the same manner as in the case of layered networks without skip connections. In other words, the $j$ -th path weight $\bar{w}_{k,j}$ is the product of the weight parameters in the $j$ -th path, and $\bar{\sigma}_{j}(x,w)$ is the product of the - $1$ activations in the $j$ -th path, corresponding to ReLU nonlinearity and max pooling; $\bar{\sigma}_{j}(x,w)=1$ if all units in the $j$ -th path are active, and $\bar{\sigma}_{j}(x,w)=0$ otherwise. Also, $\bar{x}_{j}$ is the input used in the $j$ -th path. Therefore, for DAGs, including layered networks without skip connections,

where $[\bar{x}\circ\bar{\sigma}(x,w)]_{j}=\bar{x}_{j}\bar{\sigma}_{j}(x,w)$ and $(\bar{w}_{k})_{j}=\bar{w}_{k,j}$ are the vectors of the size of the number of the paths.

Let $\{\lambda_{j}\}_{j}$ and $\{u_{j}\}_{j}$ be a set of eigenvalues and a corresponding orthonormal set of eigenvectors of $G$ . Let $\theta_{\bar{w}_{k},j}^{(1)}$ be the angle between $u_{j}$ and $\bar{w}_{k}$ . Let $\theta^{(2)}_{\bar{w}_{k}}$ be the angle between $v$ and $\bar{w}_{k}$ . Then (deterministically),

Proof idea. From Equation (1) with squared loss, we can decompose the generalization gap into three terms:

By manipulating each term, we obtain the desired statement. See Appendix C3 for a complete proof. $\hfill\square$

In Theorem 7, there is no concept of a hypothesis space. Instead, it indicates that if the norm of the weights $\|\bar{w}_{k}^{\scalebox{0.5}{$ S $}}\|_{2}$ at the end of learning process with the actual given $S$ is small, then the generalization gap is small, even if the norm $\|\bar{w}_{k}^{\scalebox{0.5}{$ S $}}\|_{2}$ is unboundedly large at anytime with any dataset other than $S$ .

Based on a generic bound-based theory, Neyshabur et al. (2015a, b) proposed to control the norm of the path weights $\|\bar{w}_{k}\|_{2}$ , which is consistent with our direct bound-less result (and which is as computationally tractable as a standard forward-backward pass), one can compute $\|\bar{w}_{k}^{\scalebox{0.5}{$ S $}}\|_{2}^{2}$ with a single forward pass using element-wise squared weights, an identity input, and no nonlinearity. One can also follow the previous paper (Neyshabur et al., 2015a) for its computation.). Unlike the previous results, we do not require a pre-defined bound on $\|\bar{w}_{k}\|_{2}$ over different datasets, but depend only on its final value with each $S$ as desired, in addition to more tight insights (besides the norm) via equality as discussed above. In addition to the pre-defined norm bound, these previous results have an explicit exponential dependence on the depth of the network, which does not appear in our Theorem 4. Similarly, some previous results specific to layered networks without skip connections (Sun et al., 2016; Xie et al., 2015) contain the $2^{L-1}$ factor and a bound on the product of the norm of weight matrices, $\prod_{l=1}^{L}\|W^{(l)}\|$ , instead of $\sum_{k}\|\bar{w}_{k}^{\scalebox{0.5}{$ S $}}\|_{2}$ . Here, $\sum_{k}\|\bar{w}_{k}\|_{2}^{2}\leq\prod_{l=1}^{L}\|W^{(l)}\|_{F}^{2}$ because the latter contains all of the same terms as the former as well as additional non-negative additive terms after expanding the sums in the definition of the norms.

Therefore, unlike previous bounds, Theorem 7 generates these new theoretical insights based on the tight equality (in the first line of the equation in Theorem 7). Notice that without manipulating the generalization gap, we can always obtain equality. However, the question answered here is whether or not we can obtain competitive theoretical insights (the path norm bound) via equality instead of inequality. From a practical view point, if the obtained insights are the same (e.g., regularize the norm), then equality-based theory has the obvious advantage of being more precise.

3 Probabilistic bound over random datasets

In Equation (2), the generalization gap is decomposed into three terms, each of which contains the difference between a sum of dependent random variables and its expectation. The dependence comes from the fact that $z_{i}=[\bar{x}_{i}\circ\bar{\sigma}(x_{i},w^{\scalebox{0.6}{$ S $}})]$ are dependent over the sample index $i$ , because of the dependence of $w^{\scalebox{0.6}{$ S $}}$ on the entire dataset $S$ . We then observe the following: in $z^{[L]}_{k}(x,w)=[\bar{x}\circ\bar{\sigma}(x,w)]^{\top}\bar{w}$ , the derivative of $z=[\bar{x}\circ\bar{\sigma}(x,w)]$ with respect to $w$ is zero everywhere (except for the measure zero set, where the derivative does not exist). Therefore, each step of the (stochastic) gradient decent greedily chooses the best direction in terms of $\bar{w}$ (with the current $z=[\bar{x}\circ\bar{\sigma}(x,w)]$ ), but not in terms of $w$ in $z=[\bar{x}\circ\bar{\sigma}(x,w)]$ (see Appendix A3 for more detail). This observation leads to a conjecture that the dependence of $z_{i}=[\bar{x}_{i}\circ\bar{\sigma}(x_{i},w^{\scalebox{0.6}{$ S $}})]$ via the training process with the whole dataset $S$ is not entirely “bad”in terms of the concentration of the sum of the terms with $z_{i}$ .

As a first step to investigate the dependence of $z_{i}$ , we evaluated the following novel two-phase training procedure that explicitly breaks the dependence of $z_{i}$ over the sample index $i$ . We first train a network in a standard way, but only using a partial training dataset $S_{\alpha m}=\{(x_{1},y_{1}),\dots,(x_{\alpha m},y_{\alpha m})\}$ of size $\alpha m$ , where $\alpha\in(0,1)$ (standard phase). We then assign the value of $w^{\scalebox{0.6}{$ \mathcal{S}_{\alpha m} $}}$ to a new placeholder $w_{\sigma}:=w^{\scalebox{0.6}{$ \mathcal{S}_{\alpha m} $}}$ and freeze $w_{\sigma}$ , meaning that as $w$ changes, $w_{\sigma}$ does not change. At this point, we have that $z^{[L]}_{k}(x,w^{\scalebox{0.6}{$ \mathcal{S}_{\alpha m} $}})=[\bar{x}\circ\bar{\sigma}(x,w_{\sigma})]^{\top}\bar{w}_{k}^{\scalebox{0.6}{$ \mathcal{S}_{\alpha m} $}}.$ We then keep training only the $\bar{w}_{k}^{\scalebox{0.6}{$ \mathcal{S}_{\alpha m} $}}$ part with the entire training dataset of size $m$ (freeze phase), yielding the final model via this two-phase training procedure as

We implemented the two-phase training procedure with the MNIST and CIFAR-10 datasets. The test accuracies of the standard training procedure (base case) were 99.47% for MNIST (ND), 99.72% for MNIST, and 92.89% for CIFAR-10. MNIST (ND) indicates MNIST with no data augmentation. The experimental details are in Appendix B.

Our source code is available at: http://lis.csail.mit.edu/code/gdl.html

Figure 2 presents the test accuracy ratios for varying $\alpha$ : the test accuracy of the two-phase training procedure divided by the test accuracy of the standard training procedure. The plot in Figure 2 begins with $\alpha=0.05$ , for which $\alpha m=3000$ in MNIST and $\alpha m=2500$ in CIFAR-10. Somewhat surprisingly, using a much smaller dataset for learning $w_{\sigma}$ still resulted in competitive performance. A dataset from which we could more easily obtain a better generalization (i.e., MNIST) allowed us to use smaller $\alpha m$ to achieve competitive performance, which is consistent with our discussion above.

3.2 Theoretical results

where $\beta_{1}=\frac{2C_{zz}}{3m_{\sigma}}\ln\frac{3d_{z}}{\delta}+\sqrt{\frac{2\gamma^{2}_{zz}}{m_{\sigma}}\ln\frac{3d_{z}}{\delta}},$ $\beta_{2}=\frac{2C_{yz}}{3m_{\sigma}}\ln\frac{6d_{y}d_{z}}{\delta}+\sqrt{\frac{\gamma^{2}_{yz}}{m_{\sigma}}\ln\frac{6d_{y}d_{z}}{\delta}},$ and $\beta_{2}=\frac{2C_{y}}{3m_{\sigma}}\ln\frac{3}{\delta}+\sqrt{\frac{2\gamma^{2}_{y}}{m_{\sigma}}\ln\frac{3}{\delta}}.$

4 Probabilistic bound for 0-1 loss with multi-labels

For the 0–1 loss with multi-labels, the two-phase training procedure in Section 5.3 yields the generalization bound in Theorem 9. Similarly to the bounds in Theorems 7 and 8, the generalization bound in Theorem 9 does not necessarily have dependence on the number of weights, and exponential dependence on depth and effective input dimensionality.

Discussions and open problems

It is very difficult to make a detailed characterization of how well a specific hypotheses generated by a certain learning algorithm will generalize, in the absence of detailed information about the given problem instance. Traditional learning theory addresses this very difficult question and has developed bounds that are as tight as possible given the generic information available. In this paper, we have worked toward drawing stronger conclusions by developing theoretical analyses tailored for the situations with more detailed information, including actual neural network structures, and actual performance on a validation set.

Optimization and generalization in deep learning are closely related via the following observation: if we make optimization easier by changing model architectures, generalization performance can be degraded, and vice versa. Hence, non-pessimistic generalization theory discussed in this paper might allow more architectural choices and assumptions in optimization theory.

Theorem 7 partially addresses Open Problem 3 by preserving the exact ordering via equality without bounds. However, it would be beneficial to consider a weaker notion of order preservation to gain analyzability with more useful insights as stated in Open Problem 3.

Acknowledgements

We gratefully acknowledge support from NSF grants 1420316, 1523767 and 1723381, from AFOSR FA9550-17-1-0165, from ONR grant N00014-14-1-0486, and from ARO grant W911 NF1410433, as well as support from NSERC, CIFAR and Canada Research Chairs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.

References

A Appendix: Additional discussions

Theorem 7 address Open Problem 2 with the limited applicability to certain neural networks with squared loss. In contrast, a parallel study (Kawaguchi and Bengio, 2018) presents a novel generic learning theory to address Open Problem 2 for general cases in machine learning. It would be beneficial to explore both a generic analysis (Kawaguchi and Bengio, 2018) and a concrete analysis in deep learning to get theoretical insights that are tailored for each particular case.

In general, theoretical bounds from statistical learning theory can be too loose to be directly used in practice. In addition, many theoretical results in statistical learning theory end up suggesting to simply regularize some notion of smoothness of a hypothesis class. Indeed, by upper bounding a distance between two functions (e.g., a hypothesis and the ground truth function corresponding to expected true labels), one can immediately see without statistical learning theory that regularizing some notion of smoothness of the hypothesis class helps guarantees on generalization. Then, by the Occam’s razor argument, one might prefer a simpler (yet still rigor) theory and a corresponding simpler algorithm.

This subsection proposes the following family of simple regularization algorithms: given any architecture and method, add a new regularization term for each mini-batch as

where $x_{i}$ is drawn from some distribution approximating the true distribution of $x$ , $\xi_{1},\dots,\xi_{\bar{m}}$ are independently and uniformly drawn from $\{-1,1\}$ , $\bar{m}$ is a mini-batch size and $\lambda$ is a hyper-parameter. Importantly, the approximation of the true distribution of $x$ is only used for regularization purposes and hence needs not be precisely accurate (as long as it plays its role for regularization). For example, it can be approximated by populations generated by a generative neural network and/or an extra data augmentation process. For simplicity, we call this family of methods as Directly Approximately Regularizing Complexity (DARC).

In this paper, we evaluated only a very simple version of the proposed family of methods as a first step. That is, our experiments employed the following simple and easy-to-implement method, called DARC1:

where $x_{i}$ is the $i$ -th sample in the training mini-batch. The additional computational cost and programming effort due to this new regularization is almost negligible because $z_{k}^{[L]}(x_{i})$ is already used in computing the original loss. This simplest version was derived by approximating the true distribution of $x$ with the empirical distribution of the training data.

We evaluated the proposed method (DARC1) by simply adding the new regularization term in equation (A.1) to existing standard codes for MNIST and CIFAR-10. A standard variant of LeNet (LeCun et al., 1998) and ResNeXt-29( $16\times 64$ d) (Xie et al., 2016) are used for MNIST and CIFAR-10, and compared with the addition of the studied regularizer. For all the experiments, we fixed $(\lambda/\bar{m})=0.001$ with $\bar{m}=64$ . We used a single model without ensemble methods. The experimental details are in Appendix B. The source code is available at: http://lis.csail.mit.edu/code/gdl.html

Table 1 shows the error rates comparable with previous results. To the best of our knowledge, the previous state-of-the-art classification error is 0.23% for MNIST with a single model (Sato et al., 2015) (and 0.21% with an ensemble by Wan et al. 2013). To further investigate the improvement, we ran 10 random trials with computationally less expensive settings, to gather mean and standard deviation (stdv). For MNIST, we used fewer epochs with the same model. For CIFAR-10, we used a smaller model class (pre-activation ResNet with only 18 layers). Table 2 summarizes the improvement ratio: the new model’s error divided by the base model’s error. We observed the improvements for all cases. The test errors (standard deviations) of the base models were 0.53 (0.029) for MNIST (ND), 0.28 (0.024) for MNIST, and 7.11 (0.17) for CIFAR-10 (all in $\%$ ).

Table 3 summarizes the values of the regularization term $\frac{1}{m}(\max_{k}\sum_{i=1}^{m}\allowbreak|z_{k}^{[L]}(x_{i})|)$ for each obtained model. The models learned with the proposed method were significantly different from the base models in terms of this value. Interestingly, a comparison of the base cases for MNIST (ND) and MNIST shows that data augmentation by itself implicitly regularized what we explicitly regularized in the proposed method.

A2 Relationship to other fields

The situation where theoretical studies focus on a set of problems and practical applications care about each element in the set is prevalent in machine learning and computer science literature, not limited to the field of learning theory. For example, for each practical problem instance $q\in Q$ , the size of the set $Q$ that had been analyzed in theory for optimal exploration in Markov decision processes (MDPs) were demonstrated to be frequently too pessimistic, and a methodology to partially mitigate the issue was proposed (Kawaguchi, 2016b). Bayesian optimization would suffer from a pessimistic set $Q$ regarding each problem instance $q\in Q$ , the issue of which was partially mitigated (Kawaguchi et al., 2015).

Moreover, characterizing a set of problems $Q$ only via a worst-case instance $q^{\prime}\in Q$ (i.e., worst-case analysis) is known to have several issues in theoretical computer science, and so-called beyond worst-case analysis (e.g., smoothed analysis) is an active area of research to mitigate the issues.

A3 SGD chooses direction in terms of w¯¯𝑤\bar{w}

Note that $\sigma(x,w)$ is 0 or 1 for max-pooling and/or ReLU nonlinearity. Thus, the derivative of $z=[\bar{x}\circ\bar{\sigma}(x,w)]$ with respect to $w$ is zero everywhere (except at the measure zero set where the derivative does not exists). Thus, by the chain rule (and power rule), the gradient of the loss with respect to $w$ only contain the contribution from the derivative of $z^{[L]}_{k}$ with respect to $\bar{w}$ , but not with respect to $w$ in $z$ .

A4 Simple implementation of two-phase training procedure

Directly implementing Equation (3) requires the summation over all paths, which can be computationally expensive. To avoid it, we implemented it by creating two deep neural networks, one of which defines $\bar{w}$ paths hierarchically, and another of which defines $w_{\sigma}$ paths hierarchically, resulting in the computational cost at most (approximately) twice as much as the original cost of training standard deep learning models. We tied $w_{\sigma}$ and $\bar{w}$ in the two networks during standard phase, and untied them during freeze phase.

Our source code is available at: http://lis.csail.mit.edu/code/gdl.html

The computation of the standard network without skip connection can be re-written as:

where $W^{[l]}_{\sigma}:=W^{[l]}$ , $z^{[l-1]}_{\sigma}:=\sigma(W^{[l]}_{\sigma}z^{[l-1]}_{\sigma}(x,w))$ and $\dot{\sigma}^{[l]}_{j}(W^{[l]}z^{[l-1]}(x,w))\allowbreak=1$ if the $j$ -th unit at the $l$ -th layer is active, $\dot{\sigma}^{[l]}_{j}(W^{[l]}z^{[l-1]}(x,w))=0$ otherwise. Note that because $W^{[l]}_{\sigma}=W^{[l]}$ , we have that $z^{[l-1]}_{\sigma}=z^{[l]}$ in standard phase and standard models.

In the two-phase training procedure, we created two networks for $W^{[l]}_{\sigma}\allowbreak z^{[l-1]}_{\sigma}(x,w)$ and $W^{[l]}\allowbreak z^{[l-1]}(x,w)$ separately. We then set $W^{[l]}_{\sigma}=W^{[l]}$ during standard phase, and frozen $W^{[l]}_{\sigma}$ and only trained $W^{[l]}$ during freeze phase. By following the same derivation of Equation (1), we can see that this defines the desired computation without explicitly computing the summation over all paths. By the same token, this applies to DAGs.

B Appendix: Experimental details

We used the following fixed architecture:

Convolutional layer with 32 filters with filter size of 5 by 5, followed by max pooling of size of 2 by 2 and ReLU.

Convolution layer with 32 filters with filter size of 5 by 5, followed by max pooling of size of 2 by 2 and ReLU.

Fully connected layer with output 1024 units, followed by ReLU and Dropout with its probability being 0.5.

Fully connected layer with output 10 units.

We fixed learning rate to be 0.01, momentum coefficient to be 0.5, and optimization algorithm to be (standard) stochastic gradient decent (SGD). We fixed data augmentation process as: random crop with size 24, random rotation up to $\pm 15$ degree, and scaling of 15%. We used 3000 epochs for Table 1, and 1000 epochs for Tables 2 and 3.

For data augmentation, we used random horizontal flip with probability 0.5 and random crop of size 32 with padding of size 4.

For Table 1, we used ResNeXt-29( $16\times 64$ d) (Xie et al., 2016). We set initial learning rate to be 0.05 and decreased to $0.005$ at 150 epochs, and to 0.0005 at 250 epochs. We fixed momentum coefficient to be 0.9, weight decay coefficient to be $5\times 10^{-4}$ , and optimization algorithm to be stochastic gradient decent (SGD) with Nesterov momentum. We stopped training at 300 epochs.

For Tables 2 and 3, we used pre-activation ResNet with only 18 layers (pre-activation ResNet-18) (He et al., 2016). We fixed learning rate to be 0.001 and momentum coefficient to be 0.9, and optimization algorithm to be (standard) stochastic gradient decent (SGD). We used 1000 epochs.

C Appendix: Proofs

We use the following lemma in the proof of Theorem 7.

Theorem 1.4 by Tropp (2012) states that for all $t\geq 0,$

Setting $\delta=d\exp\left(-\frac{t^{2}/2}{\gamma^{2}+Rt/3}\right)$ implies

Solving for $t$ with the quadratic formula and bounding the solution with the subadditivity of square root on non-negative terms (i.e., $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for all $a,b\geq 0$ ),

which grows unboundedly as $\alpha\rightarrow\infty$ without changing $A$ and $B$ , proving (ii)(c) in statement (ii). ∎

C2 Proof of Corollary 2

C3 Proof of Theorem 7

From Equation (1), the squared loss of deep models for each point $(x,y)$ can be rewritten as

Thus, from Equation (1) with the squared loss, we can decompose the generalization gap into three terms as

As $G$ is a real symmetric matrix, we denote an eigendecomposition of $G$ as $G=U\Lambda U^{\top}$ where the diagonal matrix $\Lambda$ contains eigenvalues as $\Lambda_{jj}=\lambda_{j}$ with the corresponding orthogonal eigenvector matrix $U$ ; $u_{j}$ is the $j$ -th column of $U$ . Then,

C4 Proof of Theorem 8

From Equation (2), with the definition of induced matrix norm and the Cauchy-Schwarz inequality,

In the below, we bound each term of the right-hand side of the above with concentration inequalities.

For the first term: Matrix Bernstein inequality (Lemma 10) states that for any $\delta>0$ , with probability at least $1-\delta/3$ ,

For the second term: We apply Bernstein inequality to each $(k,k^{\prime})\in\{1,\dots,d_{y}\}\times\{1,\dots,d_{z}\}$ and take union bound over $d_{y}d_{z}$ events, obtaining that for any $\delta>0$ , with probability at least $1-\delta/3$ , for all $k\in\{1,2,\dots,d_{y}\}$ ,

For the third term: From Bernstein inequality, with probability at least $1-\delta/3$ ,

Putting together: Putting together, for a fixed (or frozen) $w_{\sigma}$ , with probability at least $1-\delta$ (probability over $S\setminus S_{\alpha m}=\{(x_{\alpha m+1},y_{\alpha m+1}),\dots,(x_{m},\allowbreak y_{m})\}$ ), we have that

Since Equation (C.1) always hold deterministically (with or without such a dataset), the desired statement of this theorem follows. ∎

C5 Proof of Theorem 9

Recall the following fact: using the result by Koltchinskii and Panchenko (2002), we have that for any $\delta>0$ , with probability at least $1-\delta$ , the following holds for all $f\in{\mathcal{F}}$ :

where $\mathfrak{R}^{\prime}_{m_{\sigma}}({\mathcal{F}})$ is Rademacher complexity defined as

Here, $\xi_{i}$ is the Rademacher variable, and the supremum is taken over all $k\in\{1,\dots,d_{y}\}$ and all $w$ allowed in ${\mathcal{F}}$ . Then, for our parameterized hypothesis spaces with any frozen $w_{\sigma}$ ,

Because square root is concave in its domain, by using Jensen’s inequality and linearity of expectation,

Putting together, we have that $\mathfrak{R}^{\prime}_{m}({\mathcal{F}})\leq C_{\sigma}C_{w}\sqrt{m_{\sigma}}$ . ∎

C6 Proof of Proposition 5

By noticing that the solution of $\epsilon$ with the minus sign results in $\epsilon<0$ , which is invalid for Bernstein inequality, we obtain the valid solution with the plus sign. Then, we have