Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks

Gauthier Gidel, Francis Bach, Simon Lacoste-Julien

Introduction

When optimizing over-parameterized models, such as deep neural networks, a large set of parameters leads to a zero training error. However they lead to different values for the test error and thus have distinct generalization properties. More specifically, Neyshabur (2017, Part II) argues that the choice of the optimization algorithm (and its respective hyperparameters) provides an implicit regularization with respect to its geometry: it biases the training, finding a particular minimizer of the objective.

In this work, we use the same setting as Saxe et al. (2018): a regression problem with least-squares loss on a multi-dimensional output. Our prediction is made either by a linear model or by a two-layer linear neural network (Saxe et al., 2018). We extend their work which covered the continuous gradient dynamics, to weaker assumptions as well as analyze the behavior of the discrete gradient updates

We show that with a vanishing initialization and a small enough step-size, the gradient dynamics of a two-layer linear neural network sequentially learns components that can be ranked according to a hierarchical structure whereas the gradient dynamics induced by the same regression problem but with a linear prediction model instead learns these components simultaneously, missing this notion of hierarchy between components. The path followed by the two-layer formulation actually corresponds to successively solving the initial regression problem with a growing low rank constraint which is also know as reduced-rank regression (Izenman, 1975). Note that this notion of path followed by the dynamics of a whole network is different from the notion of path introduced by Neyshabur et al. (2015a) which corresponds to a path followed inside a fixed network, i.e., one corresponds to training dynamics whereas the other corresponds to the propagation of information inside a network.

To sum-up, in our framework, the path followed by the gradient dynamics of a two-layer linear network provides an implicit regularization that may lead to potentially better generalization properties. Our contributions are the following:

Under some assumptions (see Assumption 1), we prove that both the discrete and continuous gradient dynamics sequentially learn the solutions of a gradually less regularized version of reduced-rank regression (Corollary 2 and 3). Among the close related work, such result on implicit regularization regarding discrete dynamics is novel. For the continuous case, we weaken the standard commutativity assumption using perturbation analysis.

We experimentally verify the reasonableness of our assumption and observe improvements in terms of generalization (matrix reconstruction in our case) using the gradient dynamics of the two-layer linear network when compared against the linear model.

The implicit regularization provided by the choice of the optimization algorithm has recently become an active area of research in machine learning, putting lot of interest on the behavior of gradient descent on deep over-parametrized models (Neyshabur et al., 2015b, 2017; Zhang et al., 2017).

Several works show that gradient descent on unregularized problems actually finds a minimum norm solution with respect to a particular norm that drastically depends on the problem of interest. Soudry et al. (2018) look at a logistic regression problem and show that the predictor does converge to the max-margin solution. A similar idea has been developed in the context of matrix factorization (Gunasekar et al., 2017). Under the assumption that the observation matrices commute, they prove that gradient descent on this non-convex problem finds the minimum nuclear norm solution of the reconstruction problem, they also conjecture that this result would still hold without the commutativity assumption. This conjecture has been later partially solved by Li et al. (2018) under mild assumptions (namely the restricted isometry property). This work has some similarities with ours, since both focus on a least-squares regression problem over matrices with a form of matrix factorization that induces a non convex landscape. Their problem is more general than ours (see Uschmajew and Vandereycken (2018) for an even more general setting) but they are showing a result of a different kind from ours: they focus on the properties of the limit solution the continuous dynamics whereas we show some properties on the whole dynamics (continuous and discrete), proving that it actually visits points during the optimization that may provide good generalization. Interestingly, both results actually share common assumptions such as a commutativity assumption (which is less restrictive in our case since it is always true in some realistic settings such as linear autoencoders), vanishing initialization and a small enough step size.

Nar and Sastry (2018) also analyzed the gradient descent algorithm on a least-squares linear network model as a discrete time dynamical system, and derived certain necessary (but not sufficient) properties of the local optima that the algorithm can converge to with a non-vanishing step size. In this work, instead of looking at the properties of the limit solutions, we focus on the path followed by the gradient dynamics and precisely caracterize the weights learned along this path.

Combes et al. (2018) studied the continuous dynamics of some non-linear networks under relatively strong assumptions such as the linear separability of the data. Conversely, in this work, we do not make such separability assumption on the data but focus on linear networks.

Finally, Gunasekar et al. (2018) compared the implicit regularization provided by gradient descent in deep linear convolutional and fully connected networks. They show that the solution found by gradient descent is the minimum norm for both networks but according to a different norm. In this work, the fact that gradient descent finds the minimum norm solution is almost straightforward using standard results on least-squares. But the path followed by the gradient dynamics reveals interesting properties for generalization. As developed earlier, instead of focusing on the properties of the solution found by gradient descent, our goal is to study the path followed by the discrete gradient dynamics in the case of a two-layer linear network.

Prior work (Saxe et al., 2013, 2014; Advani and Saxe, 2017; Saxe et al., 2018; Lampinen and Ganguli, 2019) studied the gradient dynamics of two-layer linear networks and proved a result similar to our Thm. 2. We consider Saxe et al. (2018) as the closest related work, we re-use their notion of simple deep linear neural network, that we call two-layer neural networks, and use some elements of their proofs to extend their results. However, note that their work comes from a different perspective: through a mathematical analysis of a simple non-linear dynamics, they intend to highlight continuous dynamics of learning where one observes the sequential emergence of hierarchically structured notions to explain the regularities in representation of human semantic knowledge. In this work, we are also considering a two-layer neural network but with an optimization perspective. We are able to extend Saxe et al. (2018, Eq. 6 and 7) weakening the commutativity assumption considered in Saxe et al. (2018) using perturbation analysis. In §4.1, we test to what extent our weaker assumption holds. Our main contribution is to show a similar result on the discrete gradient dynamics, that is important in our perspective since we aim to study the dynamics of gradient descent. This result cannot be trivially extended from the result on the continuous dynamics. We provide details on the difficulties of the proof in §3.2.

A Simple Deep Linear Model

where ${\bm{W}}_{1},\ldots,{\bm{W}}_{L}$ are learned through a MSE formulation with the least-squares loss $f$ ,

are the design matrices of $({\bm{x}}_{i})_{1\leq i\leq n}$ and $({\bm{y}}_{i})_{1\leq i\leq n}$ . The deep linear model (1) is a $L$ -layer deep linear neural network where we see ${\bm{h}}_{l}:={\bm{W}}_{l}\cdots{\bm{W}}_{1}{\bm{x}}$ for $1\leq l\leq L-1$ as the $l^{th}$ hidden layer. At first, since this deep linear network cannot represent more than a linear transformation, we could think that there is no reason to use a deeper representation $L=1$ . However, in terms of learning flow, we will see in §3 that for $L=2$ this model has a completely different dynamics from $L=1$ .

Increasing $L$ may induce a low rank constraint when $r:=\min\{r_{l}\,:\,1\leq l\leq L-1\}<\min(d,p)$ . In that case, (2) is equivalent to a reduced-rank regression,

These problems have explicit solutions depending on ${\bm{X}}$ and ${\bm{Y}}$ (Reinsel and Velu, 1998, Thm. 2.2).

Note that, in this work we are interested in the implicit regularization provided in the context of over-parametrized models, i.e., when $r>\min(p,d)$ . In that case,

Gradient Dynamics as a Regularizer

In this section we would like to study the discrete dynamics of the gradient flow of (2), i.e.,

where we use the notation ${\bm{W}}_{[L]}^{(t)}:=({\bm{W}}_{1}^{(t)},\ldots,{\bm{W}}_{L}^{(t)})$ . The quantity $\eta$ is usually called the step-size. In order to get intuitions on the discrete dynamics we also consider its respective continuous version,

where for $1\leq l\leq L$ , $\dot{\bm{W}}_{l}(t)$ is the temporal derivative of ${\bm{W}}_{l}(t)$ . Note that there is no step-size in the continuous time dynamics since it actually corresponds to the limit of (5) when $\eta\to 0$ . The continuous dynamics may be more convenient to study because such differential equations may have closed form solutions. In §3.1, we will see that under reasonable assumptions it is the case for (6).

We start with the study of the continuous linear model, its gradient is,

where ${\bm{\Sigma}}_{xy}:=\frac{1}{n}{\bm{X}}^{\top}{\bm{Y}}$ and ${\bm{\Sigma}}_{x}:=\frac{1}{n}{\bm{X}}^{\top}{\bm{X}}$ . Thus, ${\bm{W}}(t)$ is the solution of the differential equation,

where ${\bm{\Sigma}}_{x}^{\dagger}$ is the pseudoinverse of ${\bm{\Sigma}}_{x}$ .

This standard result on ODE is provided in §B.1. Note that when ${\bm{W}}_{0}\to\bm{0}$ we have

Deep linear network: L≥2𝐿2L\geq 2.

The study of the deep linear model is more challenging since for $L\geq 2$ , the landscape of the objective function $f$ is non-convex. The gradient flow of (2) is

where we used the convention that ${\bm{W}}_{1,0}={\bm{I}}_{d}$ and ${\bm{W}}_{L+1,L}={\bm{I}}_{p}$ . Thus (6) becomes

We obtain a coupled differential equation (12) that is harder to solve than the previous linear differential equation (8) due, at the same time, to its non-linear components and to the coupling between ${\bm{W}}_{l}\,,\;1\leq l\leq L$ . However, in the case $L=2$ , Saxe et al. (2018) managed to find an explicit solution to this coupled differential equation under the assumption that “perceptual correlation is minimal” ( ${\bm{\Sigma}}_{x}={\bm{I}}_{d}$ ).By a rescaling of the data, their proof is valid for any matrix ${\bm{\Sigma}}_{x}$ proportional to the identity matrix. In this work we extend Saxe et al. (2018, Eq. 7) (for $L=2$ ) under weaker assumptions. More precisely, we do not require the covariance matrix ${\bm{\Sigma}}_{x}$ to be the identity matrix. Let $({\bm{U}},{\bm{V}},{\bm{D}})$ be the SVD of ${\bm{\Sigma}}_{xy}$ , our assumption is the following:

There exist two orthogonal matrices ${\bm{U}}$ , ${\bm{V}}$ such that we have the joint decomposition,

where ${\bm{B}}$ is such that $\|{\bm{B}}\|_{2}\leq\epsilon$ and ${\bm{D}}_{x},\,{\bm{D}}_{xy}$ are matrices only with diagonal coefficients. We note $\sigma_{1}\geq\dots\geq\sigma_{r_{xy}}>0$ the singular values of ${\bm{\Sigma}}_{xy}$ and $\lambda_{1},\ldots,\lambda_{r_{x}}$ the diagonal entries of ${\bm{D}}_{x}$ .

Since two matrices commute if and only if they are co-diagonalizable (Horn et al., 1985, Thm. 1.3.21), the quantity $\epsilon$ represent to what extend ${\bm{\Sigma}}_{x}$ and ${\bm{\Sigma}}_{xy}{\bm{\Sigma}}_{xy}^{\top}$ do not commute. Before solving (12) under Assump. 1, we describe some motivating examples where the quantity $\epsilon$ is small or zero:

Linear autoencoder: If ${\bm{Y}}$ is set to ${\bm{X}}$ and $L=2$ , we recover a linear autoencoder: $\hat{\bm{x}}({\bm{x}})={\bm{W}}_{2}^{\top}{\bm{W}}_{1}^{\top}{\bm{x}}$ , where ${\bm{h}}:={\bm{W}}_{1}^{\top}{\bm{x}}$ is the encoded representation of ${\bm{x}}$ ,

Note that this linear autoencoder can also be interpreted as a form of principal component analysis. Actually, if we initialize with ${\bm{W}}_{1}={\bm{W}}_{2}^{\top}$ , the gradient dynamics exactly recovers the PCA of ${\bm{X}}$ , which is closely related to the matrix factorization problem of Gunasekar et al. (2017). See §A where this derivation is detailed.

Deep linear $multi$ -class prediction: In that case, $p$ is the number of classes and ${\bm{y}}_{i}$ is a one-hot encoding of the class with, in practice, $p\ll d$ . The intuition on why we may expect $\|{\bm{B}}\|_{2}$ to be small is because $\operatorname{rank}({\bm{Y}})\ll\operatorname{rank}({\bm{X}})$ and thus the matrices of interest only have to almost commute on a small space in comparison to the whole space, thus ${\bm{B}}$ would be close to . We verify this intuition by computing $\|{\bm{B}}\|_{2}$ for several classification datasets in Table 1.

Minimal influence of perceptual correlation: ${\bm{\Sigma}}_{x}\approx{\bm{I}}_{d}$ . It is the setting discussed by Saxe et al. (2018). We compare this assumption for some classification datasets with our Assump. 1 in §4.1.

An explicit solution for L=2𝐿2L=2.

Under Assump. 1 and specifying the initialization, one can solve the matrix differential equation for $\epsilon=0$ and then use perturbation analysis to assess how close the solution of (8) is to the closed form solution derived for $\epsilon=0$ . This result is summarized in the following theorem proved in §B.2.

When $L=2$ , under Assump. 1, if we initialize with ${\bm{W}}_{1}(0)={\bm{U}}\operatorname{diag}(e^{-\delta_{1}},\ldots,e^{-\delta_{p}}){\bm{Q}}$ and ${\bm{W}}_{2}(0)={\bm{Q}}^{-1}\operatorname{diag}(e^{-\delta_{1}},\ldots,e^{-\delta_{d}}){\bm{V}}^{\top}$ where ${\bm{Q}}$ is an arbitrary invertible matrix, then the solution of (12) can be decomposed as the sum of the solution for $\epsilon=0$ and a perturbation term,

where we have $c>0$ such that $\|{\bm{W}}_{i}^{\epsilon}(t)\|\leq\epsilon\cdot e^{ct^{2}}$ and,

where $(\sigma_{i})$ and $(\lambda_{i})$ are defined is Assump. 1. Note that $\operatorname{rank}({\bm{\Sigma}}_{xy}):=r_{xy}\leq\operatorname{rank}({\bm{\Sigma}}_{x}):=r_{x}$ .

The main difficulty in this result is the perturbation analysis for which we use a consequence of Grönwall’s inequality (Gronwall, 1919) (Lemma 4). The proof can be sketched in three parts: first showing the result for $\epsilon=0$ , then showing that in the case $\epsilon>0$ , the matrices ${\bm{W}}_{1}(t)/t$ and ${\bm{W}}_{2}(t)/t$ are bounded and finally use Lemma 4 to get the perturbation bound.

This result is more general than the one provided by Saxe et al. (2018) because it requires a weaker assumption than ${\bm{\Sigma}}_{x}={\bm{I}}_{d}$ and $\epsilon=0$ . In doing so, we obtain a result that takes into account the influence of correlations of the input samples. Note that Thm. 1 is only valid if the initialization ${\bm{W}}_{1}(0){\bm{W}}_{2}(0)$ has the same singular vectors as ${\bm{\Sigma}}_{xy}$ . However, making such assumptions on the initialization is standard in the literature and, in practice, we can set the initialization of the optimization algorithm in order to also ensure that property. For instance, in the case of the linear autoencoder, one can set ${\bm{W}}_{1}(0)={\bm{W}}_{2}(0)=e^{-\delta}{\bm{I}}_{d}$ .

In the following subsection we will use Thm. 1 to show that the components $[{\bm{U}}]_{i}\,,\;1\leq i\leq r_{xy}$ in the order defined by the decreasing singular values of ${\bm{\Sigma}}_{xy}$ are learned sequentially by the gradient dynamics.

Sequential learning of components.

The sequential learning of the left singular vectors of ${\bm{\Sigma}}_{xy}$ (sorted by the magnitude of its singular values) by the continuous gradient dynamics of deep linear networks has been highlighted by Saxe et al. (2018). They note in their Eq. (10) that the $i^{th}$ phase transition happens approximately after a time $T_{i}$ defined as (using our notation),

They argue that as $\delta_{i}\to\infty$ , the time $T_{i}$ is roughly $O(1/\sigma_{i})$ . The intuition is that a vanishing initialization increases the gap between the phase transition times $T_{i}$ and thus tends to separate the learning of each components. However, a vanishing initialization just formally leads to $T_{i}\to\infty$ .

In this work, we introduce a notion of time rescaling in order to formalize this notion of phase transition and we show that, after this time rescaling, the point visited between two phase transitions is the solution of a low rank regularized version (4) of the initial problem (2) with the low rank constraint that loosens sequentially.

The intuition behind time rescaling is that it counterbalances the vanishing initialization in (17): Since $T_{i}$ grows as fast as $\delta_{i}$ we need to multiply the time by $\delta_{i}$ , in order to grow at the same pace as $T_{i}$ .

Using this rescaling we can present our theorem, proved in §B.3, which says that a vanishing initialization tends to force the sequential learning of the component of ${\bm{X}}$ associated with the largest singular value of ${\bm{\Sigma}}_{xy}$ . Note that we need to rescale the time uniformly for each component. That is why in the following we set $\delta_{i}=\delta\,,\,1\leq i\leq\max(p,d)$ .

Let us denote $w_{i}(t)$ , the values defined in (16). If $w_{i}(0)=e^{-\delta}\,,\;1\leq i\leq r,$ and $\epsilon=e^{-\delta^{2}\ln(\delta)}$ then we have that $w_{i}(\delta t)$ converge to a step function as $\delta\to\infty$ :

where $T_{i}:=1/\sigma_{i}$ , $\mathds{1}\{t\in A\}=1$ if $\,t\in A$ and otherwise.

Notice how the $i^{th}$ components of ${\bm{W}}_{1}$ and ${\bm{W}}_{2}$ are inactive, i.e., $w_{i}(t)$ is zero, for small $t$ and is suddenly learned when $t$ reaches the phase transition time $T_{i}:=1/\sigma_{i}$ . As shown in Prop. 1 and illustrated in Fig. 1, this sequential learning behavior does not occur for the non-factorized formulation. Gunasekar et al. (2017) observed similar differences between their factorized and not factorized formulations of matrix regression. Note that, the time rescaling we introduced is $t\rightarrow\delta t$ , in order to compensate the vanishing initialization, rescaling the time and taking the limit this way for (8) would lead to a constant function.

Gunasekar et al. (2017) also had to consider a vanishing initialization in order to show that on a simple matrix factorization problem the continuous dynamics of gradient descent does converge to the minimum nuclear norm solution. This assumption is necessary in such proofs in order to avoid to initialize with wrong components. However one cannot consider an initialization with the null matrix since it is a stationary point of the dynamics, that is why this notion of double limit (vanishing initialization and $t\to\infty$ ) is used.

From Thm. 2, two corollaries follow directly. The first one regards the nuclear norm of the product ${\bm{W}}_{1}(\delta t){\bm{W}}_{2}(\delta t)$ . This corollary says that $\|{\bm{W}}_{1}(\delta t){\bm{W}}_{2}(\delta t)\|_{*}$ is a step function and that each increment of this integer value corresponds to the learning of a new component of ${\bm{X}}$ . These components are leaned by order of relevance, i.e., by order of magnitude of their respective eigenvalues and the learning of a new component can be easily noticed by an incremental gap in the nuclear norm of the matrix product ${\bm{W}}_{1}(\delta t){\bm{W}}_{2}(\delta t)$ ,

Let ${\bm{W}}_{1}(t)$ and ${\bm{W}}_{2}(t)$ be the matrices solution of (12) defined in (15). The limit of the squared euclidean norm of ${\bm{W}}_{1}(t){\bm{W}}_{2}(t)$ when $\delta\to\infty$ is a step function defined as,

where $T_{i}:=1/\sigma_{i}$ and $\sigma_{1}>\cdots>\sigma_{r_{xy}}>0$ are the positive singular values of ${\bm{\Sigma}}_{xy}$ .

From Thm. 2, we can notice that, between time $T_{k}$ and $T_{k+1}$ , the rank of the limit matrix ${\bm{W}}_{1}{\bm{W}}_{2}$ is actually equal to $k$ , meaning that at each phase transition, the rank of ${\bm{W}}_{1}{\bm{W}}_{2}$ is increased by 1. Moreover, this matrix product contains the $k$ components of ${\bm{X}}$ corresponding to the $k$ largest singular values of ${\bm{\Sigma}}_{xy}$ . Thus, we can show that this matrix product is the solution of the $k$ -low rank constrained version (4) of the initial problem (2),

Let ${\bm{W}}_{1}(t)$ and ${\bm{W}}_{2}(t)$ be the matrices solution of (12) defined in (15). We have that,

2 Discrete dynamics

We are interested in the behavior of optimization methods. Thus, the gradient dynamics of interest is the discrete one (5). A major contribution of our work is thus contained in this section. The continuous case studied in § 3.1 provided good intuitions and insights on the behavior of the potential discrete dynamics that we can use for our analysis.

Previous related work (Gunasekar et al., 2017; Saxe et al., 2018) only provide results on the continuous dynamics. Their proofs use the fact that their respective continuous dynamics of interest have a closed form solution (e.g., Thm.1). To our knowledge, no closed form solution is known for the discrete dynamics (5). Thus its analysis requires a new proof technique. Moreover, using Euler’s integration methods, one can show that both dynamics are close but only for a vanishing step size depending on a finite horizon. Such dependence on the horizon is problematic since the time rescaling used in Thm. 2 would make any finite horizon go to infinity. In this section, we consider realistic conditions on the step-size (namely, it has to be smaller than the Lipschitz constant and some notion of eigen-gap) without any dependence on the horizon. Such assumption is relevant since we want to study the dynamics of practical optimization algorithms (i.e., with a step size as large as possible and without knowing in advance the horizon).

Single layer linear model.

In this paragraph, we consider the discrete update for the linear model. Since $L=1$ , for notational compactness, we call ${\bm{W}}_{t}$ the matrix updated according to (5). Using the gradient derivation (7), the discrete update scheme for the linear model is,

Noticing that for $1/\lambda_{\max}({\bm{\Sigma}}_{x})>\eta>0\,,\;{\bm{I}}_{d}-\eta{\bm{\Sigma}}_{x}$ is invertible, this recursion (see §B.4) leads to,

We obtain a similar result as the solution of the differential equation given in Prop. 1. With a vanishing initialization we reach a function that does not sequentially learn some components.

Two-layer linear model.

The discrete update scheme for the two-layer linear network (2) is,

When $\epsilon=0$ , by a change of basis and a proper initialization, we can reduce the study of this matrix equation to $r$ independant dynamics (see §B.5 for more details), for $1\leq i\leq r$ ,

Thus we can derive a bound on the iterate $w_{i}^{(t)}$ leading to the following theorem,

Under the same assumptions as Thm. 1 and $\epsilon=0$ , we have

Moreover, for any $1\leq i\leq r_{xy}$ , if $1>w_{i}^{(0)}>0$ and $2\eta\sigma_{i}<1$ , then $\forall t\geq 0\,,\;1\leq i\leq r_{x}$ we have,

and $w_{i}^{(t)}\leq\tfrac{w_{i}^{(0)}}{1+w_{i}^{(0)}\lambda_{i}\eta t}$ for $r_{xy}\leq i\leq r_{x}$ . The differences with the continuous case (16) are in red.

The solution of the continuous dynamics lets us think directly studying the sequence $w_{i}^{(t)}$ might be quite challenging since the solution of the continuous dynamics $w_{i}(t)^{-1}$ has a non-linear and non-convex behavior.

The main insight from this proof is that one can treat the discrete case using the right transformation, to show that some sequence doee converge linearly.

Thm. 2 indicates the quantity $w_{i}(t)^{-1}-\tfrac{\lambda_{i}}{\sigma_{i}}$ is the good candidate to show linear convergence to 0,

What we can expect is thus to show that the sequence $(w_{i}^{(t)})^{-1}-\tfrac{\sigma_{i}}{\lambda_{i}}$ has similar properties. The first step of the proof is to show that $(w_{i}^{(t)})$ is an increasing sequence smaller than one. The second step is then to use (22) to get,

Then using that $1-x\leq\frac{1}{1+x}\leq 1-x+x^{2}$ for any $1\leq x\leq 0$ we can derive the upper and lower bounds on the linear convergence rate. See §B.5 for full proof. ∎

In order to get a similar interpretation of Thm. 3 in terms of implicit regularization, we use the intuitions from Thm. 2. The analogy between continuous and discrete time is that the discrete time dynamics is doing $t$ time-steps of size $\eta$ , meaning that we have ${\bm{W}}(\eta t)\approx{\bm{W}}_{t}$ , the time rescaling in continuous time consists in multiplying the time by $\delta$ thus we get the analog phase transition time,

Recall that we assumed that $m_{i}^{(0)}=n_{i}^{(0)}=e^{-\delta}$ . Thus, we show that the $i^{th}$ component is learned around time $T_{i}$ , and consequently that the components are learned sequentially,

If $\eta<\frac{1}{2\sigma_{1}}$ , $\eta<2\tfrac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}}$ and $\eta<\tfrac{\sigma_{i}-\sigma_{i+1}}{2\sigma_{i+1}^{2}},$ for $\;1\leq i\leq r_{xy}-1$ , then for $1\leq i<r_{x}$ ,

where $T_{0}:=0,\,T_{j}:=\frac{1}{\sigma_{j}\eta}\,,\;1\leq j\leq r_{xy}$ and $T_{j}:=+\infty\>\text{if}\>j>r_{xy}$ .

This result is proved in §B.5. The quantities $\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}}$ and $\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i+1}^{2}}$ can be interpreted as relative eigen-gaps. Note that they are well defined since we assumed that the eigenspaces were unidimensional. The intuition behind this condition is that the step-size cannot be larger than the eigen-gaps because otherwise the discrete optimization algorithm would not be able to distinguish some components.

Experiments

In this section we verify to what extent Assump. 1 is true on standard classification datasets. For this, we compute the normalized quantities $\Delta_{xy}$ and $\Delta_{x}$ representing how much Assump. 1 and the assumption that ${\bm{\Sigma}}_{x}\approx{\bm{I}}_{d}$ are respectively broken. We compute ${\bm{B}}$ by computing ${\bm{U}}$ , the left singular vector of ${\bm{\Sigma}}_{xy}$ and looking at the non-diagonal coefficients of ${\bm{U}}^{\top}{\bm{\Sigma}}_{x}{\bm{U}}$ ,

where $\|\cdot\|$ is the Frobenius norm, the $\hat{\bm{\Sigma}}$ expressions correspond to $\hat{\bm{X}}:={\bm{X}}/\|{\bm{X}}\|$ and $\hat{\bm{I}}_{d}:={\bm{I}}_{d}/\|{\bm{I}}_{d}\|$ . These normalized quantities are between and $1$ . The closer to $1$ , the less the assumption hold and conversely, the closer to , the more the assumption approximately holds. We present the results on three standard classification datasets, MNIST (LeCun et al., 2010), CIFAR10 (Krizhevsky et al., 2014) and ImageNet (Deng et al., 2009), a down-sampled version of ImageNet with images of size $64\times 64$ . In Table 1, we can see that the quantities $\Delta_{x}$ and $\Delta_{xy}$ do not vary much among the datasets and that the $\Delta$ associated with our our Assump. 1 is two orders of magnitude smaller than the $\Delta$ associated with Saxe et al. (2018)’s assumption indicating the relevance of our assumption.

2 Linear Autoencoder

We can see that the experimental results are very close to the theoretical behavior predicted with the continuous dynamics in Figure 1. Contrary to the dynamics induced by the linear model formulation ( $L=1$ ), the dynamics induced by the two-layer linear network ( $L=2$ ) is very close to a step function: each step corresponds to the learning to a new component: They are learned sequentially.

Discussion

There is a growing body of empirical and theoretical evidence that the implicit regularization induced by gradient methods is key in the training of deep neural networks. Yet, as noted by Zhang et al. (2017), even for linear models, our understanding of the origin of generalization is limited. In this work, we focus on a simple non-convex objective that is parametrized by a two-layer linear network. In the case of linear regression we show that the discrete gradient dynamics also visits points that are implicitly regularized solutions of the initial optimization problem. In that sense, in the context of machine learning, applying gradient descent on the overparametrized model of interest, provides a form of implicit regularization: it sequentially learns the hierarchical components of our problem which could help for generalization. Our setting does not pretend to solve generalization in deep neural networks; many majors components of the standard neural network training are omitted such as the non-linearities, large values of $L$ and the stochasticity in the learning procedure (SGD). Nevertheless, it provides useful insights about the source of generalization in deep learning.

This research was partially supported by the Canada CIFAR AI Chair Program, the Canada Excellence Research Chair in “Data Science for Realtime Decision-making”, by the NSERC Discovery Grant RGPIN-2017-06936, by a graduate Borealis AI fellowship and by a Google Focused Research award.

References

Appendix A Deep Linear Autoencoder Recovers PCA.

Let us recall that the two-layer linear autoencoder can be formulated as,

Thus, the gradients of the objective are,

Thus, if ${\bm{W}}_{2}^{(0)}=({\bm{W}}_{1}^{(0)})^{\top}$ and if $[{\bm{W}}_{1}^{(0)}{\bm{W}}_{2}^{(0)},{\bm{\Sigma}}_{x}]=0$ , we have that,

Thus, for the discrete case, by a recurrence we have that, ${\bm{W}}_{1}^{(t)}=({\bm{W}}_{2}^{(t)})^{\top}\,,\;t\geq 0$ and for the continuous case, invoking the Cauchy-Lipschitz theorem, we have that ${\bm{W}}_{1}(t)={\bm{W}}_{2}(t)^{\top}\,,\;t\geq 0$ . Consequently, the limit solution is such that

Appendix B Proof of Theorems and Propositions

where ${\bm{\Sigma}}_{x}^{\dagger}$ is the pseudoinverse of ${\bm{\Sigma}}_{x}$ .

We can differentiate (34) and check if it verifies (8). In order to do that, we just need to notice that ${\bm{\Sigma}}_{x}{\bm{\Sigma}}_{x}^{\dagger}{\bm{\Sigma}}_{xy}={\bm{\Sigma}}_{xy}$ . To see that we compute the SVD of ${\bm{X}}^{\top}={\bm{U}}^{\top}{\bm{D}}{\bm{V}}$ where ${\bm{D}}$ is a rectangular matrix with only diagonal coefficients such that,

Thus, we have ${\bm{\Sigma}}_{x}={\bm{U}}^{\top}\operatorname{diag}(\lambda_{1},\ldots,\lambda_{r},0,\ldots,0){\bm{U}}$ and ${\bm{\Sigma}}_{x}^{\dagger}={\bm{U}}^{\top}\operatorname{diag}(1/\lambda_{1},\ldots,1/\lambda_{r},0,\ldots,0){\bm{U}}$ . Leading to,

Consequently, the matrix valued function ${\bm{W}}(t)$ defined in (34) verifies (8). Now we just need to use Cauchy-Lipschitz theorem [Coddington and Levinson, 1955] (a.k.a. Picard–Lindelöf theorem) to say that this solution is the unique solution of the ODE (8).

B.2 proof of Thm. 1

We use ideas from [Saxe et al., 2018] and combine it with Assum. 1 for $\epsilon=0$ . Note that $\epsilon=0$ if and only if ${\bm{\Sigma}}_{x}$ and ${\bm{\Sigma}}_{xy}$ commute. thus, we have that,

Let us consider a generalization of the linear transformation proposed by Saxe et al. [2018, Eq. S6,S7],

where ${\bm{Q}}_{l}\,,\;1\leq l\leq L-1$ are arbitrary invertible matrices. Then, noting ${\bm{Q}}_{0}:={\bm{U}}$ and ${\bm{Q}}_{L}:={\bm{V}}$ , we get the following dynamics,

Thus using (36), the fact that ${\bm{U}}^{\top}{\bm{U}}={\bm{I}}_{d}$ and that for any invertible matrix ${\bm{Q}}$ , we have $({\bm{Q}}^{-1})^{\top}=({\bm{Q}}^{\top})^{-1}$ , we get that,

Using the same argument as [Saxe et al., 2018], if $\bar{\bm{W}}_{l}(t)\,,\;1\leq l\leq L$ only have diagonal coefficients then their derivative also only have diagonal coefficients. Thus, if we initialize ${\bm{W}}_{l}^{(0)}\,,\;1\leq l\leq L$ , only with diagonal coefficients we have a decoupled solution for each diagonal coefficient. This argument can be formalized using Cauchy-Lipschitz theorem: (39) has a unique solution which is the one we will exhibit in the following.

where the notation $w_{-l,i}(t)$ stands for the product of the $w_{k,i}(t)\,,\,1\leq k\leq L$ omitting $w_{l,i}(t)$ , i.e.,

and $w_{i}(t)$ stands for the product of the $w_{k,i}(t)\,,\,1\leq k\leq L$ . The difference with [Saxe et al., 2018] is that, since they only consider the case ${\bm{\Sigma}}_{x}={\bm{I}}_{d}$ they have $\lambda_{i}=1$ , they also only consider the case $L=2$ . The use of Assumption 1 allowed us to work in a more general case.

We will assume that if $w_{l,i}(t)=w_{k,i}(t)\,,\,1\leq l,k\leq L$ , to find an analytic solution and then show that if $w_{l,i}(0)=w_{k,i}(0)\,,\,1\leq k,l\leq L$ then this analytic solution verifies (40) and thus, by Cauchy-Lipschitz theorem, is the unique solution of the non-linear differential equation.

Thus, considering $w_{i}(t):=w_{1,i}(t)\cdots w_{L,i}(t)$ , and assuming that $w_{l,i}(t)=w_{k,i}(t)\,,\,1\leq l,k\leq L$ , we get that, for $1\leq i\leq r,$

Case L=2𝐿2L=2:

in that case we have two situations, $\sigma_{i}>0$ and $\sigma_{i}=0\,,\;\lambda_{i}>0$ (the case $\sigma_{i}=\lambda_{i}=0$ give a constant functions).

In order to get a solution for $w_{2,i}(t)$ and $w_{1,i}(t)$ , we will use Cauchy-Lipschitz theorem [Coddington and Levinson, 1955]. The idea is that if we find a solution of (40), it is the only one. Let us set, $m_{i}(0)=n_{i}(0)=e^{-\delta_{i}}$ , then we can set,

Thus, this is the unique solution of (40).

For $\sigma_{i}=0\,,\;\lambda_{i}>0$ we have that,

Thus, if we initialize with $m_{i}(0)=n_{i}(0)=e^{-\delta_{i}}$ we get,

Non commutative case ϵ>0italic-ϵ0\epsilon>0.

Now, we will consider Assumption 1 with $\epsilon>0$ and $L=2$ .

First let us proove two lemmas usefull for later,

The matrix valued function ${\bm{W}}(t)$ converge to ${\bm{X}}^{\dagger}{\bm{Y}}$ and thus is bounded for $t>0$ .

Since (6) is a gradient dynamics, it only moves in the span of the gradient of $f$ (the explicit expressions of $\nabla f$ is derived in (7) for $L=1$ and (11) for a general $L$ ). We use this property to characterize the solution found by these dynamics. We can study each column of the predictors ${\bm{W}}:={\bm{W}}_{1}\cdots{\bm{W}}_{L}$ . If we look at the columns of $\nabla_{{\bm{W}}_{L}}f$ , they are included in ${\bm{X}}^{\top}$ , thus it means that if we initialize the columns of ${\bm{W}}_{L}^{(0)}$ in that span, then the columns of ${\bm{W}}$ will belong to that span during the whole learning process,

where ${\bm{W}}$ is ${\bm{W}}(t)$ . Thus, if the dynamics (6) converge, then they converge to a matrix with the $i^{th}$ column vector being in the intersection,

Finally, we have ${\bm{X}}{\bm{W}}(t)\to{\bm{Y}}$ by definition of the gradient dynamics,

We have that $\|{\bm{W}}_{1}(t)\|^{2}=O(t)$ and $\|{\bm{W}}_{2}(t)\|^{2}=O(t)$ .

Since ${\bm{W}}(t)$ is bounded then $\|{\bm{W}}_{1}(t)\|^{2}=O(t)$ . The same way we have $\|{\bm{W}}_{2}(t)\|^{2}=O(t)$ ∎

After the same change of basis as in the commutative case The matrices $\bar{\bm{W}}_{l}(t)$ follow the differencial equations

In order to perform pertrubation analysis we will use a consequence of Grönwall’s inequality [Gronwall, 1919].

Let $\beta$ be a non negative function and $\alpha$ a non decreasing function. Let $u$ be a function defined on an interval $I=[a,\infty)$ such that

The proof can be found for instance in [Berglund, 2001, Lemma 3.1.6] ∎

Thus, let us consider $\bar{\bm{W}}_{1}(t)$ and $\bar{\bm{W}}_{2}(t)$ the solutions of (58) as well as $\bar{\bm{W}}^{0}_{1}(t)$ and $\bar{\bm{W}}^{0}_{2}(t)$ the solutions of the very same differential equation but with ${\bm{B}}=0$ . For notational simplicity we will omit the bar on the matrices ${\bm{W}}$ in the following. We have that,

In order to upper bound the first integral we will consider the function $F({\bm{A}},{\bm{B}}):={\bm{A}}{\bm{B}}{\bm{B}}^{\top}$ . This function is Lipschitz on any compact because this function is infinitely differentiable. Thus we have that (omiting the $t$ in the notation),

Using the fact that $\frac{{\bm{W}}_{1}(t)}{\sqrt{t}},\frac{{\bm{W}}_{1}^{0}(t)}{\sqrt{t}},\frac{{\bm{W}}_{2}^{0}(t)}{\sqrt{t}}$ and $\frac{{\bm{W}}_{2}(t)}{\sqrt{t}},\,t\geq 0$ live in a compact set (Lemma 3) and that $F$ is Lipschitz on any compact. Thus, using that $\|S\|=O(\epsilon)$ , we have

And consequently we can sum these two inequalities and apply Grönwall’s inequality with the quantities $u(t)=\|{\bm{W}}_{1}(t)-{\bm{W}}^{0}_{1}(t)\|+\|{\bm{W}}_{2}(t)-{\bm{W}}^{0}_{2}(t)\|$ , $\alpha(t)=O(\epsilon)$ and $\beta(s)=O(s)$ to get,

B.3 Proof of Thm. 2

Let us denotes $w_{i}(t)$ , the values defined in (16). If $m_{i}(0)=e^{-\delta}\,,\;1\leq i\leq r,$ then we have,

Then we can conclude saying that for any $i$ and $t\geq 0$ ,

B.4 Proof of Eq. 21

B.5 Proof of Thm. 3

If we define ${\bm{W}}^{(t)}:={\bm{W}}_{1}^{(t)}{\bm{W}}_{2}^{(t)}$ , the discrete update scheme for the two-layer linear neural network (2) is,

Using the same transformation (37) as in §B.2 we get that,

let us note, $r=\min(p,d)$ and $m_{1}^{(t)},\ldots,m_{r}^{(t)}$ and $n_{1}^{(t)},\ldots,n_{r}^{(t)}$ the respective diagonal coefficients of $\bar{\bm{W}}_{1}^{(t)}$ and $\bar{\bm{W}}_{2}^{(t)}$ , they follow the equation,

In order to prove Thm. 3, we will prove several properties on the sequences $(m_{i}^{(t)})_{t\geq 0}$ and $(n_{i}^{(t)})_{t\geq 0}\,,\;1\leq i\leq d$ . First let us introduce the sequence $(a_{i}^{(t)})_{t\geq 0}$ defined as $a_{i}^{(t)}:=m_{i}^{(t)}n_{i}^{(t)}\,,t\geq 0$ .

By a straightforward recurrence we have that if at time $t$ , $m_{i}^{(t)}=n_{i}^{(t)}$ then by (70), we have $m_{i}^{(t+1)}=n_{i}^{(t+1)}$ . ∎

Thus, we will now focus on the sequence $(a_{i}^{(t)})_{t\geq 0}$ , by (70), we have that

Similarly as for the continuous case, there is two different behavior $\sigma_{i}>0$ ad $\sigma_{i}=0\,,\;\lambda_{i}>0$ . In the following we assume that $\eta>0$ .

For $\sigma_{i}>0$ we can derive the following results,

For any $1\leq i\leq r_{xy}$ , if $0<a_{i}^{(0)}<\frac{\sigma_{i}}{\lambda_{i}}$ and $2\eta\sigma_{i}<1$ , then, the sequence $(a_{i}^{(t)})$ is increasing and

Let us assume that (73) is true for a time-step $t$ and let us prove that it is still true at time-step $t+1$ .

Using the recursive definition (72) of $a_{i}^{(t)}$ we get for $t\geq 0$ ,

For the upper bound we need to notice that $a_{i}^{(t+1)}=f_{i}(a_{i}^{(t)})$ where $f_{i}:x\mapsto x+\eta x(\sigma_{i}-\lambda_{i}x)(2+\eta(\sigma_{i}-\lambda_{i}x))$ where $\eta>0$ . Since we assumed that $2\eta\sigma_{i}<1$ , we have that,

Then we just need to show that $g(x)<\tfrac{\lambda_{i}}{\sigma_{i}}\,,\;\forall x\in(0,\tfrac{\lambda_{i}}{\sigma_{i}})$ .

Thus $g$ is non-decreasing on $(0,1)$ and consequently, $g(x)<g(\tfrac{\lambda_{i}}{\sigma_{i}})=\tfrac{\lambda_{i}}{\sigma_{i}}\,,\;\forall x\in(0,1)$ .

With this lemma we can proof Thm. 3. Let us first recall this theorem.

For any $1\leq i\leq r_{xy}$ , if $\frac{\sigma_{i}}{\lambda_{i}}>a_{i}^{(0)}>0$ and $2\eta\sigma_{i}<1$ , then $\forall t\geq 0\,,\;1\leq i\leq r$ we have,

In this proof for notational compactness we will remove the index $i$ .

We first prove (81), we work with $1/a^{(t+1)}-\frac{\lambda}{\sigma}$ , Using (72) we get,

where we used that $\frac{1}{1+x}\geq 1-x\,,\;\forall x\geq 0$ . Thus we have,

To prove (81) we will once again work with $1/a^{(t+1)}-\frac{\lambda}{\sigma}$ . Using (72) we get

where we used that $\frac{1}{1+x}\leq 1-x+x^{2}\,,\;\forall x\geq 0$ . Thus we have,

Now for $\sigma=0$ and $\lambda>0$ we have that,

Thus, considering $(a_{i}^{(t)})^{-1}$ we get

Thus, if we assume that $1/a_{i}^{(0)}\geq\lambda\eta$ we have that $(1/a_{i}^{(t)})$ is a increasing sequence and that,

Case ϵ>0italic-ϵ0\epsilon>0.

If we are able to show that all the sequences ${\bm{W}}_{1}^{(t)}$ and ${\bm{W}}_{2}^{(t)}$ are bounded

From this theorem we can deduce the following corollary,

Proof of Corollary 3

If $\eta<\frac{1}{2\sigma_{1}}$ , $\eta<2\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}}$ and $\eta<\frac{\sigma_{i}-\sigma_{i+1}}{2\sigma_{i+1}^{2}}\,,\;\forall i\;1\leq i\leq r_{xy}-1$ then for $1\leq i<r_{x}$ ,

where $T_{j}:=\frac{1}{\sigma_{j}\eta}\,,\;1\leq j\leq r_{xy}$ and $T_{j}=+\infty\text{ if }j>r_{xy}$ and $T_{0}=0$ .

First let us notice that since $\sigma_{1}>\ldots>\sigma_{r_{xy}}>0$ , the assumption $\eta<1/(2\sigma_{1})$ implies $\eta<1/(2\sigma_{i})\,,\;1\leq i\leq d$ .

Let $i\leq r_{xy}$ . Let us first prove that if $j<i$ , then $a_{i}^{(T_{j})}\underset{\delta\to\infty}{\to}0$ .

Using (81) and recalling that in Thm. 3, we assume that $a_{i}^{(0)}=e^{-2\delta}$ , we have for $1\leq j<i$ ,

We have $(2+\eta\sigma_{i})\sigma_{i}/\sigma_{i-1}<2$ , because we assumed that $\eta<2\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}}\,,\;\forall i\quad 1\leq i\leq d$ . Note that for $i=r_{xy}$ we have $a_{i}^{(\delta T_{i+1})}=0\,,\;\forall\delta>0$ .

Let us now prove that if $j>i$ , then $a_{i}^{(\delta T_{j})}\underset{\delta\to\infty}{\to}\frac{\sigma_{i}}{\lambda_{i}}$ .

Using (82) and recalling that in Thm. 3, we assume that $a_{i}^{(0)}=e^{-2\delta}$ , we have for $j>i\geq 1$ ,

where we have that $(e^{2\delta}-1)e^{\delta(-2\sigma_{i}/\sigma_{j}+4\eta\sigma_{i}^{2}/\sigma_{j})}\to 0$ because

Now for $i\geq r_{xy}+1$ we just need to use, (Theorem’ 3) to get,