Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks

Gauthier Gidel, Francis Bach, Simon Lacoste-Julien

Introduction

When optimizing over-parameterized models, such as deep neural networks, a large set of parameters leads to a zero training error. However they lead to different values for the test error and thus have distinct generalization properties. More specifically, Neyshabur (2017, Part II) argues that the choice of the optimization algorithm (and its respective hyperparameters) provides an implicit regularization with respect to its geometry: it biases the training, finding a particular minimizer of the objective.

In this work, we use the same setting as Saxe et al. (2018): a regression problem with least-squares loss on a multi-dimensional output. Our prediction is made either by a linear model or by a two-layer linear neural network (Saxe et al., 2018). We extend their work which covered the continuous gradient dynamics, to weaker assumptions as well as analyze the behavior of the discrete gradient updates

We show that with a vanishing initialization and a small enough step-size, the gradient dynamics of a two-layer linear neural network sequentially learns components that can be ranked according to a hierarchical structure whereas the gradient dynamics induced by the same regression problem but with a linear prediction model instead learns these components simultaneously, missing this notion of hierarchy between components. The path followed by the two-layer formulation actually corresponds to successively solving the initial regression problem with a growing low rank constraint which is also know as reduced-rank regression (Izenman, 1975). Note that this notion of path followed by the dynamics of a whole network is different from the notion of path introduced by Neyshabur et al. (2015a) which corresponds to a path followed inside a fixed network, i.e., one corresponds to training dynamics whereas the other corresponds to the propagation of information inside a network.

To sum-up, in our framework, the path followed by the gradient dynamics of a two-layer linear network provides an implicit regularization that may lead to potentially better generalization properties. Our contributions are the following:

Under some assumptions (see Assumption 1), we prove that both the discrete and continuous gradient dynamics sequentially learn the solutions of a gradually less regularized version of reduced-rank regression (Corollary 2 and 3). Among the close related work, such result on implicit regularization regarding discrete dynamics is novel. For the continuous case, we weaken the standard commutativity assumption using perturbation analysis.

We experimentally verify the reasonableness of our assumption and observe improvements in terms of generalization (matrix reconstruction in our case) using the gradient dynamics of the two-layer linear network when compared against the linear model.

The implicit regularization provided by the choice of the optimization algorithm has recently become an active area of research in machine learning, putting lot of interest on the behavior of gradient descent on deep over-parametrized models (Neyshabur et al., 2015b, 2017; Zhang et al., 2017).

Several works show that gradient descent on unregularized problems actually finds a minimum norm solution with respect to a particular norm that drastically depends on the problem of interest. Soudry et al. (2018) look at a logistic regression problem and show that the predictor does converge to the max-margin solution. A similar idea has been developed in the context of matrix factorization (Gunasekar et al., 2017). Under the assumption that the observation matrices commute, they prove that gradient descent on this non-convex problem finds the minimum nuclear norm solution of the reconstruction problem, they also conjecture that this result would still hold without the commutativity assumption. This conjecture has been later partially solved by Li et al. (2018) under mild assumptions (namely the restricted isometry property). This work has some similarities with ours, since both focus on a least-squares regression problem over matrices with a form of matrix factorization that induces a non convex landscape. Their problem is more general than ours (see Uschmajew and Vandereycken (2018) for an even more general setting) but they are showing a result of a different kind from ours: they focus on the properties of the limit solution the continuous dynamics whereas we show some properties on the whole dynamics (continuous and discrete), proving that it actually visits points during the optimization that may provide good generalization. Interestingly, both results actually share common assumptions such as a commutativity assumption (which is less restrictive in our case since it is always true in some realistic settings such as linear autoencoders), vanishing initialization and a small enough step size.

Nar and Sastry (2018) also analyzed the gradient descent algorithm on a least-squares linear network model as a discrete time dynamical system, and derived certain necessary (but not sufficient) properties of the local optima that the algorithm can converge to with a non-vanishing step size. In this work, instead of looking at the properties of the limit solutions, we focus on the path followed by the gradient dynamics and precisely caracterize the weights learned along this path.

Combes et al. (2018) studied the continuous dynamics of some non-linear networks under relatively strong assumptions such as the linear separability of the data. Conversely, in this work, we do not make such separability assumption on the data but focus on linear networks.

Finally, Gunasekar et al. (2018) compared the implicit regularization provided by gradient descent in deep linear convolutional and fully connected networks. They show that the solution found by gradient descent is the minimum norm for both networks but according to a different norm. In this work, the fact that gradient descent finds the minimum norm solution is almost straightforward using standard results on least-squares. But the path followed by the gradient dynamics reveals interesting properties for generalization. As developed earlier, instead of focusing on the properties of the solution found by gradient descent, our goal is to study the path followed by the discrete gradient dynamics in the case of a two-layer linear network.

Prior work (Saxe et al., 2013, 2014; Advani and Saxe, 2017; Saxe et al., 2018; Lampinen and Ganguli, 2019) studied the gradient dynamics of two-layer linear networks and proved a result similar to our Thm. 2. We consider Saxe et al. (2018) as the closest related work, we re-use their notion of simple deep linear neural network, that we call two-layer neural networks, and use some elements of their proofs to extend their results. However, note that their work comes from a different perspective: through a mathematical analysis of a simple non-linear dynamics, they intend to highlight continuous dynamics of learning where one observes the sequential emergence of hierarchically structured notions to explain the regularities in representation of human semantic knowledge. In this work, we are also considering a two-layer neural network but with an optimization perspective. We are able to extend Saxe et al. (2018, Eq. 6 and 7) weakening the commutativity assumption considered in Saxe et al. (2018) using perturbation analysis. In §4.1, we test to what extent our weaker assumption holds. Our main contribution is to show a similar result on the discrete gradient dynamics, that is important in our perspective since we aim to study the dynamics of gradient descent. This result cannot be trivially extended from the result on the continuous dynamics. We provide details on the difficulties of the proof in §3.2.

A Simple Deep Linear Model

where W1,,WL{\bm{W}}_{1},\ldots,{\bm{W}}_{L} are learned through a MSE formulation with the least-squares loss ff,

are the design matrices of (xi)1in({\bm{x}}_{i})_{1\leq i\leq n} and (yi)1in({\bm{y}}_{i})_{1\leq i\leq n}. The deep linear model (1) is a LL-layer deep linear neural network where we see hl:=WlW1x{\bm{h}}_{l}:={\bm{W}}_{l}\cdots{\bm{W}}_{1}{\bm{x}} for 1lL11\leq l\leq L-1 as the lthl^{th} hidden layer. At first, since this deep linear network cannot represent more than a linear transformation, we could think that there is no reason to use a deeper representation L=1L=1. However, in terms of learning flow, we will see in §3 that for L=2L=2 this model has a completely different dynamics from L=1L=1.

Increasing LL may induce a low rank constraint when r:=min{rl:1lL1}<min(d,p)r:=\min\{r_{l}\,:\,1\leq l\leq L-1\}<\min(d,p). In that case, (2) is equivalent to a reduced-rank regression,

These problems have explicit solutions depending on X{\bm{X}} and Y{\bm{Y}} (Reinsel and Velu, 1998, Thm. 2.2).

Note that, in this work we are interested in the implicit regularization provided in the context of over-parametrized models, i.e., when r>min(p,d)r>\min(p,d). In that case,

Gradient Dynamics as a Regularizer

In this section we would like to study the discrete dynamics of the gradient flow of (2), i.e.,

where we use the notation W[L](t):=(W1(t),,WL(t)){\bm{W}}_{[L]}^{(t)}:=({\bm{W}}_{1}^{(t)},\ldots,{\bm{W}}_{L}^{(t)}). The quantity η\eta is usually called the step-size. In order to get intuitions on the discrete dynamics we also consider its respective continuous version,

where for 1lL1\leq l\leq L, W˙l(t)\dot{\bm{W}}_{l}(t) is the temporal derivative of Wl(t){\bm{W}}_{l}(t). Note that there is no step-size in the continuous time dynamics since it actually corresponds to the limit of (5) when η0\eta\to 0. The continuous dynamics may be more convenient to study because such differential equations may have closed form solutions. In §3.1, we will see that under reasonable assumptions it is the case for (6).

We start with the study of the continuous linear model, its gradient is,

where Σxy:=1nXY{\bm{\Sigma}}_{xy}:=\frac{1}{n}{\bm{X}}^{\top}{\bm{Y}} and Σx:=1nXX{\bm{\Sigma}}_{x}:=\frac{1}{n}{\bm{X}}^{\top}{\bm{X}}. Thus, W(t){\bm{W}}(t) is the solution of the differential equation,

where Σx{\bm{\Sigma}}_{x}^{\dagger} is the pseudoinverse of Σx{\bm{\Sigma}}_{x}.

This standard result on ODE is provided in §B.1. Note that when W00{\bm{W}}_{0}\to\bm{0} we have

Deep linear network: L≥2𝐿2L\geq 2.

The study of the deep linear model is more challenging since for L2L\geq 2, the landscape of the objective function ff is non-convex. The gradient flow of (2) is

where we used the convention that W1,0=Id{\bm{W}}_{1,0}={\bm{I}}_{d} and WL+1,L=Ip{\bm{W}}_{L+1,L}={\bm{I}}_{p}. Thus (6) becomes

We obtain a coupled differential equation (12) that is harder to solve than the previous linear differential equation (8) due, at the same time, to its non-linear components and to the coupling between Wl,  1lL{\bm{W}}_{l}\,,\;1\leq l\leq L. However, in the case L=2L=2, Saxe et al. (2018) managed to find an explicit solution to this coupled differential equation under the assumption that “perceptual correlation is minimal” (Σx=Id{\bm{\Sigma}}_{x}={\bm{I}}_{d}).By a rescaling of the data, their proof is valid for any matrix Σx{\bm{\Sigma}}_{x} proportional to the identity matrix. In this work we extend Saxe et al. (2018, Eq. 7) (for L=2L=2) under weaker assumptions. More precisely, we do not require the covariance matrix Σx{\bm{\Sigma}}_{x} to be the identity matrix. Let (U,V,D)({\bm{U}},{\bm{V}},{\bm{D}}) be the SVD of Σxy{\bm{\Sigma}}_{xy}, our assumption is the following:

There exist two orthogonal matrices U{\bm{U}}, V{\bm{V}} such that we have the joint decomposition,

where B{\bm{B}} is such that B2ϵ\|{\bm{B}}\|_{2}\leq\epsilon and Dx,Dxy{\bm{D}}_{x},\,{\bm{D}}_{xy} are matrices only with diagonal coefficients. We note σ1σrxy>0\sigma_{1}\geq\dots\geq\sigma_{r_{xy}}>0 the singular values of Σxy{\bm{\Sigma}}_{xy} and λ1,,λrx\lambda_{1},\ldots,\lambda_{r_{x}} the diagonal entries of Dx{\bm{D}}_{x}.

Since two matrices commute if and only if they are co-diagonalizable (Horn et al., 1985, Thm. ​1.3.21), the quantity ϵ\epsilon represent to what extend Σx{\bm{\Sigma}}_{x} and ΣxyΣxy{\bm{\Sigma}}_{xy}{\bm{\Sigma}}_{xy}^{\top} do not commute. Before solving (12) under Assump. 1, we describe some motivating examples where the quantity ϵ\epsilon is small or zero:

Linear autoencoder: If Y{\bm{Y}} is set to X{\bm{X}} and L=2L=2, we recover a linear autoencoder: x^(x)=W2W1x\hat{\bm{x}}({\bm{x}})={\bm{W}}_{2}^{\top}{\bm{W}}_{1}^{\top}{\bm{x}}, where h:=W1x{\bm{h}}:={\bm{W}}_{1}^{\top}{\bm{x}} is the encoded representation of x{\bm{x}},

Note that this linear autoencoder can also be interpreted as a form of principal component analysis. Actually, if we initialize with W1=W2{\bm{W}}_{1}={\bm{W}}_{2}^{\top}, the gradient dynamics exactly recovers the PCA of X{\bm{X}}, which is closely related to the matrix factorization problem of Gunasekar et al. (2017). See §A where this derivation is detailed.

Deep linear multimulti-class prediction: In that case, pp is the number of classes and yi{\bm{y}}_{i} is a one-hot encoding of the class with, in practice, pdp\ll d. The intuition on why we may expect B2\|{\bm{B}}\|_{2} to be small is because rank(Y)rank(X)\operatorname{rank}({\bm{Y}})\ll\operatorname{rank}({\bm{X}}) and thus the matrices of interest only have to almost commute on a small space in comparison to the whole space, thus B{\bm{B}} would be close to . We verify this intuition by computing B2\|{\bm{B}}\|_{2} for several classification datasets in Table 1.

Minimal influence of perceptual correlation: ΣxId{\bm{\Sigma}}_{x}\approx{\bm{I}}_{d}. It is the setting discussed by Saxe et al. (2018). We compare this assumption for some classification datasets with our Assump. 1 in §4.1.

An explicit solution for L=2𝐿2L=2.

Under Assump. 1 and specifying the initialization, one can solve the matrix differential equation for ϵ=0\epsilon=0 and then use perturbation analysis to assess how close the solution of (8) is to the closed form solution derived for ϵ=0\epsilon=0. This result is summarized in the following theorem proved in §B.2.

When L=2L=2, under Assump. 1, if we initialize with W1(0)=Udiag(eδ1,,eδp)Q{\bm{W}}_{1}(0)={\bm{U}}\operatorname{diag}(e^{-\delta_{1}},\ldots,e^{-\delta_{p}}){\bm{Q}} and W2(0)=Q1diag(eδ1,,eδd)V{\bm{W}}_{2}(0)={\bm{Q}}^{-1}\operatorname{diag}(e^{-\delta_{1}},\ldots,e^{-\delta_{d}}){\bm{V}}^{\top} where Q{\bm{Q}} is an arbitrary invertible matrix, then the solution of (12) can be decomposed as the sum of the solution for ϵ=0\epsilon=0 and a perturbation term,

where we have c>0c>0 such that Wiϵ(t)ϵect2\|{\bm{W}}_{i}^{\epsilon}(t)\|\leq\epsilon\cdot e^{ct^{2}} and,

where (σi)(\sigma_{i}) and (λi)(\lambda_{i}) are defined is Assump. 1. Note that rank(Σxy):=rxyrank(Σx):=rx\operatorname{rank}({\bm{\Sigma}}_{xy}):=r_{xy}\leq\operatorname{rank}({\bm{\Sigma}}_{x}):=r_{x}.

The main difficulty in this result is the perturbation analysis for which we use a consequence of Grönwall’s inequality (Gronwall, 1919) (Lemma 4). The proof can be sketched in three parts: first showing the result for ϵ=0\epsilon=0, then showing that in the case ϵ>0\epsilon>0, the matrices W1(t)/t{\bm{W}}_{1}(t)/t and W2(t)/t{\bm{W}}_{2}(t)/t are bounded and finally use Lemma 4 to get the perturbation bound.

This result is more general than the one provided by Saxe et al. (2018) because it requires a weaker assumption than Σx=Id{\bm{\Sigma}}_{x}={\bm{I}}_{d} and ϵ=0\epsilon=0. In doing so, we obtain a result that takes into account the influence of correlations of the input samples. Note that Thm. 1 is only valid if the initialization W1(0)W2(0){\bm{W}}_{1}(0){\bm{W}}_{2}(0) has the same singular vectors as Σxy{\bm{\Sigma}}_{xy}. However, making such assumptions on the initialization is standard in the literature and, in practice, we can set the initialization of the optimization algorithm in order to also ensure that property. For instance, in the case of the linear autoencoder, one can set W1(0)=W2(0)=eδId{\bm{W}}_{1}(0)={\bm{W}}_{2}(0)=e^{-\delta}{\bm{I}}_{d}.

In the following subsection we will use Thm. 1 to show that the components [U]i,  1irxy[{\bm{U}}]_{i}\,,\;1\leq i\leq r_{xy} in the order defined by the decreasing singular values of Σxy{\bm{\Sigma}}_{xy} are learned sequentially by the gradient dynamics.

Sequential learning of components.

The sequential learning of the left singular vectors of Σxy{\bm{\Sigma}}_{xy} (sorted by the magnitude of its singular values) by the continuous gradient dynamics of deep linear networks has been highlighted by Saxe et al. (2018). They note in their Eq. (10) that the ithi^{th} phase transition happens approximately after a time TiT_{i} defined as (using our notation),

They argue that as δi\delta_{i}\to\infty, the time TiT_{i} is roughly O(1/σi)O(1/\sigma_{i}). The intuition is that a vanishing initialization increases the gap between the phase transition times TiT_{i} and thus tends to separate the learning of each components. However, a vanishing initialization just formally leads to TiT_{i}\to\infty.

In this work, we introduce a notion of time rescaling in order to formalize this notion of phase transition and we show that, after this time rescaling, the point visited between two phase transitions is the solution of a low rank regularized version (4) of the initial problem (2) with the low rank constraint that loosens sequentially.

The intuition behind time rescaling is that it counterbalances the vanishing initialization in (17): Since TiT_{i} grows as fast as δi\delta_{i} we need to multiply the time by δi\delta_{i}, in order to grow at the same pace as TiT_{i}.

Using this rescaling we can present our theorem, proved in §B.3, which says that a vanishing initialization tends to force the sequential learning of the component of X{\bm{X}} associated with the largest singular value of Σxy{\bm{\Sigma}}_{xy}. Note that we need to rescale the time uniformly for each component. That is why in the following we set δi=δ,1imax(p,d)\delta_{i}=\delta\,,\,1\leq i\leq\max(p,d).

Let us denote wi(t)w_{i}(t), the values defined in (16). If wi(0)=eδ,  1ir,w_{i}(0)=e^{-\delta}\,,\;1\leq i\leq r, and ϵ=eδ2ln(δ)\epsilon=e^{-\delta^{2}\ln(\delta)} then we have that wi(δt)w_{i}(\delta t) converge to a step function as δ\delta\to\infty:

where Ti:=1/σiT_{i}:=1/\sigma_{i}, \mathds1{tA}=1\mathds{1}\{t\in A\}=1 if tA\,t\in A and otherwise.

Notice how the ithi^{th} components of W1{\bm{W}}_{1} and W2{\bm{W}}_{2} are inactive, i.e., wi(t)w_{i}(t) is zero, for small tt and is suddenly learned when tt reaches the phase transition time Ti:=1/σiT_{i}:=1/\sigma_{i}. As shown in Prop. 1 and illustrated in Fig. 1, this sequential learning behavior does not occur for the non-factorized formulation. Gunasekar et al. (2017) observed similar differences between their factorized and not factorized formulations of matrix regression. Note that, the time rescaling we introduced is tδtt\rightarrow\delta t, in order to compensate the vanishing initialization, rescaling the time and taking the limit this way for (8) would lead to a constant function.

Gunasekar et al. (2017) also had to consider a vanishing initialization in order to show that on a simple matrix factorization problem the continuous dynamics of gradient descent does converge to the minimum nuclear norm solution. This assumption is necessary in such proofs in order to avoid to initialize with wrong components. However one cannot consider an initialization with the null matrix since it is a stationary point of the dynamics, that is why this notion of double limit (vanishing initialization and tt\to\infty) is used.

From Thm. 2, two corollaries follow directly. The first one regards the nuclear norm of the product W1(δt)W2(δt){\bm{W}}_{1}(\delta t){\bm{W}}_{2}(\delta t). This corollary says that W1(δt)W2(δt)\|{\bm{W}}_{1}(\delta t){\bm{W}}_{2}(\delta t)\|_{*} is a step function and that each increment of this integer value corresponds to the learning of a new component of X{\bm{X}}. These components are leaned by order of relevance, i.e., by order of magnitude of their respective eigenvalues and the learning of a new component can be easily noticed by an incremental gap in the nuclear norm of the matrix product W1(δt)W2(δt){\bm{W}}_{1}(\delta t){\bm{W}}_{2}(\delta t),

Let W1(t){\bm{W}}_{1}(t) and W2(t){\bm{W}}_{2}(t) be the matrices solution of (12) defined in (15). The limit of the squared euclidean norm of W1(t)W2(t){\bm{W}}_{1}(t){\bm{W}}_{2}(t) when δ\delta\to\infty is a step function defined as,

where Ti:=1/σiT_{i}:=1/\sigma_{i} and σ1>>σrxy>0\sigma_{1}>\cdots>\sigma_{r_{xy}}>0 are the positive singular values of Σxy{\bm{\Sigma}}_{xy}.

From Thm. 2, we can notice that, between time TkT_{k} and Tk+1T_{k+1}, the rank of the limit matrix W1W2{\bm{W}}_{1}{\bm{W}}_{2} is actually equal to kk, meaning that at each phase transition, the rank of W1W2{\bm{W}}_{1}{\bm{W}}_{2} is increased by 1. Moreover, this matrix product contains the kk components of X{\bm{X}} corresponding to the kk largest singular values of Σxy{\bm{\Sigma}}_{xy}. Thus, we can show that this matrix product is the solution of the kk-low rank constrained version (4) of the initial problem (2),

Let W1(t){\bm{W}}_{1}(t) and W2(t){\bm{W}}_{2}(t) be the matrices solution of (12) defined in (15). We have that,

2 Discrete dynamics

We are interested in the behavior of optimization methods. Thus, the gradient dynamics of interest is the discrete one (5). A major contribution of our work is thus contained in this section. The continuous case studied in § 3.1 provided good intuitions and insights on the behavior of the potential discrete dynamics that we can use for our analysis.

Previous related work (Gunasekar et al., 2017; Saxe et al., 2018) only provide results on the continuous dynamics. Their proofs use the fact that their respective continuous dynamics of interest have a closed form solution (e.g., Thm.1). To our knowledge, no closed form solution is known for the discrete dynamics (5). Thus its analysis requires a new proof technique. Moreover, using Euler’s integration methods, one can show that both dynamics are close but only for a vanishing step size depending on a finite horizon. Such dependence on the horizon is problematic since the time rescaling used in Thm. 2 would make any finite horizon go to infinity. In this section, we consider realistic conditions on the step-size (namely, it has to be smaller than the Lipschitz constant and some notion of eigen-gap) without any dependence on the horizon. Such assumption is relevant since we want to study the dynamics of practical optimization algorithms (i.e., with a step size as large as possible and without knowing in advance the horizon).

Single layer linear model.

In this paragraph, we consider the discrete update for the linear model. Since L=1L=1, for notational compactness, we call Wt{\bm{W}}_{t} the matrix updated according to (5). Using the gradient derivation (7), the discrete update scheme for the linear model is,

Noticing that for 1/λmax(Σx)>η>0,  IdηΣx1/\lambda_{\max}({\bm{\Sigma}}_{x})>\eta>0\,,\;{\bm{I}}_{d}-\eta{\bm{\Sigma}}_{x} is invertible, this recursion (see §B.4) leads to,

We obtain a similar result as the solution of the differential equation given in Prop. 1. With a vanishing initialization we reach a function that does not sequentially learn some components.

Two-layer linear model.

The discrete update scheme for the two-layer linear network (2) is,

When ϵ=0\epsilon=0, by a change of basis and a proper initialization, we can reduce the study of this matrix equation to rr independant dynamics (see §B.5 for more details), for 1ir1\leq i\leq r,

Thus we can derive a bound on the iterate wi(t)w_{i}^{(t)} leading to the following theorem,

Under the same assumptions as Thm. 1 and ϵ=0\epsilon=0, we have

Moreover, for any 1irxy1\leq i\leq r_{xy}, if 1>wi(0)>01>w_{i}^{(0)}>0 and 2ησi<12\eta\sigma_{i}<1, then t0,  1irx\forall t\geq 0\,,\;1\leq i\leq r_{x} we have,

and wi(t)wi(0)1+wi(0)λiηtw_{i}^{(t)}\leq\tfrac{w_{i}^{(0)}}{1+w_{i}^{(0)}\lambda_{i}\eta t} for rxyirxr_{xy}\leq i\leq r_{x}. The differences with the continuous case (16) are in red.

The solution of the continuous dynamics lets us think directly studying the sequence wi(t)w_{i}^{(t)} might be quite challenging since the solution of the continuous dynamics wi(t)1w_{i}(t)^{-1} has a non-linear and non-convex behavior.

The main insight from this proof is that one can treat the discrete case using the right transformation, to show that some sequence doee converge linearly.

Thm. 2 indicates the quantity wi(t)1λiσiw_{i}(t)^{-1}-\tfrac{\lambda_{i}}{\sigma_{i}} is the good candidate to show linear convergence to 0,

What we can expect is thus to show that the sequence (wi(t))1σiλi(w_{i}^{(t)})^{-1}-\tfrac{\sigma_{i}}{\lambda_{i}} has similar properties. The first step of the proof is to show that (wi(t))(w_{i}^{(t)}) is an increasing sequence smaller than one. The second step is then to use (22) to get,

Then using that 1x11+x1x+x21-x\leq\frac{1}{1+x}\leq 1-x+x^{2} for any 1x01\leq x\leq 0 we can derive the upper and lower bounds on the linear convergence rate. See §B.5 for full proof. ∎

In order to get a similar interpretation of Thm. 3 in terms of implicit regularization, we use the intuitions from Thm. 2. The analogy between continuous and discrete time is that the discrete time dynamics is doing tt time-steps of size η\eta, meaning that we have W(ηt)Wt{\bm{W}}(\eta t)\approx{\bm{W}}_{t}, the time rescaling in continuous time consists in multiplying the time by δ\delta thus we get the analog phase transition time,

Recall that we assumed that mi(0)=ni(0)=eδm_{i}^{(0)}=n_{i}^{(0)}=e^{-\delta}. Thus, we show that the ithi^{th} component is learned around time TiT_{i}, and consequently that the components are learned sequentially,

If η<12σ1\eta<\frac{1}{2\sigma_{1}}, η<2σiσi+1σi2\eta<2\tfrac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}} and η<σiσi+12σi+12,\eta<\tfrac{\sigma_{i}-\sigma_{i+1}}{2\sigma_{i+1}^{2}}, for   1irxy1\;1\leq i\leq r_{xy}-1, then for 1i<rx1\leq i<r_{x},

where T0:=0,Tj:=1σjη,  1jrxyT_{0}:=0,\,T_{j}:=\frac{1}{\sigma_{j}\eta}\,,\;1\leq j\leq r_{xy} and Tj:=+ifj>rxyT_{j}:=+\infty\>\text{if}\>j>r_{xy}.

This result is proved in §B.5. The quantities σiσi+1σi2\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}} and σiσi+1σi+12\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i+1}^{2}} can be interpreted as relative eigen-gaps. Note that they are well defined since we assumed that the eigenspaces were unidimensional. The intuition behind this condition is that the step-size cannot be larger than the eigen-gaps because otherwise the discrete optimization algorithm would not be able to distinguish some components.

Experiments

In this section we verify to what extent Assump. 1 is true on standard classification datasets. For this, we compute the normalized quantities Δxy\Delta_{xy} and Δx\Delta_{x} representing how much Assump. 1 and the assumption that ΣxId{\bm{\Sigma}}_{x}\approx{\bm{I}}_{d} are respectively broken. We compute B{\bm{B}} by computing U{\bm{U}}, the left singular vector of Σxy{\bm{\Sigma}}_{xy} and looking at the non-diagonal coefficients of UΣxU{\bm{U}}^{\top}{\bm{\Sigma}}_{x}{\bm{U}},

where \|\cdot\| is the Frobenius norm, the Σ^\hat{\bm{\Sigma}} expressions correspond to X^:=X/X\hat{\bm{X}}:={\bm{X}}/\|{\bm{X}}\| and I^d:=Id/Id\hat{\bm{I}}_{d}:={\bm{I}}_{d}/\|{\bm{I}}_{d}\|. These normalized quantities are between and 11. The closer to 11, the less the assumption hold and conversely, the closer to , the more the assumption approximately holds. We present the results on three standard classification datasets, MNIST (LeCun et al., 2010), CIFAR10 (Krizhevsky et al., 2014) and ImageNet (Deng et al., 2009), a down-sampled version of ImageNet with images of size 64×6464\times 64. In Table 1, we can see that the quantities Δx\Delta_{x} and Δxy\Delta_{xy} do not vary much among the datasets and that the Δ\Delta associated with our our Assump. 1 is two orders of magnitude smaller than the Δ\Delta associated with Saxe et al. (2018)’s assumption indicating the relevance of our assumption.

2 Linear Autoencoder

We can see that the experimental results are very close to the theoretical behavior predicted with the continuous dynamics in Figure 1. Contrary to the dynamics induced by the linear model formulation (L=1L=1), the dynamics induced by the two-layer linear network (L=2L=2) is very close to a step function: each step corresponds to the learning to a new component: They are learned sequentially.

Discussion

There is a growing body of empirical and theoretical evidence that the implicit regularization induced by gradient methods is key in the training of deep neural networks. Yet, as noted by Zhang et al. (2017), even for linear models, our understanding of the origin of generalization is limited. In this work, we focus on a simple non-convex objective that is parametrized by a two-layer linear network. In the case of linear regression we show that the discrete gradient dynamics also visits points that are implicitly regularized solutions of the initial optimization problem. In that sense, in the context of machine learning, applying gradient descent on the overparametrized model of interest, provides a form of implicit regularization: it sequentially learns the hierarchical components of our problem which could help for generalization. Our setting does not pretend to solve generalization in deep neural networks; many majors components of the standard neural network training are omitted such as the non-linearities, large values of LL and the stochasticity in the learning procedure (SGD). Nevertheless, it provides useful insights about the source of generalization in deep learning.

This research was partially supported by the Canada CIFAR AI Chair Program, the Canada Excellence Research Chair in “Data Science for Realtime Decision-making”, by the NSERC Discovery Grant RGPIN-2017-06936, by a graduate Borealis AI fellowship and by a Google Focused Research award.

References

Appendix A Deep Linear Autoencoder Recovers PCA.

Let us recall that the two-layer linear autoencoder can be formulated as,

Thus, the gradients of the objective are,

Thus, if W2(0)=(W1(0)){\bm{W}}_{2}^{(0)}=({\bm{W}}_{1}^{(0)})^{\top} and if [W1(0)W2(0),Σx]=0[{\bm{W}}_{1}^{(0)}{\bm{W}}_{2}^{(0)},{\bm{\Sigma}}_{x}]=0, we have that,

Thus, for the discrete case, by a recurrence we have that, W1(t)=(W2(t)),  t0{\bm{W}}_{1}^{(t)}=({\bm{W}}_{2}^{(t)})^{\top}\,,\;t\geq 0 and for the continuous case, invoking the Cauchy-Lipschitz theorem, we have that W1(t)=W2(t),  t0{\bm{W}}_{1}(t)={\bm{W}}_{2}(t)^{\top}\,,\;t\geq 0. Consequently, the limit solution is such that

Appendix B Proof of Theorems and Propositions

where Σx{\bm{\Sigma}}_{x}^{\dagger} is the pseudoinverse of Σx{\bm{\Sigma}}_{x}.

We can differentiate (34) and check if it verifies (8). In order to do that, we just need to notice that ΣxΣxΣxy=Σxy{\bm{\Sigma}}_{x}{\bm{\Sigma}}_{x}^{\dagger}{\bm{\Sigma}}_{xy}={\bm{\Sigma}}_{xy}. To see that we compute the SVD of X=UDV{\bm{X}}^{\top}={\bm{U}}^{\top}{\bm{D}}{\bm{V}} where D{\bm{D}} is a rectangular matrix with only diagonal coefficients such that,

Thus, we have Σx=Udiag(λ1,,λr,0,,0)U{\bm{\Sigma}}_{x}={\bm{U}}^{\top}\operatorname{diag}(\lambda_{1},\ldots,\lambda_{r},0,\ldots,0){\bm{U}} and Σx=Udiag(1/λ1,,1/λr,0,,0)U{\bm{\Sigma}}_{x}^{\dagger}={\bm{U}}^{\top}\operatorname{diag}(1/\lambda_{1},\ldots,1/\lambda_{r},0,\ldots,0){\bm{U}}. Leading to,

Consequently, the matrix valued function W(t){\bm{W}}(t) defined in (34) verifies (8). Now we just need to use Cauchy-Lipschitz theorem [Coddington and Levinson, 1955] (a.k.a. Picard–Lindelöf theorem) to say that this solution is the unique solution of the ODE (8).

B.2 proof of Thm. 1

We use ideas from [Saxe et al., 2018] and combine it with Assum. 1 for ϵ=0\epsilon=0. Note that ϵ=0\epsilon=0 if and only if Σx{\bm{\Sigma}}_{x} and Σxy{\bm{\Sigma}}_{xy} commute. thus, we have that,

Let us consider a generalization of the linear transformation proposed by Saxe et al. [2018, Eq. S6,S7],

where Ql,  1lL1{\bm{Q}}_{l}\,,\;1\leq l\leq L-1 are arbitrary invertible matrices. Then, noting Q0:=U{\bm{Q}}_{0}:={\bm{U}} and QL:=V{\bm{Q}}_{L}:={\bm{V}}, we get the following dynamics,

Thus using (36), the fact that UU=Id{\bm{U}}^{\top}{\bm{U}}={\bm{I}}_{d} and that for any invertible matrix Q{\bm{Q}}, we have (Q1)=(Q)1({\bm{Q}}^{-1})^{\top}=({\bm{Q}}^{\top})^{-1}, we get that,

Using the same argument as [Saxe et al., 2018], if Wˉl(t),  1lL\bar{\bm{W}}_{l}(t)\,,\;1\leq l\leq L only have diagonal coefficients then their derivative also only have diagonal coefficients. Thus, if we initialize Wl(0),  1lL{\bm{W}}_{l}^{(0)}\,,\;1\leq l\leq L, only with diagonal coefficients we have a decoupled solution for each diagonal coefficient. This argument can be formalized using Cauchy-Lipschitz theorem: (39) has a unique solution which is the one we will exhibit in the following.

where the notation wl,i(t)w_{-l,i}(t) stands for the product of the wk,i(t),1kLw_{k,i}(t)\,,\,1\leq k\leq L omitting wl,i(t)w_{l,i}(t), i.e.,

and wi(t)w_{i}(t) stands for the product of the wk,i(t),1kLw_{k,i}(t)\,,\,1\leq k\leq L. The difference with [Saxe et al., 2018] is that, since they only consider the case Σx=Id{\bm{\Sigma}}_{x}={\bm{I}}_{d} they have λi=1\lambda_{i}=1, they also only consider the case L=2L=2. The use of Assumption 1 allowed us to work in a more general case.

We will assume that if wl,i(t)=wk,i(t),1l,kLw_{l,i}(t)=w_{k,i}(t)\,,\,1\leq l,k\leq L, to find an analytic solution and then show that if wl,i(0)=wk,i(0),1k,lLw_{l,i}(0)=w_{k,i}(0)\,,\,1\leq k,l\leq L then this analytic solution verifies (40) and thus, by Cauchy-Lipschitz theorem, is the unique solution of the non-linear differential equation.

Thus, considering wi(t):=w1,i(t)wL,i(t)w_{i}(t):=w_{1,i}(t)\cdots w_{L,i}(t), and assuming that wl,i(t)=wk,i(t),1l,kLw_{l,i}(t)=w_{k,i}(t)\,,\,1\leq l,k\leq L, we get that, for 1ir,1\leq i\leq r,

Case L=2𝐿2L=2:

in that case we have two situations, σi>0\sigma_{i}>0 and σi=0,  λi>0\sigma_{i}=0\,,\;\lambda_{i}>0 (the case σi=λi=0\sigma_{i}=\lambda_{i}=0 give a constant functions).

In order to get a solution for w2,i(t)w_{2,i}(t) and w1,i(t)w_{1,i}(t), we will use Cauchy-Lipschitz theorem [Coddington and Levinson, 1955]. The idea is that if we find a solution of (40), it is the only one. Let us set, mi(0)=ni(0)=eδim_{i}(0)=n_{i}(0)=e^{-\delta_{i}}, then we can set,

Thus, this is the unique solution of (40).

For σi=0,  λi>0\sigma_{i}=0\,,\;\lambda_{i}>0 we have that,

Thus, if we initialize with mi(0)=ni(0)=eδim_{i}(0)=n_{i}(0)=e^{-\delta_{i}} we get,

Non commutative case ϵ>0italic-ϵ0\epsilon>0.

Now, we will consider Assumption 1 with ϵ>0\epsilon>0 and L=2L=2.

First let us proove two lemmas usefull for later,

The matrix valued function W(t){\bm{W}}(t) converge to XY{\bm{X}}^{\dagger}{\bm{Y}} and thus is bounded for t>0t>0.

Since (6) is a gradient dynamics, it only moves in the span of the gradient of ff (the explicit expressions of f\nabla f is derived in (7) for L=1L=1 and (11) for a general LL). We use this property to characterize the solution found by these dynamics. We can study each column of the predictors W:=W1WL{\bm{W}}:={\bm{W}}_{1}\cdots{\bm{W}}_{L}. If we look at the columns of WLf\nabla_{{\bm{W}}_{L}}f, they are included in X{\bm{X}}^{\top}, thus it means that if we initialize the columns of WL(0){\bm{W}}_{L}^{(0)} in that span, then the columns of W{\bm{W}} will belong to that span during the whole learning process,

where W{\bm{W}} is W(t){\bm{W}}(t). Thus, if the dynamics (6) converge, then they converge to a matrix with the ithi^{th} column vector being in the intersection,

Finally, we have XW(t)Y{\bm{X}}{\bm{W}}(t)\to{\bm{Y}} by definition of the gradient dynamics,

We have that W1(t)2=O(t)\|{\bm{W}}_{1}(t)\|^{2}=O(t) and W2(t)2=O(t)\|{\bm{W}}_{2}(t)\|^{2}=O(t).

Since W(t){\bm{W}}(t) is bounded then W1(t)2=O(t)\|{\bm{W}}_{1}(t)\|^{2}=O(t). The same way we have W2(t)2=O(t)\|{\bm{W}}_{2}(t)\|^{2}=O(t)

After the same change of basis as in the commutative case The matrices Wˉl(t)\bar{\bm{W}}_{l}(t) follow the differencial equations

In order to perform pertrubation analysis we will use a consequence of Grönwall’s inequality [Gronwall, 1919].

Let β\beta be a non negative function and α\alpha a non decreasing function. Let uu be a function defined on an interval I=[a,)I=[a,\infty) such that

The proof can be found for instance in [Berglund, 2001, Lemma 3.1.6] ∎

Thus, let us consider Wˉ1(t)\bar{\bm{W}}_{1}(t) and Wˉ2(t)\bar{\bm{W}}_{2}(t) the solutions of (58) as well as Wˉ10(t)\bar{\bm{W}}^{0}_{1}(t) and Wˉ20(t)\bar{\bm{W}}^{0}_{2}(t) the solutions of the very same differential equation but with B=0{\bm{B}}=0. For notational simplicity we will omit the bar on the matrices W{\bm{W}} in the following. We have that,

In order to upper bound the first integral we will consider the function F(A,B):=ABBF({\bm{A}},{\bm{B}}):={\bm{A}}{\bm{B}}{\bm{B}}^{\top}. This function is Lipschitz on any compact because this function is infinitely differentiable. Thus we have that (omiting the tt in the notation),

Using the fact that W1(t)t,W10(t)t,W20(t)t\frac{{\bm{W}}_{1}(t)}{\sqrt{t}},\frac{{\bm{W}}_{1}^{0}(t)}{\sqrt{t}},\frac{{\bm{W}}_{2}^{0}(t)}{\sqrt{t}} and W2(t)t,t0\frac{{\bm{W}}_{2}(t)}{\sqrt{t}},\,t\geq 0 live in a compact set (Lemma 3) and that FF is Lipschitz on any compact. Thus, using that S=O(ϵ)\|S\|=O(\epsilon), we have

And consequently we can sum these two inequalities and apply Grönwall’s inequality with the quantities u(t)=W1(t)W10(t)+W2(t)W20(t)u(t)=\|{\bm{W}}_{1}(t)-{\bm{W}}^{0}_{1}(t)\|+\|{\bm{W}}_{2}(t)-{\bm{W}}^{0}_{2}(t)\|, α(t)=O(ϵ)\alpha(t)=O(\epsilon) and β(s)=O(s)\beta(s)=O(s) to get,

B.3 Proof of Thm. 2

Let us denotes wi(t)w_{i}(t), the values defined in (16). If mi(0)=eδ,  1ir,m_{i}(0)=e^{-\delta}\,,\;1\leq i\leq r, then we have,

Then we can conclude saying that for any ii and t0t\geq 0,

B.4 Proof of Eq. 21

B.5 Proof of Thm. 3

If we define W(t):=W1(t)W2(t){\bm{W}}^{(t)}:={\bm{W}}_{1}^{(t)}{\bm{W}}_{2}^{(t)}, the discrete update scheme for the two-layer linear neural network (2) is,

Using the same transformation (37) as in §B.2 we get that,

let us note, r=min(p,d)r=\min(p,d) and m1(t),,mr(t)m_{1}^{(t)},\ldots,m_{r}^{(t)} and n1(t),,nr(t)n_{1}^{(t)},\ldots,n_{r}^{(t)} the respective diagonal coefficients of Wˉ1(t)\bar{\bm{W}}_{1}^{(t)} and Wˉ2(t)\bar{\bm{W}}_{2}^{(t)}, they follow the equation,

In order to prove Thm. 3, we will prove several properties on the sequences (mi(t))t0(m_{i}^{(t)})_{t\geq 0} and (ni(t))t0,  1id(n_{i}^{(t)})_{t\geq 0}\,,\;1\leq i\leq d. First let us introduce the sequence (ai(t))t0(a_{i}^{(t)})_{t\geq 0} defined as ai(t):=mi(t)ni(t),t0a_{i}^{(t)}:=m_{i}^{(t)}n_{i}^{(t)}\,,t\geq 0.

By a straightforward recurrence we have that if at time tt, mi(t)=ni(t)m_{i}^{(t)}=n_{i}^{(t)} then by (70), we have mi(t+1)=ni(t+1)m_{i}^{(t+1)}=n_{i}^{(t+1)}. ∎

Thus, we will now focus on the sequence (ai(t))t0(a_{i}^{(t)})_{t\geq 0}, by (70), we have that

Similarly as for the continuous case, there is two different behavior σi>0\sigma_{i}>0 ad σi=0,  λi>0\sigma_{i}=0\,,\;\lambda_{i}>0. In the following we assume that η>0\eta>0.

For σi>0\sigma_{i}>0 we can derive the following results,

For any 1irxy1\leq i\leq r_{xy}, if 0<ai(0)<σiλi0<a_{i}^{(0)}<\frac{\sigma_{i}}{\lambda_{i}} and 2ησi<12\eta\sigma_{i}<1, then, the sequence (ai(t))(a_{i}^{(t)}) is increasing and

Let us assume that (73) is true for a time-step tt and let us prove that it is still true at time-step t+1t+1.

Using the recursive definition (72) of ai(t)a_{i}^{(t)} we get for t0t\geq 0,

For the upper bound we need to notice that ai(t+1)=fi(ai(t))a_{i}^{(t+1)}=f_{i}(a_{i}^{(t)}) where fi:xx+ηx(σiλix)(2+η(σiλix))f_{i}:x\mapsto x+\eta x(\sigma_{i}-\lambda_{i}x)(2+\eta(\sigma_{i}-\lambda_{i}x)) where η>0\eta>0. Since we assumed that 2ησi<12\eta\sigma_{i}<1, we have that,

Then we just need to show that g(x)<λiσi,  x(0,λiσi)g(x)<\tfrac{\lambda_{i}}{\sigma_{i}}\,,\;\forall x\in(0,\tfrac{\lambda_{i}}{\sigma_{i}}).

Thus gg is non-decreasing on (0,1)(0,1) and consequently, g(x)<g(λiσi)=λiσi,  x(0,1)g(x)<g(\tfrac{\lambda_{i}}{\sigma_{i}})=\tfrac{\lambda_{i}}{\sigma_{i}}\,,\;\forall x\in(0,1).

With this lemma we can proof Thm. 3. Let us first recall this theorem.

For any 1irxy1\leq i\leq r_{xy}, if σiλi>ai(0)>0\frac{\sigma_{i}}{\lambda_{i}}>a_{i}^{(0)}>0 and 2ησi<12\eta\sigma_{i}<1, then t0,  1ir\forall t\geq 0\,,\;1\leq i\leq r we have,

In this proof for notational compactness we will remove the index ii.

We first prove (81), we work with 1/a(t+1)λσ1/a^{(t+1)}-\frac{\lambda}{\sigma}, Using (72) we get,

where we used that 11+x1x,  x0\frac{1}{1+x}\geq 1-x\,,\;\forall x\geq 0. Thus we have,

To prove (81) we will once again work with 1/a(t+1)λσ1/a^{(t+1)}-\frac{\lambda}{\sigma}. Using (72) we get

where we used that 11+x1x+x2,  x0\frac{1}{1+x}\leq 1-x+x^{2}\,,\;\forall x\geq 0. Thus we have,

Now for σ=0\sigma=0 and λ>0\lambda>0 we have that,

Thus, considering (ai(t))1(a_{i}^{(t)})^{-1} we get

Thus, if we assume that 1/ai(0)λη1/a_{i}^{(0)}\geq\lambda\eta we have that (1/ai(t))(1/a_{i}^{(t)}) is a increasing sequence and that,

Case ϵ>0italic-ϵ0\epsilon>0.

If we are able to show that all the sequences W1(t){\bm{W}}_{1}^{(t)} and W2(t){\bm{W}}_{2}^{(t)} are bounded

From this theorem we can deduce the following corollary,

Proof of Corollary 3

If η<12σ1\eta<\frac{1}{2\sigma_{1}}, η<2σiσi+1σi2\eta<2\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}} and η<σiσi+12σi+12,  i  1irxy1\eta<\frac{\sigma_{i}-\sigma_{i+1}}{2\sigma_{i+1}^{2}}\,,\;\forall i\;1\leq i\leq r_{xy}-1 then for 1i<rx1\leq i<r_{x},

where Tj:=1σjη,  1jrxyT_{j}:=\frac{1}{\sigma_{j}\eta}\,,\;1\leq j\leq r_{xy} and Tj=+ if j>rxyT_{j}=+\infty\text{ if }j>r_{xy} and T0=0T_{0}=0.

First let us notice that since σ1>>σrxy>0\sigma_{1}>\ldots>\sigma_{r_{xy}}>0, the assumption η<1/(2σ1)\eta<1/(2\sigma_{1}) implies η<1/(2σi),  1id\eta<1/(2\sigma_{i})\,,\;1\leq i\leq d.

Let irxyi\leq r_{xy}. Let us first prove that if j<ij<i, then ai(Tj)δ0a_{i}^{(T_{j})}\underset{\delta\to\infty}{\to}0.

Using (81) and recalling that in Thm. 3, we assume that ai(0)=e2δa_{i}^{(0)}=e^{-2\delta}, we have for 1j<i1\leq j<i,

We have (2+ησi)σi/σi1<2(2+\eta\sigma_{i})\sigma_{i}/\sigma_{i-1}<2, because we assumed that η<2σiσi+1σi2,  i1id\eta<2\frac{\sigma_{i}-\sigma_{i+1}}{\sigma_{i}^{2}}\,,\;\forall i\quad 1\leq i\leq d. Note that for i=rxyi=r_{xy} we have ai(δTi+1)=0,  δ>0a_{i}^{(\delta T_{i+1})}=0\,,\;\forall\delta>0.

Let us now prove that if j>ij>i, then ai(δTj)δσiλia_{i}^{(\delta T_{j})}\underset{\delta\to\infty}{\to}\frac{\sigma_{i}}{\lambda_{i}}.

Using (82) and recalling that in Thm. 3, we assume that ai(0)=e2δa_{i}^{(0)}=e^{-2\delta}, we have for j>i1j>i\geq 1,

where we have that (e2δ1)eδ(2σi/σj+4ησi2/σj)0(e^{2\delta}-1)e^{\delta(-2\sigma_{i}/\sigma_{j}+4\eta\sigma_{i}^{2}/\sigma_{j})}\to 0 because

Now for irxy+1i\geq r_{xy}+1 we just need to use, (Theorem’ 3) to get,