On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Sanjeev Arora, Nadav Cohen, Elad Hazan

Introduction

How does depth help? This central question of deep learning still eludes full theoretical understanding. The general consensus is that there is a trade-off: increasing depth improves expressiveness, but complicates optimization. Superior expressiveness of deeper networks, long suspected, is now confirmed by theory, albeit for fairly limited learning problems (Eldan & Shamir, 2015; Raghu et al., 2016; Lee et al., 2017; Cohen et al., 2017; Daniely, 2017; Arora et al., 2018). Difficulties in optimizing deeper networks have also been long clear – the signal held by a gradient gets buried as it propagates through many layers. This is known as the “vanishing/exploding gradient problem”. Modern techniques such as batch normalization (Ioffe & Szegedy, 2015) and residual connections (He et al., 2015) have somewhat alleviated these difficulties in practice.

Given the longstanding consensus on expressiveness vs. optimization trade-offs, this paper conveys a rather counterintuitive message: increasing depth can accelerate optimization. The effect is shown, via first-cut theoretical and empirical analyses, to resemble a combination of two well-known tools in the field of optimization: momentum, which led to provable acceleration bounds (Nesterov, 1983); and adaptive regularization, a more recent technique proven to accelerate by Duchi et al. (2011) in their proposal of the AdaGrad algorithm. Explicit mergers of both techniques are quite popular in deep learning (Kingma & Ba, 2014; Tieleman & Hinton, 2012). It is thus intriguing that merely introducing depth, with no other modification, can have a similar effect, but implicitly.

There is an obvious hurdle in isolating the effect of depth on optimization: if increasing depth leads to faster training on a given dataset, how can one tell whether the improvement arose from a true acceleration phenomenon, or simply due to better representational power (the shallower network was unable to attain the same training loss)? We respond to this hurdle by focusing on linear neural networks (cf. Saxe et al. (2013); Goodfellow et al. (2016); Hardt & Ma (2016); Kawaguchi (2016)). With these models, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices – an overparameterization.

Both our theoretical analysis and our empirical evaluation indicate that acceleration via overparameterization need not be computationally expensive. From an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters.

Related Work

Theoretical study of optimization in deep learning is a highly active area of research. Works along this line typically analyze critical points (local minima, saddles) in the landscape of the training objective, either for linear networks (see for example Kawaguchi (2016); Hardt & Ma (2016) or Baldi & Hornik (1989) for a classic account), or for specific non-linear networks under different restrictive assumptions (cf. Choromanska et al. (2015); Haeffele & Vidal (2015); Soudry & Carmon (2016); Safran & Shamir (2017)). Other works characterize other aspects of objective landscapes, for example Safran & Shamir (2016) showed that under certain conditions a monotonically descending path from initialization to global optimum exists (in compliance with the empirical observations of Goodfellow et al. (2014)).

Turning to general optimization, accelerated gradient (momentum) methods were introduced in Nesterov (1983), and later studied in numerous works (see Wibisono et al. (2016) for a short review). Such methods effectively accumulate gradients throughout the entire optimization path, using the collected history to determine the step at a current point in time. Use of preconditioners to speed up optimization is also a well-known technique. Indeed, the classic Newton’s method can be seen as preconditioning based on second derivatives. Adaptive preconditioning with only first-order (gradient) information was popularized by the BFGS algorithm and its variants (cf. Nocedal (1980)). Relevant theoretical guarantees, in the context of regret minimization, were given in Hazan et al. (2007); Duchi et al. (2011). In terms of combining momentum and adaptive preconditioning, Adam (Kingma & Ba, 2014) is a popular approach, particularly for optimization of deep networks.

Algorithms with certain theoretical guarantees for non-convex optimization, and in particular for training deep neural networks, were recently suggested in various works, for example Ge et al. (2015); Agarwal et al. (2017); Carmon et al. (2016); Janzamin et al. (2015); Livni et al. (2014) and references therein. Since the focus of this paper lies on the analysis of algorithms already used by practitioners, such works lie outside our scope.

Obviously the overparameterization does not affect the expressiveness of the linear model. How does it affect optimization? What happens to gradient descent on this non-convex objective?

Gradient descent over L(w1,w2)L({\mathbf{w}}_{1},w_{2}), with fixed small learning rate and near-zero initialization, is equivalent to gradient descent over L(w)L({\mathbf{w}}) with particular adaptive learning rate and momentum terms.

To see this, consider the gradients of L(w)L({\mathbf{w}}) and L(w1,w2)L({\mathbf{w}}_{1},w_{2}):

Gradient descent over L(w1,w2)L({\mathbf{w}}_{1},w_{2}) with learning rate η>0\eta>0:

The dynamics of the underlying parameter w=w1w2{\mathbf{w}}={\mathbf{w}}_{1}w_{2} are:

We conclude that the dynamics governing the underlying parameter w{\mathbf{w}} correspond to gradient descent with a momentum term, where both the learning rate (ρ(t)\rho^{(t)}) and momentum coefficients (μ(t,τ)\mu^{(t,\tau)}) are time-varying and adaptive.

Linear Neural Networks

For completeness, we regard a depth-11 network as the family of directly parameterized linear predictors, i.e. we set Φ1:=Φlin\Phi^{1}:=\Phi^{lin} (see Equation 1).

and so the sole difference between the training loss of a depth-NN network and that of a depth-11 network (classic linear model) lies in the replacement of a matrix parameter by a product of NN matrices. This implies that if increasing NN can indeed accelerate convergence, it is not an outcome of any phenomenon other than favorable properties of depth-induced overparameterization for optimization.

Implicit Dynamics of Gradient Descent

In this section we present a new result for linear neural networks, tying the dynamics of gradient descent on LN()L^{N}(\cdot) – the training loss corresponding to a depth-NN network, to those on L1()L^{1}(\cdot) – training loss of a depth-11 network (classic linear model). Specifically, we show that gradient descent on LN()L^{N}(\cdot), a complicated and seemingly pointless overparameterization, can be directly rewritten as a particular preconditioning scheme over gradient descent on L1()L^{1}(\cdot).

When applied to LN()L^{N}(\cdot), gradient descent takes on the following form:

η>0\eta>0 here is a learning rate, and λ0\lambda\geq 0 is an optional weight decay coefficient. For simplicity, we regard both η\eta and λ\lambda as fixed (no dependence on tt). Define the underlying end-to-end weight matrix:

Given that LN(W1,,WN)=L1(We)L^{N}(W_{1},\ldots,W_{N})=L^{1}(W_{e}) (Equation 3), we view WeW_{e} as an optimized weight matrix for L1()L^{1}(\cdot), whose dynamics are governed by Equation 4. Our interest then boils down to the study of these dynamics for different choices of NN. For N=1N=1 they are (trivially) equivalent to standard gradient descent over L1()L^{1}(\cdot). We will characterize the dynamics for N2N\geq 2.

To be able to derive, in our general setting, an explicit update rule for the end-to-end weight matrix WeW_{e} (Equation 5), we introduce an assumption by which the learning rate is small, i.e. η20\eta^{2}\approx 0. Formally, this amounts to translating Equation 4 to the following set of differential equations:

where tt is now a continuous time index, and W˙j(t)\dot{W}_{j}(t) stands for the derivative of WjW_{j} with respect to time. The use of differential equations, for both theoretical analysis and algorithm design, has a long and rich history in optimization research (see Helmke & Moore (2012) for an overview). When step sizes (learning rates) are taken to be small, trajectories of discrete optimization algorithms converge to smooth curves modeled by continuous-time differential equations, paving way to the well-established theory of the latter (cf. Boyce et al. (1969)). This approach has led to numerous interesting findings, including recent results in the context of acceleration methods (e.g. Su et al. (2014); Wibisono et al. (2016)).

With the continuous formulation in place, we turn to express the dynamics of the end-to-end matrix WeW_{e}:

Assume the weight matrices W1WNW_{1}{\ldots}W_{N} follow the dynamics of continuous gradient descent (Equation 6). Assume also that their initial values (time t0t_{0}) satisfy, for j=1N1j=1{\ldots}N-1:

Then, the end-to-end weight matrix WeW_{e} (Equation 5) is governed by the following differential equation:

where []j1N[\cdot]^{\frac{j-1}{N}} and []NjN[\cdot]^{\frac{N-j}{N}}, j=1Nj=1\ldots{N}, are fractional power operators defined over positive semidefinite matrices.

(sketch – full details in Appendix A.1) If λ=0\lambda\,{=}\,0 (no weight decay) then one can easily show that Wj+1(t)W˙j+1(t)=W˙j(t)Wj(t)W_{j+1}^{\top}(t)\dot{W}_{j+1}(t)=\dot{W}_{j}(t)W_{j}^{\top}(t) throughout optimization. Taking the transpose of this equation and adding to itself, followed by integration over time, imply that the difference between Wj+1(t)Wj+1(t)W_{j+1}^{\top}(t)W_{j+1}(t) and Wj(t)Wj(t)W_{j}(t)W_{j}^{\top}(t) is constant. This difference is zero at initialization (Equation 7), thus will remain zero throughout, i.e.:

A slightly more delicate treatment shows that this is true even if λ>0\lambda>0, i.e. with weight decay included.

Equation 9 implies alignment of the (left and right) singular spaces of Wj(t)W_{j}(t) and Wj+1(t)W_{j+1}(t), simplifying the product Wj+1(t)Wj(t)W_{j+1}(t)W_{j}(t). Successive application of this simplification allows a clean computation for the product of all layers (that is, WeW_{e}), leading to the explicit form presented in theorem statement (Equation 8). ∎

Translating the continuous dynamics of Equation 8 back to discrete time, we obtain the sought-after update rule for the end-to-end weight matrix:

This update rule relies on two assumptions: first, that the learning rate η\eta is small enough for discrete updates to approximate continuous ones; and second, that weights are initialized on par with Equation 7, which will approximately be the case if initialization values are close enough to zero. It is customary in deep learning for both learning rate and weight initializations to be small, but nonetheless above assumptions are only met to a certain extent. We support their applicability by showing empirically (Section 8) that the end-to-end update rule (Equation 10) indeed provides an accurate description for the dynamics of WeW_{e}.

A close look at Equation 10 reveals that the dynamics of the end-to-end weight matrix WeW_{e} are similar to gradient descent over L1()L^{1}(\cdot) – training loss corresponding to a depth-11 network (classic linear model). The only difference (besides the scaling by NN of the weight decay coefficient λ\lambda) is that the gradient dL1dW(We)\frac{dL^{1}}{dW}(W_{e}) is subject to a transformation before being used. Namely, for j=1Nj=1{\ldots}N, it is multiplied from the left by [WeWe]j1N[W_{e}W_{e}^{\top}]^{\frac{j-1}{N}} and from the right by [WeWe]NjN[W_{e}^{\top}{W}_{e}]^{\frac{N-j}{N}}, followed by summation over jj. Clearly, when N=1N=1 (depth-11 network) this transformation reduces to identity, and as expected, WeW_{e} precisely adheres to gradient descent over L1()L^{1}(\cdot). When N2N\geq 2 the dynamics of WeW_{e} are less interpretable. We arrange it as a vector to gain more insight:

For an arbitrary matrix AA, denote by vec(A)vec(A) its arrangement as a vector in column-first order. Then, the end-to-end update rule in Equation 10 can be written as:

The result readily follows from the properties of the Kronecker product – see Appendix A.2 for details. ∎

Claim 1 implies that in the end-to-end update rule of Equation 10, the transformation applied to the gradient dL1dW(We)\frac{dL^{1}}{dW}(W_{e}) is essentially a preconditioning, whose eigendirections and eigenvalues depend on the singular value decomposition of WeW_{e}. The eigendirections are the rank-11 matrices urvr{\mathbf{u}}_{r}{\mathbf{v}}_{r^{\prime}}^{\top}, where ur{\mathbf{u}}_{r} and vr{\mathbf{v}}_{r^{\prime}} are left and right (respectively) singular vectors of WeW_{e}. The eigenvalue of urvr{\mathbf{u}}_{r}{\mathbf{v}}_{r^{\prime}}^{\top} is j=1Nσr2(Nj)/Nσr2(j1)/N\sum_{j=1}^{N}\sigma_{r}^{2(N-j)/N}\sigma_{r^{\prime}}^{2(j-1)/N}, where σr\sigma_{r} and σr\sigma_{r^{\prime}} are the singular values of WeW_{e} corresponding to ur{\mathbf{u}}_{r} and vr{\mathbf{v}}_{r^{\prime}} (respectively). When N2N\geq 2, an increase in σr\sigma_{r} or σr\sigma_{r^{\prime}} leads to an increase in the eigenvalue corresponding to the eigendirection urvr{\mathbf{u}}_{r}{\mathbf{v}}_{r^{\prime}}^{\top}. Qualitatively, this implies that the preconditioning favors directions that correspond to singular vectors whose presence in WeW_{e} is stronger. We conclude that the effect of overparameterization, i.e. of replacing a classic linear model (depth-11 network) by a depth-NN linear network, boils down to modifying gradient descent by promoting movement along directions that fall in line with the current location in parameter space. A-priori, such a preference may seem peculiar – why should an optimization algorithm be sensitive to its location in parameter space? Indeed, we generally expect sensible algorithms to be translation invariant, i.e. be oblivious to parameter value. However, if one takes into account the common practice in deep learning of initializing weights near zero, the location in parameter space can also be regarded as the overall movement made by the algorithm. We thus interpret our findings as indicating that overparameterization promotes movement along directions already taken by the optimization, and therefore can be seen as a form of acceleration. This intuitive interpretation will become more concrete in the subsection that follows.

A final point to make, is that the end-to-end update rule (Equation 10 or 11), which obviously depends on NN – number of layers in the deep linear network, does not depend on the hidden widths n1nN1n_{1}\ldots{n}_{N-1} (see Section 4). This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters. Consequently, the acceleration of overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.

To facilitate a straightforward presentation of our findings, we hereinafter focus on the special case where the optimized models have a single output, i.e. where k=1k=1. This corresponds, for example, to a binary (two-class) classification problem, or to the prediction of a numeric scalar property (regression). It admits a particularly simple form for the end-to-end update rule of Equation 10:

The result follows from the definition of a fractional power operator over matrices – see Appendix A.3. ∎

Claim 2 implies that in the single output case, the effect of overparameterization (replacing classic linear model by depth-NN linear network) on gradient descent is twofold: first, it leads to an adaptive learning rate schedule, by introducing the multiplicative factor We222/N\left\|W_{e}\right\|_{2}^{2-2/N}; and second, it amplifies (by NN) the projection of the gradient on the direction of WeW_{e}. Recall that we view WeW_{e} not only as the optimized parameter, but also as the overall movement made in optimization (initialization is assumed to be near zero). Accordingly, the adaptive learning rate schedule can be seen as gaining confidence (increasing step sizes) when optimization moves farther away from initialization, and the gradient projection amplification can be thought of as a certain type of momentum that favors movement along the azimuth taken so far. These effects bear potential to accelerate convergence, as we illustrate qualitatively in Section 7, and demonstrate empirically in Section 8.

Overparametrization Effects Cannot Be Attained via Regularization

Adding a regularizer to the objective is a standard approach for improving optimization (though lately the term regularization is typically associated with generalization). For example, AdaGrad was originally invented to compete with the best regularizer from a particular family. The next theorem shows (for single output case) that the effects of overparameterization cannot be attained by adding a regularization term to the original training loss, or via any similar modification. This is not obvious a-priori, as unlike many acceleration methods that explicitly maintain memory of past gradients, updates under overparametrization are by definition the gradients of something. The assumptions in the theorem are minimal and also necessary, as one must rule-out the trivial counter-example of a constant training loss.

where PrW{}Pr_{W}\{\cdot\} is the projection given in Equation 13. Then, there exists no function (of WW) whose gradient field is F()F(\cdot).

(sketch – full details in Appendix A.4) The proof uses elementary differential geometry (Buck, 2003): curves, arc length and the fundamental theorem for line integrals, which states that the integral of g\nabla{g} for any differentiable function gg amounts to along every closed curve.

Overparametrization changes gradient descent’s behavior: instead of following the original gradient dL1dW\frac{dL^{1}}{dW}, it follows some other direction F()F(\cdot) (see Equations 12 and 14) that is a function of the original gradient as well as the current point WW. We think of this change as a transformation that maps one vector field ϕ()\phi(\cdot) to another – Fϕ()F_{\phi}(\cdot):

Notice that for ϕ=dL1dW\phi=\frac{dL^{1}}{dW}, we get exactly the vector field F()F(\cdot) defined in theorem statement.

We note simple properties of the mapping ϕFϕ\phi\mapsto{F}_{\phi}. First, it is linear, since for any vector fields ϕ1,ϕ2\phi_{1},\phi_{2} and scalar cc: Fϕ1+ϕ2=Fϕ1+Fϕ2F_{\phi_{1}{+}\phi_{2}}{=}F_{\phi_{1}}{+}F_{\phi_{2}} and Fcϕ1=cFϕ1F_{c{\cdot}\phi_{1}}{=}c{\cdot}{F}_{\phi_{1}}. Second, because of the linearity of line integrals, for any curve Γ\Gamma, the functional ϕΓFϕ\phi\mapsto\int_{\Gamma}F_{\phi}, a mapping of vector fields to scalars, is linear.

We show that F()F(\cdot) contradicts the fundamental theorem for line integrals. To do so, we construct a closed curve Γ=Γr,R\Gamma{=}\Gamma_{r,R} for which the linear functional ϕΓFϕ\phi\mapsto\oint_{\Gamma}{F}_{\phi} does not vanish at ϕ=dL1dW\phi{=}\frac{dL^{1}}{dW}. Let e:=dL1dW(W=0)/dL1dW(W=0){\mathbf{e}}:=\frac{dL^{1}}{dW}(W{=}0)/\|\frac{dL^{1}}{dW}(W{=}0)\|, which is well-defined since by assumption dL1dW(W=0)0\frac{dL^{1}}{dW}(W{=}0){\neq 0}. For r<Rr<R we define (see Figure 1):

Γr,R1\Gamma_{r,R}^{1} is the line segment from Re-R\cdot{\mathbf{e}} to re-r\cdot{\mathbf{e}}.

Γr,R2\Gamma_{r,R}^{2} is a spherical curve from re-r\cdot{\mathbf{e}} to rer\cdot{\mathbf{e}}.

Γr,R3\Gamma_{r,R}^{3} is the line segment from rer\cdot{\mathbf{e}} to ReR\cdot{\mathbf{e}}.

Γr,R4\Gamma_{r,R}^{4} is a spherical curve from ReR\cdot{\mathbf{e}} to Re-R\cdot{\mathbf{e}}.

With the definition of Γr,R\Gamma_{r,R} in place, we decompose dL1dW\frac{dL^{1}}{dW} into a constant vector field κdL1dW(W=0)\kappa\,{\equiv}\,\frac{dL^{1}}{dW}(W{=}0) plus a residual ξ\xi. We explicitly compute the line integrals along Γr,R1Γr,R4\Gamma_{r,R}^{1}\ldots\Gamma_{r,R}^{4} for FκF_{\kappa}, and derive bounds for FξF_{\xi}. This, along with the linearity of the functional ϕΓFϕ\phi\mapsto\int_{\Gamma}F_{\phi}, provides a lower bound on the line integral of F()F(\cdot) over Γr,R\Gamma_{r,R}. We show the lower bound is positive as r,R0r,R\to 0, thus F()F(\cdot) indeed contradicts the fundamental theorem for line integrals. ∎

Illustration of Acceleration

To this end, we showed that overparameterization (use of depth-NN linear network in place of classic linear model) induces on gradient descent a particular preconditioning scheme (Equation 10 in general and 12 in the single output case), which can be interpreted as introducing some forms of momentum and adaptive learning rate. We now illustrate qualitatively, on a very simple hypothetical learning problem, the potential of these to accelerate optimization.

With fixed learning rate η>0\eta>0 (weight decay omitted for simplicity), gradient descent over L()L(\cdot) gives:

Changing variables per Δi=wiyi\Delta_{i}=w_{i}-y_{i}, we have:

Assuming the original weights w1w_{1} and w2w_{2} are initialized near zero, Δ1\Delta_{1} and Δ2\Delta_{2} start off at y1-y_{1} and y2-y_{2} respectively, and will eventually reach the optimum Δ1=Δ2=0\Delta^{*}_{1}=\Delta^{*}_{2}=0 if the learning rate is small enough to prevent divergence:

Suppose now that the problem is ill-conditioned, in the sense that y1y2y_{1}{\gg}{y}_{2}. If p=2p=2 this has no effect on the bound for η\eta. Optimal learning rate for gradient descent on quadratic objective does not depend on current parameter value (cf. Goh (2017)). If p>2p>2 the learning rate is determined by y1y_{1}, leading Δ2\Delta_{2} to converge very slowly. In a sense, Δ2\Delta_{2} will suffer from the fact that there is no “communication” between the coordinates (this will actually be the case not just with gradient descent, but with most algorithms typically used in large-scale settings – AdaGrad, Adam, etc.).

Now consider the scenario where we optimize L()L(\cdot) via overparameterization, i.e. with the update rule in Equation 12 (single output). In this case the coordinates are coupled, and as Δ1\Delta_{1} gets small (w1w_{1} gets close to y1y_{1}), the learning rate is effectively scaled by y122Ny_{1}^{2-\frac{2}{N}} (in addition to a scaling by NN in coordinate 11 only), allowing (if y1>1y_{1}{>}1) faster convergence of Δ2\Delta_{2}. We thus have the luxury of temporarily slowing down Δ2\Delta_{2} to ensure that Δ1\Delta_{1} does not diverge, with the latter speeding up the former as it reaches safe grounds. In Appendix B we consider a special case and formalize this intuition, deriving a concrete bound for the acceleration of overparameterization.

Experiments

Our analysis (Section 5) suggests that overparameterization – replacement of a classic linear model by a deep linear network, induces on gradient descent a certain preconditioning scheme. We qualitatively argued (Section 7) that in some cases, this preconditioning may accelerate convergence. In this section we put these claims to the test, through a series of empirical evaluations based on TensorFlow toolbox (Abadi et al. (2016)). For conciseness, many of the details behind our implementation are deferred to Appendix C.

Alongside the validity of the end-to-end update rule, Figure 2 also demonstrates the negligible effect of network width on convergence, in accordance with our analysis (see Section 5). Specifically, it shows that in the evaluated setting, hidden layers of size 11 (scalars) suffice in order for the essence of overparameterization to fully emerge. Unless otherwise indicated, all results reported hereinafter are based on this configuration, i.e. on scalar hidden layers. The computational toll associated with overparameterization will thus be virtually non-existent.

An immediate question arises at this point. If depth indeed accelerates convergence, why not add as many layers as one can computationally afford? The reason, which is actually apparent in our analysis, is the so-called vanishing gradient problem. When training a very deep network (large NN), while initializing weights to be small, the end-to-end matrix WeW_{e} (Equation 5) is extremely close to zero, severely attenuating gradients in the preconditioning scheme (Equation 10). A possible approach for alleviating this issue is to initialize weights to be larger, yet small enough such that the end-to-end matrix does not “explode”. The choice of identity (or near identity) initialization leads to what is known as linear residual networks (Hardt & Ma, 2016), akin to the successful residual networks architecture (He et al., 2015) commonly employed in deep learning. Notice that identity initialization satisfies the condition in Equation 7, rendering the end-to-end update rule (Equation 10) applicable. Figure 5-left shows convergence, under gradient descent, of a single layer model against deeper networks than those evaluated before – depths 44 and 88. As can be seen, with standard, near-zero initialization, the depth-44 network starts making visible progress only after about 65K65K iterations, whereas the depth-88 network seems stuck even after 100K100K iterations. In contrast, under identity initialization, both networks immediately make progress, and again depth serves as an implicit accelerator.

As a final sanity test, we evaluate the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting. Specifically, we experiment with the convolutional network tutorial for MNIST built into TensorFlow, https://github.com/tensorflow/models/tree/master/tutorials/image/mnist which includes convolution, pooling and dense layers, ReLU non-linearities, stochastic gradient descent with momentum, and dropout (Srivastava et al., 2014). We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer. Here, as opposed to previous experiments, widths of the newly formed hidden layers were not set to 11, but rather to the minimal values that do not deteriorate expressiveness (see Appendix C). Overall, with an addition of roughly 15%15\% in number of parameters, optimization has accelerated considerably – see Figure 5-right. The displayed results were obtained with the hyperparameter settings hardcoded into the tutorial. We have tried alternative settings (varying learning rates and standard deviations of initializations – see Appendix C), and in all cases observed an outcome similar to that in Figure 5-right – overparameterization led to significant speedup. Nevertheless, as reported above for linear networks, it is likely that for non-linear networks the effect of depth on optimization is mixed – some settings accelerate by it, while others do not. Comprehensive characterization of the cases in which depth accelerates optimization warrants much further study. We hope our work will spur interest in this avenue of research.

Conclusion

Through theory and experiments, we demonstrated that overparameterizing a neural network by increasing its depth can accelerate optimization, even on very simple problems.

Our analysis of linear neural networks, the subject of various recent studies, yielded a new result: for these models, overparameterization by depth can be understood as a preconditioning scheme with a closed form description (Theorem 1 and the claims thereafter). The preconditioning may be interpreted as a combination between certain forms of adaptive learning rate and momentum. Given that it depends on network depth but not on width, acceleration by overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.

Clearly, complete theoretical analysis for non-linear networks will be challenging. Empirically however, we showed that the trivial idea of replacing an internal weight matrix by a product of two can significantly accelerate optimization, with absolutely no effect on expressiveness (Figure 5-right).

Acknowledgments

Sanjeev Arora’s work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. Elad Hazan’s work is supported by NSF grant 1523815 and Google Brain. Nadav Cohen is a member of the Zuckerman Israeli Postdoctoral Scholars Program, and is supported by Eric and Wendy Schmidt.

References

References

Appendix A Deferred Proofs

Before delving into the proof, we introduce notation that will admit a more compact presentation of formulae. For 1abN1\leq{a}\leq{b}\leq{N}, we denote:

where W1WNW_{1}\ldots{W}_{N} are the weight matrices of the depth-NN linear network (Equation 2). If a>ba>b, then by definition both aj=bWj\prod^{j=b}_{a}W_{j} and j=abWj\prod_{j=a}^{b}W_{j}^{\top} are identity matrices, with size depending on context, i.e. on the dimensions of matrices they are multiplied against. Given any square matrices (possibly scalars) A1,A2,,AmA_{1},A_{2},\ldots,A_{m}, we denote by diag(A1Am)diag(A_{1}\ldots{A}_{m}) a block-diagonal matrix holding them on its diagonal:

As illustrated above, diag(A1Am)diag(A_{1}\ldots{A}_{m}) may hold additional, zero-valued rows and columns beyond A1AmA_{1}\ldots{A}_{m}. Conversely, it may also trim (omit) rows and columns, from its bottom and right ends respectively, so long as only zeros are being removed. The exact shape of diag(A1Am)diag(A_{1}\ldots{A}_{m}) is again determined by context, and so if BB and CC are matrices, the expression Bdiag(A1Am)CB\cdot{diag}(A_{1}\ldots{A}_{m})\cdot{C} infers a number of rows equal to the number of columns in BB, and a number of columns equal to the number of rows in CC.

Turning to the actual proof, we disregard the trivial case N=1N=1, and begin by noticing that Equation 3, along with the definition of WeW_{e} (Equation 5), imply that for every j=1Nj=1\ldots{N}:

Plugging this into the differential equations of gradient descent (Equation 6), we get:

For j=1N1j=1\ldots{N}{-}1, multiply the jj’th equation by Wj(t)W_{j}^{\top}(t) from the right, and the j+1j{+}1’th equation by Wj+1(t)W_{j+1}^{\top}(t) from the left. This yields:

Taking the transpose of these equations and adding to themselves, we obtain, for every j=1N1j=1\ldots{N}{-}1:

Turning to Lemma 1 below, while recalling our assumption for time t0t_{0} (Equation 7):

we conclude that, throughout the entire time-line:

Recollecting the definitions of Cj(t),Cj(t)C_{j}(t),C^{\prime}_{j}(t), this means:

Regard tt now as fixed, and for every j=1Nj=1\ldots{N}, let:

be a singular value decomposition. That is to say, UjU_{j} and VjV_{j} are orthogonal matrices, and Σj\Sigma_{j} is a rectangular-diagonal matrix holding non-decreasing, non-negative singular values on its diagonal. Equation 18 implies that for j=1N1j=1\ldots{N}{-}1:

Oj,rO_{j,r} here is simply a matrix changing between orthogonal bases in the eigenspace of ρr\rho_{r} – it maps the basis comprising Vj+1V_{j+1}-columns to that comprising UjU_{j}-columns. Recalling that both Σj\Sigma_{j} and Σj+1\Sigma_{j+1} are rectangular-diagonal, holding only non-negative values, Equation 20 implies that each of these matrices is equal to diag(ρ1Id1,,ρmIdm)diag(\sqrt{\rho_{1}}{\cdot}I_{d_{1}},\ldots,\sqrt{\rho_{m}}{\cdot}I_{d_{m}}). Note that the matrices generally do not have the same shape and thus, formally, are not equal to one another. Nonetheless, in line with our diagdiag notation (see beginning of this subsection), Σj\Sigma_{j} and Σj+1\Sigma_{j+1} may differ from each other only in trailing, zero-valued rows and columns. By an inductive argument, all the singular value matrices Σ1,Σ2,,ΣN\Sigma_{1},\Sigma_{2},\ldots,\Sigma_{N} (see Equation 19) are equal up to trailing zero rows and columns. The fact that ρ1ρm\rho_{1}\ldots\rho_{m} do not include an index jj in their notation is thus in order, and we may write, for every j=1N1j=1\ldots{N}{-}1:

Concatenations of weight matrices thus simplify as follows:

where we used the orthogonality of Oj,rO_{j,r}, and the obvious fact that it commutes with IdrI_{d_{r}}. Consider Equation 21 with j=1j=1 and Equation 22 with j=Nj=N, while recalling that by definition We(t)=1i=NWj(t)W_{e}(t)=\prod^{i=N}_{1}W_{j}(t):

It follows that for every j=1Nj=1\ldots{N}:

where []Nj+1N[\cdot]^{\frac{N-j+1}{N}} and []jN[\cdot]^{\frac{j}{N}} stand for fractional power operators defined over positive semidefinite matrices.

With Equations 23 and 24 in place, we are finally in a position to complete the proof. Returning to Equation 16, we multiply W˙j(t)\dot{W}_{j}(t) from the left by j+1i=NWi(t)\prod^{i=N}_{j+1}W_{i}(t) and from the right by 1i=j1Wi(t)\prod^{i=j-1}_{1}W_{i}(t), followed by summation over j=1Nj=1\ldots{N}. This gives:

By definition We(t)=1i=NWj(t)W_{e}(t)=\prod^{i=N}_{1}W_{j}(t), so we can substitute the first two lines above:

Finally, plugging in the relations in Equations 23 and 24, the sought-after result is revealed:

Then, if ff and gg assume the same value at some t0It_{0}\in{I} (interior or boundary), they must coincide along the entire interval, i.e. it must hold that f(t)=g(t)f(t)=g(t) for all tIt\in{I}.

We know that h(t0)=0h(t_{0})=0 for some t0It_{0}\in{I}, and would like to show that h(t)=0 tIh(t)=0~{}\forall{t}\in{I}. Assume by contradiction that this is not the case, so there exists t2It_{2}\in{I} for which h(t2)0h(t_{2})\neq 0. Without loss of generality, suppose that h(t2)>0h(t_{2})>0, and that t2>t0t_{2}>t_{0}. Let SS be the zero set of hh, i.e. S:={tI:h(t)=0}S:=\{t\in{I}:h(t)=0\}. Since hh is continuous in II, SS is topologically closed, therefore its intersection with the interval [t0,t2][t_{0},t_{2}] is compact. Denote by t1t_{1} the maximal element in this intersection, and consider the interval J:=[t1,t2]IJ:=[t_{1},t_{2}]\subset{I}. By construction, hh is positive along JJ, besides on the endpoint t1t_{1} where it assumes the value of zero. For t1<tt2t_{1}<t\leq{t}_{2}, we may solve as follows the differential equation of hh (Equation 25):

where β\beta is the positive constant defined by h(t2)=βeαt2h(t_{2})=\beta{e}^{-\alpha{t_{2}}}. Since in particular hh is bounded away from zero on (t1,t2](t_{1},t_{2}], and assumes zero at t1t_{1}, we obtain a contradiction to its continuity. This completes the proof. ∎

A.2 Proof of Claim 1

Our proof relies on the Kronecker product operation for matrices. For arbitrary matrices AA and BB of sizes ma×nam_{a}\times{n}_{a} and mb×nbm_{b}\times{n}_{b} respectively, the Kronecker product ABA\odot{B} is defined to be the following block matrix:

where aija_{ij} stands for the element in row ii and column jj of AA. The Kronecker product admits numerous useful properties. We will employ the following:

If AA and BB are matrices such that the matrix product ABAB is defined, then:

where IrAI_{r_{A}} and IcBI_{c_{B}} are the identity matrices whose sizes correspond, respectively, to the number of rows in AA and the number of columns in BB. vec()vec(\cdot) here, as in claim statement, stands for matrix vectorization in column-first order.

If A1A_{1}, A2A_{2}, B1B_{1} and B2B_{2} are matrices such that the matrix products A1B1A_{1}B_{1} and A2B2A_{2}B_{2} are defined, then:

Equation 28 and 29 imply, that if AA and BB are some orthogonal matrices, so is ABA\odot{B}:

With the Kronecker product in place, we proceed to the actual proof. It suffices to show that vectorizing:

where PWe(t){P}_{W_{e}^{(t)}} is the preconditioning matrix defined in claim statement. For notational conciseness, we hereinafter omit the iteration index tt, and simply write WeW_{e} instead of We(t)W_{e}^{(t)}.

Let IdI_{d} and IkI_{k} be the identity matrices of sizes d×dd\times{d} and k×kk\times{k} respectively. Utilizing the properties of the Kronecker product, we have:

The first equality here makes use of Equation 27, and the second of Equation 28. We will show that the matrix:

meets the characterization of PWe{P}_{W_{e}}, thereby completing the proof. Let:

The third equality here is based on the relation in Equation 28, and the last equality is based on Equation 29. Denoting:

Now, since by definition UU and VV are orthogonal, OO is orthogonal as well (follows from the relation in Equation 30). Additionally, the fact that DD is rectangular-diagonal implies that the square matrix Λ\Lambda is also diagonal. Equation 34 is thus an orthogonal eigenvalue decomposition of QQ. Finally, denote the columns of UU (left singular vectors of WeW_{e}) by u1uk{\mathbf{u}}_{1}\ldots{\mathbf{u}}_{k}, those of VV (right singular vectors of WeW_{e}) by v1vd{\mathbf{v}}_{1}\ldots{\mathbf{v}}_{d}, and the diagonal elements of DD (singular values of WeW_{e}) by σ1σmax{k,d}\sigma_{1}\ldots\sigma_{\max\{k,d\}} (by definition σr=0\sigma_{r}=0 if r>min{k,d}r>\min\{k,d\}). The definitions in Equations 32 and 33 imply that the columns of OO are:

with corresponding diagonal elements in Λ\Lambda being:

We conclude that QQ indeed meets the characterization of PWeP_{W_{e}} in claim statement. This completes the proof.

A.3 Proof of Claim 2

We disregard the trivial case N=1N=1, as well as the scenario We(t)=0W_{e}^{(t)}=0 (both lead Equations 10 and 12 to equate). Omitting the iteration index tt from our notation, it suffices to show that:

The latter expression is precisely the second line of Equation 35, thus our proof is complete.

A.4 Proof of Theorem 2

Our proof relies on elementary differential geometry: curves, arc length and line integrals (see Chapters 8 and 9 in Buck (2003)).

In words, the line integral of the gradient of hh over Γ\Gamma, is equal to the difference between the value taken by hh at the end-point of Γ\Gamma, and that taken at the start-point. A direct implication of the theorem is that if Γ\Gamma is closed (γe=γs\gamma_{e}=\gamma_{s}), the line integral vanishes:

We conclude that if F()F(\cdot) is the gradient field of some function, its line integral over any closed (piecewise smooth) curve lying in U{\mathcal{U}} must vanish. We will show that this is not the case.

In light of the above, to show that F()F(\cdot) contradicts the gradient theorem, thereby completing the proof, it suffices to find a closed (piecewise smooth) curve Γ\Gamma for which the linear functional ϕΓFϕ\phi\mapsto\oint_{\Gamma}{F}_{\phi} does not vanish at ϕ=dL1dW\phi=\frac{dL^{1}}{dW}. By assumption dL1dW(W=0)0\frac{dL^{1}}{dW}(W{=}0)\neq 0, and so we may define the unit vector in the direction of dL1dW(W=0)\frac{dL^{1}}{dW}(W{=}0):

Let RR be a positive constant small enough such that the Euclidean ball of radius RR around the origin is contained in U{\mathcal{U}}. Let rr be a positive constant smaller than RR. Define Γr,R\Gamma_{r,R} to be a curve as follows (see illustration in Figure 1): The proof would have been slightly simplified had we used a curve that passes directly through the origin. We avoid this in order to emphasize that the result is not driven by some point-wise singularity (the origin received special treatment in the definition of F()F(\cdot) – see Equations 14 and 13).

Γr,R1\Gamma_{r,R}^{1} is the line segment from Re-R\cdot{\mathbf{e}} to re-r\cdot{\mathbf{e}}.

Γr,R2\Gamma_{r,R}^{2} is a geodesic on the sphere of radius rr, starting from re-r\cdot{\mathbf{e}} and ending at rer\cdot{\mathbf{e}}.

Γr,R3\Gamma_{r,R}^{3} is the line segment from rer\cdot{\mathbf{e}} to ReR\cdot{\mathbf{e}}.

Γr,R4\Gamma_{r,R}^{4} is a geodesic on the sphere of radius RR, starting from ReR\cdot{\mathbf{e}} and ending at Re-R\cdot{\mathbf{e}}.

Γr,R\Gamma_{r,R} is a piecewise smooth, closed curve that fully lies within U{\mathcal{U}}. Consider the linear functional it induces: ϕΓr,RFϕ\phi\mapsto\oint_{\Gamma_{r,R}}{F}_{\phi}. We will evaluate this functional on ϕ=dL1dW\phi=\frac{dL^{1}}{dW}. To do so, we decompose the latter as follows:

cc is a scalar equal to dL1dW(W=0)\|\tfrac{dL^{1}}{dW}(W{=}0)\|.

e(){\mathbf{e}}(\cdot) is a vector field returning the constant e{\mathbf{e}} (Equation 37).

ξ()\xi(\cdot) is a vector field returning the values of dL1dW()\frac{dL^{1}}{dW}(\cdot) shifted by the constant dL1dW(W=0)-\tfrac{dL^{1}}{dW}(W{=}0). It is continuous on U{\mathcal{U}} and vanishes at the origin.

Applying Lemma 2 to ξ\xi over Γr,R\Gamma_{r,R} gives:

The linearity of the functional ϕΓr,RFϕ\phi\mapsto\oint_{\Gamma_{r,R}}{F}_{\phi}, along with Equation 39, then imply:

We will show that for proper choices of RR and rr, the lower bound above is positive. Γr,R\Gamma_{r,R} will then be a piecewise smooth closed curve lying in U{\mathcal{U}}, for which the functional ϕΓr,RFϕ\phi\mapsto\oint_{\Gamma_{r,R}}{F}_{\phi} does not vanish at ϕ=dL1dW\phi=\frac{dL^{1}}{dW}. As stated, this will imply that F()F(\cdot) violates the gradient theorem, thereby concluding our proof.

All that is left is to affirm that the expression:

can indeed be made positive with proper choices of RR and rr. Recall that:

N>2N>2 by assumption; implies 2N3\nicefrac2N2>0\frac{2N}{3-\nicefrac{{2}}{{N}}}-2>0.

RR is any positive constant small enough such that the ball of radius RR around the origin is contained in U{\mathcal{U}}.

rr is any positive constant smaller than RR.

Γr,R\Gamma_{r,R} is a curve whose points are all within distance RR from the origin.

c=dL1dW(W=0)c=\|\tfrac{dL^{1}}{dW}(W{=}0)\| – positive by assumption.

ξ()\xi(\cdot) is a vector field that is continuous on U{\mathcal{U}} and vanishes at the origin.

The following procedure gives RR and rr as required:

Set rr to follow RR such that: r32N=0.5R32Nr^{3{-}\frac{2}{N}}=0.5\cdot{R}^{3{-}\frac{2}{N}}.

Choose ϵ>0\epsilon>0 for which 0.5c(2N32N2)2πNϵ>00.5c\left(\frac{2N}{3-\frac{2}{N}}{-}2\right)-2\pi{N}\epsilon>0.

Set RR to be small enough such that ξ(w)ϵ\left\|\xi({\mathbf{w}})\right\|\leq\epsilon for any point w{\mathbf{w}} within distance RR from the origin.

where len(Γ)len(\Gamma) is the arc length of Γ\Gamma, and γΓ\gamma\in\Gamma refers to a point lying on the curve.

We begin by noting that the use of max\max (as opposed to sup\sup) in stated upper bound is appropriate, since under our definition of a curve (adopted from Buck (2003)), points lying on it constitute a compact set. This subtlety is of little importance – one may as well replace max\max by sup\sup, and the lemma would still serve its purpose.

It is not difficult to see that for any wU{\mathbf{w}}\in{\mathcal{U}}, w0{\mathbf{w}}\neq 0:

Trivially, Fϕ(w)Nw22Nϕ(w)\left\|F_{\phi}({\mathbf{w}})\right\|\leq{N}\left\|{\mathbf{w}}\right\|^{2{-}\frac{2}{N}}\left\|\phi({\mathbf{w}})\right\| holds for w=0{\mathbf{w}}{=}0 as well. The sought-after result now follows from the properties of line integrals:

Let e{\mathbf{e}} be a unit vector, let Γr,R\Gamma_{r,R} be a piecewise smooth closed curve as specified in Equation 38 and the text thereafter, and let ϕFϕ\phi\mapsto{F}_{\phi} be the operator on continuous vector fields defined by Equation 36. Overloading notation by regarding e()e{\mathbf{e}}(\cdot)\equiv{\mathbf{e}} as a constant vector field, it holds that:

We compute the line integral by decomposing Γr,R\Gamma_{r,R} into its smooth components Γr,R1Γr,R4\Gamma_{r,R}^{1}\ldots\Gamma_{r,R}^{4}:

Starting from Γr,R1\Gamma_{r,R}^{1}, notice that for every point w{\mathbf{w}} lying on this curve: e,wwww=e\langle{\mathbf{e}},\frac{{\mathbf{w}}}{\left\|{\mathbf{w}}\right\|}\rangle\frac{{\mathbf{w}}}{\left\|{\mathbf{w}}\right\|}={\mathbf{e}}. Therefore:

The line integral on the right translates into a simple univariate integral:

Turning to Γr,R2\Gamma_{r,R}^{2}, note that for any point w{\mathbf{w}} along this curve w22N=r22N\left\|{\mathbf{w}}\right\|^{2-\frac{2}{N}}=r^{2-\frac{2}{N}}, and ww\frac{{\mathbf{w}}}{\left\|{\mathbf{w}}\right\|} is perpendicular to the direction of motion. This implies:

The line integral Γr,R2e\int_{\Gamma_{r,R}^{2}}{\mathbf{e}} is simply equal to the progress Γr,R2\Gamma_{r,R}^{2} makes in the direction of e{\mathbf{e}}, which is 2r2r. Accordingly:

As for Γr,R3\Gamma_{r,R}^{3} and Γr,R4\Gamma_{r,R}^{4}, their line integrals may be computed similarly to those of Γr,R1\Gamma_{r,R}^{1} and Γr,R2\Gamma_{r,R}^{2} respectively. Such computations yield:

Combining Equation 40 with Equations 41, 42, 43 and 44, we obtain the desired result. ∎

Appendix B A Concrete Acceleration Bound

In Section 7 we illustrated qualitatively, on a family of very simple hypothetical learning problems, the potential of overparameterization (use of depth-NN linear network in place of classic linear model) to accelerate optimization. In this appendix we demonstrate how the illustration can be made formal, by considering a special case and deriving a concrete bound on the acceleration.

As shown in Section 7, under gradient descent, w1w_{1} and w2w_{2} move independently, and to prevent divergence, the learning rate must satisfy η<min{2/y1p2,2/y2p2}\eta<\min\{2/y_{1}^{p-2},2/y_{2}^{p-2}\}. In our setting, this translates to (GD below stands for gradient descent):

For w2w_{2}, the optimal learning rate (convergence in a single step) is 1/y221/y_{2}^{2}, and the constraint above will lead to very slow convergence (see Equation 15 and its surrounding text).

Suppose now that we optimize via overparameterization, i.e. with the update rule in Equation 12 (single output). In our particular setting (recall, in addition to the above, that we omitted weight decay for simplicity – λ=0\lambda=0), this update rule translates to:

For the first iteration (t=0t=0), replacing ϵi:=wi(0)\epsilon_{i}:=|w_{i}^{(0)}|, while recalling that y1y2ϵ1ϵ2y_{1}\gg{y_{2}}\gg\epsilon_{1}\gg\epsilon_{2}, we obtain:

Set η=1/2ϵ1y12\eta=1/2\epsilon_{1}y_{1}^{2}. Then w1(1)y1w_{1}^{(1)}\approx{y}_{1} and w2(1)y23/2y12+ϵ2y1/2ϵ1w_{2}^{(1)}\approx{y}_{2}^{3}/2y_{1}^{2}+\epsilon_{2}y_{1}/2\epsilon_{1}. Our assumptions thus far (y1y2y_{1}\gg{y}_{2} and ϵ1ϵ2\epsilon_{1}\gg\epsilon_{2}) imply w1(1)w2(1)w_{1}^{(1)}\gg{w}_{2}^{(1)}. Moreover, since ϵ2/ϵ1y2/y1\epsilon_{2}/\epsilon_{1}\approx{y}_{2}/y_{1}, it holds that w2(1)O(y2)=O(1)w_{2}^{(1)}\in{\mathcal{O}}(y_{2})={\mathcal{O}}(1). Taking all of this into account, the second iteration (t=1t=1) of the overparameterized update rule (Equation 46) becomes:

In words, w1w_{1} will stay approximately equal to y1y_{1}, whereas w2w_{2} will take a step that corresponds to gradient descent with learning rate (OP below stands for overparameterization):

By assumption ϵ1y11\epsilon_{1}y_{1}{\gg}1 and y2O(1)y_{2}{\in}{\mathcal{O}}(1), thus ηOP<2/y22\eta^{OP}{<}2/y_{2}^{2}, meaning that w2w_{2} will remain on the order of y2y_{2} (or less). An inductive argument can therefore be applied, and our observation regarding the second iteration (t=1t=1) continues to hold throughout – w1w_{1} is (approximately) fixed at y1y_{1}, and w2w_{2} follows steps that correspond to gradient descent with learning rate ηOP\eta^{OP}.

To summarize our findings, we have shown that while standard gradient descent limits w2w_{2} with a learning rate ηGD\eta^{GD} that is at most 2/y122/y_{1}^{2} (Equation 45), overparameterization can be adjusted to induce on w2w_{2} an implicit gradient descent scheme with learning rate ηOP=1/2ϵ1y1\eta^{OP}=1/2\epsilon_{1}y_{1} (Equation 47), all while admitting immediate (single-step) convergence for w1w_{1}. Since both ηGD\eta^{GD} and ηOP\eta^{OP} are well below 1/y221/y_{2}^{2}, we obtain acceleration by at least ηOP/ηGD>y1/4ϵ1\eta^{OP}/\eta^{GD}>y_{1}/4\epsilon_{1} (we remind the reader that y11y_{1}\gg 1 is the target value of w1w_{1}, and ϵ11\epsilon_{1}\ll 1 is the magnitude of its initialization).

Appendix C Implementation Details

Below we provide implementation details omitted from our experimental report (Section 8).

The details hereafter apply to all of our experiments besides that on the convolutional network (Figure 5-right).

In accordance with our theoretical setup (Section 4), evaluated linear networks did not include bias terms, only weight matrices. The latter were initialized to small values, drawn i.i.d. from a Gaussian distribution with mean zero and standard deviation 0.010.01. The only exception to this was the setting of identity initialization (Figure 5-left), in which an offset of 11 was added to the diagonal elements of each weight matrix (including those that are not square).

When applying a grid search over learning rates, the values {105,5105,,101,5101}\{10^{-5},5\cdot 10^{-5},\ldots,10^{-1},5\cdot 10^{-1}\} were tried. We note that in the case of depth-88 network with standard near-zero initialization (Figure 5-left), all learning rates led either to divergence, or to a failure to converge (vanishing gradients).

C.2 Convolutional Network

For the experiment on TensorFlow’s MNIST convolutional network tutorial, we simply downloaded the code, https://github.com/tensorflow/models/tree/master/tutorials/image/mnist and introduced two minor changes:

Hidden dense layer: 3136×5123136{\times}512 weight matrix replaced by multiplication of 3136×5123136{\times}512 and 512×512512{\times}512 matrices.

Output layer: 512×10512{\times}10 weight matrix replaced by multiplication of 512×10512{\times}10 and 10×1010{\times}10 matrices.

The newly introduced weight matrices were initialized in the same way as their predecessors (random Gaussian distribution with mean zero and standard deviation 0.10.1). Besides the above, no change was made. An addition of roughly 250K250K parameters to a 1.6M1.6M-parameter model gave the speedup presented in Figure 5-right.

To rule out the possibility of the speedup resulting from suboptimal learning rates, we reran the experiment with grid search over the latter. The learning rate hardcoded into the tutorial follows an exponentially decaying schedule, with base value 10210^{-2}. For both the original and overparameterized models, training was run multiple times, with the base value varying in {105,5105,,101,5101}\{10^{-5},5\cdot 10^{-5},\ldots,10^{-1},5\cdot 10^{-1}\}. We chose, for each model separately, the configuration giving fastest convergence, and then compared the models one against the other. The observed gap in convergence rates was similar to that in Figure 5-right.

An additional point we set out to examine, is the sensitivity of the speedup to initialization of overparameterized layers. For this purpose, we retrained the overparameterized model multiple times, varying in {103,5103,,101,5101}\{10^{-3},5\cdot 10^{-3},\ldots,10^{-1},5\cdot 10^{-1}\} the standard deviation of the Gaussian distribution initializing overparameterized layers (as stated above, this standard deviation was originally set to 10110^{-1}). Convergence rates across the different runs were almost identical. In particular, they were all orders of magnitude faster than the convergence rate of the baseline, non-overparameterized model.