On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Sanjeev Arora, Nadav Cohen, Elad Hazan

Introduction

How does depth help? This central question of deep learning still eludes full theoretical understanding. The general consensus is that there is a trade-off: increasing depth improves expressiveness, but complicates optimization. Superior expressiveness of deeper networks, long suspected, is now confirmed by theory, albeit for fairly limited learning problems (Eldan & Shamir, 2015; Raghu et al., 2016; Lee et al., 2017; Cohen et al., 2017; Daniely, 2017; Arora et al., 2018). Difficulties in optimizing deeper networks have also been long clear – the signal held by a gradient gets buried as it propagates through many layers. This is known as the “vanishing/exploding gradient problem”. Modern techniques such as batch normalization (Ioffe & Szegedy, 2015) and residual connections (He et al., 2015) have somewhat alleviated these difficulties in practice.

Given the longstanding consensus on expressiveness vs. optimization trade-offs, this paper conveys a rather counterintuitive message: increasing depth can accelerate optimization. The effect is shown, via first-cut theoretical and empirical analyses, to resemble a combination of two well-known tools in the field of optimization: momentum, which led to provable acceleration bounds (Nesterov, 1983); and adaptive regularization, a more recent technique proven to accelerate by Duchi et al. (2011) in their proposal of the AdaGrad algorithm. Explicit mergers of both techniques are quite popular in deep learning (Kingma & Ba, 2014; Tieleman & Hinton, 2012). It is thus intriguing that merely introducing depth, with no other modification, can have a similar effect, but implicitly.

There is an obvious hurdle in isolating the effect of depth on optimization: if increasing depth leads to faster training on a given dataset, how can one tell whether the improvement arose from a true acceleration phenomenon, or simply due to better representational power (the shallower network was unable to attain the same training loss)? We respond to this hurdle by focusing on linear neural networks (cf. Saxe et al. (2013); Goodfellow et al. (2016); Hardt & Ma (2016); Kawaguchi (2016)). With these models, adding layers does not alter expressiveness; it manifests itself only in the replacement of a matrix parameter by a product of matrices – an overparameterization.

Both our theoretical analysis and our empirical evaluation indicate that acceleration via overparameterization need not be computationally expensive. From an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters.

Related Work

Theoretical study of optimization in deep learning is a highly active area of research. Works along this line typically analyze critical points (local minima, saddles) in the landscape of the training objective, either for linear networks (see for example Kawaguchi (2016); Hardt & Ma (2016) or Baldi & Hornik (1989) for a classic account), or for specific non-linear networks under different restrictive assumptions (cf. Choromanska et al. (2015); Haeffele & Vidal (2015); Soudry & Carmon (2016); Safran & Shamir (2017)). Other works characterize other aspects of objective landscapes, for example Safran & Shamir (2016) showed that under certain conditions a monotonically descending path from initialization to global optimum exists (in compliance with the empirical observations of Goodfellow et al. (2014)).

Turning to general optimization, accelerated gradient (momentum) methods were introduced in Nesterov (1983), and later studied in numerous works (see Wibisono et al. (2016) for a short review). Such methods effectively accumulate gradients throughout the entire optimization path, using the collected history to determine the step at a current point in time. Use of preconditioners to speed up optimization is also a well-known technique. Indeed, the classic Newton’s method can be seen as preconditioning based on second derivatives. Adaptive preconditioning with only first-order (gradient) information was popularized by the BFGS algorithm and its variants (cf. Nocedal (1980)). Relevant theoretical guarantees, in the context of regret minimization, were given in Hazan et al. (2007); Duchi et al. (2011). In terms of combining momentum and adaptive preconditioning, Adam (Kingma & Ba, 2014) is a popular approach, particularly for optimization of deep networks.

Algorithms with certain theoretical guarantees for non-convex optimization, and in particular for training deep neural networks, were recently suggested in various works, for example Ge et al. (2015); Agarwal et al. (2017); Carmon et al. (2016); Janzamin et al. (2015); Livni et al. (2014) and references therein. Since the focus of this paper lies on the analysis of algorithms already used by practitioners, such works lie outside our scope.

Obviously the overparameterization does not affect the expressiveness of the linear model. How does it affect optimization? What happens to gradient descent on this non-convex objective?

Gradient descent over $L({\mathbf{w}}_{1},w_{2})$ , with fixed small learning rate and near-zero initialization, is equivalent to gradient descent over $L({\mathbf{w}})$ with particular adaptive learning rate and momentum terms.

To see this, consider the gradients of $L({\mathbf{w}})$ and $L({\mathbf{w}}_{1},w_{2})$ :

Gradient descent over $L({\mathbf{w}}_{1},w_{2})$ with learning rate $\eta>0$ :

The dynamics of the underlying parameter ${\mathbf{w}}={\mathbf{w}}_{1}w_{2}$ are:

We conclude that the dynamics governing the underlying parameter ${\mathbf{w}}$ correspond to gradient descent with a momentum term, where both the learning rate ( $\rho^{(t)}$ ) and momentum coefficients ( $\mu^{(t,\tau)}$ ) are time-varying and adaptive.

Linear Neural Networks

For completeness, we regard a depth- $1$ network as the family of directly parameterized linear predictors, i.e. we set $\Phi^{1}:=\Phi^{lin}$ (see Equation 1).

and so the sole difference between the training loss of a depth- $N$ network and that of a depth- $1$ network (classic linear model) lies in the replacement of a matrix parameter by a product of $N$ matrices. This implies that if increasing $N$ can indeed accelerate convergence, it is not an outcome of any phenomenon other than favorable properties of depth-induced overparameterization for optimization.

Implicit Dynamics of Gradient Descent

In this section we present a new result for linear neural networks, tying the dynamics of gradient descent on $L^{N}(\cdot)$ – the training loss corresponding to a depth- $N$ network, to those on $L^{1}(\cdot)$ – training loss of a depth- $1$ network (classic linear model). Specifically, we show that gradient descent on $L^{N}(\cdot)$ , a complicated and seemingly pointless overparameterization, can be directly rewritten as a particular preconditioning scheme over gradient descent on $L^{1}(\cdot)$ .

When applied to $L^{N}(\cdot)$ , gradient descent takes on the following form:

$\eta>0$ here is a learning rate, and $\lambda\geq 0$ is an optional weight decay coefficient. For simplicity, we regard both $\eta$ and $\lambda$ as fixed (no dependence on $t$ ). Define the underlying end-to-end weight matrix:

Given that $L^{N}(W_{1},\ldots,W_{N})=L^{1}(W_{e})$ (Equation 3), we view $W_{e}$ as an optimized weight matrix for $L^{1}(\cdot)$ , whose dynamics are governed by Equation 4. Our interest then boils down to the study of these dynamics for different choices of $N$ . For $N=1$ they are (trivially) equivalent to standard gradient descent over $L^{1}(\cdot)$ . We will characterize the dynamics for $N\geq 2$ .

To be able to derive, in our general setting, an explicit update rule for the end-to-end weight matrix $W_{e}$ (Equation 5), we introduce an assumption by which the learning rate is small, i.e. $\eta^{2}\approx 0$ . Formally, this amounts to translating Equation 4 to the following set of differential equations:

where $t$ is now a continuous time index, and $\dot{W}_{j}(t)$ stands for the derivative of $W_{j}$ with respect to time. The use of differential equations, for both theoretical analysis and algorithm design, has a long and rich history in optimization research (see Helmke & Moore (2012) for an overview). When step sizes (learning rates) are taken to be small, trajectories of discrete optimization algorithms converge to smooth curves modeled by continuous-time differential equations, paving way to the well-established theory of the latter (cf. Boyce et al. (1969)). This approach has led to numerous interesting findings, including recent results in the context of acceleration methods (e.g. Su et al. (2014); Wibisono et al. (2016)).

With the continuous formulation in place, we turn to express the dynamics of the end-to-end matrix $W_{e}$ :

Assume the weight matrices $W_{1}{\ldots}W_{N}$ follow the dynamics of continuous gradient descent (Equation 6). Assume also that their initial values (time $t_{0}$ ) satisfy, for $j=1{\ldots}N-1$ :

Then, the end-to-end weight matrix $W_{e}$ (Equation 5) is governed by the following differential equation:

where $[\cdot]^{\frac{j-1}{N}}$ and $[\cdot]^{\frac{N-j}{N}}$ , $j=1\ldots{N}$ , are fractional power operators defined over positive semidefinite matrices.

(sketch – full details in Appendix A.1) If $\lambda\,{=}\,0$ (no weight decay) then one can easily show that $W_{j+1}^{\top}(t)\dot{W}_{j+1}(t)=\dot{W}_{j}(t)W_{j}^{\top}(t)$ throughout optimization. Taking the transpose of this equation and adding to itself, followed by integration over time, imply that the difference between $W_{j+1}^{\top}(t)W_{j+1}(t)$ and $W_{j}(t)W_{j}^{\top}(t)$ is constant. This difference is zero at initialization (Equation 7), thus will remain zero throughout, i.e.:

A slightly more delicate treatment shows that this is true even if $\lambda>0$ , i.e. with weight decay included.

Equation 9 implies alignment of the (left and right) singular spaces of $W_{j}(t)$ and $W_{j+1}(t)$ , simplifying the product $W_{j+1}(t)W_{j}(t)$ . Successive application of this simplification allows a clean computation for the product of all layers (that is, $W_{e}$ ), leading to the explicit form presented in theorem statement (Equation 8). ∎

Translating the continuous dynamics of Equation 8 back to discrete time, we obtain the sought-after update rule for the end-to-end weight matrix:

This update rule relies on two assumptions: first, that the learning rate $\eta$ is small enough for discrete updates to approximate continuous ones; and second, that weights are initialized on par with Equation 7, which will approximately be the case if initialization values are close enough to zero. It is customary in deep learning for both learning rate and weight initializations to be small, but nonetheless above assumptions are only met to a certain extent. We support their applicability by showing empirically (Section 8) that the end-to-end update rule (Equation 10) indeed provides an accurate description for the dynamics of $W_{e}$ .

A close look at Equation 10 reveals that the dynamics of the end-to-end weight matrix $W_{e}$ are similar to gradient descent over $L^{1}(\cdot)$ – training loss corresponding to a depth- $1$ network (classic linear model). The only difference (besides the scaling by $N$ of the weight decay coefficient $\lambda$ ) is that the gradient $\frac{dL^{1}}{dW}(W_{e})$ is subject to a transformation before being used. Namely, for $j=1{\ldots}N$ , it is multiplied from the left by $[W_{e}W_{e}^{\top}]^{\frac{j-1}{N}}$ and from the right by $[W_{e}^{\top}{W}_{e}]^{\frac{N-j}{N}}$ , followed by summation over $j$ . Clearly, when $N=1$ (depth- $1$ network) this transformation reduces to identity, and as expected, $W_{e}$ precisely adheres to gradient descent over $L^{1}(\cdot)$ . When $N\geq 2$ the dynamics of $W_{e}$ are less interpretable. We arrange it as a vector to gain more insight:

For an arbitrary matrix $A$ , denote by $vec(A)$ its arrangement as a vector in column-first order. Then, the end-to-end update rule in Equation 10 can be written as:

The result readily follows from the properties of the Kronecker product – see Appendix A.2 for details. ∎

Claim 1 implies that in the end-to-end update rule of Equation 10, the transformation applied to the gradient $\frac{dL^{1}}{dW}(W_{e})$ is essentially a preconditioning, whose eigendirections and eigenvalues depend on the singular value decomposition of $W_{e}$ . The eigendirections are the rank- $1$ matrices ${\mathbf{u}}_{r}{\mathbf{v}}_{r^{\prime}}^{\top}$ , where ${\mathbf{u}}_{r}$ and ${\mathbf{v}}_{r^{\prime}}$ are left and right (respectively) singular vectors of $W_{e}$ . The eigenvalue of ${\mathbf{u}}_{r}{\mathbf{v}}_{r^{\prime}}^{\top}$ is $\sum_{j=1}^{N}\sigma_{r}^{2(N-j)/N}\sigma_{r^{\prime}}^{2(j-1)/N}$ , where $\sigma_{r}$ and $\sigma_{r^{\prime}}$ are the singular values of $W_{e}$ corresponding to ${\mathbf{u}}_{r}$ and ${\mathbf{v}}_{r^{\prime}}$ (respectively). When $N\geq 2$ , an increase in $\sigma_{r}$ or $\sigma_{r^{\prime}}$ leads to an increase in the eigenvalue corresponding to the eigendirection ${\mathbf{u}}_{r}{\mathbf{v}}_{r^{\prime}}^{\top}$ . Qualitatively, this implies that the preconditioning favors directions that correspond to singular vectors whose presence in $W_{e}$ is stronger. We conclude that the effect of overparameterization, i.e. of replacing a classic linear model (depth- $1$ network) by a depth- $N$ linear network, boils down to modifying gradient descent by promoting movement along directions that fall in line with the current location in parameter space. A-priori, such a preference may seem peculiar – why should an optimization algorithm be sensitive to its location in parameter space? Indeed, we generally expect sensible algorithms to be translation invariant, i.e. be oblivious to parameter value. However, if one takes into account the common practice in deep learning of initializing weights near zero, the location in parameter space can also be regarded as the overall movement made by the algorithm. We thus interpret our findings as indicating that overparameterization promotes movement along directions already taken by the optimization, and therefore can be seen as a form of acceleration. This intuitive interpretation will become more concrete in the subsection that follows.

A final point to make, is that the end-to-end update rule (Equation 10 or 11), which obviously depends on $N$ – number of layers in the deep linear network, does not depend on the hidden widths $n_{1}\ldots{n}_{N-1}$ (see Section 4). This implies that from an optimization perspective, overparameterizing using wide or narrow networks has the same effect – it is only the depth that matters. Consequently, the acceleration of overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.

To facilitate a straightforward presentation of our findings, we hereinafter focus on the special case where the optimized models have a single output, i.e. where $k=1$ . This corresponds, for example, to a binary (two-class) classification problem, or to the prediction of a numeric scalar property (regression). It admits a particularly simple form for the end-to-end update rule of Equation 10:

The result follows from the definition of a fractional power operator over matrices – see Appendix A.3. ∎

Claim 2 implies that in the single output case, the effect of overparameterization (replacing classic linear model by depth- $N$ linear network) on gradient descent is twofold: first, it leads to an adaptive learning rate schedule, by introducing the multiplicative factor $\left\|W_{e}\right\|_{2}^{2-2/N}$ ; and second, it amplifies (by $N$ ) the projection of the gradient on the direction of $W_{e}$ . Recall that we view $W_{e}$ not only as the optimized parameter, but also as the overall movement made in optimization (initialization is assumed to be near zero). Accordingly, the adaptive learning rate schedule can be seen as gaining confidence (increasing step sizes) when optimization moves farther away from initialization, and the gradient projection amplification can be thought of as a certain type of momentum that favors movement along the azimuth taken so far. These effects bear potential to accelerate convergence, as we illustrate qualitatively in Section 7, and demonstrate empirically in Section 8.

Overparametrization Effects Cannot Be Attained via Regularization

Adding a regularizer to the objective is a standard approach for improving optimization (though lately the term regularization is typically associated with generalization). For example, AdaGrad was originally invented to compete with the best regularizer from a particular family. The next theorem shows (for single output case) that the effects of overparameterization cannot be attained by adding a regularization term to the original training loss, or via any similar modification. This is not obvious a-priori, as unlike many acceleration methods that explicitly maintain memory of past gradients, updates under overparametrization are by definition the gradients of something. The assumptions in the theorem are minimal and also necessary, as one must rule-out the trivial counter-example of a constant training loss.

where $Pr_{W}\{\cdot\}$ is the projection given in Equation 13. Then, there exists no function (of $W$ ) whose gradient field is $F(\cdot)$ .

(sketch – full details in Appendix A.4) The proof uses elementary differential geometry (Buck, 2003): curves, arc length and the fundamental theorem for line integrals, which states that the integral of $\nabla{g}$ for any differentiable function $g$ amounts to along every closed curve.

Overparametrization changes gradient descent’s behavior: instead of following the original gradient $\frac{dL^{1}}{dW}$ , it follows some other direction $F(\cdot)$ (see Equations 12 and 14) that is a function of the original gradient as well as the current point $W$ . We think of this change as a transformation that maps one vector field $\phi(\cdot)$ to another – $F_{\phi}(\cdot)$ :

Notice that for $\phi=\frac{dL^{1}}{dW}$ , we get exactly the vector field $F(\cdot)$ defined in theorem statement.

We note simple properties of the mapping $\phi\mapsto{F}_{\phi}$ . First, it is linear, since for any vector fields $\phi_{1},\phi_{2}$ and scalar $c$ : $F_{\phi_{1}{+}\phi_{2}}{=}F_{\phi_{1}}{+}F_{\phi_{2}}$ and $F_{c{\cdot}\phi_{1}}{=}c{\cdot}{F}_{\phi_{1}}$ . Second, because of the linearity of line integrals, for any curve $\Gamma$ , the functional $\phi\mapsto\int_{\Gamma}F_{\phi}$ , a mapping of vector fields to scalars, is linear.

We show that $F(\cdot)$ contradicts the fundamental theorem for line integrals. To do so, we construct a closed curve $\Gamma{=}\Gamma_{r,R}$ for which the linear functional $\phi\mapsto\oint_{\Gamma}{F}_{\phi}$ does not vanish at $\phi{=}\frac{dL^{1}}{dW}$ . Let ${\mathbf{e}}:=\frac{dL^{1}}{dW}(W{=}0)/\|\frac{dL^{1}}{dW}(W{=}0)\|$ , which is well-defined since by assumption $\frac{dL^{1}}{dW}(W{=}0){\neq 0}$ . For $r<R$ we define (see Figure 1):

$\Gamma_{r,R}^{1}$ is the line segment from $-R\cdot{\mathbf{e}}$ to $-r\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}^{2}$ is a spherical curve from $-r\cdot{\mathbf{e}}$ to $r\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}^{3}$ is the line segment from $r\cdot{\mathbf{e}}$ to $R\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}^{4}$ is a spherical curve from $R\cdot{\mathbf{e}}$ to $-R\cdot{\mathbf{e}}$ .

With the definition of $\Gamma_{r,R}$ in place, we decompose $\frac{dL^{1}}{dW}$ into a constant vector field $\kappa\,{\equiv}\,\frac{dL^{1}}{dW}(W{=}0)$ plus a residual $\xi$ . We explicitly compute the line integrals along $\Gamma_{r,R}^{1}\ldots\Gamma_{r,R}^{4}$ for $F_{\kappa}$ , and derive bounds for $F_{\xi}$ . This, along with the linearity of the functional $\phi\mapsto\int_{\Gamma}F_{\phi}$ , provides a lower bound on the line integral of $F(\cdot)$ over $\Gamma_{r,R}$ . We show the lower bound is positive as $r,R\to 0$ , thus $F(\cdot)$ indeed contradicts the fundamental theorem for line integrals. ∎

Illustration of Acceleration

To this end, we showed that overparameterization (use of depth- $N$ linear network in place of classic linear model) induces on gradient descent a particular preconditioning scheme (Equation 10 in general and 12 in the single output case), which can be interpreted as introducing some forms of momentum and adaptive learning rate. We now illustrate qualitatively, on a very simple hypothetical learning problem, the potential of these to accelerate optimization.

With fixed learning rate $\eta>0$ (weight decay omitted for simplicity), gradient descent over $L(\cdot)$ gives:

Changing variables per $\Delta_{i}=w_{i}-y_{i}$ , we have:

Assuming the original weights $w_{1}$ and $w_{2}$ are initialized near zero, $\Delta_{1}$ and $\Delta_{2}$ start off at $-y_{1}$ and $-y_{2}$ respectively, and will eventually reach the optimum $\Delta^{*}_{1}=\Delta^{*}_{2}=0$ if the learning rate is small enough to prevent divergence:

Suppose now that the problem is ill-conditioned, in the sense that $y_{1}{\gg}{y}_{2}$ . If $p=2$ this has no effect on the bound for $\eta$ . Optimal learning rate for gradient descent on quadratic objective does not depend on current parameter value (cf. Goh (2017)). If $p>2$ the learning rate is determined by $y_{1}$ , leading $\Delta_{2}$ to converge very slowly. In a sense, $\Delta_{2}$ will suffer from the fact that there is no “communication” between the coordinates (this will actually be the case not just with gradient descent, but with most algorithms typically used in large-scale settings – AdaGrad, Adam, etc.).

Now consider the scenario where we optimize $L(\cdot)$ via overparameterization, i.e. with the update rule in Equation 12 (single output). In this case the coordinates are coupled, and as $\Delta_{1}$ gets small ( $w_{1}$ gets close to $y_{1}$ ), the learning rate is effectively scaled by $y_{1}^{2-\frac{2}{N}}$ (in addition to a scaling by $N$ in coordinate $1$ only), allowing (if $y_{1}{>}1$ ) faster convergence of $\Delta_{2}$ . We thus have the luxury of temporarily slowing down $\Delta_{2}$ to ensure that $\Delta_{1}$ does not diverge, with the latter speeding up the former as it reaches safe grounds. In Appendix B we consider a special case and formalize this intuition, deriving a concrete bound for the acceleration of overparameterization.

Experiments

Our analysis (Section 5) suggests that overparameterization – replacement of a classic linear model by a deep linear network, induces on gradient descent a certain preconditioning scheme. We qualitatively argued (Section 7) that in some cases, this preconditioning may accelerate convergence. In this section we put these claims to the test, through a series of empirical evaluations based on TensorFlow toolbox (Abadi et al. (2016)). For conciseness, many of the details behind our implementation are deferred to Appendix C.

Alongside the validity of the end-to-end update rule, Figure 2 also demonstrates the negligible effect of network width on convergence, in accordance with our analysis (see Section 5). Specifically, it shows that in the evaluated setting, hidden layers of size $1$ (scalars) suffice in order for the essence of overparameterization to fully emerge. Unless otherwise indicated, all results reported hereinafter are based on this configuration, i.e. on scalar hidden layers. The computational toll associated with overparameterization will thus be virtually non-existent.

An immediate question arises at this point. If depth indeed accelerates convergence, why not add as many layers as one can computationally afford? The reason, which is actually apparent in our analysis, is the so-called vanishing gradient problem. When training a very deep network (large $N$ ), while initializing weights to be small, the end-to-end matrix $W_{e}$ (Equation 5) is extremely close to zero, severely attenuating gradients in the preconditioning scheme (Equation 10). A possible approach for alleviating this issue is to initialize weights to be larger, yet small enough such that the end-to-end matrix does not “explode”. The choice of identity (or near identity) initialization leads to what is known as linear residual networks (Hardt & Ma, 2016), akin to the successful residual networks architecture (He et al., 2015) commonly employed in deep learning. Notice that identity initialization satisfies the condition in Equation 7, rendering the end-to-end update rule (Equation 10) applicable. Figure 5-left shows convergence, under gradient descent, of a single layer model against deeper networks than those evaluated before – depths $4$ and $8$ . As can be seen, with standard, near-zero initialization, the depth- $4$ network starts making visible progress only after about $65K$ iterations, whereas the depth- $8$ network seems stuck even after $100K$ iterations. In contrast, under identity initialization, both networks immediately make progress, and again depth serves as an implicit accelerator.

As a final sanity test, we evaluate the effect of overparameterization on optimization in a non-idealized (yet simple) deep learning setting. Specifically, we experiment with the convolutional network tutorial for MNIST built into TensorFlow, https://github.com/tensorflow/models/tree/master/tutorials/image/mnist which includes convolution, pooling and dense layers, ReLU non-linearities, stochastic gradient descent with momentum, and dropout (Srivastava et al., 2014). We introduced overparameterization by simply placing two matrices in succession instead of the matrix in each dense layer. Here, as opposed to previous experiments, widths of the newly formed hidden layers were not set to $1$ , but rather to the minimal values that do not deteriorate expressiveness (see Appendix C). Overall, with an addition of roughly $15\%$ in number of parameters, optimization has accelerated considerably – see Figure 5-right. The displayed results were obtained with the hyperparameter settings hardcoded into the tutorial. We have tried alternative settings (varying learning rates and standard deviations of initializations – see Appendix C), and in all cases observed an outcome similar to that in Figure 5-right – overparameterization led to significant speedup. Nevertheless, as reported above for linear networks, it is likely that for non-linear networks the effect of depth on optimization is mixed – some settings accelerate by it, while others do not. Comprehensive characterization of the cases in which depth accelerates optimization warrants much further study. We hope our work will spur interest in this avenue of research.

Conclusion

Through theory and experiments, we demonstrated that overparameterizing a neural network by increasing its depth can accelerate optimization, even on very simple problems.

Our analysis of linear neural networks, the subject of various recent studies, yielded a new result: for these models, overparameterization by depth can be understood as a preconditioning scheme with a closed form description (Theorem 1 and the claims thereafter). The preconditioning may be interpreted as a combination between certain forms of adaptive learning rate and momentum. Given that it depends on network depth but not on width, acceleration by overparameterization can be attained at a minimal computational price, as we demonstrate empirically in Section 8.

Clearly, complete theoretical analysis for non-linear networks will be challenging. Empirically however, we showed that the trivial idea of replacing an internal weight matrix by a product of two can significantly accelerate optimization, with absolutely no effect on expressiveness (Figure 5-right).

Acknowledgments

Sanjeev Arora’s work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. Elad Hazan’s work is supported by NSF grant 1523815 and Google Brain. Nadav Cohen is a member of the Zuckerman Israeli Postdoctoral Scholars Program, and is supported by Eric and Wendy Schmidt.

References

Appendix A Deferred Proofs

Before delving into the proof, we introduce notation that will admit a more compact presentation of formulae. For $1\leq{a}\leq{b}\leq{N}$ , we denote:

where $W_{1}\ldots{W}_{N}$ are the weight matrices of the depth- $N$ linear network (Equation 2). If $a>b$ , then by definition both $\prod^{j=b}_{a}W_{j}$ and $\prod_{j=a}^{b}W_{j}^{\top}$ are identity matrices, with size depending on context, i.e. on the dimensions of matrices they are multiplied against. Given any square matrices (possibly scalars) $A_{1},A_{2},\ldots,A_{m}$ , we denote by $diag(A_{1}\ldots{A}_{m})$ a block-diagonal matrix holding them on its diagonal:

As illustrated above, $diag(A_{1}\ldots{A}_{m})$ may hold additional, zero-valued rows and columns beyond $A_{1}\ldots{A}_{m}$ . Conversely, it may also trim (omit) rows and columns, from its bottom and right ends respectively, so long as only zeros are being removed. The exact shape of $diag(A_{1}\ldots{A}_{m})$ is again determined by context, and so if $B$ and $C$ are matrices, the expression $B\cdot{diag}(A_{1}\ldots{A}_{m})\cdot{C}$ infers a number of rows equal to the number of columns in $B$ , and a number of columns equal to the number of rows in $C$ .

Turning to the actual proof, we disregard the trivial case $N=1$ , and begin by noticing that Equation 3, along with the definition of $W_{e}$ (Equation 5), imply that for every $j=1\ldots{N}$ :

Plugging this into the differential equations of gradient descent (Equation 6), we get:

For $j=1\ldots{N}{-}1$ , multiply the $j$ ’th equation by $W_{j}^{\top}(t)$ from the right, and the $j{+}1$ ’th equation by $W_{j+1}^{\top}(t)$ from the left. This yields:

Taking the transpose of these equations and adding to themselves, we obtain, for every $j=1\ldots{N}{-}1$ :

Turning to Lemma 1 below, while recalling our assumption for time $t_{0}$ (Equation 7):

we conclude that, throughout the entire time-line:

Recollecting the definitions of $C_{j}(t),C^{\prime}_{j}(t)$ , this means:

Regard $t$ now as fixed, and for every $j=1\ldots{N}$ , let:

be a singular value decomposition. That is to say, $U_{j}$ and $V_{j}$ are orthogonal matrices, and $\Sigma_{j}$ is a rectangular-diagonal matrix holding non-decreasing, non-negative singular values on its diagonal. Equation 18 implies that for $j=1\ldots{N}{-}1$ :

$O_{j,r}$ here is simply a matrix changing between orthogonal bases in the eigenspace of $\rho_{r}$ – it maps the basis comprising $V_{j+1}$ -columns to that comprising $U_{j}$ -columns. Recalling that both $\Sigma_{j}$ and $\Sigma_{j+1}$ are rectangular-diagonal, holding only non-negative values, Equation 20 implies that each of these matrices is equal to $diag(\sqrt{\rho_{1}}{\cdot}I_{d_{1}},\ldots,\sqrt{\rho_{m}}{\cdot}I_{d_{m}})$ . Note that the matrices generally do not have the same shape and thus, formally, are not equal to one another. Nonetheless, in line with our $diag$ notation (see beginning of this subsection), $\Sigma_{j}$ and $\Sigma_{j+1}$ may differ from each other only in trailing, zero-valued rows and columns. By an inductive argument, all the singular value matrices $\Sigma_{1},\Sigma_{2},\ldots,\Sigma_{N}$ (see Equation 19) are equal up to trailing zero rows and columns. The fact that $\rho_{1}\ldots\rho_{m}$ do not include an index $j$ in their notation is thus in order, and we may write, for every $j=1\ldots{N}{-}1$ :

Concatenations of weight matrices thus simplify as follows:

where we used the orthogonality of $O_{j,r}$ , and the obvious fact that it commutes with $I_{d_{r}}$ . Consider Equation 21 with $j=1$ and Equation 22 with $j=N$ , while recalling that by definition $W_{e}(t)=\prod^{i=N}_{1}W_{j}(t)$ :

It follows that for every $j=1\ldots{N}$ :

where $[\cdot]^{\frac{N-j+1}{N}}$ and $[\cdot]^{\frac{j}{N}}$ stand for fractional power operators defined over positive semidefinite matrices.

With Equations 23 and 24 in place, we are finally in a position to complete the proof. Returning to Equation 16, we multiply $\dot{W}_{j}(t)$ from the left by $\prod^{i=N}_{j+1}W_{i}(t)$ and from the right by $\prod^{i=j-1}_{1}W_{i}(t)$ , followed by summation over $j=1\ldots{N}$ . This gives:

By definition $W_{e}(t)=\prod^{i=N}_{1}W_{j}(t)$ , so we can substitute the first two lines above:

Finally, plugging in the relations in Equations 23 and 24, the sought-after result is revealed:

Then, if $f$ and $g$ assume the same value at some $t_{0}\in{I}$ (interior or boundary), they must coincide along the entire interval, i.e. it must hold that $f(t)=g(t)$ for all $t\in{I}$ .

We know that $h(t_{0})=0$ for some $t_{0}\in{I}$ , and would like to show that $h(t)=0~{}\forall{t}\in{I}$ . Assume by contradiction that this is not the case, so there exists $t_{2}\in{I}$ for which $h(t_{2})\neq 0$ . Without loss of generality, suppose that $h(t_{2})>0$ , and that $t_{2}>t_{0}$ . Let $S$ be the zero set of $h$ , i.e. $S:=\{t\in{I}:h(t)=0\}$ . Since $h$ is continuous in $I$ , $S$ is topologically closed, therefore its intersection with the interval $[t_{0},t_{2}]$ is compact. Denote by $t_{1}$ the maximal element in this intersection, and consider the interval $J:=[t_{1},t_{2}]\subset{I}$ . By construction, $h$ is positive along $J$ , besides on the endpoint $t_{1}$ where it assumes the value of zero. For $t_{1}<t\leq{t}_{2}$ , we may solve as follows the differential equation of $h$ (Equation 25):

where $\beta$ is the positive constant defined by $h(t_{2})=\beta{e}^{-\alpha{t_{2}}}$ . Since in particular $h$ is bounded away from zero on $(t_{1},t_{2}]$ , and assumes zero at $t_{1}$ , we obtain a contradiction to its continuity. This completes the proof. ∎

A.2 Proof of Claim 1

Our proof relies on the Kronecker product operation for matrices. For arbitrary matrices $A$ and $B$ of sizes $m_{a}\times{n}_{a}$ and $m_{b}\times{n}_{b}$ respectively, the Kronecker product $A\odot{B}$ is defined to be the following block matrix:

where $a_{ij}$ stands for the element in row $i$ and column $j$ of $A$ . The Kronecker product admits numerous useful properties. We will employ the following:

If $A$ and $B$ are matrices such that the matrix product $AB$ is defined, then:

where $I_{r_{A}}$ and $I_{c_{B}}$ are the identity matrices whose sizes correspond, respectively, to the number of rows in $A$ and the number of columns in $B$ . $vec(\cdot)$ here, as in claim statement, stands for matrix vectorization in column-first order.

If $A_{1}$ , $A_{2}$ , $B_{1}$ and $B_{2}$ are matrices such that the matrix products $A_{1}B_{1}$ and $A_{2}B_{2}$ are defined, then:

Equation 28 and 29 imply, that if $A$ and $B$ are some orthogonal matrices, so is $A\odot{B}$ :

With the Kronecker product in place, we proceed to the actual proof. It suffices to show that vectorizing:

where ${P}_{W_{e}^{(t)}}$ is the preconditioning matrix defined in claim statement. For notational conciseness, we hereinafter omit the iteration index $t$ , and simply write $W_{e}$ instead of $W_{e}^{(t)}$ .

Let $I_{d}$ and $I_{k}$ be the identity matrices of sizes $d\times{d}$ and $k\times{k}$ respectively. Utilizing the properties of the Kronecker product, we have:

The first equality here makes use of Equation 27, and the second of Equation 28. We will show that the matrix:

meets the characterization of ${P}_{W_{e}}$ , thereby completing the proof. Let:

The third equality here is based on the relation in Equation 28, and the last equality is based on Equation 29. Denoting:

Now, since by definition $U$ and $V$ are orthogonal, $O$ is orthogonal as well (follows from the relation in Equation 30). Additionally, the fact that $D$ is rectangular-diagonal implies that the square matrix $\Lambda$ is also diagonal. Equation 34 is thus an orthogonal eigenvalue decomposition of $Q$ . Finally, denote the columns of $U$ (left singular vectors of $W_{e}$ ) by ${\mathbf{u}}_{1}\ldots{\mathbf{u}}_{k}$ , those of $V$ (right singular vectors of $W_{e}$ ) by ${\mathbf{v}}_{1}\ldots{\mathbf{v}}_{d}$ , and the diagonal elements of $D$ (singular values of $W_{e}$ ) by $\sigma_{1}\ldots\sigma_{\max\{k,d\}}$ (by definition $\sigma_{r}=0$ if $r>\min\{k,d\}$ ). The definitions in Equations 32 and 33 imply that the columns of $O$ are:

with corresponding diagonal elements in $\Lambda$ being:

We conclude that $Q$ indeed meets the characterization of $P_{W_{e}}$ in claim statement. This completes the proof.

A.3 Proof of Claim 2

We disregard the trivial case $N=1$ , as well as the scenario $W_{e}^{(t)}=0$ (both lead Equations 10 and 12 to equate). Omitting the iteration index $t$ from our notation, it suffices to show that:

The latter expression is precisely the second line of Equation 35, thus our proof is complete.

A.4 Proof of Theorem 2

Our proof relies on elementary differential geometry: curves, arc length and line integrals (see Chapters 8 and 9 in Buck (2003)).

In words, the line integral of the gradient of $h$ over $\Gamma$ , is equal to the difference between the value taken by $h$ at the end-point of $\Gamma$ , and that taken at the start-point. A direct implication of the theorem is that if $\Gamma$ is closed ( $\gamma_{e}=\gamma_{s}$ ), the line integral vanishes:

We conclude that if $F(\cdot)$ is the gradient field of some function, its line integral over any closed (piecewise smooth) curve lying in ${\mathcal{U}}$ must vanish. We will show that this is not the case.

In light of the above, to show that $F(\cdot)$ contradicts the gradient theorem, thereby completing the proof, it suffices to find a closed (piecewise smooth) curve $\Gamma$ for which the linear functional $\phi\mapsto\oint_{\Gamma}{F}_{\phi}$ does not vanish at $\phi=\frac{dL^{1}}{dW}$ . By assumption $\frac{dL^{1}}{dW}(W{=}0)\neq 0$ , and so we may define the unit vector in the direction of $\frac{dL^{1}}{dW}(W{=}0)$ :

Let $R$ be a positive constant small enough such that the Euclidean ball of radius $R$ around the origin is contained in ${\mathcal{U}}$ . Let $r$ be a positive constant smaller than $R$ . Define $\Gamma_{r,R}$ to be a curve as follows (see illustration in Figure 1): The proof would have been slightly simplified had we used a curve that passes directly through the origin. We avoid this in order to emphasize that the result is not driven by some point-wise singularity (the origin received special treatment in the definition of $F(\cdot)$ – see Equations 14 and 13).

$\Gamma_{r,R}^{1}$ is the line segment from $-R\cdot{\mathbf{e}}$ to $-r\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}^{2}$ is a geodesic on the sphere of radius $r$ , starting from $-r\cdot{\mathbf{e}}$ and ending at $r\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}^{3}$ is the line segment from $r\cdot{\mathbf{e}}$ to $R\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}^{4}$ is a geodesic on the sphere of radius $R$ , starting from $R\cdot{\mathbf{e}}$ and ending at $-R\cdot{\mathbf{e}}$ .

$\Gamma_{r,R}$ is a piecewise smooth, closed curve that fully lies within ${\mathcal{U}}$ . Consider the linear functional it induces: $\phi\mapsto\oint_{\Gamma_{r,R}}{F}_{\phi}$ . We will evaluate this functional on $\phi=\frac{dL^{1}}{dW}$ . To do so, we decompose the latter as follows:

$c$ is a scalar equal to $\|\tfrac{dL^{1}}{dW}(W{=}0)\|$ .

${\mathbf{e}}(\cdot)$ is a vector field returning the constant ${\mathbf{e}}$ (Equation 37).

$\xi(\cdot)$ is a vector field returning the values of $\frac{dL^{1}}{dW}(\cdot)$ shifted by the constant $-\tfrac{dL^{1}}{dW}(W{=}0)$ . It is continuous on ${\mathcal{U}}$ and vanishes at the origin.

Applying Lemma 2 to $\xi$ over $\Gamma_{r,R}$ gives:

The linearity of the functional $\phi\mapsto\oint_{\Gamma_{r,R}}{F}_{\phi}$ , along with Equation 39, then imply:

We will show that for proper choices of $R$ and $r$ , the lower bound above is positive. $\Gamma_{r,R}$ will then be a piecewise smooth closed curve lying in ${\mathcal{U}}$ , for which the functional $\phi\mapsto\oint_{\Gamma_{r,R}}{F}_{\phi}$ does not vanish at $\phi=\frac{dL^{1}}{dW}$ . As stated, this will imply that $F(\cdot)$ violates the gradient theorem, thereby concluding our proof.

All that is left is to affirm that the expression:

can indeed be made positive with proper choices of $R$ and $r$ . Recall that:

$N>2$ by assumption; implies $\frac{2N}{3-\nicefrac{{2}}{{N}}}-2>0$ .

$R$ is any positive constant small enough such that the ball of radius $R$ around the origin is contained in ${\mathcal{U}}$ .

$r$ is any positive constant smaller than $R$ .

$\Gamma_{r,R}$ is a curve whose points are all within distance $R$ from the origin.

$c=\|\tfrac{dL^{1}}{dW}(W{=}0)\|$ – positive by assumption.

$\xi(\cdot)$ is a vector field that is continuous on ${\mathcal{U}}$ and vanishes at the origin.

The following procedure gives $R$ and $r$ as required:

Set $r$ to follow $R$ such that: $r^{3{-}\frac{2}{N}}=0.5\cdot{R}^{3{-}\frac{2}{N}}$ .

Choose $\epsilon>0$ for which $0.5c\left(\frac{2N}{3-\frac{2}{N}}{-}2\right)-2\pi{N}\epsilon>0$ .

Set $R$ to be small enough such that $\left\|\xi({\mathbf{w}})\right\|\leq\epsilon$ for any point ${\mathbf{w}}$ within distance $R$ from the origin.

where $len(\Gamma)$ is the arc length of $\Gamma$ , and $\gamma\in\Gamma$ refers to a point lying on the curve.

We begin by noting that the use of $\max$ (as opposed to $\sup$ ) in stated upper bound is appropriate, since under our definition of a curve (adopted from Buck (2003)), points lying on it constitute a compact set. This subtlety is of little importance – one may as well replace $\max$ by $\sup$ , and the lemma would still serve its purpose.

It is not difficult to see that for any ${\mathbf{w}}\in{\mathcal{U}}$ , ${\mathbf{w}}\neq 0$ :

Trivially, $\left\|F_{\phi}({\mathbf{w}})\right\|\leq{N}\left\|{\mathbf{w}}\right\|^{2{-}\frac{2}{N}}\left\|\phi({\mathbf{w}})\right\|$ holds for ${\mathbf{w}}{=}0$ as well. The sought-after result now follows from the properties of line integrals:

Let ${\mathbf{e}}$ be a unit vector, let $\Gamma_{r,R}$ be a piecewise smooth closed curve as specified in Equation 38 and the text thereafter, and let $\phi\mapsto{F}_{\phi}$ be the operator on continuous vector fields defined by Equation 36. Overloading notation by regarding ${\mathbf{e}}(\cdot)\equiv{\mathbf{e}}$ as a constant vector field, it holds that:

We compute the line integral by decomposing $\Gamma_{r,R}$ into its smooth components $\Gamma_{r,R}^{1}\ldots\Gamma_{r,R}^{4}$ :

Starting from $\Gamma_{r,R}^{1}$ , notice that for every point ${\mathbf{w}}$ lying on this curve: $\langle{\mathbf{e}},\frac{{\mathbf{w}}}{\left\|{\mathbf{w}}\right\|}\rangle\frac{{\mathbf{w}}}{\left\|{\mathbf{w}}\right\|}={\mathbf{e}}$ . Therefore:

The line integral on the right translates into a simple univariate integral:

Turning to $\Gamma_{r,R}^{2}$ , note that for any point ${\mathbf{w}}$ along this curve $\left\|{\mathbf{w}}\right\|^{2-\frac{2}{N}}=r^{2-\frac{2}{N}}$ , and $\frac{{\mathbf{w}}}{\left\|{\mathbf{w}}\right\|}$ is perpendicular to the direction of motion. This implies:

The line integral $\int_{\Gamma_{r,R}^{2}}{\mathbf{e}}$ is simply equal to the progress $\Gamma_{r,R}^{2}$ makes in the direction of ${\mathbf{e}}$ , which is $2r$ . Accordingly:

As for $\Gamma_{r,R}^{3}$ and $\Gamma_{r,R}^{4}$ , their line integrals may be computed similarly to those of $\Gamma_{r,R}^{1}$ and $\Gamma_{r,R}^{2}$ respectively. Such computations yield:

Combining Equation 40 with Equations 41, 42, 43 and 44, we obtain the desired result. ∎

Appendix B A Concrete Acceleration Bound

In Section 7 we illustrated qualitatively, on a family of very simple hypothetical learning problems, the potential of overparameterization (use of depth- $N$ linear network in place of classic linear model) to accelerate optimization. In this appendix we demonstrate how the illustration can be made formal, by considering a special case and deriving a concrete bound on the acceleration.

As shown in Section 7, under gradient descent, $w_{1}$ and $w_{2}$ move independently, and to prevent divergence, the learning rate must satisfy $\eta<\min\{2/y_{1}^{p-2},2/y_{2}^{p-2}\}$ . In our setting, this translates to (GD below stands for gradient descent):

For $w_{2}$ , the optimal learning rate (convergence in a single step) is $1/y_{2}^{2}$ , and the constraint above will lead to very slow convergence (see Equation 15 and its surrounding text).

Suppose now that we optimize via overparameterization, i.e. with the update rule in Equation 12 (single output). In our particular setting (recall, in addition to the above, that we omitted weight decay for simplicity – $\lambda=0$ ), this update rule translates to:

For the first iteration ( $t=0$ ), replacing $\epsilon_{i}:=|w_{i}^{(0)}|$ , while recalling that $y_{1}\gg{y_{2}}\gg\epsilon_{1}\gg\epsilon_{2}$ , we obtain:

Set $\eta=1/2\epsilon_{1}y_{1}^{2}$ . Then $w_{1}^{(1)}\approx{y}_{1}$ and $w_{2}^{(1)}\approx{y}_{2}^{3}/2y_{1}^{2}+\epsilon_{2}y_{1}/2\epsilon_{1}$ . Our assumptions thus far ( $y_{1}\gg{y}_{2}$ and $\epsilon_{1}\gg\epsilon_{2}$ ) imply $w_{1}^{(1)}\gg{w}_{2}^{(1)}$ . Moreover, since $\epsilon_{2}/\epsilon_{1}\approx{y}_{2}/y_{1}$ , it holds that $w_{2}^{(1)}\in{\mathcal{O}}(y_{2})={\mathcal{O}}(1)$ . Taking all of this into account, the second iteration ( $t=1$ ) of the overparameterized update rule (Equation 46) becomes:

In words, $w_{1}$ will stay approximately equal to $y_{1}$ , whereas $w_{2}$ will take a step that corresponds to gradient descent with learning rate (OP below stands for overparameterization):

By assumption $\epsilon_{1}y_{1}{\gg}1$ and $y_{2}{\in}{\mathcal{O}}(1)$ , thus $\eta^{OP}{<}2/y_{2}^{2}$ , meaning that $w_{2}$ will remain on the order of $y_{2}$ (or less). An inductive argument can therefore be applied, and our observation regarding the second iteration ( $t=1$ ) continues to hold throughout – $w_{1}$ is (approximately) fixed at $y_{1}$ , and $w_{2}$ follows steps that correspond to gradient descent with learning rate $\eta^{OP}$ .

To summarize our findings, we have shown that while standard gradient descent limits $w_{2}$ with a learning rate $\eta^{GD}$ that is at most $2/y_{1}^{2}$ (Equation 45), overparameterization can be adjusted to induce on $w_{2}$ an implicit gradient descent scheme with learning rate $\eta^{OP}=1/2\epsilon_{1}y_{1}$ (Equation 47), all while admitting immediate (single-step) convergence for $w_{1}$ . Since both $\eta^{GD}$ and $\eta^{OP}$ are well below $1/y_{2}^{2}$ , we obtain acceleration by at least $\eta^{OP}/\eta^{GD}>y_{1}/4\epsilon_{1}$ (we remind the reader that $y_{1}\gg 1$ is the target value of $w_{1}$ , and $\epsilon_{1}\ll 1$ is the magnitude of its initialization).

Appendix C Implementation Details

Below we provide implementation details omitted from our experimental report (Section 8).

The details hereafter apply to all of our experiments besides that on the convolutional network (Figure 5-right).

In accordance with our theoretical setup (Section 4), evaluated linear networks did not include bias terms, only weight matrices. The latter were initialized to small values, drawn i.i.d. from a Gaussian distribution with mean zero and standard deviation $0.01$ . The only exception to this was the setting of identity initialization (Figure 5-left), in which an offset of $1$ was added to the diagonal elements of each weight matrix (including those that are not square).

When applying a grid search over learning rates, the values $\{10^{-5},5\cdot 10^{-5},\ldots,10^{-1},5\cdot 10^{-1}\}$ were tried. We note that in the case of depth- $8$ network with standard near-zero initialization (Figure 5-left), all learning rates led either to divergence, or to a failure to converge (vanishing gradients).

C.2 Convolutional Network

For the experiment on TensorFlow’s MNIST convolutional network tutorial, we simply downloaded the code, https://github.com/tensorflow/models/tree/master/tutorials/image/mnist and introduced two minor changes:

Hidden dense layer: $3136{\times}512$ weight matrix replaced by multiplication of $3136{\times}512$ and $512{\times}512$ matrices.

Output layer: $512{\times}10$ weight matrix replaced by multiplication of $512{\times}10$ and $10{\times}10$ matrices.

The newly introduced weight matrices were initialized in the same way as their predecessors (random Gaussian distribution with mean zero and standard deviation $0.1$ ). Besides the above, no change was made. An addition of roughly $250K$ parameters to a $1.6M$ -parameter model gave the speedup presented in Figure 5-right.

To rule out the possibility of the speedup resulting from suboptimal learning rates, we reran the experiment with grid search over the latter. The learning rate hardcoded into the tutorial follows an exponentially decaying schedule, with base value $10^{-2}$ . For both the original and overparameterized models, training was run multiple times, with the base value varying in $\{10^{-5},5\cdot 10^{-5},\ldots,10^{-1},5\cdot 10^{-1}\}$ . We chose, for each model separately, the configuration giving fastest convergence, and then compared the models one against the other. The observed gap in convergence rates was similar to that in Figure 5-right.

An additional point we set out to examine, is the sensitivity of the speedup to initialization of overparameterized layers. For this purpose, we retrained the overparameterized model multiple times, varying in $\{10^{-3},5\cdot 10^{-3},\ldots,10^{-1},5\cdot 10^{-1}\}$ the standard deviation of the Gaussian distribution initializing overparameterized layers (as stated above, this standard deviation was originally set to $10^{-1}$ ). Convergence rates across the different runs were almost identical. In particular, they were all orders of magnitude faster than the convergence rate of the baseline, non-overparameterized model.