A Spectral Condition for Feature Learning

Greg Yang, James B. Simon, Jeremy Bernstein

Introduction

Recent years have seen an unprecedented push to train deep learning systems with more and more parameters, leading to powerful models across domains and the unlocking of qualitatively new capabilities (Silver et al., 2016, Brown et al., 2020, Ramesh et al., 2022). This continuing trend, combined with the technical challenges of training large models, has motivated much recent study of the dynamics of neural networks at large width, and more generally the study of how their dynamics scale as network width grows. This program has yielded a cornucopia of theoretical insights (Lee et al., 2018, Jacot et al., 2018, Arora et al., 2019, Canatar et al., 2021) and practical scaling recommendations (Yang & Hu, 2021b, Yang et al., 2021, Dey et al., 2023).

A key challenge when training a network of large width is to ensure that feature learning occurs at hidden layers. By this, we mean that the hyperparameters of the network are scaled in a manner such that the hidden representations of the network (as obtained by partial evaluation of the network up to a certain layer) change substantially over the course of training. Naïve hyperparameter scaling rules, including the well-studied “neural tangent parametrization” (NTP), in fact lose feature learning at large width (Lee et al., 2019, Sohl-Dickstein et al., 2020). But ample evidence supports the conclusion that proper feature learning is necessary for achieving optimal performance on many tasks (Lee et al., 2020, Fort et al., 2020, Atanasov et al., 2022, Vyas et al., 2022). Furthermore, scaling training correctly can lead to new functionality such as hyperparameter transfer. For instance, the recently proposed maximal update parametrization (Yang & Hu, 2021b, Yang et al., 2021) allows for transferring hyperparameters from narrow models to wide models, avoiding the cost of tuning the wide model directly.

Maximal update parametrization ( $\mu$ P) is derived by fairly involved “tensor programs” arguments that track feature distributions analytically in the infinite width limit. Anecdotally, the principles underlying $\mu$ P are not well understood by the community. In this paper, we provide a new perspective on $\mu$ P, showing that its scaling relations can be obtained by elementary linear algebra arguments. In short, we show that $\mu$ P is equivalent to scaling the spectral norm of any weight matrix or update like $\smash{\sqrt{\texttt{fan-out}/\texttt{fan-in}}}$ . This simple condition has various favorable numerical properties that contrast sharply with heuristic optimization strategies based on controlling the Frobenius norm (You et al., 2017) or entry size (Kingma & Ba, 2015) of updates. In the authors’ experience, the spectral scaling condition both simplifies the implementation of $\mu$ P in code, and is significantly easier to work with theoretically, leading to further conceptual advances in our research (Bernstein et al., 2023).

On a more fundamental level, an important step to solving many problems in classical computer science is to write down a suitable distance function for the problem at hand (Dhillon & Tropp, 2008). This idea is of particular importance in the design of optimization algorithms, where a notion of parameter distance is needed (Nemirovsky & Yudin, 1983, Amari, 1998). While it can be tempting to use the Euclidean norm on parameter vectors to measure distance, this naïve choice risks discarding the structure of the problem. For example, neural networks involve compositions of linear operators, which we refer to as their operator structure. Past efforts to metrize the space of neural networks while accounting for their operator structure have included using the Frobenius norm to measure distance between matrices (Bernstein et al., 2020a), which motivates various optimization algorithms that make Frobenius-normalized updates (You et al., 2020; 2017, Shazeer & Stern, 2018, Liu et al., 2021). This paper shows that the spectral norm provides a better notion of distance between operators in the context of deep learning.

To begin, we state the precise conditions on hidden features that we wish to ensure. We will ask for two things: both the features and their updates upon a step of gradient descent must be the proper size.

Our main message is that feature learning in the sense of 1 may be ensured by the following spectral scaling condition on the weight matrices of a deep network and their gradient updates:

The bulk of this paper is dedicated to thoroughly demonstrating that training in accordance with our spectral scaling condition satisfies 1 in MLPs. As an accessible path to this conclusion, we begin in Section 3 with a simple model—a deep linear MLP trained for one step on one example—and then successively extend to multiple training steps, a nonlinear model, and multiple inputs. In the process, we give a scaling analysis of the dynamics of feature learning. We then explain how 1 may be achieved in a standard deep learning setting and compare-and-contrast the resulting scaling prescription with others in the literature. In particular, we recover the recent “maximal-update parametrization” ( $\mu$ P) (Yang & Hu, 2021b).

2 Summary of contributions

Concretely, our contributions are as follows:

We propose the spectral scaling condition (1) and show that it suffices to achieve feature learning in neural networks even at large width.

We show that other popular scaling rules, including so-called standard parameterization and neural tangent parametrization, fail to satisfy 1.

In the main text, we focus on MLPs trained via ordinary gradient descent for clarity. Our results may actually be extended to cover any architecture and any adaptive optimizer (for a suitable definition of any, c.f. LABEL:{sec:formal_theory}). Therefore our spectral scaling condition provides a unifying hyperparameter scaling rule that remains the same whether the underlying optimizer is, say, SGD or Adam. We suggest that, when one wishes to determine how the hyperparameters of a new deep learning system should scale with width, one might turn to the satisfaction of 1 as an overarching principle.

Preliminaries

Here we review standard notations which we use in our scaling analysis.

Scaling notation. We will use the usual big- $O$ notation and variants to make statements about how various quantities scale with network width. Intuitively speaking:

$f(0pt)=O(g(0pt))$ means that $f(0pt)$ “scales no faster than” $g(0pt)$ ,

$f(0pt)=\Theta(g(0pt))$ means that $f(0pt)$ “scales like” or “is order” $g(0pt)$ ,

$f(0pt)=\Omega(g(0pt))$ means that $f(0pt)$ “scales at least as fast as” $g(0pt)$ .

Formally, $f(0pt)=\Theta(g(0pt))$ is equivalent to the statement that there exist constants $c,C>0$ such that $c\cdot g(0pt)\leq f(0pt)\leq C\cdot g(0pt)$ for all sufficiently large $d$ . The weaker statements $f(0pt)=O(g(0pt))$ and $f(0pt)=\Omega(g(0pt))$ entail only the upper and lower bounds, respectively.

We will only be concerned with scaling with respect to layer widths in this paper. Big- $O$ notation will hide any dependence on other factors — such as depth, dataset size, learning rate schedule, a global learning rate prefactor — and our statements purely concern how quantities will or should scale with model width.

That is, the spectral norm is the largest factor by which a matrix can increase the norm of a vector on which it acts. The spectral norm of a matrix is equal to its largest singular value. We will sometimes contrast the spectral norm with the Frobenius norm $\left|\!\left|\cdot\right|\!\right|_{F}$ given by $\smash{\left|\!\left|{\bm{A}}\right|\!\right|_{F}^{2}=\sum_{ij}A_{ij}^{2}}$ .

The spectral scaling condition induces feature learning

In this section, we show that our spectral scaling condition (1) achieves feature evolution of the correct scale in multilayer perceptrons (MLPs). We begin with a toy example which conveys key intuitions and then give a series of extensions which recover a much more general case.

We begin with a simple model: a deep linear MLP trained for one step on a single input. While elementary, this example will be sufficient to capture the intuition for a much more general case.

Hidden vector updates. By the subadditivity and submultiplicativity of the spectral norm, Equations 2 and 3 imply that:

where on the right hand sides of the inequality we have inserted 1 and 1. The spectral scaling condition thus gives features and feature updates obeying the correct upper bounds, and we need merely show comparable lower bounds.

Tightness of bounds via matrix-vector alignment. The upper bound in the submultiplicativity property $\left|\!\left|{\bm{A}}{\bm{v}}\right|\!\right|_{2}\leq\left|\!\left|{\bm{A}}\right|\!\right|_{*}\cdot\left|\!\left|{\bm{v}}\right|\!\right|_{2}$ can be very loose—in particular, this is the case when the vector only interacts with the small singular values in the matrix. We will now show that this is not the case in deep network training, and that these upper bounds provide a fairly accurate description of the way things scale. We make two observations regarding random weight matrices and gradient updates:

With the claims established, we can now get lower bounds on hidden vector size which serve to verify 1. The features at initialization scale correctly as:

We now pause to discuss key intuitions from the above argument which will carry through to the general case.

Weight updates are low-rank and aligned. An important observation is that weight updates are highly structured: they have low rank and align to incoming vectors. This motivates the spectral norm (which is the degree by which a matrix scales a “perfectly aligned” vector) as the correct measure of size.

2 Extensions: additional gradient steps, nonlinearities, and multiple examples

We now extend our warmup example to successively more complex settings, ultimately recovering the general case. As we add back complexity, our spectral scaling condition will remain sufficient to achieve feature evolution of the proper size, and key intuitions from our warmup will continue to hold up to minor modifications. Each extension requires making a natural assumption. We empirically verify these assumptions for a deep MLP in Appendix C.

Updates do not perfectly cancel initial quantities. That is:

2.2 Nonlinearities

We now add a nonlinearity $\phi$ to each layer of our MLP. The modified forward recursion relation is:

with base case ${\bm{h}}_{1}({\bm{x}})={\bm{W}}_{1}{\bm{x}}$ and output ${\bm{h}}_{L}({\bm{x}})={\bm{W}}_{L}{\bm{h}}^{\prime}_{L-1}({\bm{x}})$ . We assume that the hidden features before and after the application of the nonlinearity are of the same scale:

2.3 Batch size greater than one

When training on a minibatch $\mathcal{D}=\{({\bm{x}}_{i},{\bm{y}}_{i})\}_{i=1}^{B}$ with size $B>1$ , each gradient step is simply an average of the steps on each example:

We additionally make the assumption that the batch size is fixed and independent of width:

The batch size is width-independent: $B=\Theta(1)$ .

Empirical observation: low-rank structure remains at large batch size. Surprisingly, we observe numerically that MLP updates remain low (effective) rank and aligned with incoming vectors even at large batch size $B$ . This is demonstrated in Figure 1.

3 Adam and other adaptive optimizers

4 Uniqueness of spectral scaling condition

Efficient implementation of the spectral scaling condition

where $\eta=\Theta(1)$ is a width-independent prefactor.

Comparisons to existing parametrizations

2 Contrast to “standard parametrization”

At present, the vast majority of deep learning systems use either “Kaiming,” “Xavier,” or “LeCun” initialization (He et al., 2015, Glorot & Bengio, 2010, LeCun et al., 2002) with layer-independent learning rates. Generically, we refer to this as “standard parametrization” (SP), where layerwise initialization and learning rates scale as:

Notice that SP initialization exceeds 1 in any layer with fan-out smaller than fan-in. This includes the final layer in sufficiently wide networks. So, while 1 implies that weight matrices have spectral norm $\Theta(\sqrt{\texttt{fan-out}/\texttt{fan-in}})$ at initialization, under SP the spectral norms of certain layers are initialized larger than this. This means that, under SP, network outputs can blow up if training aligns the layer inputs with the top singular subspaces (and in fact this alignment generally occurs).

3 Contrast to “neural tangent parametrization”

It bears noting that an MLP parameterized with the NTP can be made to undergo feature evolution by simply rescaling the network output (and appropriately scaling down the global learning rate) (Chizat et al., 2019, Bordelon & Pehlevan, 2022). This operation transforms the NTP into $\mu$ P.

4 Contrast to “Frobenius-normalized updates”

Demonstration: 𝝁𝝁\bm{\mu}P versus NTP

Here we discuss a simple, illustrative experiment in which we directly verify that $\mu$ P obeys our spectral scaling condition (1) and achieves leading-order feature evolution (1) and the NTP does not. We do so via direct measurement of spectral quantities in MLPs of varying width trained on the same task. Appendix A provides full experimental details, and results are plotted Figure 2.

Model and data. We train MLPs with $L=3$ linear layers, no biases, and ReLU activations. For demonstration purposes, we train on a small subset of $B=200$ examples from a two-class subset of CIFAR-10. The model has input dimension $0pt_{0}=3072$ , output dimension $0pt_{3}=1$ , and uniform hidden dimension $0pt_{1}=0pt_{2}=0pt$ , with $0pt$ varied between training runs. We initialize and train each MLP twice with hyperparameters obtained using $\mu$ P and NTP scalings, respectively. We train networks of widths $0pt\in$ to near-zero training loss. After training, we compute various spectral quantities which we now discuss.

Norms of feature updates. We measure the average relative change in features over training:

where ${\bm{h}}_{2}^{0}({\bm{x}})$ and ${\bm{h}}_{2}({\bm{x}})$ are the second (preactivation) hidden vector at initialization and after training respectively, and the expectation is over samples ${\bm{x}}$ from the batch. As shown in Figure 2A, this feature evolution ratio remains roughly fixed as width $d$ grows when using $\mu$ P, in satisfaction of 1. By contrast, it decays as $1/\sqrt{0pt}$ for the NTP as predicted by Lee et al. (2019).

Spectral norms of weight updates. We measure the relative change in weights in spectral norm: $\left|\!\left|{\bm{W}}_{2}-{\bm{W}}_{2}^{0}\right|\!\right|_{*}/\left|\!\left|{\bm{W}}_{2}^{0}\right|\!\right|_{*}$ , where ${\bm{W}}_{2}^{0}$ and ${\bm{W}}_{2}$ are the second weight matrix before and after training, respectively. As shown in Figure 2B, this ratio remains roughly fixed with width in the case of $\mu$ P, in accordance with 1. By contrast, it decays as $1/\sqrt{0pt}$ for the NTP.

Final-layer alignment. We measure the alignment of the final layer to incoming vectors as follows:

This quantity is $\Theta(1/\sqrt{0pt})$ at initialization. As shown in Figure 2C, it grows to $\Theta(1)$ when using $\mu$ P but remains $\Theta(1/\sqrt{0pt})$ when using the NTP.

Frobenius norms of weight updates. One often hears the claim that “the weights don’t move” when training very wide neural networks. Here we show that the validity of this claim crucially depends on the choice of metric. In Figure 2D, we show that the Frobenius norm is deceptive: the net relative change $\left|\!\left|{\bm{W}}_{2}-{\bm{W}}_{2}^{0}\right|\!\right|_{F}/\left|\!\left|{\bm{W}}_{2}^{0}\right|\!\right|_{F}$ can decay with width even when the relative change in spectral norm is constant. This provides crucial context for interpreting existing results in the literature (Lee et al., 2019, Figure 1).

Related work

$\mu$ P was derived heuristically from spectral norm considerations in talks given by the first author in 2021 (Yang & Hu, 2021a). Earlier work (Bernstein et al., 2020a) derived a spectral analysis of feature learning based on perturbation bounds, but that work obtained the wrong scaling relation with network width due to a flawed conditioning assumption on gradients. Below, we review various strands of related work on feature learning and training strategies.

Parametrizations for wide neural networks. Much work has examined the scaling behavior of training dynamics of networks at large width. Together, works on the “neural tangent kernel” (NTK) limit (Jacot et al., 2018, Lee et al., 2019), the “mean-field” limit (Rotskoff & Vanden-Eijnden, 2022, Mei et al., 2019, Sirignano & Spiliopoulos, 2022), and the related “feature learning” limit (Geiger et al., 2020, Yaida, 2022, Yang & Hu, 2021b) (i.e. the $\mu$ P limit), paint a rich picture of a family of possible infinite-width scalings. After healthy debate regarding the relative empirical performance of the (more analytically tractable) NTK limit and the feature learning limit, a recent consensus holds that learning features is usually beneficial in practical large-scale deep learning settings (Chizat et al., 2019, Fort et al., 2020, Vyas et al., 2022). Our spectral scaling analysis recovers the feature learning limit in a simpler manner than previous analyses.

Operator theory of neural networks. Neural networks are constructed by composing linear operators with elementwise nonlinearities. One line of work studies this operator structure and how it behaves under perturbation to understand how step sizes should be set in gradient descent. For instance, Bernstein et al. (2020a) derive perturbation bounds on the maximum amount of feature change that can be induced by a gradient step in terms of the operator properties of the weight matrices. Meanwhile, Yang & Hu (2021b) study the operator structure of neural networks in the limit that width is taken to infinity, proposing a parametrization that obtains feature learning in this limit.

A body of literature studies optimization algorithms for deep networks that take steps whose size is set relative to the weights to which they are applied (You et al., 2017, Bernstein et al., 2020a; b, Liu et al., 2021, Carbonnelle & Vleeschouwer, 2019, Shazeer & Stern, 2018). A particular focus has been placed on setting the Frobenius norm of update steps to be small relative to the Frobenius norm of the weight matrices (You et al., 2017, Bernstein et al., 2020a; b, Liu et al., 2021). A main practical takeaway of this paper is that the Frobenius norm should be replaced by the spectral norm to get proper width scaling. The source of the difference between Frobenius and spectral norm is that gradient updates tend to have low stable rank, as shown in Figure 1, while the weights themselves tend to have high stable rank.

Conclusion

We have presented an analysis of the dynamics of feature learning in deep neural networks, beginning with desired conditions on feature evolution (1) and culminating in the demonstration that these conditions may be achieved by simple scaling rules (1 and 1) applied uniformly to each layer. Our analysis recovers and generalizes practically-important “feature-learning parametrizations” and provides a simple, unifying perspective on the question of parametrization in wide neural networks. For comparison, formal results derived under the tensor programs framework are given in Appendix B.

Our discussion has focused principally on MLPs for clarity, but our feature learning desideratum and spectral scaling condition can be directly applied to structured architectures. The spectral scaling condition may be applied to multi-index tensors as appear in convolutional architectures by applying the condition to appropriate “slices” of the full tensor. Simple application of our spectral scaling condition recovers $\mu$ P scalings reported for these model classes (see e.g. Yang et al. (2021), Table 8 and Section J.2). We also give the hyperparameter scalings for biases (which are easily derived but omitted in the main text for clarity) in Appendix D. This architectural universality is also proven rigorously in Appendix B.

Acknowledgements

The authors thank Josh Albrecht, Blake Bordelon, Alex Wei, Nikhil Ghosh, and Dhruva Karkada for useful discussions and comments on the manuscript. JS gratefully acknowledges support from the National Science Foundation Graduate Fellow Research Program (NSF-GRFP) under grant DGE 1752814.

Author Contributions

GY developed our core insight regarding the utility of the spectral norm, produced our tensor programs theory (Appendix B), and aided in refinement of the paper. JS spearheaded the writing of the paper, led iteration towards simple analysis which communicates our spectral picture, and ran experiments. JB developed an early incarnation of the spectral picture (Bernstein et al., 2020a), contributed key insights simplifying our exposition including unifying all layers under single formulae, aided in writing the paper, and ran experiments.

References

Appendix A Experimental details

where the expectation is taken over ${\bm{x}}$ from the batch. Shaded envelopes in Figure 1 denote one standard deviation with respect to both random network initialization and random batch selection over $10$ trials.

Experimental details for Figure 2. We train MLPs with depth $L=3$ , widths $0pt_{0}=3072,0pt_{1}=0pt_{2}=d,0pt_{3}=1$ , and ReLU activation functions. The data consists of 200 samples from CIFAR-10 (Krizhevsky, 2009) from only the classes airplane and automobile and uses $\pm 1$ targets.

We use two different hyperparameter schemes as follows. To implement $\mu$ P, we take

with global learning rate $\eta=0.1$ . To implement NTP, we follow Jacot et al. (2018) and Lee et al. (2019):

at all layers, again with $\eta=0.1$ . These parameterizations are equivalent at $d=1$ , which lets us view each parameterization as a particular scaling prescription applied to a narrow base network.

We train full-batch for $10^{4}$ steps, which is sufficient for all widths to drop below $0.01$ training loss on average by the end of training. We do not expect that training for many more steps would saliently change the resulting plots. Shaded envelopes in Figure 2 denote one standard deviation with respect to both random network initialization and random batch selection over $10$ experiment trials.

It is perhaps worth emphasizing that these experiments worked much better than they had to. Our theory strictly applies only to the case of a small number of gradient steps relative to network width, but the net updates shown in each subplot of Figure 2 reflect the accumulation of thousands of gradient steps, a number which is larger than network width in all cases. We were thus surprised by the very clear agreement of this experiment with predicted power laws.

Appendix B Tensor programs theory

For simplicity, assume all hidden widths $0pt_{1}=\cdots=0pt_{L-1}$ are the same. Borrowing from Yang & Hu (2021b), an abc-parametrization is just a recipe for scaling (as powers of width) the multiplier, initializer scale, and learning rate of all parameter tensors of a neural network. SP, NTP, and $\mu$ P are all examples of abc-parametrizations. A stable abc-parametrization is one whose (pre)activations and output do not blow up with width at any step of training. Then (under the generous 5) the following theorem is our main result:

In $\mu$ P, for almost every learning rate (in the measure-theoretic sense), LABEL:{cond:scaling} is satisfied at any time during training for sufficiently large width. $\mu$ P is the unique stable abc-parametrization with this property.

These statements are universal: they hold for any architecture and any adaptive optimizers representable by tensor programs (including convolutional neural networks, residual networks, and transformers, etc., as well as RMSProp, Adam, etc.), not just MLP and SGD.

As we have explained the core intuitions of our spectral perspective of feature learning in the main text, here we focus on proving the most general result in the most concise way.

Our proofs will rely on the following notions defined in prior work:

representable architecture (Yang & Littwin, 2023, Defn 2.9.1)

matrix/vector/scalar parameters (Yang & Littwin, 2023, Defn 2.9.1)

abcd-parametrization for representable architectures (Yang & Littwin, 2023, Defn 2.9.7)

entrywise optimizer (Yang & Littwin, 2023, Sec 2.1)

ket and iid-copy notation (Yang & Littwin, 2023, Sec 1.2)

Everything here follows under the following:

Assume our architecture is representable by a $\textsc{Ne}{\otimes}\textsc{or}\top$ program with pseudo-Lipschitz nonlinearities, trained by an entrywise optimizer with pseudo-Lipschitz update functions.

The following is our main proposition for initialization:

In any abcd-parametrization, any matrix parameter ${\bm{W}}$ at random initialization has $\left|\!\left|{\bm{W}}\right|\!\right|_{F}/\left|\!\left|{\bm{W}}\right|\!\right|_{*}=\Theta(\sqrt{0}pt)$ , but any vector or scalar parameter ${\bm{W}}$ has $\left|\!\left|{\bm{W}}\right|\!\right|_{F}/\left|\!\left|{\bm{W}}\right|\!\right|_{*}=1$ .

This is a claim about random matrices and follows from classical random matrix theory. Here’s a quick sketch: If $\sigma$ is the standard deviation of a matrix entry, then $\left|\!\left|{\bm{W}}\right|\!\right|_{F}\approx\sigma\cdot 0pt$ from the central limit theorem, and $\left|\!\left|{\bm{W}}\right|\!\right|_{*}\approx 2\sigma\cdot\sqrt{0pt}$ as stated in the main text, from which the first part of the proposition follows. For vectors and scalars, the stated ratio is always $1$ . ∎

Matrix Updates

Consider any abcd-parametrization. For any matrix or vector parameter ${\bm{W}}$ , at any step of training, $\Delta\!{\bm{W}}/\left|\!\left|\Delta\!{\bm{W}}\right|\!\right|_{F}$ converges to a Hilbert-Schmidt integral operator.

This is trivial for vector parameters. For matrix parameters, observe $\widetilde{\Delta}\!{\bm{W}}=\Delta\!{\bm{W}}/\left|\!\left|\Delta\!{\bm{W}}\right|\!\right|_{F}$ is always a nonlinear outer product

for multi-vectors $\bm{x}$ and $\bm{y}$ and multi-scalars $\bm{c}$ and some nonlinearity $Q$ . Then the limit $\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}\widetilde{\Delta}\!{\bm{W}}\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}$ acts on a ket $\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}z\rangle$ by

integrating over $\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}\bm{y}\rangle$ and $\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}z\rangle$ . The Hilbert-Schmidt norm of $\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}\widetilde{\Delta}\!{\bm{W}}\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}$ is

which is finite by the Master Theorem. Therefore $\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}\widetilde{\Delta}\!{\bm{W}}\mathchoice{\scalebox{0.7}[1.0]{$ \displaystyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \textstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptstyle\talloblong $}}{\scalebox{0.7}[1.0]{$ \scriptscriptstyle\talloblong $}}$ is Hilbert-Schmidt. ∎

In Proposition 2, for large enough width, unless $\Delta\!{\bm{W}}=0$ , both spectral norm and Frobenius norm of $\Delta\!{\bm{W}}/\left|\!\left|\Delta\!{\bm{W}}\right|\!\right|_{F}$ are $\Theta(1)$ .

This is obvious for Frobenius norm since it converges to the Hilbert-Schmidt norm of the operator limit.

However, the spectral norm cannot be expressed directly in such form. But, by definition of spectral norm and of the the ket space $\mathcal{Z}$ , one can construct a nonzero vector $z$ in an extension of the program that defined $\Delta\!{\bm{W}}$ , such that

for some $\theta\in(1/2,1]$ .In fact, we can pick $\theta\in[1-\epsilon,1]$ for any $\epsilon>0$ . This implies, by the Master Theorem, that

for sufficiently large width. The LHS is $\Theta(1)$ , so we are done.

In $\mu$ P, at any fixed time during training, as width $\to\infty$ , any matrix parameter ${\bm{W}}$ has spectral norm $\Theta(1)$ and Frobenius norm $\Theta(\sqrt{0}pt)$ .

In $\mu$ P, it’s trivial to see that $\left|\!\left|\Delta\!{\bm{W}}\right|\!\right|_{F}=\Theta(1)$ . So any nonzero $\Delta\!{\bm{W}}$ (at any point of training) has both spectral norm and Frobenius norm $\Theta(1)$ by the above.

Simple calculation then shows that ${\bm{W}}$ at any fixed time has $\Theta(\sqrt{0}pt)$ since this is the case at initialization by Proposition 1. This means the quadratic mean of the singular values of ${\bm{W}}$ is $\Theta(1)$ , so its max singular value must be $\Omega(1)$ . But it furthermore must be $\Theta(1)$ because ${\bm{W}}$ at initialization and all of its updates have $O(1)$ spectral norm. ∎

In $\mu$ P, for all but a measure-zero set of learning rates, LABEL:{cond:scaling} is satisfied at any time during training for sufficiently large width. $\mu$ P is the unique stable and faithful abcd-parametrization with this property.

In $\mu$ P, by Proposition 4, all matrix parameters satisfy LABEL:{cond:scaling} no matter what the learning rate is. However, for vector parameters, it’s possible for some specific learning rate to cause the weights to vanish after an update, but at most a (Lebesgue) measure-zero set of learning rates will cause this to happen. Assuming this vanishing does not happen, ${\bm{W}}$ has $\Theta(1)$ Frobenius norm at initialization and at all times during training. By Proposition 3, ${\bm{W}}$ also has $\Theta(1)$ spectral norm, so LABEL:{cond:scaling} is satisfied. A similar but easier argument applies to all scalar parameters. This shows $\mu$ P satisfies LABEL:{cond:scaling}.

Since any other stable and faithful parametrization essentially just rescales the initialization and the update, we see that no other parametrization can satisfy LABEL:{cond:scaling}.∎

For SGD, all abc-parametrizations are equivalent to a faithful abcd-parametrization because $Q$ is identity, so Theorem 2 recovers Theorem 1.

Appendix C Empirical checks of 1, 2 and 3

In Section 3, we first illustrated how 1 satisfies 1 in a minimal model—a deep linear network trained for one step on one sample—and then iteratively extended our argument to multiple steps, nonlinearities, and multiple samples. Each extension came with a mild assumption. These assumptions are intended to be natural conditions one expects from generic network dynamics.

In this section, we restate each assumption, explain further why it is expected to hold, and then present a validating experiment. All experiments use the same setup as that of Figure 2. Plotted quantities depending on the sample ${\bm{x}}$ are averaged over all ${\bm{x}}$ in the batch, with shaded envelopes showing one standard deviation of this mean over five experiment trials.

Updates do not perfectly cancel initial quantities. That is:

Like 1, a violation of 3 would require a perfect cancellation of high-dimensional matrices, which is unlikely. We verify 3 in Figure 6. 3 is also verified implicitly by the right-hand subplot of Figure 1.

Appendix D Scalings for biases

Appendix E Nondimensionalization and natural norms

For example: in the input layer, image data typically takes the form of dense vectors, while the one-hot encoding in language models takes the form of sparse vectors. The output vector of a network is typically a dense vector. All hidden vectors (pre- or post-activation) are dense.

E.2 Defining natural norms

E.3 Spectral scaling in natural norms

By equipping vectors and matrices with their natural norms, we simplify the conditions 1 and 1 from the main text: