Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher

Introduction

Deep learning has greatly benefited from adaptive optimization methods such as RMSProp (Hinton et al., 2012), AdaGrad (Duchi et al., 2011; McMahan & Streeter, 2010), and Adam (Kingma, 2014), which dynamically change the geometry of the problem based on gradients encountered on-the-fly during training. While these methods have demonstrated remarkable success, they fundamentally treat neural networks (NNs) as optimization problems where we lack any prior knowledge about their particular setting.

However, NNs are far from being black boxes—their structure is not only known but they are deliberately designed. This simple observation raises directly the question:

Is it more beneficial to adapt the optimizer a priori, instead of exploring their respective geometries on-the-fly?

Adaptation on-the-fly has been the defacto standard in this setting, with adaptive algorithms, such as Adam (Kingma, 2014), dominating the deep learning model training.

One possible way for adaptation a priori, which we focus on in this work, is to modify the underlying norm used to measure distances in the parameter space. There is precedence to our proposal, as the early work by Carlson et al. (2015a, b, c) introduced the stochastic spectral descent method (SSD), which performs steepest descent in the spectral norm, and demonstrated that the method can substantially accelerate deep learning training.

The significance of the SSD approach has been very recently brought back to attention by Bernstein & Newhouse (2024b), who showed that the Shampoo optimizer (Gupta et al., 2017)—winner of the external tuning track at the 2024 AlgoPerf: Training Algorithms competition (Dahl et al., 2023)—can be viewed as SSD when a certain accumulation is disabled. Moreover, Bernstein & Newhouse (2024b) introduced an efficient Newton-Schultz iteration to replace the approximate SVD calculations previously required. Jordan et al. (2024b) incorporated the Newton-Schultz iteration with additional momentum into SSD under the name Muon to achieve impressive results on the nanoGPT architecture by applying it to the hidden layers.

This work focuses on developing an algorithmic framework that can exploit an appropriate choice of norm for the entire neural network with particular emphasis on hyperparameter transfer across model sizes (Yang & Hu, 2021), convergence and practical performance.

To adapt to the geometry a priori, we will build on a classical (but unexpected) family of algorithms in contrast to the steepest descent methods, namely the ones involving the linear minimization oracle ( $\operatorname{lmo}$ ) over a norm-ball constraint known as the Conditional Gradient (CG) methods.

While classically being used for constrained problems, we take the slightly unusual approach by showing that the $\operatorname{lmo}$ s can be used even for unconstrained problems. The algorithm, dubbed as the unconstrained Stochastic Conditional Gradient method (uSCG), shows improvements both theoretically and practically when the norm-ball constraint matches the natural geometry of the problem.

In particular, we build on the Stochastic Conditional Gradient (SCG) method of Mokhtari et al. (2020) from the constrained setting, which provides explicit control on the norm of NN weight matrices. This is particularly relevant for robust image classification (Cisse et al., 2017), generalization bounds (Bartlett et al., 2017), Lipschitz control of generative adversarial networks (Arjovsky et al., 2017; Miyato et al., 2018), diffusion models (Karras et al., 2024, Sec. 2.3), and for ensuring Lipschitz continuity of NNs (Large et al., 2024).

Concretely we make the following contributions:

Theoretical rates: We introduce a new, stochastic $\operatorname{lmo}$ based family of algorithms uSCG, which can exploit the specific geometry of the problem. In doing so we achieve the $O(n^{-1/4})$ order optimal convergence rate under general nonconvexity and stochasticity for uSCG (Arjevani et al., 2022). Moreover, we provide a new analogous guarantee for the constrained case for SCG.

Unification: Our $\operatorname{lmo}$ -based approach provides a unifying framework for various popular algorithms, based on the norm choice (see Table 1); as a byproduct we establish the first provable rate for the Muon optimizer. More importantly, this generality allows us to design a new method for deep learning based on operator norms called Scion, which enjoys zero-shot hyperparameter transferability (Yang et al., 2022), and can be implemented storing only one set of parameters and one gradient (stored in half-precision), economizing on memory in large-scale training.

Numerical validation: We carry out exhaustive numerical evaluation of Scion ranging from small scale experiments on MLPs and CNNs to NanoGPT models with up to 3B parameters. We consistently observe the transferability properties across all settings for Scion. The scheme is more tolerant to large batch sizes and exhibits superior performance due to the a priori adaptation.

An additional $\operatorname{lmo}$ -based algorithm ALMOND can be found in LABEL:subsec:almond, generalizing the Normalized SGD based method of Zhao et al. (2020), for training with large-batches. Key differences of ALMOND with uSCG and SCG are discussed to further motivate our algorithms.

Preliminaries

We are interested in solving the following general (possibly nonconvex) optimization problem

The central primitive in the algorithms considered in this work is the linear minimization oracle ( $\operatorname{lmo}$ ) defined as

where we are particularly interested in the special case where the constraint set is a norm constraint $\|x\|\leq\rho$ , for some $\rho>0$ and some norm $\|\cdot\|$ , which does not have to be the Euclidean norm. Examples of norm-constrained $\operatorname{lmo}$ s are provided in Table 1 and Table 2 regarding operator norms. An important property of the $\operatorname{lmo}$ is that the operator is scale invariant, i.e., $\operatorname{lmo}(a\cdot s)=\operatorname{lmo}(s)$ for $a>0$ , and in fact we have by construction under the norm constraints that $\|\operatorname{lmo}(s)\|\leq\rho$ . Thus, it is only the direction of the input $s$ that matters and not the magnitude.

A classical method for solving the constrained variant of problem 1, when the $\operatorname{lmo}$ is available, is the Conditional Gradient method (CG) (Frank et al., 1956; Clarkson, 2010; Jaggi, 2013), which proceeds as follows with $\gamma_{k}\in(0,1)$

ensuring the feasibility of $x^{k}$ via simplicial combination.

In the stochastic regime, the analyzing $\operatorname{lmo}$ -based algorithms is involved. Even when the stochastic oracle $\nabla f(x,\xi)$ is unbiased, the direction of the updates, as defined by $\operatorname{lmo}(\nabla f(x,\xi))$ , is not unbiased. To help overcome this difficulty, we will employ a commonly used trick of averaging past gradients with $\alpha_{k}\in(0,1]$ (aka momentum),

which will rigorously help with algorithmic convergence.

Our Methods

For the unconstrained case we introduce a new method, dubbed the unconstrained SCG method (uSCG):

with stepsizes $\gamma_{k}\in(0,1)$ . Instead of the convex combination in CG, the update rule simply sums the $\operatorname{lmo}$ s. In contrast with e.g., gradient descent, the update always has the same magnitude regardless of the size of the gradient average $d^{k}$ . The final algorithm is presented in Algorithm 1.

For the constrained case, we revisit the SCG method of Mokhtari et al. (2020) and adopt it for the non-convex objectives typically encountered in deep learning model training. This algorithm (Algorithm 2) proceeds as follows

For uSCG, weight decay has a very precise interpretation, since the method reduces to SCG. Consider the following variant of uSCG with weight decay $x^{k+1}=x^{k}+\gamma_{k}\operatorname{lmo}(d^{k})-\gamma_{k}\mu x^{k}.$

The weight decay parameter $\mu\in$ interpolates between uSCG and SCG. If the weight decay is in $(0,1)$ then the algorithm is still an instance of SCG and thus solve a constrained problem, but one with a larger radius of $\rho^{\prime}=\tfrac{\rho}{\mu}$ with a stepsize chosen as $\gamma_{k}^{\prime}=\gamma_{k}\mu$ .

1 Choice of Norm Constraint

To choose an appropriate norm for deep learning, we build on the operator norm perspective of Large et al. (2024). To simplify the presentation we will reason through a linear MLP as a running example, but in Section 3.2, we point to our theoretical guarantees with activation functions.

Let us a consider a linear MLP with the initial hidden layer defined as $h_{1}(z)=W_{1}z+b_{1}$ and the remaining layers

The operator norm is in turn defined as follows

Directly from the definition, we have that if the input $z$ is bounded through $\|z\|_{\alpha}\leq 1$ , then the output $\|Az\|_{\beta}$ will be bounded when $\|A\|_{\alpha\rightarrow\beta}$ is bounded.

A collection of operator norms and their resulting $\operatorname{lmo}$ s is provided in Table 2. It will be convenient to convert between bounds on these different operator norm which the following fact makes precise.

The operator norm satisfies for some $\rho>0$

Fact 3.1 tells us that we can bound operator norms using bounds on the vector norms, i.e.,

We start by focusing on controlling the RMS norm, but will later consider other norms. There are three example operator norms to consider for the MLP in consideration:

A choice needs to be made for the input norm $\alpha_{1}$ and output norm $\beta_{L}$ , which depends on the application:

Output layer

We summarize the different norm choices and their resulting $\operatorname{lmo}$ s in Tables 3 and 4. Table 3 provides an overview of norm choices of output layers, while Table 4 provides choices for input layers under 1-hot encoded input.

Exclusively Sign

The sign-only update is of particular interest, since it permits efficient communication in distributed settings. We demonstrate its hyperparameter transferability of the optimal stepsize across width in LABEL:fig:GPT:shakespeare:sign of LABEL:app:experiments.

2 Hyperparameter Transfer

We formalize this intuition in LABEL:lemma:mu_transfer of LABEL:app:transfer:proof following the proof technique of Yang et al. (2023), which holds for losses including logistic regression and MSE, and activation functions including ReLU, GELU and Tanh. Specifically, we show that the so-called maximal update learning rate $\gamma^{*}$ (i.e., the learning rate that enables the hidden layer preactivations to undergo the largest possible change in a single update step) is independent of width. Thus, a learning rate tuned on a smaller model can be directly applied to a wider model without compromising the maximal update property.

Related Works

Yang & Hu (2021); Yang et al. (2022) showed that there exists a parameterization (i.e., a choice of initialization and layerwise stepsize scaling) for which the features in every single layer evolve in a width-independent manner. The so-called Maximal Update Parametrization ( $\mu$ P) allows transferring optimal hyperparameter from a small proxy model to a large model.

A relationship with the spectral norm was established in Yang et al. (2023). An operator norm perspective was taken in the modular norm framework of Large et al. (2024); Bernstein & Newhouse (2024a), which was used to show Lipschitz continuity with constants independent of width. We build on this perspective and propose the $1\rightarrow\infty$ operator norm, which leads to a sign update rule.

Steepest descent in a normed space

where $[\cdot]^{\sharp}:=\operatorname*{arg\,max}_{x\in\mathcal{X}}\braket{\cdot,x}-\tfrac{1}{2}\|x\|^{2}$ is the sharp-operator (Nesterov, 2012; Kelner et al., 2014).

The deterministic case is analyzed in Nesterov (2012); Kelner et al. (2014), and was extended to the stochastic case in Carlson et al. (2015b), with a particular empirically focus on the spectral norm, named as (preconditioned) stochastic spectral descent (SSD) (Carlson et al., 2015a, b, c). Their SSD algorithm is an instance of the Majorization-Minimization (MM) algorithms (Lange, 2016), which iteratively minimizes a locally tight upper bound.

The dualization in Bernstein & Newhouse (2024a) is also performed by the sharp-operator. In contrast to (6), uSCG and SCG are invariant to the magnitude of the gradients and do not need to compute the dual norm $\|\cdot\|_{*}$ , which cannot be computed independently across layers.

Unlike the $\operatorname{lmo}$ -based schemes, extending sharp-operator-based algorithms to handle constrained problems is nontrivial even in the vector case (El Halabi, 2018). Additionally, a practical concern of using, say, spectral norm projections in deep learning is that the model weights themselves can be dense (so the required SVD would be expensive), while gradients used in the $\operatorname{lmo}$ are usually low-rank (allowing efficient SVD approximations).

Muon

where the $\operatorname{lmo}$ corresponds implicitly to the spectral norm.

The accumulation $G^{k}$ can be written in terms of the averaged gradient $d^{k}$ in Algorithm 2. We have that $d^{k}=\alpha G^{k}$ by picking $\alpha=(1-\beta)$ . Since the $\operatorname{lmo}$ is scale invariant, $d^{k}$ and $G^{k}$ can be used interchangeably without changing the update. Thus, we can alternatively write Muon (with non-Nesterov based momentum) exactly as uSCG.

In practice Muon is only applied to hidden layers, thus excluding the first layer and the last layer for which Adam(W) is used. In contrast, we apply uSCG and SCG to all layers and demonstrate transferability of the stepsize.

Sign

SignSGD and the momentum variant Signum were brought to prominence and further analyzed in Bernstein et al. (2018) motivated by efficient communication for distributed optimization, while they are originally introduced with the dual norm scaling and used only for weight bias updates in Carlson et al. (2015a, b, c). These schemes are typically studied under the framework of steepest descent, which results in the $\|g^{k}\|_{1}$ stepsize scaling in (6) usually not present in practice as remarked in Balles et al. (2020).

Normalization

The LARS optimizer (You et al., 2017) uses normalized gradient and was shown to be particularly useful for large batch settings. The method can be viewed as performing normalized SGD with momentum (Cutkosky & Mehta, 2020) layerwise with a particular adaptive parameter-dependent stepsize.

Continuous greedy

With zero initialization, $x_{1}=0$ , and stepsize $\gamma_{k}=\gamma=1/n$ , uSCG recovers the stochastic continuous greedy method (Mokhtari et al., 2020; Vondrák, 2008), which can be used to solve DR-submodular maximization problems under Matroid polytope constraints.

LMO for deep learning

SCG for training neural networks has been suggested in Pokutta et al. (2020) and Lu et al. (2022), where optimization was specifically constrained to the $K$ -sparse polytope with increasing batch-size for handling stochasticity. Beyond these works, we provide convergence guarantees for SCG with constant batch-sizes and, introduce a principled framework for selecting effective constraints based on the input and output space geometries of the layers of the network.

Trust-region

The SCG method can be seen as a trust-region method with a linear surrogate. Usually, the surrogate is taken to be quadratic (cf. Wright (2006, Ch. 4)). We refer to Conn et al. (2000) for an extensive overview of trust-region methods.

Analysis

We begin by presenting the two main assumptions we will make to analyze Algorithms 1 and 2. The first is an assumption on the Lipschitz-continuity of $\nabla f$ with respect to the norm $\|\cdot\|_{\ast}$ restricted to $\mathcal{X}$ . We do not assume this norm to be Euclidean which means our results apply to the geometries relevant to training neural networks.

The gradient $\nabla f$ is $L$ -Lipschitz with $L\in(0,\infty)$ , i.e.,

Furthermore, $f$ is bounded below by $f^{\star}$ .

Our second assumption is that the stochastic gradient oracle we have access to is unbiased and has a bounded variance, a typical assumption in stochastic optimization.

With these assumptions we can state our worst-case convergence rates, first for Algorithm 1 and then for Algorithm 2.

These results show that, in the worst-case, running Algorithm 1 with constant momentum $\alpha$ guarantees faster convergence but to a noise-dominated region with radius proportional to $\sigma$ . In contrast, running Algorithm 1 with vanishing momentum $\alpha_{k}$ is guaranteed to make the expected dual norm of the gradient small but at a slower rate. Algorithm 2 exhibits the analogous behavior, as we show next.

Before stating the results for Algorithm 2, we emphasize that they are with constant stepsize $\gamma$ , which is atypical for conditional gradient methods. However, like most conditional gradient methods, we provide a convergence rate on the so-called Frank-Wolfe gap which measures criticality for the constrained optimization problem over $\mathcal{D}$ .

Finally, we remind the reader that the iterates of Algorithm 2 are always feasible for the set $\mathcal{D}$ by the design of the update and convexity of the norm ball $\mathcal{D}$ .

Experiments

For computing the $\operatorname{lmo}$ of layers using a spectral norm constraint, we use the efficient implementation provided in Jordan et al. (2024b) of the Newton-Schultz iteration proposed in Bernstein & Newhouse (2024b). In this section, Muon (Jordan et al., 2024b) refers to the version used in practice, which uses AdamW for the first layer and last layer and Nesterov type momentum.

We build on the excellent modded-nanogpt codebase (Jordan et al., 2024a), which makes the following modernizations to Karpathy (2023): rotary embeddings is used instead of positional embeddings, RMS norm is used instead of LayerNorm, and linear decay schedule instead of a cosine stepsize. Scion and Unconstrained Scion use the (Sign $\rightarrow$ Spectral $\rightarrow$ Sign) configuration with scaling factors in accordance with Tables 3 and 4. We train for $5100$ iterations with a batchsize of $512$ on the FineWeb dataset (see LABEL:tbl:hyperparams:nanoGPT regarding hyperparameters). In comparison with Adam, both Muon and (Unconstrained) Scion do not require learning rate warmup. We sweep over stepsizes and model width in Figure 1.

From Figure 1, we observe that the optimal stepsize of Scion and Unconstrained Scion transfer across model width as oppose to Adam and Muon. Even when the Muon optimizer is tuned on the largest model size it achieves a validation loss of 2.988 in comparison with 2.984 of Unconstrained Scion. Our methods completely remove the need for using Adam otherwise present in the Muon implementation, which permits an implementation that only requires storing one set of weights and one set of gradient (stored in half-precision) across all layers (see LABEL:app:impl). The experiments additionally demonstrates that our method works for weight sharing.

B model

Using the optimal configuration of the 124M parameter proxy model, we perform a large model experiment on a 3B parameter model, which also increases the depth. Specifically, we take the embedding dimension to be 2560 and the depth to be 36. We observe in Table 5 that Unconstrained Scion outperforms all other methods. The loss curve is provided in LABEL:fig:NanoGPT:3B:loss-curve of LABEL:app:experiments.

Large batches

To test the effect of large batches we fix the total number of tokens for the 124M parameter model, and sweep over the batch sizes while rescaling the total number of steps accordingly. The stepsize is optimized for each combination of batch size and optimizer. We observe that (Unconstrained) Scion is better at maintaining a low validation loss with increasing batch size than the baselines (cf., Figure 2).

Image classification

We additionally test on a convolutional neural network (CNN) on the CIFAR10 dataset. We focus on Scion since norm control of the parameters is important for the setting. We use the configuration (Spectral $\rightarrow$ Spectral $\rightarrow$ Sign) and sweep over width and stepsize. The explicit control on the norm provided by Scion circumvents the need for the Frobenius norm normalization of the weights present in the base implementation (Jordan, 2024). The results are shown in Figure 3, which demonstrates the transferability of the optimal stepsize.

Acknowledgements

We thank Leena Chennuru Vankadara and Jeremy Bernstein for helpful discussion. This work was supported as part of the Swiss AI Initiative by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID a06 on Alps. This work was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_205011. This work was supported by Hasler Foundation Program: Hasler Responsible AI (project number 21043). Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-24-1-0048. ASF was supported by a public grant from the Fondation Mathématique Jacques Hadamard.