Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer

Introduction

The de-facto approach for improving performance of vision and language models today is scale: large models are trained on more data for longer . Empirically, it has been observed that the benefit of scale often follows a predictable power law in which the performance $f(x)$ (e.g. error rate or log-perplexity) satisfies $f(x)\sim\beta x^{-c}+\varepsilon_{\infty}$ for some $\beta,c>0$ as one varies the scaling dimension $x$ (e.g. data or model size), if the remaining dimensions are not bottlenecks . Here, $\varepsilon_{\infty}$ is the irreducible loss.

However, the simple power-law relation becomes more complicated when compute is considered. In this case, power laws are observed only along the compute-optimal frontier. Otherwise, scaling up the model size for a fixed compute budget can deteriorate performance (see and Figure 4). Since one often has a fixed compute budget in mind (e.g. available hardware and time), one should pick the model size that maximizes performance subject to the compute budget constraint, which may imply not training until convergence. Indeed, this approach was used successfully in the recent Chinchilla that outperformed its predecessor Gopher despite being $4\times$ smaller in size.

Unfortunately, in both and among others, the “size” of a model is equated with its parameter count, with no special consideration for model “shape dimensions”, such as “depth” or “width”. The rationale behind this choice follows from the surprising observation that the transformer shape had little impact on its scaling behavior in language modeling (LM) when performance is measured upstream (e.g. using log-perplexity) . Nevertheless, follow-up analysis suggests that shape plays a pivotal role in other domains, such as in machine translation and also in language modeling for downstream performance , with recent works even advocating for extreme aspect ratios, such as a single wide attention layer .

In vision, in particular, much earlier works using convolutional neural networks (CNNs) pointed out that the parameter count is indeed a poor predictor of performance. For example, scaling all dimensions in ResNets is more effective than scaling a single dimension such as depth alone. In addition, scaling width is often more effective than depth, especially for small models . Hence, optimizing the “shape” of transformers seems worthwhile.

In this work, we present SoViT: a shape-optimized vision transformer that matches the performance of much larger models despite being pre-trained with equal compute. It is derived from a recipe we introduce for optimizing the shape of neural architectures, such as their depth and width. A principled approach for scaling multiple dimensions is advantageous because although one can scale dimensions via brute-force search, this requires extensive computation and often remains sub-optimal . Our recipe allows us to extrapolate without having to conduct an extensive set of experiments. For example, after only 115 experments, we identify a scaling strategy in ViT for all three dimensions: width (internal representation), depth, and MLP size. For comparison, requires over 400 experiments to optimize a single dimension (the parameter count) alone.

One major finding is that small vision models can perform on par with larger ones with the same compute if we optimize their shape. In language, recent works have demonstrated the value of scaled-down architectures, such as the Chinchilla model discussed earlier — a 70B parameter model that outperforms the 280B-parameter Gopher and 175B-parameter GPT3 — as well as LLaMA with its 13B parameter variant outperforming GPT3 on most benchmarks . By introducing SoViT, we establish this phenomenon in vision as well.

Figure 1 summarizes how the various shape dimensions are scaled in SoViT (see Section 3 for derivation). The MLP dimension is scaled faster than depth, which in turn is scaled faster than width. When summarized by their parameter count (rightmost plot), compute-optimal ViTs are smaller than was previously used. With this scaling strategy, we find the shape of a ViT for the compute-equivalent of ViT-g/14 pretrained on 16B JFT images . We call this $2.5\times$ smaller model SoViT-400m/14. It achieves 90.3% fine-tuning accuracy on ILSRCV2012 and 82.2% zero-shot accuracy in the locked-image text tuning (LiT) setup . We further evaluate SoViT-400m/14 on captioning, VQA and panoptic segmentation and highlight some results in Figure 2.

Statement of Contribution. In summary, our contribution is to:

Introduce a new method for optimizing the shape of neural networks, such as their depth and width. Our technique expands and improves previous methods by optimizing multiple shape dimensions jointly while requiring significantly fewer experiments.

Demonstrate the effectiveness of scaled-down architectures in vision. We optimize ViT for the compute-equivalent of ViT-g/14, leading to a smaller, faster model of equal quality.

Present new qualitative insights for scaling vision transformers, such as on how to scale individual shape dimensions and how optimal ViT shapes vary across domains.

Conduct extensive evaluation across tasks like image classification, image captioning, VQA, zero-shot classification and panoptic segmentation, identifying both gains and limitations.

Related Work

Optimizing training for compute has received a significant amount of attention in recent years, partly due to the financial and environmental costs of training large models . However, conflicting results are sometimes reported. For example, in language modeling, argues that the model size should be scaled faster than the data size, implying it is compute optimal to “undertrain” large models. Similar conclusions are found in . On the other hand, argues that the model size should be scaled uniformly with the data size, and highlights that transformers were not trained long enough, leading to some recent efforts “overtraining” their models instead. Our analysis for ViT in Section 4 agrees partially with the latter result.

Scaling the size of vision transformers has led to remarkable results achieving, for instance, 90.4% top-1 accuracy on ImageNet (ILSRCV2012) with 2 billion parameters and 90.9% top-1 accuracy with 4 billion parameters . When scaled to 22 billion parameters, ViT exhibits state-of-the-art alignment to human visual perception in terms of shape/texture bias, among other findings .

Despite the clear benefit of scale, there has been little investigation into optimally scaling the shape of ViTs. suggest preferentially increasing depth before scaling other dimensions uniformly. For ViT, however, they only consider small ViT-S and ViT-B models and the reported accuracy improvement comes with an increase in FLOPs of up to $\times 4$ , making it difficult to draw conclusions about the suggested shape’s quality. In contrast recommend scaling width over depth, but the authors do not observe any improvement when applying their strategy to ViT.

Our analysis draws inspiration from “compound scaling” in MobileNet and EfficientNet , while differing in significant ways. EfficientNet uses an exhaustive grid search to determine the optimal architecture for a fixed increase in compute (e.g. $\times 2$ ). Afterwards, each dimension is scaled up by the same ratio with every subsequent increase in compute. In contrast, we expand scaling laws to simultaneously account for model size and compute beyond the efficient frontier and leverage them to derive the optimal scaling exponents for each dimension separately, as outlined in Section 3.

Throughout our analysis, we use downstream metrics, e.g. ImageNet 10-shot error, when measuring performance instead of upstream metrics. This follows recent reports arguing that upstream performance may not reflect downstream performance in language and vision .

We use GFLOPs as a proxy for compute since it is hardware-agnostic and correlates well with actual wall-clock core-hours (see Figure 4). However, GFLOPs can have limitations and may not be a perfect predictor for the metric of interest (e.g. core hours) in all model and hardware types. Note that we focus on scaling the shape of the architecture, not on improving its training protocol, which can be similarly beneficial .

Scaling Strategy

The goal of optimizing shape for fixed compute $\mathbf{t}$ is to identify $\mathbf{x}^{\star}$ (depending on $\mathbf{t}$ ) such that:

for some small tolerance $\epsilon>0$ . Due to modeling assumptions, approximations, and the finite possible number of experiments conducted, we cannot hope for $\epsilon=0$ and have to tolerate a small excess loss.

Single Dimension. As demonstrated in Figure 3, the shape of a pretrained vision transformer has an impact on its downstream performance. To determine an optimal shape scaling strategy, we begin by considering both compute $\mathbf{t}$ and a single shape dimension $\mathbf{x}_{k}$ for $k\in[D]$ , such as depth. In prior works, optimizing a single dimension $\mathbf{x}_{k}$ for compute involves running a large number of experiments in order to identify the Pareto optimal frontier, from which power laws on $\mathbf{x}_{k}$ or $\mathbf{t}$ are derived . Since this is expensive, we propose the following joint functional form instead:

where $\alpha_{k},a_{k},\beta_{k},b_{k},c,\xi_{k},\varepsilon_{k}>0$ . Here, $f_{k}$ focuses on the dimension $k$ alone and assumes that all other shape dimensions $j\neq k$ are sufficiently large such that they do not constitute a bottleneck. We also assume that data is unlimited so that there is no risk of overfitting. We estimate the parameters in (2) by minimizing the relative error. In (2), $a_{k}$ are scaling exponents when varying the corresponding shape dimension in the compute-unbounded regime, $c$ is the data scaling exponent, while $b_{k}$ relates to the impact of the model shape on compute.

Our argument for this particular functional form is six-fold:

If compute is unbounded, we recover the familiar power law relation on model size $f_{k}(\mathbf{x}_{k})\sim\alpha_{k}\mathbf{x}_{k}^{-a_{k}}+\varepsilon_{k}$ . In addition, increasing the model size $x_{k}$ while keep the data size fixed does not imply that $f_{k}(\mathbf{x}_{k},\,\mathbf{t})\to\varepsilon_{k}$ because $\mathbf{x}_{k}^{b}$ can increase faster than $\mathbf{t}^{c}$ in (2).

For any fixed model size, the relation above reduces to the power law $f_{k}(\mathbf{t})\sim A\mathbf{t}^{-c}+B$ , where $A=\beta_{k}\mathbf{x}_{k}^{b_{k}}+\xi_{k}$ and $B=\alpha_{k}\mathbf{x}_{k}^{-a_{k}}+\varepsilon_{k}$ . Since the model size is fixed, $\mathbf{t}$ is proportional to the size of the data. Such data scaling laws have been demonstrated extensively in various domains .

For fixed compute, the relation w.r.t. $\mathbf{x}_{k}$ is non-monotone, quasiconvex (see Appendix A), in agreement with empirical measurements . See IsoFlop curves in Figure 4.

Arguments for power law behavior using space partitioning suggest that the exponent $c$ is independent of the shape dimension. In particular, $c=\Theta(1/d)$ , where $d$ is the intrinsic dimension of the data manifold . From this, we conclude that assuming the functional form in (2) for every shape dimension separately cannot lead to any contradictions since this assumption is satisfied by the decomposable loss:

for some constants $\xi,\varepsilon_{\infty}>0$ .

When optimizing the shape dimension $\mathbf{x}_{k}$ for fixed compute $\mathbf{t}$ , the optimal value $\mathbf{x}_{k}^{\star}$ is:

Recall that the scaling exponent $s_{k}$ in (4) is positive because $a_{k},b_{k},c>0$ . Using the relation (4), we rearrange the terms in Eq. (2), and obtain the scaling law for model performance along the compute-optimal frontier (Appendix A):

for some constants $F$ and $G$ , which is a sum of power law terms involving the model size and compute. Indeed, this decomposition has been demonstrated to hold within the compute-optimal frontier by and .

Eq. (2) fits empirical measurements and extrapolates accurately as well, see Figure 4.

Multiple Dimensions. Next, we expand upon the previous approach by incorporating multiple dimensions. To reiterate, our method involves both a functional form (2) and a novel procedure. Our procedure significantly decreases the number of large-scale experiments required to identify compute-optimal architectures, by an order of magnitude compared to prior work .

Star Sweep – Conducting a brute-force grid search to estimate scaling parameters across all dimensions is expensive, since it requires $O(2^{D})$ experiments to cover the search space. Instead, we demonstrate that a “star sweep” is sufficient: (1) starting from a large model $\mathbf{x}^{(c)}$ (the star center), we vary a single dimension $k\in[D]$ at a time in an exponentially-spaced grid, such that all values are much smaller than $\mathbf{x}^{(c)}_{k}$ . In our experiments, for instance, we optimize three shape parameters: width, depth, and MLP dim (see Section 4 for a brief definition of each dimension). Our star center is $\mathbf{x}^{(c)}=(1968,\,40,\,6144)$ ; i.e. has width 1968, depth 40, and MLP dim 6144. When varying MLP dim in the star sweep, we use the grid $(1088,\,1360,\,1728,\,2160,\,2592,\,3072)$ , corresponding to about 20% increase in each step, while fixing width to 1968 and depth to 40. We do this to ensure that other dimensions do not form a bottleneck when estimating the parameters in (2). This gives us the scaling exponents $s_{k}$ in (4).

Grid Sweep – The second stage is a grid sweep for small models trained for short compute. Depending on the number of shape dimensions involved, the cost of running this grid sweep can be negligible. Its goal is to identify a single architecture $\mathbf{x}^{(0)}$ that lies in the Pareto optimal frontier for small compute as illustrated in Figure 3. This is important since a suboptimal $\mathbf{x}^{(0)}$ can significantly skew results . Our grid sweep identifies $\mathbf{x}^{(0)}$ to be $(608,\,10,\,928)$ , the blue star in Figure 3. The advantage of this step is to absorb the leading coefficients in $\mathbf{x}_{k}^{\star}=O(\mathbf{t}^{s_{k}})$ in (4) so that the star sweep focuses on estimating the exponents $s_{k}$ alone. We demonstrate in Figure 5 that the scaling exponents $s_{k}$ are robust to the choice of the evaluation metric $f$ . In Appendix B.3, we discuss important considerations that were taken into account during this analysis.

Scaling. Finally, we scale all dimensions jointly. Starting from the small compute-optimal architecture $\mathbf{x}^{(0)}$ and the amount of compute $\mathbf{t}^{(0)}$ it is optimal for, suppose we increase compute by a factor $\tau>1$ (i.e. the new compute is $\tau\,\mathbf{t}^{(0)}$ ). By treating this increment $\tau$ as a sequence of $D$ smaller increments of size $\tau^{w_{k}}$ each with $\sum_{k}w_{k}=1$ , an increase in compute by a factor of $\tau$ is accompanied by an increase in every shape dimension $k$ by a factor of $\tau^{w_{k}}$ , respectively. In this work, the adopt the simplest strategy of setting $w_{k}=1/D$ , but acknowledge that more sophisticated approaches might lead to better results.

Shape-optimized ViT

We implement the scaling strategy in Section 3 in vision transformers pretrained on JFT-3B, a proprietary dataset with about 30k classes and around 3 billion examples , using the Adam optimizer . As mentioned in Section 3, we focus on optimizing three shape dimensions: width (size of internal representation), depth (number of encoder blocks) and MLP dim (hidden dimension). Following , we remove near-duplicate examples between upstream JFT-3B data and all the downstream train and test sets. Appendix B contains the full set of hyper-parameters used in the experiments, including full details about the star and grid sweeps described in Section 3. We fix the patch size in our analysis to $14\times 14$ , but study “flexifying” to arbitrary sequence lengths following in Section 5.5.

As an evaluation metric $f$ , we consider two domains: (1) image classification, with ImageNet linear 10-shot error rate as the metric, and (2) image-to-text LiT-decoding following . In the latter case, the evaluation metric $f$ is an average of four perplexity scores: COCO captioning, optical character recognition (OCR), and question answering (VQAv2 and GQA). Refer to for details about the LiT-decoder setup. By considering such distinct domains, our goal is to identify similarities and differences (if any) in how to optimally scale the shape of vision transformers (ViT).

We use the aforementioned star center $\mathbf{x}^{(c)}=(1968,\,40,\,6144)$ as our starting point. To estimate the scaling exponents $s_{k}$ in (4) for each dimension separately, we vary width in the grid $(608,\,768,\,928,\,1088,\,1328,\,1648)$ , depth in the grid $(8,\,10,\,12,\,16,\,20,\,24)$ , and MLP dim in the grid $(1088,\,1360,\,1728,\,2160,\,2592,\,3072)$ . As discussed in Section 3, we use an exponential spacing with all values being much smaller than in the star center $\mathbf{x}^{(c)}$ . Following , we evaluate quality using few-shot linear transfer by using pre-trained models to extract features and fitting a linear regression head mapping them to the one-hot encoding of the target labels.

The individual scaling exponents we find are $s_{\text{depth}}\approx 0.45$ , $s_{\text{width}}\approx 0.22$ , and $s_{\text{MLP}}\approx 0.6$ . Importantly, these exponents are quite robust to the choice of the metric. As shown in Figure 5, changing the metric from ImageNet 10-shot to either 5-shot or 25-shot can change the best-fit estimate of the other exponents $a_{k},b_{k},c_{k}$ in (2) but the scaling exponent $s_{k}$ is relatively unchanged, since it is formed as a ratio over other exponents. In addition, the data scaling exponent $c$ appears to be independent of the choice of the shape dimension. As mentioned earlier, this is consistent with space partitioning arguments for power law scaling .

The estimated scaling exponents $s_{k}$ point to the following picture:

MLP dimension should be scaled faster than depth, and depth faster than width.

The size of ViT, as quantified by its parameter count, is scaled more slowly than the allocated compute. More precisely, for every increment in compute by a factor of $10$ , the parameter count of the optimized model shape increases by a factor of $\approx 2.5$ .

As demonstrated in Figure 1, small ViT models can match the performance of much larger ones when their shape and training duration are jointly optimized for the available compute.

We validate these predictions by optimizing the shape of ViT for the compute-equivalent of ViT-g/14 when the latter is pretrained on 16 billion JFT-3B examples as done in . The resulting model, SoViT-400m/14, is significantly smaller and faster, yet equally competitive. It has a width of 1152, depth 27, and MLP dim 4304. Fine-tuning it on ImageNet results in a 90.3% top-1 accuracy, see Figure 2. Section 5 presents various other evaluations.

In Figure 6, we also optimize the shape of ViT for the compute-equivalent of ViT-B/14 pretrained on 4 billion examples of JFT-3B using Imagenet 10-shot error rate as an evaluation metric, resulting in SoViT-150m/14. It has a width of 880, depth 18, and MLP dim 2320. As shown in Figure 6, optimizing the shape of ViT leads to a significant improvement in performance, from 76.6% in ViT-B/14 to 78.5% in SoViT-150m/14 when both are trained for the same amount of compute. We also vary the optimized shape by decreasing/increasing one dimension at a time and retraining the corresponding model while keeping the total compute fixed. As shown in Figure 6, small deviations from the predicted optimal shape can lead to a notable drop in performance, especially for width since it has the smallest scaling exponent (see Figure 5). We also include in Figure 6 (left) a comparison with a model, denoted B-150m, which has the same shape as ViT-B/14 but the same size as SoViT-150m/14. This confirms that while optimizing the model size improves performance, optimizing the shape improves it even further.

Importantly, the model shapes in Figure 6 bear no resemblance to those observed during the star or grid sweeps. To recall, the star sweep is centered around an architecture $\mathbf{x}^{(c)}$ whose shape dimensions are significantly larger than in ViT-B/14, whereas the grid sweep pretrains models that are substantially smaller and for only 600M examples. The ability of our strategy to accurately identify a near-optimal model shape within this context underscores its robust extrapolation capability.

2 Multitask Decoder

Besides image classification, there has been a significant interest in multimodal applications, mostly fueled by the convergence across language and vision on the transformer architecture . In particular, an encoder-decoder transformer with an autoregressive decoder is a popular choice because it allows reusing pretrained image encoders. We repeat the analysis conducted in Section 4.1 to optimize the shape of the image encoder, while fixing the decoder architecture to two layers as was used in . Further details are provided in Appendix C. As an evaluation metric $f$ , we use the average of four perplexity scores: COCO captioning , OCR , VQAv2 and GQA , without normalization since they share a similar scale. For the learning rate and weight decay hyper-parameters, we conduct a sweep where we vary the learning rate in $\{10^{-3},\,3\times 10^{-4},\,10^{-4}\}$ and the weight decay in $\{3\times 10^{-4},\,10^{-4},\,3\times 10^{-5}\}$ . We pick the largest learning rate and the corresponding weight decay that result in a stable training run (i.e. smooth training loss curve and gradient norms) for both the largest and smallest image encoder architectures. From this, a learning rate of $3\times 10^{-4}$ and a weight decay of $10^{-4}$ are selected.

Using this analysis, the derived scaling exponents are approximately $0.25,0.49$ and $0.62$ for width, depth and MLP size, respectively. Hence, whereas the optimal shape dimensions in small architectures can be quite different between image classification and multitask decoding, as shown in Figure 3, the scaling exponents are nearly identical, so the same scaling recipe is used in both domains.

Evaluations

Overview. We now evaluate SoViT-400M in various contexts to verify whether it broadly matches ViT-g/14’s performance, or only in the ILSRCV2012 10-shot metric it was optimized for. The settings we cover are few-shot, frozen linear probes on ImageNet, zero-shot transfer, image-language multitasking including captioning, OCR, and question answering, as well as panoptic segmentation. In each of these settings, we compare SoViT-400m/14 to ViT-L/16 and a ViT-g/14, all trained on the

Compute. Experiments are executed on Tensor Processing Units (TPU). SoViT-400m/14 is pretrained on 40 billion examples, which amounts to 9T GFLOPs and 230K TPUv3 core-hours. ViT-g/14 was pretrained on 16 billion examples, corresponding to 9T GFLOPs and 210K TPUv3 core-hours.

We verify classification performance in three common and widely useful setups: full fine-tuning, linear probes on the frozen model, and few-shot linear classification.

Fine-tuning on ImageNet. Pre-trained image encoders are most commonly evaluated by fine-tuning them on the ILSVRC2012 classification task. The detailed fine-tuning settings are provided in Appendix E. One important aspect is to increase image resolution as a way of further increasing the capacity of the pre-trained model during fine-tuning . Table 1 shows the performance of SoViT-400m/14 in comparison with ViT-L/16, ViT-g/14 fine-tuned at various resolutions, along with a few more representative models from the literature. The results confirm that SoViT-400m/14 achieves the goal of matching ViT-g/14 while being significantly smaller.

Linear probing on ImageNet. The quality of the pre-trained representation learned by the model is often more directly assessed by performing linear probes, meaning learning a linear classifier on top of unmodified, frozen output features from the model. We present results of this evaluation on the full ImageNet-1k dataset in Table 2, including robustness evaluations of the learned probe according to ReaL , ImageNet-v2 , ImageNet-Renditions , ImageNet-Adversarial , and ObjectNet testsets. SoViT-400m/14 is generally on par with ViT-g/14 despite its smaller output width.

Broad few-shot linear transfer. We follow and evaluate a closed-form linear regression probe for 10-shot classification across a wide range of classification tasks in Table 3. Again, SoViT-400m/14 performs on-par with ViT-g/14 across the board.

2 Contrastive image-text tuning

Next, we follow the locked-image text tuning (LiT) recipe on the WebLI dataset to add zero-shot classification abilities to the pre-trained ViT-L/16, SoViT-400m/14 and ViT-g/14 image encoders. In this setup, a new text encoder is trained using the contrastive image-text matching objective . See Appendix D for details. Table 4 (second column) shows that SoViT-400m/14 is competitive with ViT-g/14, and substantially better than ViT-L/16.

3 Multitask Decoding

We also evaluate the three pretrained ViT models in multitask decoding as described in Section 4.2, where we follow the setup studied in . We fix the decoder architecture to two layers since it was found to perform well . For evaluation, we report COCO CIDEr , OCR , VQAv2 and GQA accuracy and log-perplexity. In brief, the CIDEr score measures the similarity between a generated caption and reference captions, considering $n$ -gram statistics, OCR evaluates optical character recognition, whereas both VQAv2 and GQA are question-answering evaluations. Results are summarized in Table 4. SoViT-400M performs on par with ViT-g/14.

4 Panoptic Segmentation

Additionally, we evaluate SoViT-400m/14 on panoptic segmentation , which is a challenging dense scene understating task by closely following the setup in UViM . At a high level, UViM panoptic segmentation model consists of a visual image encoder and a decoder which maps the image representation to an intermediate code. The code is later decoded to the panoptic segmentation mask using a fixed VQVAE model, which was pretrained on panoptic masks . In our experiments we initialize UViM’s image encoder with ViT-L/16, SoViT-400m/14 and ViT-g/14.

Following , we train the UViM model using the COCO panoptic dataset (with $512\times 512$ input resolution) and report the PQ metric. We achieve 43.5, 43.7 and 44.8 PQ points for ViT-L/16, SoViT-400m/14 and ViT-g/14 respectively. Our results indicate that dense segmentation tasks can be a limitation of the proposed optimal model shape, and a different model shape might be derived in this domain. We leave this investigation for future work.

5 Flexifying SoViT-400M

Finally, since we do not include the patch size (sequence length) as part of the shape optimization, we verify that this is not a limitation by flexifying SoViT-400m/14 on ILSVRC2012 for 300 epochs. The performance of the resulting FlexiSoViT-400m is shown in Fig 7 as green curve when varying the patch-size at inference time. A few reference ViT models from Table 1 and are added, confirming that SoViT-400m maintains a clear advantage. It is worth noting that flexifying does not rule out that other patch sizes could be compute-optimal. It merely demonstrates that SoViT-400M continues to perform quite well for other patch sizes when it is flexified.

Conclusion

In conclusion, we introduce an efficient method for optimizing the shape of neural architectures and successfully apply it to vision transformers. Our analysis demonstrates that smaller models, trained at their optimal architecture shape for the right amount of compute, can match much larger models.

Acknowledgments and Disclosure of Funding

We thank Mostafa Dehghani, Andreas Steiner, Daniel Keysers, Neil Houlsby, Sam Smith, David Schneider-Joseph, Rodolphe Jenatton and the anonymous reviewers for their valuable feedback and discussions. We also thank the Google DeepMind unit at large for providing a supportive research environment. We use the big_vision codebase for conducting experiments in this project.

ArXiv Version History

Version 1: Original version. Version 2: Layout fixes. Add missing citations to ImageNet-R,-A and ObjectNet. Version 3: Provided the full shape of SoViT-150m/14. Added details to Appendix B.3 about the grid sweep, a missing citation to the CIDEr score, and further discussions to Figures 3 and 6. Included a brief explanation of the image-to-text evaluation metrics in Section 5.3, the scaling exponents in Section 3 and the shape dimensions in Section 4. Fixed typos. Version 4: Fixed wall-clock time pre-training duration (TPUv3 core-hours) of SoViT-400m. Version 5: Fixed typos. Added brief explanations to Section 3.

References

Appendix A Scaling Laws Analysis

In this appendix, we present proofs of two claims in the paper. First, we show that (2) is quasiconvex on its first argument $\mathbf{x}_{k}$ . Second, we derive (5).

We assume throughout the proof that $a_{k},b_{k}$ are strictly positive, otherwise $f_{k}(\mathbf{x}_{k},\mathbf{t})$ is a monotone function on its first argument and the statement holds trivially.

To establish the quasiconvexity of $f_{k}(\mathbf{x}_{k},\,\mathbf{t})$ in (2), we observe that:

At the limit $\mathbf{x}_{k}\to\infty$ , the term involving $\mathbf{x}_{k}^{-a_{k}}$ vanishes and we have the asymptotic relation:

Similarly, when $\mathbf{x}_{k}\to 0^{+}$ , we have:

which is monotone decreasing. Therefore, $f^{\prime}(\mathbf{x}_{k},\mathbf{t})\leq 0$ for all $\mathbf{x}_{k}\leq\hat{x}$ . Combining both results implies that $f_{k}(x,\mathbf{t})$ is monotone decreasing in the domain $x\in(0,\hat{x})$ and is monotone increasing in the domain $x\in(\hat{x},\infty)$ .

A function $f(y)$ is said to be quasi-convex if for any $y_{1}$ and $y_{2}$ in its domain and any $\lambda\in$ , one has :

with $c_{2}>c_{1}$ . This implies that $c_{1}\geq\hat{x}$ and $c_{2}\leq\hat{x}$ , which is a contradiction. Therefore, $f_{k}(\mathbf{x}_{k},\mathbf{t})$ is quasi-convex on its first argument.

A.2 Derivation of (5)

Rearranging the expression in (4), we have:

where we plugged in the last expression. Simplifying yields (5) for some constants $F,G\geq 0$ .

Appendix B Shape Optimization

Table 5 provides the set of hyperparameters used in the star and grid sweeps. We use a small batch size of 128 here in order to train multiple models in parallel on small hardware topologies.

B.2 Star Sweep

In the star sweep, we use the center $\mathbf{x}^{(c)}=(1968,\,40,\,6144)$ as our starting point. To estimate the scaling exponents $s_{k}$ in (4) for each dimension separately, we vary width in the grid $(608,\,768,\,928,\,1088,\,1328,\,1648)$ , depth in the grid $(8,\,10,\,12,\,16,\,20,\,24)$ , and MLP dim in the grid $(1088,\,1360,\,1728,\,2160,\,2592,\,3072)$ . We train each model on 500K, 1M, and 2M steps. We always fix the patch size to $14\times 14$ and the number of attention heads to 16.

B.3 Grid Sweep

In the grid sweep, we pretrain each architecture on 600M examples. We use the cross-product of:

Some important considerations to be taken into account include:

When designing the grid sweep, we made sure that the compute-optimal model selected lies strictly in the interior of the grid, not on its boundary. This is because if it lies at the boundary (e.g. its depth is the maximum depth used in the grid), one cannot determine if it is compute-optimal or if increasing that dimension will yield even better models. This can be an iterative process, in which additional grid points are added to the sweep if necessary.

When identifying the model, we ensured that it is compute-optimal for a good range of compute (not only at some isolated point). Since the model is now compute-optimal for a range of compute budgets, we select as a starting point in our recipe the least compute it is optimal for. For example, if a model is compute-optimal for computes ranging from 1 TFLOPs to 2 TFLOPs, we use 1 TFLOPS in our recipe. In other words, we err on the side of caution, giving preference to larger models as we scale up the vision transformer (ViT).

Generally, the grid sweep should be tightly packed; e.g. with increments of 20% only in each dimension. By contrast, increments in the star sweep should be large in order to identify the scaling exponents reliably.

Appendix C Multitask Decoding Setup

Table 6 summarizes the hyperparameter settings for the multitask decoding setup in Section 4.2 and Section 5.3. We always fix the decoder to 2 layers since it generally performs well .

Appendix D LiT Training Setup

Table 7 summarizes the hyperparameter settings for the locked-image text turning (LiT) setup, which is used to report zero-shot classification accuracy in Table 4. We use a large batch size of 32K in this setup because it improves the performance of contrastive training .

Appendix E Transfer to ImageNet-1k

Table 8 lists the settings for the ImageNet-1k fine-tuning results presented in Table 1 in the main paper. The only three settings which differ across resolutions are learningrate decay, random augment and mixup strenghts. We did explore various learningrates, training durations (mostly shorter) as well as Polyak averaging, although the same setting shown in the table appears to be best across the board. Finally, we list various other settings which we did not explore. We simply used good default values from experience.

E.2 Linear probe on frozen encoder

We take the image representation at the pre-logits, i.e. the 1152-dimensional vector that comes out of the MAP-head and feeds right into the linear classification layer. For each of ViT-L/16, SoViT-400m/14 and ViT-g/14, we perform a grid-search over the following settings, and select the best-performing model on minival (2% of train) to be reported in Table 2: Augmentation: resize(256)|random_crop(224) vs. inception_crop(224), learning rate: 0.001, 0.0003, 0.0001, epochs: 1, 3, 10, weight decay: 0.0001, None. It should be noted that we keep various other settings to “known good defaults” based on prior explorations with similar models (i.e. plain ViTs). Table 9 summarizes key settings.