Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler

cs.LG cs.CL

Introduction

There have been a lot recent interest in the scaling properties of Transformer models (Kaplan et al., 2020; Hernandez et al., 2021; Bahri et al., 2021; Henighan et al., 2020; Tay et al., 2021b; Abnar et al., 2021). However, not much is understood about the scaling properties of different inductive biases imposed by model architectures. Improvements at a a specific scale (compute, size etc) are often assumed to transfer to different scales and compute regions (So et al., 2019; Choromanski et al., 2020; Lan et al., 2019; Dehghani et al., 2018) and new research is often presented in a point-wise fashion with respect to scale. In short, it is not uncommon for new methods to be presented with data points at very specific or limited compute regions (e.g., base size). We believe that understanding the interaction between architecture and scaling laws is crucial as designing models that perform well at diverse scales will likely have significant impact.

This paper is an attempt to understand the effect of inductive bias (architecture) on scaling laws of language models. To this end, we pre-train and finetune over ten diverse model architectures across multiple compute region and scales (e.g., from 15M to 40 Billion parameters). In total, we pre-train and finetune over 100 different models of different architectures and sizes and present insights and challenges at scaling these ten diverse architectures.

We consider a broad spectrum of models in our extensive experiments. Concretely, we consider several well-established Transformer variants (Vaswani et al., 2017) such as Evolved Transformer (So et al., 2019), Universal Transformers (Dehghani et al., 2018) and Switch Transformers (Fedus et al., 2021). We also consider lightweight models such as ALBERT (Lan et al., 2019) and/or efficient Transformers (Tay et al., 2020) such as Performer (Choromanski et al., 2020) and Funnel Transformers (Dai et al., 2020). In our comparison, we are also interested in finding out if general improvements to the Transformer architectures such as Mixture-of-Softmax (Yang et al., 2017) and/or Gated Linear Units (Dauphin et al., 2017; Shazeer, 2020) influence the scaling behaviour of models. Finally, we also evaluate models outside the family of Transformers including Lightweight convolutions (Wu et al., 2019), Dynamic convolutions (Wu et al., 2019) and the recently proposed MLP-Mixers (Tolstikhin et al., 2021). Figure 1 illustrates an overview about the experiments we run.

We also note that scaling these models is not as straightforward as it seems, i.e., there are intricate details of scale that are intertwined with architectural choices which we study in detail in this paper. For example, a distinct feature of Universal Transformers (and ALBERT) is parameter sharing. Hence, compared with standard Transformers, this architectural choice significantly warps the scaling behaviour not only with respect to performance but also amongst compute metrics such as FLOPs, speed and number of parameters (Dehghani et al., 2021a). Conversely, models such as Switch Transformers are on the other end of the spectrum with an uncommon relationship between FLOPs and number of parameters, i.e., they have high parameter to FLOPs ratio. This difficulty makes navigating this landscape challenging.

The key contributions of this paper are as follows:

For the first time, we derive scaling laws for different inductive biases and model architectures. We find that this scaling coefficient differs greatly from model to model. We believe this is an important consideration in model development. It turns out that amongst all ten architectures that we consider, the vanilla Transformer has the best scaling behaviour, even if its absolute performance at each compute region is not the greatest.

We observe that models that operate well in one compute-scale region is not necessarily the best in another compute-region. Moreover, we find that certain models have difficulty scaling despite performing decently (comparably) at lower-compute regions. This has implications, since it is difficult to get the fulll picture of a model’s scalability with pointwise comparisons at a certain compute-region.

We find that when it comes to scaling different model architectures, upstream pre-training perplexity might not correlate well with downstream transfer. Hence, the underlying architecture and inductive bias is also crucial for downstream transfer.

We highlight the difficulties of scaling with certain architectures and show that some models do not scale (or scale with a negative trend). We also find concerning trends where linear-time attention models such as Performer struggle with scaling up.

Related Work

Kaplan et al. (2020) studied empirical scaling laws of the decoder-only Transformer language models. They focused on the standard left-to-right language modeling objective with the cross-entropy loss as the performance metric. One of the main findings is that the loss scales as a power-law with three major characteristics of the model training: model size, dataset size and the training compute. Another somewhat surprising finding is that the model shapes such as width or depth of the Transformer network have minimal effects on the cross-entropy loss for a wide range of scales. Subsequent works (Henighan et al., 2020; Hernandez et al., 2021) made similar conclusions for autoregressive generative modeling and for transfer learning, respectively. This finding is also generally supported by (Tay et al., 2021b) but discrepancies were found for the gap between pretraining and finetuning - highlighting the fact that observing downstream performance of large language model is indeed important. In (Tay et al., 2021b), the effect of depth was unusually pronounced for downstream performance.

Raffel et al. (2019) studied the effect of pre-training objectives, model structures (e.g., encoder-decoder, decoder-only), pre-training dataset size and training strategy on the transfer learning. They showed that the downstream performance monotonically increases with the model scale (from 60M to 11B parameters). While they studied several model structures, the Transformer implementation is mostly the same as the original Transformer by Vaswani et al. (2017). Conneau et al. (2020); Goyal et al. (2021) scaled-up multilingual encoder-only architectures up to 11B parameters while maintaining the original Transformer implementation. They found that scaling the model improves its cross-lingual ability. Fedus et al. (2021) scaled a sparse model based on Mixture of Experts (MoE) models up to trillion parameters.

While previous studies have repeatedly shown the benefits of scale for language understanding tasks for both dense and sparse Transformers and cross-lingual abilities, all of these used the same Transformer implementation within each studies. With a plethora of improved Transformer architectures proposed in the literature, it is timely to investigate which of these improved architecture has the best scaling properties. The main goal of this paper is to systematically study how inductive biases imposed by these Transformer variants affect the scaling behavior in a shared software and hardware settings. This is in similar spirit to (Narang et al., 2021) that studies the impact of architectures on performance. Our analysis extends that of (Narang et al., 2021) to the model scale axis.

Methods

This section outlines our experimental setup.

This section describes the models we evaluate in our experiments. Our models are largely implemented in a sequence to sequence framework (Sutskever et al., 2014) following the convention of T5 (Raffel et al., 2019). Encoder-decoder models are a natural choice for this experimentation because they can universally express both encoding and decoding tasks.

We consider several standard Transformer variants.

Transformers (Vaswani et al., 2017) - The basic vanilla Transformer architecture. Our basic setup considers the T5-style of Transformers (Raffel et al., 2019), which largely follows the vanilla Transformer except that it uses relative attention instead of sinusoidal position embeddings and pre-layer normalization, i.e. layer normalization is applied before each sublayer.

Evolved Transformers (So et al., 2019) - A transformer architecture learned via AutoML. The architecture comprises of convolutions and attention. We scale Evolved Transformers following the same pattern as vanilla Transformers.

Universal Transformers (UT) (Dehghani et al., 2018) - A Transformer architecture with shared parameters and recurrent-like computation for transform layers. Scaling UTs are challenging because of parameter sharing. While we are able to also increase $d_{FF}$ or $d_{model}$ , the increase in parameters is of magnitude $N_{layers}$ than standard Transformers. Another axis of exploration is to scale $r$ the number of repeated computation at each UT layer - this increases computation (number of FLOPs) but does not increase the parameter size of the model.

Switch Transformer (Fedus et al., 2021) - a sparsely activated mixture-of-experts architecture. The Sparse Transformer is another model with an unusual relationship between number of parameters and compute. When we scale this model uniformly, the number of parameters easily reaches the ballpark of 40B.

Efficient Transformer Variants

These class of models are mainly concerned at reducing computational costs, memory usage, or parameter count of models.

Performer (Choromanski et al., 2020) - A linear time attention model using generalizable kernel attention. For simplicity, we adopt the relu kernel variant for our experiments. We scale Performer in the similar fashion (i.e., uniform scaling) as vanilla Transformers.

Funnel Transformer (FT) (Dai et al., 2020) A Transformer architecture that downsamples the input sequence across the layer stack. Our implementation uses FT only in the encoder and reverts to vanilla Transformer in the decoder following Narang et al. (2021).

ALBERT (Lan et al., 2019) - A lightweight transformer architecture that shares parameters across all layers and factorizes the embedding and output softmax layers. For our seq2seq ALBERT, we also share the weights of encoder and decoder.

General Improvements

We consider general improvements that are not necessarily tied to Transformers. We select candidates that have shown to do well in Narang et al. (2021).

Mixture of Softmaxes (Yang et al., 2017) - A transformer architecture adopting the MoS method at the Softmax layer.

Gated Linear Units with GeLU (GLU-Transformer) - Replacing position-wise feed-forward-networks in Transformers with Gated Linear Units (Dauphin et al., 2017).

Non-Transformer Architectures

We are interested in the scaling behaviour of non-Transformer based architectures such as convolutions and/or mixer architectures.

Lightweight Convolutions (Wu et al., 2019) - Lightweight depthwise convolutions that have shown promise over Transformer architectures.

Dynamic Convolutions (Wu et al., 2019) - An extension of the Lightweight Convolution to create time-dependent kernels.

MLP-Mixers (Tolstikhin et al., 2021) - Mixers are recently proposed architectures that learn a lightweight mixing of tokens. Since Mixers have not been used in autoregressive decoding, we only use token-mixers on the input encoder.

2 Experiment Setup

Our setup, along with all models, are implemented in Mesh TensorFlow (Shazeer et al., 2018), a library with similar interface to TensorFlow but enables distributed model parallelism across multiple workers. For fair comparison, all models are pretrained for $2^{19}$ steps on the english C4 corpus optimized using an inverse square root learning rate with Adafactor (Shazeer and Stern, 2018). All models use the same SentencePiece tokenizer (Kudo and Richardson, 2018) containing $32K$ subwords. This closely follows the setup in the T5 paper (Raffel et al., 2019). Finetuning is performed for $100K$ steps on a mixture of GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016). We evaluate on both upstream (pre-training) validation perplexity as well as downstream transfer for NLU tasks (GLUE + SuperGLUE + SQuAD) after fine-tuning. We pretrain and finetune our models with 16 TPU-v3 chips with data parallelism. All large models have a model parallelism of $2$ and XL models have a model parallelism of $8$ .

We consider several different model sizes for each architecture. For models that are straightforward to scale, we simply follow the standard convention in Raffel et al. (2019), moving from small to base, to large and XL. We include a tiny version of each model to observe how different models behave at lower compute regions. For models where it was not straightforward to scale (e.g., Universal Transformers, ALBERT), we tried to scale them in a similar fashion but faced obvious limitations such as getting ALBERT to have the same number of parameters as T5 XL without incurring a huge number of cost in terms of FLOPs. For convolutional models, we consider $d_{\rm{model}}$ to be the hidden size (i.e., channel depth) for the one-dimensional convolution layers. Values such as $d_{\rm{kv}},N_{H}$ then become redundant. Details on scaling detailsThe largest Switch transformer was scaled in a pretty sub-optimal way. So we don’t think it is representative of the full potential of the Switch family. Take the last data point of Switch with a pinch of salt. of each architecture can be found in the supplementary material.

3 Main Results

We report the main results of this paper in Table 1. We report the number of trainable parameters, FLOPs (of a single forward pass) and speed (steps per second). We also report on validation perplexity (on upstream pre-training) and results on 17 downstream tasks. The results are reported aggregates of GLUE, SuperGLUE and SQuAD. While we use the same Mesh TensorFlow-based codebase used by Raffel et al. (2019) and hence expect our experimental results to match theirs, we verify that our T5 base does achieve similar results to what is reported in Raffel et al. (2019).

4 Do all models scale the same way?

This section investigates if all model architectures scale in the same way.

Figure 2 reports the scaling behaviour of all models as we increase the number of FLOPs. We observe that the scaling behaviour of all models are quite unique and distinct, i.e., most of them are quite different from standard Transformers. Perhaps the biggest finding here is that most models (e.g., LConv, Evolved) all seem to be on-par or better than standard Transformers but fail to scale with a higher compute budget. Another interesting trend is that “linear" Transformers such as Performer fail to scale as shown in Figure 2(i). The pre-training perplexity metric only decreases by 2.7% going from base to large scale compared to 8.4% of the vanilla Transformer.

Downstream Transfer

Figure 3 reports the scaling curves of all models on downstream transfer. The overall finding that most models have distinct scaling curves compared to Transformers is also evident in downstream tasks. It is also noteworthy that most models have a different upstream and downstream scaling curve. We find that some models such as Funnel Transformer and LConvs that seem to hold out pretty well on upstream but suffer substantially on downstream. As for Performer, the performance (disparity) seems to be even greater in downstream as compared to upstream. Notably, the SuperGLUE downstream tasks generally require pseudo cross-attention on the encoder, which models such as convolutions are not equipped to handle (Tay et al., 2021a). To this end, we find that certain models may have difficulty learning the downstream tasks despite good upstream performance.

5 Are the best models at each scale different?

Figure 1 shows the Pareto-frontier when plotting compute against upstream and downstream performance. Since the colors of the plot represent different models, we can observe that the best model for every scale and compute region might be different. Moreover, from Figure 3, we can also observe this. For example, the Evolved Transformer seems to do well against the standard Transformer at tiny to small region (downstream) but this quickly changes when scaling the model up. We also observe this with MoS-Transformer where it clearly outperforms vanilla Transformers at some regions but not at others.

6 Scaling Law for Each Model

Table 2 presents the slope of the fitted linear line $\alpha$ for each model across multiple scenarios. We derive $\alpha$ by plotting $F$ (FLOPs), $U$ (upstream perplexity), $D$ (downstream accuracy), $P$ (number of parameters). In general, most values of $\alpha$ depict how well a model scales. For example $\alpha_{F,U}$ is plotting FLOPs against Upstream performance. The only exception is $\alpha_{U,D}$ which is a measure of upstream vs downstream performance. A high $\alpha_{U,D}$ value means that the transfer to the downstream tasks is better as a model scales. Overall, the $\alpha$ value is a metric that represents how well a model performs relatively across all scales

In general, we find that the vanilla Transformer has the highest values of $\alpha$ . Models such as Evolved Transformer, GLU-Transformer, MoS-Transformer and Funnel Transformer tend to have similar scaling properties to the vanilla Transformer. The GLU-Transformer has similar and slightly worse scaling properties to the vanilla Transformer, even if it was observed to do better in absolute sense on some compute-regions. On the other hand, we also observe that there are models which are difficult to scale such as LConv, UT, MLP-Mixer and Performer. This is even more evident on downstream task. We also note that ALBERT scales (trends) negativelyThis version of ALBERT shares parameters across encoder and decoder which may partially explain why we had a hard time scaling up. (gets worse) as we scale the model up. On the other hand, the metric $\alpha_{U,D}$ measures how the downstream performance scales with upstream performance. Overall, the Switch Transformer does the best on this metric where downstream performance scales well with upstream performance. Generally, models that make less changes to the main Transformer architecture (GLU-Transformer, MoS-Transformer) tend to retain similar scaling behaviours and changing the inductive bias also significantly alters the scaling property of the model.

7 Do Scaling Protocols influence model architectures in the same way?

We are interested in how different scaling protocols influence the model architectures. Figure 4 shows the effect of scaling depth of four model architectures (MoS-Transformer, Transformer, Evolved Transformer and LConv). Figure 5 shows the effect of scaling width on the same four architectures. Firstly, on upstream (negative log perplexity) curves, we note that while different architectures have a distinct difference in absolute performance, the scaling trend remains quite similar. On downstream, depth scaling (Figure 4) seems to act equally on most architectures with the exception of LConv. Meanwhile, for width scaling, it seems that Evolved Transformers scale slightly better when applying width-scaling. It is also interesting to note that depth-scaling has a much more substantial impact on downstream scaling as opposed to width-scaling.

8 Epilogue and Conclusion

In this paper, we conducted extensive experiments, pretraining and finetuning of up to 100 models ranging from 10 well-established Transformer and non-Transformer architectures. We showed that different model architectures can have different scaling behaviours and models performing well in one compute region (or model size) may not do identically well in another compute region.

We also showed that model architectures may do well on upstream perplexity but fail to transfer to downstream tasks. Hence, practitioners should be cautious about developing architectures that not only scale well with respect to the upstream perplexity, but also based on downstream performance. While we certainly do not expect researchers to always report model performance across all scales (especially large-scale), we believe that it is good to keep in mind that architectures can perform quite differently at different compute regions. Hence, this might be a good dimension to consider when designing new inductive biases. As such, performing evaluation at a certain compute region may be insufficient to capture the full picture. It is also good to consider if different inductive biases will result in different extends of emergent capabilities (Wei et al., 2022; Abnar et al., 2020).

We also showed that different model architectures may react differently to different scaling protocols, which further expands on the narrative that comparing and benchmarking these models can be very challenging (Dehghani et al., 2021b). When it comes to scaling large models, this paper shows that novel inductive biases can be indeed quite risky which might explain why most state-of-the-art large language models Rae et al. (2021); Chowdhery et al. (2022); Tay et al. (2022) are based on relatively vanilla architectures. Our advice is to be cautious when staking an expensive run on a Transformer architecture that drastically modifies the attention mechanism (e.g., Mixers and Performers are generally high risk options as seen in our experiment results). Finally, we acknowledge that not every practitioner or researcher would require models that are able to scale to billion of parameters. In that case, inductive biases that are tailored to small or low compute will be sufficient.

References

Appendix

For most models, it was reasonable to follow the uniform scaling method in the main T5 sizes. At each size, the hyperparameters are as follows:

For Switch Transformers, we use the following scaling:

Scaling for Universal Transformer

Scaling UTs are generally difficult as described in the main text. There were two main considerations for scaling UTs. Initially we tried scaling the number of recurrent operations. However, we found that even with an increase of FLOPS, this does not lead to improved performance. Overall, the UT model might be pretty slow and therefore a model with the same hparams as vanilla XL might be infeasible to run. Hence, we explored increasing the width of the MLPs to $32K$ to see if UTs would scale in this manner.