Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt
Introduction
In recent years, research has shown that models pre-trained on large and diverse datasets learn representations that transfer well to a variety of tasks. As a result, machine learning practitioners now commonly develop solutions for downstream tasks by fine-tuning large pre-trained models (Girshick et al., 2014; Yosinski et al., 2014; Kornblith et al., 2019; Kolesnikov et al., 2020). Typically, the fine-tuning process involves two steps: (1) fine-tune models with a variety of hyperparameter configurations, and (2) select the model which achieves the highest accuracy on the held-out validation set. The remaining models are then discarded.
Selecting a single model and discarding the rest has several downsides. For one, ensembling outputs of many models can outperform the best single model, albeit at a high computational cost during inference. For another, fine-tuning a model on downstream tasks can sometimes reduce out-of-distribution performance (Radford et al., 2021; Andreassen et al., 2021; Wortsman et al., 2021; Pham et al., 2021), and the best single model on the target distribution may not be the best model on out-of-distribution data.
In this work, we propose a more accurate and robust alternative to the second step of the conventional recipe in the context of fine-tuning a large pre-trained model. Instead of selecting the individual fine-tuned model which achieves the highest accuracy on the held-out validation set, we average the weights of models fine-tuned independently, and refer to the result as a model soup. Given the results of the first step—a hyperparameter sweep over fine-tuned models—averaging several of these models to form a model soup requires no additional training and adds no cost at inference time.
Since the loss landscape of neural network training is non-convex with many solutions in different loss basins, it is perhaps surprising that averaging the weights of independently fine-tuned models achieves high performance. However, recent work (Neyshabur et al., 2020) observes that fine-tuned models optimized independently from the same pre-trained initialization lie in the same basin of the error landscape, inspiring our method. Weight averaging along a single training trajectory has previously been shown to improve the performance of models in non-transfer settings (Szegedy et al., 2016; Izmailov et al., 2018). Our approach extends weight averaging to the context of fine-tuning, where we find that it also works across many independent runs with varied hyperparemeter configurations. Our use of a diverse set of fine-tuned models is inspired by Gontijo-Lopes et al. (2022) who observe that ensembling independent runs trained with different hyperparameters improves performance.
We perform a comprehensive experimental study of fine-tuning to understand the behavior of model soups. For our main results we fine-tune CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), which are pre-trained with a contrastive loss on image-text pairs, and a ViT-G model pre-trained on JFT (Zhai et al., 2021). Our results show that model soups often outperform the best individual model on both the in-distribution and natural distribution shift test sets (Table 1, Figure 1, Figure 5). A model soup composed of ViT-G models achieves 90.94% on ImageNet (Deng et al., 2009), surpassing the previous state of the art of 90.88% attained by the CoAtNet model (Dai et al., 2021) while requiring 25% fewer FLOPs at inference time.Since our initial submission, we attain 90.98% with BASIC (Pham et al., 2021), which ties the newer CoCa model (Yu et al., 2022) to their reported precision; see Appendix C. In general, model soups can approach the performance of ensembling, with no additional computational cost or memory relative to a single model during inference. Beyond ImageNet and associated distribution shifts, our results show that model soups are applicable when fine-tuning on tasks from the WILDS (Koh et al., 2021) benchmark, and when fine-tuning transformer models (Vaswani et al., 2017; Devlin et al., 2019a; Raffel et al., 2020b) for text classification.
While the most straightforward approach to making a model soup is to average all the weights uniformly, we find that greedy soups, where models are sequentially added to the soup if they improve accuracy on held-out data, outperforms uniform averaging. Greedy soups avoid adding in models which may lie in a different basin of the error landscape, which could happen if, for example, models are fine-tuned with high learning rates.
In addition to empirical observations, we analytically relate the similarity in loss between weight-averaging and logit-ensembling to the flatness of the loss (i.e., its second derivative on a line between models) and confidence of the predictions (expressed via the variance of a logits difference drawn from the weight-average softmax). We empirically validate our approximation on a subset of the models we train and show that it is strongly correlated with the true averaging vs. ensembling performance difference, particularly in the learning rate regimes where soups are effective and models achieve higher accuracy.
Paper outline. Our method of model soups is presented and evaluated in Sections 2 and 3, respectively. Next, Section 4 includes our analysis relating model soups and ensembles, Section 5 details the scope and limitations of the proposed method, and Section 6 contextualizes model soups by reviewing related work.
Method
This section highlights three recipes for model souping, the uniform, greedy, and learned soup, though the greedy soup is our central method. We summarize the methods described in this section in Table 2.
Let denote the parameters obtained by fine-tuning with pre-trained initialization and hyperparameter configuration . The hyperparameter configuration can include the choice of optimizer, data augmentation, training iterations, and a random seed which will determine data order.
For hyperparameter configurations let . Conventionally, the parameters which attain the highest accuracy on a held out validation set are selected, and the remaining parameters are discarded. Instead, model soups use an average of , i.e., where . The uniform soup is constructed by averaging all fine-tuned models and so .
There are settings in which a hyperparameter configuration can produce a model with low accuracy that results in a low accuracy uniform soup. This issue can be circumvented with a greedy soup (Recipe 1). The greedy soup is constructed by sequentially adding each model as a potential ingredient in the soup, and only keeping the model in the soup if performance on a held out validation set (disjoint from the training and test sets) improves. Before running this procedure we sort the models in decreasing order of validation set accuracy, and so the greedy soup can be no worse than the best individual model on the held-out validation set. We also explore a more advanced learned soup recipe that optimizes model interpolation weights by gradient-based minibatch optimization (see Appendix I for details). This procedure requires simultaneously loading all models in memory which currently hinders its use with large networks.
Experiments
This section presents our key experimental findings. We begin with experimental setup (Section 3.1) then provide intuition for model soups by examining error landscape visualizations (Section 3.2). Next we present our main results (Section 3.3), using model soups as an alternative to selecting the best performing individual model. The appendix includes additional results on model soups in the context of robust fine-tuning (Appendix D) and model soups constructed by fine-tuning on different datasets (Appendix E).
Our experiments explore the application of model soups when fine-tuning various models. The primary models we fine-tune are the CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), and BASIC (Pham et al., 2021) models pre-trained with contrastive supervision from image-text pairs, a ViT-G/14 model pre-trained on JFT-3B (Zhai et al., 2021), and transformer models for text classification (Devlin et al., 2019a; Raffel et al., 2020a). Unless otherwise mentioned, experiments use the CLIP ViT-B/32 model. Fine-tuning is performed end-to-end (all parameters are modified) which typically results in better accuracy than training only the final linear layer (Kornblith et al., 2019; Agrawal et al., 2014; Chatfield et al., 2014; Azizpour et al., 2015).
We consider two different methods for initializing the final linear layer before fine-tuning. The first method initializes the model from a linear probe (LP), as described in Kumar et al. (2022), and we refer to this method as LP initialization. The second method uses the zero-shot initialization, e.g., using the classifier produced by the text tower of CLIP or ALIGN as the initialization. Both methods for initializing the model produce similar trends when applicable, and unless otherwise stated we use the LP initialization.
For the ensemble baselines (Dietterich, 2000; Lakshminarayanan et al., 2017) we ensemble the logits (unormalized outputs) of models as in Gontijo-Lopes et al. (2022). Fine-tuning uses a supervised cross-entropy loss and, unless otherwise mentioned, is conducted on ImageNet (Deng et al., 2009). When fine-tuning on ImageNet we also evaluate on the five natural distribution shifts: ImageNetV2 (Recht et al., 2019), ImageNet-R (Hendrycks et al., 2021a), ImageNet-Sketch (Wang et al., 2019), ObjectNet (Barbu et al., 2019), and ImageNet-A (Hendrycks et al., 2021b). We often report results averaged over these five distribution shifts. Since the official ImageNet validation set is typically used as the test set, we use roughly 2% of the ImageNet training set as a held-out validation set for constructing greedy soups.
2 Intuition and motivation
These results suggest that (1) interpolating the weights of two fine-tuned solutions can improve accuracy compared to individual models and (2) more uncorrelated solutions—models that form an angleIn particular, the angle between and , i.e., the angle between the arrows shown in Figure 2. closer to 90 degrees—may lead to higher accuracy on the linear interpolation path.
To investigate the correlation between accuracy improvement and angle, we consider a series of models trained with different seeds, learning rates, and data augmentation. For each pair , we compare the accuracy of their average with the average of their accuracies, , which we refer to as the interpolation advantage. Figure 3 illustrates the results, in which we observe that the interpolation advantage is correlated with the angle and that varying the learning rate, seed, or data augmentation can produce solutions which are more orthogonal. Experimental details and discussion of high learning rates provided in Appendix J.1.
Ensemble comparison. Figure 4 observes that ensemble performance is correlated with soup performance for moderate and small learning rates. We consider pairs of models selected at random from the individual solutions in Figure 1, and find that the maximum learning rate of the models in the pair is indicative of the ensemble accuracy, soup accuracy, and their relation: When learning rate is small, ensemble accuracy and soup accuracy are similar, but both are suboptimal. For moderate learning rate values, ensemble accuracy and soup accuracy are both high. For high learning rate values, ensemble performance exceeds soup performance, but ensembles/soups with moderate learning rates perform better. Overall, ensembles achieve higher accuracy on ImageNet while the reverse is true on the distribution shifts.
One dimensional hyperparameter grids. Finally, in Appendix F we ask the question: for a one dimensional grid of hyperparameters , how does averaging the models fine-tuned with hyperparameter configurations and corresponding to the endpoints compare with picking the best individual model fine-tuned with hyperparameter configuration ? The hyperparameters we vary are optimizer, augmentation, and learning rate. For the majority of grid searches, the average of the endpoints outperforms the best individual model in the grid.
3 Model soups
With the gains of averaging two fine-tuned models in mind, we turn our attention to averaging many models with different hyperparameters: this section presents our main results, which show that averaging fine-tuned models can be used as an alternative to the conventional procedure of selecting the single model which performs best on the held-out validation set. We explore CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) fine-tuned on ImageNet (Deng et al., 2009) (Section 3.3.1), ViT-G pre-trained on JFT-3B (Zhai et al., 2021) and fine-tuned on ImageNet (Section 3.3.2), and transformer models fine-tuned on text classification tasks (Section 3.3.3). Appendix G additionally explores (1) CLIP ViT-L fine-tuned on WILDS (Koh et al., 2021) and CIFAR-10 and (2) an ImageNet-22k-pretrained ViT-B fine-tuned on ImageNet. Moreover, Appendix C shows that model soups improve accuracy when fine-tuning BASIC (Pham et al., 2021).
We begin our study of model soups by considering two-pretrained models, CLIP ViT-B/32 and ALIGN EfficientNet-L2, and performing a hyperparameter sweep for the fine-tuning each model on ImageNet. For CLIP we use a random hyperparameter search over learning rate, weight decay, training epochs, label smoothing, and data augmentation, obtaining 72 fine-tuned models (details in Appendix J.2.1). For ALIGN we use a grid search over learning rate, data augmentation, and mixup, obtaining 12 fine-tuned models (details in Appendix J.2.2). To form our greedy soups, we sort models in order of decreasing accuracy on the held-out validation set before applying Recipe 1. For both CLIP and ALIGN, the greedy soup selects 5 models. Figure 1 and 5 show the performance of the resulting models and their uniform and greedy soups for CLIP and ALIGN. The greedy soup improves on over the best model in the hyperparameter sweep by 0.7 and 0.5 percentage points, respectively.
Furthermore, we show that, for essentially any number of models, the greedy soup outperforms the best single model on both the ImageNet and the out-of-distribution test sets. We consider an additional setting where we prepare a sequence of soups by sequentially adding CLIP models from the hyperparameter sweep in random order. Appendix Figure B.1 shows the performance of the uniform and greedy soup, as well as the best single model so far and a logit ensemble, as a function of the number of models considered. The greedy soup is better than the uniform soup on ImageNet and comparable to it out-of-distribution. The logit ensemble is better than the greedy soup on ImageNet, but worse out-of-distribution.
Table 3 lists the performance of the CLIP soups and baselines described above, as well as additional soup variants described in Appendix I.
To further establish the generality of the model soup, we replicate the CLIP hyperparameter sweep experiment on two image classification tasks from WILDS (Koh et al., 2021), namely FMoW (Christie et al., 2018) and iWildCam (Beery et al., 2021). Appendix Figure G.1 shows results qualitatively similar to our ImageNet experiment, and Appendix J.2.1 describes experimental details.
We report several additional variants and baselines for the experiment described above. In Appendix H we present results for different hyperparameter sweeps and fine-tuning initializations, when fine-tuning CLIP on ImageNet. For instance, we try a standard grid search which is similar to the grid search described for ALIGN above, and an extreme grid search which includes solutions fine-tuned with extreme hyperparameters that result in badly performing models (details in Appendix J.2.1). Moreover, Appendix L compares model soups with additional baselines, including distillation from an ensemble as in Hinton et al. (2014), exponential moving averaging (Szegedy et al., 2016), stochastic weight averaging (Izmailov et al., 2018), and sharpness aware minimization (Foret et al., 2021).
We highlight a few interesting takeaways from these experiments: (1) The greedy soup outperforms the best individual model—with no extra training and no extra compute during inference, we were able to produce a better model. (2) While the uniform soup can outperform the best individual model, we only observe this when all individual models achieve high accuracy (e.g., when fine-tuning ALIGN in Figure 1); unlike the examples in Figure 2, there can be an error barrier between fine-tuned models. We mainly observe this when fine-tuning with high learning rates (this is illustrated in Appendix J.1, Figure J.1). However, these high learning rate models also have a lower accuracy, and are therefore excluded by the greedy soup.
3.2 Fine-tuning a ViT-G model pre-trained on JFT-3B
To test whether the gains obtained by model soups are additive with other techniques used to obtain state-of-the-art models, we applied our greedy soup technique to 58 ViT-G/14 models fine-tuned on ImageNet. We vary the learning rate, decay schedule, loss function, and minimum crop size in the data augmentation, and optionally apply RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et al., 2019). We also train four models with sharpness-aware minimization (SAM) (Foret et al., 2021). For further details of our hyperparameter sweep, see Appendix J.2.3. For each model training run, we save exponential moving averages (EMA) of the weights (Szegedy et al., 2016) computed with decay factors of 0.999 (low EMA) and 0.9999999 (high EMA). Whereas high EMA generally provides the best single-model accuracy, both greedy soup and greedy ensembling attain higher validation accuracy when applied to parameters with low EMA. We report the highest single model accuracy numbers obtained with either EMA decay value, but perform greedy soup and ensembling with models trained with EMA decay of 0.999. For each combination of training run and EMA decay rate, we evaluate accuracy on our held out validation set every 1000 steps. We use these accuracy values to pick the best checkpoint for ensembling, souping, and subsequent evaluation.
In Table 4, we report results on the ImageNet validation set and the five distribution shift datasets studied above as well as two relabeled ImageNet validation sets, ReaL (Beyer et al., 2020) and multilabel (Shankar et al., 2020). Our greedy soup procedure selects 14 of the 58 models fine-tuned as part of our hyperparameter sweep, and this soup performs statistically significantly better than the best individually fine-tuned model selected based on our held out validation set on all datasets except for ObjectNet. Even when we give an unfair advantage to individually fine-tuned models by selecting them based on their performance on each test set (denoted “oracle” in Table 4), the greedy soup, which was selected using only in-distribution data, remains superior on most datasets. Only on ReaL and ObjectNet does there exist an individual model that performs statistically significantly better than the soup, and the best model differs between those two datasets. Greedy ensembling performs similarly to the greedy soup in terms of ImageNet top-1 and multilabel accuracy, and slightly better on ReaL, but significantly worse on all distribution shift datasets except for ImageNet-V2. Thus, greedy soup can provide additional gains on top of standard hyperparameter tuning even in the extremely high accuracy regime.
3.3 Fine-tuning on text classification tasks
To test whether the gains obtained by model soups extend to domains beyond image classification, we conduct preliminary experiments with natural language processing (NLP). While more investigation is warranted to establish the applicability of model soups for NLP, we believe our experiments are a promising initial step. In particular, we fine-tune BERT (Devlin et al., 2019b) and T5 (Raffel et al., 2020b) models on four text classification tasks from the GLUE benchmark (Wang et al., 2018): MRPC (Dolan and Brockett, 2005), RTE (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), CoLA (Warstadt et al., 2019) and SST-2 (Socher et al., 2013), as in (Dodge et al., 2020). We use the standard metric for each dataset: average of accuracy and score for MRPC, accuracy for RTE, Matthews correlation for CoLA (Matthews, 1975) and accuracy for SST-2. Details are provided in Appendix J.4.
We fine-tune 32 models for each dataset with a random hyper-parameter search over learning rate, batch size, number of epochs and random seed. Table 5 reports the corresponding metric on the validation set for BERT-base uncased (Devlin et al., 2019a) and T5-base (Raffel et al., 2020b). Additional experimental details and results for more models are provided in Appendix J.5. While the improvements are not as pronounced as in image classification, the greedy soup can improve performance over the best individual model in many cases.
Analytically comparing soups to ensembles
To test our approximation, we evaluate it over of set of fine-tuned models with different learning rates, augmentation strategies, random seeds and values. We set to calibrate the soup model, and find that it improves the ability of our approximation to predict the soup/ensemble error difference; see Appendix K.4 for detailed description of our setup.
Figure K.1 summarizes the results of our empirical evaluations. When excluding the high learning rate of (center and right panels),Fine-tuned models with learning rate are far in weight space from the initial model and are often rejected when forming greedy soups. Therefore, we do not expect our approximation to be tight for these learning rates. we see that the approximation is strongly correlated with both the true difference in loss as well as the difference in error, and the approximation and true loss difference generally agree in sign. Additional details are provided in Appendix K.
Scope and limitations
While this work has so far demonstrated that averaging many fine-tuned models is a useful technique for improving accuracy, this section explores two limitations of the approach. The first is the applicability of model soups, and the second is the failure of model soups to substantially improve calibration.
Applicability. So far our experiments have mainly explored models pre-trained on large, heterogeneous datasets. In Appendix G we also explore model soups for an ImageNet-22k pre-trained model. While the greedy soup still provides improvements on ImageNet, these improvements are less substantial compared to those observed when fine-tuning CLIP and ALIGN.
Calibration. While ensembles improve model calibration (Guo et al., 2017; Roelofs et al., 2020), model soups do not have the same effect. As hyperparameters can also have an effect on calibration, we consider the ensemble and soup of 20 models which are identical other than random seed. Results are illustrated in Figure B.2 using the calibration metrics of Roelofs et al. (2020).
Related work
Averaging the weights of models is a popular approach in convex optimization and deep learning. Most applications study models along the same optimization trajectory, e.g. (Ruppert, 1988; Polyak, 1990; Szegedy et al., 2016; Izmailov et al., 2018; Zhang et al., 2019; Kaddour et al., 2022; Junczys-Dowmunt et al., 2016). By contrast, Nagarajan and Kolter (2019); Frankle et al. (2020); Neyshabur et al. (2020); Von Oswald et al. (2020) and Matena and Raffel (2021) weight-average models which share an initialization but are optimized independently. Nagarajan and Kolter (2019) observed that models trained on MNIST (LeCun, 1998) from the same random initialization are connected in weight space by a linear path of high accuracy. Frankle et al. (2020) find that, when training a pair of models from scratch on harder datasets such as ImageNet with the same hyperparameter configuration and initialization but different data order, interpolating weights achieves no better than random accuracy. However, Frankle et al. (2020) showed that when the two models share a portion of their optimization trajectory, accuracy does not drop when they are averaged. Analogously, Neyshabur et al. (2020) demonstrate that when two models are fine-tuned with the same pre-trained initialization, the interpolated model attains at least the accuracy of the endpoints. Unlike Nagarajan and Kolter (2019); Frankle et al. (2020); Neyshabur et al. (2020) we consider averaging many models with varied hyperparameter configurations.
In the late phases of training, Von Oswald et al. (2020) make copies of a subset of the neural network parameters (e.g, the batch norm weights, the classification layer, etc.). These parameters are then optimized independently and subsequently averaged. In contrast to Von Oswald et al. (2020), a) we average across independent runs with hyperparemter diversity, b) we modify all weights in the network, and c) we consider the transfer setting. Matena and Raffel (2021) merge models with the same pre-trained initialization that are fine-tuned on different text classification tasks. They also propose Fisher information as an alternative technique for model merging. We experiment with averaging models which are trained on different datasets in Appendix E, however, in contrast to Matena and Raffel (2021) we do not use data from the target distribution. Wortsman et al. (2021) average zero-shot and fine-tuned models, finding improvements in- and out-of-distribution. In contrast to Wortsman et al. (2021), we average models across many independent runs which provides more substantial improvements.
Stochastic Weight Averaging (SWA) (Izmailov et al., 2018), which averages weights along a single optimization trajectory, is also motivated by the relation between ensembling model outputs and averaging model weights. In contrast, the averaging we propose is across independent runs. Moreover, while their analysis relates the averaged network outputs (i.e., the logit ensemble) to the output of the a network with the averaged weights, our analysis (Section 4) goes a step further and relates the classification losses associated with these two vectors.
In computer vision and natural language processing, the best performing models are often pre-trained on a large dataset before being fine-tuned on data from the target task (Donahue et al., 2014; Yosinski et al., 2014; Sharif Razavian et al., 2014; Girshick et al., 2014; Mahajan et al., 2018; Kornblith et al., 2019; Yalniz et al., 2019; Kolesnikov et al., 2020; Bommasani et al., 2021). This paradigm is also referred to as transfer learning. Recently, image-text pre-training has become increasingly popular in computer vision as a pre-training task (Radford et al., 2021; Jia et al., 2021; Mu et al., 2021; Pham et al., 2021; Yu et al., 2022). Recent work has explored alternative strategies for adapting these models to specific target tasks (Zhou et al., 2021; Gao et al., 2021; Zhang et al., 2021), for instance via a lightweight residual feature adapter. In contrast, our work explores standard end-to-end fine-tuned models. Other work has attempted to improve transfer learning by regularizing models toward their initialization (Xuhong et al., 2018), choosing layers to tune on a per-example basis (Guo et al., 2019), reinitializing layers over the course of training (Li et al., 2020), or using multiple pretrained models with data-dependent gating (Shu et al., 2021).
Combining the outputs of many models is a foundational technique for improving the accuracy and robustness of machine learning models (Dietterich, 2000; Bauer and Kohavi, 1999; Breiman, 1996; Friedman et al., 2001; Lakshminarayanan et al., 2017; Freund and Schapire, 1997). Ovadia et al. (2019) show that ensembles exhibit high accuracy under distribution shift. Mustafa et al. (2020) propose a method for identifying subsets of pre-trained models for fine-tuning and later ensembling them, finding strong in-distribution accuracy and robustness to distribution shift. Gontijo-Lopes et al. (2022) conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to uncorrelated errors and better ensemble accuracy. Finally, previous work has explored building ensembles of models produced by hyperparameter searches (Snoek et al., 2015; Mendoza et al., 2016; Saikia et al., 2020), including greedy selection strategies (Caruana et al., 2004, 2006; Lévesque et al., 2016; Wenzel et al., 2020). Importantly, ensembles require a separate inference pass through each model, which increases computational costs. When the number of models is large, this can be prohibitively expensive. Unlike ensembles, model soups require no extra compute at inference time.
Conclusion
Our results challenge the conventional procedure of selecting the best model on the held-out validation set when fine-tuning. With no extra compute during inference, we are often able to produce a better model by averaging the weights of multiple fine-tuned solutions.
We thank Ting Chen, Jesse Dodge, Ben Eysenbach, David Fleet, Pieter-Jan Kindermans, Mohammad Norouzi, Sarah Pratt and Vivek Ramanujan for helpful discussions and draft feedback, Lucas Beyer and Xiaohua Zhai for assistance with ViT-G/14 fine-tuning, and Hyak at UW for computing support. YC was supported in part by the Israeli Science Foundation (ISF) grant no. 2486/21, the Len Blavatnik and the Blavatnik Family foundation, and The Yandex Initiative for Machine Learning. This work is in part supported by the NSF AI Institute for Foundations of Machine Learning (IFML), Open Philanthropy, NSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-4031, DARPA W911NF-15-1-0543 and gifts from Allen Institute for AI.
References
Appendix A Overview
The appendix is organizes via the following contributions:
Appendix B (Additional figures) supplements the main text with additional figures.
Appendix C (BASIC) presents additional experiments exploring model soups for BASIC (Pham et al., 2021).
Appendix D (Robust fine-tuning) compares model soups with WiSE-FT (Wortsman et al., 2021), a technique for fine-tuning while preserving robustness.
Appendix E (Cross-dataset soups) explores soups for models which are trained on different datasets to improve zero-shot transfer.
Appendix F (Analysis of 1D hyperparameter grids) compares the performance of averaging endpoints with intermediate solutions for hyperparemters on a one dimensional grid.
Appendix G (Additional fine-tuning and pre-training datasets) explores model soups for additional datasets.
Appendix H (Additional grid searches and initializations) supplements the results in the main text with other hyperparameter sweeps and model initializations (i.e., zero-shot instead of LP initialization).
Appendix I (Learned soup) describes the more advanced souping procedure where we learn the soup mixing coefficients with gradient based optimization on the held-out validation set.
Appendix J (Experimental details) provides additional details for the experiments.
Appendix K (Analytical comparison details) supplements Section 4 in analytically comparing soups and ensembles.
Appendix L (Additional baselines) compares soups with additional baselines including stochastic weight averaging (Izmailov et al., 2018) and sharpenss aware minimization (Foret et al., 2021).
Appendix B Additional figures
Appendix C BASIC
After our initial submission we tested model soups when fine-tuning BASIC-L (Pham et al., 2021). Due to memory constraints, we fine-tune with a batch size of 64 instead of 512. We initialize with the zero-shot classification head and train for 8 epochs using the Adafactor optimizer (Shazeer and Stern, 2018) at a resolution of . We sweep over a grid of learning rates ( or ) and 10 data augmentation settings, resulting in 20 different models. We use random crops and flips with a minimum crop size of 90% of the image together with mixup (Zhang et al., 2017) or CutMix (Yun et al., 2019) with , AutoAugment with . We additionally train models with random crops and flips with minimum crop sizes of 5% and 90% without additional augmentation.
As in our ViT-G/14 experiments (Section 3.3.2), we save exponential moving averages with low and high EMA decay factors, and find that low EMA weights provide better performance for greedy souping and greedy ensembling whereas high EMA weights provide better single-model performance. We adjust the EMA factors for the difference in batch size and thus use a decay factor of for our low EMA configuration and for our high EMA configuration. During each training run, for each set of EMA weights, we evaluate accuracy on our held out validation set every 5000 steps and use the best checkpoint for ensembling, souping, and subsequent evaluation. We resize the full image to for evaluation.
Results are shown in Table C.1. The greedy soup consistently outperforms the individual model with highest accuracy on the held-out validation set. The best BASIC-L model on each individual test set sometimes outperforms the greedy soup, but selecting the model on the test set will generally overestimate its true accuracy.
Appendix D Robust fine-tuning
Wortsman et al. (2021) introduce WiSE-FT, a method for improving the robustness of a model which is fine-tuned from initialization by linearly interpolating and . An intriguing observation was that, once the data augmentation is fixed, interpolating between and often traces a similar curve regardless of hyperparameters. This is visible in Figure D.1 (right) where different data augmentations are shown with different colors. On the other hand, in Figure D.1 (left) there are many different methods of data augmentation as we conduct a random hyperparameter search. In other words, a reasonable hypothesis was that this curve is Pareto optimal—no hyperparameter configuration would surpass it. In Figure D.1, we trace the curves when interpolating between and for a random hyperparameter search (left) and the standard grid search described in Appendix J.2.1 (right) when fine-tuning CLIP ViT-B/32. We find that the uniform soup and greedy soup lie beyond these interpolation curves. Moreover, we find interpolating between these soups and the initialization also provides additional accuracy improvements on the distribution shifts.
Appendix E Cross-dataset soups
So far, our experiments have studied soups of models fine-tuned on the same dataset with different hyperparameters. In this section, we prepare soups containing models fine-tuned on different datasets. We evaluate the resulting soups on a held-out dataset, from which no labeled training data is used (i.e., zero-shot evaluation).
Concretely, we consider soups based on the CLIP zero-shot initialization along with six models fine-tuned independently on CIFAR-10 (Krizhevsky et al., 2009), Describable Textures (Cimpoi et al., 2014), Food-101 (Bossard et al., 2014), SUN397 (Xiao et al., 2016), Stanford Cars (Krause et al., 2013) and ImageNet (Deng et al., 2009). We evaluate on CIFAR-100 (Krizhevsky et al., 2009), which does not share classes with CIFAR-10. Since each task has a different set of classes, the last layers cannot be part of the soup. Hence, during fine-tuning, we freeze the linear head produced by CLIP’s text tower so that task-specific learning is captured only in the backbone weights. At test time, we use the “backbone soup” with a zero-shot head constructed from CLIP’s text tower and the CIFAR-100 class names with the prompt-ensembling used for ImageNet by Radford et al. (2021). Figure E.1 (left) shows that a model soup containing models trained on each of these datasets and the zero-shot model improves zero-shot performance on CIFAR-100 by 6.4 percentage points over the CLIP baseline. Moreover, Figure E.1 (right) shows that the choice of which fine-tuned models to include can have a substantial impact on the accuracy of the resulting soup. See Appendix J.3 for additional details.
Appendix F Analysis of 1D hyperparameter grids
This section asks: for a one dimensional grid of hyperparameters , how does averaging the models fine-tuned with hyperparameter configurations and corresponding to the endpoints compare with picking the best individual model fine-tuned with hyperparameter configuration ?
The results are illustrated in Figure F.1, where each square represents a grid . The average of the endpoints often outperforms the best individual model in the grid. A notable exception is when the learning rate is the left endpoint of the grid. As this experiment uses AdamW, this learning rate is too high for fine-tuning and, unlike the examples in Figure 2, there is a high error barrier between the two fine-tuned solutions (see Figure J.1, lower right for example).
When varying optimizer we use minimal data augmentation and LR for RMSProp (Tieleman and Hinton, 2012), Adam (Kingma and Ba, 2014), and AdamW (Loshchilov and Hutter, 2019). SGD requires a larger learning rate, and so we use . When varying augmentation strength, we use minimal data augmentation and LR .
Appendix G Additional fine-tuning and pre-training datasets
In this section we explore fine-tuning or pre-training on additional datasets. First, Figure G.1 displays results when fine-tuning a CLIP ViT-L model on two datasets included in the WILDS (Koh et al., 2021) challenge, FMoW (Christie et al., 2018) and iWildCam (Beery et al., 2021).
Next, Figure G.2 displays results for fine-tuning a CLIP ViT-L model on CIFAR-10 (Krizhevsky et al., 2009). The -axis of Figure G.2 displays accuracy on CIFAR-10.1 (Recht et al., 2019), a reproduction of CIFAR-10 with a distribution shift. The individual models are fine-tuned with the random hyperparameter search described in Section J.2.1.
In addition, Figure G.3 shows results when fine-tuning a ViT-B/32 (Dosovitskiy et al., 2021) model pre-trained on ImageNet-22k (Deng et al., 2009) and fine-tuned on ImageNet. This differs from many of our other experiments as the dataset used for pre-training is smaller and less diverse. While the greedy soup offers an improvement, the improvement is less substantial than Figure 1 which uses the same model and hyperparameter search but a different pre-training dataset.
Finally, we fine-tune a ViT-B/32 model five times on ImageNet, using the best hyperparameters found by the hyperparameter sweep, varying only the random seed. This experiment is conducted both for a model pre-trained on ImageNet-22k (Deng et al., 2009) and a pre-trained CLIP model. The results are shown in Figure G.4, comparing, for an experimental budget of models: (i) the individual model with random seed , (ii) the model soup composed of models with random seeds 1 through , and (iii) the ensemble composed of models with random seeds 1 through . The performance of the model soup appears correlated with the performance of the ensemble. Moreover, we find that CLIP models are more amenable to both ensembling and souping than models pre-trained on ImageNet-22k.
Appendix H Additional grid searches and initializations
This section recreates Figure B.1 with different initializations (linear probe initialization or zero-shot) and different grid searches (standard and extreme grid) when fine-tuning CLIP ViT-B/32. The standard and extreme grid searches are described in Section J.2.1.
Figure H.1 considers the linear probe (LP) initialization and the standard grid. Figure H.2 considers the linear probe (LP) initialization and the extreme grid. Figure H.3 considers the zero-shot initialization and the standard grid. Figure H.4 considers the zero-shot initialization and the extreme grid.
Appendix I Learned soup
In practice we find better results when is parameterized as the output of a softmax, so that each is positive and values sum to one. We optimizer the aforementioned equation with gradient based mini-batch optimization for three epochs over the held-out validation set with the AdamW otpimizer and constant learning rate 0.1.
As presented in Table 3, we also try a “by layer” variant of the learned soup. For this we learn a separate for each layer of the network. Finally, another way to get non-uniform mixing coefficients is to sample with replacement in the greedy soup procedure.
Appendix J Experimental details
To supplement Figure 2, we provide an identical experiment but with a 10x bigger learning rate instead of 10x smaller. Results are illustrated in Figure J.1 with linear instead of log scaling for the contour lines. Since the error difference is more substantial, linear scaling was more clear. When fine-tuning with a larger learning rate, error increases on the path between the two fine-tuned solutions. All error landscape visualizations use CLIP ViT-B/32 fine-tuned on ImageNet for 10 epochs with minimal data augmentation, as used by CLIP during pre-training. When computing angles between the two fine-tuned solutions, as in Figure 3, we use the repeated weights which constitute the majority of the network parameters. We ignore gain terms which tend to skew positive if occurring before ReLU activations.
In Figure 3 we consider solutions fine-tuned with learning rates less that . As in Figure J.1, if a learning rate that is large is used accuracy will decrease on the path in weight space between the two models.
J.2 Model soups
This section describes the set of hyperparameters used for searches. For all ImageNet experiments, we withhold 2% of the training set and use these examples as the held-out validation set for model selection in greedy and learned soup.
Unless otherwise mentioned, all experiments used the AdamW optimizer (Loshchilov and Hutter, 2019) with cosine annealing learning rate schedule (Loshchilov and Hutter, 2016) for 10 epochs at batch size 512 at a resolution of 224224. When necessary we discretize augmentation strength into minimal, medium, and strong. Minimal augmentation uses only a random crop consisting of 90%-100% of the total image area. Medium is the default augmentation used by the timm library (Wightman, 2019). Strong refers to RandAugment (Cubuk et al., 2020) (, ).
We now provide the low level details for the hyperparemter searches, which are standard grid, extreme grid, and random search. The standard grid includes learning rates , where typically perform the best. Augmentation strengths are minimal, medium, or strong. Mixup is either off or on at . We consider all combinations of the above, running each hyperparameter configuration with two random seeds.
The extreme grid considers learning rates , where typically perform the best. Augmentation strengths are minimal, medium, or strong. Mixup is either off or on at . Moreover, we include the initialization in this search, which often outperforms some of the extreme learning rates but is far from the most accurate model.
The random search chooses learning rate where is selected uniformly at random from 4 to 6. Weight decay is chosen randomly as where is selected uniformly at random from 0.2 to 4. With probability 0.5, label smoothing is set to 0 and otherwise it is selected uniformly at random between 0 and 0.25. Fine-tuning epochs are chosen randomly between four and sixteen. Mixup is 0 with probability 0.5, and otherwise is chosen uniformly at random from 0 to 0.9. With probability we use minimal augmentation, otherwise we use randaug where and are chosen uniformly at random between 0 and 20 and 0 and 2 respectively.
When fine-tuning on WILDS-FMoW and WILDS-iWildCam for Figure G.1, we use the same random search as when we fine-tune CLIP on ImageNet. The only difference is that we are able to use a larger ViT-L/14 model as the datasets are smaller. This also requires us to change the default batch size from 512 to 128.
J.2.2 ALIGN experiments
We fine-tuned ALIGN EfficientNet-L2 models using AdamW with weight decay of 0.1 at a resolution of for 25 epochs, with the final layer initialized from a linear probe without data augmentation. We fine-tuned 5 models with standard Inception-style random crops (consisting of 5% to 100% of the total image area with an aspect ratio between 0.75 and 1.33) and different learning rates (, , , , and ). We also fine-tuned 7 additional models at a learning rate of with different data augmentation strategies. Specifically, we varied the random cropping strategy (either Inception-style crops or less aggressive crops consisting of 90% to 100% of the total image area with an aspect ratio between 0.95 and 1.05), the use of RandAugment (Cubuk et al., 2020) (off or , ), and the use of mixup (Zhang et al., 2017) (off or ) and trained models with all combinations of these strategies. Our soups are obtained by considering these 12 models as well as the linear probe initialization. We perform evaluation at resolution using a square center crop from images. The accuracy we attain with greedy soup approaches that reported by Jia et al. (2021), which evaluated at resolution.
J.2.3 ViT-G/14 experiments
These models are initialized with a backbone that was pretrained on the JFT-3B dataset (Zhai et al., 2021) and linear probes obtained at either the resolution at which the ViT-G/14 was pretrained or at the resolution used for fine-tuning. Models are fine-tuned at a batch size of 512 for either 10,000 or 20,000 steps (approximately 4 or 8 epochs) using the Adafactor optimizer (Shazeer and Stern, 2018) with learning rates of or ; a constant or cosine decay learning rate schedule; and softmax or binary cross-entropy loss. When fine-tuning with binary cross-entropy loss, we use a linear probe that is also trained with binary cross-entropy loss. We vary data augmentation, applying RandAugment (Cubuk et al., 2020), mixup (Zhang et al., 2017), or CutMix (Yun et al., 2019) of varying strengths and random cropping with a minimum crop size of 5%, 70%, 90%, or 100% of the full image. When applying SAM, we consider models with perturbations either synchronized or unsynchronized across accelerators, including one model with synchronized perturbations and a combination of CutMix and SAM. All models are fine-tuned at resolution and evaluated by rescaling test images to (without preserving the aspect ratio) and taking a central crop.
We manually tuned hyperparameters with the goal of maximizing single-model accuracy. After settling on the use of Adafactor as the optimizer, we included all subsequently trained models in the pool of models to be used for greedy soup. The model that performs best on the holdout set is initialized with a linear probe and fine-tuned with a learning rate of 3e-5 and a constant learning rate decay schedule, with softmax cross-entropy loss, a minimum crop size of 90%, and CutMix with . The model that performs best on the official ImageNet validation set is initialized with a linear probe and fine-tuned at a learning rate of 3e-5 and a constant learning rate decay schedule, with softmax cross-entropy loss, a minimum crop size of 90%, CutMix with , and SAM. The greedy soup contains models trained with a wide range of different hyperparameter values including different learning rates, linear probes, loss functions, and every form of data augmentation and minimum crop size investigated. Notably, although models trained with SAM with synchronized perturbations are included in the greedy soup, the greedy soup process skips over the models trained with SAM with unsynchronized perturbations because adding them produces a large drop in holdout accuracy.
J.3 Cross-dataset soups details
When fine-tuning we initialize with CLIP ViT-B/32 and use learning rate for 10 epochs with mini-batch size of 512. We train with minimal augmentation.
J.4 Text classification datasets
We study four text classification datasets from the GLUE benchmark (Wang et al., 2018).
(MRPC; (Dolan and Brockett, 2005)) contains pairs of sentences, labeled as either nearly semantically equivalent, or not. The dataset is evaluated using the average of and accuracy. The training set consists of 3.7 thousand samples and the validation set of 409 samples.
(RTE; (Wang et al., 2018)) contains pair of sentences, and the task is to predict whether the first sentence (the premise) entails or contradicts the second sentence (the hypothesis). The data is originally from a series of datasets (Dagan et al., 2005; Bar-Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009). The dataset is evaluated using classification accuracy. The training set consists of 2.5 thousand samples and the validation set of 277 samples.
(CoLA; (Warstadt et al., 2019)) contains sentences labeled as either grammatical or ungrammatical. Models are evaluated on Matthews correlation (MCC; (Matthews, 1975)), which ranges between and . The training set consists of 8.6 thousand samples and the validation set consists of 1043 samples.
(SST-2; (Socher et al., 2013)) contains sentences labelled as expressing positive or negative sentiment, collected from movie reviews. The dataset is evaluated using classification accuracy. The training set consists of 67 thousand samples and the validation set consists of 873 samples.
J.5 Fine-tuning details for text classification tasks
Each model is fine-tuned 32 times on each dataset, performing a random hyperparameter search. The learning rate is chosen uniformly in log space over [, ], the batch size is chosen uniformly from and the number of epochs from . Evaluation is conducted once at the end of training, without early stopping. We use a maximum sequence length of 128 tokens and train with Adam (Kingma and Ba, 2014) using , and , gradient clipping of , no weight decay, and with the learning rate being decayed linearly to zero at the end of training. We use pre-trained weights from the Huggingface Transformers library (Wolf et al., 2020). For BERT models, we use the uncased version.
Fine-tuning occurs without any additional parameters to avoid distorting the features from the pre-trained models (Kumar et al., 2022). For such, the classification tasks are adapted to be suited to the pre-training objective of BERT and T5. For T5, the tasks are cast as a sequence-to-sequence problem. For instance, for sentiment analyses, an example is to predict “A) positive” from “sentence: The best movie I’ve ever seen! | options: A) positive B) negative | label:”. For BERT, the tasks are cast as a masked language modeling problem. For instance, for linguistic acceptability, an example is to predict “A) acceptable” for the inputs “sentence: model soups are grammatical. | options: A) acceptable B) unacceptable | label: [MASK] [MASK] [MASK]”. For evaluation, we select which of the options is given the highest probability according to the model.
The full set of results is shown in Table J.1. On 10 out of the 20 combinations of models and datasets, the greedy soup shows better performance than the best individual model from the hyperparameter search. Uniform soups show worse performance than the best individual model on all experiments, which could be an artifact of the broad range of hyperparameters used in the search. While the experiments varied only basic hyperparameters such as learning rate and batch size, we hypothesize that a broader set of hyperparameter choices (e.g. data augmentation (Wei and Zou, 2019; Ma, 2019)) could lead to more diverse models and better soups.
Finally, as a word of caution for practitioners, we remind readers that many recent language models have tied weights on the output and embedding layers (Press and Wolf, 2017). For this reason, caution is needed when writing code to average models in-place.
Appendix K Analytical comparison details
to be the “soup” weight averaged model and its corresponding logits. We also write
for the logits of the ensemble model. We write
For some distribution over we write the expected -calibrated log losses of the soup and ensemble as
We have the following expression for the derivatives of cross entropy w.r.t. logits. The gradient is
K.2 An exact expression for logit difference
We use the fundamental theorem of calculus and elemntary algebraic manipulation to obtain an exact integral form for the difference between the soup and ensemble logits. To streamline notation we drop the dependence of the logits on the input .
K.3 Derivation of approximation
We continue to suppress the dependence on in order to simplify notation. We begin with the following first order approximation of the pointwise log-loss difference between the ensemble and soup, which is also a lower bound due to convexity.
Now, we approximate the ensemble and soup logit difference using eq. 3 by assuming that for all ; this holds when the logits are approximately quadratic along the line between the checkpoints. The resulting approximation is
Combining the two approximation above, we obtain
To relate this expression to the Hessian of the loss with respect to the parameters, we note that for any (by the chain rule)
When setting , we note that the second term on the RHS is (up to a constant) our approximation for the loss difference). Recalling the expression for the cross-entropy Hessian, the first term is
this holds when logits are too far from linear in .
Substituting back and making explicit, we obtain
Scaling all logits by , the approximation becomes
Averaging the result over , we arrive at the approximation (1), which we repeat here for ease of reference:
K.4 Detailed empirical evaluations
We evaluated our bounds on checkpoints from the ViT-B/32 fine-tuning experiments from the extreme grid search described in Section J.2.1. We selected three learning rate values (, and ), two levels augmentation (none and RandAugment+MixUp), and considered two different random seeds ( and ). From these checkpoints (as well as the initialization) we constructed the following pairs:
All pairs with different learning rate, the same augmentation level and seed 0,
All pairs with the same learning rate, different augmentation level and seed 0,
All pairs with the same learning rate and augmentation level, but different seeds,
All checkpoints with seed 0 coupled with the initialization.
While choosing based on the soup rather the ensemble might skew the loss in favor of the soup, it has no effect on the difference in prediction error. Moreover, in preliminary experiments calibrating the ensemble produced very similar results. In contrast, as shown in Figure K.1, fixing throughout results in far poorer prediction of the difference in error.
Appendix L Additional baselines
This section explores additional baselines for model soups, including distillation from an ensemble as in Hinton et al. (2014) (Table L.1), fix-augmentation as in Touvron et al. (2019) (Table L.2), weight-averaging along a trajectory as in Szegedy et al. (2016); Izmailov et al. (2018) (Figures L.1 and L.2), and Sharpness Aware Minimization as in Foret et al. (2021) (Table L.3).
Unless otherwise mentioned, we fine-tune CLIP ViT-B/32 models with AdamW (Loshchilov and Hutter, 2019) and cosine annealing learning rate (Loshchilov and Hutter, 2016) for 10 epochs on ImageNet with a learning rate of 2e-5 and medium augmentation (data augmentation policies are discussed in more detail in Section J.2.1).
We explore the baseline of distillation (Hinton et al., 2014, 2015) from the ensemble of three models trained with different data augmentation. As previously reported (Bagherinezhad et al., 2018; Beyer et al., 2021), we find that it improves accuracy to run distillation with data augmentation. Unfortunately, this substantially increases the computational resources necessary to distill from the ensemble. As we cannot cache the predictions of the models in the ensemble, it is necessary to perform a forward pass for each model in the ensemble at each step of fine-tuning. This makes distilling from an ensemble similarly expensive as training the models which constitute the ensemble. Nevertheless, as illustrated in Table L.1, model soups still perform favorably.
Table L.1 also introduces stochastic augmentation. For each data point, stochastic augmentation randomly applies minimal, medium, or strong data augmentation. Additionally, Table L.2 explores an alternative method for merging augmentations together. This augmentation policy, which we refer to as fix-aug, is introduced by Touvron et al. (2019). For fix-aug, strong augmentation is used for all but the final epoch, which uses minimal augmentation.
Figure L.1 and Figure L.2 apply model soups to solutions which already average along the fine-tuning trajectory. Methods for averaging along an individual optimization trajectory include exponential moving averages (EMA) (Szegedy et al., 2016) and stochastic weight averages (SWA) (Izmailov et al., 2018). We find that EMA and SWA can improve the accuracy of a single model but that model soups provide improvements even when applied to models which have weight-averaging along their trajectory. We try learning rates and and three learning rate schedulers: constant, cosine annealing with restarts, and cosine annealing (all schedules have a short warm up period). In Figure L.1 we fine-tune a CLIP pre-trained ViT-B/32, while Figure L.2 fine-tunes an ImageNet-21k pre-trained ViT-B/32.
Table L.3 explores the relation between model soups and sharpness-aware minimization (SAM) (Foret et al., 2021). In line with previous results, we find that SAM improves accuracy over vanilla fine-tuning. Souping two models trained with SAM improves over either individual model, although the magnitude of the gain is smaller than for vanilla fine-tuning. Souping models trained with and without SAM yields higher accuracy than souping models trained only with vanilla fine-tuning or only with SAM.
As a final comparison that is potentially useful, we augment Figure 1 with additional comparisons from Table 3. Results are shown in Figure L.3