Bilevel Programming for Hyperparameter Optimization and Meta-Learning

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, Massimilano Pontil

Introduction

While in standard supervised learning problems we seek the best hypothesis in a given space and with a given learning algorithm, in hyperparameter optimization (HO) and meta-learning (ML) we seek a configuration so that the optimized learning algorithm will produce a model that generalizes well to new data. The search space in ML often incorporates choices associated with the hypothesis space and the features of the learning algorithm itself (e.g., how optimization of the training loss is performed). Under this common perspective, both HO and ML essentially boil down to nesting two search problems: at the inner level we seek a good hypothesis (as in standard supervised learning) while at the outer level we seek a good configuration (including a good hypothesis space) where the inner search takes place. Surprisingly, the literature on ML has little overlap with the literature on HO and in this paper we present a unified framework encompassing both of them.

Classic approaches to HO (see e.g. Hutter et al.,, 2015, for a survey) have been only able to manage a relatively small number of hyperparameters, from a few dozens using random search (Bergstra and Bengio,, 2012) to a few hundreds using Bayesian or model-based approaches (Bergstra et al.,, 2013; Snoek et al.,, 2012). Recent gradient-based techniques for HO, however, have significantly increased the number of hyperparameters that can be optimized (Domke,, 2012; Maclaurin et al.,, 2015; Pedregosa,, 2016; Franceschi et al.,, 2017) and it is now possible to tune as hyperparameters entire weight vectors associated with a neural network layer. In this way, it becomes feasible to design models that possibly have more hyperparameters than parameters. Such an approach is well suited for ML, since parameters are learned from a small dataset, whereas hyperparameters leverage multiple available datasets.

HO and ML only differ substantially in terms of the experimental settings in which they are evaluated. While in HO the available data is associated with a single task and split into a training set (used to tune the parameters) and a validation set (used to tune the hyperparameters), in ML we are often interested in the so-called few-shot learning setting where data comes in the form of short episodes (small datasets with few examples per class) sampled from a common probability distribution over supervised tasks.

Early work on ML dates back at least to the 1990’s (Schmidhuber,, 1992; Baxter,, 1995; Thrun and Pratt,, 1998) but this research area has received considerable attention in the last few years, mainly driven by the need in real-life and industrial scenarios for learning quickly a vast multitude of tasks. These tasks, or episodes, may appear and evolve continuously over time and may only contain few examples (Lake et al.,, 2017). Different strategies have emerged to tackle ML. Although they do overlap in some aspects, it is possible to identify at least four of them. The metric strategy attempts to use training episodes to construct embeddings such that examples of the same class are mapped into similar representations. It has been instantiated in several variants that involve non-parametric (or instance-based) predictors (Koch et al.,, 2015; Vinyals et al.,, 2016; Snell et al.,, 2017). In the related memorization strategy, the meta-learner learns to store and retrieve data points representations in memory. It can be implemented either using recurrent networks (Santoro et al.,, 2016) or temporal convolutions (Mishra et al.,, 2018). The use of an attention mechanism (Vaswani et al.,, 2017) is crucial both in (Vinyals et al.,, 2016) and in (Mishra et al.,, 2018). The initialization strategy (Ravi and Larochelle,, 2017; Finn et al.,, 2017) uses training episodes to infer a good initial value for the model’s parameters so that new tasks can be learned quickly by fine tuning. The optimization strategy (Andrychowicz et al.,, 2016; Ravi and Larochelle,, 2017; Wichrowska et al.,, 2017) forges an optimization algorithm that will find it easier to learn on novel related tasks.

A main contribution of this paper is a unified view of HO and ML within the natural mathematical framework of bilevel programming, where an outer optimization problem is solved subject to the optimality of an inner optimization problem. In HO the outer problem involves hyperparameters while the inner problem is usually the minimization of an empirical loss. In ML the outer problem could involve a shared representation among tasks while the inner problem could concern classifiers for individual tasks. Bilevel programming (Bard,, 2013) has been suggested before in machine learning in the context of kernel methods and support vector machines (Keerthi et al.,, 2007; Kunapuli et al.,, 2008), multitask learning (Flamary et al.,, 2014), and more recently HO (Pedregosa,, 2016), but never in the context of ML. The resulting framework outlined in Sec. 2 encompasses some existing approaches to ML, in particular those based on the initialization and the optimization strategies.

A technical difficulty arises when the solution to the inner problem cannot be written analytically (for example this happens when using the log-loss for training neural networks) and one needs to resort to iterative optimization approaches. As a second contribution, we provide in Sec. 3 sufficient conditions that guarantee good approximation properties. We observe that these conditions are reasonable and apply to concrete problems relevant to applications.

In Sec. 4, by taking inspiration on early work on representation learning in the context of multi-task and meta-learning (Baxter,, 1995; Caruana,, 1998), we instantiate the framework for ML in a simple way treating the weights of the last layer of a neural network as the inner variables and the remaining weights, which parametrize the representation mapping, as the outer variables. As shown in Sec. 5, the resulting ML algorithm performs well in practice, outperforming most of the existing strategies on MiniImagenet.

A bilevel optimization framework

In this paper, we consider bilevel optimization problems (see e.g. Colson et al.,, 2007) of the form

In the context of hyperparameter optimization, we are interested in minimizing the validation error of a model $g_{w}:\mathcal{X}\to\mathcal{Y}$ parameterized by a vector $w$ , with respect to a vector of hyperparameters $\lambda$ . For example, we may consider representation or regularization hyperparameters that control the hypothesis space or penalties, respectively. In this setting, a prototypical choice for the inner objective is the regularized empirical error

2 Meta-Learning

In meta-learning (ML) the inner and outer objectives are computed by averaging a training and a validation error over multiple tasks, respectively. The goal is to produce a learning algorithm that will work well on novel tasksThe ML problem is also related to multitask learning, however in ML the goal is to extrapolate from the given tasks.. For this purpose, we have available a meta-training set $\mathcal{D}=\{D^{j}=D^{j}_{\operatorname{tr}}\cup D^{j}_{\operatorname{val}}\}_{j=1}^{N}$ , which is a collection of datasets, sampled from a meta-distribution $\mathcal{P}$ . Each dataset $D^{j}=\{(x_{i}^{j},y_{i}^{j})\}_{i=1}^{n_{j}}$ with $(x_{i}^{j},y_{i}^{j})\in\mathcal{X}\times\mathcal{Y}^{j}$ is linked to a specific task. Note that the output space is task dependent (e.g. a multi-class classification problem with variable number of classes). The model for each task is a function $g_{w^{j},\lambda}:\mathcal{X}\to\mathcal{Y}^{j}$ , identified by a parameter vectors $w^{j}$ and hyperparameters $\lambda$ . A key point here is that $\lambda$ is shared between the tasks. With this notation the inner and outer objectives are

respectively. The loss $L_{j}(w_{j},\lambda,S)$ represents the empirical error of the pair $(w_{j},\lambda)$ on a set of examples $S$ . Note that the inner and outer losses for task $j$ use different train/validation splits of the corresponding dataset $D^{j}$ . Furthermore, unlike in HO, in ML the final goal is to find a good $\lambda$ and the $w^{j}$ are now instrumental.

The cartoon in Figure 1 illustrates ML as a bilevel problem. The parameter $\lambda$ indexes an hypothesis space within which the inner objective is minimized. A particular example, detailed in Sec. 4, is to choose the model $g_{w,\lambda}=\langle w,h_{\lambda}(x)\rangle$ , in which case $\lambda$ parameterizes a feature mapping. Yet another choice would be to consider $g_{w^{j},\lambda}(x)=\langle w+\lambda,x\rangle$ , in which case $\lambda$ represents a common model around which task specific models are to be found (see e.g. Evgeniou et al.,, 2005; Finn et al.,, 2017; Khosla et al.,, 2012; Kuzborskij et al.,, 2013, and reference therein).

3 Gradient-Based Approach

We now discuss a general approach to solve Problem (1)-(2) when the hyperparameter vector $\lambda$ is real-valued. To simplify our discussion let us assume that the inner objective has a unique minimizer $w_{\lambda}$ . Even in this simplified scenario, Problem (1)-(2) remains challenging to solve. Indeed, in general there is no closed form expression $w_{\lambda}$ , so it is not possible to directly optimize the outer objective function. While a possible strategy (implicit differentiation) is to apply the implicit function theorem to $\nabla L_{\lambda}=0$ (Pedregosa,, 2016; Koh and Liang,, 2017; Beirami et al.,, 2017), another compelling approach is to replace the inner problem with a dynamical system. This point, discussed in (Domke,, 2012; Maclaurin et al.,, 2015; Franceschi et al.,, 2017), is developed further in this paper.

Specifically, we let $[T]=\{1,\dots,T\}$ where $T$ is a prescribed positive integer and consider the following approximation of Problem (1)-(2)

where $E$ is a smooth scalar function, andIn general, the algorithm used to minimize the inner objective may involve auxiliary variables, e.g., velocities when using gradient descent with momentum, so $w$ should be intended as a larger vector containing both model parameters and auxiliary variables.

The approximation of the bilevel problem (1)-(2) by the procedure (5)-(6) raises the issue of the quality of this approximation and we return to this issue in the next section. However, it also suggests to consider the inner dynamics as a form of approximate empirical error minimization (e.g. early stopping) which is valid in its own right. From this perspective – conversely to the implicit differentiation strategy – it is possible to include among the components of $\lambda$ variables which are associated with the optimization algorithm itself. For example, $\lambda$ may include the step sizes or momentum factors if the dynamics $\Phi_{t}$ in Eq. (6) is gradient descent with momentum; in (Andrychowicz et al.,, 2016; Wichrowska et al.,, 2017) the mapping $\Phi_{t}$ is implemented as a recurrent neural network, while (Finn et al.,, 2017) focus on the initialization mapping by letting $\Phi_{0}(\lambda)=\lambda$ .

A major advantage of this reformulation is that it makes it possible to compute efficiently the gradient of $f_{T}$ , which we call hypergradient, either in time or in memory (Maclaurin et al.,, 2015; Franceschi et al.,, 2017), by making use of reverse or forward mode algorithmic differentiation (Griewank and Walther,, 2008; Baydin et al.,, 2017). This allows us to optimize a number of hyperparameters of the same order of that of parameters, a situation which arise in ML.

Exact and Approximate Bilevel Programming

In this section, we provide results about the existence of solutions of Problem (1)-(2) and the approximation properties of Procedure (5)-(6) with respect to the original bilevel problem. Proofs of these results are provided in the supplementary material.

Procedure (5)-(6), though related to the bilevel problem (1)-(2), may not be, in general, a good approximation of it. Indeed, making the assumptions (which sound perfectly reasonable) that, for every $\lambda\in\Lambda$ , $w_{T,\lambda}\to w_{\lambda}$ for some $w_{\lambda}\in\arg\max L_{\lambda}$ , and that $E(\cdot,\lambda)$ is continuous, one can only assert that $\lim_{T\to\infty}f_{T}(\lambda)=E(w_{\lambda},\lambda)\geq f(\lambda)$ . This is because the optimization dynamics converge to some minimizer of the inner objective $L_{\lambda}$ , but not necessarily to the one that also minimizes the function $E$ . This is illustrated in Figure 2. The situation is, however, different if the inner problem admits a unique minimizer for every $\lambda\in\Lambda$ . Indeed in this case, it is possible to show that the set of minimizers of the approximate problems converge, as $T\to+\infty$ and in an appropriate sense, to the set of minimizers of the bilevel problem. More precisely, we make the following assumptions:

the map $(w,\lambda)\mapsto L_{\lambda}(w)$ is jointly continuous and such that $\arg\min L_{\lambda}$ is a singleton for every $\lambda\in\Lambda$ ;

$w_{\lambda}=\arg\min L_{\lambda}$ remains bounded as $\lambda$ varies in $\Lambda$ .

Under the above assumptions, in the following we give results about the existence of solutions of problem (7) and the (variational) convergence of the approximate problems (5)-(6) towards problem (7) — relating the minima as well as the set of minimizers. In this respect we note that, since both $f$ and $f_{T}$ are nonconvex, $\operatorname{argmin}f_{T}$ and $\operatorname{argmin}f$ are in general nonsingleton, so an appropriate definition of set convergence is required.

Under Assumptions (i)-(iv) problem (7) admits solutions.

The result below follows from general facts on the stability of minimizers in optimization problems (Dontchev and Zolezzi,, 1993).

In addition to Assumptions (i)-(iv), suppose that:

$E(\cdot,\lambda)$ is uniformly Lipschitz continuous;

We stress that assumptions (i)-(vi) are very natural and satisfied by many problems of practical interests. Thus, the above results provide full theoretical justification to the proposed approximate procedure (5)-(6). The following remark discusses assumption (vi), while the subsequent example will be relevant to the experiments in Sec. 5.

If $L_{\lambda}$ is strongly convex, then many gradient-based algorithms (e.g., standard and accelerated gradient descent) yield linear convergence of the iterates $w_{T,\lambda}$ ’s. Moreover, in such cases, the rate of linear convergence is of type $(\nu_{\lambda}-\mu_{\lambda})/(\nu_{\lambda}+\mu_{\lambda})$ , where $\nu_{\lambda}$ and $\mu_{\lambda}$ are the Lipschitz constant of the gradient and the modulus of strong convexity of $L_{\lambda}$ respectively. So, this rate can be uniformly bounded from above by $\rho\in\left]0,1\right[$ , provided that $\sup_{\lambda\in\Lambda}\nu_{\lambda}<+\infty$ and $\inf_{\lambda\in\Lambda}\mu_{\lambda}>0$ . Thus, in these cases $w_{T,\lambda}$ converges uniformly to $w_{\lambda}$ on $\Lambda$ (at a linear rate).

Let us consider the following form of the inner objective:

Learning Hyper-Representations

In this section, we instantiate the bilevel programming approach for ML outlined in Sec. 2.2 in the case of deep learning where representation layers are shared across episodes. Finding good data representations is a centerpiece in machine learning. Classical approaches (Baxter,, 1995; Caruana,, 1998) learn both the weights of the representation mapping and those of the ground classifiers jointly on the same data. Here we follow the bilevel approach and split each dataset/episode in training and validation sets.

Starting from an initial value, the weights of the task-specific models are learned by $T$ iterations of gradient descent. The gradient of $f_{T}$ can be computed efficiently in time by making use of an extended reverse-hypergradient procedure (Franceschi et al.,, 2017) which we present in Algorithm 1. Since, in general, the number of episodes in a meta-training set is large, we compute a stochastic approximation of the gradient of $f_{T}$ by sampling a mini-batch of episodes. At test time, given a new episode $\bar{D}$ , the representation $h$ is kept fixed, and all the examples in $\bar{D}$ are used to tune the weights $\bar{w}$ of the episode-specific model $\bar{g}$ .

Like other initialization and optimization strategies for ML, our method does not require lookups in a support set as the memorization and metric strategies do (Santoro et al.,, 2016; Vinyals et al.,, 2016; Mishra et al.,, 2018). Unlike (Andrychowicz et al.,, 2016; Ravi and Larochelle,, 2017) we do not tune the optimization algorithm, which in our case is plain empirical loss minimization by gradient descent, and rather focus on the hypothesis space. Unlike (Finn et al.,, 2017), that aims at maximizing sensitivity of new task losses to the model parameters, we aim at maximizing the generalization to novel examples during training episodes, with respect to $\lambda$ . Our assumptions about the structure of the model are slightly stronger than in (Finn et al.,, 2017) but still mild, namely that some (hyper)parameters define the representation and the remaining parameters define the classification function. In (Munkhdalai and Yu,, 2017) the meta-knowledge is distributed among fast and slow weights and an external memory; our approach is more direct, since the meta-knowledge is solely distilled by $\lambda$ . A further advantage of our method is that, if the episode-specific models are linear (e.g. logistic regressors) and each loss $L^{j}$ is strongly convex in $w$ , the theoretical guarantees of Theorem 3.2 apply (see Remark 3.3). These assumptions are satisfied in the experiments reported in the next section.

Experiments

The aim of the following experiments is threefold. First, we investigate the impact of the number of iterations of the optimization dynamics on the quality of the solution on a simple multiclass classification problem. Second, we test our hyper-representation method in the context of few-shot learning on two benchmark datasets. Finally, we constrast the bilevel ML approach against classical approaches to learn shared representations The code for reproducing the experiments, based on the package Far-HO (https://bit.ly/far-ho), is available at https://bit.ly/hyper-repr.

Motivated by the theoretical findings of Sec. 3, we empirically investigate how solving the inner problem approximately (i.e. using small $T$ ) affects convergence, generalization performances, and running time. We focus in particular on the linear feature map described in Example 3.4, which allows us to compare the approximated solution against the closed-form analytical solution given by

In this setting, the bilevel problem reduces to a (non-convex) optimization problem in $H$ .

For the approximate problems we compute the hypergradient using Algorithm 1, where it is intended that $\mathcal{B}=\{(D_{\operatorname{tr}},D_{\operatorname{val}})\}$ . Figure 3 shows the values of functions $f$ and $f_{T}$ (see Eqs. (1) and (5), respectively) during the optimization of $H$ . As $T$ increases, the solution of the approximate problem approaches the true bilevel solution. However, performing a small number of gradient descent steps for solving the inner problem acts as implicit regularizer. As it is evident from Figure 4, the generalization error is better when $T$ is smaller than the value yielding the best approximation of the inner solution.

This is to be expected since, in this setting, the dimensions of parameters and hyperparameters are of the same order, leading to a concrete possibility of overfitting the outer objective (validation error). An appropriate, problem dependent, choice of $T$ may help avoiding this issue (see also Appendix C). As $T$ increases, the number of hyperiterations required to reach the maximum test accuracy decreases, further suggesting that there is an interplay between the number of iterations used to solve the inner and the outer objective. Finally, the running time of Algorithm 1, is linear in $T$ and the size of $w$ and independent of the size of $H$ (see also Table 2), making it even more appealing to reduce the number of iterations.

2 Few-shot Learning

We new turn our attention to learning-to-learn, precisely to few-shot supervised learning, implementing the ML strategy outlined in Sec. 4 on two different benchmark datasets:

$\bullet$ Omniglot (Lake et al.,, 2015), a dataset that contains examples of 1623 different handwritten characters from 50 alphabets. We downsample the images to $28\times 28$ .

$\bullet$ MiniImagenet (Vinyals et al.,, 2016), a subset of ImageNet (Deng et al.,, 2009), that contains 60000 downsampled images from 100 different classes.

Following the experimental protocol used in a number of recent works, we build a meta-training set $\mathcal{D}$ , from which we sample datasets to solve Problem (9)-(10), a meta-validation set $\mathcal{V}$ for tuning ML hyperparameters, and finally a meta-test set $\mathcal{T}$ which is used to estimate accuracy. Operationally, each meta-dataset consists of a pool of samples belonging to different (non-overlapping between separate meta-dataset) classes, which can be combined to form ground classification datasets $D^{j}=D^{j}_{\operatorname{tr}}\cup D^{j}_{\operatorname{val}}$ with 5 or 20 classes (for Omniglot). The $D^{j}_{\operatorname{tr}}$ ’s contain 1 or 5 examples per class which are used to fit $w^{j}$ (see Eq. 10). The $D^{j}_{\operatorname{val}}$ ’s, containing 15 examples per class, is used either to compute $f_{T}(\lambda)$ (see Eq. (9)) and its (stochastic) gradient if $D^{j}\in\mathcal{D}$ or to provide a generalization score if $D^{j}$ comes from either $\mathcal{V}$ or $\mathcal{T}$ . For MiniImagenet we use the same split and images proposed in (Ravi and Larochelle,, 2017), while for Omniglot we use the protocol defined by (Santoro et al.,, 2016).

Regarding the specific implementation of the representation mapping $h$ , we employ for Omniglot a four-layers convolutional neural network with strided convolutions and 64 filters per layer as in (Vinyals et al.,, 2016) and other successive works. For MiniImagenet we tried two different architectures:

$\bullet$ C4L, a four-layers convolutional neural network with max-pooling and 32 filters per layer;

$\bullet$ RN: a residual network (He et al.,, 2016) built of four residual blocks followed by two convolutional layers.

The first network architecture has been proposed in (Ravi and Larochelle,, 2017) and then used in (Finn et al.,, 2017), while a similar residual network architecture has been employed in a more recent work (Mishra et al.,, 2018). Further details on the architectures of $h$ , as well as other ML hyperparameters, are specified in the supplementary material. We report our results, using RN for MiniImagenet, in Table 3, alongside scores from various recently proposed methods for comparison.

The proposed method achieves competitive results highlighting the relative importance of learning a task independent representation, on the top of which logistic classifiers trained with very few samples generalize well. Moreover, utilizing more expressive models such as residual network as representation mappings, is beneficial for our proposed strategy and, unlike other methods, does not result in overfitting of the outer objective, as reported in (Mishra et al.,, 2018). Indeed, compared to C4L, RN achieves a relative improvement of 6.5% on one-shot and 4.2% on five-shot. Figure 5 provides a visual example of the goodness of the learned representation, showing that MiniImagenet examples (the first from meta-training, the second from the meta-testing sets) from similar classes (different dog breeds) are mapped near each other by $h$ and, conversely, samples from dissimilar classes are mapped afar.

3 On Variants of Representation Learning Methods

In this section, we show the benefits of learning a representation within the proposed bilevel framework compared to other possible approaches that involve an explicit factorization of a classifier as $g^{j}\circ h$ . The representation mapping $h$ is either pretrained or learned with different meta-learning algorithms. We focus on the problem of one-shot learning on MiniImagenet and we use C4L as architecture for the representation mapping. In all the experiments the ground models $g^{j}$ are multinomial logistic regressor as in Sec. 5.2, tuned with 5 steps of gradient descent. We ran the following experiments:

$\bullet$ Bilevel-train: we use a bilevel approach but, unlike in Sec. 4, we optimize the parameter vector $\lambda$ of the representation mapping by minimizing the loss on the training sets of each episode. The hypergradient is still computed with Algorithm 1, albeit we set $D^{j}_{\operatorname{val}}=D^{j}_{\operatorname{tr}}$ for each training episodes;

$\bullet$ Approx and Approx-train: we consider an approximation of the hypergradient $\nabla f_{T}(\lambda)$ by disregarding the optimization dynamics of the inner objectives (i.e. we set $\nabla_{\lambda}w^{j}_{T}=0$ ). In Approx-train we just use the training sets;

$\bullet$ Classic: as in (Baxter,, 1995), we learn $h$ by jointly optimize $\hat{f}(\lambda,w^{1},\dots,w^{N})=\sum_{j=1}^{N}L^{j}(w^{j},\lambda,D^{j}_{\operatorname{tr}})$ and treat the problem as standard multitask learning, with the exception that we evaluate $\hat{f}$ on mini-batches of 4 episodes, randomly sampled every 5 gradient descent iterations.

In settings where we do not use the validation sets, we let the training sets of each episode contain 16 examples per class. Using training episodes with just one example per class resulted in performances just above random chance. While the first experiment constitutes a standard baseline, the others have the specific aim of assessing ( $i$ ) the importance of splitting episodes of meta-training set into training and validation and ( $ii$ ) the importance of computing the hypergradient of the approximate bilevel problem with Algorithm 1. The results reported in Table 4 suggest that both the training/validation splitting and the full computation of the hypergradient constitute key factors for learning a good representation in a meta-learning context.

On the other side, using pretrained representations, especially in a low-dimensional space, turns out to be a rather effective baseline. One possible explanation is that, in this context, some classes in the training and testing meta-datasets are rather similar (e.g. various dog breeds) and thus ground classifiers can leverage on very specific representations.

Conclusions

We have shown that both HO and ML can be formulated in terms of bilevel programming and solved with an iterative approach. When the inner problem has a unique solution (e.g. is strongly convex), our theoretical results show that the iterative approach has convergence guarantees, a result that is interesting in its own right. In the case of ML, by adapting classical strategies (Baxter,, 1995) to the bilevel framework with training/validation splitting, we present a method for learning hyper-representations which is experimentally effective and supported by our theoretical guarantees.

Our framework encompasses recently proposed methods for meta-learning, such as learning to optimize, but also suggests different design patterns for the inner learning algorithm which could be interesting to explore in future work. The resulting inner problems may not satisfy the assumptions of our convergence analysis, raising the need for further theoretical investigations. An additional future direction of research is the study of the statistical properties of bilevel strategies where outer objectives are based on the generalization ability of the inner model to new (validation) data. Ideas from (Maurer et al.,, 2016; Denevi et al.,, 2018) may be useful in this direction.

References

Appendix A Proofs of the Results in Sec. 3

We recal a fundamental fact concerning the stability of minima and minimizers in optimization problems (Dontchev and Zolezzi,, 1993). We provide the proof for completeness.

Let $\varphi_{T}$ and $\varphi$ be lower semicontinuous functions defined on a compact set $\Lambda$ . Suppose that $\varphi_{T}$ converges uniformly to $\varphi$ on $\Lambda$ as $T\to+\infty$ . Then

Therefore, using also the continuity of $\varphi$ , we have

It follows from assumption (vi) that $f_{T}(\lambda)$ converges to $f(\lambda)$ uniformly on $\Lambda$ as $T\to+\infty$ . Then the statement follows from Theorem A.1 ∎

Appendix B Cross-validation and Bilevel Programming

We note that the (approximate) bilevel programming framework easily accommodates also estimations of the generalization error generated by a cross-validation procedures. We describe here the case of $K$ -fold cross-validation, which includes also leave-one-out cross validation.

Let $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ be the set of available data; $K$ -fold cross validation, with $K\in\{1,2,\dots,N\}$ consists in partitioning $D$ in $K$ subsets $\{D^{j}\}_{j=1}^{K}$ and fit as many models $g_{w^{j}}$ on training data $D^{j}_{\operatorname{tr}}=\bigcup_{i\neq j}D^{i}$ . The models are then evaluated on $D^{j}_{\operatorname{val}}=D^{j}$ . Denoting by $w=(w^{j})_{j=1}^{K}$ the vector of stacked weights, the $K$ -fold cross validation error is given by

By following the procedure outlined in Sec. 2.3 we can approximate the minimization of $L_{\lambda}$ with $T$ steps of an optimization dynamics and compute the hypergradient of $f_{T}(\lambda)=\frac{1}{K}\sum_{j}E^{j}(w^{j}_{T},\lambda)$ by training the $K$ models and proceed with either forward or reverse differentiation. The models may be fitted sequentially, in parallel or stochastically. Specifically, in this last case, one can repeatedly sample one fold at a time (or possibly a mini-batch of folds) and compute a stochastic hypergradient that can be used in a SGD procedure in order to minimize $f_{T}$ . At last, we note that similar ideas for leave-one out cross-validation error are developed in (Beirami et al.,, 2017), where the hypergradient of an approximation of the outer objective is computed by means of the implicit function theorem.

Appendix C The Effect of T𝑇T: Ridge Regression

In Sec. 5.1 we showed that increasing the number of iterations $T$ leads to a better optimization of the outer objective $f$ through the approximations $f_{T}$ , which converge uniformly to $f$ by Proposition 3.2. This, however, does not necessary results in better test scores due to possible overfitting of the training and validation errors, as we noted for the linear hyper-representation multiclass classification experiment in Sec. 5.1. The optimal choice of $T$ appears to be, indeed, problem dependent: if, in the aforementioned experiment, a quite small $T$ led to the best test accuracy (after hyperparameter optimization), in this section we present a small scale linear regression experiment that benefits from an increasing number of inner iterations.

and optimized the $L^{2}$ vector of regularization coefficients $\lambda$ (equivalent to a diagonal Tikhonov regularization matrix). The results, reported in Table 5, show that in this scenario overfitting is not an issue and 250 inner iterations yield the best test result. Note that, as in Sec. 5.1, also this problem admits an analytical solution, from which it is possible to compute the hypergradient analytically and perform exact hyperparameter optimization, as reported in the last row of the Table 5.

Appendix D Further Details on Few-shot Experiments

This appendix contains implementation details of the representation mapping $h$ and the meta-learning hyperparameters used in the few-shot learning experiments of Sec. 5.2.

To optimize the representation mapping, in all the experiments, we use Adam with learning rate set to $10^{-3}$ and a decay-rate of $10^{-5}$ . We used the initialization strategy proposed in (Glorot and Bengio,, 2010) for all the weights $\lambda$ of the representation.

For Omniglot experiments we used a meta-batch size of 32 episodes for five-way and of 16 episodes for 20-way. To train the episode-specific classifiers we set the learning rate to 0.1.

For one set of experiments with Mini-imagenet we used an hyper-representation (C4L) consisting of 4 convolutional layers where each layer is composed by a convolution with 32 filters, a batch normalization followed by a ReLU activation and a 2x2 max-pooling. The classifiers were trained using mini-batches of 4 episodes for one-shot and 2 episodes for five-shot with learning rate set to 0.01.

The other set of experiments with Mini-imagenet employed a Residual Network (RN) as the representation mapping, built of 4 residual blocks (with 64, 96, 128, 256 filters) and then the block that follows $\{1\times 1$ conv (2048 filters), avg pooling, $1\times 1$ conv (512 filters) $\}$ . Each residual block repeats the following block 3 times $\{1\times 1$ conv, batch normalization, leaky ReLU (leak 0.1) $\}$ before the residual connection. In this case the classifiers were optimized using mini-batches of 2 episodes for both one and five-shot with learning rate set to 0.04.