Neural Architecture Search with Bayesian Optimisation and Optimal Transport

Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, Eric Xing

Introduction

In many real world problems, we are required to sequentially evaluate a noisy black-box function $f$ with the goal of finding its optimum in some domain $\mathcal{X}$ . Typically, each evaluation is expensive in such applications, and we need to keep the number of evaluations to a minimum. Bayesian optimisation (BO) refers to an approach for global optimisation that is popularly used in such settings. It uses Bayesian models for $f$ to infer function values at unexplored regions and guide the selection of points for future evaluations. BO has been successfully applied for many optimisation problems in optimal policy search, industrial design, and scientific experimentation. That said, the quintessential use case for BO in machine learning is model selection . For instance, consider selecting the regularisation parameter $\lambda$ and kernel bandwidth $h$ for an SVM. We can set this up as a zeroth order optimisation problem where our domain is a two dimensional space of $(\lambda,h)$ values, and each function evaluation trains the SVM on a training set, and computes the accuracy on a validation set. The goal is to find the model, i.e. hyper-parameters, with the highest validation accuracy.

The majority of the BO literature has focused on settings where the domain $\mathcal{X}$ is either Euclidean or categorical. This suffices for many tasks, such as the SVM example above. However, with recent successes in deep learning, neural networks are increasingly becoming the method of choice for many machine learning applications. A number of recent work have designed novel neural network architectures to significantly outperform the previous state of the art . This motivates studying model selection over the space of neural architectures to optimise for generalisation performance. A critical challenge in this endeavour is that evaluating a network via train and validation procedures is very expensive. This paper proposes a BO framework for this problem.

These two steps are fairly straightforward in conventional domains. For example, in Euclidean spaces, we can use one of many popular kernels such as Gaussian, Laplacian, or Matérn; we can maximise $\varphi_{t}$ via off the shelf branch-and-bound or gradient based methods. However, when each $x\in\mathcal{X}$ is a neural network architecture, this is not the case. Hence, our challenges in this work are two-fold. First, we need to quantify (dis)similarity between two networks. Intuitively, in Fig. 1, network 1a is more similar to network 1b, than it is to 1c. Secondly, we need to be able to traverse the space of such networks to optimise the acquisition function. Our main contributions are as follows.

We develop a (pseudo-)distance for neural network architectures called OTMANN (Optimal Transport Metrics for Architectures of Neural Networks) that can be computed efficiently via an optimal transport program.

We develop a BO framework for optimising functions on neural network architectures called NASBOT (Neural Architecture Search with Bayesian Optimisation and Optimal Transport). This includes an evolutionary algorithm to optimise the acquisition function.

Empirically, we demonstrate that NASBOT outperforms other baselines on model selection tasks for multi-layer perceptrons (MLP) and convolutional neural networks (CNN). Our python implementations of OTMANN and NASBOT are available at github.com/kirthevasank/nasbot.

Related Work: Recently, there has been a surge of interest in methods for neural architecture search . We discuss them in detail in the Appendix due to space constraints. Broadly, they fall into two categories, based on either evolutionary algorithms (EA) or reinforcement learning (RL). EA provide a simple mechanism to explore the space of architectures by making a sequence of changes to networks that have already been evaluated. However, as we will discuss later, they are not ideally suited for optimising functions that are expensive to evaluate. While RL methods have seen recent success, architecture search is in essence an optimisation problem – find the network with the lowest validation error. There is no explicit need to maintain a notion of state and solve credit assignment . Since RL is a fundamentally more difficult problem than optimisation , these approaches need to try a very large number of architectures to find the optimum. This is not desirable, especially in computationally constrained settings.

None of the above methods have been designed with a focus on the expense of evaluating a neural network, with an emphasis on being judicious in selecting which architecture to try next. Bayesian optimisation (BO) uses introspective Bayesian models to carefully determine future evaluations and is well suited for expensive evaluations. BO usually consumes more computation to determine future points than other methods, but this pays dividends when the evaluations are very expensive. While there has been some work on BO for architecture search , they have only been applied to optimise feed forward structures, e.g. Fig. 1a, but not Figs. 1b, 1c. We compare NASBOT to one such method and demonstrate that feed forward structures are inadequate for many problems.

Set Up

2 A Mathematical Formalism for Neural Networks

The directed edges are to be interpreted as follows. The output of each layer is fed to each of its children; so both layers 2 and 3 in Fig. 1b take the output of layer $1$ as input. When a layer has multiple parents, the inputs are concatenated; so layer 5 sees an input of $16+16$ filtered channels coming in from layers $3$ and $4$ . Finally, we mention that neural networks are also characterised by the values of the weights/parameters between layers. In architecture search, we typically do not consider these weights. Instead, an algorithm will (somewhat ideally) assume access to an optimisation oracle that can minimise the loss function on the training set and find the optimal weights.

The OTMANN Distance

To motivate this distance, note that the performance of a neural network is determined by the amount of computation at each layer, the types of these operations, and how the layers are connected. A meaningful distance should account for these factors. To that end, OTMANN is defined as the minimum of a matching scheme which attempts to match the computation at the layers of one network to the layers of the other. We incur penalties for matching layers with different types of operations or those at structurally different positions. We will find a matching that minimises these penalties, and the total penalty at the minimum will give rise to a distance. We first describe two concepts, layer masses and path lengths, which we will use to define OTMANN.

Path lengths from/to $u_{\textrm{ip}}$ / $u_{\textrm{op}}$ : In a neural network $\mathcal{G}$ , a path from $u$ to $v$ is a sequence of layers $u_{1},\dots,u_{s}$ where $u_{1}=u$ , $u_{s}=v$ and $(u_{i},u_{i+1})\in\mathcal{E}$ for all $i\leq s-1$ . The length of this path is the number of hops from one node to another in order to get from $u$ to $v$ . For example, in Fig. 1c, $(2,5,8,13)$ is a path from layer $2$ to $13$ of length $3$ . Let the shortest (longest) path length from $u$ to $v$ be the smallest (largest) number of hops from one node to another among all paths from $u$ to $v$ . Additionally, define the random walk path length as the expected number of hops to get from $u$ to $v$ , if, from any layer we hop to one of its children chosen uniformly at random. For example, in Fig. 1c, the shortest, longest and random walk path lengths from layer $1$ to layer $14$ are 5, 7, and 5.67 respectively. For any $u\in\mathcal{L}$ , let $\delta^{\textrm{sp}}_{\textrm{op}}(u),\delta^{\textrm{lp}}_{\textrm{op}}(u),\delta^{\textrm{rw}}_{\textrm{op}}(u)$ denote the length of the shortest, longest and random walk paths from $u$ to the output $u_{\textrm{op}}$ . Similarly, let $\delta^{\textrm{sp}}_{\textrm{ip}}(u),\delta^{\textrm{lp}}_{\textrm{ip}}(u),\delta^{\textrm{rw}}_{\textrm{ip}}(u)$ denote the corresponding lengths for walks from the input $u_{\textrm{ip}}$ to $u$ . As the layers of a neural network can be topologically orderedA topological ordering is an ordering of the layers $u_{1},\dots,u_{{|\mathcal{L}|}}$ such that $u$ comes before $v$ if $(u,v)\in\mathcal{E}$ . , the above path lengths are well defined and finite. Further, for any $s\in\{\textrm{sp,lp,rw}\}$ and $t\in\{\textrm{ip,op}\}$ , $\delta_{t}^{s}(u)$ can be computed for all $u\in\mathcal{L}$ , in $\mathcal{O}(|\mathcal{E}|)$ time (see Appendix A.3 for details).

The label mismatch term $\phi_{\textrm{lmm}}$ , penalises matching masses that have different labels, while the structural term $\phi_{\textrm{str}}$ penalises matching masses at structurally different positions with respect to each other. If we choose not to match any mass in either network, we incur a non-assignment penalty $\phi_{\textrm{nas}}$ . $\nu_{\textrm{str}}>0$ determines the trade-off between the structural and other terms. The inequality constraints ensure that we do not over assign the masses in a layer. We now describe $\phi_{\textrm{lmm}},\phi_{\textrm{nas}},$ and $\phi_{\textrm{str}}$ .

This completes the description of our matching program. In Appendix A, we prove that (3) can be formulated as an Optimal Transport (OT) program . OT is a well studied problem with several efficient solvers . Our theorem below, shows that the solution of (3) is a distance.

Let $d(\mathcal{G}_{1},\mathcal{G}_{2})$ be the solution of (3) for networks $\mathcal{G}_{1},\mathcal{G}_{2}$ . Under mild regularity conditions on $M$ , $d(\cdot,\cdot)$ is a pseudo-distance. That is, for all networks $\mathcal{G}_{1},\mathcal{G}_{2},\mathcal{G}_{3}$ , it satisfies, $d(\mathcal{G}_{1},\mathcal{G}_{2})\geq 0$ , $d(\mathcal{G}_{1},\mathcal{G}_{2})=d(\mathcal{G}_{2},\mathcal{G}_{1})$ , $d(\mathcal{G}_{1},\mathcal{G}_{1})=0$ and $d(\mathcal{G}_{1},\mathcal{G}_{3})\leq d(\mathcal{G}_{1},\mathcal{G}_{2})+d(\mathcal{G}_{2},\mathcal{G}_{3})$ .

We conclude this section with a couple of remarks. First, OTMANN shares similarities with Wasserstein (earth mover’s) distances which also have an OT formulation. However, it is not a Wasserstein distance itself—in particular, the supports of the masses and the cost matrices change depending on the two networks being compared. Second, while there has been prior work for defining various distances and kernels on graphs, we cannot use them in BO because neural networks have additional complex properties in addition to graphical structure, such as the type of operations performed at each layer, the number of neurons, etc. The above work either define the distance/kernel between vertices or assume the same vertex (layer) set , none of which apply in our setting. While some methods do allow different vertex sets , they cannot handle layer masses and layer similarities. Moreover, the computation of the above distances are more expensive than OTMANN. Hence, these methods cannot be directly plugged into BO framework for architecture search.

In Appendix A, we provide additional material on OTMANN. This includes the proof of Theorem 1, a discussion on some design choices, and implementation details such as the computation of the path lengths. Moreover, we provide illustrations to demonstrate that OTMANN is a meaningful distance for architecture search. For example, a t-SNE embedding places similar architectures close to each other. Further, scatter plots showing the validation error vs distance on real datasets demonstrate that networks with small distance tend to perform similarly on the problem.

NASBOT

We now describe NASBOT, our BO algorithm for neural architecture search. Recall that in order to realise the BO scheme outlined in Section 2.1, we need to specify (a) a kernel $\kappa$ for neural architectures and (b) a method to optimise the acquisition $\varphi_{t}$ over these architectures. Due to space constraints, we will only describe the key ideas and defer all details to Appendix B.

As described previously, we will use a negative exponentiated distance for $\kappa$ . Precisely, $\kappa=\alpha e^{-\beta d}+\bar{\alpha}d^{-\bar{\beta}\bar{d}}$ , where $d,\bar{d}$ are the OTMANN distance and its normalised version. We mention that while this has the form of popular kernels, we do not know yet if it is in fact a kernel. In our experiments, we did not encounter an instance where the eigenvalues of the kernel matrix were negative. In any case, there are several methods to circumvent this issue in kernel methods .

We use an evolutionary algorithm (EA) approach to optimise the acquisition function (2). For this, we begin with an initial pool of networks and evaluate the acquisition $\varphi_{t}$ on those networks. Then we generate a set of $N_{\textrm{mut}}$ mutations of this pool as follows. First, we stochastically select $N_{\textrm{mut}}$ candidates from the set of networks already evaluated such that those with higher $\varphi_{t}$ values are more likely to be selected than those with lower values. Then we modify each candidate, to produce a new architecture. These modifications, described in Table 2, might change the architecture either by increasing or decreasing the number of computational units in a layer, by adding or deleting layers, or by changing the connectivity of existing layers. Finally, we evaluate the acquisition on this $N_{\textrm{mut}}$ mutations, add it to the initial pool, and repeat for the prescribed number of steps. While EA works fine for cheap functions, such as the acquisition $\varphi_{t}$ which is analytically available, it is not suitable when evaluations are expensive, such as training a neural network. This is because EA selects points for future evaluations that are already close to points that have been evaluated, and is hence inefficient at exploring the space. In our experiments, we compare NASBOT to the same EA scheme used to optimise the acquisition and demonstrate the former outperforms the latter.

We conclude this section by observing that this framework for NASBOT/OTMANN has additional flexibility to what has been described. If one wishes to tune over drop-out probabilities, regularisation penalties and batch normalisation at each layer, they can be treated as part of the layer label, via an augmented label penalty matrix $M$ which accounts for these considerations. If one wishes to jointly tune other scalar hyper-parameters (e.g. learning rate), they can use an existing kernel for euclidean spaces and define the GP over the joint architecture + hyper-parameter space via a product kernel. BO methods for early stopping in iterative training procedures can be easily incorporated by defining a fidelity space. Using a line of work in scalable GPs , one can apply our methods to challenging problems which might require trying a very large number ( $\sim$ 100K) of architectures. These extensions will enable deploying NASBOT in large scale settings, but are tangential to our goal of introducing a BO method for architecture search.

Experiments

Methods: We compare NASBOT to the following baselines. RAND: random search; EA (Evolutionary algorithm): the same EA procedure described above. TreeBO : a BO method which only searches over feed forward structures. Random search is a natural baseline to compare optimisation methods. However, unlike in Euclidean spaces, there is no natural way to randomly explore the space of architectures. Our RAND implementation, operates in exactly the same way as NASBOT, except that the EA procedure is fed a random sample from ${\bf{\rm Unif}}(0,1)$ instead of the GP acquisition each time it evaluates an architecture. Hence, RAND is effectively picking a random network from the same space explored by NASBOT; neither method has an unfair advantage because it considers a different space. While there are other methods for architecture search, their implementations are highly nontrivial and are not made available.

Datasets: We use the following datasets: blog feedback , indoor location , slice localisation , naval propulsion , protein tertiary structure , news popularity , Cifar10 . The first six are regression problems for which we use MLPs. The last is a classification task on images for which we use CNNs. Table 3 gives the size and dimensionality of each dataset. For the first $6$ datasets, we use a $0.6-0.2-0.2$ train-validation-test split and normalised the input and output to have zero mean and unit variance. Hence, a constant predictor will have a mean squared error of approximately $1$ . For Cifar10 we use $40K$ for training and $10K$ each for validation and testing.

Experimental Set up: Each method is executed in an asynchronously parallel set up of 2-4 GPUs, That is, it can evaluate multiple models in parallel, with each model on a single GPU. When the evaluation of one model finishes, the methods can incorporate the result and immediately re-deploy the next job without waiting for the others to finish. For the blog, indoor, slice, naval and protein datasets we use 2 GeForce GTX 970 (4GB) GPUs and a computational budget of 8 hours for each method. For the news popularity dataset we use $4$ GeForce GTX 980 (6GB) GPUs with a budget of 6 hours and for Cifar10 we use 4 K80 (12GB) GPUs with a budget of 10 hours. For the regression datasets, we train each model with stochastic gradient descent (SGD) with a fixed step size of $10^{-5}$ , a batch size of 256 for 20K batch iterations. For Cifar10, we start with a step size of $10^{-2}$ , and reduce it gradually. We train in batches of 32 images for 60K batch iterations. The methods evaluate between $70$ - $120$ networks depending on the size of the networks chosen and the number of GPUs.

Results: Fig. 2 plots the best validation score for each method against time. In Table 3, we present the results on the test set with the best model chosen on the basis of validation set performance. On the Cifar10 dataset, we also trained the best models for longer ( $150K$ iterations). These results are in the last column of Table 3. We see that NASBOT is the most consistent of all methods. The average time taken by NASBOT to determine the next architecture to evaluate was $46.13$ s. For RAND, EA, and TreeBO this was $26.43$ s, $0.19$ s, and $7.83$ s respectively. The time taken to train and validate models was on the order of 10-40 minutes depending on the model size. Fig. 2 includes this time taken to determine the next point. Like many BO algorithms, while NASBOT’s selection criterion is time consuming, it pays off when evaluations are expensive. In Appendices B and C, we provide additional details on the experiment set up and conduct synthetic ablation studies by holding out different components of the NASBOT framework. We also illustrate some of the best architectures found—on many datasets, common features were long skip connections and multiple decision layers.

Finally, we note that while our Cifar10 experiments fall short of the current state of the art , the amount of computation in these work is several orders of magnitude more than ours (both the computation invested to train a single model and the number of models trained). Further, they use constrained spaces specialised for CNNs, while NASBOT is deployed in a very general model space. We believe that our results can also be improved by employing enhanced training techniques such as image whitening, image flipping, drop out, etc. For example, using our training procedure on the VGG-19 architecture yielded a test set error of $0.1018$ after $150K$ iterations. However, VGG-19 is known to do significantly better on Cifar10. That said, we believe our results are encouraging and lay out the premise for BO for neural architectures.

Conclusion

We described NASBOT, a BO framework for neural architecture search. NASBOT finds better architectures for MLPs and CNNs more efficiently than other baselines on several datasets. A key contribution of this work is the efficiently computable OTMANN distance for neural network architectures, which may be of independent interest as it might find applications outside of BO. Our code for NASBOT and OTMANN will be made available.

We would like to thank Guru Guruganesh and Dougal Sutherland for the insightful discussions. This research is partly funded by DOE grant DESC0011114, NSF grant IIS1563887, and the Darpa D3M program. KK is supported by a Facebook fellowship and a Siebel scholarship.

References

Appendix A Additional Details on OTMANN

is called an optimal transport program. One interpretation of this set up is that $y_{1}$ denotes the supplies at $n_{1}$ warehouses, $y_{2}$ denotes the demands at $n_{2}$ retail stores, $C_{ij}$ denotes the cost of transporting a unit mass of supplies from warehouse $i$ to store $j$ and $Z_{ij}$ denotes the mass of material transported from $i$ to $j$ . The program attempts to find transportation plan which minimises the total cost of transportation $\langle Z,C\rangle$ .

One interpretation of (5) is that the last row/column appended to the cost matrices serve as a non-assignment layer and that the cost for transporting unit mass to this layer from all other layers is $1$ . The costs for mislabelling was defined relative to this non-assignment cost. The costs for similar layers is much smaller than $1$ ; therefore, the optimiser is incentivised to transport mass among similar layers rather than not assign it provided that the structural penalty is not too large. Correspondingly, the cost for very disparate layers is much larger so that we would never match, say, a convolutional layer with a pooling layer. In fact, the $\infty$ ’s in Table 1 can be replaced by any value larger than $2$ and the solution will be the same. The following theorem shows that (3) and (5) are equivalent.

Problems (3) and (5) are equivalent, in that they both have the same minimum and we can recover the solution of one from the other.

The converse also follows via a straightforward argument. For given $Z^{\prime}$ that is feasible for (5), we let $Z$ be the first $n_{1}\times n_{2}$ block. By the equality constraints and non-negativity of $Z^{\prime}$ , $Z$ is feasible for (3). By reversing the argument in (6) we see that the objectives are also equal. ∎

A.2 Distance Properties of OTMANN

The following theorem shows that the solution of (3) is a pseudo-distance. This is a formal version of Theorem 1 in the main text.

Assume that the mislabel cost matrix $M$ satisfies the triangle inequality; i.e. for all labels x, y, z we have $M(\textrm{{{x}}},\textrm{{{z}}})\leq M(\textrm{{{x}}},\textrm{{{y}}})+M(\textrm{{{y}}},\textrm{{{z}}})$ . Let $d(\mathcal{G}_{1},\mathcal{G}_{2})$ be the solution of (3) for networks $\mathcal{G}_{1},\mathcal{G}_{2}$ . Then $d(\cdot,\cdot)$ is a pseudo-distance. That is, for all networks $\mathcal{G}_{1},\mathcal{G}_{2},\mathcal{G}_{3}$ , it satisfies, $d(\mathcal{G}_{1},\mathcal{G}_{2})>0$ , $d(\mathcal{G}_{1},\mathcal{G}_{2})=d(\mathcal{G}_{2},\mathcal{G}_{1})$ , $d(\mathcal{G}_{1},\mathcal{G}_{1})=0$ and $d(\mathcal{G}_{1},\mathcal{G}_{3})\leq d(\mathcal{G}_{1},\mathcal{G}_{2})+d(\mathcal{G}_{2},\mathcal{G}_{3})$ .

Some remarks are in order. First, observe that while $d(\cdot,\cdot)$ is a pseudo-distance, it is not a distance; i.e. $d(\mathcal{G}_{1},\mathcal{G}_{2})=0\,\nRightarrow\,\mathcal{G}_{1}=\mathcal{G}_{2}$ . For example, while the networks in Figure 3 have different descriptors according to our formalism in Section 2.2, their distance is . However, it is not hard to see that their functionality is the same – in both cases, the output of layer $1$ is passed through $16$ conv3 filters and then fed to a layer with $32$ conv3 filters – and hence, this property is desirable in this example. It is not yet clear however, if the topology induced by our metric equates two functionally dissimilar networks. We leave it to future work to study equivalence classes induced by the OTMANN distance. Second, despite the OT formulation, this is not a Wasserstein distance. In particular, the supports of the masses and the cost matrices change depending on the two networks being compared.

We will use the OT formulation (5) in this proof. The first three properties are straightforward. Non-negativity follows from non-negativity of $Z^{\prime},C^{\prime}$ in (5). It is symmetric since the cost matrix for $d(\mathcal{G}_{2},\mathcal{G}_{1})$ is $C^{\prime\top}$ if the cost matrix for $d(\mathcal{G}_{1},\mathcal{G}_{2})$ is $C$ and $\langle Z^{\prime},C^{\prime}\rangle=\langle Z^{\prime\top},C^{\prime\top}\rangle$ for all $Z^{\prime}$ . We also have $d(\mathcal{G}_{1},\mathcal{G}_{1})=0$ since, then, $C^{\prime}$ has a zero diagonal.

Similarly, $R^{\top}{\bf 1}_{\bar{n}_{1}}=y_{3}$ . Now, let $U^{\prime},V^{\prime},W^{\prime}$ be the cost matrices $C^{\prime}$ in (5) when computing $d(\mathcal{G}_{1},\mathcal{G}_{2})$ , $d(\mathcal{G}_{2},\mathcal{G}_{3})$ , and $d(\mathcal{G}_{1},\mathcal{G}_{3})$ respectively. We will use the following technical lemma whose proof is given below.

For all $i\in\mathcal{L}_{1},\,j\in\mathcal{L}_{2},\,k\in\mathcal{L}_{3}$ , we have $W^{\prime}_{ik}\leq U^{\prime}_{ij}+V^{\prime}_{jk}$ .

Applying Lemma 4 yields the triangle inequality.

The first step uses the fact that $d(\mathcal{G}_{1},\mathcal{G}_{3})$ is the minimum of all feasible solutions and the third step uses Lemma 4. The fourth step rearranges terms and the fifth step uses $P^{\top}{\bf 1}_{\bar{n}_{1}}=Q{\bf 1}_{\bar{n}_{3}}=y_{2}$ . ∎

For the non-assignment terms, when $(W^{\prime}_{\textrm{nas}})_{ik}=0$ the claim is true trivially. $(W^{\prime}_{\textrm{nas}})_{ik}=1$ , either when $(i=\bar{n}_{1},k\leq n_{3})$ or $(i\leq n_{1},k=\bar{n}_{3})$ . In the former case, when $j\leq n_{2}$ , $(U^{\prime}_{\textrm{nas}})_{jk}=1$ and when $j=\bar{n}_{2}$ , $(V^{\prime}_{\textrm{nas}})_{\bar{n}_{2}}=1$ as $k\leq n_{3}$ . We therefore have, $(W^{\prime}_{\textrm{nas}})_{ik}=(U^{\prime}_{\textrm{nas}})_{ij}+(V^{\prime}_{\textrm{nas}})_{jk}=1$ . A similar argument shows equality for the $(i\leq n_{1},k=\bar{n}_{3})$ case as well.

Finally, for the structural terms we note that $W^{\prime}_{\textrm{str}}$ can be written as $W^{\prime}_{\textrm{str}}=\sum_{t}W^{\prime(t)}$ as can $U^{\prime(t)},T^{\prime(t)}$ . Here $t$ indexes over the choices for the types of distances considered, i.e. $t\in\{\textrm{sp, lp, rw}\}\times\{\textrm{ip, op}\}$ . It is sufficient to show $(W^{\prime(t)})_{ik}\leq(U^{\prime(t)})_{ij}+(T^{\prime(t)})_{jk}$ . This inequality takes the form,

A.3 Implementation & Design Choices

Masses on the decision & input/output layers: It is natural to ask why one needs to model the mass in the decision and input/output layers. For example, a seemingly natural choice is to use for these layers. Using mass, is a reasonable strategy if we were to allow only one decision layer. However, when there are multiple decision layers, consider comparing the following two networks: the first has a feed forward MLP with non-linear layers, the second is the same network but with an additional linear decision layer $u$ , with one edge from $u_{\textrm{ip}}$ to $u$ and an edge from $u$ to $u_{\textrm{op}}$ . This latter models the function as a linear + non-linear term which might be suitable for some problems unlike modeling it only as a non-linear term. If we do not add layer masses for the input/output/decision layers, then the distance between both networks would be - as there will be equal mass in the FF part for both networks and they can be matched with 0 cost.

Computing path lengths $\delta_{s}^{t}$ : Algorithm 1 computes all path lengths in $O(|\mathcal{E}|)$ time. Note that topological sort of a connected digraph also takes $O(|\mathcal{E}|)$ time. The topological sorting ensures that $\delta^{\textrm{rw}}_{\textrm{op}}$ is always computed for the children in step 4. For $\delta^{\textrm{sp}}_{\textrm{op}},\delta^{\textrm{lp}}_{\textrm{op}}$ we would replace the averaging of $\Delta$ in step 5 with the minimum and maximum of $\Delta$ respectively.

For $\delta^{\textrm{rw}}_{\textrm{ip}}$ we make the following changes to Algorithm 1. In step 1, we set $\delta^{\textrm{rw}}_{\textrm{ip}}(u_{\textrm{ip}})=0$ , in step 3, we ${\rm pop\_first}$ and $\Delta$ in step 4 is computed using the parents. $\delta^{\textrm{sp}}_{\textrm{ip}},\delta^{\textrm{lp}}_{\textrm{ip}}$ are computed with the same procedure but by replacing the averaging with minimum or maximum as above.

Label Penalty Matrices: The label penalty matrices used in our NASBOT implementation, described below, satisfy the triangle inequality condition in Theorem 3.

CNNs: Table 4 shows the label penalty matrix $M$ for used in our CNN experiments with labels conv3, conv5, conv7, max-pool, avg-pool, softmax, ip, op. conv $k$ denotes a $k\times k$ convolution while avg-pool and max-pool are pooling operations. In addition, we also use res3, res5, res7 layers which are inspired by ResNets. A res $k$ uses 2 concatenated conv $k$ layers but the input to the first layer is added to the output of the second layer before the relu activation – See Figure 2 in He et al. . The layer mass for res $k$ layers is twice that of a conv $k$ layer. The costs for the res in the label penalty matrix is the same as the conv block. The cost between a res $k$ and conv $j$ is $M(\textrm{{{res$ k $}}},\textrm{{{conv$ j $}}})=0.9\times M(\textrm{{{conv$ k $}}},\textrm{{{conv$ j $}}})+0.1\times 1$ ; i.e. we are using a convex combination of the conv costs and the non-assignment cost. The intuition is that a res $k$ is similar to conv $k$ block except for the residual addition.

MLPs: Table 5 shows the label penalty matrix $M$ for used in our MLP experiments with labels relu, crelu, leaky-relu, softplus, elu, logistic, tanh, linear, ip, op. Here the first seven are common non-linear activations; relu, crelu, leaky-relu, softplus, elu rectifiers while logistic and tanh are sigmoidal activations.

Secondly, we use a slightly different form for $C_{\textrm{str}}$ . First, for $i\in\mathcal{L}_{1},\;j\in\mathcal{L}_{2}$ , we let $C^{\textrm{all}}_{\textrm{str}}(i,j)=\frac{1}{6}\sum_{s\in\{\textrm{sp, lp, rw}\}}$ $\sum_{t\in\{\textrm{ip,op}\}}|\delta^{s}_{t}(i)-\delta^{s}_{t}(j)|$ be the average of all path length differences; i.e. $C^{\textrm{all}}_{\textrm{str}}$ captures the path length differences when considering all layers. For CNNs, we similarly construct matrices $C^{\textrm{conv}}_{\textrm{str}},C^{\textrm{pool}}_{\textrm{str}},C^{\textrm{fc}}_{\textrm{str}}$ , except they only consider the convolutional, pooling and fully connected layers respectively in the path lengths. For $C^{\textrm{conv}}_{\textrm{str}}$ , the distances to the output (from the input) can be computed by zeroing outgoing (incoming) edges to layers that are not convolutional. We can similarly construct $C^{\textrm{pool}}_{\textrm{str}}$ and $C^{\textrm{fc}}_{\textrm{str}}$ only counting the pooling and fully connected layers. Our final cost matrix for the structural penalty is the average of these four matrices, $C_{\textrm{str}}=(C^{\textrm{all}}_{\textrm{str}}+C^{\textrm{conv}}_{\textrm{str}}+C^{\textrm{pool}}_{\textrm{str}}+C^{\textrm{fc}}_{\textrm{str}})/4$ . For MLPs, we adopt a similar strategy by computing matrices $C^{\textrm{all}}_{\textrm{str}},C^{\textrm{rec}}_{\textrm{str}},C^{\textrm{sig}}_{\textrm{str}}$ with all layers, only rectifiers, and only sigmoidal layers and let $C_{\textrm{str}}=(C^{\textrm{all}}_{\textrm{str}}+C^{\textrm{rec}}_{\textrm{str}}+C^{\textrm{sig}}_{\textrm{str}})/3$ . The intuition is that by considering certain types of layers, we are accounting for different types of information flow due to different operations.

A.4 Some Illustrations of the OTMANN Distance

We illustrate that OTMANN computes reasonable distances on neural network architectures via a two-dimensional t-SNE visualisation of the network architectures based. Given a distance matrix between $m$ objects, t-SNE embeds them in a $d$ dimensional space so that objects with small distances are placed closer to those that have larger distances. Figure 4 shows the t-SNE embedding using the OTMANN distance and its noramlised version. We have indexed 13 networks in both figures in a-n and displayed their architectures in Figure 5. Similar networks are placed close to each other indicating that OTMANN induces a meaningful topology among neural network architectures.

Next, we show that the distances induced by OTMANN are correlated with validation error performance. In Figure 6 we provide the following scatter plot for networks trained in our experiments for the Indoor, Naval and Slice datasets. Each point in the figure is for pair of networks. The $x$ -axis is the OTMANN distance between the pair and the $y$ -axis is the difference in the validation error on the dataset. In each figure we used $300$ networks giving rise to $~{}45K$ pairwise points in each scatter plot. As the figure indicates, when the distance is small the difference in performance is close to . However, as the distance increases, the points are more scattered. Intuitively, one should expect that while networks that are far apart could perform similarly or differently, similar networks should perform similarly. Hence, OTMANN induces a useful topology in the space of architectures that is smooth for validaiton performance on real world datasets. This demonstrates that it can be incorporated in a BO framework to optimise a network based on its validation error.

Appendix B Implementation of NASBOT

Here, we describe our BO framework for NASBOT in full detail.

As described in the main text, we use a negative exponentiated distance as our kernel. Precisely, we use,

Here, $d_{i},\bar{d}_{i}$ , are the OTMANN distance and its normalised counterpart developed in Section 3, computed with different values for $\nu_{\textrm{str}}\in\{\nu_{\textrm{str},i}\}_{i}$ . $\beta_{i},\bar{\beta}_{i}$ manage the relative contributions of $d_{i},\bar{d}_{i}$ , while $(\alpha,\bar{\alpha})$ manage the contributions of each kernel in the sum. An ensemble approach of the above form, instead of trying to pick a single best value, ensures that NASBOT accounts for the different topologies induced by the different distances $d_{i},\bar{d}_{i}$ . In the experiments we report, we used $\{\nu_{\textrm{str},i}\}_{i}=\{0.1,0.2,0.4,0.8\}$ , $p=1$ and $\bar{p}=2$ . Our experience suggests that NASBOT was not particularly sensitive to these choices expect when we used only very large or only very small values in $\{\nu_{\textrm{str},i}\}_{i}$ .

NASBOT, as described above has $11$ hyper-parameters of its own; $\alpha,\bar{\alpha},\{(\beta_{i},\bar{\beta}_{i})\}_{i=1}^{4}$ and the GP noise variance $\eta^{2}$ . While maximising the GP marginal likelihood is a common approach to pick hyper-parameters, this might cause over-fitting when there are many of them. Further, as training large neural networks is typically expensive, we have to content with few observations for the GP in practical settings. Our solution is to start with a (uniform) prior over these hyper-parameters and sample hyper-parameter values from the posterior under the GP likelihood , which we found to be robust. While it is possible to treat $\nu_{\textrm{str}}$ itself as a hyper-parameter of the kernel, this will require us to re-compute all pairwise distances of networks that have already been evaluated each time we change the hyper-parameters. On the other hand, with the above approach, we can compute and store distances for different $\nu_{\textrm{str},i}$ values whenever a new network is evaluated, and then compute the kernel cheaply for different values of $\alpha,\bar{\alpha},\{(\beta_{i},\bar{\beta}_{i})\}_{i}$ .

B.2 Optimising the Acquisition

We use a evolutionary algorithm (EA) approach to optimise the acquisition function (2). We begin with an initial pool of networks and evaluate the acquisition $\varphi_{t}$ on those networks. Then we generate a set of $N_{\textrm{mut}}$ mutations of this pool as follows. First, we stochastically select $N_{\textrm{mut}}$ candidates from the set of networks already evaluated such that those with higher $\varphi_{t}$ values are more likely to be selected than those with lower values. Then we apply a mutation operator to each candidate, to produce a modified architecture. Finally, we evaluate the acquisition on this $N_{\textrm{mut}}$ mutations, add it to the initial pool, and repeat for the prescribed number of steps.

Mutation Operator: To describe the mutation operator, we will first define a library of modifications to a neural network. These modifications, described in Table 6, might change the architecture either by increasing or decreasing the number of computational units in a layer, by adding or deleting layers, or by changing the connectivity of existing layers. They provide a simple mechanism to explore the space of architectures that are close to a given architecture. The one-step mutation operator takes a given network and applies one of the modifications in Table 6 picked at random to produce a new network. The $k$ -step mutation operator takes a given network, and repeatedly applies the one-step operator $k$ times – the new network will have undergone $k$ changes from the original one. One can also define a compound operator, which picks the number of steps probabilistically. In our implementation of NASBOT, we used such a compound operator with probabilities $(0.5,0.25,0.125,0.075,0.05)$ ; i.e. it chooses a one-step operator with probability $0.5$ , a $4$ -step operator with probability $0.075$ , etc. Typical implementations of EA in Euclidean spaces define the mutation operator via a Gaussian (or other) perturbation of a chosen candidate. It is instructive to think of the probabilities for each step in our scheme above as being analogous to the width of the Gaussian chosen for perturbation.

Sampling strategy: The sampling strategy for EA is as follows. Let $\{z_{i}\}_{i}$ , where $z_{i}\in\mathcal{X}$ be the points evaluated so far. We sample $N_{\textrm{mut}}$ new points from a distribution $\pi$ where $\pi(z_{i})\propto\exp(g(z_{i})/\sigma$ ). Here $g$ is the function to be optimised (for NASBOT, $\varphi_{t}$ at time $t$ ). $\sigma$ is the standard deviation of all previous evaluations. As the probability for large $g$ values is higher, they are more likely to get selected. $\sigma$ provides normalisation to account for different ranges of function values.

Since our candidate selection scheme at each step favours networks that have high acquisition value, our EA scheme is more likely to search at regions that are known to have high acquisition. The stochasticity in this selection scheme and the fact that we could take multiple steps in the mutation operation ensures that we still sufficiently explore the space. Since an evaluation of $\varphi_{t}$ is cheap, we can use many EA steps to explore several architectures and optimise $\varphi_{t}$ .

Other details: The EA procedure is also initialised with the same initial pools in Figures 20, 21. In our NASBOT implementation, we increase the total number of EA evaluations $n_{\text{{{{EA}}} }}$ at rate $\mathcal{O}(\sqrt{t})$ where $t$ is the current time step in NASBOT. We set $N_{\textrm{mut}}$ to be $\mathcal{O}(\sqrt{n_{\text{{{{EA}}} }}})$ . Hence, initially we are only considering a small neighborhood around the initial pool, but as we proceed along BO, we expand to a larger region, and also spend more effort to optimise $\varphi_{t}$ .

Considerations when performing modifications: The modifications in Table 6 is straightforward in MLPs. But in CNNs one needs to ensure that the image sizes are the same when concatenating them as an input to a layer. This is because strides can shrink the size of the image. When we perform a modification we check if this condition is violated and if so, disallow that modification. When a skip modifier attempts to add a connection from a layer with a large image size to one with a smaller one, we add avg-pool layers at stride 2 so that the connection can be made (this can be seen, for e.g. in the second network in Fig. 8).

B.3 Other Implementation Details

Initialisation: We initialise NASBOT (and other methods) with an initial pool of $10$ networks. These networks are illustrated in Fig. 20 for CNNs and Fig. 21 for MLPs at the end of the document. These are the same networks used to initialise the EA procedure to optimise the acquisition. All initial networks have feed forward structure. For the CNNs, the first $3$ networks have structure similar to the VGG nets and the remaining have blocked feed forward structures as in He et al. . We also use blocked structures for the MLPs with the layer labels decided arbitrarily.

Domain: For NASBOT, and other methods, we impose the following constraints on the search space. If the EA modifier (explained below) generates a network that violates these constraints, we simply skip it.

Maximum number of units per layer: $1024$

Layer types: We use the layer types detailed in Appendix A.3 for both CNNs and MLPs. For CNNs, all pooling operations are done at stride 2. For convolutional layers, we use either stride 1 or 2 (specified in the illustrations). For all layers in a CNN we use relu activations.

Parallel BO: We use a parallelised experimental set up where multiple models can be evaluated in parallel. We handle parallel BO via the hallucination technique in Ginsbourger et al. .

Finally, we emphasise that many of the above choices were made arbitrarily, and we were able to get NASBOT working efficiently with our first choice for these parameters/specifications. Note that many end-to-end systems require specification of such choices.

Appendix C Addendum to Experiments

RAND: Our RAND implementation, operates in exactly the same way as NASBOT, except that the EA procedure (Sec. B.2) is fed a random sample from ${\bf{\rm Unif}}(0,1)$ instead of the GP acquisition each time it evaluates an architecture. That is, we follow the same schedule for $n_{\text{{{EA}}}}$ and $N_{\textrm{mut}}$ as we did for NASBOT . Hence RAND has the opportunity to explore the same space as NASBOT, but picks the next evaluation randomly from this space.

EA: This is as described in Appendix B except that we fix $N_{\textrm{mut}}=10$ all the time. In our experiments where we used a budget based on time, it was difficult to predict the total number of evaluations so as to set $N_{\textrm{mut}}$ in perhaps a more intelligent way.

TreeBO: As the implementation from Jenatton et al. was not made available, we wrote our own. It differs from the version described in the paper in a few ways. We do not tune for a regularisation penalty and step size as they do to keep it line with the rest of our experimental set up. We set the depth of the network to $60$ as we allowed $60$ layers for the other methods. We also check for the other constraints given in Appendix B before evaluating a network. The original paper uses a tree structured kernel which can allow for efficient inference with a large number of samples. For simplicity, we construct the entire kernel matrix and perform standard GP inference. The result of the inference is the same, and the number of GP samples was always below $120$ in our experiments so a sophisticated procedure was not necessary.

C.2 Details on Training

In all methods, for each proposed network architecture, we trained the network on the train data set, and periodically evaluated its performance on the validation data set. For MLP experiments, we optimised network parameters using stochastic gradient descent with a fixed step size of $10^{-5}$ and a batch size of 256 for 20,000 iterations. We computed the validation set MSE every 100 iterations; from this we returned the minimum MSE that was achieved. For CNN experiments, we optimised network parameters using stochastic gradient descent with a batch size of 32. We started with a learning rate of $0.01$ and reduced it gradually. We also used batch normalisation and trained the model for $60,000$ batch iterations. We computed the validation set classification error every $4000$ iterations; from this we returned the minimum classification error that was achieved.

After each method returned an optimal neural network architecture, we again trained each optimal network architecture on the train data set, periodically evaluated its performance on the validation data set, and finally computed the MSE or classification error on the test data set. For MLP experiments, we used the same optimisation procedure as above; we then computed the test set MSE at the iteration where the network achieved the minimum validation set MSE. For CNN experiments, we used the same optimisation procedure as above, except here the optimal network architecture was trained for 120,000 iterations; we then computed the test set classification error at the iteration where the network achieved the minimum validation set classification error.

C.3 Optimal Network Architectures and Initial Pool

Here we illustrate and compare the optimal neural network architectures found by different methods. In Figures 8-11, we show some optimal network architectures found on the Cifar10 data by NASBOT, EA, RAND, and TreeBO, respectively. We also show some optimal network architectures found for these four methods on the Indoor data, in Figures 12-15, and on the Slice data, in Figures 16-19. A common feature among all optimal architectures found by NASBOT was the presence of long skip connections and multiple decision layers.

In Figure 21, we show the initial pool of MLP network architectures, and in Figure 20, we show the initial pool of CNN network architectures. On the Cifar10 dataset, VGG-19 was one of the networks in the initial pool. While all methods beat VGG-19 when trained for 24K iterations (the number of iterations we used when picking the model), TreeBO and RAND lose to VGG-19 (see Section 5 for details). This could be because the performance after shorter training periods may not exactly correlate with performance after longer training periods.

C.4 Ablation Studies and Design Choices

We conduct experiments comparing the various design choices in NASBOT. Due to computational constraints, we carry them out on synthetic functions.

In Figure 7a, we compare NASBOT using only the normalised distance, only the unnormalised distance, and the combined kernel as in (7). While the individual distances performs well, the combined form outperforms both.

Next, we modify our EA procedure to optimise the acquisition. We execute NASBOT using only the EA modifiers which change the computational units (first four modifiers in Table 6), then using the modifiers which only change the structure of the networks (bottom 5 in Table 6), and finally using all 9 modifiers, as used in all our experiments. The combined version outperforms the first two.

Finally, we experiment with different choices for $p$ and $\bar{p}$ in (7). As the figures indicate, the performance was not particularly sensitive to these choices.

Below we describe the three synthetic functions $f_{1},f_{2},f_{3}$ used in our synthetic experiments. $f_{3}$ applies for CNNs while $f_{1},f_{2}$ apply for MLPs. Here ${\rm am}$ denotes the average mass per layer, $\rm deg_{i}$ is the average in degree the layers, $\rm deg_{o}$ is the average out degree, $\rm\delta$ is the shortest distance from $u_{\textrm{ip}}$ to $u_{\textrm{op}}$ , $\rm{str}$ is the average stride in CNNS, $\rm{frac\_conv3}$ is the fraction of layers that are conv3, $\rm{frac\_sigmoid}$ is the fraction of layers that are sigmoidal.

Appendix D Additional Discussion on Related Work

Historically, evolutionary (genetic) algorithms (EA) have been the most common method used for designing architectures . EA techniques are popular as they provide a simple mechanism to explore the space of architectures by making a sequence of changes to networks that have already been evaluated. However, as we will discuss later, EA algorithms, while conceptually and computationally simple, are typically not best suited for optimising functions that are expensive to evaluate. A related line of work first sets up a search space for architectures via incremental modifications, and then explores this space via random exploration, MCTS, or A* search . Some of the methods above can only optimise among feed forward structures, e.g. Fig. 1a, but cannot handle spaces with arbitrarily structured networks, e.g. Figs. 1b, 1c.

The most successful recent architecture search methods that can handle arbitrary structures have adopted reinforcement learning (RL) . However, architecture search is in essence an optimisation problem – find the network with the highest function value. There is no explicit need to maintain a notion of state and solve the credit assignment problem in RL . Since RL is fundamentally more difficult than optimisation , these methods typically need to try a very large number of architectures to find the optimum. This is not desirable, especially in computationally constrained settings.