Neural Architecture Optimization

Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, Tie-Yan Liu

Introduction

Automatic design of neural network architecture without human intervention has been the interests of the community from decades ago to very recent . The latest algorithms for automatic architecture design usually fall into two categories: reinforcement learning (RL) based methods and evolutionary algorithm (EA) based methods . In RL based methods, the choice of a component of the architecture is regarded as an action. A sequence of actions defines an architecture of a neural network, whose dev set accuracy is used as the reward. In EA based method, search is performed through mutations and re-combinations of architectural components, where those architectures with better performances will be picked to continue evolution.

It can be easily observed that both RL and EA based methods essentially perform search within the discrete architecture space. This is natural since the choices of neural network architectures are typically discrete, such as the filter size in CNN and connection topology in RNN cell. However, directly searching the best architecture within discrete space is inefficient given the exponentially growing search space with the number of choices increasing. In this work, we instead propose to optimize network architecture by mapping architectures into a continuous vector space (i.e., network embeddings) and conducting optimization in this continuous space via gradient based method. On one hand, similar to the distributed representation of natural language , the continuous representation of an architecture is more compact and efficient in representing its topological information; On the other hand, optimizing in a continuous space is much easier than directly searching within discrete space due to better smoothness.

We call this optimization based approach Neural Architecture Optimization (NAO), which is briefly shown in Fig. 1. The core of NAO is an encoder model responsible to map a neural network architecture into a continuous representation (the blue arrow in the left part of Fig. 1). On top of the continuous representation we build a regression model to approximate the final performance (e.g., classification accuracy on the dev set) of an architecture (the yellow surface in the middle part of Fig. 1). It is noteworthy here that the regression model is similar to the performance predictor in previous works . What distinguishes our method is how to leverage the performance predictor: different with previous work that uses the performance predictor as a heuristic to select the already generated architectures to speed up searching process, we directly optimize the module to obtain the continuous representation of a better network (the green arrow in the middle and bottom part of Fig. 1) by gradient descent. The optimized representation is then leveraged to produce a new neural network architecture that is predicted to perform better. To achieve that, another key module for NAO is designed to act as the decoder recovering the discrete architecture from the continuous representation (the red arrow in the right part of Fig. 1). The decoder is an LSTM model equipped with an attention mechanism that makes the exact recovery easy. The three components (i.e., encoder, performance predictor and decoder) are jointly trained in a multi task setting which is beneficial to continuous representation: the decoder objective of recovering the architecture further improves the quality of the architecture embedding, making it more effective in predicting the performance.

We conduct thorough experiments to verify the effectiveness of NAO, on both image classification and language modeling tasks. Using the same architecture space commonly used in previous works , the architecture found via NAO achieves $1.93\%$ test set error rate (with cutout ) on CIFAR-10. Furthermore, on PTB dataset we achieve $56.0$ perplexity, also surpassing best performance found via previous methods on neural architecture search. Furthermore, we show that equipped with the recent proposed weight sharing mechanism in ENAS to reduce the large complexity in the parameter space of child models, we can achieve improved efficiency in discovering powerful convolutional and recurrent architectures, e.g., both take less than $10$ hours on 1 GPU.

Our codes and model checkpoints are available at https://github.com/renqianluo/NAO and https://github.com/renqianluo/NAO_pytorch.

Related Work

Recently the design of neural network architectures has largely shifted from leveraging human knowledge to automatic methods, sometimes referred to as Neural Architecture Search (NAS) . As mentioned above, most of these methods are built upon one of the two basic algorithms: reinforcement learning (RL) and evolutionary algorithm (EA) . For example, use policy networks to guide the next-step architecture component. The evolution processes in guide the mutation and recombination process of candidate architectures. Some recent works try to improve the efficiency in architecture search by exploring the search space incrementally and sequentially, typically from shallow to hard. Among them, additionally utilizes a performance predictor to select the promising candidates. Similar performance predictor has been specifically studied in parallel works such as . Although different in terms of searching algorithms, all these works target at improving the quality of discrete decision in the process of searching architectures.

The most recent work parallel to ours is DARTS , which relaxes the discrete architecture space to continuous one by mixture model and utilizes gradient based optimization to derive the best architecture. One one hand, both NAO and DARTS conducts continuous optimization via gradient based method; on the other hand, the continuous space in the two works are different: in DARTS it is the mixture weights and in NAO it is the embedding of neural architectures. The difference in optimization space leads to the difference in how to derive the best architecture from continuous space: DARTS simply assumes the best decision (among different choices of architectures) is the $argmax$ of mixture weights while NAO uses a decoder to exactly recover the discrete architecture.

Another line of work with similar motivation to our research is using bayesian optimization (BO) to perform automatic architecture design . Using BO, an architecture’s performance is typically modeled as sample from a Gaussian process (GP). The induced posterior of GP, a.k.a. the acquisition function denoted as $a:\mathcal{X}\rightarrow R^{+}$ where $\mathcal{X}$ represents the architecture space, is tractable to minimize. However, the effectiveness of GP heavily relies on the choice of covariance functions $K(x,x^{\prime})$ which essentially models the similarity between two architectures $x$ and $x^{\prime}$ . One need to pay more efforts in setting good $K(x,x^{\prime})$ in the context of architecture design, bringing additional manual efforts whereas the performance might still be unsatisfactory . As a comparison, we do not build our method on the complicated GP setup and empirically find that our model which is simpler and more intuitive works much better in practice.

Approach

We introduce the details of neural architecture optimization (NAO) in this section.

Firstly we introduce the design space for neural network architectures, denoted as $\mathcal{X}$ . For fair comparison with previous NAS algorithms, we adopt the same architecture space commonly used in previous works .

For searching CNN architecture, we assume that the CNN architecture is hierarchical in that a cell is stacked for a certain number of times (denoted as $N$ ) to form the final CNN architecture. The goal is to design the topology of the cell. A cell is a convolutional neural network containing $B$ nodes. Each of the nodes contains two branches, with each branch taking the output of one of the former nodes as input and applying an operation to it. The operation set includes $11$ operators listed in Appendix. The node adds the outputs of its two branches as its output. The inputs of the cell are the outputs of two previous cells, respectively denoted as node $-2$ and node $-1$ . Finally, the outputs of all the nodes that are not used by any other nodes are concatenated to form the final output of the cell. Therefore, for each of the $B$ nodes we need to: 1) decide which two previous nodes are used as the inputs to its two branches; 2) decide the operation to apply to its two branches. We set $B=5$ in our experiments as in .

For searching RNN architecture, we use the same architecture space as in . The architecture space is imposed on the topology of an RNN cell, which computes the hidden state $h_{t}$ using input $i_{t}$ and previous hidden state $h_{t-1}$ . The cell contains $B$ nodes and we have to make two decisions for each node, similar to that in CNN cell: 1) a previous node as its input; 2) the activation function to apply. For example, if we sample node index $2$ and ReLU for node $3$ , the output of the node will be $o_{3}=ReLU(o_{2}\cdot W_{3}^{h})$ . An exception here is for the first node, where we only decide its activation function $a_{1}$ and its output is $o_{1}=a_{1}(i_{t}\cdot W^{i}+h_{t-1}\cdot W_{1}^{h})$ . Note that all $W$ matrices are the weights related with each node. The available activation functions are: tanh, ReLU, identity and sigmoid. Finally, the output of the cell is the average of the output of all the nodes. In our experiments we set $B=12$ as in .

We use a sequence consisting of discrete string tokens to describe a CNN or RNN architecture. Taking the description of CNN cell as an example, each branch of the node is represented via three tokens, including the node index it selected as input, the operation type and operation size. For example, the sequence “node-2 conv 3x3 node1 max-pooling 3x3 ” means the two branches of one node respectively takes the output of node $-2$ and node $1$ as inputs, and respectively apply $3\times 3$ convolution and $3\times 3$ max pooling. For the ease of introduction, we use the same notation $x=\{x_{1},\cdots,x_{T}\}$ to denote such string sequence of an architecture $x$ , where $x_{t}$ is the token at $t$ -th position and all architectures $x\in\mathcal{X}$ share the same sequence length denoted as $T$ . $T$ is determined via the number of nodes $B$ in each cell in our experiments.

2 Components of Neural Architecture Optimization

The overall framework of NAO is shown in Fig. 1. To be concrete, there are three major parts that constitute NAO: the encoder, the performance predictor and the decoder.

Encoder. The encoder of NAO takes the string sequence describing an architecture as input, and maps it into a continuous space $\mathcal{E}$ . Specifically the encoder is denoted as $E:\mathcal{X}\rightarrow\mathcal{E}$ . For an architecture $x$ , we have its continuous representation (a.k.a. embedding) $e_{x}=E(x)$ . We use a single layer LSTM as the basic model of encoder and the hidden states of the LSTM are used as the continuous representation of $x$ . Therefore we have $e_{x}=\{h_{1},h_{2},\cdots,h_{T}\}\in\mathcal{R}^{T\times d}$ where $h_{t}\in\mathcal{R}^{d}$ is LSTM hidden state at $t$ -th timestep with dimension $d$ For ease of introduction, even though some notations have been used before (e.g., $h_{t}$ in defining RNN search space), they are slightly abused here..

Performance predictor. The performance predictor $f:\mathcal{E}\rightarrow\mathcal{R}^{+}$ is another important module accompanied with the encoder. It maps the continuous representation $e_{x}$ of an architecture $x$ into its performance $s_{x}$ measured by dev set accuracy. Specifically, $f$ first conducts mean pooling on $e_{x}=\{h_{1},\cdots,h_{T}\}$ to obtain $\overline{e}_{x}=\frac{1}{T}\sum_{t}^{T}h_{t}$ , and then maps $\overline{e}_{x}$ to a scalar value using a feed-forward network as the predicted performance. For an architecture $x$ and its performance $s_{x}$ as training data, the optimization of $f$ aims at minimizing the least-square regression loss $(s_{x}-f(E(x)))^{2}$ .

Considering the objective of performance prediction, an important requirement for the encoder is to guarantee the permutation invariance of architecture embedding: for two architectures $x_{1}$ and $x_{2}$ , if they are symmetric (e.g., $x_{2}$ is formed via swapping two branches within a node in $x_{1}$ ), their embeddings should be close to produce the same performance prediction scores. To achieve that, we adopt a simple data augmentation approach inspired from the data augmentation method in computer vision (e.g., image rotation and flipping): for each $(x_{1},s_{x})$ , we add an additional pair $(x_{2},s_{x})$ where $x_{2}$ is symmetrical to $x_{1}$ , and use both pairs (i.e., $(x_{1},s_{x})$ and $(x_{2},s_{x})$ ) to train the encoder and performance predictor. Empirically we found that acting in this way brings non-trivial gain: on CIFAR-10 about $2\%$ improvement when we measure the quality of performance predictor via the pairwise accuracy among all the architectures (and their performances).

Decoder. Similar to the decoder in the neural sequence-to-sequence model , the decoder in NAO is responsible to decode out the string tokens in $x$ , taking $e_{x}$ as input and in an autoregressive manner. Mathematically the decoder is denoted as $D:\mathcal{E}\rightarrow x$ which decodes the string tokens $x$ from its continuous representation: $x=D(e_{x})$ . We set $D$ as an LSTM model with the initial hidden state $s_{0}=h_{T}(x)$ . Furthermore, attention mechanism is leveraged to make decoding easier, which will output a context vector $ctx_{r}$ combining all encoder outputs $\{h_{t}\}_{t=1}^{T}$ at each timestep $r$ . The decoder $D$ then induces a factorized distribution $P_{D}(x|e_{x})=\prod_{r=1}^{T}P_{D}(x_{r}|e_{x},x_{<r})$ on $x$ , where the distribution on each token $x_{r}$ is $P_{D}(x_{r}|e_{x},x_{<r})=\frac{\exp(W_{x_{r}}[s_{r},ctx_{r}])}{\sum_{x^{\prime}\in V_{r}}\exp(W_{x^{\prime}}[s_{r},ctx_{r}])}$ . Here $W$ is the output embedding matrix for all tokens, $x_{<r}$ represents all the previous tokens before position $r$ , $s_{r}$ is the LSTM hidden state at $r$ -th timestep and $ $means concatenation of two vectors.$ V_{r} $denotes the space of valid tokens at position$ r$ to avoid the possibility of generating invalid architectures.

The training of decoder aims at recovering the architecture $x$ from its continuous representation $e_{x}=E(x)$ . Specifically, we would like to maximize $\log P_{D}(x|E(x))=\sum_{r=1}^{T}\log P_{D}(x_{r}|E(x),x_{<r})$ . In this work we use the vanilla LSTM model as the decoder and find it works quite well in practice.

3 Training and Inference

We jointly train the encoder $E$ , performance predictor $f$ and decoder $D$ by minimizing the combination of performance prediction loss $L_{pp}$ and structure reconstruction loss $L_{rec}$ :

where $X$ denotes all candidate architectures $x$ (and their symmetrical counterparts) that are evaluated with the performance number $s_{x}$ . $\lambda\in$ is the trade-off parameter. Furthermore, the performance prediction loss acts as a regularizer that forces the encoder not optimized into a trivial state to simply copy tokens in the decoder side, which is typically eschewed by adding noise in encoding $x$ by previous works .

When both the encoder and decoder are optimized to convergence, the inference process for better architectures is performed in the continuous space $\mathcal{E}$ . Specifically, starting from an architecture $x$ with satisfactory performance, we obtain a better continuous representation $e_{x^{\prime}}$ by moving $e_{x}=\{h_{1},\cdots,h_{T}\}$ , i.e., the embedding of $x$ , along the gradient direction induced by $f$ :

where $\eta$ is the step size. Such optimization step is represented via the green arrow in Fig. 1. $e_{x^{\prime}}$ corresponds to a new architecture $x^{\prime}$ which is probably better than $x$ since we have $f(e_{x^{\prime}})\geq f(e_{x})$ , as long as $\eta$ is within a reasonable range (e.g., small enough). Afterwards, we feed $e_{x^{\prime}}$ into decoder to obtain a new architecture $x^{\prime}$ assumed to have better performance If we have $x^{\prime}=x$ , i.e., the new architecture is exactly the same with the previous architecture, we ignore it and keep increasing the step-size value by $\frac{\eta}{2}$ , until we found a different decoded architecture different with $x$ .. We call the original architecture $x$ as ‘seed’ architecture and iterate such process for several rounds, with each round containing several seed architectures with top performances. The detailed algorithm is shown in Alg. 1.

4 Combination with Weight Sharing

Recently the weight sharing trick proposed in significantly reduces the computational complexity of neural architecture search. Different with NAO that tries to reduce the huge computational cost brought by the search algorithm, weight sharing aims to ease the huge complexity brought by massive child models via the one-shot model setup . Therefore, the weight sharing method is complementary to NAO and it is possible to obtain better performance by combining NAO and weight sharing. To verify that, apart from the aforementioned algorithm 1, we try another different setting of NAO by adding the weight sharing method. Specifically we replace the RL controller in ENAS by NAO including encoder, performance predictor and decoder, with the other training pipeline of ENAS unchanged. The results are reported in the next section 4.

Experiments

In this section, we report the empirical performances of NAO in discovering competitive neural architectures on benchmark datasets of two tasks, the image recognition and the language modeling.

We move some details of model training for CIFAR-10 classification to the Appendix. The architecture encoder of NAO is an LSTM model with token embedding size and hidden state size respectively set as $32$ and $96$ . The hidden states of LSTM are normalized to have unit length, i.e., $h_{t}=\frac{h_{t}}{||h_{t}||_{2}^{2}}$ , to constitute the embedding of the architecture $x$ : $e_{x}=\{h_{1},\cdots,h_{T}\}$ . The performance predictor $f$ is a one layer feed-forward network taking $\frac{1}{T}\sum_{t=1}^{T}h_{t}$ as input. The decoder is an LSTM model with an attention mechanism and the hidden state size is $96$ . The normalized hidden states of the encoder LSTM are used to compute the attention. The encoder, performance predictor and decoder of NAO are trained using Adam for $1000$ epochs with a learning rate of $0.001$ . The trade-off parameters in Eqn. (1) is $\lambda=0.9$ . The step size to perform continuous optimization is $\eta=10$ . Similar to previous works , for all the architectures in NAO training phase (i.e., in Alg. 1), we set them to be small networks with $B=5,N=3,F=32$ and train them for $25$ epochs. After the best cell architecture is found, we build a larger network by stacking such cells $6$ times (set $N=6$ ), and enlarging the filter size (set $F=36$ , $F=64$ and $F=128$ ), and train it on the whole training dataset for $600$ epochs.

We tried two different settings of the search process for CIFAR-10: without weight sharing and with weight sharing. For the first setting, we run the evaluation-optimization step in Alg. 1 for three times (i.e., $L=3$ ), with initial $X$ set as $600$ randomly sampled architectures, $K=200$ , forming $600+200+200=1000$ model architectures evaluated in total. We use 200 V100 GPU cards to complete all the process within 1 day. For the second setting, we exactly follow the setup of ENAS and the search process includes running 150 epochs on 1 V100 GPU for $7$ hours.

The detailed results are shown in Table 1, where we demonstrate the performances of best experts designed architectures (in top block), the networks discovered by previous NAS algorithm (in middle block) and by NAO (refer to its detailed architecture in Appendix), which we name as NAONet (in bottom block). We have several observations. (1) NAONet achieves the best test set error rate ( $1.93$ ) among all architectures. (2) When accompanied with weight sharing (NAO-WS), NAO achieves $2.93\%$ error rate with much less parameters ( $2.5$ M) within only $7$ hours ( $0.3$ day), which is more efficient than ENAS. (3) Compared with the previous strongest architecture, the AmoebaNet, within smaller search space ( $\#op=11$ ), NAO not only reduces the classification error rate ( $3.34\rightarrow 3.18$ ), but also needs an order of magnitude less architectures that are evaluated ( $20000\rightarrow 1000$ ). (3) Compared with PNAS , even though the architecture space of NAO is slightly larger, NAO is more efficient ( $M=1000$ ) and significantly reduces the error rate of PNAS ( $3.41\rightarrow 3.18$ ).

Furthermore, we would like to inspect whether the gradient update in Eqn.(2) really helps to generate better architecture representations that are further decoded to architectures via $D$ . In Fig. 2 we show the average performances of architectures in $X_{eval}$ discovered via NAO at each optimization iteration. Red bar indicates the mean of real performance values $\frac{1}{|X_{eval}|}\sum_{x\in X_{eval}}s_{x}$ while blue bar indicates the mean of predicted value $\frac{1}{|X_{eval}|}\sum_{x\in X_{eval}}f(E(x))$ . We can observe that the performances of architectures in $X_{eval}$ generated via NAO gradually increase with each iteration. Furthermore, the performance predictor $f$ produces predictions well aligned with real performance, as is shown via the small gap between the paired red and blue bars.

2 Transferring the discovered architecture to CIFAR-100

To evaluate the transferability of discovered NAOnet, we apply it to CIFAR-100. We use the best architecture discovered on CIFAR-10 and exactly follow the same training setting. Meanwhile, we evaluate the performances of other automatically discovered neural networks on CIFAR-100 by strictly using the reported architectures in previous NAS papers . All results are listed in Table 2. NAONet gets test error rate of $14.75$ , better than the previous SOTA obtained with cutout ( $15.20$ ). The results show that our NAONet derived with CIFAR-10 is indeed transferable to more complicated task such as CIFAR-100.

3 Transferring the discovered architecture to ImageNet

Further, we transfer the discovered architecture to ImageNet. We use the best architecture discovered on CIFAR-10 and follow the same training setting as in related works . All results are listed in Table 3. NAONet gets top-1 test error rate of $25.7$ , better than most previous works. The results again show that our NAONet derived from CIFAR-10 is transferable to much more complicated task such as ImageNet.

4 Results of Language Modeling on PTB

We leave the model training details of PTB to the Appendix. The encoder in NAO is an LSTM with embedding size $64$ and hidden size $128$ . The hidden state of LSTM is further normalized to have unit length. The performance predictor is a two-layer MLP with each layer size as $200,1$ . The decoder is a single layer LSTM with attention mechanism and the hidden size is $128$ . The trade-off parameters in Eqn. (1) is $\lambda=0.8$ . The encoder, performance predictor, and decoder are trained using Adam with a learning rate of $0.001$ . We perform the optimization process in Alg 1 for two iterations (i.e., $L=2$ ). We train the sampled RNN models for shorter time ( $600$ epochs) during the training phase of NAO, and afterwards train the best architecture discovered yet for $2000$ epochs for the sake of better result. We use $200$ P100 GPU cards to complete all the process within 1.5 days. Similar to CIFAR-10, we furthermore explore the possibility of combining weight sharing with NAO and the resulting architecture is denoted as ‘NAO-WS’.

We report all the results in Table 4, separated into three blocks, respectively reporting the results of experts designed methods, architectures discovered via previous automatic neural architecture search methods, and our NAO. As can be observed, NAO successfully discovered an architecture that achieves quite competitive perplexity $56.0$ , surpassing previous NAS methods and is on par with the best performance from LSTM method with advanced manually designed techniques such as averaged weight drop . Furthermore, NAO combined with weight sharing (i.e., NAO-WS) again demonstrates efficiency to discover competitive architectures (e.g., achieving $56.6$ perplexity via searching in $10$ hours).

5 Transferring the discovered architecture to WikiText-2

We also apply the best discovered RNN architecture on PTB to another language modelling task based on a much larger dataset WikiText-2 (WT2 for short in the following). Table 5 shows the result that NAONet discovered by our method is on par with, or surpasses ENAS and DARTS.

Conclusion

We design a new automatic architecture design algorithm named as neural architecture optimization (NAO), which performs the optimization within continuous space (using gradient based method) rather than searching discrete decisions. The encoder, performance predictor and decoder together makes it more effective and efficient to discover better architectures and we achieve quite competitive results on both image classification task and language modeling task. For future work, first we would like to try other methods to further improve the performance of the discovered architecture, such as mixture of softmax for language modeling. Second, we would like to apply NAO to discovering better architectures for more applications such as Neural Machine Translation. Third, we plan to design better neural models from the view of teaching and learning to teach .

Acknowledgement

We thank Hieu Pham for the discussion on some details of ENAS implementation, and Hanxiao Liu for the code base of language modeling task in DARTS . We furthermore thank the anonymous reviewers for their constructive comments.

References

Appendix

CIFAR-10 contains $50k$ and $10k$ images for training and testing. We randomly choose $5000$ images within training set as the dev set for measuring the performance of each candidate network in the optimization process of NAO. Standard data pre-processing and augmentation, such as whitening, randomly cropping $32\times 32$ patches from unsampled images of size $40\times 40$ , and randomly flipping images horizontally are applied to original training set. The CNN models are trained using SGD with momentum set to $0.9$ , where the arrange of learning rate follows a single period cosine schedule with $l_{max}=0.024$ proposed in . For the purpose of regularization, We apply stochastic drop-connect on each path, and an $l_{2}$ weight decay of $5\times 10^{-4}$ . All the models are trained with batch size $128$ .

Penn Treebank (PTB) is one of the most widely adopted benchmark dataset for language modeling task. We use the open-source code of ENAS released by the authors and exactly follow their setups. Specifically, we apply variational dropout, $l_{2}$ regularization with weight decay of $5\times 10^{-7}$ , and tying word embeddings and softmax weights . We train the models using SGD with an initial learning rate of $10.0$ , decayed by a factor of $0.9991$ after every epoch staring at epoch $15$ . To avoid gradient explosion, we clip the norm of gradient with the threshold value $0.25$ . For WikiText-2, we use embdedding size 700, weight decay of $5\times 10^{-7}$ and variational dropout $0.15$ . Others unstated are the same as in PTB, such as weight tying.

2 Search Space

In searching convolutional cell architectures without weight sharing, following previous works of , we adopt $11$ possible ops as follow:

When using weight sharing, we use exactly the same 5 operators as :

In searching for recurrent cell architectures, we exactly follow the search space of ENAS , where possible activation functions are:

3 Best Architecture discovered

Here we plot the best architecture of CNN cells discovered by our NAO algorithm in Fig. 3.

Furthermore we plot the best architecture of recurrent cell discovered by our NAO algorithm in Fig. 4.