Genetic CNN

Lingxi Xie, Alan Yuille

Introduction

Visual recognition is a fundamental task in computer vision, implying a wide range of applications. Recently, the state-of-the-art algorithms on visual recognition are mostly based on the deep Convolutional Neural Network (CNN). Starting from the fundamental network model for large-scale image classification , researchers have been increasing the depth of the network , as well as designing new inner structures to improve recognition accuracy. Although these modern networks have been shown to be efficient, we note that their structures are manually designed, not learned, which limits the flexibility of the approach.

In this paper, we explore the possibility of automatically learning the structure of deep neural networks. We consider a constrained case, in which the network has a limited number of layers, and each layer is designed as a set of pre-defined building blocks such as convolution and pooling. Even under these limitations, the total number of possible network structures grows exponentially with the number of layers. Therefore, it is impractical to enumerate all the candidates and find the best one. Instead, we formulate this problem as optimization in a large search space, and apply the genetic algorithm to traversing the space efficiently.

The genetic algorithm involves constructing an initial generation of individuals (candidate solutions), and performing genetic operations to allow them to evolve in a genetic process. To this end, we propose an encoding method to represent each network structure by a fixed-length binary string. After that, we define several standard genetic operations, i.e., selection, mutation and crossover, which eliminate weak individuals of the previous generation and use them to generate competitive ones. The quality of each individual is determined by its recognition accuracy on a reference dataset. Throughout the genetic process, we evaluate each individual (i.e., network structure) by training it from scratch. The genetic process comes to an end after a fixed number of generations.

It is worth emphasizing that the genetic algorithm is computationally expensive, because we need to conduct a complete network training process for each generated individual. Therefore, we run the genetic process on two small datasets, i.e., MNIST and CIFAR10, and demonstrate its ability to find high-quality network structures. It is interesting to see that the generated structures, most of which have been less studied before, often perform better than the standard manually designed ones. Finally, we transfer the learned structures to large-scale experiments and verify their effectiveness.

The remainder of this paper is organized as follows. Section 2 briefly introduces related work. Section 3 illustrates the way of using the genetic algorithm to design network structures. Experiments are shown in Section 4, and conclusions are drawn in Section 5.

Related Work

Image classification is a fundamental problem in computer vision. Recently, researcher have extended conventional classification tasks into large-scale environments such as ImageNet and Places . With the availability of powerful computational resources (e.g., GPU), the Convolutional Neural Networks (CNNs) have shown superior performance over the conventional Bag-of-Visual-Words models .

CNN is a hierarchical model for large-scale visual recognition. It is based on the observation that a network with enough neurons is able to fit any complicated data distribution. In past years, neural networks were shown effective for simple recognition tasks . More recently, the availability of large-scale training data (e.g., ImageNet ) and powerful GPUs make it possible to train deep CNNs which significantly outperform BoVW models. A CNN is composed of several stacked layers. In each of them, responses from the previous layer are convoluted with a filter bank and activated by a differentiable non-linearity. Hence, a CNN can be considered as a composite function, which is trained by back-propagating error signals defined by the difference between the supervision and prediction at the top layer. Recently, several efficient methods were proposed to help CNNs converge faster and prevent over-fitting, such as ReLU activation , batch normalization , Dropout and DisturbLabel . Features extracted from pre-trained neural networks can be generalized to other recognition tasks .

Designing powerful CNN structures is an intriguing problem. It is believed that deeper networks produce better recognition results . But also, adding highway information has been verified to be useful . Efforts are also made to add invariance into the network structure . We find some work which uses stochastic or dense structures, but all these network structures are deterministic (although stochastic operations are used in to accelerate training and prevent over-fitting), which limits the flexibility of the models and thus inspires us to automatically learn network structures.

2 Genetic Algorithm

The genetic algorithm is a metaheuristic inspired by the process of natural selection. Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection.

A typical genetic algorithm requires two prerequisites, i.e., a genetic representation of the solution domain, and a fitness function to evaluate each individual. A good example is the travelling-salesman problem (TSP) , a famous NP-complete problem which aims at finding the optimal Hamiltonian path in a graph of $N$ nodes. In this situation, each feasible solution is represented as a permutation of $\left\{1,2,\ldots,N\right\}$ , and the fitness function is the total cost (distance) of the path. We will show later that deep neural networks can be encoded into a binary string.

The core idea of the genetic algorithm is to allow individuals to evolve via some genetic operations. Popular operations include selection, mutation, crossover, etc. The selection process allows us to preserve strong individuals while eliminating weak ones. The ways of performing mutation and crossover vary from case to case, often based on the properties of the specific problem. For example, in the TSP problem with the permutation-based representation, a possible set of mutations is to change the order of two visited nodes. These operations are also used in our work.

There is a lot of research in how to improve the performance of genetic algorithms, including performing local search and generating random keys . In our work, we show that the vanilla genetic algorithm works well enough without these tricks. We also note that some previous work applied the genetic algorithm to exploring efficient neural network architectures , but our work aims at learning the architecture of modern CNNs, which is not studied in these prior works.

Our Approach

This section presents a genetic algorithm for designing competitive network structures. First, we describe a way of representing the network structure by a fixed-length binary string. Next, several genetic operations are defined, including selection, mutation and crossover, so that we can traverse the search space efficiently and find high-quality solutions.

Throughout this work, the genetic algorithm is only used to propose new network structures, the parameters and classification accuracy of each structure are obtained via standalone training-from-scratch.

We provide a binary string representation for a network structure in a constrained case. We first note that many state-of-the-art network structures can be partitioned into several stages. In each stage, the geometric dimensions (width, height and depth) of the layer cube remain unchanged. Neighboring stages are connected via a spatial pooling operation, which may change the spatial resolution. All the convolutional operations within one stage have the same number of filters, or channels.

We borrow this idea to define a family of networks which can be encoded into fixed-length binary strings. A network is composed of $S$ stages, and the $s$ -th stage, ${s}={1,2,\ldots,S}$ , contains $K_{s}$ nodes, denoted by $v_{s,k_{s}}$ , ${k_{s}}={1,2,\ldots,K_{s}}$ . The nodes within each stage are ordered, and we only allow connections from a lower-numbered node to a higher-numbered node. Each node corresponds to a convolutional operation, which takes place after element-wise summing up all its input nodes (lower-numbered nodes that are connected to it). After convolution, batch normalization and ReLU are followed, which are verified efficient in training very deep neural networks . We do not encode the fully-connected part of a network.

In each stage, we use ${1+2+\ldots+\left(K_{s}-1\right)}={\frac{1}{2}K_{s}\left(K_{s}-1\right)}$ bits to encode the inter-node connections. The first bit represents the connection between $\left(v_{s,1},v_{s,2}\right)$ , then the following two bits represent the connection between $\left(v_{s,1},v_{s,3}\right)$ and $\left(v_{s,2},v_{s,3}\right)$ , etc. This process continues until the last $K_{s}-1$ bits are used to represent the connection between $v_{s,1},v_{s,2},\ldots,v_{s,K_{s}-1}$ and $v_{s,K_{s}}$ . For ${1}\leqslant{i}<{j}\leqslant{K_{s}}$ , if the code corresponding to $\left(v_{s,i},v_{s,j}\right)$ is $1$ , there is an edge connecting $v_{s,i}$ and $v_{s,j}$ , i.e., $v_{s,j}$ takes the output of $v_{s,i}$ as a part of the element-wise summation, and vice versa.

Figure 1 illustrates two examples of network encoding. To summarize, a $S$ -stage network with $K_{s}$ nodes at the $s$ -th stage is encoded into a binary string with length ${L}={\frac{1}{2}{\sum_{s}}K_{s}\left(K_{s}-1\right)}$ . Equivalently, there are in total $2^{L}$ possible network structures. This number may be very large. In the CIFAR10 experiments (see Section 4.2), we have ${S}={3}$ and ${\left(K_{1},K_{2},K_{3}\right)}={\left(3,4,5\right)}$ , therefore ${L}={19}$ and ${2^{L}}={524\rm{,}288}$ . It is computationally intractable to enumerate all these structures and find the optimal one(s). In the following parts, we adopt the genetic algorithm to efficiently explore good candidates in this large space.

To make every binary string valid, we define two default nodes in each stage. The default input node, denoted as $v_{s,0}$ , receives data from the previous stage, performs convolution, and sends its output to every node without a predecessor, e.g., $v_{s,1}$ . The default output node, denoted as $v_{s,K_{s}+1}$ , receives data from all nodes without a successor, e.g., $v_{s,K_{s}}$ , sums up them, performs convolution, and sends its output to the pooling layer. Note that the connections between the ordinary nodes and the default nodes are not encoded.

There are two special cases. First, if an ordinary node $v_{s,i}$ is isolated (i.e., it is not connected to any other ordinary nodes $v_{s,j}$ , ${i}\neq{j}$ ), then it is simply ignored, i.e., it is not connected to the default input node nor the default output node (see the B2 node in Figure 1). This is to guarantee that a stage with more nodes can simulate all structures represented by a stage with fewer nodes. Second, if there are no connections at a stage, i.e., all bits in the binary string are , then the convolutional operation is performed only once, not twice (one for the default input node and one for the default output node).

1.2 Examples and Limitations

Many popular network structures can be represented using the proposed encoding scheme. Examples include VGGNet , ResNet , and a modified variant of DenseNet , which are illustrated in Figure 2.

Currently, only convolutional and pooling operations are considered, which makes it impossible to generate some tricky network modules such as Maxout . Also, the size of convolutional filters is fixed within each stage, which limits our network from incorporating multi-scale information as in the inception module . However, we note that all the encoding-based approaches have such limitations. Our approach can be easily modified to include more types of layers and more flexible inter-layer connections. As shown in experiments, we can achieve competitive recognition performance using merely these basic building blocks.

As shown in a recent published work using reinforcement learning to explore neural architecture , this type of methods often require heavy computation to traverse the huge solution space. Fortunately, our method can be easily generalized and scaled up, which is done via learning the architecture on a small dataset and transfer the learned information to large-scale datasets. Please refer to the experimental part for details.

2 Genetic Operations

The flowchart of the genetic process is shown in Algorithm 1. It starts with an initialized generation of $N$ randomized individuals. Then, we perform $T$ rounds, or $T$ generations, each of which consists of three operations, i.e., selection, mutation and crossover. The fitness function of each individual is evaluated via training-from-scratch on the reference dataset.

As we shall see in Section 4.1.3, different strategies of initialization do not impact the genetic performance too much. Even starting with a naive initialization (all individuals are all-zero strings), the genetic process can discover quite competitive structures with crossover and mutation.

2.2 Selection

2.3 Mutation and Crossover

2.4 Evaluation

Experiments

The proposed genetic algorithm requires a very large amount of computational resources, which makes it intractable to be directly evaluated on large-scale datasets such as ILSVRC2012 . Our solution is to explore promising network structures on small datasets such as MNIST and CIFAR10 , then transfer these structures to the large-scale recognition tasks.

The MNIST dataset defines a handwritten digit recognition task. There are $60\rm{,}000$ images for training, and $10\rm{,}000$ images for testing, all of them are $28\times 28$ grayscale images. Both training and testing data are uniformly distributed over $10$ categories, i.e., digits from to $9$ . To avoid using the testing data, we leave $10\rm{,}000$ images from the training set for validation.

We follow the basic LeNet for MNIST recognition. The original network is abbreviated as: {spverbatim} C5@20-MP2S2-C5@50-MP2S2-FC500-D0.5-FC10. Here, C5@20 is a convolutional layer with a kernel size $5$ , a default spatial stride $1$ and the number of kernels $20$ ; MP2S2 is a max-pooling layer with a kernel size $2$ and a spatial stride $2$ , FC500 is a fully-connected layer with $500$ outputs, and D0.5 is a Dropout layer with a drop ratio $0.5$ . We apply $20$ training epochs with learning rate $10^{-3}$ , followed by $4$ epochs with learning rate $10^{-4}$ , and another $1$ epoch with learning rate $10^{-5}$ .

We set ${S}={2}$ , ${\left(K_{1},K_{2}\right)}={\left(3,5\right)}$ , and keep the fully-connected part of LeNet unchanged. The first convolutional layer within each stage remains the same as in the original LeNet, and other convolutional layers take the kernel size $3\times 3$ and the same channel number. The length $L$ of each binary string is $13$ , which means that there are ${2^{13}={8\rm{,}192}}$ possible individuals.

Results are summarized in Table 1. With the genetic operations, we can find competitive network structures which achieve high recognition accuracy. Although over a short period the recognition rate of the best individual is not improved, the average and medium accuracies generally get higher from generation to generation. This is very important, because it guarantees the genetic algorithm improves the overall quality of the individuals. According to our diagnosis in Section 4.1.2, this is very important for the genetic process, since the quality of a new individual is positively correlated to the quality of its parent(s). After $50$ generations, the recognition error rate of the best individual drops from $0.41\%$ to $0.34\%$ .

1.2 Diagnosis

We perform diagnostic experiments to verify the hypothesis, that a better individual is more likely to generate a good individual via mutation or crossover. For this, we randomly select several occurrences of mutation and crossover in the CIFAR10 genetic process, and observe the relationship between an individual and its parent(s). Figure 3 supports our point. We argue that the genetic operations tend to preserve a fraction of the good local properties, so that the excellent “genes” from the parent(s) are more likely to be preserved.

1.3 Initialization Issues

Finally, we observe the impact of different initialized networks. For this, we start a naive population with ${N}={20}$ all-zero individuals, and use the same parameters for a complete genetic process. Results are shown in Figure 4. We find that, although the all-zero string corresponds to a very simple and less competitive network structure, the genetic algorithm is able to generate strong individuals after several generations. This naive initialization achieves the initial performance of randomized individuals with about $10$ generations. After about $30$ generations, there is almost no difference, by statistics, between these two populations.

2 CIFAR10 Experiments

The CIFAR10 dataset is a subset of the $80$ -million tiny image database . There are $50\rm{,}000$ images for training, and $10\rm{,}000$ images for testing, all of them are $32\times 32$ RGB images. CIFAR10 contains $10$ basic categories, and both training and testing data are uniformly distributed over these categories. To avoid using the testing data, we leave $10\rm{,}000$ images from the training set for validation.

We follow a revised LeNet for CIFAR10 recognition. The original network is abbreviated as: {spverbatim} C5(P2)@8-MP3(S2)-C5(P2)@16-MP3(S2)- C5(P2)@32-MP3(S2)-FC128-D0.5-FC10. Note that we significantly reduce the filter numbers at each stage to accelerate the training phase. We will show later that this does not prevent the genetic process from learning promising network structures. We apply $120$ training epochs with learning rate $10^{-2}$ , followed by $60$ epochs with learning rate $10^{-3}$ , $40$ epoch with learning rate $10^{-4}$ and another $20$ epoch with learning rate $10^{-5}$ .

We keep the fully-connected part of the above network unchanged, and set ${S}={3}$ and ${\left(K_{1},K_{2},K_{3}\right)}={\left(3,4,5\right)}$ . Similarly, the first convolutional layer within each stage remains the same as in the original LeNet, and other convolutional layers take the kernel size $3\times 3$ and the same channel number. The length $L$ of each binary string is $19$ , which means that there are ${2^{19}={524\rm{,}288}}$ possible individuals.

We perform two individual genetic processes. The results of one process are summarized in Table 2. As in the MNIST experiments, all the important statistics (e.g., average and median accuracies) grow from generation to generation. We also report the best network structures in the table, and visualize the best structures throughout these two processes in Figure 5.

2.2 Comparison to Densely-Connected Nets

Under our encoding scheme, a straightforward way of network construction is to set all bits to be $1$ , which leads to a network in which any two layers within the same stage are connected. This network produces a $76.84\%$ recognition rate, which is a little bit lower than those reported in Table 2. Considering that the densely-connected network requires heavier computational overheads, we conclude that the genetic algorithm helps to find more effective and efficient structures than the dense connections.

2.3 Observation

In Figure 5, we plot the the network structures learned from two individual genetic processes. The structures learned by the genetic algorithm are quite different from the manually designed ones, although some manually designed local structures are observed, like the chain-shaped networks, multi-path networks and highway networks. We emphasize that these two networks are obtained by independent genetic processes, which demonstrates that our genetic process generally converges to similar network structures.

3 CIFAR and SVHN Experiments

We apply the networks learned from the CIFAR10 experiments to more small-scale datasets. We test three datasets, i.e., CIFAR10, CIFAR100 and SVHN. CIFAR100 is an extension to CIFAR10 containing $100$ finer-grained categories. It has the same numbers of training and testing images as CIFAR10, and these images are also uniformly distributed over $100$ categories.

SVHN (Street View House Numbers) is a large collection of $32\times 32$ RGB images, i.e., $73\rm{,}257$ training samples, $26\rm{,}032$ testing samples, and $531\rm{,}131$ extra training samples. We preprocess the data as in the previous work , i.e., selecting $400$ samples per category from the training set as well as $200$ samples per category from the extra set, using these $6\rm{,}000$ images for validation, and the remaining $598\rm{,}388$ images as training samples. We also use Local Contrast Normalization (LCN) for preprocessing .

We evaluate the best network structure in each generation of the genetic process. We resume using a large number of filters at each stage, i.e., the three stages and the first fully-connected layer are equipped with $64$ , $128$ , $256$ and $1024$ filters, respectively. The training strategy, include the numbers of epochs and learning rates, remains the same as in the previous experiments.

We compare our results with some state-of-the-art methods in Table 3. Among these competitors, we note that the densely-connected network is closely related to our work. Although GeNet ( $17$ layers) produces lower recognition accuracy, we note that the structures used in are much deeper (e.g., $40$ – $100$ layers). Since dense connection is often not the best option (see Section 4.2.2), we believe that it is possible to use the genetic algorithm to optimize the connections used in .

4 ILSVRC2012 Experiments

Results are summarized in Table 4. We can see that, in general, structures learned from a small dataset (CIFAR10) can be transferred to large-scale visual recognition (ILSVRC2012). Our model achieves better performance than VGGNet, because the original three chain-shaped stages are replaced by the automatically learned structures.

Conclusions

This paper applies the genetic algorithm to designing the structure of deep neural networks. We first propose an encoding method to represent each network structure with a fixed-length binary string, then uses some popular genetic operations such as mutation and crossover to explore the search space efficiently. Different initialization strategies do not make much difference on the genetic process. We conduct the genetic algorithm with a relatively small reference dataset (CIFAR10), and find that the generated structures transfer well to other tasks, including the large-scale ILSVRC2012 dataset.

Despite the interesting results we have obtained, our algorithm suffers from several drawbacks. First, a large fraction of network structures are still unexplored, including those with non-convolutional modules like Maxout , and the multi-scale strategy used in the inception module . Second, in the current work, the genetic algorithm is only used to explore the network structure, whereas the network training process is performed separately. It would be very interesting to incorporate the genetic algorithm to training the network structure and weights simultaneously. These directions are left for future work.

Acknowledgements

We thank John Flynn, Wei Shen, Chenxi Liu and Siyuan Qiao for instructive discussions.