Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing

Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, Dharmendra S. Modha

Abstract

Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that i) approach state-of-the-art classification accuracy across 88 standard datasets, encompassing vision and speech, ii) perform inference while preserving the hardware’s underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 12001200 and 26002600 frames per second and using between 2525 and 275275 mW (effectively >6000>6000 frames / sec / W) and iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. For the first time, the algorithmic power of deep learning can be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.

Approach

Here, we provide a description of the relevant elements of deep convolutional networks and the TrueNorth neuromorphic chip, and describe how the essence of the former can be realized on the latter.

A deep convolutional network is a multilayer feedforward neural network, whose input is typically image-like and whose layers are neurons that collectively perform a convolutional filtering of the input or a prior layer (Figure 1). Neurons within a layer are arranged in two spatial dimensions, corresponding to shifts in the convolution filter, and one feature dimension, corresponding to different filters. Each neuron computes a summed weighted input, ss, as

where x={xi,j,f}\textbf{x}=\{x_{i,j,f}\} are the neuron’s input pixels or neurons, w={wi,j,f}\textbf{w}=\{w_{i,j,f}\} are the filter weights, ii, jj are over the topographic dimensions, and ff is over the feature dimension or input channels. Batch normalization can be used to zero center ss and normalize its standard deviation to 11, following

where rr is the filter response, bb is a bias term, ϵ=104\epsilon=10^{-4} provides numerical stability, and μ\mu and σ\sigma are the mean and standard deviation of ss computed per filter using all topographic locations and examples in a data batch during training, or using the entire training set during inference. Final neuron output is computed by applying a non-linear activation function to the filter response, typically a rectified linear unit that sets negative values to . In a common scheme, features in the last layer are each assigned a label – such as prediction class – and vote to formulate network output .

Deep networks are trained using the backpropagation learning rule . This procedure involves iteratively i) computing the network’s response to a batch of training examples in a forward pass, ii) computing the error between the network’s output and the desired output, iii) using the chain rule to compute the error gradient at each synapse in a backward pass, and iv) making a small change to each weight along this gradient so as to reduce error.

2 TrueNorth

A TrueNorth chip consists of a network of neurosynaptic cores with programmable connectivity, synapses, and neuron parameters (Figure 2). Connectivity between neurons follows a block-wise scheme: each neuron can connect to an input line of any one core in the system, and from there to any neuron on that core through its local synapses. All communication to-, from-, and within- chip is performed using spikes.

TrueNorth neurons use a variant of an integrate-and-fire model with 2323 configurable parameters where a neuron’s state variable, V(t)V(t), updates each tick, tt – typically at 10001000 ticks per second, though higher rates are possible – according to

where x^(t)={x^i}\bm{\hat{x}}(t)=\{\hat{x}_{i}\} are the neuron’s spiking inputs, w={wi}\textbf{w}=\{w_{i}\} are its corresponding weights, LL is its leak chosen from {255,254,...,255}\{-255,-254,...,255\}, and ii is over its inputs. If V(t)V(t) is greater than or equal to a threshold θ\theta, the neuron emits a spike and resets using one of several reset modes, including resetting to . If V(t)V(t) is below a lower bound, it can be configured to snap to that bound.

Synapses have individually configurable on/off states and have a strength assigned by look-up table. Specifically, each neuron has a 44-entry table parameterized with values in the range {255,254,...,255}\{-255,-254,...,255\}, each input line to a core is assigned an input type of 11, 22, 33 or 44, and each synapse then determines its strength by using the input type on its source side to index into the table of the neuron on its target side.It should be noted that our approach can easily be adapted to hardware with other synaptic representation schemes. In this work, we only use 2 input types, corresponding to synapse strengths of -1 and 1, described in the next section.

3 Mapping Deep Convolutional Networks to TrueNorth

By appropriately designing the structure, neurons, network input, and weights of convolutional networks during training, it is possible to efficiently map those networks to neuromorphic hardware.

Network structure is mapped by partitioning each layer into 11 or more equally sized groups along the feature dimension,Feature groups were originally used by AlexNet, which split the network to run on 22 parallel GPUs during training. The use of grouping is expanded upon considerably in this work. where each group applies its filters to a different, non-overlapping, equally sized subset of layer input features. Layers are designed such that the total filter size (rows ×\times columns ×\times features) of each group is less than or equal to the number of input lines available per core, and the number of output features is less than or equal to the number of neurons per core. This arrangement allows 11 group’s features, filters, and filter support region to be implemented using 11 core’s neurons, synapses, and input lines, respectively (Figure 3A). Total filter size was further limited to 128128 here, to support trinary synapses, described below. For efficiency, multiple topographic locations for the same group can be implemented on the same core. For example, by delivering a 4×4×84\times 4\times 8 region of the input space to a single core, that core can be used to implement overlapping filters of size 3×3×83\times 3\times 8 for 44 topographic locations.

Where filters implemented on different cores are applied to overlapping regions of the input space, the corresponding input neurons must target multiple cores, which is not explicitly supported by TrueNorth. In such instances, multiple neurons on the same core are configured with identical synapses and parameters (and thus will have matching output), allowing distribution of the same data to multiple targets. If insufficient neurons are available on the same core, a feature can be “split” by connecting it to a core with multiple neurons configured to spike whenever they receive an input spike from that feature. Neurons used in either duplication scheme are referred to here as copies (Figure 3B).

3.2 Neurons

To match the use of spikes in hardware, we employ a binary representation scheme for data throughout the network.Schemes that use higher precision are possible, such as using the number of spikes generated in a given time window to represent data (a rate code). However, we observed the best accuracy for a given energy budget by using the binary scheme described here. Neurons in the convolutional network use the activation function

where yy is neuron output and rr is the neuron filter response (Equation 1). By configuring TrueNorth neurons such that i) L=b(σ+ϵ)μL=\lceil b(\sigma+\epsilon)-\mu\rceil, where LL is the leak from Equation 2 and the remaining variables are the normalization terms from Equation 1, which are computed from training data offline, ii) threshold (θ\theta in Equation 2) is 11, iii) reset is to after spiking, and iv) the lower bound on the membrane potential is , their behavior exactly matches that in Equation 3 (Figure 3C). Conditions iii and iv ensure that V(t)V(t) is at the beginning of each image presentation, allowing for 11 classification per tick using pipelining.

3.3 Network input

Network inputs are typically represented with multi-bit channels (for example, 88-bit RGB channels). Directly converting the state of each bit into a spike would result in an unnatural neural encoding since each bit represents a different value (for example, the most-significant-bit spike would carry a weight of 128128 in an 88-bit scheme). Here, we avoid this awkward encoding altogether by converting the high precision input into a spiking representation using convolution filters with the binary output activation function described in Equation 3. This process is akin to the transduction that takes place in biological sensory organs, such as the conversion of brightness levels into single spikes representing spatial luminance gradients in the retina.

3.4 Weights

While TrueNorth does not directly support trinary weights, they can be simulated by using neuron copies such that a feature’s output is delivered in pairs to its target cores. One member of the pair is assigned input type 11, which corresponds to a +1+1 in every neuron’s lookup table, and the second input type 22, which corresponds to a 1-1. By turning on neither, one, or the other of the corresponding synaptic connections, a weight of , +1+1 or 1-1 can be created (Figure 3D). To allow us to map into this representation, we restrict synaptic weights in the convolutional network to these same trinary values.

3.5 Training

Constraints on receptive field size and features per group, the use of binary neurons and use of trinary weights are all employed during training. As the binary-valued neuron used here has a derivative of \infty at , and everywhere else, which is not amenable to backpropagation, we instead approximate its derivative as being 11 at and linearly decaying to in the positive and negative direction according to

where rr is the filter response and yy is the neuron output. Weight updates are applied to a high precision hidden value, whw_{h}, which is bounded in the range 1-1 to 11 by clipping, and mapped to the trinary value used for the forward and backward pass by rounding with hysteresis according to

where hh is a hysteresis parameter set to 0.1 here.This is rule is similar to the recent results from BinaryNet , but was developed independently here in this work. Our specific neuron derivative and use of hysteresis are unique. The hidden weights allows each synapse to flip between discrete states based on subtle differences in the relative amplitude of error gradients measured across multiple training batches.

We employ standard heuristics for training, including momentum (0.90.9), weight decay (10710^{-7}), and decreasing learning rate (dropping by 10×10\times twice during training). We further employ a spike sparsity pressure by adding γ12yˉ2\gamma\frac{1}{2}\sum\bar{y}^{2} to the cost function, where yˉ\bar{y} is average feature activation, the summation is over all features in the network, and γ\gamma is a parameter, set to 10410^{-4} here. This serves as both a regularizer and to reduce spike traffic during deployment (and therefore reduce energy consumption). Training was performed offline on conventional GPUs, using a library of custom training layers built upon functions from the MatConvNet toolbox . Network specification and training complexity using these layers is on par with standard deep learning.

3.6 Deployment

The parameters learned through training are mapped to hardware using reusable, composable hardware description functions called corelets . The corelets created for this work automatically compile the learned network parameters, which are independent of any neuromorphic platform, into an platform-specific hardware configuration file that can directly program TrueNorth chips.

Results

We applied our approach to 88 image and audio benchmarks by creating 55 template networks using 0.50.5, 11, 22, 44 or 88 TrueNorth chipsAdditional network sizes for the audio datasets (VAD, TIMIT classification, TIMIT frames) were created by adjusting features per layer or removing layers. (Tables 1 and 2 and Figure 4). Testing was performed at 11 classification per hardware tick.

Three layer configurations were especially useful in this work, though our approach supports a variety of other parameterizations. First, spatial filter layers employ patch size 3×3×83\times 3\times 8 and stride 11, allowing placement of 44 topographic locations per core. Second, network-in-network layers (see ) employ patch size 1×1×1281\times 1\times 128 and stride of 1, allowing each filter to span a large portion of the incoming feature space, thereby helping to maintain network integration. Finally, pooling layers employ standard convolution layers with patch size 2×2×322\times 2\times 32 and stride 2, thereby resulting in non-overlapping patches that reduce the need for neuron copies.

We found that using up to 1616 channels for the transduction layer (Figure 5) gave good performance at a low bandwidth. For multi-chip networks we used additional channels, presupposing additional bandwidth in larger systems. As smaller networks required less regularization, weight decay was not employed for networks smaller than 44 chips, and spike sparsity pressure was not used for networks half chip size or less.

2 Hardware

To characterize performance, all networks that fit on a single chip were run in TrueNorth hardware. Multi-chip networks were run in simulation , pending forthcoming infrastructure for interconnecting chips. Single-chip classification accuracy and throughput were measured on the NS1e development board (Figure 2B), but power was measured on a separate NS1t test and characterization board (not shown) – using the same supply voltage of 1.01.0V on both boards – since the current NS1e board is not instrumented to measure power and the NS1t board is not designed for high throughput. Total TrueNorth power is the sum of i) leakage power, computed by measuring idle power on NS1t and scaling by the fraction of the chip’s cores used by the network, and ii) active power, computed by measuring total power during classification on NS1t, subtracting idle power, and scaling by the classification throughput (FPS) measured on NS1e.Active energy per classification does not change as the chip’s tick runs faster or slower as long as the voltage is the same (as in the experiments here) because the same number of transistors switch independent of the tick duration. For hardware measurement, our focus was to characterize operation on the TrueNorth chip as a component in a future embedded system. Such a system will also need to consider capabilities and energy requirements of sensors, transduction, and off-chip communication, which requires hardware choices that are application specific and are not considered here.

3 Performance

Table 3 and Figure 6 show our results for all 88 datasets and a comparison with state-of-the-art approaches, with measured power and classifications per energy (Frames/Sec/Watt) reported for single-chip networks. It is know that augmenting training data through manipulations such as mirroring can improve scores on test data, but this adds complexity to the overall training process. To maintain focus on the algorithm presented here, we do not augment our training set, including no dropout, and so compare our results to other works that also do not use data augmentation. Our experiments show that for almost all of the benchmarks, a single-chip network is sufficient to come within a few percent of state-of-the-art accuracy. Increasing to up to 88 chips improved accuracy by several percentage points, and in the case of the VAD dataset surpassed state-of-the-art performance.

Discussion

Our work demonstrates that the structural and operational differences between neuromorphic computing and deep learning are not fundamental, and points to the richness of neural network constructs and the adaptability of backpropagation. This marks an important step towards a new generation of applications based on embedded neural networks.

These results help to validate the neuromorphic approach, which is to provide an efficient yet flexible substrate for spiking neural networks, instead of targeting a single application or network structure. Indeed, the specification for TrueNorth and a prototype chip were developed in 2011, before the recent resurgence of convolutional networks in 2012 . Not only is TrueNorth capable of implementing these convolutional networks, which it was not originally designed for, but it also supports a variety of connectivity patterns (feedback and lateral, as well as feedforward) and can simultaneously implement a wide range of other algorithms (see ). We envision running multiple networks on the same TrueNorth chip, enabling composition of end-to-end systems encompassing saliency, classification, and working memory.

We see several avenues of potentially fruitful exploration for future work. Several recent innovations in unconstrained deep learning that may be of value for the neuromorphic domain include deeply supervised networks , and modified gradient optimization rules. The approach used here applies hardware constraints from the beginning of training, that is, constrain-then-train, but innovation may also come from constrain-while-train approaches, where training initially begins in an unconstrained space, but constraints are gradually introduced during training . Finally, co-design between algorithms and future neuromorphic architectures promises even better accuracy and efficiency.

References