Neuromorphic Deep Learning Machines

Emre Neftci, Charles Augustine, Somnath Paul, Georgios Detorakis

Introduction

Biological neurons and synapses can provide the blueprint for inference and learning machines that are potentially thousandfold more energy efficient than mainstream computers. However, the breadth of application and scale of present-day neuromorphic hardware remains limited, mainly due to a lack of general and efficient inference and learning algorithms compliant with the spatial and temporal constraints of the brain. Machine learning and deep learning are well poised for solving a broad set of applications using neuromorphic hardware, thanks to their general-purpose, modular, and fault-tolerant nature Esser et al. ; Neftci et al. ; Lee et al. . One outstanding question is whether the learning phase in deep neural networks can be efficiently carried out in neuromorphic hardware. Performing learning on-the-fly in less controlled environments where no prior, representative dataset exists confers more fine-grained context awareness to behaving cognitive agents. However, deep learning usually relies on the immediate availability of network-wide information stored with high-precision memory. In digital computers, the access to this information funnels through the von Neumann bottleneck, which dictates the fundamental limits of the computing substrate. Distributing computations along multiple cores (such as in GPUs) is a popular solution to mitigate this problem, but even there the scalability of BP is often limited by its memory-intensive operations Zhu et al. .

The implementation of gradient BP on a neural substrate is even more challenging Baldi et al. ; Lee et al. ; Grossberg because it requires 1) using synaptic weights that are identical with forward passes (symmetric weights requirements, also known as the weight transport problem), 2) carrying out the operations involved in BP including multiplications with derivatives and activation functions, 3) propagating error signals with high precision, 4) alternating between forward and backward passes, 5) changing the sign of synaptic weights, and 6) availability of targets (labels). The essence of these challenges is that it requires precise linear and non-linear computations, and more importantly because gradient BP requires information that is not local to the computational building blocks in a neural substrate, meaning that special communcation channels must be provisioned. Whether a given operation is local or not depends on the physical substrate that carries out the computations. For example, while symmetric weights in neural networks are compatible with von Neumann architectures (and even desirable since weights in both directions are shared), the same is not true in a distributed system such as the brain: elementary computing units do not have bidirectional connections with the same weight in each direction. Since neuromorphic implementations generally assume dynamics closely related to the those in the brain, requirements (1-4) above also hinder efficient implementations of BP in neuromorphic hardware.

Although previous work O’Connor and Welling ; Lee et al. ; Lillicrap et al. overcomes some of the fundamental difficulties of gradient BP listed above in spiking networks, here we tackle all of the key difficulties using event-driven random BP (eRBP), a learning rule for deep spiking neural networks achieving classification accuracies that are similar to those obtained in artificial neural networks, potentially running on a fraction of the energy budget with dedicated neuromorphic hardware.

eRBP builds on the recent advances in approximate forms of the gradient BP rule Lee et al. ; Liao et al. ; Lillicrap et al. ; Baldi et al. for training spiking neurons of the type used in neuromorphic hardware to perform supervised learning. These approximations solve the non-locality problem by replacing BP weights with random ones, leading to remarkably little loss in classification performance on benchmark tasks Lillicrap et al. ; Baldi et al. (requirement 1 above). Although a general theoretical understanding of random BP (RBP) is still lacking, extended simulations and analyses of linear networks show that, during learning, the network adjusts its feed-forward weights to learn an approximation of the pseudo-inverse of the (random) feedback weights, which is equally good in communicating gradients. eRBP is an asynchronous (event-driven) adaptation of random BP that can be tightly embedded with the dynamics of dual compartment I&F neurons that costs one addition and two comparisons per synaptic weight update. Extended experimentations show that the spiking nature of neuromorphic hardware and the lack of general linear and non-linear computations at the neuron does not prevent accurate learning on classification tasks (requirement 2, 3), and operates continuously and asynchronously without alternation of forward or backward passes (requirement 4). Additional experimental evidence shows that eRBP is robust to fixed width representations with limited neural and synaptic state precision, making it suitable for dedicated digital hardware. The success of eRBP lays out the foundations of neuromorphic deep learning machines, and paves the way for learning with streaming spike-event data in neuromorphic platforms at artificial neural network proficiencies. We demonstrate this in simulation of a custom digital neuromorphic processor using fixed point, discrete-time dynamics.

Results

The central contribution of this article is event-driven RBP (eRBP), a presynaptic spike-driven plasticity rule modulated by top-down errors and gated by the state of the postsynaptic neuron. The idea behind this additional modulation factor is motivated by supervised gradient descent learning in artificial neural networks and biologically plausible models of three-factor plasticity rules Urbanczik and Senn , which was argued to subserve supervised, unsupervised and reinforcement learning, an idea that was also reported in Lillicrap et al. . For a given neuron, the eRBP learning dynamics for synapse $k$ can be summarized as follows:

where $S_{k}^{pre}[t]$ represents the spike train of presynaptic neuron $k$ ( $S^{pre}_{k}[t]=1$ if pre-synaptic neuron $k$ spiked at time step $t$ ), and $\Theta$ is the derivative of the spiking neuron’s activation function evaluated at the total synaptic input $I[t]$ . The essence of eRBP is contained in the factor $T$ . For the final output (prediction) layer, $T$ is equal to the classification error $e$ of the considered neuron, similarly to the delta rule. For hidden layers, the $T$ is equal to the error projected randomly to the hidden neurons, i.e. $T=\sum_{j}g_{j}e_{j}$ . where the $g_{j}$ are random weights that are fixed during the learning. We found that a boxcar function in place of $\Theta$ provides very good results, while being more amenable to hardware implementation compared to the alternative of computing the exact derivative of the activation function. This choice is motivated by the fact that the activation function of leaky I&F neurons with absolute refractory period can be approximated by a linear threshold unit with saturation whose derivative is exactly the boxcar function. Using a boxcar function with boundaries $b_{min}$ and $b_{max}$ , the eRBP synaptic weight update consists of additions and comparisons only, and can be captured using the following operations for each neuron:

The eRBP rule combined with its ability to learn deep representations with near equal accuracies as described below can enable neuromorphic deep learning machines on a wide variety of tasks. In the following, we focus on a design that is tailored for digital neuromorphic design, namely that some non-plastic synaptic weights can be exactly matched. Its implementation in a mixed signal design prone to fabrication mismatch and other non-idealities is the subject of ongoing work.

2 Spiking Networks Equipped with eRBP Learn with High Accuracy

We demonstrate eRBP in networks consisting of one and two hidden layers trained on permutation invariant MNIST (Tab. 1) , although eRBP can in theory generalize to other datasets, tasks and network architectures as well. Rather than optimizing for absolute classification performance, we compare to equivalent artificial (non-spiking) neural networks trained with RBP and standard BP, with free parameters fine-tuned to achieve the highest accuracy on the considered classification tasks (Tab. 1). On most network configurations eRBP achieved performances equivalent to those achieved with RBP in artificial networks. When equipped with probabilistic connections (peRBP) that randomly blank-out presynaptic spikes, the network performed better overall. This is because, as learning progresses, a significant portion of the neurons tend to fire near their maximum rate and synchronize their spiking activity across layers as a result of large synaptic weights (and thus presynaptic inputs). Synchronized spike activity is not well captured by our rate model, which is assumed by the eRBP (see Methods). Additive noise has relatively small effect when the magnitude of the presynaptic input is large. However, multiplicative blank-out noise improves learning by introducing irregularity in the presynaptic spike-trains even when presynaptic neurons fire regularly. Our previous work Neftci et al. suggested that the probabilistic connections implement DropConnect regularization Wan et al. . In contrast with Wan et al. , the probabilistic connections remain enabled both during learning and inference because the network dynamics depend strongly on this stochasticity. Interestingly, this type of “always-on” stochasticity also was argued to approximate Bayesian inference with Gaussian processes Gal and Ghahramani .

Overall, the learned classification accuracy is comparable with those obtained with offline training of spiking neural networks (e.g. GPUs) using standard BP.

The presence of these bursts of error activity suggest that eRBP could learn spatiotemporal sequences as well. However, learning useful latent representations of the sequences requires solving a temporal credit assignment problem at the hidden layer – a problem that is commonly solved with gradient BP-through-time in artificial neural networks Rumelhart et al. – which could be tackled using synaptic eligibility dynamics based on ideas of reinforcement learning Sutton and Barto .

3 Classification with Single Spikes is Highly Accurate and Efficient

The low latency response with high accuracy may seem at odds with the inherent firing rate code underlying the network computations (See Methods). However, a code based on the time of the first-spike is consistent with a firing rate code, since a neuron with a high firing rate is expected to fire first Gerstner and Kistler . In addition, the onset of the stimulus provokes a burst of synchronized activity, which further favors the rapid onset of the prediction response. These results suggest that despite the underlying firing rate code, eRBP can take advantage of the spiking dynamics, with classification accuracies comparable to spiking networks trained exclusively for single-spike classification Mostafa .

4 Spiking Networks Equipped with eRBP Learn Rapidly and Efficiently

The spiking neural network requires fewer iterations of the dataset to reach the peak classification performance compared to the artificial neural network trained with batch gradient descent (Fig. 2). In minibatch learning, weight updates are averaged across the minibatch. Batch or Minibatch learning improves learning speed in conventional hardware thanks to vectorization libraries or efficient parallelization with GPUs’ SIMD architecture, and lead to smoother convergence. However, this approach result in $n_{batch}$ times fewer weight updates per epoch compared to online gradient descent. In contrast, the spiking neuron network is updated after each sample presentation, accounting in large part for the faster convergence of learning. Other spiking networks trained online using stochastic gradient descent achieved comparable speedup O’Connor and Welling ; Lee et al. . These results are not entirely surprising since seminal work in stochastic gradient descent established that, with suitable conditions on the learning rate, the solution to a learning problem obtained with stochastic gradient descent is asymptotically as good as the solution obtained with batch gradient descent Le Cun and Bottou for a given number of samples. Furthermore, for equal computational resources, online gradient descent can process more data samples Le Cun and Bottou , while requiring less memory for implementation. Thus, for an equal number of compute operations per unit time, online gradient descent converges faster than batch learning. Standard artificial neural networks can be trained using $n_{batch}=1$ , but learning becomes much slower on standard platforms because the operations cannot be vectorized across data samples. (The converse is not true, however: $n_{batch}>1$ in spiking networks is non-local because it requires storing synaptic weight gradients.) It is fortunate that synaptic plasticity is an inherently “online” in the machine learning sense, given that potential applications of neuromorphic hardware often involve real-time streaming data.

5 eRBP can Learn with low Precision, Fixed Point Representations

The effectiveness of stochastic gradient descent degrades when the precision of the synaptic weights using a fixed point representation is smaller than 16 bits Courbariaux et al. . This is because quantization determines the smallest learning rate and bounds the range of the synaptic weights, thereby preventing averaging the variability across dataset iterations. The tight integration of memory with computing circuits as pursued in neuromorphic chip design is challenging due to space constraints and memory leakage. For this reason, full precision (or even 16 bit) computer simulations of spiking networks may be unrepresentative of performance that can be attained in dedicated neuromorphic designed due to quantization of neural states and parameters, and synaptic weights. Extended simulations suggest that the random BP performances at 10 bits precision is indistinguishable from unquantized weights Baldi et al. , but whether this is the case for online learning was not yet tested. Here, we hypothesize that 8 bit synaptic weight is a good trade-off between the ability to learn with high accuracy and the cost of implementation in hardware. To demonstrate robustness to such constraints, we simulate quantized versions of the eRBP network using low precision fixed point representations (8 bits per synaptic weight and 16 bits for neural states). Consistent with existing findings, our simulations of eRBP in a quantized 784-100-10 network show that eRBP still performs reasonably well under these conditions (Fig. 4). While many weights aggregate at the boundaries, a majority of them remain away from the boundaries. Although the learned accuracies using quantized simulations fall slightly short of the full precision ones, we emphasize that no specific rounding mechanisms Muller and Indiveri was used to obtain these results and are expected to tighten this gap.

Discussion

The gradient descent BP rule is a powerful algorithm that is ubiquitous in deep learning, but when implemented in a von Neumann or neural architecture, it relies on the immediate availability of network-wide information stored with high-precision memory. More specifically, Baldi et al. and Lee et al. list several reasons why the following requirements of gradient BP make them biologically implausible. The essence of these difficulties is that gradient BP is non-local in space and in time when implemented on a neural substrate, and requires precise linear and non-linear computations. The feedback alignment work demonstrated that symmetric weights were not necessary for communicating error signals across layers. Here we demonstrated a learning rule inspired by Building on feedback alignment Lillicrap et al. , and membrane voltage-gated plasticity rules, and three-factor synaptic plasticity rules proposed in the computational neuroscience literature. With an adequate network architecture, we find that the spike-based computations and the lack of general linear and non-linear computations and alternating forward and backward steps does not prevent accurate learning. Although previous work overcome some of the non-locality problems of gradient BP O’Connor and Welling ; Lee et al. ; Lillicrap et al. , eRBP overcomes all of the key difficulties using a simple rule that incurs one addition and two comparisons per synaptic weight update.

Taken together, our results suggest that general-purpose deep learning using streaming spike-event data in neuromorphic platforms at artificial neural network proficiencies is realizable. To emphasize this, we have implemented eRBP using a software simulations using fixed point, discrete-time dynamics, called the neural and synaptic array transceiver. This simulator is compatible with a digital neuromorphic hardware currently in development, the full details of which will be published elsewhere.

Our experiments target digital implementations of spiking neural networks with embedded plasticity. However, membrane-voltage based learning rules implemented in mixed-signal neuromorphic hardware Qiao et al. ; Huayaney et al. are compatible with eRBP provided that synaptic weight updates can be modulated by an external signal on a neuron-to-neuron basis. Following this route, and combined with the recent advances in neuromorphic engineering and emerging nanotechnologies, eRBP can become key to ultra low-power processing in space and power constrained platforms.

Spiking neural networks, especially those based on the I&F neuron types severely restrict the possible computations during learning and inference. With the wide availability of graphical processing units and future dedicated machine learning accelerators, the neuromorphic spike-based approach more machine learning tasks is often heavily criticized as being misguided. While this is true for some cases and on metrics based on absolute accuracy at some standardized benchmark task, the premise of neuromorphic engineering, i.e. that electronic and biological share similar constraints on communication, power and reliability Mead , extend to the algorithmic domain. That is, accommodating machine learning algorithms within the constraints ultra-low power hardware for adaptive behavior (i.e. embedded learning) is likely to result in solutions for communication, computations and reliability that are in strong resemblance with how the brain solves similar problems. In addition, the neuromorphic approach offers a few advantages over straight artificial neural networks: 1) Asynchronous, event-based communication most often used in neuromorphic hardware considerably reduce the communication between distributed processes, 2) Spiking networks naturally exploit “rate” codes and “spike” codes where single spikes are meaningful, leading to fast and thus power-efficient and gradual responses (Fig. 3, see also O’Connor and Welling ).

One prominent example is the Binarized Neural Network (BNN) Courbariaux and Bengio . The BNN is trained such that weights and activities are -1 or 1, which considerably reduces the energetic footprint of inference, because multiplications are not necessary and the memory requirements for inference are much smaller. The discrete, quantized dynamics used in this work and developed independently from the BNN shares many similarities, such as binary activations (spikes), low-precision variables, and straight-through gradient estimators. Our neurally inspired approach innovates several new features for binarized networks: network activity is sparse and data-driven (asynchronous), random variables for stochasticity are generated only when neurons spike, errors are backpropagated only for misclassified examples, and learning is ongoing leading to accurate early single spike classification. Many other examples that led to the unprecedented success in machine learning were discovered independently of equivalent neural mechanisms, such as normalization techniques for improving deep learning Ioffe and Szegedy ; Ren et al. , attention, short-term memory for learning complex tasks Graves et al. , and memory consolidation through fast replays for reinforcement learning Mnih et al. ; Kumaran et al. . The convergence between the two approaches (neuromorphic vs. artificial) will not only improve the design of neuromorphic learning machines, but can also widen the breadth of knowledge transfer between computational neuroscience and deep learning.

2 Relation to Prior Work in Random Backpropagation

Our learning rule builds on the feedback alignment learning rule proposed in Lillicrap et al. , showing that random feedback can deliver useful teaching signals by aligning the feed-forward weights with the feed-back weights. The authors also demonstrated a spiking neural network implementing feedback alignment, demonstrating that feedback alignment is able to implicitly adapt to random feedback when the forward and backward pathways both operate continuously. However, their learning rule is not event-based as in eRBP, but operates on a continuous-time fashion that is not directly compatible with spike-driven plasticity, and a direct neuromorphic implementation thereof would be inadequate due to the high bandwidth communication required between neurons. Furthermore, their model is a spike response model that does not emulate the physical dynamics of spiking neurons such as I&F neurons. Another difference between eRBP and the network presented in Lillicrap et al. is that eRBP contains only one error-coding layer, whereas feedback alignment contains one error-coding layer per hidden layer. Such direct feedback alignment was recently proposed in Nø kland and Baldi et al. , and a theoretical analysis there showed that gradient computed in this fashion point is within 90 degrees of the backpropagated gradient. Baldi et al. studied feedback alignment in the framework of local learning and the learning channel, and derived several other flavors of random BP such as adaptive, sparse, skipped and indirect RBP, along with their combinations. In related work, Lee et al. showed how feedback weights can be learned to improve the classification accuracy by training the feedback weights to learn the inverse of the feedforward mapping.

3 Relation to Prior Work in Spiking Deep Neural Networks

Several approaches successfully realized the mapping of pre-trained artificial neural networks onto spiking neural networks using a firing rate code O’Connor et al. ; Cao et al. ; Hunsberger and Eliasmith ; O’Connor and Welling ; Esser et al. ; Neftci et al. [2014b]; Diehl et al. ; Das et al. ; Marti et al. Such mapping techniques have the advantage that they can leverage the capabilities of existing machine learning frameworks such as Caffe Jia et al. or Theano Goodfellow et al. for brain-inspired computers. More recently, Mostafa used a temporal coding scheme where information is encoded in spike times instead of spike rates and the dynamics are cast in a differentiable form. As a result, the network can be trained using standard gradient descent to achieve very accurate, sparse and power-efficient classification. Although eRBP achieves comparable results, their approach naturally leads to sparse activity in the hidden layer which can be more advantageous in large and deep networks.

An intermediate approach is to learn online with standard BP using spike-based quantization of network states O’Connor and Welling and the instantaneous firing rate of the neurons Lee et al. . O’Connor and Welling eschews neural dynamics and instead operates directly on event-based (spiking) quantizations of vectors. Using this representation, common neural network operations including online gradient BP are mapped on to basic addition, comparison and indexing operations applied to streams of signed spikes. As in eRBP, their learning rule achieves better results when weight updates are made in an event-based fashion, as this allows the network to update its parameters many times during the processing of a single data sample. Lee et al. propose a method for training spiking neural networks via a formulation of the instantaneous firing rate of the neuron obtained by low-pass filtering the spikes. There, quantities that can be related to the postsynaptic potential (rather than mean rates) are used to compute the derivative of the activity of the neuron, which can provide a useful gradient for backpropagation. Esser et al. use multiple spiking convolutional networks trained offline to achieve near state-of-the-art classification in standard benchmark tasks. Their approach maps onto the all-digital spiking neural network architecture using trinary weights. For the above approaches, the eRBP learning rule presented here can be used as a drop-in replacement and can reduce the computational footprint of learning by simplifying the backpropagated chain path and by operating directly with locally available variables i.e. membrane potentials and spikes.

4 Relation to Prior Work in Spike-Driven Plasticity Rules

STDP has been shown to be very powerful in a number of different models and tasks Thorpe et al. ; Nessler et al. ; Neftci et al. [2014a]. Although the implementation of acausal updates (triggered by presynaptic firing) is typically straightforward in cases where presynaptic lookup tables are used, the implementation of causal updates (triggered by postsynaptic firing) can be challenging due to the requirement of storing a reverse look-up table. Several approximations of STDP exist to solve this problem Pedroni et al. ; Galluppi et al. , but require dedicated circuits.

Thus, there is considerable benefit in hardware implementations of synaptic plasticity rules that forego the causal updates. Such rules, which we referred to as spike-driven plasticity, can be consistent with STDP Brader et al. ; Sheik et al. [2016a]; Qiao et al. ; Clopath et al. , especially when using dynamical variables that are representative of the pre- and postsynaptic firing rates (such as calcium or average membrane voltage).

A common feature among spike-driven learning rules is a modulation or gating with a variable that reflects the average firing rate of the neuron for example through calcium concentration Graupner and Brunel ; Huayaney et al. or the membrane potential Clopath et al. ; Sheik et al. [2016a] or both Brader et al. . Sheik et al. [2016a] recently proposed a membrane-gated rule inspired by calcium and voltage-based rules with homeostasis for learning unsupervised spike pattern detection. Their rule statistically emulates pairwise STDP using presynaptic spike timing only and using additions and multiplications. Except for homeostasis, eRBP follows similar dynamics but potentiation and depression magnitudes are dynamic and determined by external modulation, and comparisons are made on total synaptic currents to avoid the effect of the voltage reset after firing.

The two compartment neuron model used in this work is motivated from conductance-based dynamics in Urbanczik and Senn and previous neuromorphic realizations of two compartment mixed signal spiking neurons Park et al. . Although the spiking network used in this work is current-based rather than conductance-based, eRBP shares strong similarities to the three-factor learning rule employed in Urbanczik and Senn . The latter is composed of three factors: an approximation of the prediction error, the derivative of the membrane potential with respect to the synaptic weight, and a positive weighting function that stabilizes learning in certain scenarios. The first factor corresponds to the error modulation, while the second and third factors roughly correspond to the presynaptic activity and the derivative of the activation function. The differences between eRBP and Urbanczik and Senn (besides from the random BP which was considered in Lillicrap et al. ) stems mainly from two facts: 1) the firing rate description used here for simplicity and for easier comparisons between artificial neural networks and spiking neural networks and 2) eRBP is fully event-based in the sense that weights are updated only when the presynaptic neurons spike, in order to make memory and compute operations more efficient in hardware.

Conclusions and Future Directions

This article demonstrates a local learning rule for deep, feed-forward neural networks achieving classification accuracies on par with those obtained using equivalent machine learning algorithms. The learning rule combines two features: 1) algorithmic simplicity: one addition and two comparisions per synaptic update provided one auxiliary state per neuron and 2) Locality: all the information for the weight update is available at each neuron and the synapse. The combination of these two features enables learning dynamics for deep learning in neuromorphic hardware.

Existing literature suggests that that random BP also works for unsupervised learning Lee et al. ; Baldi et al. in deeper and convolutional networks. It can be reasonably expected that the deep learning community will uncover many variants of random BP, including in recurrent neural networks for sequence learning and memory augmented neural networks. In tandem with these developments, we envision that such RBP techniques will enable the embedded learning of pattern recognition, attention, working memory and action selection mechanisms which promise transformative hardware architectures for embedded computing.

This work has focused on unstructured, feed-forward neural networks and a single benchmark task across multiple implementations for ease of comparison. Limitations in deep learning algorithms are often invisible on “small” datasets like MNIST Liao et al. . Random BP was demonstrated to be effective in a variety of tasks and network structures Liao et al. ; Baldi et al. , including convolutional neural networks. Although random BP was reported to work well in this case Liao et al. , the parameter sharing in convnets is inherently non-local. Despite this non-locality, neuromorphic implementation of convnets are still possible in neuromorphic Qiao et al. if presynaptic connectivity tables are stored rather than postsynaptic tables.

Methods

In artificial neural networks, the mean-squared cost function for one data sample in a single layer neural network is:

where $e_{i}$ is the error of prediction neuron $i$ , $y_{i}=\phi(\sum_{j}w_{ij}x_{j})$ is the activity of the prediction neuron $i$ with activation function $\phi$ , $\mathbf{x}$ is the data sample and $l_{i}$ is the label associated to the data sample. The task of learning is to minimize this cost over the entire dataset. The gradient descent rule in artificial neural networks is often used to this end by modifying the network parameters $\mathbf{w}$ in the direction opposite to the gradient:

where $\eta$ is a small learning rate. In deep networks, i.e. networks containing one or more hidden layers, the weights of the hidden layer neurons are modified by backpropagating the errors from the prediction layer using the chain rule:

where the $\delta$ for the topmost layer is $e_{i}$ , as in Eq. (3) and $y$ at the bottommost layer is the data $x$ . This update rule is the well-known gradient back propagation algorithm ubiquitously used in deep learning Rumelhart et al. . Learning is typically carried out in forward passes (evaluation of the neural network activities) and backward passes (evaluation of the $\delta$ s). The computation of the $\delta$ requires knowledge of the forward weights, thus gradient BP relies on the immediate availability of a symmetric transpose of the network for computing the backpropagated errors $\delta_{ij}^{l}$ . Often the access to this information funnels through the von Neumann bottleneck, which dictates the fundamental limits of the computing substrate.

In the random BP rule considered here, the BP term $\delta$ is replaced with:

where $g_{ik}^{l}$ are fixed random numbers. This backpropagated term does not depend on the previous layer $l+1$ , and thus does not have a recursive structure as in standard BP (Eq. (4)) or feedback alignment Lillicrap et al. . This form was previously referred to as direct feedback alignment Nø kland or skipped RBP Baldi et al. and was shown to perform equally well on a broad spectrum of tasks. A detailed justification of random BP is out of the scope of this article, and interested readers are referred to Nø kland ; Baldi et al. ; Lillicrap et al. .

In the context of models of biological spiking neurons, RBP is appealing because it circumvents the problem of calculating the backpropagated errors and does not require bidirectional synapses or symmetric weights. RBP works remarkably very well in a wide variety of classification and regression problems, using supervised and unsupervised learning in feed-forward networks, with a very small penalty in accuracy.

The above BP rules are commonly used in artificial neural networks, where neuron outputs are represented as single scalar variables. To derive an equivalent spike-based rule, we start by matching this scalar value is the neuron’s instantaneous firing rate. The cost function and its derivative for one data sample is then:

where $e_{i}(t)$ is the error of prediction unit $i$ and $\nu^{p}$ , $\nu^{l}$ are the firing rates of prediction and label neurons, respectively.

Random BP (Eq. (5)) is straightforward to implement in artificial neural network simulations. However, spiking neurons and synapses, especially with the dynamics that can be afforded in low-power neuromorphic implementations typically do not have arbitrary mathematical operations at their disposal. For example, evaluating the derivative $\phi$ can be difficult depending on the form of $\phi$ and multiplications between the multiple factors involved in RBP can become very costly given that they must be performed at every synapse for every presynaptic event.

The dynamics of spiking neural circuits driven by Poisson spike trains is often studied in the diffusion approximation Wang ; Brunel and Hakim ; Brunel ; Fusi and Mattia ; Renart et al. ; Deco et al. ; Tuckwell . In this approximation, the firing rates of individual neurons are replaced by a common time-dependent population activity variable with the same mean and two-point correlation function as the original variables, corresponding here to a Gaussian process. The approximation is true when the following assumptions are verified: 1) the charge delivered by each spike to the postsynaptic neuron is small compared to the charge necessary to generate an action potential, 2) the number of afferent inputs to each neuron is large, 3) the spike times are uncorrelated. In the diffusion approximation, only the first two moments of the synaptic current are retained. The currents to the neuron, $I(t)$ , can then be decomposed as:

where $\mu=\langle I(t)\rangle=\sum_{j}w_{j}\nu_{j}$ and $\sigma^{2}=w_{bg}^{2}\nu_{bg}$ , where $\nu_{bg}$ is the firing rate of the background activity, and $\eta(t)$ is the white noise process. We restrict neuron dynamics to the case of synaptic time constants that are much larger than the membrane time constant, i.e. $\tau_{m}\ll\tau_{syn}$ , such that we can neglect the fluctuations caused by synaptic activity from other neurons in the network i.e. $\sigma$ is constant. Although the above dynamics are not true in general, in a neuromorphic approach, the parameters can be chosen accordingly during configuration or at design.

In this case, the neuron’s membrane potential dynamics is an Ornstein-Uhlenbeck (OU) process Gardiner . The stationary distribution of the freely evolving membrane potential (no firing threshold) is a Gaussian distribution:

where $g_{L}$ is the leak conductance and $\tau_{m}$ is the membrane time constant. Although this distribution is generally not representative of the membrane potential of the I&F neuron due to the firing threshold Gerstner and Kistler , the considered case $\tau_{m}\ll\tau_{syn}$ yields approximately a truncated Gaussian distribution, where neurons with $V_{nt}>0$ fire at their maximum rate of $\frac{1}{\tau_{refr}}$ . This approximation is less exact for very large $\mu$ due to the resetting, but the resulting form highlights the essence of eRBP while maintaining mathematical tractability. Furthermore, using a first-passage time approach, Petrovici et al. computed corrections that account for small synaptic time constants and the effect of the firing threshold on this distribution.

The firing rate of neuron $i$ is approximately equal to the inverse of the refractory period, $\nu_{i}=\tau_{refr}^{-1}$ with probability $P(V_{nt,i}(t+1)\geq 0|\mathbf{s}(t))$ and zero otherwise. The probability is equal to one minus the cumulative distribution function of $V_{nt,i}$ :

For gradient descent, we require the derivative of the neuron’s activation function with respect to the weight $w$ . By definition of the cumulative distribution, this is the Gaussian function in Eq. (8) times the presynaptic activity:

As in previous work Neftci et al. [2014a], we replace $\nu_{j}(t)$ in the above equations with the presynaptic spike train $s_{j}(t)$ to obtain an asynchronous, event-driven update, where the derivative is evaluated only when the presynaptic neuron spikes. This approach is justified by the fact that the learning rate is typically small, such that the event-driven updates are averaged at the synaptic weight variable Gerstner and Kistler . Thus the derivative becomes:

In the considered spiking neuron dynamics, the Gaussian function is not directly available. Although, a sampling scheme based on the membrane potential to approximate the derivative is possible, here we follow a simpler solution: Backed by extensive simulations, and inspired by previously proposed learning rules based on membrane potential gated learning rules Sheik et al. [2016a]; Brader et al. ; Clopath et al. , we find that replacing the Gaussian function with a boxcar function $\Theta$ operating on the total synaptic input, $I(t)$ , with boundaries $b_{min}$ and $b_{max}$ yields results that are as good as using the exact derivative. With appropriate boundaries, $\Theta(I(t))$ can be interpreted as a piecewise constant approximation of the Gaussian functionor equivalently, for the purpose of the derivative evaluation, the activation function is approximated as a rectified linear with hard saturation at $\tau_{refr}^{-1}$ , also called “hard tanh” in the machine learning community. since $I(t)$ is proportional to its argument $\sum_{j}w_{ij}\nu_{j}$ , and has the advantage that an explicit multiplication with the modulation is unnecessary in the random BP rule (explained below).

The resulting derivative function is similar in spirit to straight-through estimators used in machine learning Courbariaux and Bengio .

Derivation of Event-Driven Random Backpropagation

For simplicity, the error $e_{i}(t)$ is computed using a pair of spiking neurons with a rectified linear activation function. One neuron computes the positive values of $e_{i}(t)$ , while the other neuron computes the negative values of $e_{i}(t)$ such that:

Each pair of error neurons synapse with a leaky dendritic compartment $U$ of the hidden and prediction neurons using equal synaptic weights with opposite sign, generating a dendritic potential proportional to $(\nu^{E+}_{i}(t)-\nu^{E-}_{i}(t))\cong e_{i}$ . Several other schemes for communicating the errors are possible. For example an earlier version of eRBP used on a positively biased error neuron per class (rather than a positive negative pair as above) such that the neuron operated (mostly) in the linear regime. This solution led to similar results but was computationally more expensive due to error neurons being strongly active even when the classification was correct. Population codes of heterogeneous neurons as in Salinas and Abbott ; Eliasmith and Anderson may provide even more flexible dynamics for learning. The weight update for the last layer becomes:

The weight update for the hidden layers is similar, except that a random linear combination of the error is used instead of $e_{i}$ :

where $C=\{d,h\}$ . All weight initializations are scaled with the number of rows and the number of columns as $g_{ik}\sim U(\sqrt{\frac{6}{N_{E}+N_{H}}})$ , where $N_{E}$ is the number of error neurons and $N_{H}$ is the number of hidden neurons.

In the following, we detail the spiking neuron dynamics that can efficiently implement eRBP.

2 Spiking Neural Network and Plasticity Dynamics

The network used for eRBP consists of one or two feed-forward layers (Fig. 1) with $N_{d}$ “data” neurons, $N_{h}$ hidden neurons and $N_{p}$ prediction neurons. The top layer, labeled $P$ , is the prediction. The feedback from the error population is fed back directly to the hidden layers‘ neurons. The network is composed of three types of neurons: 1) Error-coding neurons are non-leaky I&F neurons following the linear dynamics:

where $s_{i}^{P}(t)$ and $s_{i}^{L}(t)$ are spike trains from prediction neurons and labels (teaching signal). To prevent negative runaway dynamics, a rigid boundary at zero is imposed. In addition, the membrane potential is lower bounded to $V_{T}^{E}$ . Each error neuron has one counterpart neuron with weights of opposite sign, i.e. $w^{L-}=-w^{L+}$ to encode the negative errors. The firing rate of the error-coding neurons is proportional to a linear rectification of the inputs. For simplicity, the label spike train is regular with firing rate equal to $\tau_{refr}^{-1}$ . When the prediction neurons classify correctly, $(s_{i}^{P}(t)-s_{i}^{L}(t))\cong 0$ , such that the error neurons remain silent. 2) Hidden neurons follow current-based leaky I&F dynamics:

where $s_{k}^{d}(t)$ and $s_{j}^{h}(t)$ are the spike trains of the data neurons and the hidden neurons, respectively, $I^{h}$ are current-based synapse dynamics, $\sigma_{w}s^{bg}_{i}(t)$ a Poisson process of rate $1$ kHz and amplitude $\sigma_{w}$ , and $\xi$ is a stochastic Bernouilli process with probability $p$ (indices $i,j$ are omitted for clarity). The Poisson process simulates background Poisson activity and contributes additively to the membrane potential, whereas the Bernouilli process contributes multiplicatively by randomly “blanking-out” the proportion $(1-p)$ of the input spikes. In this work, we consider feed-forward networks, i.e the weight matrix $w^{h}$ is restricted to be upper diagonal. Each neuron is equipped with a separate “dendritic” compartment $U^{h}_{i}$ following similar subthreshold dynamics as the membrane potential and where $s^{E}(t)$ is the spike train of the error-coding neurons and $g^{E}_{ij}$ is a fixed random matrix. The dendritic compartment is not directly coupled to the “somatic” membrane potential $V^{h}_{i}$ , but indirectly through the learning dynamics. For every hidden neuron $i$ , $\sum_{j}w_{ij}^{E}=0$ , ensuring that the spontaneous firing rate of the error-coding neurons does not bias the learning. The synaptic weight dynamics follow a dendrite-modulated and gated rule:

where $\Theta$ is a boxcar function with boundaries $b_{min}$ and $b_{max}$ .

3) Prediction neurons, synapses and synaptic weight updates follow the same dynamics as the hidden neurons except for the dendritic compartment, and one-to-one connection with pairs of error-neurons associated to the same class:

The spike trains at the data layer were generated using a stochastic neuron with instantaneous firing rate (exponential hazard function Gerstner and Kistler with absolute refractory period):

where $d$ is the intensity of the pixel (scaled from 0 to 1), and $t^{\prime}$ is the time of the last spike. Although neurons with I&F neuron dynamics similar to the prediction and hidden neurons could be employed here, we assumed that data will be provided by external sensors in the form of spike trains that do not necessarily follow I&F dynamics. Fig. 5 illustrates the neural dynamics in a prediction neuron, in a network trained with 500 training samples (1/100 of an epoch).

In practice, we find that neurons tend to strongly synchronize in late stages of the training. The analysis provided above does not accurately describe synchronized dynamics, since one of the assumptions for the diffusion approximation is that spike times are uncorrelated. Multiplicative stochasticity was previously shown to be beneficial for regularization and decorrelation of spike trains, while being easy to implement in neuromorphic hardware Neftci et al. . Following the ideas of synaptic sampling Neftci et al. , we find that replacing the background Poisson noise with multiplicative, blank-out noise Vogelstein et al. at the plastic synapses slightly improves the results and mitigates the energetic footprint of the stochasticity Sheik et al. [2016b].

3 Quantized, Discrete-time dynamics

In order to demonstrate the effectiveness of eRBP in dedicated digital hardware with realistic constraints on precision and limited computations, we created a simulation of a digital neuromorphic learning core. The hardware simulator is a bit-accurate, fixed point emulation of a spiking neural network implemented on a digital hardware platform for ultra-efficient and flexible learning dynamics that is currently under development. Full details of the model and its architecture will be discussed elsewhere. Here, we describe the subset of the model dynamics necessary for peRBP. The dynamics of the neuron $i$ implemented at the hidden layer is the following:

where the fourth and fifth line account for thresholds, resets, and spiking outputs $s_{i}$ . $s^{d}_{j}[t]$ , $s^{h}_{j}[t]$ , $s^{h}_{j}[t]$ , and $s^{E}_{j}[t]\in\{0,1\}$ are the spiking output of the input (data), hidden, and error-coding neurons $i$ at time $t$ , respectively. Indices from $\xi[t]$ were dropped for clarity, and every instance of $\xi[t]$ in the equations above refers to an independent and identically distributed Bernouilli draw with probability $p$ . The terms $g_{U}$ and $g_{I}$ are weight gain factors used to adjust the range of the synaptic weights.

Parameters $a$ , including $a_{V}$ $a_{U}$ $a_{syn}$ , and $a_{IV}$ are integers implementing the coupling between and within the states $V$ , $U$ and $I$ . The $\diamond$ operator is a custom bit shift that performs multiplication by powers of two and that can be implemented using only bitwise operations:

The reason for using $\diamond$ rather than left and right bit shifting is because integers stored using a two’s complement representation have the property that right shifting by $a$ of values such that $x>-2^{a^{\prime}},\forall a^{\prime}<a$ is $-1$ , whereas is expected in the case of a multiplication by $2^{-a}$ . The $\diamond$ operator corrects this problem by modifying the bit shift operation such that $-2^{a^{\prime}}\diamond{a}=0,\,\forall a^{\prime}<a$ . Furthermore, such multiplications by powers of $2$ have the advantage that fewer bits are required to store parameters on a logarithmic scale, which is a natural parametrization for such linear difference equations. A similar operation was used in the BNN Courbariaux and Bengio for an approximate power-of-two operation, although in our simulations, the first argument $a$ is considered constant.

ensures that all states leak towards zero in the absence of external drive.

where Clip clips the weights higher than 128 and lower than -128. Dynamics for the prediction neurons were the same except that they reflected the connectivity of the output layer. Positive error neurons followed the following dynamics:

As for continuous dynamics, negative error neurons follow the exact same dynamics with $w^{L}$ of opposite sign and error neuron membrane voltages are lower bounded to zero.

For data neurons, input spike trains were generated as Poisson spike trains with rate $\gamma d$ , where $d$ is the pixel intensity. For label neurons, input spikes were regular, i.e. spikes were spaced regularly with interspike interval $\tau_{refr}^{-1}$

In the simulations used for eRBP, all states $V$ , $U$ and $I$ and parameters were stored in 16 bit fixed point precision (ranging from -32768 to 32767), except for synaptic weights which were stored with 8 bit precision (ranging from -128 to 128) and coupling parameters were stored with 5 bits precision (from 0 to 32).

4 Experimental Setup and Software Simulations

Acknowledgments

This work was partly supported by the Intel Corporation and by the National Science Foundation under grant 1640081, and the Nanoelectronics Research Corporation (NERC), a wholly-owned subsidiary of the Semiconductor Research Corporation (SRC), through Extremely Energy Efficient Collective Electronics (EXCEL), an SRC-NRI Nanoelectronics Research Initiative under Research Task ID 2698.003. We thank Jun-Haeng Lee and Peter O’Connor for review and comments; and Gert Cauwenberghs, João Sacramento, Walter Senn for discussion.