Edited by: André van Schaik, Western Sydney University, Australia
Reviewed by: Mark D. McDonnell, University of South Australia, Australia; Michael Pfeiffer, Robert Bosch (Germany), Germany
*Correspondence: Emre O. Neftci
This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
An ongoing challenge in neuromorphic computing is to devise general and computationally efficient models of inference and learning which are compatible with the spatial and temporal constraints of the brain. One increasingly popular and successful approach is to take inspiration from inference and learning algorithms used in deep neural networks. However, the workhorse of deep learning, the gradient descent Gradient Back Propagation (BP) rule, often relies on the immediate availability of networkwide information stored with highprecision memory during learning, and precise operations that are difficult to realize in neuromorphic hardware. Remarkably, recent work showed that exact backpropagated gradients are not essential for learning deep representations. Building on these results, we demonstrate an eventdriven random BP (eRBP) rule that uses an errormodulated synaptic plasticity for learning deep representations. Using a twocompartment Leaky Integrate & Fire (I&F) neuron, the rule requires only one addition and two comparisons for each synaptic weight, making it very suitable for implementation in digital or mixedsignal neuromorphic hardware. Our results show that using eRBP, deep representations are rapidly learned, achieving classification accuracies on permutation invariant datasets comparable to those obtained in artificial neural network simulations on GPUs, while being robust to neural and synaptic state quantizations during learning.
Biological neurons and synapses can provide the blueprint for inference and learning machines that are potentially 1,000fold more energy efficient than mainstream computers. However, the breadth of application and scale of presentday neuromorphic hardware remains limited, mainly by a lack of general and efficient inference and learning algorithms compliant with the spatial and temporal constraints of the brain.
Thanks to their generalpurpose, modular, and faulttolerant nature, deep neural networks and machine learning has become a popular and effective means for executing a broad set of practical vision, audition and control tasks in neuromorphic hardware (Esser et al.,
The implementation of Gradient Back Propagation (hereafter BP for short) on a neural substrate is even more challenging (Grossberg,
Although, previous work (Lee et al.,
eRBP builds on the recent advances in approximate forms of the gradient BP rule (Lee et al.,
The focus of eRBP is to achieve realtime, online learning at higher power efficiency compared to deep learning on standard hardware, rather than achieving the highest accuracy on a given task. The success of eRBP on these measures lays out the foundations of neuromorphic deep learning machines, and paves the way for learning with streaming spikeevent data in neuromorphic platforms at proficiencies close to those of artificial neural networks.
This article is organized as follows: key theoretical and simulation results are provided in the results sections, followed by a general discussion and conclusion. Technical details of eRBP and its implementation are provided as the final section.
The central contribution of this article is eventdriven RBP (eRBP), a presynaptic spikedriven plasticity rule modulated by topdown errors and gated by the state of the postsynaptic neuron. The idea behind this additional modulation factor is motivated by supervised gradient descent learning in artificial neural networks and biologically plausible models of threefactor plasticity rules (Urbanczik and Senn,
In gradient descent using a squared error cost function, weight updates for a neuron in layer
where
where
where
This choice is motivated by the fact that the activation function of I&F neurons with absolute refractory period can be approximated by a linear threshold unit (also known as rectified linear unit) with saturation whose derivative is exactly the boxcar function. In this case, the eRBP synaptic weight update consists of additions and comparisons only, and can be captured using the following operations for neuron
where
Provided the second compartment dynamics, no multiplications are necessary for an eRBP update. This second compartment can be disabled after learning without affecting the inference dynamics. This rule is reminiscent of membrane voltagebased rules, where spikedriven plasticity is induced only when membrane voltage is inside an eligibility window (Brader et al.,
The realization of eRBP on neuromorphic hardware requires an auxiliary learning variable for integrating and storing topdown error signals during learning, which can be substantiated by a dendritic compartment. Provided this variable, each synaptic weight update incurs only two comparison operations and one addition. Additions and comparisons can be implemented very naturally in neuromorphic VLSI circuits (Liu et al.,
We demonstrate eRBP in networks consisting of one and two hidden layers trained on permutation invariant MNIST and EMNIST (Table
Classification error on the permutation invariant MNIST task (test set) obtained by averaging test errors of the last 5 epochs (for MNIST) and last epoch for EMNIST.
PI MNIST 78410010  3.77 (3.23)  2.89 (2.81)  2.74 (2.64)  3.19 (2.98)  2.25 (2.19)  2.44 (2.39) 
PI MNIST 78420010  3.53 (2.98)  2.78 (2.53)  2.13 (2.04)  2.37 (2.33)  1.85 (1.78)  1.94 (1.88) 
PI MNIST 78450010  2.86 (2.57)  2.34 (2.23)  2.00 (1.96)  2.09 (2.06)  1.63 (1.60)  1.88 (1.80) 
PI MNIST 78420020010  2.96 (2.85)  2.29 (2.22)  2.50 (2.45)  2.26 (2.25)  1.80 (1.78)  1.82 (1.74) 
PI MNIST 78450050010  2.36 (2.28)  2.02 (1.96)  2.24 (2.0)  2.34 (2.31)  1.90 (1.86)  1.69 (1.56) 
PI EMNIST 78420020010  26.76 (25.26)  21.83 (21.4)  22.3 (20.18)  32.37 (26.48)  18.42 (16.06)  18.23 (17.72) 
Network Architecture for Eventdriven Random Backpropagation (eRBP) and example spiking activity after training a 78420020010 network for 60 epochs. The network consists of feedforward layers (
MNIST Classification error on fully connected artificial neural networks (BP and RBP) and on spiking neural networks (eRBP). Curves for eRBP were obtained by averaging across 5 simulations with different seeds.
When equipped with stochastic connections (multiplicative noise) that randomly blank out presynaptic spikes, the network performed better overall (labeled
The reasons why the eRBP_{×} performs better than the eRBP_{+} configuration cannot only be attributed to its regularizing effect: As learning progresses, a significant portion of the neurons tend to fire near their maximum rate and synchronize their spiking activity across layers as a result of large synaptic weights (and thus presynaptic inputs). Synchronized spike activity is not well captured by firing rate models, which is assumed by eRBP (see Section 5). Additive noise has a relatively small effect when the magnitude of the presynaptic input is large. However, multiplicative blankout noise improves learning by introducing irregularity in the presynaptic spiketrains even when presynaptic neurons fire regularly. This type of “alwayson” stochasticity also was argued to approximate Bayesian inference with Gaussian processes (Gal and Ghahramani,
Overall, the learned classification accuracy with eRBP_{×} is close to that obtained with offline training of neural networks (e.g., GPUs,
Transitions between two data samples of different class (digit) are marked by bursts of activity in the error neurons (Figure
Firing rate of data layer and error layer upon stimulus onset, averaged across 1,000 trials and all neurons in the layer. The large firing rate at the onset is caused by synchronized neural activity. The vertical line in the bottom figure depicts the 50
In future work involving practical applications on autonomous systems, it will be beneficial to interleave learning and inference stages without explicitly controlling the learning rate. One way to achieve this is to introduce a negative bias in the error neurons by means of a constant negative input and an equal positive bias in the label neurons such that the error neuron can be only be active when an input label is provided
The presence of these bursts of error activity suggest that eRBP could learn spatiotemporal sequences as well. However, learning useful latent representations of the sequences requires solving a temporal credit assignment problem at the hidden layer—a problem that is commonly solved with gradient BPthroughtime in artificial neural networks (Rumelhart et al.,
The response of the 78420010 network after stimulus onset is about one synaptic time constant. Using the first spike after 2τ_{s} = 8
Classification error in the 78420010 eRBP_{+} network as a function of the number of spikes in the prediction layer, and total number of synaptic operations incurred up to each output spike. To obtain this data, the network was first stimulated with random patterns, and the spikes in the output layer were counted after τ_{syn} = 4
In this example, classification using the first spike incurred about 100
The low latency response with high accuracy may seem at odds with the inherent firing rate code underlying the network computations (see Section 5). However, a code based on the time of the firstspike is consistent with a firing rate code, since a neuron with a high firing rate is expected to fire first (Gerstner and Kistler,
In the spiking simulations, weight updates are updated during the presentation of
These results are not entirely surprising since seminal work in stochastic gradient descent established that with suitable conditions on the learning rate, the solution to a learning problem obtained with stochastic gradient descent is asymptotically as good as the solution obtained with batch gradient descent (Le Cun and Bottou,
It is fortunate that synaptic plasticity is inherently “online” in the machine learning sense, given that potential applications of neuromorphic hardware often involve realtime streaming data.
The online, eventbased learning in eRBP combined with the reduced number of required dataset iterations suggests that learning on neuromorphic hardware can be particularly efficient. Furthermore, in neuromorphic hardware, only active connections in the network incur a SynOp. To demonstrate the efficiency of the learning, we report the number of multiplyaccumulate (MAC) operations required for reaching a given accuracy compared to the number of synaptic operations (SynOps) in the spiking network for the MNIST learning task (78420020010 network, Figure
Spiking Neural Networks equipped with eRBP with stochastic synapses (multiplicative noise) achieve SynOpMAC parity at the MNIST task. The number of multiplyaccumulate (MAC) operations required for reaching a given accuracy is compared to the number of synaptic operations (SynOps) in the spiking network for the MNIST learning task (78420020010 network). Both networks requires roughly the same number of operations to reach the same accuracy during learning. Only MACs incurred in the matrix multiplications are taken into account (other necessary operations e.g., additions, logistic function calls, and weight updates were not taken into account here, and would further favor the spiking network).
The spiking neural networks learn quickly initially (epoch 1 at 94%), but subsequent improvements become slower compared to the artificial neural network. The reasons for this slowdown are likely due to (1) random backpropagation/direct feedback alignment, (2) spikes emanating from errorcoding neurons becoming very sparse toward the end of the training, which prevent fine adjustments of the weight. We speculate that a scheduled or network accuracybased adjustment of the error neuron sensitivity is likely to mitigate the latter cause. Such modifications, along with more sophisticated learning rules involving momentum and learning rate decay are left for future work.
The effectiveness of stochastic gradient descent degrades when the precision of the synaptic weights using a fixed point representation is smaller than 16 bits (Courbariaux et al.,
Extended simulations suggest that the random BP performance at 10 bits precision is indistinguishable from unquantized weights (Baldi et al.,
The gradient descent BP rule is a powerful algorithm that is ubiquitous in deep learning, but when implemented in a neuromorphic substrate, it relies on the immediate availability of networkwide information stored with highprecision memory. More specifically, (Baldi et al.,
Taken together, our results suggest that generalpurpose deep learning using streaming spikeevent data in neuromorphic platforms at artificial neural network proficiencies is realizable.
Our experiments target neuromorphic implementations of spiking neural networks with embedded plasticity. Membranevoltage based learning rules implemented in mixedsignal neuromorphic hardware (Qiao et al.,
Spiking neural networks, especially those based on the I&F neuron types severely restrict computations during learning and inference. With the wide availability of graphical processing units and future dedicated machine learning accelerators, the neuromorphic spikebased approach to learning machines is often heavily criticized as being misguided. While this may be true for some hardware designs and on metrics based on absolute accuracy at most standardized benchmark tasks, neuromorphic hardware dedicated for embedded learning can have distinctive advantages thanks to: (1) asynchronous, eventbased communication, which considerably reduces the communication between distributed processes, (2) natural exploitation of “rate” codes and “spike” codes where single spikes are meaningful, leading to fast and thus powerefficient and gradual responses (Figure
Many examples that led to the unprecedented success in machine learning have substantial overlap with equivalent neural mechanisms, such as normalization (Ioffe and Szegedy,
Our learning rule builds on the feedback alignment learning rule demonstrating that random feedback can deliver useful teaching signals by aligning the feedforward weights with the feedback weights (Lillicrap et al.,
Several approaches successfully realized the mapping of pretrained artificial neural networks onto spiking neural networks using a firing rate code (O'Connor et al.,
An intermediate approach is to learn online with standard BP using spikebased quantization of network states (O'Connor and Welling,
STDP has been shown to be very powerful in a number of different models and tasks related to machine learning (Thorpe et al.,
Thus, there is considerable benefit in hardware implementations of synaptic plasticity rules that forego the causal updates. Such rules, which we referred to as spikedriven plasticity, can be consistent with STDP (Brader et al.,
A common feature among spikedriven learning rules is a modulation or gating with a variable that reflects the average firing rate of the neuron, for example through calcium concentration (Graupner and Brunel,
The two compartment neuron model used in this work is motivated by conductancebased dynamics in Urbanczik and Senn (
This article demonstrates a local, eventbased synaptic plasticity rule for deep, feedforward neural networks achieving classification accuracies on par with those obtained using equivalent machine learning algorithms. The learning rule combines two features: (1) Algorithmic simplicity: one addition and two comparisons per synaptic update provided one auxiliary state per neuron and (2) Locality: all the information for the weight update is available at each neuron and the synapse. The combination of these two features enables synaptic plasticity dynamics for neuromorphic deep learning machines.
Our results lay out a key component for the building blocks of spikebased deep learning using neural and synaptic operations largely demonstrated in existing neuromorphic technology (Chicca et al.,
One limitation eRBP is related to the “loop duration,” i.e., the duration necessary from the input onset to a stable response in the error neurons. This duration scales with the number of layers, raising the question whether eRBP can generalize for very deep networks without impractical delays. Future work currently in investigation is to augment eRBP using recently proposed synthetic gradients (Jaderberg et al.,
It can be reasonably expected that the deep learning community will uncover many variants of random BP, including in recurrent neural networks for sequence learning and memory augmented neural networks. In tandem with these developments, we envision that such RBP techniques will enable the embedded learning of pattern recognition, attention, working memory, and action selection mechanisms which promise transformative hardware architectures for embedded computing.
This work has focused on unstructured, feedforward neural networks and a single benchmark task across multiple implementations for ease of comparison. Limitations in deep learning algorithms are often invisible on “toy” datasets like MNIST (Liao et al.,
In artificial neural networks, the meansquared cost function for one data sample in a single layer neural network is:
where
and where η is a small learning rate. In deep networks, i.e., networks containing one or more hidden layers, the weights of the hidden layer neurons are modified by backpropagating the errors from the prediction layer using the chain rule:
where the δ for the topmost layer is
In the random BP rule considered here, the BP term δ is replaced with:
where
In the context of models of biological spiking neurons, RBP is appealing because it circumvents the problem of calculating the backpropagated errors and does not require bidirectional synapses or symmetric weights. RBP works remarkably well in a wide variety of classification and regression problems, using supervised and unsupervised learning in feedforward networks, with a small penalty in accuracy.
The above BP rules are commonly used in artificial neural networks, where neuron outputs are represented as single scalar variables. To derive an equivalent spikebased rule, we start by matching this scalar value is the neuron's instantaneous firing rate. The cost function and its derivative for one data sample is then:
where
Random BP (Equation 6) is straightforward to implement in artificial neural network simulations. However, spiking neurons and synapses, especially with the dynamics that can be afforded in lowpower neuromorphic implementations typically do not have arbitrary mathematical operations at their disposal. For example, evaluating the derivative ϕ can be difficult depending on the form of ϕ and multiplications between the multiple factors involved in RBP can become very costly given that they must be performed at every synapse for every presynaptic event.
In the following, we derive an eventdriven version of RBP that uses only two comparisons and one addition for each presynaptic spike to perform the weight update. The derivation proceeds as follows: (1) Derive the firing rate ν, i.e, the equivalent of ϕ in the spiking neural network, (2) Compute its derivative
The dynamics of spiking neural circuits driven by Poisson spike trains is often studied in the diffusion approximation (Wang,
where
In this case, the neuron's membrane potential dynamics is an OrnsteinUhlenbeck (OU) process (Gardiner,
where
The firing rate of neuron
where “erf” stands for the error function. The firing rate of neuron
For gradient descent, we require the derivative of the neuron's activation function with respect to the weight
As in previous work (Neftci et al.,
In the considered spiking neuron dynamics, the Gaussian function is not directly available. Although, a sampling scheme based on the membrane potential to approximate the derivative is possible, here we follow a simpler solution: Backed by extensive simulations, and inspired by previously proposed learning rules based on membrane potential gated learning rules (Brader et al.,
The resulting derivative function is similar in spirit to straightthrough estimators used in machine learning (Courbariaux and Bengio,
For simplicity, the error
Each pair of error neurons synapse with a leaky dendritic compartment
The weight update for the hidden layers is similar, except that a random linear combination of the error is used instead of
All weight initializations are scaled with the number of rows and the number of columns as
In the following, we detail the spiking neuron dynamics that can efficiently implement eRBP.
The network used for eRBP consists of one or two feedforward layers (Figure
(1)
where
(2)
where
where Θ is a boxcar function with boundaries
(3)
The spike trains at the data layer were generated using a stochastic neuron with instantaneous firing rate [exponential hazard function (Gerstner and Kistler,
where
Neural states and synaptic weight of the prediction neuron after 500 training examples.
In practice, we find that neurons tend to strongly synchronize in late stages of the training. The analysis provided above does not accurately describe synchronized dynamics, since one of the assumptions for the diffusion approximation is that spike times are uncorrelated. Multiplicative stochasticity was previously shown to be beneficial for regularization and decorrelation of spike trains, while being easy to implement in neuromorphic hardware (Neftci E. et al.,
We trained fully connected feedforward networks on two datasets, the standard MNIST handwritten digits (LeCun et al.,
To keep the durations of the spiking simulations tractable, learning was run for 60 epochs (MNIST) or 30 epochs (EMNIST), compared to 1,000 epochs in the GPU. This is not a major limitation since errors appear to converge earlier in the spiking neural network. During a training epoch, each of the training digits were presented in during 250
All learning rates were kept fixed during the simulation. Other
We tested eRBP training on a spiking neural network based on the Auryn simulator (Zenke and Gerstner,
Parameters used for the continuoustime spiking neural network simulation implementing eRBP.
Number of data neurons  All networks  784  
Number of hidden neurons  All networks  100,200,400,1000  
Number of label neurons  All networks  10  
Number of positive error neurons  All networks  10  
Number of negative error neurons  All networks  10  
Number of prediction neurons  All networks  10  
σ  Poisson noise weight  eRBP_{+}  50· 10^{−3} 
eRBP_{×}  0· 10^{−3} 

Blankout probability  eRBP_{+}  1.0  
eRBP_{×}  0.45  
τ_{refr}  Refractory period  Prediction and hidden neurons  3.9 
Data neurons  4.0 

τ_{syn}  Synaptic Time Constant  All synapses  4 
Leak conductance state 
Prediction and hidden neurons  1 

Leak conductance state 
Prediction and hidden neurons  5 

Membrane capacitance  All neurons  1 

Firing threshold  Prediction and Hidden neurons  100 

Error neurons  100 

Number of training samples used  All figures  50000  
Number of training samples used  Table 
10000  
Table 
1000  
Table 
10000  
Training sample duration  All models  100 

Testing sample duration  Table 
500 

Table 
250 

Initial weight matrix  RBP, BP  
eRBP_{+}  
eRBP_{×}  
eRBP_{+}, eRBP_{×}  90· 10^{−3}nA  
eRBP_{+}, eRBP_{×}  90· 10^{−3}nA  
eRBP_{+}, eRBP_{×}  −90· 10^{−3}nA  
eRBP_{+}, eRBP_{×}  −1.15, 1.15 

2nd hidden layer  eRBP_{+}, eRBP_{×}  25, 25 

Figure 
eRBP_{+}, eRBP_{×}  −0.6, 0.6 

β  Data neuron input scale  eRBP_{+}, eRBP_{×}  0.5 
γ  Data neuron input threshold  eRBP_{+}, eRBP_{×}  −0.215 
η  Learning Rate  eRBP_{+}  6· 10^{−4}nS 
eRBP_{×}  10· 10^{−4}nS  
RBP, BP  0.4/ 

Minibatch size  RBP(100), BP(100)  100  
RBP(1), BP(1)  1 
EN and GD: Designed experiment, conducted experiments, and wrote the paper. EN, GD, SP, and CA: Contributed software and tools.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work was partly supported by the Intel Corporation and by the National Science Foundation under grant 1640081, and the Nanoelectronics Research Corporation (NERC), a whollyowned subsidiary of the Semiconductor Research Corporation (SRC), through Extremely Energy Efficient Collective Electronics (EXCEL), an SRCNRI Nanoelectronics Research Initiative under Research Task ID 2698.003. We thank Friedemenn Zenke for support on the Auryn simulator, JunHaeng Lee and Peter O'Connor for review and comments; and Gert Cauwenberghs, João Sacramento, Walter Senn for discussion.
^{1}Such logical “and” operation on top of a graded signal was previously used for conditional signal propagation in neuromorphic VLSI spiking neural networks (Neftci et al.,
^{2}or equivalently, for the purpose of the derivative evaluation, the activation function is approximated as a rectified linear with hard saturation at