^{1}

^{*}

^{1}

^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

Edited by: Gert Cauwenberghs, University of California, San Diego, United States

Reviewed by: Sadique Sheik, University of California, San Diego, United States; John V. Arthur, IBM, United States Bruno Umbria Pedroni contributed to the review of John V. Arthur

*Correspondence: Bodo Rueckauer

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Deep Artificial Neural Network (ANN) architectures such as GoogLeNet (Szegedy et al.,

Recent work have shown that the event-based mode of operation in SNNs is particularly attractive for reducing the latency and computational load of deep neural networks (Farabet et al.,

Multi-layered spiking networks have been implemented on digital commodity platforms such as FPGAs (Neil and Liu,

In order to bridge the gap between Deep Learning continuous-valued networks and neuromorphic spiking networks, it is necessary to develop methods that yield deep

A more straightforward approach is to take the parameters of a pre-trained ANN and to map them to an equivalent-accurate SNN. Early studies on ANN-to-SNN conversion began with the work of Perez-Carrasco et al. (

These approaches achieve very good results on MNIST, but the SNN results are below state-of-the-art ANN results when scaling up to networks that can solve CIFAR-10 (Krizhevsky, ^{1}

In this work, we address some important shortcomings of existing ANN-to-SNN conversion methods. Through mathematical analysis of the approximation of the output firing rate of a spiking neuron to the equivalent analog activation value, we were able to derive a theoretical measure of the error introduced in the previous conversion process. On the basis of this novel theory, we propose modifications to the spiking neuron model that significantly improve the performance of deep SNNs. By developing spiking implementations of max-pooling layers, softmax activation, neuron biases, and batch normalization (Ioffe and Szegedy,

To automate the process of transforming a pre-trained ANN into an SNN, we developed an SNN-conversion toolbox that is able to transform models written in Keras (Chollet, ^{2}

The remainder of the paper is organized as follows: section 2.1 outlines the conversion theory and section 2.2 presents the methods for implementing the different features of a CNN. The work in these two sections is extended from earlier work in Rueckauer et al. (

The basic principle of converting ANNs into SNNs is that firing rates of spiking neurons should match the graded activations of analog neurons. Cao et al. (

We assume here a one-to-one correspondence between an ANN unit and a SNN neuron, even though it is also possible to represent each ANN unit by a population of spiking neurons. For a network with ^{l}, ^{l}. The number of units in each layer is ^{l}. The ReLU activation of the continuous-valued neuron

starting with a^{0} = x, where x is the input, normalized so that each _{i} ∈ [0, 1]^{3}

where _{thr} is the threshold and

Every input pattern is presented for ^{+}. The highest firing rate supported by a time stepped simulator is given by the inverse time resolution _{max}: = 1/Δ

The principle of the ANN-to-SNN conversion method as introduced in Cao et al. (

The spiking neuron integrates inputs _{thr} from the membrane potential at the time when it exceeds the threshold:

From these membrane equations, we can derive slightly different approximation properties for the two reset mechanisms. In this section we analyze the first hidden layer and expand the argument in section 2.1.2 to higher layers. We assume that the input currents

As expected, the spike rates are proportional to the ANN activations _{thr} and smaller inputs improve the approximation at the expense of longer integration times. Using the definition _{thr} within a single time step, we can keep

A simple switch to the

The previous results were based on the assumption that the neuron receives a constant input

This equation states that the firing rate of a neuron in layer

with

In this section we introduce new methods that improve the classification error rate of deep SNNs (Rueckauer et al.,

Biases are standard in ANNs, but were explicitly excluded by previous conversion methods for SNNs. In a spiking network, a bias can simply be implemented with a constant input current of equal sign as the bias. Alternatively, one could present the bias with an external spike input of constant rate proportional to the ANN bias, as proposed in Neftci et al. (

One source of approximation errors is that in time-stepped simulations of SNNs, the neurons are restricted to a firing rate range of [0, _{max}], whereas ANNs typically do not have such constraints. Weight normalization is introduced by Diehl et al. (

The ^{l} = max[^{l}], then weights ^{l} and biases ^{l} are normalized to ^{l} → ^{l}/λ^{l}.

Although weight normalization avoids firing rate saturation in SNNs, it might result in very low firing rates, thereby increasing the latency until information reaches the higher layers. We refer to the algorithm described in the previous paragraph as “max-norm,” because the normalization factor λ^{l} was set to the maximum ANN activation within a layer, where the activations are computed using a large subset of the training data. This is a very conservative approach, which ensures that the SNN firing rates will most likely not exceed the maximum firing rate. The drawback is that this procedure is prone to be influenced by singular outlier samples that lead to very high activations, while for the majority of the remaining samples, the firing rates will remain considerably below the maximum rate.

Such outliers are not uncommon, as shown in Figure

Distribution of all non-zero activations in the first convolution layer of a CNN, for 16666 CIFAR10 samples, and plotted in log-scale. The dashed line in both plots indicates the 99.9th percentile of all ReLU activations across the dataset, corresponding to a normalization scale λ = 6.83. This is more than three times less than the overall maximum of λ_{max} = 23.16. _{max}.

We propose a more robust alternative where we set λ^{l} to the ^{4}

Batch-normalization reduces internal covariate shift in ANNs and thereby speeds up the training process. BN introduces additional layers where affine transformations of inputs are performed in order to achieve zero-mean and unit variance. An input

Because event-based benchmark datasets are rare (Hu et al.,

Here, we interpret the analog input activations as constant currents. Following Equation (2), the input to the neurons in the first hidden layer is obtained by multiplying the corresponding kernels with the analog input image x:

This results in one constant charge value

Softmax is commonly used on the outputs of a deep ANN, because it results in normalized and strictly positive class likelihoods. Previous approaches for ANN-to-SNN conversion did not convert softmax layers, but simply predicted the output class corresponding to the neuron that spiked most during the presentation of the stimulus. However, this approach fails when all neurons in the final layer receive negative inputs, and thus never spike.

Here we implement two versions of a spiking softmax layer. The first is based on the mechanism proposed in Nessler et al. (

Most successful ANNs use max-pooling to spatially down-sample feature maps. However, this has not been used in SNNs because computing maxima with spiking neurons is non-trivial. Instead, simple average pooling used in Cao et al. (

To obtain the number of operations in the networks during classification, we define as fan-in _{in} the number of incoming connections to a neuron, and similarly fan-out _{out} as the number of outgoing projections to neurons in the subsequent layer. To give some examples: In a convolutional layer, the fan-in is given by the size of the 2-dimensional convolution kernel multiplied by the number of channels in the previous layer. In a fully-connected layer, the fan-in simply equals the number of neurons in the preceding layer. The fan-out of a neuron in a convolutional layer

In case of the ANN, the total number of floating-point operations for classification of one frame is given by:

with _{l} the number of neurons in layer

In the case of an SNN, only additions are needed when the neuron states are updated. We adopt the notation from Merolla et al. (^{5}

where _{l}(

In the ANN, the number of operations needed to classify one image, consisting of the cost of a full forward-pass, is a constant. In the SNN, the image is presented to the network for a certain simulation duration, and the network outputs a classification guess at every time step. By measuring both the classification error rate and the operation count at each step during simulation, we are able to display how the classification error rate of the SNN gradually decreases with increasing number of operations (cf

The two different modes of operation—single forward pass in the ANN vs. continuous simulation in the SNN—have significant implications when aiming for an efficient hardware implementation. One well known fact is that additions required in SNNs are cheaper than multiply accumulates needed in ANNs. For instance, our simulations in a Global Foundry 28 nm process show that the cost of performing a 32-bit floating-point addition is about 14 X lower than that of a MAC operation and the corresponding chip area is reduced by 21 X. It has also been shown that memory transfer outweighs the energy cost of computations by two orders of magnitude (Horowitz,

There are two ways of improving the classification error rate of an SNN obtained via conversion: (1) training a better ANN before conversion, and (2) improving the conversion by eliminating approximation errors of the SNN. We proposed several techniques for these two approaches in section 2; in sections 3.1 and 3.2 we evaluate their effect using the CIFAR-10 data set. section 3.3 extends the SNN conversion methods to the ImageNet data set. In section 3.4 we show that SNNs feature an accuracy-vs.-operations trade-off that allow tuning the performance of a network to a given computational budget.

The networks were implemented in Keras (Chollet,

The methods introduced in section 2 allow conversion of CNNs that use biases, softmax, batch-normalization, and max-pooling layers, which all improve the classification error rate of the ANN. The performance of a converted network was quantified on the CIFAR-10 benchmark (Krizhevsky,

Classification error rate on MNIST, CIFAR-10 and ImageNet for our converted spiking models, compared to the original ANNs, and compared to spiking networks from other groups.

MNIST [ours] | 0.56 | 8 k | 1.2 M | |

MNIST [Zambrano and Bohte, |
0.86 | 0.86 | 27 k | 6.6 M |

CIFAR-10 [ours, BinaryNet sign] | 11.03 | 11.75 | 0.5 M | 164 M |

CIFAR-10 [ours, BinaryNet Heav] | 11.58 | 12.55 | 0.5 M | 164 M |

CIFAR-10 [ours, BinaryConnect, binarized at infer.] | 16.81 | 16.65 | 0.5 M | 164 M |

CIFAR-10 [ours, BinaryConnect, full prec. at infer.] | 8.09 | 0.5 M | 164 M | |

CIFAR-10 [ours] | 11.13 | 11.18 | 0.1 M | 23 M |

CIFAR-10 [Esser et al., |
NA | 12.50 | 8 M | NA |

CIFAR-10 [Esser et al., |
NA | 17.50 | 1 M | NA |

CIFAR-10 [Hunsberger and Eliasmith, ^{*} |
14.03 | 16.46 | 50 k | NA |

CIFAR-10 [Cao et al., ^{**} |
20.88 | 22.57 | 35 k | 7.4 M |

ImageNet [ours, VGG-16]^{†} |
36.11 (15.14) | 50.39 (18.37) | 15 M | 3.5 B |

ImageNet [ours, Inception-V3]^{††} |
23.88 (7.01) | 11.7 M | 0.5 B | |

ImageNet [Hunsberger and Eliasmith, ^{‡} |
NA | 48.20 (23.80) | 0.5 M | NA |

Figure

Influence of novel mechanisms for ANN-to-SNN conversion on the SNN error rate for CIFAR-10.

SNNs are known to exhibit a so-called accuracy-latency trade-off (Diehl et al.,

Accuracy-latency trade-off. Robust parameter normalization (red) enables our spiking network to correctly classify CIFAR-10 samples much faster than using our previous max-normalization (green). Not normalizing leads to classification at chance level (blue).

This accuracy-latency trade-off is very prominent in case of the classic LeNet architecture on MNIST (

VGG Simonyan and Zisserman (

While the conversion pipeline outlined in section 2 can deliver converted SNNs that produced equivalent error rates as the original ANNs on the MNIST and CIFAR-10 data sets, the error rate of the converted Inception-V3 was initially far from the error rate of the ANN. One main reason is that neurons undergo a transient phase at the beginning of the simulation because a few neurons have large biases or large input weights. During the first few time steps, the membrane potential of each neuron needs to accumulate input spikes before it can produce any output. The firing rates of neurons in the first layer need several time steps to converge to a steady rate, and this convergence time is increased in higher layers that receive transiently varying input. The convergence time is decreased in neurons that integrate high-frequency input, but increased in neurons integrating spikes at low frequency^{6}^{7}

In order to overcome the negative effects of transients in neuron dynamics, we tried a number of possible solutions, including the initializations of the neuron states, different reset mechanisms, and bias relaxation schemes. The most successful approach we found was to clamp the membrane potential to zero for the first

This simple modification of the SNN state variables removes the transient response completely (see Figure ^{8}

We expect that the transient of the network could be reduced by training the network with constraints on the biases or the β parameter of the batch-normalization layers. Table

The neurons in our spiking network emit events at a rate proportional to the activation of the corresponding unit in the ANN. Target activations with reduced precision can be approximated more quickly and accurately with a small number of spike events. For instance, if the activations are quantized into values of {0, 0.1, 0.2, …, 0.9, 1.0}, the spiking neuron can perfectly represent each value within at most 10 time steps. On the other hand, to approximate a floating-point precision number using 16 bit precision, the neuron in the worst case would have to be active for 2^{16} = 65536 time steps.

To demonstrate the potential benefit of using low-precision activations when transforming a given model into a spiking network, we apply the methods from section 2.2 to BinaryNet Courbariaux et al. (

By virtue of the quantized activations, these two SNNs are able to approximate the ANN activations with very few operations (see Figure

Classification error rate vs number of operations for the BinaryNet ANN and SNN implementation on the complete CIFAR-10 dataset.

Classification error rate vs number of operations for the LeNet ANN and SNN implementation on the MNIST dataset.

The lowest error rate for our converted spiking CIFAR-10 models is achieved using BinaryConnect (Courbariaux et al.,

This work presents two new developments. The first is a novel theory that describes the approximation of an SNN firing rates to its equivalent ANN activations. The second is the techniques to convert almost arbitrary continuous-valued CNNs into spiking equivalents. By implementing SNN-compatible versions of common ANN CNN features such as max pooling, softmax, batch normalization, biases and Inception modules, we allow a larger class of CNNs including VGG-16 and GoogLeNet Inception-V3 to be converted into SNNs. Table

In addition to the improved SNN results on MNIST and CIFAR-10, this work presents for the first time, a spiking network implementation of VGG-16 and Inception-V3 models, utilizing simple non-leaky integrate-and-fire neurons. The top-5 error rates of the SNNs during inference lie close to the original ANNs. Future investigations will be carried out to identify additional conversion methods that will allow the VGG-16 SNN to reach the error rate of the ANN. For instance, we expect a reduction in the observed initial transients of higher up layers within large networks, by training the networks with constraints on the biases.

With BinaryNet (an 8-layer CNN with binary weights and activations tested on CIFAR-10) (Courbariaux et al.,

The converted networks highlight a remarkable feature of spiking networks: While ANNs require a fixed amount of computations to achieve a classification result, the final error rate in a spiking network drops off rapidly during inference when an increasing number of operations is used to classify a sample. The network classification error rate can be tailored to the number of operations that are available during inference, allowing for accurate classification at low latency and on hardware systems with limited computational resources. In some cases, the number of operations needed for correct classification can be reduced significantly compared to the original ANN. We found a savings in computes of 2x for smaller full-precision networks (e.g., LeNet has 8 k neurons and 1.2 M connections), and larger low-precision models (e.g., BinaryNet has 0.5 M neurons and 164 M connections). These savings did not scale up to the very large networks such as VGG-16 and Inception-V3 with more than 11 M neurons and over 500 M connections. One reason is that each additional layer in the SNN introduces another stage where high-precision activations need to be approximated by discrete spikes. We show in Equation (5b) that this error vanishes over time. But since higher layers are driven by inputs that contain approximation errors from lower layers (cf. Equation 6), networks of increasing depth need to be simulated longer for an accurate approximation. We are currently investigating spike encoding schemes that make more efficient use of temporal structure than the present rate-based encoding. Mostafa et al. (

Finally, this conversion framework allows the deployment of state-of-the-art pre-trained high-performing ANN models onto energy-efficient real-time neuromorphic spiking hardware such as TrueNorth (Benjamin et al.,

BR developed the theory, implemented the methods, conducted the experiments and drafted the manuscript. YH implemented and tested the spiking max-pool layer. I-AL contributed to some of the experiments. MP and S-CL contributed to the design of the experiments, the analysis of the data, and to the writing of the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The reviewer, SS, and handling Editor declared their shared affiliation.

We thank Jun Haeng Lee for helpful comments and discussions, and the reviewers for their valuable contributions.

The Supplementary Material for this article can be found online at:

^{1}

^{2}

^{3}This analysis focuses on applications with image data sets, which are generally transformed in this way. The argument could be extended to the case of zero-centered data by interpreting negative input to the first hidden layer of the SNN as coming from a class of inhibitory neurons, and inverting the sign of the charge deposited in the post-synaptic neuron.

^{4}This distribution is obtained by computing the ANN activations on a large fraction of the training set. From this, the scaling factor can be determined and applied to the layer parameters. This has to be done only once for a given network; during inference the parameters do not change.

^{5}This

^{6}An ANN neuron responds precisely the same whether (A) receiving input from a neuron with activation 0.1 and connecting weight 0.8, or (B) activation 0.8 and weight 0.1. In contrast, the rate of an SNN neuron will take longer to converge in case (A) than in (B). This phenomenon forms the basis of the accuracy-latency trade-off mentioned above: One would like to keep firing rates as low as possible to reduce the operational cost of the network, but has to sacrifice approximation accuracy for it.

^{7}Even though the parameters in each layer were normalized such that the input to each neuron is below threshold, this does not guarantee that all biases are sub-threshold: their effect could be reduced by inhibitory input spikes. While such inhibitory synaptic input is still missing at the onset of the simulation, the output dynamics of a neuron will be dominated by a large bias.

^{8}As our neuron model does not contain any time constant, this unit should be read as “spikes per simulation time step” and is not related to spikes per wall-clock time.