^{1}

^{*}

^{†}

^{2}

^{†}

^{1}

^{1}

^{2}

^{3}

^{1}

^{1}

^{1}

^{2}

^{4}

^{2}

^{1}

^{1}

^{2}

^{3}

^{4}

Edited by: André van Schaik, Western Sydney University, Australia

Reviewed by: Hesham Mostafa, University of California, San Diego, United States; Terrence C. Stewart, University of Waterloo, Canada; Mark D. McDonnell, University of South Australia, Australia

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

†These authors have contributed equally to this work

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The memory requirement of deep learning algorithms is considered incompatible with the memory restriction of energy-efficient hardware. A low memory footprint can be achieved by pruning obsolete connections or reducing the precision of connection strengths after the network has been trained. Yet, these techniques are not applicable to the case when neural networks have to be trained directly on hardware due to the hard memory constraints. Deep Rewiring (DEEP R) is a training algorithm which continuously rewires the network while preserving very sparse connectivity all along the training procedure. We apply DEEP R to a deep neural network implementation on a prototype chip of the 2nd generation SpiNNaker system. The local memory of a single core on this chip is limited to 64 KB and a deep network architecture is trained entirely within this constraint without the use of external memory. Throughout training, the proportion of active connections is limited to 1.3%. On the handwritten digits dataset MNIST, this extremely sparse network achieves 96.6% classification accuracy at convergence. Utilizing the multi-processor feature of the SpiNNaker system, we found very good scaling in terms of computation time, per-core memory consumption, and energy constraints. When compared to a X86 CPU implementation, neural network training on the SpiNNaker 2 prototype improves power and energy consumption by two orders of magnitude.

The number of connections is the main limiting factor in the up-scaling of neural network implementations. It dominates the required chip area (Benjamin et al.,

Here, we show an implementation of a deep neural network with learning capabilities on a neuromorphic hardware chip. Both the training phase and the inference phase are implemented on a 2nd generation SpiNNaker prototype. All the memory that is required for neuron and synapse parameters is kept in the fast, local 64 KB SRAM of the chip. To operate the network at such a low memory footprint, we use the recently introduced Deep Rewiring (DEEP R) model for training deep networks with sparse weight matrices (Bellec et al.,

Our implementation makes efficient use of the hardware accelerators that are available on the SpiNNaker prototype chip. The chip is equipped with high-throughput accelerators for fixed-point exponentiation and random number generation. Both operations are commonly needed in the neural network domain and are also used in the DEEP R algorithm. We show that using the hardware accelerators results in a speedup of around 1.62 × in our setup without any further software optimizations. Together with parallelization from one core to four cores, a total speedup of 7.7 × is achieved.

We evaluated our implementation of DEEP R on the SpiNNaker prototype chip using a benchmark network architecture for learning the MNIST benchmark dataset. The complete network can be simulated on a single core of the chip using < 40 KB of the SRAM (see section 3.1) and also on multiple cores for time speedup and higher energy efficiency purpose. In section 3.2 we study the computation time speedup that can be gained by using hardware accelerators and multiple cores. Finally, in section 3.3 we study the power and energy requirements of our system. In summary, our results demonstrate a full on-chip deep learning system, that achieves 96.6% accuracy on the MNIST benchmark dataset, using < 4% memory compared to a fully connected network, and only 2% of the power budget compared to running the same algorithm on an X86 processor.

SpiNNaker (Furber et al.,

In this article, we use the first SpiNNaker2 prototype chip which has been implemented in GLOBALFOUNDRIES 28 nm SLP CMOS technology (Höppner et al.,

System overview of the SpiNNaker 2 prototype system. Components used in this article are highlighted in green.

A photo of our setup is shown in Figure

Photo of the entire setup and the prototype chip (Höppner et al.,

We apply the DEEP R algorithm (see section 2.3) to train a network architecture that was previously chosen in Han et al. (

Details to the implementation of the neural network architecture and execution procedures. The network input and target data is held in the slow DRAM _{i}) are propagated forward through the network and errors (Δ_{i}) are propagated backward. The 10 neurons of the output layer (y) correspond to 10 classes of digits. The weight matrices stores the connection weights and related sparsity are marked with W plus percentage. The target (lab) is encoded with one-hot code. The circled numbers with arrows at the bottom present the order of execution procedures in one training iteration and their scopes.

To benchmark the performance characteristics of the implementation on the prototype chip, we used the MNIST dataset (LeCun et al.,

In DEEP R, every potential connection of the neural network can at any time during training either be

The DEEP R algorithm alternates stochastic gradient descent (SGD) steps and rewiring steps. SGD updates are applied to weights of all active connections using error backpropagation, while the weight of dormant connections are by definition 0 and do not need to be stored or updated. In addition _{1}-regularization and gradient noise is applied to active connections in each SGD step. For each active weight that crosses zero (changes sign) after the SGD update, DEEP R sets this connection dormant and activates another connection that was dormant. New connections are randomly drawn from all possible neuron pairs within the layer with equal probability. The new connection is initialized with a weight of 0 and then evolves according to SGD updates.

It was proven in Bellec et al. (_{1}-regularization—crosses zero in the very next time step, and disappears. As a result, only functional connections have a high chance to remain in the network.

DEEP R is implemented on the SpiNNaker 2 prototype chip in an online fashion, i.e., one input image is processed after another. A summary of the algorithm is given in Algorithm

Pseudocode of the online version of the DEEP R algorithm implemented on-chip

Each backpropagation step at line 3 is split into a forward and a backward pass. In the forward pass—also called inference—the activation generated by an image is propagated from the first layer to the last through the network to predict the image label. In the backward pass, the predicted label is compared to the target label and the error is backpropagated from the last layer to the first. The learning rate η for SGD is initialized to 0.05 and decays by a factor of 0.5 every 2 training epochs. The _{1}-regularization parameter of DEEP R is set to 10^{−5}. The temperature

The DEEP R algorithm ensures that the memory consumed by the network connectivity is constant and the memory requirement is a parameter of the algorithm that can be chosen to fit hardware limitations. We constrained the number of connections individually for each weight matrix, i.e., the number of connections between adjacent layers, such that each weight matrix consumes a fixed amount of memory on one particular core.

To perform the rewiring step at line 4 of Algorithm 1, we delete all the weights that changed signs during the SGD step, and reconnect the corresponding number of connections elsewhere. The weights in each layer constitute a sparse weight matrix, and there exist several possibilities how such sparse matrices can be stored efficiently in memory. The choice of the matrix format was indeed important to optimize both memory consumption and speed. To optimize speed one has to take into account that within the forward and the backward passes of backpropagation it is required to multiply the sparse weight matrices from two different directions. In the rewiring step it is required to delete and insert elements efficiently.

The popular matrix formats, compressed sparse row (CSR) and compressed sparse column (CSC) first described in Tinney and Walker (

Illustration of the sparse matrix format.

The rewiring step is split into a deletion and an insertion step. The deletion step erases the entries of all weights that changed the sign during the last gradient descent step. The insertion step replaces the deleted coefficients by other randomly selected matrix entries. The deletion is performed by passing once through the four arrays shown in Figure

The neural network we implemented in this article is relatively simple and can be fully stored in a 64 KB SRAM of one ARM core. But with the escalating complexity of the modern machine learning architectures, a network involving numerous parameters may not be accommodated in one machine. Also, a larger network results in more computation time and a larger memory footprint. It is thus necessary to discuss how to use multiple computation nodes to train a deep neural network. SpiNNaker 2 is a multi-core system that naturally features multi-core distributed implementations such that we employed the 4-core prototype chip to implement the above network.

Neural network training algorithms are memory-intensive because they involve massive data sets and complex network architectures with large parameter sets. In terms of these two factors, there are in principle two strategies for parallel training (Shrivastava et al.,

Data too big to fit into the memory of a single node can be partitioned across several nodes. The learning model is duplicated into each node and trained with different input data simultaneously. This approach maintains the model structure in each node and facilitates rapid deployment across multiple nodes.

If network (or model) parameters cannot be accommodated in a single node, they can be split into several sub-modules and stored in different nodes. In this case the same data is fed into different nodes, but on each node only a subset of the model is trained. In other words, the computation of the entire model is separately allocated on the distributed nodes.

The network structures considered in this article are feedforward neural networks. Although the connections between adjacent layers are sparse, the memory is dominated by the weight matrices and not by the neuron activities (see section 3.1). With data parallelism, all weight matrices are duplicated in each node. On the other hand, model parallelism is well-suited to distribute large models where model parameters dominate the memory consumption. Therefore, we did not adopt data parallelism but model parallelism as the implemented distribution methodology.

The performance improvement of model parallelism depends directly on how the model is split. A good model partition scheme should balance the computational load of each node while minimizing the data dependency between nodes.

Figures

Different partition schemes and their communication overhead analysis. The forward pass (fp) and backward pass (bp) directions are marked in blue and red, respectively. Given that the matrix has a size of _{1}, _{2}, _{3}. _{1}, _{2}, _{3}, _{4}.

Another partition scheme is channel-wise partitioning, under which each node stores partial parameters of one layer. These partial parameters throughout all layers are combined to form a channel. Whether in forward pass or backward pass, to compute the output of one layer multiple nodes need to communicate with each other and broadcast their intermediate results because each node has only partial parameters and cannot complete the computation alone. The channel-wise partition makes distributed computing independent of the network structure so it has a better adaptability to different network depths or layer sizes, and the computational load is evenly distributed on each node to maximize the use of resources. We adopted the channel-wise partition to implement the distributed training so the weight matrices, various activations, errors, and gradients are deployed on different nodes and are updated locally.

To be more specific, the channel-wise partition has also several variants. If we focus on the weight matrix, the most computation-intensive operation in backpropagation is essentially the vector-matrix multiplication of the form

We analyze the communication pattern in the forward pass under 1D vertical partition as an example. Firstly, the input vector is duplicated onto _{i}, _{1} sends the vector component _{1} with the length of _{j},

We adopted the checkerboarding partition scheme to partition the network model. _{h} and low-end fraction _{l} of the input vector are loaded into 2 cores respectively. In the second step, each vector fraction is multiplied by the local weight sub-matrix, obtaining 4 vector components _{l1}, _{l2} and _{h1}, _{h2}. In the third step, the non-diagonal cores vertically send the local vector components to diagonal cores and merge them with their local vector components to get the output vector's high-end fraction _{h} and low-end fraction _{l}, e.g., in core 1 we gathered _{l} = _{l1}+_{l2}. In the fourth step, the diagonal cores horizontally synchronize the vector fractions to non-diagonal cores, so that the low-end fraction _{l} locates in core 1, 2 and the high-end fraction _{h} in core 3, 4. At this time the vector fractions in each core are ready for the vector-matrix multiplication of the next layer.

Data flow in 4 cores for parallel vector-matrix multiplication. The diagonal cores (core 1, 4) are highlighted in gray. ①Importing _{l} into core 1,2 and _{h} into core 3,4. ②Vector-matrix multiplication in each core. ③Non-diagonal cores vertically send local results _{l2},_{h1} to diagonal cores and merge the local results there to _{l} and _{h}. ④Diagonal cores horizontally broadcast the output vector components to non-diagonal cores (core 1 to core 2, core 4 to core 3).

Unlike the vector-matrix multiplication, the rewiring step involves matrix element addressing, removing, appending and sorting in a sparse matrix space. The parallelization does not negatively affect the speed of these operations. Instead it improves element addressing and sorting because the splitting of one big weight matrix into 4 quarter matrices reduces the lengths of the arrays represented in Figure

We implemented a benchmark neural network architecture, previously used in Han et al. (

To train the network on the prototype chip with only 64 KB of SRAM memory per core, we used the DEEP R algorithm (Bellec et al.,

Network training with the fixed connectivity of 1.3% converged in 9 epochs with a classification accuracy of 96.6% on the test set (see Figure

Classification accuracy of different setups. The accuracy of a fully connected network (FC, dashed line) is 98.2%, while sparsely connected networks (1.3% connection probability) achieve 96.6% with DEEP R (red) and 81.3% without online rewiring (blue).

In the following, we analyze the memory profile, time consumption, power and energy consumption of the implementation based on 1 core and 4 cores of the prototype chip respectively.

Figure

Memory profile of a fully connected network (FC Overall) and a sparse rewired network with one-core (1PA Overall) and four-core implementations (4PA Overall). Here number plus P indicates the utilized processor numbers and A stands for using hardware accelerators. For FC, the weight matrices, activity vectors, and temporary storage occupy 1072.34, 6.28, and 2.14 KB, respectively. In 1PA, there is no change on vectors and temporary storage but the weight matrix portion shrinks to 28.21 KB. In the four-core distributed implementation, the overall memory requirement of temporary storage, activity vector and weight matrix rise to 8.56, 15.16, and 28.24 KB. Dividing these values by four is the used resources in each core, which is indicated with 4PA per core. The total memory footprint in each case is shown above each bar.

The DEEP R algorithm reduced the total memory footprint from 1080.78 KB down to 36.63 KB (with the small decrease in the performance described above). This enabled us to perform full training on a single core of the prototype chip with its limited SRAM of 64 KB. The weight matrices dominate memory consumption in all scenarios. Comparing the weight matrix portion between FC and 1PA, the sparse connectivity is 1.3% while the memory footprint percentage is 28.21/1072.34 = 2.6%, because storing the position of a sparse weight with COO format doubled the consumed memory.

The 2D channel-wise parallelization allocates the sub weight matrices into different cores, so there is no change on the overall volume of weight matrices, but it increased the activity vectors storage with 2.4 × and quadrupled the temporary storage. Two of this 2.4 are attributed to the duplicated storage of activity vectors by parallelization and the rest 0.4 come from some additional memory overhead of core-core communication. All these factors raised the overall memory consumption from 36.63 up to 51.96 KB. However, under 4-core implementation the available SRAM is also grown from 64 up to 256 KB, so the memory constraints are actually alleviated. For visualization, the memory footprint of one core under parallelization is depicted as the rightmost bar in Figure

In a vanilla implementation of DEEP R with rewiring at each update step, the time needed for image data import, forward pass, backward pass, and rewiring accounted for 0.2, 2.5, 7.2, and 90.1% of the total computation time, respectively. Rewiring is the most time intensive step, because randomly replenishing new connections involve random number generations and sorting of matrix entries in the chosen memory-efficient sparse matrix format (see section 2.4). To improve runtime, we refined the algorithm such that it performs rewiring operations in only a fraction of the iterations. We refer to the number of iterations (i.e., parameter updates) between two consecutive rewiring operations as the rewiring period. Figure

Accuracy and time consumption of applying rewiring with different periods. The iteration number between 2 rewiring operations is defined as the rewiring period. The time consumption is normalized by the time required with a rewiring period of 1 and depicted with percentage.

On the SpiNNaker 2 system, hardware accelerators are available for random number generation and exponential function. Therefore, to perform any of these two operations, the ARM core can directly invoke the corresponding hardware module, consuming considerably fewer clock cycles, and less energy. In the discussed application, initialization of the weight matrix, generation of new connections, and noisy gradient updates involve random processes, and the softmax function in the last layer includes exponential operations. We therefore compare below the training time of the system with hardware acceleration and a pure ARM implementation that does not make use of these hardware accelerators.

The SpiNNaker 2 system is designed as a massively parallel processing network with many parallel cores that communicate via lightweight messages. Parallel training using multiple cores does not only allow for more complex networks, but is also expected to reduce training time. We tested parallelization of network training in the SpiNNaker 2 system, where we channel-wise partitioned the model into 4 channels and deployed them to 4 processors of the prototype chip.

Training time was evaluated in four setups (see Figure

Time comparison on X86 platform and on SpiNNaker prototype chip with different setups. 1P indicates that 1 processor is utilized running pure ARM software, 1PA stands for 1 processor with hardware-based acceleration for random number generation and exponential computation, and 4PA stands for a 4-core parallelization with hardware acceleration.

The profiled power consumption on the two platforms is listed in the first row in Table

Power and energy consumption of neural network training with DEEP R on an X86 platform and the SpiNNaker2 prototype with one (1P, 1PA) and four (4PA) cores.

Power [mW] | 20,500 | 88 | 88 | 227 |

Time [s] | 209 | 1,805 | 1,082 | 228 |

Energy [J] | 4,285 | 159 | 95 | 52 |

The prototype chip can be configured in a fine-grained form so that we could selectively activate different modules of the chip in different execution phases. The whole training execution could be divided into two phases: import (step 1 in Figure

Considering the differences of computation time and power consumption on both platforms, Table

For comparison of the energy efficiency with other hardware systems in section 4, we estimate that it takes around 7,500 FLOPs (7 k for multiply-accumulation, 500 for ReLU) for each digit classification, which consumes 23 μ

In the classical von Neumann architecture, computing time and energy consumption is dominated by memory access. Memory access is particularly costly in this computing paradigm, due to the von Neumann bottleneck. By using a massively parallel architecture, the brain avoids the von Neumann bottleneck. This is believed to be one important factor contributing to the extraordinary energy efficiency of the brain and mimicked in neuromorphic hardware. However, in massively parallel architectures, communication is still a bottleneck (now, not between the central processor and the memory, but between the parallel processors or neurons). This is the case for both, the brain, and artificial neuromorphic systems. The brain volume is dominated by white matter, that is, by long-range connections between neurons. In the SpiNNaker system, there is a communication bandwidth limit between the parallel cores. But an even more severe bottleneck in the neural network implementation considered here is the local memory needed to store synaptic parameters. In other words, biological and artificial information processing systems face the same problems in one or the other way. The DEEP R algorithm was inspired by the way how this problem is tackled in the brain, that is, by an ongoing rewiring of connections. This uses the available resources in a flexible way such that network structure can be adjusted to the task demands. We have shown in this article that this biological principle is also applicable to artificial neuromorphic systems.

We showed that the SpiNNaker 2 prototype chip can be used for handwritten digit classification as exemplified by the MNIST dataset. The network used is a fully connected feedforward network whereas a convolutional network is necessary to achieve state-of-the-art performance on interesting computer vision problems, and recurrent networks are necessary to tackle speech recognition. In Bellec et al. (

We first evaluate the memory requirements of the convolutional network used in Bellec et al. (_{c}×_{c} is the number of parameters and

Scaling-up to real-world images and speech recognition. To study the importance of DEEP R on larger problems we evaluate the memory consumed by the storage of parameters in blue and the activity vectors required for backpropagation in green.

For the TIMIT dataset for speech recognition, a network of 200 recurrent LSTM units was used in Bellec et al. (

This scaling analysis shows that training of models for real-world image and speech recognition will soon be possible on neuromorphic hardware. With the use of DEEP R and a distributed memory storage, a few tens of cores similar to those of our SpiNNaker 2 prototype are enough to store and process the necessary information. With this software solution, such energy-efficient chips will not suffer from their memory restrictions to solve deep learning problems.

A number of papers show implementations of

Almost all of the above examples use pre-trained, fixed-weight deep networks. This is likely due to the fact that neuromorphic systems usually have very narrowly configurable plasticity functions unsuitable for ANNs (Indiveri et al.,

In the form of

Besides applying deep learning on neuromorphic hardware, in recent years, there have been many efforts in building dedicated hardware for efficient DNN processing and developing models with a low memory and compute footprint with negligible loss of performance. The common objective is the application of DNNs for inference in mobile systems with limited memory and computing power (smartphones, robotics, IoT devices, etc.).

Memory reduction is desirable for two reasons: First, on embedded systems the available memory for storing DNN parameters and activations is very limited, and second, the energy cost per memory access can dominate the overall power consumption, e.g., an external memory access can consume more than two orders of magnitude more energy than one multiply-accumulate operation (Horowitz,

At the same time, there is active research in the

Typically employs many multiply-accumulate (MAC) units as this is the main operation in DNNs. Often, low-bit integer operations are implemented as they require less power and silicon area than floating point units Horowitz (

The training of machine learning models with memory constraints is especially useful for low-power mobile and edge devices: For example, in the context of Internet of Things, smart sensors can employ deep learning to extract relevant features from raw data. While pre-trained models can be loaded to the sensors, an in-place training and fine-tuning allow for adapting the model to the environment or changes of sensor characteristics (e.g., from aging). Aside from that, simple robots interacting with the environment can learn autonomously to perform specific tasks. Especially reinforcement learning is highly applicable to the device as training examples are immediately available. Only a few algorithms for deep learning with a small memory footprint have been proposed. (Cheng et al.,

In this article, we presented a scalable implementation of a 4-layer DNN for digit recognition on a SpiNNaker 2 prototype chip using the DEEP R algorithm (Bellec et al.,

The X86 implementation of the DEEP R algorithm used in this paper will be made available at publication of the paper.

CL performed the experiment on the prototype chip and measured the metrics with the aid of BV. GB and RL conceived and designed the DEEP R algorithm with the help of DK. JP and FN developed the hardware accelerator of exponential function and random number generation. SH and SF contributed to the chip design. RL, WM, and CM supervised the findings of this work. All authors discussed the results and contributed to the final manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors thank ARM and Synopsis for IP and the Vodafone Chair at TU Dresden for contributions to the RTL design. We thank Yexin Yan from HPSN chair for his contributions to the code base. Discussions at the yearly Fürberg workshop of the Human Brain Project contributed to the DEEP R concept and implementation.