^{1}

^{†}

^{1}

^{*}

^{†}

^{2}

^{†}

^{3}

^{1}

^{2}

^{3}

Edited by: Nikola K. Kasabov, Auckland University of Technology, New Zealand

Reviewed by: Sadique Sheik, SynSense AG, Switzerland; Zhongrui Wang, The University of Hong Kong, Hong Kong SAR, China

This article was submitted to Neuromorphic Engineering, a section of the journal Frontiers in Neuroscience

†These authors have contributed equally to this work

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Recent years witness an increasing demand for using spiking neural networks (SNNs) to implement artificial intelligent systems. There is a demand of combining SNNs with reinforcement learning architectures to find an effective training method. Recently, temporal coding method has been proposed to train spiking neural networks while preserving the asynchronous nature of spiking neurons to preserve the asynchronous nature of SNNs. We propose a training method that enables temporal coding method in RL tasks. To tackle the problem of high sparsity of spikes, we introduce a self-incremental variable to push each spiking neuron to fire, which makes SNNs fully differentiable. In addition, an encoding method is proposed to solve the problem of information loss of temporal-coded inputs. The experimental results show that the SNNs trained by our proposed method can achieve comparable performance of the state-of-the-art artificial neural networks in benchmark tasks of reinforcement learning.

Neuromorphic engineering aims to emulate the dynamics of biological neurons and synapses with silicon circuits and run spiking neural networks (SNNs) to achieve cognitive behaviors (Mead,

Among the recently proposed training methods of SNNs (Zhang and Li,

While there is an increasing demand for applying SNNs to reinforcement learning (RL) tasks (Tang et al.,

Compared with temporal coding SNNs, the cost of currently the most widely used rate coding SNNs mainly lies in response delay and accuracy. The spiking neural network based on temporal coding can cleverly use the activation time of the input layer to represent information, which means an inference can be completed in one activation cycle. Rate coding SNNs need to estimate information based on the activation frequency over a period of time, which takes more time and loses accuracy. In addition, the transcoding process also loses accuracy, which is not the case with temporal coding SNNs.

But to train SNNs for reinforcement learning tasks with the TC method, there are two critical challenges when the input and output of SNNs are continuous values. Firstly, the derivative of the current temporal-coded SNNs does not exist everywhere in the network during training, which deteriorates the performance of back-propagation training for RL tasks. Without ensuring the existence of the derivative of SNNs everywhere, the existing TC methods cannot converge for RL tasks. Secondly, due to the intrinsic computing paradigm of SNNs, if an input signal is encoded as a relatively large value, especially when it arrives after the first output neuron spikes, it cannot effectively participate in the training and inference of SNNs. A sophisticated signal encoding method is required to transfer input signals to spike times in a restricted range to ensure the effective usage of all the input signals.

Inspired by the excellent performance of the TC method, we attempt to apply it to reinforcement learning tasks in this work. We propose a Continuous-Valued Temporal Coding (CVTC) method to tackle the above-motioned challenges. To the best of our knowledge, this is the first work to train SNNs with temporal coding methods for RL tasks. The main contributions are as follows:

We design a fully differentiable temporal-coded SNN architecture (see Section 3). By introducing a self-incremental factor to each spiking neuron, the proposed SNN architecture ensures that each neuron is differentiable almost anytime and everywhere during training.

We propose a signal encoding method for continuous input signals (see Section 4). Based on a mixture of spatial and temporal coding techniques, the novel encoding method can transform input signals to spike times and solve the problem of losing information of later arrived spikes.

Experimental results show the effectiveness of the proposed CVTC method for RL tasks (see Section 5). The SNN trained by the CVTC method achieves a comparable performance of the state-of-the-art ANN in the DDQN framework with the same number of network parameters.

Most of the studies on temporal coding methods focus on how to transform spiking neurons' input spike times to their output spike times and calculate derivatives (Neftci et al.,

Among these three methods, Mostafa (

where _{out}. By integrating Equation (1), the membrane potential for

The neuron spikes when its membrane potential crosses a firing threshold which is set as 1. Then the spike time _{out} is implied as:

where _{i} < _{out}} is the subset of input spikes which actually affect the output neuron. Eventually, the exponential form of _{out} can be denoted as:

If _{out}) is denoted as _{out}, the input and output relation of a spiking neuron can thus be transformed to the same form as a typical artificial neuron. In this way, the back-propagation technique can be used for training SNNs.

In the temporal coding method, to ensure the back-propagation work normally and effectively and the output neurons emit spikes, the following conditions need to be guaranteed:

Otherwise, the _{out}) would be set to INF. We notice that due to the sparsity of spikes in SNNs, most of the neuron outputs would be set to INF. In the next section, we present the proposed training method based on the equations above.

The current TC method uses back-propagation technique to train SNNs for classification tasks. However, for general RL frameworks, such as Deep Q Network (DQN) (Fan et al.,

Taking

Back-propagation cases in RL tasks.

Legal | Legal | Exist | Normal |

INF | Legal | Equal to 0 | Stop |

INF | INF | Equal to 0 | Stop |

Legal | INF | Exist | Error |

To tackle the above problem of derivative discontinuity, we propose a fully differentiable temporal-coding training method in the following part. Section 3.1 introduces a self-incremental variable to make the TC method fully differentiable. Section 3.2 further discusses the impact of the self-incremental variable during the inference phase of trained SNNs.

To solve the problem above, here we modify the spiking neuron model and introduce a self-incremental variable β

By integrating Equation (7), the spike time _{out} can be implied as:

where β is a hyperparameter. Hence, the exponential form of _{out} can be calculated with:

where the following requirements has to be satisfied:

Otherwise, the _{out}) would be set to INF. Our proposed temporal coding method ensures that the derivative for each neuron is always continuous as long as Equation (10) is satisfied.

For the convenience of comparison, we illustrate our algorithm using the same style as Mostafa (_{k}) →

Pseudocode of the forward pass in a feed-forward network with L layers.

^{0}: Vector of input spike times encoded with Algorithm 1 |

^{1}, ..., ^{L}: Number of neurons in the L layers |

^{1}, ..., ^{L}: Set of weight matrices. W^{l}[i, j] is the weight from neuron j in layer l |

^{L}: Vector of first spike times of neurons in the output layer |

1: |

2: ^{r} |

3: |

4: |

5: |

6: |

7: ^{r} |

8: |

9: |

10: |

Pseudocode of the get_causal_set function.

1: |

2: ^{sorted} ← z[sort_indices] //sorted input vectors |

3: ^{sorted} ← w[sort_indices] //weight vector rearranged to match sorted input vector |

4: |

5: |

6: |

7: |

8: ^{sorted} |

9: |

10: |

11: |

12: |

13: |

14: |

A reinforcement learning problem can converge only if the _{1}, the pendulum tilts left, the neuron for moving left keeps spiking faster than that of moving right. The network can keep selecting the left-moving action until the upright state is restored. Around step _{2}, the pendulum is upright, the neuron for moving left and right both spike quickly, which means the expectation is always high, and thus this state is close to the ideal one. At step _{3}, the pendulum has been shifted to the left of the field and tilts to the left. Since the network has not been well trained for this situation yet, the two output neurons' spike order flips between steps. The network cannot continue to select the expected action, left moving, to restore the upright state. It shows that our training method is effective and in line with expectations.

Spike time variation of the output layer of CVTC. _{1}. The car is at the center of the field, and the pendulum is turning left. The spike time for moving right is high. _{2}. The car is at the center of the field, and the pendulum is upright. The spike times are both low. _{3}. The car is on the left of the field, and the pendulum is turning left. The spike times are both high.

In the above section, we introduce a self-incremental variable. In this section, we discuss the impact of this variable when inferencing the trained SNN on real chips. Although the self-incremental variable is usually easy to implement with mixed-signal analog/digital circuits, we further explore how to deploy the trained network on neuromorphic hardware without dedicated modification. Therefore, in the inference stage of a trained SNN, the implementation of this variable is removed. The method given in this section is to directly use Equation (4) to calculate the activation time during inference. Since β in Equation (9) is small enough, when the DQN algorithm converges, the difference between Equations (4) and (9) approaches zero.

_{j} is the jth neuron's spike time of output layer using Equation (9), and q_{j} is the jth neuron's spike time of output layer using Equation (4). For any ϵ > 0, there is a small enough β for all j such that

During the training phase of SNNs, we set β as a small enough value and use Equation (9) to calculate the spike time. During the inference phase on real chips, we ignore the self-incremental variable and directly use Equation (4) to implement the circuit. According to the experimental results in Section 5.1, when β is set as a value smaller than 1

Temporal-coded input information's contribution is inherently biased in asynchronous SNNs. In such networks, the input spikes that arrive earlier affect the processing of the subsequent spikes. Thus, the earlier spikes have a higher impact on the SNNs's output, which is undesirable. In reinforcement learning tasks, the input signal represents the observation of the environment, such as how far the agent is from the center, how large the angle is, how large the speed is, etc. When we transfer the value to a spike timing, ideally, the timing should not have any predefined impact factor because we are not sure if the observed value should be larger or lower. This should be the task for the reinforcement learning algorithm to discover.

Example of the unbalanced-input problem in SNNs. _{3} = 1.0. _{3} = 1.3. Since we take the first activated output neuron as the network's output, it cannot distinguish the two different input patterns in this example. When we delayed _{3} from 1.0 _{2}.

To solve this problem of the inherently biased contribution of input signals, we propose an encoding method based on a mixture of spatial and temporal coding techniques, which can solve the problem largely while keeping the input information intact during the coding procedure. In Section 4.1, we present our encoding method for input signals of SNNs. In Section 4.2, we prove that the encoded input signal can be easily recovered, and there is no information loss during encoding.

We discrete the range [_{i}, we have

where _{i,k} is the value of the k-th point of i-th input channel.

Thus, continuous temporal signals can be mapped into discrete spatial signals, and they can be treated equally by the network as different parts of an input image. However, this coding method is achieved at the cost of losing precision, making it unable to distinguish subtle differences between input signals and only representing 2^{K} different input values.

Then we extend the coding method presented in this section to take advantage of a normal distribution of μ =

where

In this way, the original input signal is encoded as continuous spike time in the range of [

It worth noting that since the final original output activation time cannot be recovered from more than two output signals when the output is mixed with noise, the premise that the signal can be recovered in Section 4.1 must be lossless, and the method in Section 4.1 cannot be directly applied to the output. In addition, it is also difficult to implement the

The pseudocode of our encoding method presented is illustrated in

Pseudocode for encoding the input signals.

_{1}, .., _{L} |

_{1}, .., _{L*K} |

1: _{1}, .., _{L} |

2: _{1}, .., _{L} |

3: |

4: |

5: |

6: _{i*K+k} ← S_{i,k} |

7: |

8: |

Here we show that the input encoding method in Section 4.1 is non-destructive and can be recovered to the original input. The probability density function of the normal distribution in Equation (14) is defined as:

Substitute the term of normal distribution in Equation (14) with Equation (15), the encoded input signal becomes:

Hence, the origin input signal are given by:

Based on Equation (17), _{i} can be easily recovered from the encoded input.

The membrane potential of a spiking neuron when that is about to fire is described as:

where

We can transform Equation (18) to:

where α and γ denote two constants. Then we have:

where

Use iterative algorithms such as Newton's method to get a numerical solution. However, this would result in a great reduce of the efficiency of the solution.

Use low-order Taylor expansion approximation. However, this would result in an accuracy-decreasing problem.

Therefore, we choose to use β

In this paper, we use full-connected structure as the temporal-layers. For MNIST task, we use the same network structure with one hidden layer. Hidden layers of both the CVTC and Temporal Coding (TC) network have 800 neurons. For CartPole task, we also use one hidden layer of 800 neurons for all SNN networks. But the input sizes of DDQN-SNN-CVTC and DDQN-SNN-TC-encoded are 80 instead of 4. For MountainCar task, all networks have two hidden layers of 12 and 48 layers, including SNN and ANN methods. The input size of DDQN-SNN-CVTC are expanded to 40 by input encoding.

In

Hyperparameters for algorithms in experiments.

Optimization algorithm | SGD (Amari, |
Adam (Kingma and Ba, |
Adam |

Learning rate | 0.01–0.0001 | 0.001251 | 0.001 |

Training batch size | 10 | 32 | 32 |

Target network update frequency | 100 step | 1 episode | |

Replay memory capacity | 1,000 | 200,000 | |

Training batch size | 23.37 | −200 | −106.4 |

γ | 0.99 | 0.99 | |

ϵ | 1–0.1 | 1–0.00001 | |

20 | 15 | ||

20 | 15 | ||

σ | 1.4 | 1.2 | |

β | 0.1, 0.01, 0.001 | 0.001 | 0.001 |

Here we show that the added incremental term can be removed after training. So the trained network can be run in typical SNNs constructed with I&F neurons.

_{j} _{j}

When the DQN algorithm converges, the predicted value of the output layer is upper bound. Let _{max} donate the max activation time of the output layer in the Q network. As we discussed at the beginning of Section 4, in each layer of the network, only those inputs that are less than the maximum spike time of the output layer could affect the output layer. Let _{l,j} donate the _{l,j} < _{max}.

It can be conducted that all output times are positive as follows:

So we always have:

where _{l − 1, j} < _{l,j}}, _{l,i,j} donate the weight between neuron

For the first layer of Q network, inputs for _{0,j} and _{0,j} are same. it can be obtained by Equation (8):

where _{i} donate the

Simplify the equation:

The lower bound of _{0,j}) can be obtained:

Subtract _{0,j}) on both sides:

where _{0,j}) − β, A, B, and W are all positive. Let:

β should satisfy:

So when we choose

Thus we proved the limit of loss for one layer. Then we generalize it for all layers. For layer

If we choose β = min(β_{l,j}), it always hold:

Theorem 1 is proved. By choosing a relative small value of β, the effect of the removing the incremental term

In this section, we evaluate that the method proposed in this paper can be applied to the two problems of the TC method in RL and compare it with the general RL baseline.

Section 5.1 compares the training results of the TC and CVTC methods based on the MNIST data set and the CartPole environment. The purpose is to prove that the additional increment item β

In Section 5.2, we compare the results of whether to use the coding method proposed in Section 4 and proved that the coding method is effective for RL training.

In Section 5.3, we compare our method to other baseline methods and proved that our approach could achieve the same performance as the baseline method on Benchmark tasks.

Our experiments are carried out in the CartPole-v0 and MountainCar-v0 control environment of OpenAI Gym (Barto et al.,

In Section 3, we analyze the influence of increment item β_{high} = 6 and _{low} = 1 instead.

We choose different β values, and the CVTC method was trained for 20 epochs.

Results in MNIST task.

We show CVTC and TC methods' spike time distribution in MNIST task in _{out} to show the real activation time of neurons instead of _{out}. The TC method uses 1_{out}, and its corresponding spike time is 19.8. As shown in

Then we choose different β and analyzed the effect of the inference phase (^{−2}, the difference between the two errors approaches 0. It proves that when β is small enough, the parameters obtained by the CVTC method can be deployed to the neuromorphic chip without transformation.

Comparison of errors on MNIST task.

Evaluate error (%) | 2.65 | 2.72 | 2.85 |

Test error (%) | 2.67 | 2.72 | 2.85 |

In this section, we evaluate the effectiveness of the input coding method on CartPole-v0 environment. We replace the deep network with SNN in the DDQN framework, which is more stable, and proved that it would converge in finite time (Xiong et al., _{out} as _{out} has wider range than _{out}. The _{out}. Then the fastest responding neurons in the output layer refers to the best actions. We test the following permutations of methods: The DDQN-SNN-CVTC network using the method proposed in Sections 4 and 5.1; The DDQN-SNN-CVTC-uncoded network removes the input coding step described in Section 4; The DDQN-SNN-TC represents the original temporal coding method, which has also been introduced in Section 5.1; The DDQN-SNN-TC-encoded added the method proposed in Section 4 based on TC but did not use the network proposed in Section 3.

Training curves on CartPole task.

We evaluate the performance of our approach on Gym basic tasks. The MountainCar environment always returns −1 as the reward, so we need to make the reward positive to ensure the

As shown in

Comparison of performance on Gym basic tasks.

DDQN | 195.95 ± 0.59 | −106.4 ± 1.05 |

PPO | 198.57 ± 0.42 | −96.20 ± 0.46 |

DDQN-SNN-CVTC (ours) | 180.19 ± 2.73 | −108.15 ± 2.1 |

DDQN-SNN-TC | 17.89 ± 0.3 | −199 ± 0 |

Currently, there are still some special challenges to train SNNs for Policy-based reinforcement learning tasks with the TC method due to the existence of two regression networks in the policy gradient algorithm:

This paper presents the CVTC method to train asynchronous SNNs. We introduce a constantly increasing variable for each spiking neuron to ensure that it is differentiable anytime during training. This variable can be removed after training without performance degradation. Then we propose a novel temporal coding method to encode input signals with normal distribution using a group of input coding neurons. It solves the problem of losing information of later arrived spikes. Moreover, we theoretically prove that the encoded input information can be easily restored from the encoded spike times. We show that using our CVTC method, SNNs can be trained for RL tasks and achieve a comparable performance of the state-of-the-art ANN in the DDQN framework. Code can be found at:

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

GW designed the reinforment learning architecture. JW analyzed the experimental data. SL conducted the experiments. DL designed the temporal coding training algorithm. All authors contributed to the article and approved the submitted version.

This work was supported in part by the National Natural Science Foundation of China under Grant 62002369.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.