^{*}

^{*}

Edited by: Paul Miller, Brandeis University, United States

Reviewed by: Pieter R. Roelfsema, Netherlands Institute for Neuroscience (KNAW), Netherlands; Mattia Rigotti, IBM Research (United States), United States

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The interplay of reinforcement learning and memory is at the core of several recent neural network models, such as the Attention-Gated MEmory Tagging (

Memory spans various timescales and plays a crucial role in human and animal learning (Tetzlaff et al.,

Memory mechanisms can be implemented by enriching a subset of artificial neurons with slow time constants and gating mechanisms (Hochreiter and Schmidhuber,

Here, we study and extend the Attention-Gated MEmory Tagging model or

Notably, the

However, in the case of more complex tasks with long trials and multiple stimuli, like 12AX (O'Reilly and Frank,

Overview of _{2}). The reward obtained for the previous action is used to compute the TD error δ (green), which modifies the connection weights, that contributed to the selection of the previous action, in proportion to their eligibility traces (green lines and text). After this, temporal eligibility traces, synaptic traces and tags (in green) on the connections are updated to reflect the correlations between the current pre and post activities. Then, in the feedback pass, spatial eligibility traces (in red) are updated, attention-gated by the current action (e.g., red _{2}), via feedback weights.

The paper is structured as follows. Section 2 presents the architectural and mathematical details of hybrid

The network controls an agent which, in each time step

In

The regular branch is a standard feedforward network with one hidden layer. The current stimulus

where σ is the sigmoidal function σ(^{−1}. Input units are one-hot binary with values _{i} ∈ {0, 1} (equal to 1 if stimulus

The memory branch is driven by

where the brackets signify rectification. In the following, we denote the input into the memory branch with a variable

The memory units in the next layer have to maintain task-relevant information through time. The transient input is transmitted via the synaptic connections

We introduce the factor φ_{j} ∈ [0, 1] here, as an extension to the standard _{j} ≡ 1 for all _{j} (Figure

Architectures of standard _{j} ≡ 1, while hybrid _{j} < 1 and non-leaky φ_{j} = 1 units.

The memory state

The states of the memory units are reset to 0 at the end of each trial.

Both branches converge onto the output layer. The activity of an output unit with index _{i}], denoted as ^{s, a}(

where γ ∈ [0, 1] is a discount factor. Numerically, the vector

Finally, the

With probability ϵ, a stochastic policy is chosen with probability to take action

where ^{*} and a scaling factor ^{*} and the

where

where β is a learning rate and ^{R, M} for eligibility traces at the input-to-hidden (

After the update of weights, a synapse from neuron

where α ∈ [0, 1] is a decay parameter, _{k} is a binary one-hot variable that indicates the winning action (equal to 1 if action

Similarly, a synapse from neuron

where

Note that the tag _{k},

In the original

After action selection and the updates of weights, tags, and temporal eligibility traces in the feedforward pass, the synapses that contributed to the currently selected action update their spatial eligibility traces in an attentional feedback step. For the synapses from the input to the hidden layer, the tag

where feedback weights from the output layer to the hidden layer have been denoted as _{k} ∈ {0, 1} is the value of output unit

It must be noted that the feedback synapses

For networks with one hidden layer and one-hot coding in the output, attentional feedback is equivalent to backpropagation (Roelfsema and van Ooyen,

even in the presence of a decay factor φ < 1. Here, we specifically discuss the case of the tagging Equations (13) and (15) and the update rule (11) associated with the weight _{j}. Analogous update rules for weights

Proof. We want to show that

For simplicity, here we prove (17) for full temporal decay of the eligibility trace

where _{j}.

We first observe that the right-hand side of Equation (17) can be rewritten as:

Thus, it remains to show that

Similarly to the approach used in backpropagation, we now apply the chain rule and we focus on each term separately:

From Equations (5) and (7), we immediately have that:

We note that, in the feedback step the weight

Finally, for the term

where _{0} indicates the starting time of the trial and last approximation derives from the assumption of slow learning dynamics, i.e., _{0} ≤ τ <

In conclusion, we combine the different terms and we obtain the desired result:

Thus, if the decay factor φ_{j} of the synaptic trace

All simulation scripts were written in python (

We used the parameters listed in Table _{j} = 1 for the first half of the memory cells and φ_{j} = 0.7 for the second half. To compare with the standard _{j} ≡ 1 for all _{j} ≡ 0.7 for all

Parameters for the

β : Learning parameter | 0.15 |

λ : Eligibility persistence | 0.15 |

γ : Discount factor | 0.9 |

α : Eligibility decay rate | 1 − γλ |

ϵ : Exploration rate | 0.025 |

^{*} : Softmax time scale |
2000 trials |

10 |

As a first step, we validated our implementations of standard and hybrid

Network architecture parameters for the simulations.

8 | ||

3 | 10 | |

8 | 20 | |

2 | 2 |

In the sequence prediction task (Cui et al.,

Scheme of the sequence prediction task. Scheme of sequence prediction trials with sequence length equal to 4 (i.e., 2 distractors): the two possible sequences are:

The network has to learn the task for a given sequence length, kept fixed throughout training. The agent must learn to maintain the initial cue of the sequence in memory until the end of the trial, to solve the task. At the same time, the agent has to learn to neglect the information coming from the intermediate cues (called distractors). Thus the difficulty of the task is correlated with the length of the sequence.

We studied the performance of the

Convergence in the sequence prediction task.

We also analyzed the effect of the temporal length of the sequences on the network performance, by varying the number of distractors (i.e., the intermediate letters) per sequence (Figure

The leaky dynamics are not helpful for the sequence prediction task, because the intermediate cues are not relevant for the final model performance. Therefore, we expect the learning rule to suppress the weight values in the ^{M} matrix for distractors, and increase those of the initial

Memory weights of

To confirm the better performance of the network using conservative units over leaky ones, we tested the networks on a modified task never seen during training. Specifically, the test sequences were one letter longer than training sequences and the distractors were not anymore in alphabetical order but were sampled uniformly. For instance, if the network was trained with distractors

Different versions of

Statistics of different versions of

Standard |
98.1 | 100 |

Purely Leaky Control | 85.8 | 99.4 |

Hybrid |
98.3 | 100 |

The 12AX task is a standard cognitive task used to test working memory and diagnose behavioral and cognitive deficits related to memory dysfunctions (O'Reilly and Frank,

The general procedure of the task is schematized in Figure

The 12AX task: table of key information.

Input | 8 possible stimuli: |

Action | Non-Target ( |

Target sequences | |

Probability of target sequence is 25%. | |

Training dataset | Maximum number of training trials is 1, 000, 000. |

Pairs | Each trial starts with |

followed by a random number (between 1 and 4) of | |

pairs chosen from { |

The inserted pairs are determined randomly, with a probability of 50% to have pairs

We simulated the Hybrid

Learning convergence of the

With the convergence criterion of 1,000 consecutive correct predictions (corresponding to ~167 trials) (Alexander and Brown,

Comparative statistics of the

Success of learning refers to the fulfillment of the convergence criterion (Alexander and Brown,

In order to understand how the hybrid memory works on the 12AX task, we analyzed the weight structure of the connectivity matrices which belong to the memory branch of the hybrid

Memory weights of Hybrid

The memory units show an opposing behavior on activation versus on deactivation of Target cues: for instance, if

The conservative dynamics of the memory in standard

A key goal of the computational neuroscience community is to develop neural networks that are at the same time biologically plausible and able to learn complex tasks similar to humans. The embedding of memory is certainly an important step in this direction, because memory plays a central role in human learning and decision making. Our interest in the

We have no convergence guarantees for our algorithm and network. While on-policy TD learning methods have convergence guarantees for fully observable Markov Decision Processes (MDPs) (Singh et al.,

Even apart from the issue of POMDPs, there is the issue of convergence of TD-learning for MDPs using a neural network to approximate the

We now compare hybrid

The Hierarchical Temporal Memory (

Although the hybrid memory in the

In addition, the recent delta-

The lack of a memory gating system is a great limitation for

The Hybrid

Alternatively, inspired by the hierarchical architecture of

In the past years, the reinforcement learning community has proposed several deep RL networks, like deep Q-networks (Mnih et al.,

AG, MM, and WG contributed to the conception and design of the study. MM developed and performed the simulations, and wrote the first draft of the manuscript. WG, MM, and AG revised the manuscript, and read and approved the submitted version. MM, AG and WG further revised the manuscript in accordance with the reviewers' comments and read and approved the final version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We thank Vineet Jain and Johanni Brea for helpful discussions, and Vasiliki Liakoni for comments on the manuscript. Financial support was provided by the European Research Council (Multirules, grant agreement no. 268689), the Swiss National Science Foundation (Sinergia, grant agreement no. CRSII2_147636), and the European Commission Horizon 2020 Framework Program (H2020) (Human Brain Project, grant agreement no. 720270).

The Supplementary Material for this article can be found online at: