^{1}

^{1}

Edited by: Peter Dayan, University College London, UK

Reviewed by: Nathaniel D. Daw, New York University, USA; Alex Pouget, University of Rochester, USA

*Correspondence: Rajesh P. N. Rao, Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195-2350, USA.e-mail:

This is an open-access article subject to an exclusive license agreement between the authors and the Frontiers Research Foundation, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are credited.

A fundamental problem faced by animals is learning to select actions based on noisy sensory information and incomplete knowledge of the world. It has been suggested that the brain engages in Bayesian inference during perception but how such probabilistic representations are used to select actions has remained unclear. Here we propose a neural model of action selection and decision making based on the theory of partially observable Markov decision processes (POMDPs). Actions are selected based not on a single “optimal” estimate of state but on the posterior distribution over states (the “belief” state). We show how such a model provides a unified framework for explaining experimental results in decision making that involve both information gathering and overt actions. The model utilizes temporal difference (TD) learning for maximizing expected reward. The resulting neural architecture posits an active role for the neocortex in belief computation while ascribing a role to the basal ganglia in belief representation, value computation, and action selection. When applied to the random dots motion discrimination task, model neurons representing belief exhibit responses similar to those of LIP neurons in primate neocortex. The appropriate threshold for switching from information gathering to overt actions emerges naturally during reward maximization. Additionally, the time course of reward prediction error in the model shares similarities with dopaminergic responses in the basal ganglia during the random dots task. For tasks with a deadline, the model learns a decision making strategy that changes with elapsed time, predicting a collapsing decision threshold consistent with some experimental studies. The model provides a new framework for understanding neural decision making and suggests an important role for interactions between the neocortex and the basal ganglia in learning the mapping between probabilistic sensory representations and actions that maximize rewards.

To survive in a constantly changing and uncertain environment, animals must solve the problem of learning to choose actions based on noisy sensory information and incomplete knowledge of the world. Neurophysiological and psychophysical experiments suggest that the brain relies on probabilistic representations of the world and performs Bayesian inference using these representations to estimate task-relevant quantities (sometimes called “hidden or latent states”) (Knill and Richards,

In this article, we propose a neural model for action selection and decision making that combines probabilistic representations of the environment with a reinforcement-based learning mechanism to select actions that maximize total expected future reward. The model leverages recent advances in three different fields: (1) neural models of Bayesian inference, (2) the theory of optimal decision making under uncertainty based on partially observable Markov decision processes (POMDPs), and (3) algorithms for temporal difference (TD) learning in reinforcement learning theory.

The new model postulates that decisions are made not based on a unitary estimate of “state” but rather the entire posterior probability distribution over states (the “belief state”) (see also Dayan and Daw,

We illustrate the proposed model by applying it to the well-known random dots motion discrimination task. We show that after learning, model neurons representing belief state exhibit responses similar to those of LIP neurons in primate cerebral cortex. The appropriate threshold for switching from gathering information to making a decision is learned as part of the reward maximization process through TD learning. After learning, the temporal evolution of reward prediction error (TD error) in the model shares similarities with the responses of midbrain dopaminergic neurons in monkeys performing the random dots task. We also show that the model can learn time-dependent decision making strategies, predicting a collapsing decision threshold for tasks with deadlines.

The model ascribes concrete computational roles to the neocortex and the basal ganglia. Cortical circuits are hypothesized to compute belief states (posterior distributions over states). These belief states are received as inputs by neurons in the striatum in the basal ganglia. Striatal neurons are assumed to represent behaviorally relevant points in belief space which are learned from experience. The model suggests that the striatal/STN-GPe-GPi/SNr network selects the appropriate action for a particular belief state while the striatal-SNc/VTA network computes the value (total expected future reward) for a belief state. The dopaminergic outputs from SNc/VTA are assumed to convey the TD reward prediction error that modulates learning in the striatum-GP/SN networks. Our model thus resembles previous “actor-critic” models of the basal ganglia (Barto,

We first introduce the theory of partially observable Markov decision processes. We then describe the three main components of the model: (1) neural computation of belief states, (2) learning the value of a belief state, and (3) learning the appropriate action for a belief state.

Partially observable Markov decision processes (POMDPs) provide a formal probabilistic framework for solving tasks involving action selection and decision making under uncertainty (see Kaelbling et al.,

The goal of the agent is to maximize the expected sum of future rewards:

where

Since the animal does not know the true state of the world, it must choose actions based on the history of observations and actions. This information is succinctly captured by the “belief state,” which is the posterior probability distribution over states at time _{t}_{t}_{t}_{t}_{t}, a_{t−1}, _{t−1},…,_{0}, _{0}).

The belief state can be computed recursively over time from the previous belief state using Bayes rule:

where

The goal then becomes one of maximizing the expected future reward in Eq. (_{t}_{t}_{t}_{t}

Note that in traditional reinforcement learning, states are mapped to actions whereas a POMDP policy maps a

Methods for solving POMDPs typically rely on estimating the value of a belief state, which, for a fixed policy π, is defined as the expected sum of rewards obtained by starting from the current belief state and executing actions according to π:

This can be rewritten in a recursive form known as Bellman's equation (Bellman,

The recursive form is useful because it enables one to derive an online learning rule for value estimation as described below.

Figure

_{t}

We propose here a model for learning POMDP policies that could be implemented in neural circuitry. The model leverages recent advances in POMDP solvers in the field of artificial intelligence as well as ideas from reinforcement learning theory.

Before proceeding to the model, we note that the space of beliefs is continuous (each component of the belief state vector is a probability between 0 and 1) and typically high-dimensional (number of dimensions is one less than the number of states). This makes the problem of finding optimal policies very difficult. In fact, finding exact solutions to general POMDP problems has been proved to be a computationally hard problem (e.g., the finite-horizon case is “PSPACE-hard”; Papadimitriou and Tsitsiklis,

A prerequisite for a neural POMDP model is being able to compute the belief state _{t}

where _{t}_{t}

Equation (_{t}_{t}_{t−1}), suggesting a neural implementation based on a recurrent network, for example, a leaky integrator network:

where _{1} is a potentially non-linear function describing the feedforward transformation of the input, M is the matrix of recurrent synaptic weights, and _{1} is a dendritic filtering function.

The above differential equation can be rewritten in discrete form as:

where _{t}_{1} and _{1}, and

To make the connection between Eq. (

This suggests that Eq. (_{t}_{t}_{t}_{j}T_{t−1,}_{t−1}(_{t−1}(

A neural model as sketched above for approximate Bayesian inference but using a linear recurrent network was first explored in (Rao, _{t}

In general, the hidden state _{t}

Many other neural models for Bayesian inference have been proposed (Yu and Dayan, _{t}

Recall that the value of a belief state, for a fixed policy π, can be expressed in recursive form using Bellman's equation:

The above recursive form suggests a strategy for learning the values of belief states in an online (input-by-input) fashion by minimizing the error function:

This is the squared

The model estimates value using a three-layer network as shown in Figure

The input layer receives the belief state _{t}

_{t}_{i}

where _{i}_{t}^{2} is a variance parameter.

The belief points _{i}_{t}_{t}_{t}_{t}

The output of the network is given by:

where _{i}

The synaptic weights _{i}^{π}:

The synaptic weights at time

where α_{1} and α_{2} are constants governing the rate of learning, and δ_{t + 1} is the TD error

A more interesting observation is that the learning rule (7) for the belief basis vectors _{t + 1} in the learning rule. The learned basis vectors therefore do not simply capture the statistics of the inputs but do so in a manner that minimizes the error in prediction of value.

The network for action selection (Figure

In the model, the probability of choosing action _{j}_{t}

where

We now derive a simple learning rule for the action weights _{j}_{t + 1}), we would like to maximize the probability _{j}_{t}_{t + 1}), we would like to minimize _{j}_{t}_{j}_{t}_{t + 1} is positive and maximizing 1/_{j}_{t}_{t + 1} is negative. The desired result can therefore be achieved by maximizing the function _{t + 1} log _{j}_{t}

Substituting Eq. (

An approximate solution to the optimization problem in (9) can be obtained by performing gradient ascent on _{t}_{j}_{3} here is the learning rate):

In other words, after an action _{j}_{t + 1}g_{i}(_{t}

We postulate that the probabilistic computation of beliefs in Eq. (

We further postulate that the outputs of cortical circuits (i.e., belief states) are conveyed as inputs to the basal ganglia, which implements the value and action selection networks in the model. In particular, we suggest that the striatum/STN-GPe-GPi/SNr network computes actions while the striatum-SNc/VTA network computes value (Figure

In Figure _{i}

The interpretation of dopaminergic outputs in the basal ganglia as representing prediction error is consistent with previous TD-based models of dopaminergic responses (Schultz et al.,

We tested the neural POMDP model derived above in the well-known random dots motion discrimination task used to study decision making in primates (Shadlen and Newsome,

The animal's task is to decide the direction of motion of the coherently moving dots for a given input sequence. The animal learns the task by being rewarded if it makes an eye movement to a target on the left side of its fixation point if the motion is to the left, and to a target on the right if the motion is to the right. A wealth of data exists on the psychophysical performance of humans and monkeys on this task, as well as the neural responses observed in brain areas such as MT and LIP in monkeys performing this task (see Roitman and Shadlen,

In the first set of experiments, we illustrate the model using a simplified version of the random dots task where the coherence value chosen at the beginning of the trial is known. This reduces the problem to that of deciding from noisy observations the underlying direction of coherent motion, given a fixed known coherence. We tackle the case of unknown coherence in a later section.

We model the task using a POMDP as follows: there are two underlying hidden states representing the two possible directions of coherent motion (leftward or rightward). In each trial, the experimenter chooses one of these hidden states (either leftward or rightward) and provides the animal with observations of this hidden state in the form of an image sequence of random dots at the chosen coherence. Note that the hidden state remains the same until the end of the trial. Using only the sequence of observed images seen so far, the animal must choose one of the following actions: sample one more time step (to reduce uncertainty), make a leftward eye movement (indicating choice of leftward motion), or make a rightward eye movement (indicating choice of rightward motion).

We use the notation _{L}_{R}_{t}_{L}_{R}_{t}_{t}_{t,} c_{t}_{t}_{1}, _{2},…, _{Q}_{S}, A_{L}, A_{R}

The animal receives a reward for choosing the correct action, i.e., action _{L}_{L}_{R}_{R}

The transition probabilities _{t}|_{t − 1,} a_{t − 1}) for the task are as follows: the state remains unchanged (self-transitions have probability 1) as long as the sample action _{S}_{t}|_{t − 1,} A_{S}) = 1. When the animal chooses _{L}_{R}_{L}_{R}_{k}_{1}, _{2},…, _{Q}}) chosen uniformly at random.

In the first set of experiments, we trained the model on 6000 trials of leftward or rightward motion. Inputs _{t}_{t}|_{t,} _{t}) based on the current coherence value and state (direction). For these simulations, _{t}_{L}_{R}_{L}_{L}, c_{k}_{k}_{R}_{R}_{k}

The belief state _{t}_{t}_{t}_{k}_{t}|_{t,}_{t}), and the previous belief state:

The belief over _{L}_{t}_{L}_{t}_{R}

The resulting belief state vector _{t}_{i}_{1} = α_{3} = 0.0005, α_{2} = 2.5 × 10^{−7}, γ = 1, λ = 1, σ^{2} = 0.05). The number of output units was three for the action network and one for the value network. A more realistic implementation could utilize populations of neurons to represent the two directions of motions and estimate posterior probabilities from population activity; for simplicity, we assume here that the two posterior probabilities are represented directly by two units.

The number of hidden units used in the first set of simulations was 11. We found that qualitatively similar results are obtained for other values. The number of hidden units determines the precision with which the belief space can be partitioned and mapped to appropriate actions. A complicated task could require a larger number of hidden neurons to partition the belief space in an intricate manner for mapping portions of the belief space to the appropriate value and actions.

The input-to-hidden weights were initialized to evenly span the range between [0 1] and [1 0]. Similar results were obtained for other choices of initial parameters (e.g., uniformly random initialization).

The process of learning is captured in Figure

Figure

_{R} (rightward motion) state. The full belief state is simply [Belief(_{R}) 1-Belief(_{R})]. Before learning begins, all values for belief states are initialized to 0 (left panel). After learning, highly uncertain belief states (Belief(_{R}) near 0.5) have low value while belief states near 0 or 1 (high certainty about states _{L} or _{R} respectively) have high values.

Before learning, all values are 0 because the weights _{i}_{L}_{R}_{R}_{R}_{L}_{R}

_{S,} _{L}, _{R}) as a function of belief for state _{R} (rightward motion). The “Sample” action is chosen with high probability when the current state is uncertain (belief is between 0.2 and 0.8, top right plot). The “Choose Left” action has a high probability when the belief for _{R} is near 0 (i.e., the belief for _{L} is high) and the “Choose Right” action when belief for S_{R} is near 1.

Figure _{S}_{R}_{L}_{R}

The performance of the model on the task depends on the coherence of the stimulus and is quantified by the psychometric function in Figure

Accuracies above 90% are already achieved for coherences 8% and above, similar to the monkey data. 100% accuracy in the model is consistently achieved only for the 100% coherence case due to the probabilistic method used for action selection (see section

The vertical dotted line in each plot in Figure

We did not attempt to quantitatively fit a particular monkey's data, preferring to focus instead on qualitative matches. It should be noted that the model learns to solve the random dots task from scratch over the course of several hundred trials, with the only guidance provided being the reward/penalty at the end of a trial. This makes fitting curves, such as the psychometric function, to a particular monkey difficult, compared to previous models of the random dots task that are not based on learning and which therefore allow easier parameter fitting.

Figure

The learned policy in Figure _{L}_{R}_{L}_{R}_{L}_{L}_{L}

Figure _{L}_{L}_{R}_{t}|s_{t}, c_{t}

_{L}) for stimuli moving leftward (solid) and rightward (dashed) with motion coherences 4, 8, 20, and 40% respectively. The model chose the correct action in each case. The panel on the right shows average responses of 54 neurons in cortical area LIP in a monkey (figure adapted from Roitman and Shadlen,

The random walk-like ramping behavior of the belief computing neurons in the model is comparable to the responses of cortical neurons in area LIP in the monkey (Figure ^{1}

Unlike previous models of LIP responses, the POMDP model suggests an interpretation of the LIP data in terms of maximizing total expected future reward within a general framework for probabilistic reasoning under uncertainty. Thus, parameters such as the threshold for making a decision emerge naturally within the POMDP framework as a result of maximizing reward. As the model responses in Figure _{L}_{R}

The hidden layer neurons in Figure

Since the belief vector is continuous valued and typically high-dimensional, the transformation from the input layer to hidden layer in Figure

Figure

When initialized to random values (Figure

The anatomical mapping of elements of the model to basal ganglia anatomy in Figure

We first present a comparison of model TD responses to DA responses seen in the simple conditioning task of (Mirenowicz and Schultz,

To illustrate TD responses in the model for simple conditioning, we reduced the uncertainty in the random dots task to 0 and tracked the evolution of the TD error. Figure

This behavior of the TD error in the model (Figure

In the previous section, we considered the case where only motion direction was unknown and the coherence value was given in each trial. This situation is described by the graphical model (Koller and Friedman, _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}

_{t}_{t}_{t}_{t-1} are assumed to be known. The action _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}

We now examine the case where both the direction of motion and coherence are unknown. The graphical model is shown in Figure

Suppose _{t}_{t}

Then, the belief state at time

This belief state can be computed as in Eq. (_{t}_{t}_{t}_{t}

Alternatively, one can estimate these marginals directly by performing Bayesian inference over the graphical model in Figure _{t}_{t}

Figure

To illustrate this model, we simulated the case where there are two directions of motion (

The model was exposed to 4000 trials, with the motion direction and coherence selected uniformly at random for each trial. The rewards and penalties were the same as in the previous section (+20 reward for correct decisions, −400 for errors, and −1 for each sampling action). The number of hidden units, shared by the value and action networks, was 25 each for belief over direction and coherence. The other parameters were set as follows: α_{1} = 3 × 10^{−4}, α_{2} = 2.5 × 10^{−8}, α_{3} = 4 × 10^{−6}, γ = 1, λ = 0.5, σ^{2} = 0.05.

Figure

The corresponding learned policy is shown in Figure ^{2}

Figure

The belief trajectory over coherence in Figure

In this section, we compare model predictions regarding reward prediction error (TD error) with recently reported results on dopamine responses from SNc neurons in monkeys performing the random dots task (Nomoto et al.,

We first describe the model's predictions. Consider an “Easy” coherence trial where the direction of motion is leftward (L). The model starts with a belief state of [0.5 0.5] over direction (and coherence); subsequent updates push Belief(L) higher, which corresponds to climbing the ramp in the value function in Figure

Figure

For comparison, Figure ^{3}

For “Hard” motion coherence trials (coherence = 8%), the average TD error in the model is shown in Figure

The model also predicts that upon reward delivery at the end of a correct trial, TD error should be larger for the “Hard” (8% coherence) case due to its smaller expected value (see Figure

Finally, in the case of an error trial, the model predicts that the absence of reward (or presence of a negative reward/penalty as in the simulations) should cause a negative reward prediction error and this error should be slightly larger for the higher coherence case due to its higher expected value (see Figure

Our final set of results illustrates how the model can be extended to learn time-varying policies for tasks with a deadline. Suppose a task has to be solved by time

Figure

where ^{i}_{1} = 2.5 × 10^{−5}, α_{2} = 4 × 10^{−8}, α_{3} = 1 × 10^{−5}, γ = 1, λ = 1.5, σ^{2} = 0.08. The model was trained on 6000 trials, with motion direction (Left/Right) and coherence (Easy/Hard) selected uniformly at random for each trial.

Figures

The learned policy, which is a function of elapsed time, is shown in Figure

As seen in Figure

More interestingly, as we approach the deadline, the threshold for the “Choose Left” action collapses to a value close to 0.5 (and likewise for “Choose Right”), suggesting that the model has learned it is better to pick one of these two actions (at the risk of committing an error) than to reach the deadline and incur a larger penalty. Such a “collapsing” bound or decision threshold has also been predicted by previous theoretical studies (e.g., Latham et al.,

The mechanisms by which animals learn to choose actions in the face of uncertainty remains an important open problem in neuroscience. The model presented in this paper proposes that actions are chosen based on the entire posterior distribution over task-relevant states (the “belief state”) rather than a single “optimal” estimate of the state. This allows an animal to take into account the current uncertainty in its state estimates when selecting actions, permitting the animal to perform information gathering actions for reducing uncertainty and choosing overt actions only when (and if) uncertainty is sufficiently reduced.

We formalized the proposed approach using the framework of partially observable Markov decision processes (POMDPs) and presented a neural model for solving POMDPs. The model relies on TD learning for mapping beliefs to values and actions. We illustrated the model using the well-known random dots task and presented results showing that (a) the temporal evolution of beliefs in the model shares similarities with the responses of cortical neurons in area LIP in the monkey, (b) the threshold for selecting overt actions emerges naturally as a consequence of learning to maximize rewards, (c) the model exhibits psychometric and chronometric functions that are qualitatively similar to those in monkeys, (d) the time course of reward prediction error (TD error) in the model when stimulus uncertainty is varied resembles the responses of dopaminergic neurons in SNc in monkeys performing the random dots task, and (e) the model predicts a time-dependent strategy for decision making under a deadline, with a collapsing decision threshold consistent with some previous theoretical and experimental studies.

The model proposed here builds on the seminal work of Daw, Dayan, and others who have explored the use of POMDP and related models for explaining various aspects of decision making and suggested systems-level architectures (Daw et al.,

We suggest that networks in the cortex implement Bayesian inference and convey the resulting beliefs (posterior distributions) to value estimation and action selection networks. The massive convergence of cortical outputs onto the striatum (the “input” structure of the basal ganglia) and the well-known role of the basal ganglia in reward-mediated action make the basal ganglia an attractive candidate for implementing the value estimation and action selection networks in the model. Such an implementation is consistent with previous “actor-critic” models of the basal ganglia (Barto,

The hypothesis that striatal neurons learn a compact representation of cortical belief states (Eq.

The general idea of optimizing policies for decision making by maximizing reward has previously been suggested in the context of drift–diffusion and sequential probability ratio test (SPRT) models (Gold and Shadlen,

Our formulation of the problem within a reinforcement learning framework is also closely related to the work of Latham et al. (

The model we have proposed extends naturally to decision making with arbitrary numbers of choices (e.g., random dots tasks with number of directions greater than 2; Churchland et al.,

The interpretation of LIP responses as beliefs predicts that increasing the number of directions of motion to

It has been shown that LIP neurons can be modulated by other variables such as value (Platt and Glimcher,

The belief computation network required by the current model is similar to previously proposed networks for implementing Bayesian inference in hidden Markov models (HMMs) (e.g., Rao,

We illustrated the ability of the model to learn a time-dependent policy using a network with an input node that represents elapsed time (Figure

For a task with a deadline, the model learned a time-dependent policy with a “collapsing” decision threshold (Latham et al.,

On the computational front, several questions await further study: how does the proposed model scale to large-scale POMDP problems such as those faced by an animal in non-laboratory settings? How does the performance of the model compare with approximation algorithms for POMDPs suggested in the artificial intelligence literature? What types of convergence properties can be proved for the model? Empirical results from varying model parameters for the random dots problem suggest that the model converges to an appropriate value function and policy under a variety of conditions but rigorous theoretical guarantees could potentially be derived by leveraging past results on the convergence of TD learning (Sutton,

Another open issue is how the transition and observation models (or more generally, the parameters and structure of a graphical model) for a given POMDP problem could be learned from experience. Algorithms in machine learning, such as the expectation-maximization (EM) algorithm (Dempster et al.,

Finally, the mapping of model components to the anatomy of the basal ganglia in Figure

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

I am grateful to the two reviewers for their detailed comments and suggestions. I also thank Erick Chastain, Geoff Gordon, Yanping Huang, Michael Shadlen, Pradeep Shenoy, Deepak Verma, and Angela Yu for useful discussions. Part of this manuscript was written at the scenic Whiteley Center at Friday Harbor Laboratories – I thank the Center for my stay there. This work was supported by NSF grant 0622252, NIH NINDS grant NS-65186, the ONR Cognitive Science Program, and the Packard Foundation.

^{1}The simulations here assume known coherence; for the unknown coherence case, similar responses are obtained when considering the marginal posterior probability over direction (see section

^{2}The middle range of values for Belief(E) usually co-occurs with the middle range of values for Belief(L) (and not very high or very low Belief(L) values). This accounts for the near 0 probabilities for the Left/Right actions in the figure even for very high and very low Belief(L) values, when Belief(E) is in the middle range.

^{3}The dopamine response for monkey L in (Nomoto et al.,