Edited by: Angela J. Yu, University of California San Diego, USA
Reviewed by: Tim Behrens, University of Oxford, UK; Pradeep Shenoy, University of California San Diego, USA; Gregory Corrado, Google, USA
*Correspondence: Robert C. Wilson, Department of Psychology, Neuroscience Institute, Princeton University, Green Hall, Princeton, NJ 08540, USA. e-mail:
This is an open-access article distributed under the terms of the
Reinforcement learning models of human and animal learning usually concentrate on how we learn the relationship between different stimuli or actions and rewards. However, in real-world situations “stimuli” are ill-defined. On the one hand, our immediate environment is extremely multidimensional. On the other hand, in every decision making scenario only a few aspects of the environment are relevant for obtaining reward, while most are irrelevant. Thus a key question is how do we learn these relevant dimensions, that is, how do we learn what to learn about? We investigated this process of “representation learning” experimentally, using a task in which one stimulus dimension was relevant for determining reward at each point in time. As in real life situations, in our task the relevant dimension can change without warning, adding ever-present uncertainty engendered by a constantly changing environment. We show that human performance on this task is better described by a suboptimal strategy based on selective attention and serial-hypothesis-testing rather than a normative strategy based on probabilistic inference. From this, we conjecture that the problem of inferring relevance in general scenarios is too computationally demanding for the brain to solve optimally. As a result the brain utilizes approximations, employing these even in simplified scenarios in which optimal representation learning is tractable, such as the one in our experiment.
In the last two decades, the computational field of reinforcement learning (RL) has revolutionized our understanding of the neural basis of decision making by providing a precise, formal computational framework within which learning and action selection can be understood (Sutton and Barto,
In the machine learning literature, this so-called “curse of dimensionality” is often overcome by the use of specialist, hand crafted representations that concentrate on a small subset of relevant stimulus features to make the RL problem tractable. Yet humans and animals, who are presumably born without such task-specific representations, still learn to solve new tasks efficiently in a world that is both uncertain and extremely multidimensional. In this work we investigate how this is possible.
We hypothesize that humans make the assumption that in any specific task only a small number of features of the environment are relevant for determining reward. For example, when eating at a restaurant, the identity of the chef and the quality of the ingredients are important determinants of reward. Of much less importance (in most circumstances) are the table one is sitting at, the clothes the waiter is wearing, and the weather outside. Such a sparsity assumption (Kemp and Tenenbaum,
Here we analyze human behavior on a task that involves concurrent representation learning and RL in a non-stationary environment characterized by abrupt and unsignaled change-points. In this task, subjects must track a periodically changing relevant stimulus feature using only noisy reward feedback. Unlike previous work involving change-point detection (Behrens et al.,
We compare two possible computational solutions to the representation learning problem: (1) an exact Bayesian inference strategy that makes use of all information in the task, and (2) a selective attention, serial-hypothesis-testing strategy that uses just a fraction of the information at a time. This second learning strategy trades statistical efficiency for computational efficiency and acknowledges that, even with the simplifying assumption of sparsity, the Bayesian solution may be too computationally demanding for the brain (Daw and Courville,
Using both qualitative and quantitative analyses of behavioral data, we find that human behavior is significantly better accounted for by the selective attention strategy than by the exact Bayesian strategy. This suggests that humans favor computational over statistical efficiency for representation learning, even for a simple task such as ours in which exact, or approximate Bayesian inference might be tractable.
To investigate the process of representation learning, we examined learning in a simplified scenario in which stimuli have three features, only one of which is relevant to predicting (and obtaining) point rewards. A schematic of our representation learning task, based on the Wisconsin Card Sorting Task (Milner,
Thirty-seven subjects (22 females, ages 18–28, mean 21.3 years) recruited from the Princeton community performed 500–1000 trials of the task. Subjects first performed a similar number of trials in a task that was identical but for the fact that changes in the relevant dimension and most rewarding feature were explicitly signaled. For both parts of the experiment, subjects received on-screen instructions informing them that only one of the three dimensions (color, shape, or texture) was relevant to determining the probability of winning a point, that one feature in the relevant dimension will result in rewards more often than the others (the exact probability was not mentioned), and that all rewards were probabilistic (specifically, that “even the best stimulus will sometimes not reward with a point” and vice-versa). They were also instructed to respond quickly (imposed using a 1-s response timeout) and to try to get as many points as possible.
Subjects practiced the task, and were then instructed that throughout the first part they would be informed when the relevant dimension and best feature were changed. Before the start of the second (unsignaled changes) part, they were instructed that changes in the relevant dimension and best feature would no longer be signaled. In this part, subjects were given a break and allowed to rest after completing each quarter of the trials. In each break they were also informed of their cumulative point earnings since the last break, and how that compared to the maximum possible earnings. In neither part were subjects instructed about the rate of change-points or the hazard function.
After performing the second part, some subjects continued to perform another 300 trials of the signaled task inside a magnetic resonance imaging scanner. Subjects were compensated for their time with either $12 (behavior only) or $40 (scanning experiment). All subjects gave informed consent and the study was approved by the Princeton University Institutional Review Board. As we are interested specifically in the interaction between change-point detection and representation learning, here we focus exclusively on data from the unsignaled portion of the experiment (see Gershman et al.,
We compared two alternative families of models that could potentially explain human performance on this task. The first, based on Bayesian probability theory, makes use of all available information to infer the action that will maximize the probability of obtaining a reward on the next trial. The second focuses on one feature at a time, testing whether it is the correct feature that maximizes reward. This latter set of models is suboptimal in its use of information, and consequently performs less well on the task. However, it is computational much simpler and thus more tractable in an online highly multidimensional setting. By comparing the fits of the two classes of models to subjects’ trial-by-trial behavior, our aim was to determine which model better describes the computations performed by the human brain in such scenarios.
We begin by defining some notation common to both models. We let the identity of the relevant dimension (shape, color, or texture) on trial
The Bayesian model uses probabilistic inference to compute the probability distribution over the identity of the rewarding dimension and feature given all past trials,
Specifically, we are interested in computing the probability of reward for each possible stimulus,
When
The key, then, is to evaluate
By definition, the run-length either increases by one after each trial, in between change-points, or becomes zero at a change-point. Thus we can define the change-point prior,
where
Substituting the change-point prior, equation 4, into equation 3 gives
where
We now need to specify two distributions: (1)
The first of these,
The run-length distribution
Taken together, these equations define a Bayesian update process that makes optimal use of past information to infer the present identity of the most rewarding feature with unsignaled change-points.
In order to generate choice probabilities the model looks one time step into the future to compute stimulus values, that is, the predicted reward outcome for each choice
where β is an inverse temperature parameter which will be fit to data. Thus for a given Bayesian model (Bayes const h or Bayes var h) and a set of parameters (β,
Note that, unlike the inference process, the decision process is suboptimal. This reflects the fact that although “ideal observer” inference of reward probabilities in this task is tractable, determining the optimal action selection policy while taking account of the multiple levels of uncertainty about these inferences is far from trivial.
Figures
The second family of models assumes that subjects use a simplified “serial-hypothesis-testing” strategy to solve the task. In particular, this model postulates that at each point in time the subject focuses attention on one feature
We focus on three formal instantiations of the selective attention model. These are by no means the only possibilities within this family of models, but they capture the essential features of selective attention while remaining amenable to analysis.
This model performs a Bayesian hypothesis test to determine whether to switch from, or stick with, the currently attended dimension and feature, based on the reward history since switching attention. It shares some similarities with the model of Yu and Dayan (
The SA full model tracks the probability that the currently attended dimension-feature pair is the most rewarding one based on the history of rewards experienced since the last shift of attention. We write
Here
where
Similarly, we can compute the probability that the attended dimension-feature pair is
Given these two probabilities, the log likelihood ratio between the hypothesis that the current focus of attention is on the highly rewarding feature, and the alternative hypothesis is
which in turn determines the probability of switching the focus of attention according to
with inverse temperature parameter β and threshold þ as free parameters of the model. Note that this model essentially reduces to leaky counting of the number of wins and losses. An outline of the model in action is shown in Figure
We also considered two simplified versions of the selective attention model. In the first, we assume that the subject shifts attention with probability
In this second, simplest possible, selective attention model, we assume that subjects switch the focus of their attention randomly with some fixed probability
We use an ε-greedy policy for all selective attention models, such that the agent picks the option with the attended feature with probability 1 − ε, and chooses randomly with probability ε. Such a choice rule allows for mistakes, that is, trials in which an option that does not have the attended feature is chosen by accident. In these trials, the current value of
While these selective attention models provide straightforward accounts of the generation of subjects’ choices, it is complicated to fit them to subjects’ behavior. This is due to an additional source of uncertainty: the experimenter’s uncertainty about the subject’s focus of attention on each trial. For example, it is impossible to conclude, only on the basis of the observation that the subject has chosen a red-plaid-circle stimulus, which of the three features (red, plaids, or circles) was the focus of attention. We can only infer a
where
To compute the probability that a feature is attended to,
Here, we introduce
and
The exact form of the change-point prior,
Finally, we assume a uniform distribution over
As was the case for the Bayesian model, computing
To adjudicate between the models, we used human trial-by-trial choice behavior to fit the free parameters ω
The free parameters, ω
Model | Parameters | Priors | Constraints | Fit value ± SEM |
---|---|---|---|---|
Bayes const h | β | Gamma(2,2) | 0 ≤ β ≤ ∞ | 6.49 ± 0.36 |
Beta(1,1) | 0 ≤ |
0.153 ± 0.009 | ||
ρ |
Beta(12,4) | 0.5 ≤ ρ |
0.500 ± 0.0004 | |
ρ |
Beta(4,12) | 0.2 ≤ ρ |
0.2 ± 8 × 10^{−8} | |
Bayes var h | β | Gamma(2,2) | 0 ≤ β ≤ ∞ | 3.59 ± 0.13 |
ρ |
Beta(12,4) | 0.5 ≤ ρ |
0.526 ± 0.009 | |
ρ |
Beta(4,12) | 0.2 ≤ ρ |
0.2 ± 3 × 10^{−8} | |
SA full | β | Gamma(2,2) | 0 ≤ β ≤ ∞ | 1.34 ± 0.07 |
Beta(1,1) | 0 ≤ |
0.38 ± 0.03 | ||
ε | Beta(1,1) | 0 ≤ ε ≤ 1 | 0.052 ± 0.007 | |
þ | Uniform(−20,0) | −20 ≤ þeta ≤ 0 | −3.29 ± 0.03 | |
SA WSLS | ε | Beta(1,1) | 0 ≤ ε ≤ 1 | 0.143 ± 0.014 |
Beta(1,1) | 0 ≤ |
0.010 ± 0.002 | ||
Beta(1,1) | 0 ≤ |
0.27 ± 0.02 | ||
SA rand | ε | Beta(1,1) | 0 ≤ ε ≤ 1 | 0.064 ± 0.009 |
Beta(1,1) | 0 ≤ |
0.169 ± 0.007 |
We fit each model’s parameters to each subject’s data separately. To facilitate this, we used regularizing priors that favored realistic values and maximum
We optimized model parameters by minimizing the negative log posterior of the data given different settings of the model parameters using the Matlab function
in which
The resulting measure, ℘
We first describe the qualitative features of subjects’ behavior on our task, before presenting more detailed analyses to determine n which model explains better subjects’ choices. We use two types of analyses: in a qualitative analysis, we test the data for distinctive patterns that are predicted by each of the models. In the second, quantitative analysis, we use the whole sequence of trial-by-trial choices and rewards in order to fit each of the models, and compare the likelihoods of the data given each model while accounting for their different numbers of free parameters. The results of both of these analyses indicate that the selective attention class of models better explains subjects’ behavior in our task.
Figure
To assess each subject’s degree of learning we compared the average fraction correct on the first five trials after a change to the average fraction correct on trials 16–20 after the change (providing that another change has not yet occurred). This is depicted in Figure
The main qualitative difference between the two classes of models we propose is that a Bayesian learner learns about all features of the chosen stimulus, whereas a selective attention learner only learns about one feature in each trial. Thus the selective attention model predicts that if a subject switches the focus of attention after a zero-reward trial, she is equally likely to switch to
Unfortunately, as discussed above, we cannot directly infer the focus of attention from a single choice in our task. However, we can glean some information about the focus of attention in a model-free way from pairs of consecutive trials in which only one feature is common to both choices. On this subset of trials, for the selective attention model, there is a high probability that the currently attended feature is the feature common to both choices. We can then ask what happens on the next trial (i.e., the third trial in the sequence), based on the history of rewards for the first two trials, i.e., whether the two previous trials were “loss” trials (denoted as 00) or “win” trials (11). Specifically, we asked whether the third choice shared the putative attended feature (“stick”) or not (“switch”), and whether it included 0, 1, or 2 of the previous putatively unattended features. For instance, based on two consecutive trials in which a red-plaid-square and a red-dots-triangle were chosen, we guess that “red” was the attended feature, and ask: if both trials resulted in 0 reward, will the subject “stick” with red on the third trial? And if not, will the subject also tend to avoid dots and triangles (as predicted by the Bayesian class of models, but not the selective attention models)?
Figure
Figure
We assessed the sensitivity of our analysis by asking to what extent our model fitting procedure was able to correctly recover the identity of a model from simulated data. Specifically, we used each model to simulate 200 subjects (500 trials each), using parameter values consistent with the values fit to subjects’ behavior (that is, parameter values drawn from empirical priors generated from the above fits to subjects’ choice data). We then fit each of the five models to each simulated subject, and determined the model that best fit the subject’s data.
The results of this analysis are summarized in the “confusion matrix” (Steyvers et al.,
Although the above analysis of choice behavior strongly favors the selective attention model, it is important to ask whether an independent agent playing that strategy can perform the task at a level comparable with that of humans. To test this, we simulated two agents – one playing according to the Bayesian model with fixed hazard rate and the other using the full selective attention strategy, with optimal parameter settings – and measured their learning curves.
In the Bayesian case, the optimal parameters correspond to a deterministic choice function (β = ∞) and a hazard rate set as the average hazard rate for the task (
One possible reason for the inferior performance of the selective attention model compared to human subjects is that we modeled a memoryless selective attention model (for reasons of tractability of the model-based choice analysis). Thus, our selective attention agent has a probability of 1/9 of switching to the
For comparison, we also simulated the Bayesian, full and random selective attention agents using parameter values found using the fitting procedure. As expected from a model that pays no attention to rewards, the SA rand model shows no learning and performs at chance levels throughout. Both the Bayesian and SA full models have learning curves with fit parameters that are shallower than human behavior.
Much progress has been made in recent years in understanding how humans and animals learn to maximize rewards by trial and error. However, this work has often eschewed the question of how the representation on which this learning process relies, is itself learned. We investigated this “representation learning” process in a task in which humans had to learn concurrently which of three dimensions is relevant to rewards, and within this dimension, which feature is associated with the highest probability of reward. To exacerbate uncertainty in the task and further mimic learning in real-world scenarios, the underlying reward-generating process changed abruptly at different time points, unannounced to the subjects. Thus our task involved uncertainty at multiple levels: uncertain representations, uncertain (probabilistic) rewards and uncertainty regarding change-point locations.
Surprisingly, in contrast to perceptual decision making tasks in which humans are often shown to be Bayes-optimal evidence integrators (Ernst and Banks,
There are at least two possible explanations for this suboptimality. The first is that subjects in our task are less trained compared to many of the human and animal subjects used in psychophysical and perceptual tasks. Thus they may have not yet learned the correct Bayesian strategy. While this may account for differences from experiments such as (Munuera et al.,
Instead, we conjecture that there is a more general reason for subjects’ suboptimal representation learning: optimal performance in our task requires significantly more computational capacity than needed to optimally solve any of the referenced perceptual tasks. Specifically, implementing the full Bayesian model with a variable hazard rate involves keeping track of up to 300 probability values (30 weights for each possible run-length plus 30 nine-valued probability distributions
In contrast, the selective attention model demands far fewer computational resources, requiring the maintenance of only two variables, the identity of the attended feature and the likelihood ratio, at any time step. This process is not wholly computationally suboptimal, as it at least utilizes an optimal (Bayesian) computation of the likelihood that the currently attended feature is the most rewarding. However, since this strategy does not take all available information into account, learning is slower in terms of the amount of experience that is needed in order to learn the correct representation.
Such an account may explain why in a related, but simpler, task with only two dimensions and features (Wunderlich et al.,
In real-world representation learning in which the number of possible relevant dimensions and features is potentially huge, we expect the difference in computational efficiency between selective attention and fully Bayesian strategies to be even more pronounced. Thus we conjecture that it is impossible for humans to implement a full Bayesian solution in the real-world and thus, even in our relatively simple task, they are using an alternative approximate mechanism which is computationally efficient but statistically suboptimal (see also Steyvers et al.,
The main limitation of the current experimental design was that we could not directly measure the subjects’ focus of attention. To nevertheless estimate choice probabilities for the selective attention models, we used the subjects’ sequence of behavioral choices and outcomes to infer the dynamically changing focus of attention. On some trials this method has significant uncertainty about the focus of attention, which reduces the power of our statistics in differentiating between models. Still, our results supported the selective attention models unambiguously. In future work we hope to overcome this limitation using modified versions of the task.
Another limitation is that our current analysis does not include a selective attention model with memory for options that have already been tested. This is because exact inference in change-point problems is intractable when data are correlated across change-points (as would be the case across attentional switches in such a model). In future work we will examine the possibility of performing approximate inference in these cases.
The two families of models that we explored, Bayesian and selective attention models, can be thought of as formal instantiations of previous models of discrimination learning: the Bayesian model expounding the “continuity theory” of early behaviorists (reviewed in Mackintosh,
Our results favor the latter theory. Because we have a precise formulation of the models in both cases, we can, in line with early theories of selective attention (Broadbent,
However, prior work also suggests models that we have yet to explore. Specifically, many past studies have found evidence for two-stage models of selective attention (Sutherland and Mackintosh,
In this light, Dehaene and Changeux’s (
Unfortunately, an exact model-based analysis is not possible in these cases due to violation of the product partition assumption. More work employing approximate inference schemes and different experimental manipulations will be needed to distinguish between these more subtle instantiations of selective attention.
The full selective attention model shares many similarities with that of Yu and Dayan (
Despite these differences, it is interesting to speculate whether the attention switching mechanism, proposed by Yu and Dayan (
We have presented a novel experimental paradigm in which humans infer the relevance of different features of stimuli to determining rewards, in a changing environment. Analysis of choice behavior in this task suggests that humans use a suboptimal inference process based on a selective attention serial-hypothesis-testing strategy in which subjects focus on just one feature of the stimuli at a time. This glaring suboptimality is perhaps justified by the intractability and complexity of the problem at hand – and humans’ extraordinary success at learning new tasks in a highly multidimensional and changing environments attests to its obvious utility.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We thank S. J. Gershman and M. T. Todd for helpful comments on this work. This research was funded by a Sloan Fellowship (Yael Niv), the Binational United States-Israel Science Foundation and Award Number R03DA029073 from the National Institute on Drug Abuse. The content is solely the responsibility of the authors and does not represent the official views of any of the funders.
^{1}Note that the “product partition” simplification does not strictly hold for our task because here a change was always to a different dimension (e.g., if the relevant feature was red, it could change to waves, but never to green), which means that the data are weakly correlated across a change-point. Nevertheless, we use this approximation as it simplifies inference considerably while making the model only slightly suboptimal in terms of performance on the task.
^{2}It is also possible to use the so-called ε-greedy choice function with this model. In this case, the model chooses the highest value option with probability 1 − ε for some small ε (0 ≤ ε ≤ 1) otherwise choosing randomly between the three options. Empirically, we found this model to fit the data particularly poorly and hence we do not consider it further in this paper.
^{3}Note that the shaded significance region was computed conservatively, according to the subject with the fewest number of trials. For subjects with larger number of trials the shaded region should be smaller.