^{1}

^{2}

^{1}

^{3}

^{3}

^{1}

^{3}

^{1}

^{2}

^{3}

Edited by: Sven Bestmann, University College London, UK

Reviewed by: Quentin Huys, University College London, UK; Robert C. Wilson, Princeton University, USA

*Correspondence: Christoph Mathys, Laboratory for Social and Neural Systems Research, Department of Economics, University of Zurich, Bluemlisalpstrasse 10, 8006 Zurich, Switzerland. e-mail:

This is an open-access article subject to a non-exclusive license between the authors and Frontiers Media SA, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and other Frontiers conditions are complied with.

Computational learning models are critical for understanding mechanisms of adaptive behavior. However, the two major current frameworks, reinforcement learning (RL) and Bayesian learning, both have certain limitations. For example, many Bayesian models are agnostic of inter-individual variability and involve complicated integrals, making online learning difficult. Here, we introduce a generic hierarchical Bayesian framework for individual learning under multiple forms of uncertainty (e.g., environmental volatility and perceptual uncertainty). The model assumes Gaussian random walks of states at all but the first level, with the step size determined by the next highest level. The coupling between levels is controlled by parameters that shape the influence of uncertainty on learning in a subject-specific fashion. Using variational Bayes under a mean-field approximation and a novel approximation to the posterior energy function, we derive trial-by-trial update equations which (i) are analytical and extremely efficient, enabling real-time learning, (ii) have a natural interpretation in terms of RL, and (iii) contain parameters representing processes which play a key role in current theories of learning, e.g., precision-weighting of prediction error. These parameters allow for the expression of individual differences in learning and may relate to specific neuromodulatory mechanisms in the brain. Our model is very general: it can deal with both discrete and continuous states and equally accounts for deterministic and probabilistic relations between environmental events and perceptual states (i.e., situations with and without perceptual uncertainty). These properties are illustrated by simulations and analyses of empirical time series. Overall, our framework provides a novel foundation for understanding normal and pathological learning that contextualizes RL within a generic Bayesian scheme and thus connects it to principles of optimality from probability theory.

Learning can be understood as the process of updating an agent's beliefs about the world by integrating new and old information. This enables the agent to exploit past experience and improve predictions about the future; e.g., the consequences of chosen actions. Understanding how biological agents, such as humans or animals, learn requires a specification of both the computational principles and their neurophysiological implementation in the brain. This can be approached in a bottom-up fashion, building a neuronal circuit from neurons and synapses and studying what forms of learning are supported by the ensuing neuronal architecture. Alternatively, one can choose a top-down approach, using generic computational principles to construct generative models of learning and use these to infer on underlying mechanisms (e.g., Daunizeau et al.,

The laws of inductive inference, prescribing an optimal way to learn from new information, have long been known (Laplace,

These difficulties have been avoided by descriptive approaches to learning, which are not grounded in probability theory, notably some forms of reinforcement learning (RL), where agents learn the “value” of different stimuli and actions (Sutton and Barto,

Despite these advantages, RL also suffers from major limitations. On the theoretical side, it is a heuristic approach that does not follow from the principles of probability theory. In practical terms, it often performs badly in real-world situations where environmental states and the outcomes of actions are not known to the agent, but must also be inferred or learned. These practical limitations have led some authors to argue that Bayesian principles and “structure learning” are essential in improving RL approaches (Gershman and Niv,

Any Bayesian learning scheme relies upon the definition of a so-called “generative model,” i.e., a set of probabilistic assumptions about how sensory signals are generated. The generative model we propose is inspired by the seminal work of Behrens et al. (

To prevent any false expectations, we would like to point out that this paper is of a purely theoretical nature. Its purpose is to introduce the theoretical foundations and derivation of our model in detail and convey an understanding of the phenomena it can capture. Owing to length restrictions and to maintain clarity and focus, several important aspects cannot be addressed in this paper. For example, it is beyond the scope of this initial theory paper to investigate the numerical exactness of our variational inversion scheme, present applications of model selection, or present evidence that our model provides for more powerful inference on learning mechanisms from empirical data than other approaches. These topics will be addressed in forthcoming papers by our group.

This paper is structured as follows: First, we derive the general structure of the model, level by level, and consider both its exact and variational inversion. We then present analytical update equations for each level of the model that derive from our variational approach and a quadratic approximation to the variational energies. Following a structural interpretation of the model's update equations in terms of RL, we present simulations in which we demonstrate the model's behavior under different parameter values. Finally, we illustrate the generality of our model by demonstrating that it can equally deal with (i) discrete and continuous environmental states, and (ii) deterministic and probabilistic relations between environmental and perceptual states (i.e., situations with and without perceptual uncertainty).

An overview of our generative model is given by Figures ^{(1)}, ^{(2)},…,^{(n)}. Given a generative model of how the agent's environment generates these inputs, probability theory tells us how the agent can make optimal use of the inputs ^{(1)},…,^{(k−1)} and any further “prior” information to predict the next input ^{(k)}. The generative model we introduce here is an extension of the model proposed by Daunizeau et al. (^{(k)} ∈ {0, 1} on trial

_{1} determines the probability of the input

^{(k)}, the input at time point

For simplicity, imagine a situation where the agent is only interested in a single (binary) state of its environment; e.g., whether it is light or dark. In our model, the environmental state _{1} at time ^{(k)}. Here,

In other words: _{1} for both _{1} = 1 and _{1} = 0 (and _{1} allows for an accurate prediction of input

Since _{1} is binary, its probability distribution can be described by a single real number, the state _{2} at the next level of the hierarchy. We then map _{2} to the probability of _{1} such that _{2} = 0 means that _{1} = 0 and _{1} = 1 are equally probable. For _{2} → ∞ the probability for _{1} = 1 and _{1} = 0 should approach 1 and 0, respectively. Conversely, for _{2} → ∝ the probabilities for _{1} = 1 and _{1} = 0 should approach 0 and 1, respectively. This can be achieved with the following empirical (conditional) prior density:

where

Put simply, _{2} is an unbounded real parameter of the probability that _{1} = 1. In our light/dark example, one might interpret _{2} as the tendency of the light to be on.

For the sake of generality, we make no assumptions about the probability of _{2} except that it may change with time as a Gaussian random walk. This means that the value of _{2} at time

Importantly, the dispersion of the random walk (i.e., the variance exp(κ_{3} + ω) of the conditional probability) is determined by the parameters κ and ω (which may differ across agents) as well as by the state _{3}. Here, this state determines the log-volatility of the environment (cf. Behrens et al., _{2} of the light switch to be on performs a Gaussian random walk with volatility exp(κ_{3} + ω). Introducing ω allows for a volatility that scales independently of the state _{3}. Everything applying to _{2} now equally applies to _{3}, such that we could add as many levels as we please. Here, we stop at the fourth level, and set the volatility of _{3} to

Given full priors on the parameters, i.e.,

Given priors on the initial state _{1}, _{2}, _{3}} and parameters χ = {κ, ω, ϑ}. This corresponds to perceptual inference and learning, respectively. In the next section, we consider the nature of this inversion or optimization.

It is instructive to consider the factorization of the generative density

In this form, the Markovian structure of the model becomes apparent: the joint probability of the input and the states at time

By integrating out

Once ^{(k)} is observed, we can plug it into this expression and obtain

This is the quantity of interest to us, because it describes the posterior probability of the time-dependent states, ^{(k)} (and time-independent parameters, χ), in the agent's environment. This is what the agent infers (and learns), given the history of previous inputs. Computing this probability is called model inversion: unlike the likelihood ^{(k)}) from hidden states and parameters but predicts states and parameters from data.

In the framework we introduce here, we model the individual variability between agents by putting delta-function priors on the parameters:

where χ_{a}

In principle, model inversion can proceed in an online fashion: By (numerical) marginalization, we can obtain the (marginal) posteriors ^{(k+1)}, ^{(k+1)}, χ|^{(1…k)}) according to Eq. ^{(k+1)}, χ|^{(1…k+1)}) once ^{(k+1)} becomes known, and so on. Unfortunately, this (exact) inversion involves many complicated (non-analytical) integrals for every new input, rendering exact Bayesian inversion unsuitable for real-time learning in a biological setting. If the brain uses a Bayesian scheme, it is likely that it relies on some sufficiently accurate, but fast, approximation to Eq.

Variational Bayesian (VB) inversion determines the posterior distributions

Based on this assumption, the variational maximization of the negative free energy is implemented in a series of variational updates for each level _{1} with a mean _{1} = _{1} = 1). Under this constraint, the maximum entropy distribution is the Bernoulli distribution with parameter _{1} (where the variance _{1}(1 − _{1}) is a function of the mean):

At the second and third level, the maximum entropy distribution of the unbounded real variables _{2} and _{3}, given their means and variances, is Gaussian. Note that the choice of a Gaussian distribution for the approximate posteriors is not due simply to computational expediency (or the law of large numbers) but follows from the fact that, given the assumption that the posterior is encoded by its first two moments, the maximum entropy principle prescribes a Gaussian distribution. Labeling the means _{2}, _{3} and the variances _{2}, _{3}, we obtain

Now that we have approximate posteriors _{j}_{i}

where _{\i}_{j}_{\i}_{j}_{\i}

To compute _{\i} = {_{\i},σ_{\i}} for the posteriors at all but the ^{(k)} cannot yet have reached those levels. From there we proceed upward through the hierarchy of levels, always using the updated parameters _{i}

Substituting these variational energies into Eq.

According to Eqs

Update equations of this form allow the agent to update its approximate posteriors over

At the first level of the model, it is simple to determine _{1}) since

and therefore already has the form required of _{1}) by Eq.

At the second level, _{2}) by Eq. _{2}) and would only be Gaussian if _{2}) were quadratic. The problem of finding a Gaussian approximation _{2}). The obvious way to achieve this is to expand _{2}) in powers of _{2} up to second order. The choice of expansion point, however, is not trivial. One possible choice is the mode or maximum of _{2}), resulting in the frequently used _{2}) is unknown and has to be found by numerical optimization methods, precluding a single-step analytical update rule of the form of Eq. ^{(k)} and the expectation of

where, for clarity, we have used the definitions

formulated in terms of precisions (inverse variances)

in the context of the update equations, we use the hat notation to indicate “referring to prediction.” While the ^{(k)}, the

The approach that produces these update equations is conceptually similar to a Gauss–Newton ascent on the variational energy that would, by iteration, produce the Laplace approximation (cf. Figure

with

The derivation of these update equations, based on this novel quadratic approximation to _{2}) and _{3}), is described in detail in the next section, and Figure

^{(k−1)}. The argmax of ^{(k)}. This generally leads to a different result from the Laplace approximation (dashed), but there is

While knowing _{i}_{\i}_{i}

We denote by Ĩ the quadratic function obtained by expansion of

This equation lets us find

where ∂^{2} denotes the second derivative, which is constant for a quadratic function. Because ∂^{2}^{2}Ĩ agree at the expansion point

A somewhat different line of reasoning leads to

If we choose

Plugging _{2}) from Eq. _{i}'s^{2}

As we have seen, variational inversion of our model, using a new quadratic approximation to the variational energies, provides a set of simple trial-by-trial update rules for the sufficient statistics λ_{i}_{i}_{i}

Crucially, the update equations for _{2} and _{3} have a form that is familiar from RL models such as Rescorla–Wagner learning (Figure

As we explain in detail below, this same structure appears in Eqs

The term _{1} having observed input ^{(k)} and the prediction ^{(k)}; i.e., the softmax transformation of the expectation of _{2} before seeing ^{(k)}. Furthermore, _{2} represents the width of _{2}’s posterior and thus the degree of our uncertainty about _{2}, it makes sense that updates in _{2} are proportional to this estimate of posterior uncertainty: the less confident the agent is about what it knows, the greater the influence of new information should be.

According to its update Eq. _{2} always remains positive since it contains only positive terms. Crucially, _{2}, through _{3}, from the third level of the model. For vanishing volatility, i.e., _{2} can only decrease from trial to trial. This corresponds to the case in which the agent believes that _{2} is fixed; the information from every trial then has the same weight and new information can only shrink _{2}. On the other hand, even with _{2} has a lower bound: when _{2} approaches zero, the denominator of Eq. _{2} from above, leading to ever smaller decreases in _{2}. This means that after a long train of inputs, the agent still learns from new input, even when it infers that the environment is stable.

The precision formulation (cf. Eq.

illustrates that three forms of uncertainty influence the posterior variance

The update rule (Eq. _{3} has a similar structure to that of _{2} (Eq. _{2} and _{2}) in response to input ^{(k)} indicate that the agent was underestimating _{3}. Conversely, it is negative if the agent was overestimating _{3}. This can be seen by noting that the uncertainty about _{2} has two sources: _{2} (represented by _{2}), and ^{(k)} the total uncertainty is _{2} has been updated according to Eq. _{2}. If the total uncertainty is greater after seeing ^{(k)}, the fraction in _{3} increases. Conversely, if seeing ^{(k)} reduces total uncertainty, _{3} decreases. (Since _{3} is on a logarithmic scale with respect to _{2}, the ratio and not the difference of quantities referring to _{2} is relevant for the prediction error in _{3}). It is important to note that we did not construct the update equations with any of these properties in mind. It is simply a reflection of Bayes optimality that emerges on applying our variational update method.

The term corresponding to the learning rate of _{3} is

As at the second level, this is proportional to the variance _{3}of the posterior. But here, the learning rate is also is proportional to the parameter κ and a _{2} relative to its (informational) conditional uncertainty, _{2}. It is bounded between 0 and 1 and approaches 0 as _{2}; conversely, it approaches 1 as _{2} becomes negligibly small relative to _{2} (i.e., conditional uncertainty _{2}) suppresses updates of _{3} by reducing the learning rate, reflecting the fact that prediction errors in _{2} are only informative if the agent is confident about its predictions of _{2}. As with the prediction error term, this weighting factor emerged from our variational approximation.

The precision update (Eq.

As at the second level, the precision update is the sum of the precision

Proportionality to ^{2} reflects the fact that stronger coupling between the second and third levels leads to higher posterior precision (i.e., less posterior uncertainty) at the third level, while proportionality to _{2} depresses precision at the third level when informational uncertainty at the second level is high relative to environmental uncertainty; the latter also applies to the first summand in the brackets. The second summand _{2}δ_{2} means that, when the agent regards environmental uncertainty at the second level as relatively high (_{2} > 0), volatility is held back from rising further if δ_{2} > 0 by way of a decrease in the learning rate (which is proportional to the inverse precision), but conversely pushed to fall if δ_{2} < 0. If, however, environmental uncertainty is relatively low (_{2} < 0), the opposite applies: positive volatility prediction errors increase the learning rate, allowing the environmental uncertainty to rise more easily, while negative prediction errors decrease the learning rate. The term _{2}δ_{2} thus exerts a stabilizing influence on the estimate of _{3}.

This automatic integration of all the information relevant to a situation is typical of Bayesian methods and brings to mind a remark made by Jaynes (

Here, we present several simulations to illustrate the behavior of the update equations under different values of the parameters

_{1}. The fine black line is the true probability (unknown to the agent) that _{1} = 1. The red line shows _{2}); i.e., the agent's posterior expectation that _{1} = 1. Given the input and update rules, the simulation is uniquely determined by the value of the parameters _{2} of _{2}. Top: the third level with the posterior expectation _{3} of _{3}. In all three panels, the initial values of the various

_{3} is more stable. The learning rate in _{2} is initially unaffected but owing to more stable _{3} it no longer increases after the period of increased volatility.

_{2} leads to an extremely stable expected _{3}. Despite prediction errors, the agent makes only small updates to its beliefs about its environment.

_{2} and _{3} are only weakly coupled. Despite uncertainty about _{3}, only small updates to _{3} take place. Sensitivity to changes in volatility is reduced. _{2} is not affected directly, but its learning rate does not increase with volatility.

The reference scenario in Figure _{1} = 1 is 0.5. The posterior expectation of _{1} accordingly fluctuates around 0.5 and that of _{2} around 0; the expected volatility remains relatively stable. There then follows a second period of 120 trials with higher volatility, where the probability that _{1} = 1 alternates between 0.9 and 0.1 every 20 trials. After each change, the estimate of _{1} reliably approaches the true value within about 20 trials. In accordance with the changes in probability, the expected outcome tendency _{2} now fluctuates more widely around zero. At the third level, the expected log-volatility _{3} shows a tendency to rise throughout this period, displaying upward jumps whenever the probability of an outcome changes (and thus _{2} experiences sudden updates). As would be anticipated, the expected log-volatility declines during periods of stable outcome probability. In a third and final period, the first 100 trials are repeated in exactly the same order. Note how owing to the higher estimate of volatility (i.e., greater _{2} than during the first stage of the simulation. As expected, a more volatile environment leads to a higher learning rate.

_{2} (bold red line); fine red lines indicate the range of _{2}. Right: _{3} (bold blue line); fine blue lines indicate the range _{3}. Circles indicate initial values.

One may wonder why, in the third stage of the simulation, the expected log-volatility _{3} continues to rise even after the true _{2} has returned to a stable value of 0 (corresponding to _{1} = 1) = 0.5; see the fine black line in Figure _{1} = 1 outcomes, followed by three _{1} = 0 could just as well reflect a stable _{1} = 1) = 0.5 or a jump from _{1} = 1) = 1 to _{1} = 1) = 0 after the first three trials. Depending on the particular choice of parameters

The nature of the simulation in Figure _{2}, _{2}, _{3}, and _{3} (the initial value of _{1} is _{2})). Any change in the initial value _{3} can be neutralized by corresponding changes in _{2}, _{2}, and _{3} are, in principle, not neutral. They are nevertheless of little consequence in practice since, when chosen reasonably, they let the time series of hidden states ^{(k)} quickly converge to values that do not depend on the choice of initial value to any appreciable extent. In the simulations of Figures

If we reduce _{3}) from 0.5 in the reference scenario to 0.05, we find an agent which is overly confident about its prior estimate of environmental volatility and expects to see little change (Figure _{3}, while learning in _{2} is not directly affected. There is, however, an indirect effect on _{2} in that the learning rate at the second level during the third period is no longer noticeably increased by the preceding period of higher volatility. In other words, this agent shows superficial (low-level) adaptability but has higher-level beliefs that remain impervious to new information.

Figure _{3}) component of log-volatility, to −4. The multiplicative scaling exp(_{2}, which in turn leads to little learning in _{3}, since the agent can only infer changes in _{3} from changes in _{2} (cf. Eq.

The coupling between _{2} and _{3} can be diminished by reducing the value of _{3}-dependent) scaling of volatility in _{2} (_{3} is effectively insulated from the effects of prediction error in _{2} (cf. Eq. _{3} and to a much larger posterior variance _{3} than in any of the above scenarios (see Figure _{2} itself is not directly affected, except that, in the second stage of the simulation, higher volatility remains without effect on the learning rate of _{2} in the third stage. This time, however, this effect is not caused by overconfidence about _{3} (due to small _{3} (large _{3}), which would normally be expected to lead to greater learning because of the dependency of the learning rate on _{3}. This paradoxical effect can be understood by examining Eqs _{3}. Indirectly, the learning rate is increased, in that smaller _{3}. But this is dominated by the dependency of the learning rate on

The simulations described above switch between two probability regimes: _{1} = 1) = 0.5 and _{1} = 1) = 0.9 or 0.1. The stimulus distributions under these two regimes have different variances (or risk). Figure _{1} = 1) = 0.85 or 0.15, throughout the entire simulation. One recovers the same effects as in the reference scenario. We now consider generalizations of the generative model that relax some of the simplifying assumptions about sensory mappings and outcomes we made during its exposition above.

The model can readily accommodate perceptual uncertainty at the first level. This pertains to the mapping from stimulus category _{1} to sensory input

Here, the input _{1} = 1, the probability of _{a}_{1} = 1. If, however, _{1} = 0, the most likely sensation is _{b}_{a}_{b}^{2}), the greater the perceptual uncertainty. The main point here is that with this modification, the model can account for situations where _{1} can no longer be inferred with certainty from

With Eq. _{1}:

If ^{(k)} ≈ _{a}^{(k)} ≈ _{b}_{2} has no influence on the agent's belief about _{1}. If, however, the sensory input is ambiguous in that ^{(k)} is far from both _{a}_{b}_{2} to predict stimulus category. Importantly, the update equations for the higher levels of the model are not affected by this introduction of perceptual uncertainty at the first level.

A generative model that comprises a hierarchy of Gaussian random walks can be applied to many problems of learning and inference. Here, we provide a final example, where the bottom-level state being inferred is not binary but continuous (i.e., a real number). Our example concerns the exchange rate between the U.S. Dollar (USD) and the Swiss Franc (CHF) during the first 180 trading days of the year 2010 (source: _{2} even though it occupies the lowest level of the hierarchy. In other words, the input _{2} (without passing through a binary state _{1}). The data

where a is the constant variance with which the input _{2}. This can be regarded as a measure of uncertainty (i.e., how uncertain the trader is about his “perception” of USD value relative to CHF). On top of this input level, we can now add as many coupled random walks as we please. For our example, we can describe the hidden states in higher levels with Eqs

For _{2}) is already quadratic here, no further approximation to the mean-field approximation is needed, and _{2} and _{2} are the exact moments of the posterior under the mean-field approximation. Because the higher levels remain the same, the update equations for _{3} and _{3} are as given in Eqs

Scenarios with different parameter values for the USD–CHF example are presented in Figures ^{−5}. This parameterization conveys a small amount of perceptual uncertainty that leads to minor but visible deviations of _{2} from _{2} are conservative in the sense that they consider prior information along with new input. Note also that _{3} rises whenever the prediction error about _{2} is large, that is when the green dots denoting _{3} falls when predictions of _{2} are more accurate. In the next scenario (Figure ^{−6}. This scenario thus shows an agent who is effectively without perceptual uncertainty. As prescribed by the update equations above, _{2} now follows _{3} tracks the amount of change in _{2}. In Figure ^{−4}). Here, the agent adapts more slowly to changes in the exchange rate since it cannot be sure whether prediction error is due to a change in the true value of _{2} or to misperception. The final scenario in Figure _{3} in a similar way that perceptual uncertainty smoothes the trajectory of _{2}.

^{−5})_{2} of the U.S. Dollar against the Swiss Franc during the first 180 trading days of the year 2010. Bottom panel: input _{3} of _{2}.

^{−6} (

^{−4} (

^{−5})

This example again emphasizes the fact that Bayes-optimal behavior can manifest in many diverse forms. The different behaviors emitted by the agents above are all optimal under their implicit prior beliefs encoded by the parameters that control the evolution of high-level hidden states. Clearly, it would be possible to optimize these parameters using the same variational techniques we have considered for the hidden states. This would involve optimizing the free energy bound on the evidence for each agent's model (prior beliefs) integrated over time (i.e., learning). Alternatively, one could optimize performance by selecting those agents with prior beliefs (about the parameters) that had the best free energy (made the most accurate inferences over time). Colloquially, this would be the difference between training an expert to predict financial markets and simply hiring experts whose priors were the closest to the true values. We will deal with these issues of model inversion and selection in forthcoming work (Mathys et al., in preparation). We close this article with a discussion of the neurobiology behind variations in priors and the neurochemical basis of differences in the underlying parameters of the generative model.

In this article, we have introduced a generic hierarchical Bayesian framework that describes inference under uncertainty; for example, due to environmental volatility or perceptual uncertainty. The model assumes that the states evolve as Gaussian random walks at all but the first level, where their volatility (i.e., conditional variance of the state given the previous state) is determined by the next highest level. This coupling across levels is controlled by parameters, whose values may differ across subjects. In contrast to “ideal” Bayesian learning models, which prescribe a fixed process for any agent, this allows for the representation of inter-individual differences in behavior and how it is influenced by uncertainty. This variation is cast in terms of prior beliefs about the parameters coupling hierarchical levels in the generative model.

A major goal of our work was to eschew the complicated integrals in exact Bayesian inference and instead derive analytical update equations with algorithmic efficiency and biological plausibility. For this purpose, we used an approximate (variational) Bayesian approach, under a mean-field assumption and a novel approximation to the posterior energy function. The resulting single-step, trial-by-trial update equations have several important properties:

They have an analytical form and are extremely efficient, allowing for real-time inference.

They are biologically plausible in that the mathematical operations required for calculating the updates are fairly basic and could be performed by single neurons (London and Hausser,

Their structure is remarkably similar to update equations from standard RL models; this enables an interpretation that places RL heuristics, such as learning rate or prediction error, in a principled (Bayesian) framework.

The model parameters determine processes, such as precision-weighting of prediction errors, which play a key role in current theories of normal and pathological learning and may relate to specific neuromodulatory mechanisms in the brain (see below).

They can accommodate states of either discrete or continuous nature and can deal with deterministic and probabilistic mappings between environmental causes and perceptual consequences (i.e., situations with and without perceptual uncertainty).

Crucially, the closed-form update equations do not depend on the details of the model but only on its hierarchical structure and the assumptions on which the mean-field and quadratic approximation to the posteriors rest. Our method of deriving the update equations may thus be adopted for the inversion of a large class of models. Above, we demonstrated this anecdotally by providing update equations for two extensions of the original model, which accounted for sensory states of a continuous (rather than discrete) nature and perceptual uncertainty, respectively.

As alternatives to our variational scheme, one could deal with the complicated integrals of Bayesian inference by sampling methods or avoid them altogether and use simpler RL schemes. We did not pursue these options because we wanted to take a principled approach to individual learning under uncertainty; i.e., one that rests on the inversion of a full Bayesian generative model. Furthermore, we wanted to avoid sampling approximations because of the computational burden they impose. Although it is conceivable that neuronal populations could implement sampling methods, it is not clear how exactly they would do that and at what temporal and energetic cost (Yang and Shadlen,

We would like to emphasize that the examples of update equations derived here can serve as the building blocks for those of more complicated models. For example, if we have more than two categories at the first level, this can be accommodated by additional random walks at the second and subsequent levels; at those levels, Eqs

One specific problem that has been addressed with Bayesian methods in the recent past concerns online inference of “changepoints,” i.e., sudden changes in the statistical structure of the environment (Corrado et al.,

Clearly, our approach is not the first that has tried to derive tractable update equations from the full Bayesian formulation of learning. Although not described in this way in the original work, even the famous Kalman filter can be interpreted as a Bayesian scheme with RL-like update properties but is restricted to relatively simple (non-hierarchical) learning processes in stable environments (Kalman,

Our update scheme also has its limitations. The most important of these is that it depends on the variational energies being approximately quadratic. If they are not, the approximate posterior implied by our update equations might bear little resemblance (e.g., in terms of Kullback–Leibler divergence) to the true posterior. Specifically, the update fails if the curvature of the variational energy at the expansion point is negative (which implies that the conditional variance is negative; see Eq. _{2} can never become negative; _{3} = 1/π_{3} however could become negative according to Eq.

In addition to the derivation of these update equations, we have provided simulations that demonstrate model behavior under different parameterizations (priors). These simulations confirmed the computational efficiency of our approach: the simulations in Figures

Maladaptive behavior, owing to inappropriate learning and decision-making, is at the heart of most psychiatric diseases, and our framework may be particularly useful for modeling the underlying mechanisms. There are two complementary approaches one might consider: phenomenological and neurophysiological. To illustrate a phenomenological approach, we will consider extreme settings of the parameters in terms of psychopathology. In the example of Figures _{3}) as a possible cause of anxiety. In other words, knowing that the world is changing quickly is frightening enough, but being uncertain about the extent of this change may be even more upsetting. Anxiety of this sort is often observed prior to (or in the early phase of) psychotic episodes (Häfner et al., _{3} due to abnormally low

From a neurophysiological perspective, it has been proposed that dopamine might not encode the prediction error on value (Schultz et al.,

The distinction between hidden states, which vary in time and are the dynamic components of the agent's model of the world, and parameters, which are time-invariant and encode stable subject-specific learning styles, is a key component of our model. One might compare this to classical RL models where value estimates (states) are updated dynamically while the learning rate is an invariant parameter. In our case, however, the (implicit) learning rate is dynamic and results from an interaction between states and parameters: the latter determine how higher-level states influence lower-level ones. This effect of the static parameters on dynamic cross-level coupling can be seen directly from the update equations above (e.g., Eqs

One could, of course, consider alternative formulations of our model in which individual learning mechanisms, determined by the coupling across levels, are encoded entirely by states. There is a simple reason why we did not pursue this alternative. Clearly, modeling dynamic aspects of learning, such as rapid updating of learning rates, does require a representation involving states (see above). On the other hand, within an individual agent's brain, the physiological mechanism underlying this coupling must obey some general principles that have been shaped both by (life-long) experience and genetic background (cf. RL depends on individual genotype (Frank et al.,

It should be emphasized that the idea of fitting learning models to subject-specific data and using the ensuing individual parameter estimates for assessing inter-individual variability is not new and has been pursued by many previous studies (e.g., Steyvers and Brown,

The main goal of this paper was to introduce the mathematical basis of our approach and illustrate its functionality. Clearly, the simulations shown in this paper are anecdotal and cannot fully demonstrate or establish the practical utility of our approach. In particular, we have not yet demonstrated how our model can be inverted (fitted), given empirical measurements. This requires extending the present approach with a response model that connects states from the Bayesian learning model to measurable responses of a neurophysiological or behavioral sort (cf. Daunizeau et al.,

In summary, we have introduced a novel and generic framework for approximate Bayesian inference with computationally efficient and interpretable closed-form update equations. Simulations show that our approach is applicable to a range of situations beyond classical RL, including inductive inference on discrete and continuous states and situations with perceptual ambiguity. Crucially, our approach accommodates inter-individual differences, in terms of prior beliefs about key model parameters, and quantifies their computational effects: Some of these parameters may map to neurophysiological (neuromodulatory) mechanisms that have been implicated in the neurobiology of learning and psychopathology. As such, it may be a useful framework for modeling individual differences in behavior and to formally characterize behavioral stereotypes and pathophysiologically distinct subgroups in psychiatric spectrum diseases (Stephan et al.,

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

This work was supported by the NCCR “Neural Plasticity and Repair” (CM, KES), the University Research Priority Program “Foundations of Human Social Behavior” at the University of Zurich (KES), the Neurochoice project of the Swiss Systems Biology initiative SystemsX.ch (JD, KES), and the Wellcome Trust (KJF).