^{1}

^{2}

^{3}

^{*}

^{4}

^{5}

^{6}

^{6}

^{6}

^{3}

^{7}

^{2}

^{3}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

Edited by: Nicole C. Kleinstreuer, National Institute of Environmental Health Sciences (NIEHS), United States

Reviewed by: Guohua Huang, Shaoyang University, China; Dimitri Ognibene, University of Essex, United Kingdom

This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Artificial Intelligence

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Adaptive agents must act in intrinsically uncertain environments with complex latent structure. Here, we elaborate a model of visual foraging—in a hierarchical context—wherein agents infer a higher-order visual pattern (a “scene”) by sequentially sampling ambiguous cues. Inspired by previous models of scene construction—that cast perception and action as consequences of approximate Bayesian inference—we use active inference to simulate decisions of agents categorizing a scene in a hierarchically-structured setting. Under active inference, agents develop probabilistic beliefs about their environment, while actively sampling it to maximize the evidence for their internal generative model. This approximate evidence maximization (i.e., self-evidencing) comprises drives to both maximize rewards and resolve uncertainty about hidden states. This is realized via minimization of a free energy functional of posterior beliefs about both the world as well as the actions used to sample or perturb it, corresponding to perception and action, respectively. We show that active inference, in the context of hierarchical scene construction, gives rise to many empirical evidence accumulation phenomena, such as noise-sensitive reaction times and epistemic saccades. We explain these behaviors in terms of the principled drives that constitute the

Our daily life is full of complex sensory scenarios that can be described as examples of “scene construction” (Hassabis and Maguire,

We investigate hierarchical belief-updating by modeling visual foraging as a form of scene construction, where individual images are actively sampled with saccadic eye movements in order to accumulate information and categorize the scene accurately (Yarbus,

Building on a previous Bayesian formulation of scene construction, in this work use we use

The rest of this paper is structured as follows: first, we summarize active inference and the free energy principle, highlighting the

The goal of Bayesian inference is infer possible explanations for data—this means obtaining a distribution over a set of parameters

Importantly, computing this quantity requires calculating the marginal probability

Solving this summation^{1}

where

This decomposition allows us to see that the free energy becomes a tighter upper-bound on surprise the closer the variational distribution ^{2}

Having discussed the variational approximation to Bayesian inference via free energy minimization, we now turn our attention to active inference. Active inference is a framework for modeling and understanding adaptive agents, premised on the idea that agents engage in approximate Bayesian inference with respect to an internal generative model of sensory data. Crucially, under active inference both action and perception are realizations of the single drive to minimize surprise. By using variational Bayesian inference to achieve this, an active inference agent generates Bayes-optimal beliefs about sources of variation in its environment by free-energy-driven optimization of an approximate posterior ^{3}

Here, we equip the agent with the prior belief that its policies minimize the free energy expected (under their pursuit) in the future. Under Markovian assumptions on the dependence between subsequent time points in the generative model

We will not derive the self-consistency of the prior belief that agents (believe they) will choose free-energy-minimizing policies, nor the full derivation of the expected free energy here. Interested readers can find the full derivations in Friston et al. (

From this decomposition of the quantity bounded by the expected free energy

See the

In order to understand how minimizing expected free energy

We now describe an abstract scene construction task that will serve as the experimental context within which to frame our hierarchical account of active evidence accumulation. Inspired by a previous active inference model of scene construction introduced by Mirza et al. (

The scene configurations of the original formulation. The three scenes characterizing each trial in the original scene construction study, adapted with permission from Mirza et al. (

In the current work, scene construction is also framed as a categorization task, requiring the gaze-contingent disclosure of quadrants whose contents furnish evidence for beliefs about the scene identity. However, in the new task, the visual stimuli occupying the quadrants are animated

Random Dot Motion Stimuli (RDMs). Schematic of random dot motion stimuli, with increasing coherence levels (i.e., % percentage of dots moving upwards) from left to right.

We also design the visual stimulus → scene mapping such that scenes are degenerate with respect to individual visual stimuli, as in the previous task (see

The mapping between scenes and RDMs. The mapping between the four abstract scene categories and their respective dot motion pattern manifestations in the context of the hierarchical scene construction task. As an example of the spatial invariance of each scene, the bottom right panels show two possible (out of 12 total) RDM configurations for the scene “

We have seen how both perception and action emerge as consequences of free energy minimization under active inference. Perception is analogized to state estimation and corresponds to optimizing variational beliefs about the hidden causes of sensory data

We now introduce the hierarchical active inference model of visual foraging and scene construction. The generative model (the agent) and the generative process of the environment both take the form of a Markov Decision Process or MDP. MDPs are a simple class of probabilistic generative models where space and time are treated discretely (Puterman, _{t}|_{t−1}). This specification imbues the environment with Markovian, or “memoryless” dynamics. An extension of the standard MDP formulation is the _{t}|_{t})) from states to observations at a given time.

A generative model is simply a joint probability distribution over sensory observations and their latent causes ^{(i+1)} for the level above, with associated priors and likelihoods operating at all levels. This marks a departure from previous work in the hierarchical POMDP literature (Pineau et al.,

^{(i)} are expressed as multidimensional arrays in the likelihood matrix ^{(i),m}, where ^{(i),m} prescribes the probability of observing the outcome ^{(i)} evolve over time are given by Markov transition matrices ^{(i),n}(_{t}|_{t−1}, _{t}). Here

A partially-observed Markov Decision Process with two hierarchical layers. Schematic overview of the generative model for a hierarchical partially-observed Markov Decision Process. The generic forms of the likelihoods, priors, and posteriors at hierarchical levels are provided in the left panels, adapted with permission from Friston et al. (^{(2)}) may belong to one of two types: (1) observations that directly parameterize hidden states at the lower level via the composition of the observation likelihood one level ^{(i + 1)}|^{(i + 1)}) with the empirical prior or “link function” ^{(i)}|^{(i + 1)}) at the level below, and (2) observations that are directly sampled ^{(i + 1)}|^{(i + 1)})). For conciseness, we represent the first type of mapping, from states at ^{(i)}|^{(i + 1)}) = ^{(i)}) in the left panel. In contrast, all observations at the lowest level (õ^{(1)}) feed directly from the generative process to the agent.

^{*}, we use a marginal message passing routine to perform a gradient descent on the variational free energy at each time step, where posterior beliefs about hidden states and policies are incremented using prediction errors ε (see ^{(i+1)} from the level above, and “inferred observations” at higher levels are inherited as the final posterior beliefs

Belief-updating under active inference. Overview of the update equations for posterior beliefs under active inference. ^{*} that minimizes the variational free energy of observations. In practice the variational posterior over states is computed as a marginal message passing routine (Parr et al., ^{*}. Solving via error-minimization lends the scheme a degree of biological plausibility and is consistent with process theories of neural function like predictive coding (Bastos et al., ^{*} when free energy is at its minimum (for a particular marginal), i.e.,

We also find it worthwhile to clarify the distinction between the

Where ln

The Iverson brackets [τ ≤

We now introduce the deep, temporal model of scene construction using the task discussed in Section 3 as our example (

Level 1 MDP. ^{(1),1} underlying visual observations at the currently-fixated region of the visual array and (2) the sampling state ^{(1),2}, an aspect of the environment that can be changed via actions, i.e., selections of the appropriate state transition, as encoded in the ^{(1),1} can either correspond to a state with no motion signal (“Null,” in the case when there is no RDM or a categorization decision is being made) or assume one of the four discrete values corresponding to the four cardinal motion directions. At each time step of the generative process, the current state of the RDM stimulus ^{(1),1} is probabilistically mapped to a motion observation via the first-factor likelihood ^{(1),1} (shown in the top panel as _{RDM}). The entropy of the columns of this mapping can be used to parameterize the coherence of the RDM stimulus, such that the true motion states ^{(1),1} cause motion observations ^{(1),1} with varying degrees of fidelity. This is demonstrated by two exemplary _{RDM state} matrices in the top panel (these correspond to ^{(1),1}): the left-most matrix shows a noiseless, “coherent” mapping, analogized to the situation of when an RDM consists of all dots moving in the same direction as described by the true hidden state; the matrix to the right of the noiseless mapping corresponds to an incoherent RDM, where instantaneous motion observations may assume directions different than the true motion direction state, with the frequency of this deviation encoded by probabilities stored in the corresponding column of _{RDM}. The motion direction state doesn't change in the course of a trial (see the identity matrix shown in the top panel as _{RDM}, which simply maps the hidden state to itself at each subsequent time step)—this is true of both the generative model and the generative process. The second hidden state factor ^{(1),2} encodes the current “sampling state” of the agent; there are two levels under this factor: “_{Sampling state}). Entering the “^{(1),1}. ^{(1),2} (the “proprioceptive” likelihood, not shown for clarity) deterministically maps the current sampling state ^{(1),2} to an observation ^{(1),2} thereof (bottom row of lower right panel), so that the agent always observes which sampling state it is in unambiguously.

Lowest level (Level 1) beliefs are updated as the agent encounters a stream of ongoing, potentially ambiguous visual observations—the instantaneous contents of an individual fixation. The hidden states at this level describe a distribution over motion directions, which parameterize the true state of the random motion stimulus within the currently-fixated quadrant. Observations manifest as a sequence of stochastic motion signals that are samples from the true hidden state distribution.

The generative model has an identical form as the generative process (see above) used to generate the stream of Level 1 outcomes. Namely, it is comprised of a set of likelihoods and transitions as the dynamics describing the “real” environment (^{(1),1}. For example, if the current true hidden state at the lower level is ^{(1),1}. The precision of this column-encoded distribution over motion observations determines how often the sampled motions will be

Inference about the motion direction (Level 1 state estimation) roughly proceeds as follows: (1) at time ^{(1),1}; (2) posterior beliefs about the motion direction at the current timestep ^{(2),2}, where each transition matrix ^{(2),2}(^{4}

We fixed the maximum temporal horizon of Level 1 (hereafter _{1}) to be 20 time steps, such that if the “^{th} time step and the final posterior beliefs are passed up as outcomes for Level 2.

After beliefs about the state of the currently-foveated visual region are updated via active inference at Level 1, the resulting posterior belief about motion directions is passed up to Level 2 as a belief about observations. These observations (which can be thought of as the inferred state of the visual stimulus at the foveated area) are used to update the statistics of posterior beliefs over the hidden states operating at Level 2 (specifically, the hidden state factor that encodes the identity of the scene, e.g.,

The first hidden state factor corresponds to the scene identity. As described in Section 3, there are four possible scenes characterizing a given trial:

Level 2 MDP. ^{(2),1} encodes the scene identity of the trial in terms of two unique RDM directions occupy two of the quadrants (four possible scenes as described in the top right panel) and spatial configuration (one of 12 unique ways to place two RDMs in four quadrants). This yields a dimensionality of 48 for this hidden state factor (4 scenes × 12 spatial configurations). The second hidden state factor ^{(2),2} encodes the eye position, which is initialized to be in the center of the quadrants (Location 1). The next four values of this factor index the four quadrants (2–5), and the last four are indices for the choice locations (the agent fixates one of these four options to guess the scene identity). As with the sampling state factor at Level 1, the eye position factor ^{(2),2} is controllable by the agent through the action-dependent transition matrices ^{(2),2}. Outcomes at Level 2 are characterized by three modalities: the first modality ^{(2),1} indicates the visual stimulus (or lack thereof) at the currently-fixated location. Note that during belief updating, the observations of this modality ^{(2),1} are inferred hidden states over motion directions that are passed up after solving the Level 1 MDP (see ^{(2),2} and ^{(2),3} map to respective observation modalities ^{(2),2} and ^{(2),3}, and are not shown for clarity; the ^{(2),2} likelihood encodes the joint probability of particular types of trial feedback (Null, Correct, Incorrect—encoded by ^{(2),2}) as a function of the current hidden scene and the location of the agent's eyes, while ^{(2),3} is an unambiguous proprioceptive mapping that signals to the agent the location of its own eyes via ^{(2),3}. Note that these two last observation modalities ^{(2),2} and ^{(2),3} are directly sampled from the environment, and are not passed up as “inferred observations” from Level 1.

The second hidden state factor corresponds to the current spatial position that's being visually fixated—this can be thought of as a hidden state encoding the current configuration of the agent's eyes. This hidden state factor has nine possible states: the first state corresponds to an initial position for the eyes (i.e., a fixation region in the center of the array); the next four states (indices 2–5) correspond to the fixation positions of the four quadrants in the array, and the final four states (6–9) correspond to categorization choices (i.e., a saccade which reports the agent's guess about the scene identity). The states of the first and second hidden state factors jointly determine which observation is sampled at each timestep on Level 2.

Observations at this level comprise three modalities. The first modality encodes the identity of the visual stimulus at the fixated location and is identical in form to the first hidden state factor at Level 1: namely, it can be either the “Null” outcome (when there is no visual stimulus at the fixated location) or one of the four motion directions. The likelihood matrix for the first-modality on Level 2, namely ^{(2),1}, consists of probabilistic mappings from the scene identity /spatial configuration (encoded by the first hidden state factor) and the current fixation location (the second hidden state factor) to the stimulus identity at the fixated location, e.g., if the scene is ^{(2),2} is structured to return a “No Feedback” outcome in this modality when the agent fixates any area besides the response options, and returns “Correct” or “Incorrect” once the agent makes a saccade to one of the response options (locations 6–9)—the particular value it takes depends jointly on the true hidden scene and the scene identity that the agent has guessed. We will further discuss how a drive to respond accurately emerges when we describe the prior beliefs parameterized by the ^{(2),3}.

The transition matrices at Level 2, namely ^{(2),1} and ^{(2),2}, describe the dynamics of the scene identity and of the agent's oculomotor system, respectively. We assume the dynamics that describe the scene identity are both uncontrolled and unchanging, and thus fix ^{(2),1} to be an identity matrix that ensures the scene identity/spatial configuration is stable over time. As in earlier formulations (Friston et al., ^{(2),2} (e.g., if the action taken is 3 then the saccade destination is described by a transition matrix that contains a row of 1s on the third row, mapping from any previous location to location 3).

Inference and action selection at Level 2 proceeds as follows: based on the current hidden state distribution and Level 1's likelihood mapping ^{(1),1} (the generative process), observations are sampled from the three modalities. The observation under the first-modality at this level (either “Null” or a motion direction parameterizing an RDM stimulus) is passed down to Level 1 as the initial ^{(1),1} is the generative model's likelihood and _{t}) is the latest posterior density over hidden states (factorized into scene identity and fixation location). This predictive density over (first-modality) outcomes serves as an ^{st} factor hidden states are passed to Level 2 as “inferred” observations of the first modality. The belief updating at Level 2 proceeds as usual, where observations (both those “inferred” from Level 1 and the true observations from the Level 2 generative process: the oculomotor state and reward modality) are integrated using Level 2's generative model to form posterior beliefs about hidden states and policies. The policies at this level, like at the lower level, only consider one step ahead in the future—so each policy consists of one action (a saccade to one of the quadrants or a categorization action), to be taken at the next timestep. An action is sampled from the posterior over policies _{t} ~ _{1} = 20 time steps in our case) can be nested within a single time step of a higher-level process, endowing such generative models with a flexible, modular form of temporal depth. Also note the asymmetry in informational scheduling across layers, with posterior beliefs about those hidden states linked with the higher level being passed

In addition to the likelihood

^{(2),2}) encode the agent's beliefs about receiving correct and avoiding incorrect feedback. Prior beliefs over the other outcome modalities (^{(2),1} and ^{(2),3}) are all trivially zero. These beliefs are stationary over time and affect saccade selection at Level 2 via the expected free energy of policies ^{(2)} at this level encode the agent's initial beliefs about the scene identity and the location of their eyes. This prior over hidden states can be manipulated to put the agent's beliefs about the world at odds with the actual hidden state of the world. At Level 1, the agent's preferences about being in the “^{(1),2}), which corresponds to the agents umambiguous perception of its own sampling state. Finally, the prior beliefs about initial states at Level 1 (^{1}) correspond to the motion direction hidden state (the RDM identity) as well as which sampling-state the agent is in. Crucially, the first factor of these prior beliefs ^{(1),1} is initialized as the “expected observations” from Level 2: the expected motion direction (first modality). These expected observations are generated by passing the variational beliefs about the scene at Level 2 through the modality-specific likelihood mapping: ^{(2),1}|^{(2),1}) = ^{(2),1}|^{(2),1})^{(2),1}). The prior over hidden states at Level 1 is thus called an

The _{τ}|π) with the log probability density over outcomes log_{τ}). This reinterpretation of preferences as prior beliefs about observations allows us to discard the classical notion of a “utility function” as postulated in fields like reward neuroscience and economics, instead explaining both epistemic and instrumental behavior using the common currency of log-probabilities and surprise. In order to motivate agents to categorize the scene, we embed a self-expectation of accuracy into the ^{(1),2}) increases over time. This necessitates that the complementary probability of remaining in the “

Finally, the

In the following sections, we present hierarchical active inference simulations of scene construction, in which we manipulate the uncertainty associated with beliefs at different levels of the generative model to see how uncertainty differentially affects inference across levels in uncertain environments.

Having introduced the hierarchical generative model for our RDM-based scene construction task, we will now explore behavior and belief-formation in the context of hierarchical active inference. In the following sections we study different aspects of the generative model through quantitative simulations. We relate parameters of the generative model to both “behavioral” read-outs (such as sampling time, categorization latency and accuracy) as well as the agents' internal dynamics (such as the evolution of posterior beliefs, the contribution of different kinds of value to policies, etc.). We then discuss the implications of our model for studies of hierarchical inference in noisy, compositionally-structured environments.

^{(1),1}. Each each column of ^{(1),1} is initialized as a “one-hot” vector that contains a probability of 1 at the motion observation index corresponding to the true motion direction, and 0s elsewhere. As ^{(1),1}, as the first row/column of the likelihood (^{(1),1}(1, 1)) corresponds to observations about the “Null” hidden state, which is always observed unambiguously when it is present. In other words, locations that do not contain RDM stimuli are always perceived as “Null” in the first modality with certainty.

Simulated trial of scene construction under high sensory precision. ^{(1),1}. The agent observes the true RDM at Level 1 and updates its posterior beliefs about this hidden state. As uncertainty about the RDM direction is resolved, the “

Simulated trial of scene construction with low sensory precision. Same as in

^{nd} to 5^{th} rows of Panel

Effect of sensory precision on scene construction performance. Average categorization latency

We quantified the relationship between sensory precision and scene construction performance by simulating scene construction trials under different sensory precisions

For the simulations discussed in the previous section, agents always start scene construction trials with “flat” prior beliefs about the scene identity. This means that the first factor of the prior beliefs about hidden states at Level 2 ^{(2),1} was initialized as a uniform distribution. We can manipulate the agent's initial expectations about the scenes and their spatial arrangements by arbitrarily sculpting ^{(2),1} to have high or low probabilities over any state or set of states. Although many manipulations of the Level 2 prior over hidden states are possible, here we introduce a simple prior belief manipulation by uniformly elevating the prior probability of all spatial configurations (12 total) of a single type of scene. For example, to furnish an agent with the belief that there's a 50% chance of any given trial being a

Effect of sensory precision on scene construction performance for different prior belief strengths. Same as in ^{(2),1}) concentrated upon one of the four possible scenes. This elevated probability is uniformly spread among the 12 hidden states corresponding to the different quadrant-configurations of that scene, such that the agent has no prior expectation about a particular arrangement of the scene, but rather about that scene type in general. Here, we only show the results for agents with “incorrect” prior beliefs—namely, when the scene that the agent believes to be at play is different from the scene actually characterizing the trial.

The interaction between sensory and prior precision is not as straightforward when it comes to categorization latency.

Now we explore the effects of sensory and prior precision on belief-updating and policy selection at the lower level, during a single quadrant fixation. ^{(1),2}) renders the sampling of motion observations relatively useless for agents, and it “pays” to just break sampling early. This results in the pattern of break-times that we observe in

Effect of sensory precision on quadrant dwell time. _{Keep-sampling}−_{Break-sampling}). We only show these posterior policy differentials for the first 10 time steps of sampling at Level 1 due to insufficient numbers of saccades that lasted more than 10 time steps at the highest/lowest sensory precisions (see

It is worth mentioning the barely noticeable effect of prior beliefs (^{(1),2}).

The curves in _{Keep-sampling}−_{Break-sampling}. At the lowest sensory precisions, there is barely any epistemic value to pursuing the “

In the current work, we presented a hierarchical partially-observed Markov Decision Process model of scene construction, where scenes are defined as arbitrary constellations of random dot motion (RDM) stimuli. Inspired by an earlier model of scene construction (Mirza et al.,

These results contrast with the predictions of classic evidence accumulation models like the drift-diffusion model or DDM (Ratcliff,

A discussion of the relationship between the current model and previous hierarchical POMDP schemes is also warranted. The model most closely related to the current work is the “deep temporal model” of active reading, proposed by Friston et al. (

Insight from the robotics and probabilistic planning literature could also be integrated with the current work to extend deep active inference in its scope and flexibility. For instance, the framework of “planning to see” proposed in Sridharan et al. (

The hierarchical active inference scheme could also be extended to dynamic environments, where the scene itself changes, either due to intrinsic stochasticity or as a function of the agent's (or other agents') actions. This could simply be changed by encoding appropriate self-initiated state-changes into the transition model (the “B” matrices) or by introducing intrinsic, non-agent-controlled dynamics into the generative process. Ongoing work in the robotics and planning literature has highlighted the challenges of dynamic, structured environments and proposed various schemes to both plan actions and form probabilistic beliefs in such tasks (Ognibene and Demiris,

In future investigations, we plan to estimate the parameters of hierarchical active inference models from experimental data of human participants performing a scene construction task, where the identities of visual stimuli are uncertain (the equivalent of manipulating the sensory likelihood at Level 1 of the hierarchy). Data-driven inversion of a deep scene construction model can then be used to explain inter-subject variability in aspects of hierarchical inference behavior as different parameterizations of subject-specific generative models.

The data used in this study are the results of numerical simulations, and as such, we do not provide datasets. The software used to simulate the data and generate associated figures are based on visual foraging and scene construction demos included in SPM v12.0, and can be freely downloaded from

RH and AP conceived the original idea for the project. RH and MM conceived the hierarchical active inference model. RH, AP, and IK designed the scene construction task using random dot motion. MM, TP, and KF gave the critical insight into formulation of the model. RH conducted the simulations and analyzed the results. All authors contributed to the writing of the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors would like to thank Brennan Klein for extensive feedback on the paper and the figures, and Brennan Klein and Alec Tschantz for discussion and initial conceptualization of the project. The authors also thank Kai Ueltzhöffer for feedback on the project and discussions relating active inference to drift-diffusion models. The authors also thank the Monash University Network of Excellence for supporting the workshop Causation and Complexity in the Conscious Brain (Aegina, Greece 2018) at which many of the ideas related to this project were developed.

We provide the derivation of Equation (8), the expected free energy as an upper bound on the negative information gain and negative extrinsic value:

We also offer a derivation of Equation (9), the formulation of the expected free energy as the sum of “risk” and “ambiguity,” starting from its definition as an upper bound on the (negative) epistemic and instrumental values. We can write

The above derivation assumes that the mapping from predicted states _{τ}|π) to predicted observations _{τ}|_{τ}, π) is given as the likelihood of the generative model, i.e., _{τ}, _{τ}|π) = _{τ}|_{τ})_{τ}|π).

We provide a derivation of Equation (10), the full variational free energy of the posterior over observations, hidden states and policies:

^{1}From now on we assume the use of discrete probability distributions for convenience and compatibility with the sort of generative models relevant to the current work.

^{2}The Kullback-Leibler divergence or

^{3}Hereafter we refer to observations and

^{4}This threshold is referred to as “residual uncertainty,” and by default is set to as