^{*}

Edited by: Tom Verguts, Ghent University, Belgium

Reviewed by: Elise Lesage, Ghent University, Belgium; Carolina Feher Da Silva, University of Zurich, Switzerland

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Decision-making is assumed to be supported by model-free and model-based systems: the model-free system is based purely on experience, while the model-based system uses a cognitive map of the environment and is more accurate. The recently developed multistep decision-making task and its computational model can dissociate the contributions of the two systems and have been used widely. This study used this task and model to understand our value-based learning process and tested alternative algorithms for the model-free and model-based learning systems. The task used in this study had a deterministic transition structure, and the degree of use of this structure in learning is estimated as the relative contribution of the model-based system to choices. We obtained data from 29 participants and fitted them with various computational models that differ in the model-free and model-based assumptions. The results of model comparison and parameter estimation showed that the participants update the value of action sequences and not each action. Additionally, the model fit was improved substantially by assuming that the learning mechanism includes a forgetting process, where the values of unselected options change to a certain default value over time. We also examined the relationships between the estimated parameters and psychopathology and other traits measured by self-reported questionnaires, and the results suggested that the difference in model assumptions can change the conclusion. In particular, inclusion of the forgetting process in the computational models had a strong impact on estimation of the weighting parameter of the model-free and model-based systems.

Computational models are tools used to understand decision-making processes. One successful model designed for this purpose was developed by Daw et al. (

The widely used computational model for the two-step decision task (Daw et al.,

In this study, we used the two-step decision task developed by Kool et al. (^{1}^{2}

In the Kool two-step task, participants are required to choose an action (i.e., choose a rocket) in the first stage, which is followed by a second-stage state (a screen with an alien) and a reward outcome (

The two-step task used in the experiment

Using this task, we examined several reinforcement learning (RL) models to express the integrated algorithm of the model-free and model-based learning systems. The assumptions that we examined in the new models are inspired by psychological considerations. First, we considered cognitive savings regarding the values to be updated during learning. Deterministic probability is a special case of stochastic probability; however, if the transition is deterministic, we do not need to discriminate successive actions. The algorithm of the model-free system is typically the SARSA (state-action-reward-state-action) temporal-difference (TD) learning model (Rummery and Niranjan,

A conceptual framework for the examined assumptions.

When a computational model uses the above parsimonious computational algorithm, a typical model-based system is impossible to apply because the typically used model-based system is a

In addition, regarding the model-free system, we applied the concept of memory decay in the model-free part of RL. In the standard TD learning algorithm, the values of the unselected options are assumed to remain unchanged (

Overall, the examined computational models have some or all of these assumptions. These models were compared using data from the Kool two-step task. In addition, to test the effect of model construction on the parameter estimates, we compared the computational models in terms of the relationship between the estimated parameter values and subjects' scores on questionnaires regarding obsessive tendencies, impulsivity, and other psychological features.

Thirty-four undergraduate students at Nagoya University participated in the experiment. The data from two participants were excluded because the participants were unable to complete the training session by themselves due to their misunderstanding of the instructions, and three participants were excluded because they did not pass the exclusion criteria (see section Exclusion criteria). Thus, the data from the remaining 29 participants were analyzed (13 males, 16 females; age

Participants were seated ~50 cm in front of a 21.5-inch iiyama ProLite monitor with a screen resolution of 1920 × 1080 pixels and a refresh rate of 60 Hz. Instructions and stimuli were presented using the computer program Inquisit 5 Lab (2016) by Millisecond Software in Seattle, Washington.

The task procedure was almost the same as the two-step task originally proposed by Kool et al. (

Each participant completed 253 trials, which were divided into two blocks separated by 30-s breaks. Each trial consisted of two stages. In the first stage, the participants were required to select one of two rockets (downloaded from Freepik.com) by pressing the F key for the left rocket or the J key for the right rocket within 2.5 s. This stage was characterized by one of two states: state A always included rockets 1 and 2, and state B included rockets 3 and 4. The subsequent second-stage state was based on the first-stage choice. Rockets 1 and 3 were always followed by state C in the second stage, and rockets 2 and 4 were always followed by state D in the second stage. In the second stage, each state included one unique alien (downloaded from pngtree.com). The participants were required to press the space bar within 1.5 s to obtain a reward from the alien. Each alien produces a reward feedback value ranging from 1 to 9. These feedback values for each alien changed slowly over the course of the task according to a Gaussian random walk (mean = 0, σ = 0.025) with bounds of 0.25 and 0.75 and was displayed as an integer on the screen. Auditory stimuli were played when participants made a choice (bell sound) and when they obtained a reward (money sound).

At each stage, if no response was made within the time limits, a message reading, “Too late!!” was presented, and the participants proceeded to the next trial.

Before the task, the participants were informed that the positions of the rockets and the response speed within time limits would have no relationship with subsequent feedback or the total experimental time and that the choice of rockets is only related to the transition to the second-stage states. The participants were also repeatedly told that each rocket in each first stage was connected decisively with one of the two aliens in the second stage and that the reward from each alien would change slowly and independently over time depending on these aliens' moods within the range from 1 to 9. Thus, the participants were informed that they would obtain greater rewards by focusing on the moods of each alien. The participants were also informed in advance that they could receive additional monetary rewards along with their total earned points in this task. Specifically, the participants were paid ¥1,000, with an additional monetary reward of either ¥300 (if they earned more than 1,300 points) or ¥200 (if they earned fewer than 1,300 points).

The participants also completed a training session to learn the structure of the task in advance; in this session, they were required to repeatedly choose the rockets connected with one of the two aliens in the training trials without time limits or feedback, and if they succeeded in more than 5 consecutive trials for each alien, then they were next trained with 18 trials with time limits and feedback. The stimuli used in the training session were completely different from the stimuli used in the real task.

The reward probabilities were the same for all participants, but the order of the first-stage state during the task was deliberately controlled in advance, and each participant was allocated to one of four sequences (see

In the analyses, we excluded the data from uncompleted trials (i.e., those in which the choice was not made within 2.5 s) and the data from trials in which the response time was <120 ms, which were considered anticipated responses that did not reflect the stimulus types. Two participants who had more than 20% of their trials omitted based on these criteria were excluded. In addition, we excluded one participant who chose the same rocket in each first-stage state in more than 90% of the trials. Thus, the data of 29 participants were used for the subsequent analyses (rate of excluded trials: max 8%, mean 1%).

After the two-step task, the participants completed the Japanese versions of several questionnaires. OCD tendencies was assessed using the Obsessive-Compulsive Inventory (OCI) [Foa et al.,

We first describe two basic models (the parallel model and the parsimonious learning-rate adjustment model,

A schematic of value updating in the parallel model

For data from the Kool two-step task, a computational model developed by Daw et al. (

The model-free learning system uses a SARSA (λ) TD learning rule (Rummery and Niranjan, _{MF}(_{i, t}, _{i, t}), at each stage _{A} and _{B} for _{1,t}, and _{C} and _{D} for _{2,t}). In each of the first-stage states, two actions are available, and _{i, t} ∈ _{1}, _{2} denotes the selected action. In the second-stage state, only one action is available. In both stages, the selected state-action value is updated as follows:

where 0 ≤ α_{L} ≤ 1 is the learning rate parameter and 0 ≤ _{i,t} ≤ 1 denotes the reward in trial

The second-stage reward prediction error (RPE), which reflects the difference between the expected and actual reward, also updates the first-stage value but is downweighted by the eligibility trace decay parameter λ as follows:

where λ denotes the trace decay parameter that modulates the magnitude of the effect of the second-stage RPE on the first-stage value. This type of updating is called the eligibility trace rule and enables efficient value updating (Sutton and Barto,

The model-based values, _{MB}, for each action are defined by the Bellman optimality equation. In short, an option value is computed anew each time as a sum of the maximum values of the possible subsequent state-action values weighted by the transition probabilities for the respective states. The transition probability determines this weight. Thus, model-based values _{MB}(_{j}, _{k}), where _{j} ∈ _{A}, _{B}, _{C}, _{D}, and _{k} ∈ _{1}, _{2} in the first stage and _{k} = _{1} in the second stage, are calculated as follows:

Here, _{j}, _{k}) is a transition-probability function representing the probability of moving to state _{k} at state _{j}. _{j}, _{k}) = 1 when

Finally, _{MF} and _{MB} are integrated to generate a net value for choice with a model-based parameter 0 ≤

The second-stage _{NET} values are equal to _{MB} and _{MF}.

These net values determine the first-stage choice probability of choosing action _{1,t} = _{1,t}), as follows:

Here, three free parameters represent particular propensities in the choice process: β, often called inverse temperature, adjusts how sharply the value difference between options is reflected in the choice probability; π determines the degree of perseveration in the same option; and ρ expresses the degree of key-response stickiness.

Among these parameters, β is usually included in any RL model. In the two-step tasks, Daw et al. (

As another framework, we propose a parsimonious computational model applying cognitive savings of the values to be updated (

In this model, a deterministic action sequence followed by a choice is the unit for valuation, and only the action values in the choice stage are updated. In the Kool two-step task, these values correspond to the values of the first-stage rockets, _{A}, _{1}), _{A}, _{2}), _{B}, _{1}) and _{B}, _{2}). Here, choosing _{1} deterministically leads to _{C}, and choosing _{2} leads to _{D}. We use

The pure model-free value calculation ends here. If the backward-looking model-based system works, then the other state-action pair that leads to the same second-stage state with _{1,t} is also updated as follows:

Here, the weight of model-based updating is adjusted by a model-based parameter 0 ≤

The LA model obviously has simpler calculations than the P model.

This process is identical to that introduced in the P model (Equation 7).

The values of unselected actions (including the actions of the unvisited state) are not updated in typical RL. However, these values can naturally be considered to decay through a forgetting process. The following equation is one algorithm for this process: the values of unselected actions are updated as follows in each step when a selected action is updated by Equations 2, 3 in the parallel model and by Equation 8 in the LA model:

where 0 ≤ α_{F} ≤ 1 is the forgetting rate parameter and 0 ≤ μ ≤ 1 is the default-value parameter to which the values of unselected options are regressed. _{F} = 0) and three types of models with a forgetting process: the first model assumes that the values of unselected options gradually approach zero (where α_{F} is a free parameter and μ = 0), the second model assumes that they approach 0.5, which corresponds to the least biased value (where α_{F} is a free parameter and μ = 0.5), and the third model assumes that people have their own default value to which the values approach (where both α_{F} and μ are free parameters).

We used the R function “solnp” in the Rsolnp package (Ghalanos and Theuss,

where

The negative LL and AIC of each participant were calculated for each model and were summed over all participants (

Information concerning the models compared on the basis of their fit to the choices of 29 participants.

SARSA (λ) TD | Model-free | – | – | α_{L}, β, π, ρ, λ |
5 | 4,661 | 9,612 |

P | Parallel | – | – | α_{L}, β, π, ρ, |
6 | 3,435 | 7,219 |

P-F0 | Parallel | o | Fixed (μ = 0) | α_{L}, β, π, ρ, _{F} |
7 | 3,284 | 6,974 |

P-F05 | Parallel | o | Fixed (μ = 0.5) | α_{L}, β, π, ρ, _{F} |
7 | 3,048 | 6,503 |

P-FD | Parallel | o | o | α_{L}, β, π, ρ, _{F}, μ |
8 | 3,024 | 6,511 |

LA | Learning-rate adjustment | – | – | α_{L}, β, π, ρ, |
5 | 3,447 | 7,184 |

LA-F0 | Learning-rate adjustment | o | Fixed (μ = 0) | α_{L}, β, π, ρ, _{F} |
6 | 3,292 | 6,931 |

LA-F05 | Learning-rate adjustment | o | Fixed (μ = 0.5) | α_{L}, β, π, ρ, _{F} |
6 | 3,055 | 6,457 |

LA-FD | Learning-rate adjustment | o | o | α_{L}, β, π, ρ, _{F}, μ |
7 | 3,032 | 6,469 |

_{L} and λ were fixed at 1 and additional LA models in which α_{L} was fixed at 1. These reduced models showed lower AIC values (_{L} and λ are set to one, then the parallel models have a similar structure to the LA models in which α_{L} is set to one. In these models, the second-stage state values are equal to the last piece of feedback if α_{L} = 1, and the last piece of feedback is directly reflected in the first-stage value because λ = 1. Thus, such specific parallel models behave similarly to the LA models, which do not distinguish the first-stage state-action value and the following second-stage state-action value. Note that these results support the parsimonious updating assumed in the LA models but provide no information on the comparison between the forward-looking and the backward-looking model-based systems.

Estimated parameter values.

_{L} |
_{F} |
||||||||
---|---|---|---|---|---|---|---|---|---|

P | 25 | 0.98 | 3.34 | 0.70 | −0.45 | −0.33 | 0.17 | – | – |

50 | – |
– |
– | – | |||||

75 | 1.00 | 5.39 | 1.00 | 0.24 | −0.04 | 1.00 | – | – | |

LA | 25 | 1.00 | 3.19 | 0.71 | −0.42 | −0.33 | – | – | – |

50 | – |
– |
– | – | – | ||||

75 | 1.00 | 5.37 | 1.00 | 0.26 | −0.05 | – | – | – | |

P-F05 | 25 | 0.86 | 5.37 | 0.49 | 0.16 | −0.31 | 0.88 | 0.21 | Fixed (0.5) |

50 | – |
||||||||

75 | 1.00 | 10.53 | 0.74 | 0.61 | −0.06 | 1.00 | 0.49 | Fixed (0.5) | |

LA-F05 | 25 | 0.88 | 5.69 | 0.43 | 0.21 | −0.32 | – | 0.23 | Fixed (0.5) |

50 | – |
– | |||||||

75 | 1.00 | 10.43 | 0.76 | 0.62 | −0.04 | – | 0.52 | Fixed (0.5) | |

P-FD | 25 | 0.84 | 5.24 | 0.51 | 0.16 | −0.32 | 0.89 | 0.21 | 0.46 |

50 | – |
||||||||

75 | 1.00 | 10.09 | 0.75 | 0.95 | −0.06 | 1.00 | 0.45 | 0.62 | |

LA-FD | 25 | 0.84 | 5.30 | 0.44 | 0.16 | −0.32 | – | 0.23 | 0.47 |

50 | – |
– | |||||||

75 | 1.00 | 11.02 | 0.80 | 1.00 | −0.06 | – | 0.47 | 0.62 |

In the full models, the AIC values were lower in the LA models than those in the parallel models: The LA model was favored over the P model, the LA-F05 model was favored over the P-F05 model, and the LA-FD model was favored over the P-FD model (favored by more than 20 of 29 participants in each comparison;

Based on this result, the higher AIC values in the parallel models than those in the LA models among the full models are attributable to the effect of the redundant free parameters in the parallel models, and the difference in the model-based system (parallel or LA) is not critical for fitting improvement.

Regardless of the model comparisons among the full models or those among the reduced models, the models with forgetting processes were favored. Here, we show only the results of the full models, but the similar results were obtained for the reduced models (

Most participants showed reduced AIC values in the LA-FD model vs. the LA model [_{(28)} = −6.30, _{(28)} = −6.24,

Model comparison by differences in the Akaike information criterion (AIC) scores in the parsimonious learning-rate adjustment models (LA models). The AIC scores of the LA models were compared. One of the models has no forgetting process (LA), and the other three have a forgetting rate parameter for the forgetting process and either a free default-value parameter (LA-FD), a fixed default value of 0 (LA-F0), or a default value of 0.5 (LA-F05).

Model comparison by differences in the Akaike information criterion (AIC) scores in the parallel models (P models). The AIC scores of the P models were compared. One of the models has no forgetting process (P), and the other three have a forgetting rate parameter for the forgetting process and either of a free default-value parameter (P-FD), a fixed default value of 0 (P-F0), or a default value of 0.5 (P-F05).

Among the models with forgetting processes, assuming that the default value was a free parameter was preferred rather than assuming it was 0, with lower AIC values in the LA-FD model than those in the LA-F0 model by 28 of 29 participants [_{(28)} = −6.60, _{(28)} = −6.37,

In the previous section, we reported that the model fits were improved by using the reduced models: the LA models in which α_{L} was fixed and the P models in which α_{L} and λ were fixed. To assess the influence of fixing these parameters on the estimation of the weighting parameter

First, we examined differences in the basic models with respect to the estimations of ^{2} = 0.76), and ^{2} = 0.84). Some estimation differences emerged between the P models and the LA models, but the estimated regression slopes were close to 1. We also examined the influence of the forgetting processes on the estimations of the weighting parameter ^{2} = 0.41), and ^{2} = 0.27). The regression analyses revealed that the models with forgetting processes had lower estimated

The correspondence of the estimated weighting parameter ^{2}), regression line intercept, and regression line slope. Red lines indicate linear regression lines. The data on the black lines indicate complete correspondence between the estimations by the two models.

The analyses in this section were conducted to understand the characteristics of the model parameters. As observed previously, the P-F05 and LA-F05 models showed lower AIC values than the P-FD and LA-FD models, respectively, although no significant differences were noted. However, in this section, we mainly used the parameters estimated by the P-FD and LA-FD models to avoid possible estimation biases of

The computational models were developed supposing that the weighting parameter

To confirm this prediction, we focused on sensitivity to the outcome experienced in the previous trial. Generally, people revisit the state that recently produced high rewards. This pattern was also evident in our data. _{(28)} = −5.32,

Those who can use the “model” should show a similar SPO in both the MF and MB trials, whereas those who cannot use the “model” should exhibit a higher SPO in the MF trials than that in the MB trials. Therefore, if the parameter

As a reference, we conducted the same analyses for the

We next examined which parameter correlates with the total value of rewards obtained. Kool et al. (

In addition, many other factors may be related to total rewards other than _{F} when using the models with forgetting processes (P-FD:

Previous studies have reported a negative association between the weighting parameter and psychopathology, especially obsessive-compulsivity (Voon et al.,

Associations of estimated parameter values with psychopathology and other traits.

_{L} |
_{F} |
|||||||
---|---|---|---|---|---|---|---|---|

P | – |
–0.11 | 0.06 | |||||

LA | – |
–0.10 | ||||||

P-FD | –0.24 | –0.10 | –0.12 | –0.11 | 0.06 | –0.19 | –0.06 | |

LA-FD | – |
– |
–0.07 | –0.13 | –0.11 | –0.21 | –0.04 | |

P | – |
0.02 | 0.08 | –0.09 | 0.04 | –0.21 | ||

LA | 0.03 | 0.11 | –0.11 | 0.02 | ||||

P-FD | –0.01 | 0.12 | 0.07 | –0.03 | –0.04 | 0.11 | 0.02 | |

LA-FD | –0.12 | 0.15 | 0.05 | –0.03 | –0.03 | 0.29 | 0.08 | |

P | –0.08 | –0.01 | 0.15 | –0.04 | –0.10 | – |
||

LA | –0.06 | 0.00 | 0.15 | –0.05 | –0.11 | |||

P-FD | 0.20 | –0.03 | 0.12 | –0.10 | –0.12 | 0.09 | 0.02 | |

LA-FD | 0.06 | 0.00 | 0.14 | –0.07 | –0.11 | 0.09 | ||

P | –0.07 | 0.06 | 0.23 | –0.27 | –0.14 | |||

LA | –0.07 | 0.06 | 0.18 | –0.27 | –0.13 | |||

P-FD | 0.08 | 0.10 | 0.25 | – |
–0.18 | –0.10 | –0.06 | |

LA-FD | –0.09 | 0.14 | – |
–0.16 | –0.05 | |||

P | –0.09 | –0.20 | 0.29 | 0.10 | –0.18 | – |
||

LA | –0.10 | –0.19 | 0.25 | 0.10 | –0.18 | |||

P-FD | 0.07 | –0.09 | 0.01 | –0.25 | –0.01 | –0.10 | –0.03 | |

LA-FD | –0.04 | –0.07 | –0.02 | –0.23 | –0.12 | –0.05 |

This change in the correlations can be explained as follows: (1) the reduction in the AIC values by using the P-FD or LA-FD model instead of the P or LA model showed a marginally significant negative correlation with OCI scores (

When the models with a forgetting process were used for model fitting, the weighting parameter _{L}

The current correlation results may not be generalizable because of the small number of participants and the restricted population. However, the results showed how the differences between the computational models greatly affect the parameter estimates and their relationships with other indices.

We compared several computational models for data from a two-step task with a deterministic transition structure (Kool et al.,

Model comparisons and estimated parameters supported a learning process including parsimonious value computation, which assumes that values are updated for deterministic state-action sequences but not for every state-action pair. Such computational savings are useful in a real environment because computing every action value would require too many resources. Consider buying a canned coffee from a vending machine. You first decide which coffee you will buy, insert a coin, push the appropriate button, take the coffee out of the bottom box, open it, and taste it. These processes can be divided infinitely, but if the taste of the obtained coffee is not good, you will reevaluate only the first choice. An action sequence becomes automatic or habitual with repetition, and until the sequence is interrupted, individual actions do not need to be evaluated. Therefore, in the task with the deterministic transition structure, the deterministic action sequence can be regarded as a unit for value computation, and this view is supported by our results. Previous studies have also shown that an action sequence is used in the learning process (Dezfouli and Balleine,

Interestingly, parameter estimation revealed that the choices in the current task were based on a more heuristic learning process; that is, the participants seemed to have recorded only the last outcome of each choice, which was presumed from our finding that the estimated learning rate (α_{L}) was almost 1 for most of the participants (_{L} = 0.82). Multiple potential reasons may explain the high learning rates in our data. First, the participants had sufficient time to be affected by the last outcome because the time limit for making a choice was longer than the periods that are ordinarily used [e.g., Kool et al. (

Histogram of the estimated parameter values of 29 participants.

In our previous study (Toyama et al.,

In the current study, the models with the default value fixed at 0.5 showed the lowest AIC. Considering that expected outcome was 0.46 under random choice in our task, fixing the default value at 0.5 for all participants was reasonable, although the models including the default value as a free parameter also showed good model data fits, and variance in μ was observed among the participants (

Situations in which the forgetting process can affect the learning process are easy to conceptualize. For example, cognition regarding the task condition can affect the forgetting process. In a situation where the reward outcomes change frequently, the expectation for unselected options also becomes uncertain quickly, and the agent may change options often (expressed as a high forgetting rate). On the other hand, in a situation where the reward outcomes are stable, the expectation for unselected options is also stable (expressed as a low forgetting rate). Individual trait differences can also affect the forgetting process. For example, the difference between optimistic and pessimistic outlooks may be expressed as an individual difference in default values. Thus, the computational model with a forgetting process is expected to provide new insights in research related to value-based decision-making.

In this study, we could not determine which type of model-based system was used: the forward-looking or the backward-looking model-based system. This ambiguity emerged because a specific situation occurred (i.e., the estimated α_{L} and λ were almost one), implying that the current task was not appropriate to clarify which type of model-based system was used, and future studies using a proper task design that reflects the advantages and disadvantages of the two model-based systems are required. However, based on our previous work (Toyama et al.,

The interpretation of the weighting parameter

Previous studies have repeatedly found reduced use of the model-based system, which is defined by a low weighting parameter value, in OCD patients (Voon et al.,

On the other hand, this correlation almost disappeared when the parameter

When using the models with forgetting processes, we found positive relationships between the weighting parameter _{F} and depression or stress. Of course, considering the small sample size of this study, future studies are required to assess whether these models can provide useful predictive parameters. For example, the models must be applied to a large dataset for correlation analyses with clinical metrics and cognitive function. The model parameters must also be confirmed to be sufficiently recovered using various simulated behaviors (Palminteri et al.,

Although some future challenges remain, model fitting was notably improved for most of the participants by assuming a forgetting process, and as a result, the relationships between the parameters and self-reported psychopathology changed. If a data characteristic cannot be captured by a model, the model still must express the characteristic using its model parameters, which sometimes leads to misinterpretation of the parameters (Katahira,

The current study showed that participants favored the models with parsimonious computation, which assumes that the values are updated for action sequences, and a forgetting process, which assumes memory decay for unselected option values. Additionally, we confirmed that the estimated model-based weighting parameter could capture individual differences in “model use.” To date, however, most learning models do not contain psychological aspects such as cognitive savings and memory decay. Thus, research using the proposed model will force re-evaluation of how the features of the learning process correlate with psychopathology or abnormal decision-making and will enrich the study of the theory and neural basis of learning processes.

This study was carried out in accordance with the recommendations of the ethical committee of Nagoya University with written informed consent from all participants. All participants gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the ethical committee of Nagoya University. The participants were all Nagoya university healthy students.

AT collected and analyzed the data and prepared the draft. KK and HO reviewed the manuscript critically and provided important intellectual input. All authors contributed to the design of the work as well as the interpretation of the results.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at:

^{1}For example, participants make two successive choices in the Daw two-step task but only one choice in the Kool two-step task. There are four options with different reward outcomes in the second stage of the Daw two-step task and two options with different reward outcomes in the second stage of the Kool two-step task. In addition, the current condition of each option is easy to assess in the Kool task using gradual integral point feedback, whereas the Daw two-step task uses binary feedback based on hidden probability.

^{2}For example, considering that the final stage in the Daw two-step task has multiple options, the participants may intend to visit the same final state after they are not rewarded in that state because they can try the option that they did not choose in the previous trial. However, this strategy is not included in the existing computational models; thus, this choice behavior is sometimes regarded as a model-free strategy.