^{1}

^{*}

^{2}

^{1}

^{2}

Edited by: Hyeng Keun Koo, Ajou University, South Korea

Reviewed by: Simon Grima, University of Malta, Malta; Fabrizio Maturo, Università degli Studi G. d'Annunzio Chieti e Pescara (UNICH), Italy

This article was submitted to Mathematical Finance, a section of the journal Frontiers in Applied Mathematics and Statistics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Ellsberg paradox in decision theory posits that people will inevitably choose a known probability of winning over an unknown probability of winning even if the known probability is low [

Recently, neuroeconomics has been developing into an increasingly important academic discipline that helps to explain human behavior. Ellsberg paradox is a crucial topic in neuroeconomics, and researchers have employed various theories to approach and to resolve the paradox. The basic concept behind the Ellsberg paradox is that people will always choose a known probability of winning over an unknown probability of winning, even if the known probability is low and the unknown probability could be a near guarantee of winning.

Let us start with an example. Suppose we have an urn that contains 30 red balls and 60 other balls that are either black or yellow. We do not know how many black or yellow balls are there, but we know that the total number of black balls plus the total number of yellow balls equals 60. The balls are well mixed so that each individual ball is as likely to be drawn as any other.

You are now given a choice between two gambles:

[Gamble A] You receive

[Gamble B] You receive

In addition, you are given the choice between these two gambles (about a different draw from the same urn):

[Gamble C] You receive

[Gamble D] You receive

Participants are tempted to choose [Gamble A] and [Gamble D]. However, these choices violate the postulates of subjective expected utility [

It is well known that ambiguity-aversion property of decision-making is one of the prevailing theories advanced to explain this paradox. On the other hand, reinforcement learning algorithms, such as ucb1-tuned [

In this study, we took a multi-armed bandit problem (MAB) as a decision-making problem. We considered two slot machines _{A} (μ_{B}) and σ_{A} (σ_{B}), respectively. The player makes a decision on which machine to play at each trial, trying to maximize the total reward obtained after repeating several trials. The MAB is used to determine the optimal strategy for finding the machine with the highest rewards as accurately and quickly as possible by referring to past experiences. The MAB is related to many application problems in diverse fields, such as communications (cognitive networks [

In this study, we focused on limited MAB cases. Machine ^{2}). Here, we hypothesize that the total rewards from probabilities generated by a PDF is the same as the total rewards directly from the same PDF if we only focus on the average rewards using 1, 000 samples. On the basis of this hypothesis, we consider MABs, where PDFs are ^{2}). Here, δ(

SOFTMAX algorithm is a well-known algorithm for solving MABs [

where _{k}(_{k}(_{k}(

β = 0 corresponds to a random selection and β → ∞ corresponds to a greedy action. The SOFTMAX algorithm is “ambiguity-neutral” because “ambiguity” σ is not used in the algorithm.

In the tug-of-war (TOW) dynamics, a machine that has larger _{k} (_{A} (= −_{B}) is determined by the following equations:

Here, _{k}(_{k}(_{k}(_{k}−_{k}(

In the UCB1-tuned algorithm, a machine that has larger “index” is played in each time [

Initialization: Play each machine once.

Loop: Play machine

where _{j} is the number of times machine

In the modified UCB1-tuned algorithm, a machine that has larger “index” is played in each time. Compared to UCB1-tuned algorithm, the sign of the second term in the index becomes minus.

Initialization: Play each machine once.

Loop: Play machine

where _{j} is the number of times machine

In this study, we focused on the following limited MAB cases. On the basis of the hypothesis, we considered MABs where PDF of machine A is ^{2}), respectively. “Ambiguity” is expressed by σ.

For positive Δμ, we investigate 30 cases where Δμ = 0.00, 0.05, 0.10, 0.15, and 0.20, and σ = 0.05, 0.10, 0.15, 0.20, 0.25, and 0.30, respectively. Figure

Performance comparison between four learning algorithms for MAB where PDFs are ^{2}). Δ μ is positive (cases where machine

For positive Δμ cases, machine

Performances of TOW and SOFTMAX are higher than those of UCB1-tuned and modified UCB1-tuned algorithms because each of the former two algorithms has a parameter that optimized the problems. That is, each of the two algorithms has an advantage over the latter two algorithms that have no parameter. Performances of the former two algorithms (ambiguity-neutral) slightly decrease as ambiguity (σ) of the problems increases. This is because incorrect decisions are slightly increased as estimation for mean value of rewards becomes largely fluctuated.

For negative Δμ, we also investigated 30 cases where Δμ = 0.00, 0.05, 0.10, 0.15, and 0.20, and σ = 0.05, 0.10, 0.15, 0.20, 0.25, and 0.30, respectively. Figure

Performance comparison between four learning algorithms for MAB where PDFs are ^{2}). Δ μ is negative (cases where machine

For negative Δμ cases, machine

Performances of TOW and SOFTMAX are higher than those of UCB1-tuned and modified UCB1-tuned algorithms because each of the former two algorithms has a parameter that optimized the problems as well as the positive Δμ cases. Performances of the former two algorithms (ambiguity-neutral) also slightly decrease as the ambiguity (σ) of the problems increases because of the same reason as the positive Δμ cases.

In both cases (positive Δμ and negative Δμ), performance of the UCB1-tuned algorithms (ambiguity-preference) slightly increases as the ambiguity (σ) of the problems increases, whereas performance of the modified UCB1-tuned algorithms (ambiguity-aversion) largely decreases as the ambiguity (σ) of the problems increases. This means that ambiguity-aversion property of learning algorithm has a negative contribution to its performances for MABs, whereas ambiguity-preference has a positive contribution.

From these limited computer simulation results, we conclude that ambiguity-aversion property does not work for efficient decision-making in the learning point of view (repeated decision-making situations). Another point of view will be necessary for justification of ambiguity-aversion property. We suggest that the differences among learning algorithms require further study on the Ellsberg paradox and decision theory.

S-JK and TT designed research. S-JK performed computer simulations. S-JK and TT analyzed the data. S-JK wrote the manuscript. All authors reviewed the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We would like to thank Prof. Masashi Aono and Dr. Makoto Naruse for fruitful discussions in an early stage of this work.

The Supplementary Material for this article can be found online at: