^{1}

^{*}

^{2}

^{2}

^{3}

^{2}

^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

Edited by:

Reviewed by:

*Correspondence:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In a series of four experiments,

In a series of four experiments,

Recently, however,

In replication studies is it essential to be able to quantify evidence in favor of the null hypothesis. In addition, it is desirable to collect data until a point has been proven or disproven. Neither desideratum can be accomplished within the framework of frequentist statistics, and this is why our analysis of both experiments will focus on hypothesis testing using the Bayes factor (e.g.,

We sought to replicate ^{1} (cf.

A traditional frequentist analysis would start with an assessment of effect size followed by a power calculation that seeks to determine the number of participants that yields a specific probability for rejecting the null hypothesis when it is false. This frequentist analysis plan is needlessly constraining and potentially wasteful: the experiment cannot continue after the planned number of participants has been tested, and it cannot stop even when the data yield a compelling result earlier than expected (e.g.,

Based on the above considerations, our sampling plan was as follows: we planned to collect data from a minimum of 20 participants in each between-subject condition (i.e., the clockwise and counterclockwise condition, for a minimum of 40 participants in total). We were then planning to monitor the Bayes factor and stop the experiment whenever the critical hypothesis test (detailed below) reached a Bayes factor that can be considered “strong” evidence (

We planned to exclude from analysis those participants who discerned the goal of the experiment (e.g., “the experiment is about how personality changes due to turning kitchen rolls clockwise or counterclockwise”). The intended analysis proceeds as in

Specifically, we planned to assess this hypothesis by means of a default Bayes factor for an unpaired, one-sided

We deviated from the preregistration document in three aspects. The first aspect is that the preregistration document specified that we would recruit “Psychology students from the University of Amsterdam” (^{2} It turned out that this website is open to all UvA students (and not only psychology students as we initially assumed). Therefore, students from other academic fields also participated in the study. Also, seven participants made an appointment on site and were not students. As a result, our sample is more diverse compared to our initial sampling plan that only included UvA psychology students. There does not appear to be a compelling explanation for why the slightly more heterogeneous sample should materially change the outcome of the experiment, and hence we analyzed the data of all participants.

The second aspect in which we deviated from our protocol concerns the stopping rule; we planned to stop collecting data after obtaining a Bayes factor of 10 in favor of the null hypothesis, or 10 in favor of the alternative hypothesis. As can be seen in

The final aspect in which we deviated from our protocol is that we tested 102 participants, which is more than the 100 that were planned initially. This deviation occurred because participants were randomly assigned to conditions (i.e., by picking an envelope that contained the number of their booth, see below). Hence, the main criterion of a maximum number of 50 participants per condition is not necessarily consistent with the secondary criterion of a maximum number of 100 participants total, as was assumed in the preregistration document. At the point of stopping, there were 48 participants in the clockwise condition and 54 in the counterclockwise condition.

As mentioned above, we recruited students from the University of Amsterdam as well as non-students (people who walked in). Participants were rewarded with course credits or a small monetary reward.

We closely followed the materials section in

Both rods were fixed on a wooden board, 50 cm apart, so that the two paper towels could easily be manipulated using both arms. The rotating direction was instructed non-verbally by a schematic description (

After signing an informed consent form, participants first completed a set of unrelated tasks lasting approximately 30 min (e.g., completing an assessment form, doing a lexical decision task). This setup was used on purpose, as it mimicked more closely the design of

Next, we closely followed the procedures outlined by

As in

As in

We excluded five participants who did not follow the experimental procedure as intended: two of these participants rotated two rolls in the opposite direction (e.g., with the left hand clockwise and with the right hand counter-clockwise), one participant stopped rotating after the first NEO-item, one participant misunderstood the instructions and tried to rotate the wooden sticks instead of the rolls, and one participant expressed strong dissatisfaction with the task (consequently, the experimenter decided to stop the task halfway).

We included a total of 102 participants (77 females) in the analysis, 48 in the clockwise condition and 54 in the counterclockwise condition. The mean age was 22.1 years (range 17–51) and 93% (

We recoded the reverse items (Cronbach’s α = 0.65, similar to the value of α = 0.58 reported in _{01} = 10.76, indicating that the observed data are 10.76 times more likely under the null hypothesis that postulates the absence of the effect than under the alternative hypothesis that postulates the presence of the effect. According to the classification scheme proposed by

_{0+} greater than 1 indicate evidence in favor of the null hypothesis. As the number of participants grows, the Bayes factor increasingly supports the null hypothesis. It is of note that this Bayesian sequential analysis requires no corrections – the Bayes factor can simply be monitored as the data accumulate (e.g.,

To examine the robustness of our conclusions, we varied the shape of the prior for the effect size under the alternative hypothesis.

The preregistration document reads: “As the original results from

The control factors were pleasantness (“How pleasant did you find this task?”), effort (“How much effort did you invest in this task?”), mood (“At this moment, you do you feel?”), and arousal (“At this moment, how agitated are you?”). These factors were assessed by Likert scales ranging from 0 to 10.

Number of participants (N), mean and SD of openness to experience, and the two-sided default Bayes factors for each of the four control questions.

Condition | Mean score | SD | BF_{01} |
||
---|---|---|---|---|---|

Pleasantness | Clockwise | 48 | 3.88 | 2.58 | 6.50 |

Counterclockwise | 54 | 3.81 | 2.06 | ||

Effort | Clockwise | 48 | 3.60 | 2.37 | 0.98 |

Counterclockwise | 54 | 4.56 | 2.37 | ||

Mood | Clockwise | 48 | 6.33 | 1.52 | 3.52 |

Counterclockwise | 54 | 5.94 | 1.85 | ||

Arousal | Clockwise | 48 | 3.06 | 1.73 | 2.42 |

Counterclockwise | 54 | 3.69 | 2.46 |

We were unable to replicate the finding reported by

We hope that future empirical efforts in psychology and other disciplines will increasingly use preregistered Bayes factor hypothesis tests. By preregistering the analysis plan, researchers prevent themselves from falling prey to their own preconceptions and biases, mental distortions that can easily translate in a series of data-inspired hypothesis tests, only a subset of which is presented to the reader. By conducting a Bayesian hypothesis test –something that can be easily accomplished using JASP ^{3} (

In closing, we should stress that a single experiment cannot overturn a large body of work. However, the strength of evidence in our data is sufficient to change one’s prior beliefs by an order of magnitude. An empirical debate is best organized around a series of preregistered replications, and perhaps the authors whose work we did not replicate will feel inspired to conduct their own preregistered studies. In our opinion, science is best served by ruthless theoretical and empirical critique, such that the surviving ideas can be relied upon as the basis for future endeavors. A strong anvil need not fear the hammer, and accordingly we hope that preregistered replications will soon become accepted as a vital component of a psychological science that is both though-provoking and reproducible.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We like to acknowledge the helpful support from Dr. Topolinski, who kindly provided us with detailed descriptions of the