^{1}

^{*}

^{2}

^{3}

^{*}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{10}

^{11}

^{12}

^{13}

^{14}

^{15}

^{16}

^{17}

^{18}

^{1}

^{8}

^{9}

^{19}

^{20}

^{21}

^{14}

^{22}

^{23}

^{24}

^{25}

^{26}

^{27}

^{3}

^{28}

^{29}

^{30}

^{31}

^{32}

^{33}

^{34}

^{35}

^{36}

^{1}

^{37}

^{38}

^{39}

^{40}

^{41}

^{42}

^{43}

^{44}

^{45}

^{30}

^{46}

^{47}

^{30}

^{48}

^{49}

^{50}

^{51}

^{52}

^{53}

^{54}

^{25}

^{55}

^{56}

^{57}

^{14}

^{58}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{10}

^{11}

^{12}

^{13}

^{14}

^{15}

^{16}

^{17}

^{18}

^{19}

^{20}

^{21}

^{22}

^{23}

^{24}

^{25}

^{26}

^{27}

^{28}

^{29}

^{30}

^{31}

^{32}

^{33}

^{34}

^{35}

^{36}

^{37}

^{38}

^{39}

^{40}

^{41}

^{42}

^{43}

^{44}

^{45}

^{46}

^{47}

^{48}

^{49}

^{50}

^{51}

^{52}

^{53}

^{54}

^{55}

^{56}

^{57}

^{58}

Edited by: Laura Badenes-Ribera, Universitat de València, Spain

Reviewed by: Thomas J. Faulkenberry, Tarleton State University, United States; Rink Hoekstra, University of Groningen, Netherlands

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We argue that making accept/reject decisions on scientific hypotheses, including a recent call for changing the canonical alpha level from

Many researchers have criticized null hypothesis significance testing, though many have defended it too (see Balluerka et al.,

We commence with some claims on the part of Benjamin et al. (

With the foregoing out of the way, consider that a basic problem with tests of significance is that the goal is to reject a null hypothesis. This goal seems to demand—if one is a Bayesian—that the posterior probability of the null hypothesis should be low given the obtained finding. But the

Trafimow and Earp (

Even if one did not use a cutoff, the phenomenon of regression to the mean suggests that the

Furthermore, the variability of

There are several possible reasons for the low correlation, including that most of the studied associations may have in fact been nearly null, so that the

Thus, the obtained

All this means that binary decisions, based on

Another disadvantage of using any set alpha level for publication is that the relative importance of Type I and Type II errors might differ across studies within or between areas and researchers (Trafimow and Earp,

However, we do not argue that every researcher should get to set her own alpha level for each study, as recommended by Neyman and Pearson (

Given that blanket and variable alpha levels both are problematic, it is sensible not to redefine statistical significance, but to dispense with significance testing altogether, as suggested by McShane et al. (

Yet another disadvantage pertains to what Benjamin et al. (

In addition, we do not see a single replication success or failure as definitive. If one wishes to make a strong case for replication success or failure, multiple replication attempts are desirable. As is attested to by recent successful replication studies in cognitive psychology (Zwaan et al.,

The discussion thus far is under the pretense that the assumptions underlying the interpretation of

Let us continue with the significance and replication issues, reverting to the pretense that model assumptions are correct, while keeping in mind that this is unlikely. Consider that as matters now stand using tests of significance with the 0.05 criterion, the population effect size plays an important role both in obtaining statistical significance (all else being equal, the sample effect size will be larger if the population effect size is larger) and in obtaining statistical significance twice for a successful replication. Switching to the 0.005 cutoff would not lessen the importance of the population effect size, and would increase its importance unless sample sizes increased substantially from those currently used. And there is good reason to reject that replicability should depend on the population effect size. To see this quickly, consider one of the most important science experiments of all time, by Michelson and Morley (

In addition, with an alpha level of 0.005, large effect sizes would be more important for publication, and researchers might lean much more toward “obvious” research than toward testing creative ideas where there is more of a risk of small effects and of

It is desirable that published facts in scientific literatures accurately reflect reality. Consider again the regression issue. The more stringent the criterion level for publishing, the more distance there is from a finding that passes the criterion to the mean, and so there is an increasing regression effect. Even at the 0.05 alpha level, researchers have long recognized that published effect sizes likely do not reflect reality, or at least not the reality that would be seen if there were many replications of each experiment and all were published (see Briggs,

We stress that replication depends largely on sample size, but there are factors that interfere with researchers using the large sample sizes necessary for good sampling precision and replicability. In addition to the obvious costs of obtaining large sample sizes, there may be an underappreciation of how much sample size matters (Vankov et al.,

This closeness procedure stresses (a) deciding what it takes to believe that the sample statistics are good estimates of the population parameters before data collection rather than afterwards, and (b) obtaining a large enough sample size to be confident that the obtained sample statistics really are within specified distances of corresponding population parameters. The procedure also does not promote publication bias because there is no cutoff for publication decisions. And the closeness procedure is not the same as traditional power analysis: First, the goal of traditional power analysis is to find the sample size needed to have a good chance of obtaining a statistically significant

The larger point is that there are creative alternatives to significance testing that confront the sample size issue much more directly than significance testing does. The “statistical toolbox” (Gigerenzer and Marewski,

But for scientific exploration, none of those tools should become the new magic method giving clear-cut mechanical answers (Cohen,

Finally, inference should not be based on single studies at all (Neyman and Pearson,

It seems appropriate to conclude with the basic issue that has been with us from the beginning. Should

All authors listed have made a direct contribution to the paper or endorse its content, and approved it for publication.

FK-N was employed by Oikostat GmbH. GM has been acting as consultant for Janssen Research and Development, LLC. The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

We thank Sander Greenland and Rink Hoekstra for comments and discussions. MG acknowledges support from VEGA 2/0047/15 grant. RvdS was supported by a grant from the Netherlands organization for scientific research: NWO-VIDI-45-14-006. Publication was financially supported by grant 156294 from the Swiss National Science Foundation to VA.