^{1}

^{*}

^{1}

^{1}

^{2}

^{1}

^{1}

^{1}

^{1}

^{2}

Edited by: Pietro Cipresso, Italian Auxological Institute (IRCCS), Italy

Reviewed by: Dirk Van Rooy, Australian National University, Australia; Martha Michalkiewicz, Heinrich Heine University of Düsseldorf, Germany

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In the past two decades, psychological science has experienced an unprecedented replicability crisis, which has uncovered several issues. Among others, the use and misuse of statistical inference plays a key role in this crisis. Indeed, statistical inference is too often viewed as an isolated procedure limited to the analysis of data that have already been collected. Instead, statistical reasoning is necessary both at the planning stage and when interpreting the results of a research project. Based on these considerations, we build on and further develop an idea proposed by Gelman and Carlin (

“

“

In the past two decades, psychological science has experienced an unprecedented replicability crisis (Ioannidis,

Whereas some important reasons for the crisis are intrinsically related to psychology as a science (Chambers,

In the current paper, we focus on an upstream—but still neglected—issue that is unrelated to the approach chosen by the researcher, namely the need for statistical reasoning, i.e., “to reason about data, variation and chance” (Moore,

To achieve this goal, we build on and further develop an idea proposed by Gelman and Carlin (

In brief, the term

Although the idea of a design analysis could be developed within different inferential statistical approaches (e.g., Frequentist and Bayesian), in this paper we will rely on the Neyman-Pearson (N-P) approach (Pearson and Neyman,

In the next paragraphs, we will briefly review the main consequences of underpowered studies, discuss two relevant misconceptions concerning the interpretation of statistically significant results, and present a theoretical framework for design analysis, including some clarifications regarding the concept of “plausible effect size.” In section 2, through familiar examples within psychological research, the benefits of prospective and retrospective design analysis will be highlighted. In section 3, we will propose a specific method that, by explicitly taking uncertainty issues into account, could further assist researchers in evaluating scientific findings. Subsequently, in section 4, a real case study will be presented and analyzed. Finally, in section 5, we will summarize the potentials, further developments, and limitations of our proposal.

To increase readability and ensure transparency of our work, we also include two

In 1962, Cohen called attention to a problem affecting psychological research that is still very much alive today (Cohen, _{0}) when the Alternative Hypothesis (_{1}) is true. One of the problems with underpowered studies is that the probability of finding an effect, if it actually exists, is low. More importantly, if a statistically significant result (i.e., “in general,” when the observed _{0} is rejected; see Wasserstein et al.,

This inflation of the effect sizes can be seen when examining results of replication projects, which are usually planned to have higher power than the original studies. For example, the Open Science Collaboration (

Given that underpowered studies are widespread in psychology (Cohen,

When a statistically significant result is obtained in an underpowered study (e.g., power = 40%), in spite of the low probability of this event happening, the result might be seen as even more remarkable. In fact, the researcher might think, “If obtaining a statistically significant result is such a rare event, and in my experiment I obtained a statistically significant result, it must be a strong one.” This is called the “what does not kill statistical significance makes it stronger” fallacy (Loken and Gelman,

In these situations, the apparent win in terms of obtaining a statistically significant result is actually a loss; “the lucky” scientist who makes a discovery is cursed by finding an inflated estimate of that effect (Button et al., _{0} and find an effect close to what it is plausible in that research field (i.e., 0.20). In fact, in this underpowered study (i.e., based on a _{0} under the NHST approach, and that you can accept _{0} under the N-P approach.

The Winner's Curse. Hypothetical study where the plausible true effect size is small (Cohen's _{0}, the researcher has to overestimate the underlying true effect, which is indicated by the dashed vertical line. Note: the rejection regions of _{0}, given a significance level of 0.05, lie outside the vertical black lines.

As we saw in the previous example, relying solely on the statistical significance of a result can lead to completely misleading conclusions. Indeed, researchers should take into account other relevant information, such as the hypothesized “plausible effect size” and the consequent power of the study. Furthermore, to assist researchers with evaluating the results of a study in a more comprehensive way, Gelman and Carlin (

Based on this consideration, Gelman and Carlin (

Given these premises, the steps to perform design analysis using Cohen's

A plausible effect size for the study of interest needs to be identified. Rather than focusing on data at hand or on noisy estimates of a single pilot study, the formalization of a plausible effect size should be based on an extensive theoretical literature review and/or on meta-analyses. Moreover, specific tools (see for example Zondervan-Zwijnenburg et al., ^{1}

Based on the experimental design of the study of interest (in our case, a comparison between two independent groups), a large number of simulations (i.e., 100,000) will be performed according to the identified plausible effect size. This procedure serves to provide information about what to expect if the experiment is replicated an infinite number of times and assuming that the pre-identified plausible effect is true.

Given a fixed level of Type I error (e.g., 0.05), power as well as type

Although the procedure may seem complex to implement, we have here

To get a first idea of the benefits of design analysis, let us re-analyze the hypothetical study presented in

Another advantage of design analysis, which will be better explored in the following sections, is that it can be effectively used in the planning phase of a study, i.e.,

In conclusion, it is important to note that whatever the type of design analysis chosen (prospective or retrospective), the relationships between power, type

Relationship between sample size and Power, Type

As expected, power increases as sample size increases. Moreover, type

From an applied perspective, issues with type

“

The main and most difficult point rests on deciding what could be considered a “plausible effect size.” Although this might seem complex, studies are usually not developed in a void. Hypotheses are derived from theories that, if appropriately formalized in statistical terms, will increase the validity of the inferential process. Furthermore, researchers are commonly interested in knowing the size and direction of effects; as shown above, this corresponds to control for a Type

From an epistemological perspective, Kruschke (

A challenging point is to establish the dimension of this effect. It might seem paradoxical that the researcher must provide an estimate of the effect size before running the experiment given that they will conduct the study with the precise aim of finding what that estimate is. However, strong theories should allow to make such predictions, and the way in which science accumulates should provide increasing precision to these predictions.

In practice, it might be undesirable to simply take the estimate found in a pilot study or from a single previous study published in the literature as the “plausible effect size.” In fact, the plausible effect size refers to what could be approximately the true value of the parameter in the population, whereas the results of pilots or single studies (especially if underpowered) are noisy estimates of that parameter.

In line with Gelman and Carlin (

As we have seen, the identification of a plausible effect size (or a series of plausible effect sizes to explore different scenarios) requires significant effort from the researcher. Indeed, we believe that this kind of reasoning can make a substantial contribution to the planning of robust and replicable studies as well as to the efficient evaluation of obtained research findings.

To conclude, we leave the reader with a question: “All other conditions being equal, if you had to evaluate two studies of the same phenomenon, the first based on a formalization of the expected plausible effect sizes of interest that is as accurate as possible, and the second one in which the size of the effects of interest was not taken into account, the findings of which study would you believe the most?” (R. van de Schoot, personal communication).

To highlight the benefits of design analysis and to make familiar the concepts of Type ^{2}

In particular, the goal of our hypothetical case study was to evaluate the differences between two treatments that aim to improve a cognitive ability called

Before collecting data, the researchers planned the appropriate sample size to test their hypotheses, namely that there was a difference between the means of

After an extensive literature review concerning studies theoretically comparable to their own, the researchers decided that a first reasonable effect size for the difference between the innovative and the traditional treatment could be considered equal to a Cohen's

Based on the above considerations, the researchers started to plan the sample size for their study. First, they fixed the Type I error at 0.05 and—based on commonly accepted suggestions from the psychological literature—fixed the power at 0.80. Furthermore, to explicitly evaluate the inferential risks connected to their choices, they calculated the associated Type

Using our

Based on the results, to achieve a power of 0.80, a sample size of 252 for each group was needed (i.e., total sample size = 504). With this sample size, the risk of obtaining a statistically significant result in the wrong direction (Type

Although satisfied in terms of expected type ^{3}

Using the function

they discovered that: (1) the overall required sample size was considerably smaller (from 504 to 316 = 158 × 2), thus increasing the economic feasibility of the study; (2) the Type

The researchers had to make a decision. From a merely statistical point of view, the optimal choice would be to consider a power of 80% that is associated with a Type

Whatever the decision, the researchers must be aware of the inferential risks related to their choice. Moreover, when presenting the results, they must be transparent and clear in communicating such risks, thus highlighting the uncertainty associated with their conclusions.

To illustrate the usefulness of retrospective design analysis, we refer to the example presented in the previous paragraph. However, we introduce three new scenarios that can be considered as representative of what commonly occurs during the research process:

^{4}

Imagine that the researchers decide to plan their sample size based on a single published study in the phase of formalizing a plausible effect size, either because the published study presents relevant similarities with their own study or because there are no other published studies available.

Imagine that, due to unforeseen difficulties (e.g., insufficient funding), the researchers are not able to recruit the pre-planned number of participants as defined based on prospective design analysis.

Imagine the number of participants involved in the study have specific characteristics that make it impossible to yield a large sample size, or that the type of treatment is particularly expensive and cannot therefore be tested on a large sample. In this case, the only possibility is to recruit the largest possible number of participants.

As we will see below, retrospective design analysis can be a useful tool to deal with the questions and the issues raised across all three scenarios.

For the sake of simplicity and without loss of generalizability, suppose that in each of the three scenarios the researchers obtained the same results (see

Comparison of the cognitive skill

Innovative treatment | 31 | 114 | 16 | 3.496 (60) | 0.001 | 0.90 (0.38–1.43) |

Traditional treatment | 31 | 100 | 15 |

At a first glance, the results indicated a statistically significant difference in favor of the innovative treatment (see

A closer look indicated that the estimated effect size seemed too large when compared with the initial guess of the researchers (i.e.,

To obtain a clearer picture of the inferential risks associated with the observed results, we performed a retrospective design analysis using

As can be seen, the power was markedly low (i.e., only 16%) and the Type

In S1, the researchers took a single noisy estimate as the plausible effect size from a study that found a “big” effect size (e.g., 0.90). The retrospective design analysis showed what happens if the plausible effect size is, in reality, much smaller (i.e., 0.25). Specifically, given the low power and the high level of Type

In S2, to check the robustness of their results, researchers might initially be tempted to conduct a power analysis based on their observed effect size (

In S3, given the low power and the high level of Type

Despite its advantages, we need to emphasize that design analysis should not be used as an

As shown in the previous examples, a key point both in planning (i.e., prospective design analysis) and in evaluating (i.e., retrospective design analysis) a study is the formalization of a plausible effect size. Using a single value to summarize all external information and previous knowledge with respect to the study of interest can be considered an excessive simplification. Indeed, all uncertainty concerning the magnitude of the plausible effect size is not explicitly taken into consideration. In particular, the level of heterogeneity emerging from the examination of published results and/or from different opinions of the consulted experts, which can be poorly formalized. The aim of this paragraph is to propose a method that can assist researchers with dealing with these relevant issues. Specifically, we will focus on the evaluation of the results of a study (i.e., retroprospective design analysis).

Our method can be summarized in the three steps: (1) defining a lower and an upper bound within which the plausible effect size can reasonably vary; (2) formalizing an appropriate probability distribution that reflects how the effect size is expected to vary; and (3) conducting the associated analysis of power, Type

To illustrate the procedure, we use the study presented in

At this point, a first option could be to assume that, within the specified interval, all effect size values have the same probability of being true. This assumption can be easily formalized using a Uniform distribution, such as the one shown in

Different ways to formalize a plausible interval for the effect size

However, from an applied point of view it is rare for the researcher to expect that all values within the specified interval have the same plausibility. Indeed, in general conditions, it is more reasonable to believe that values around the center of the interval (i.e., 0.40 in our case) are more plausible, and that their plausibility gradually decreases as they move away from the center. This expectation can be directly formalized in statistical terms using the so-called “doubly truncated Normal distribution.” On an intuitive level (for a more complete description see Burkardt,

Coming back to our example, suppose that the researchers want to evaluate the study of interest assuming a plausible interval for Cohen's ^{5}

To summarize, this information suggests that the results of the study of interest (see

In general, when the observed effect size falls outside the pre-specified plausible interval, we can conclude that the observed study is not coherent with our theoretical expectations. On the other hand, we could also consider that our plausible interval may be unrealistic and/or poorly formalized. In these situations, researchers should be transparent and propose possible explanations that could be very helpful to the understanding of the phenomenon under study. Although this way of reasoning requires a notable effort, the information provided will lead to a more comprehensive inference than the one deriving from a simplistic dichotomous decision (i.e., “reject / do not reject”) typical of the NHST approach. Indeed, in this approach the hypotheses are poorly formalized, and power, Type

To illustrate how design analysis could enhance inference in psychological research, we have considered a real case study. Specifically, we focused on Study 2 of the published paper “A functional basis for structure-seeking: Exposure to structure promotes willingness to engage in motivated action” (Kay et al.,

The paper presented five studies arising from findings showing that human beings have a natural tendency to perceive structure in the surrounding world. Various social psychology theories propose plausible explanations that share a similar assumption that had never been tested before: that perceiving a structured world could increase people's willingness to make efforts and sacrifices toward their own goals. In Study 2, the authors decided to test this hypothesis by randomly assigning participants to two different conditions differing in the type of text they had to read. In the “random” condition, the text conveyed the idea that natural phenomena are unpredictable and random, whereas in the “structure” condition the phenomena were described as predictable and systematic. The outcome measure was the willingness to work toward a goal that each participant chose as their “most important.” The expected result was that participants in the “structure” condition would report a higher score in the measure of goal-directed behavior than those in the “random” condition.

As we saw in the previous paragraphs, before collecting data it is fundamental to plan an appropriate sample size via prospective design analysis. In this case, given the relative novelty of Study 2, was hard to identify a single plausible value for the size of the effect of interest. Rather, it seemed more reasonable to explore different scenarios according to different plausible effect sizes and power levels. We started with a minimum

As the most plausible effect size, we considered ^{6}

Overall, our “sensitivity” prospective design analysis (see

Sample size, Type

0.80 | 0.20 | 392 | 784 | ||

0.35 | 130 | 260 | 1.13 | 0.00 | |

0.50 | 64 | 128 | |||

0.60 | 0.20 | 244 | 488 | ||

0.35 | 82 | 164 | 1.30 | 0.00 | |

0.50 | 40 | 80 |

A good compromise could be to consider the second scenario (

Let us now evaluate Study 2 from a retrospective point of view. Based on their results [_{structure} = 5.26, _{structure} = 0.88, _{random} = 4.72, _{random} = 1.32, _{total} = 67; _{(65)} = 2.00, ^{7}

To evaluate the inferential risks associated with this conclusion, we ran a sensitivity retrospective design analysis on the pre-identified plausible effect sizes (i.e.,

In line with the results that emerged from the prospective analysis, the retrospective design analysis indicated that the sample size used in Study 2 exhibited high inferential risks. In fact, both for a plausible effect of

We also evaluated the results of Study 2 by performing a retrospective design analysis using the method presented in section 3. Specifically, we used a doubly truncated normal distribution centered at 0.35 (i.e., the most plausible effect size) with a plausible interval of 0.20–0.50. As could be expected, the results (i.e., power = 0.29, type

In summary, our retrospective design analysis indicated that, although statistically significant, the results of Study 2 were inadequate to support the authors' conclusions.

As mentioned at the beginning of this paragraph, Study 2 by Kay et al. (

In psychological research, statistical inference is often viewed as an isolated procedure that limits itself to the analysis of data that have already been collected. In this paper, we argue that statistical reasoning is necessary both at the planning stage and when interpreting the results of a research project. To illustrate this concept, we built on and further developed Gelman and Carlin's (

In line with recent recommendations (Cumming,

Moving beyond the simplistic and often misleading distinction between significant and non-significant results, a design analysis allows researchers to quantify, consider, and explicitly communicate two relevant risks associated with their inference, namely exaggeration ratio (Type

Another important aspect of design analysis is that it can be usefully carried out both in the planning phase of a study (i.e., prospective design analysis) and to evaluate studies that have already been conducted (i.e., retrospective design analysis), reminding researchers that the process of statistical inference should start before data collection and does not end when the results are obtained. In addition, design analysis contributes to have a more comprehensive and informative picture of the research findings through the exploration of different scenarios and according to different plausible formalizations of the effect of interests.

To familiarize the reader with the concept of design analysis, we included several examples as well as an application to a real case study. Furthermore, in addition to the classic formalization of the effect size with a single value, we proposed an innovative method to formalize uncertainty and previous knowledge concerning the magnitude of the effect via probability distributions within a Frequentist framework. Although not directly presented in the paper, it is important to note that this method could also be efficiently used to explore different scenarios according to different plausible probability distributions.

Finally, to allow researchers to use all the illustrated methods with their own data, we also provided two easy-to-use

For the sake of simplicity, in this paper we limited our consideration to Cohen's ^{2}, and ^{2}. Moreover, concerning the proposed method to formalize uncertainty and prior knowledge, other probability distributions beyond those proposed in this paper (i.e., the uniform and the doubly truncated normal) could be easily added. This was one of the main reasons behind the choice to use resampling methods to estimate power as well as Type

Also, it is important to note that our considerations regarding design analysis could be fruitfully extended to the increasingly used Bayesian methods. Indeed, our proposed method to formalize uncertainty via probability distributions finds its natural extension in the concept of Bayesian prior. Specifically, design analysis could be useful to evaluate the properties and highlight the inferential risks (such as Type

In summary, even though a design analysis requires significant effort, we believe that it has the potential to contribute to planning more robust studies and promoting better interpretation of research findings. More generally, design analysis and its associated way of reasoning helps researchers to keep in mind the inspiring quote presented at the beginning of this paper regarding the use of statistical inference: “Remember ATOM.”

All R scripts used to reproduce the examples presented in the paper are reported in the article/

GA conceived the original idea and drafted the paper. GB, CZ, and ET contributed to the development of the original idea and drafted sections of the manuscript. MP and GA wrote the R functions. GA, MP, and CZ took care of the statistical analysis and of the graphical representations. LF and AC provided the critical and useful feedback. All authors contributed to the manuscript revision, read, and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at:

^{1}To obtain a more comprehensive picture of the inferential risks associated with their study, we suggest that researchers inspect different scenarios according to different plausible effect sizes and perform more than one design analysis (see for example our application to a real case study in section 4).

^{2}We remind the reader that

^{3}Specifically, we agree with Gelman (

^{4}Even though, in this paper, we strongly recommend that one does not plan the sample size based on a single study, we propose this example to further emphasize the inferential risks associated with the information provided by a single underpowered study.

^{5}The idea behind this function is simple. First, we sample a large number (e.g., 100,000) of effect sizes

^{6}In the Open Science Collaboration (

^{7}The authors reported only the total sample size (