^{1}

^{*}

^{2}

^{2}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

Edited by: Peng Liu, Institute of Remote Sensing and Digital Earth (CAS), China

Reviewed by: Venkata Krishna Jandhyala, University of Western Ontario, Canada; Haruhiko Ogasawara, Otaru University of Commerce, Japan

This article was submitted to Environmental Informatics, a section of the journal Frontiers in Ecology and Evolution

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The methods for making statistical inferences in scientific analysis have diversified even within the frequentist branch of statistics, but comparison has been elusive. We approximate analytically and numerically the performance of Neyman-Pearson hypothesis testing, Fisher significance testing, information criteria, and evidential statistics (Royall,

In the twentieth century, the bulk of scientific statistical inference was conducted with Neyman-Pearson hypothesis tests, a term which we broadly take to encompass significance testing,

A substantial advance in late 20th century statistical practice was the development of information-theoretic indexes for model selection, namely the Akaike information criterion (AIC) and its variants (Akaike,

In an apparently separate statistical development, the concept of statistical evidence was refined in light of the shortcomings of using as evidence quantities such as

Despite widespread current usage of AIC-type indexes in ecology, the inferential basis and implications of the use of information criteria are not fully developed, and what is developed is commonly misunderstood (see the forum edited by Ellison et al.,

This paper contrasts the concept of evidence with classical statistical hypothesis testing and demonstrates that many information-based indexes for model selection can be recast and interpreted as evidence functions. We show that the evidence function concept fulfills many seeming objectives of model selection in ecology, both in a statistical as well as scientific sense, and that evidence functions are intuitive and easily grasped. Specifically, the difference of two values of an information-theoretic index for a pair of models possesses in whole or in part the properties of an evidence function and thereby grants to the resulting inference a scientific warrant of considerable novelty in ecological practice.

Of particular importance is the desirable behavior of evidence functions under model misspecification, behavior which, as we shall show, departs sharply from that of statistical hypothesis testing. As ecologists grapple increasingly with issues related to multiple quantitative hypotheses for how data arose, the evidence function concept can serve as a scientifically satisfying basis for model comparison in observational and experimental studies.

For convenience we label as Neyman-Pearson (NP) hypothesis tests a broad collection of interrelated statistical inference techniques, including

NP hypothesis tests and evidential comparisons are conducted in very different fashions and operate under different warrants. Thus, comparison is difficult. However, they both make inferences. One fundamental metric by which they can be compared is the frequency that inferences are made in error. In this paper we seek to illuminate how the frequency of errors made by these methods is influenced by sample size, the differences among models being compared, and also the differences between candidate models and the true data generating process. Both of these inferential approaches can be, and generally are, constructed around a base of the likelihood ratio (LR). By studying the statistical behavior of the LR, we can answer our questions regarding frequency of error in all approaches considered.

Throughout this discussion, one observation (datum) is represented using the random variable _{1}, _{2}, …_{n} is written as

whereas under the approximating model it is

In cases where there are two candidate models _{1}(_{2}(_{1} and _{2} to avoid double subscript levels.

We make much use of the Kullback-Leibler (KL) divergence, one of the most commonly used measures of the difference between two distributions. The KL divergence of

Here E_{g} denotes expectation with respect to the distribution represented by

The KL divergence is interpreted as the amount of information lost when using model

The relevant KL divergences under correct model specification are for _{1}(_{2}(

By reversing numerator and denominator in the log function in Equation (5), one finds that

The convention for which subscript is placed first varies among references; we put the subscript of the reference distribution first as it is easy to remember.

The likelihood ratio (LR) and its logarithm figure prominently in statistical hypothesis testing as well as in evidential statistics. The LR is

and the log-LR is

In particular, the log-LR considered as a random variable is a sum of iid random variables, and its essential statistical properties can be approximated using the central limit theorem (CLT). The CLT (_{1}(_{2}(_{12} or −_{21}, depending on which model generated the data. Let _{1} (_{2} (

One can envision cases in which these variances might not exist, but we do not consider such cases here. The CLT, which requires that the variances be finite, provides the following approximations. If the data arise from _{1}:

Here, _{2}:

The device of using the CLT to study properties of the likelihood ratio is old and venerable and figures prominently in the theory of sequential statistical analysis (Wald,

Suppose that _{1}, _{2}, …, _{n} are independent and identically distributed random variables with common finite mean denoted μ = _{i}), and finite variance denoted

be the sum of the _{i}s. Let _{n} standardized with its mean ^{2}, equivalently written as _{n}(

From the CLT one can obtain normally distributed approximations for various quantities of interest:

Here,

A model,

The approximate behavior of the LR under misspecification can also be represented with the CLT. To our two model candidates _{1}(_{2}(_{1}(_{2}(

We note that Δ_{1} is “closer” to truth, if Δ_{2} is closer to truth, and if Δ

The rightmost equality is established by adding and subtracting E_{g}[log (

And now by the CLT, if the data did not arise from _{1}(_{2}(

Critical to the understanding, both mathematical and intuitive, of inference on models is an understanding of the topology of models. Once one has a concept of distances between models, a topology is implied. A model with one or more unknown parameters represents a whole family or set of models, with each parameter value giving a completely specified model. At times we might refer to a model set as a model if there is no risk of confusion. Two model sets can be only be arranged as nested, overlapping, or non-overlapping. A set of models can be correctly specified or misspecified depending on whether or not the generating process can be exactly represented by a model in the model set. Thus, there are only six topologies relating two model sets to the generating process (

Model topologies when models are correctly specified. Regions represent parameter spaces. Star represents the true parameter value corresponding to the model that generated the data. _{1} and _{2} while the second had predictor variables _{1}, _{2}, and _{3}. _{1} and _{2} while the second had predictor variables _{2} and _{3}. Three locations of truth are possible: truth in model 1, truth in model 2, and truth in both models 1 and 2. _{1} and _{2} while the second model has predictor variables _{3} and _{4}.

Model topologies when models are misspecified. Regions represent parameter spaces. Star represents the true model that generated the data. Exes represent the point in the parameter space covered by the model set closest to the true generating process.

In the canon of traditional statistical practices for comparing two candidate models, _{1}(_{2}(_{1}(_{2}(_{1}(_{2}(

Neyman and Pearson (_{1}: the data arise from model _{1}(_{2}: the data arise from model _{2}(

Here the cutoff quantity (or critical value) _{2} given that H_{1} is true is the “Type 1 error probability” and is denoted as α.

Often for notational convenience in lieu of the statement “H_{i} is true” we will simply write “H_{i}.” Now, such a data-driven decision with fixed Type 1 error probability is the traditional form of a statistical hypothesis test. A test with a Type 1 error probability of α is said to be a size α test. The other error probability (“Type 2”), the conditional probability of wrongly deciding on H_{1} given H_{2}, is usually denoted β:

The power of the test is defined as the quantity 1 − β. Neyman and Pearson's theorem, stating that no other test of size α or less has power that can exceed the power of the likelihood ratio test, is a cornerstone of most contemporary introductions to mathematical statistics (Rice,

With the central limit theorem results (Equations 11–16), the error properties of the NP test can be approximated. To find the critical value _{1}:

and so the CLT tells us that

where Φ(

Here

The error probability β is approximated in similar fashion. We have, under H_{2},

so that, after substituting for

It is seen that β → 0 as sample size _{12} + _{21} is an actual distance measure between _{1}(_{2}(

Five important points about the Neyman-Pearson Lemma are pertinent here. First, the theorem itself is just a mathematical result and leaves unclear how it is to be used in scientific applications. The prevailing interpretation that emerged in the course of 20^{th} century science was that one of the hypotheses, H_{1}, would be accorded a special status (“the null hypothesis”), having its error probability α fixed at a known (usually small) value by the investigator. The other hypothesis, H_{2}, would be set up by experiment or survey design to be the only reasonable alternative to H_{1}. The other error probability, β, would be managed by study design characteristics (especially sample size), but would remain unknown and could at best only be estimated when the model contained parameters with unknown values. The hypothesis H_{1} would typically play the role of the skeptic's hypothesis, as in the absence of an effect (absence of a difference in means, absence of influence of a predictor variable, absence of dependence of two categorical variables, etc.) under study. The other hypothesis, H_{2}, contains the effect under study and serves as the hypothesis of the researcher, who has the scientific charge of convincing a reasoned skeptic to abandon H_{1} in favor of H_{2}.

Second, the theorem in its original form does not apply to models with unknown parameters. Various extensions were made during the ensuing decades, among them Wilks' (Wilks, ^{2}) distribution with both mean μ and variance σ^{2} unknown as the model for the alternative hypothesis H_{2}, within which the null hypothesis model _{1} constrains the mean to be a fixed known constant: μ = μ_{1}. In such scenarios, the null model is “nested” within the alternative model, that is, the null is a special version of the alternative in which the parameters are restricted to a subset of the parameter space (set of all possible parameter values). Wilks' (Wilks, ^{2}, given by

where _{1} and _{2}, with each likelihood maximized over all the unrestricted parameters in that model. The resulting parameter estimates, known as the maximum likelihood (ML) estimates, form a prominent part of frequentist statistics theory (Pawitan, _{2} formed by stacking subvectors θ_{21} and θ_{22}. Likewise, let θ under the restricted model _{1} be formed by stacking the subvectors θ_{11} and θ_{12}, where θ_{11} is a vector of fixed, known constants (i.e., all values in θ_{21} are fixed) and θ_{12} is a vector of unknown parameters. Wald's (1943) theorem (after some mathematical housekeeping: Stroud, ^{2} as a non-central chisquare(ν, λ) distribution, with degrees of freedom ν equal to the difference between the number of estimated parameters in _{2} and the number of estimated parameters in _{1}, and non-centrality parameter λ being a statistical (Mahalanobis) distance between the true parameter values under H_{2} and their restricted versions under H_{1}:

Here Σ is a matrix of expected log-likelihood derivatives (details in Severini, _{21} must be local to the restricted values θ_{11}; the important aspects for the present are that λ increases with _{1} governing the data production, the non-centrality parameter becomes zero, and Wald's theorem collapses to Wilks' theorem, which gives the asymptotic distribution of ^{2} under H_{1} to be an ordinary chisquare(ν) distribution. For linear statistical models in the normal distribution family (regression, analysis of variance, etc.), ^{2} boils down algebraically into monotone functions of statistics with exact (non-central and central) t- or F-distributions, and so the various statistical hypothesis tests can take advantage of exact distributions instead of asymptotic approximations.

The concept of a confidence interval or region for one or more unknown parameters follows from Neyman-Pearson hypothesis testing in the form of a region of parameter values for which hypothesis H_{1} would not be rejected at fixed error rate α. We remark further that although a vast amount of every day science relies on the Wilks-Wald extension of Neyman-Pearson testing (and confidence intervals), frequentist statistics theory prior to the 1970s had not provided much advice on what to do when the two models are not nested.

Certainly nowadays one could setup a model _{1}(_{1} in a hypothesis test against a non-overlapping model _{2}(_{2} and obtain the distributions of the generalized likelihood ratio under both models with simulation/bootstrapping.

Third, the Neyman-Pearson Lemma provides no guidance in the event of model misspecification. The theorem assumes that the data was generated under either H_{1} or H_{2}. However, the “Type 3” error of basing inferences on an inadequate model family is widely acknowledged to be a serious (if not fatal) scientific drawback of the Neyman-Pearson framework (and parametric modeling in general, see Chatfield,

Fourth, the asymmetry of the error structure has led to difficulties in scientific interpretation of Neyman-Pearson hypothesis testing results. The difficulties stem from α being a fixed constant. A decision to prefer hypothesis H_{2} over H_{1} because the LR (Equation 23) is smaller than _{2} over H_{1} decision has some intuitively desirable statistical properties. For example, the error rate β asymptotically approaches 0 as the sample size _{2} becomes “farther” from _{1} (in the sense of the symmetric KL distance _{12} + _{21} as seen in Equation 30). Mired in controversy and confusion, however, is the decision to prefer H_{1} over H_{2} when the LR is larger than _{1}. If a larger sample size is used, the LR has more terms, and the value of _{1}, for no matter how far apart the models are or how large a sample size is collected, the probability of wrongly choosing H_{2} when H_{1} is true remains stuck at α.

Fifth, scientific practice rarely stops with just two models. In an analysis of variance, after an overall test of whether the means are different, one usually needs to sort out just who is bigger than whom. In a multiple regression, one is typically interested in which subset of predictor variables provide the best model for predicting the response variable. In a categorical data analysis of a multiway contingency table, one is often seeking to identify which combination of categorical variables and lower and higher order interactions best account for the survey counts. For many years (through the 1980s at least), standard statistical practice called for multiple models to be sieved through some (often long) sequence of Neyman-Pearson tests, through processes such as multiple pairwise comparisons, stepwise regression, and so on. It has long been recognized, however, that selecting among multiple models with Neyman-Pearson tests plays havoc with error rates, and that a pairwise decision tree of “yes-no's” might not lead to the best model among multiple models (Whittingham et al.,

R. A. Fisher never fully bought into the Neyman-Pearson framework, although generations of readers have debated about what exactly Fisher was arguing for, due to the difficulty of his writing style and opacity of his mathematics. Fisher rejected the scientific usefulness of the alternative hypothesis (likely in part because of the lurking problem of misspecification) and chose to focus on single-model decisions (resulting in lifelong battles with Neyman; see the biography by Box, _{1} an adequate representation of the data? As in the Neyman-Pearson framework, Fisher typically cast the null hypothesis H_{1} in the role of a skeptic's hypothesis (the lady cannot tell whether the milk or the tea was poured first). It was scientifically sufficient in this approach for the researcher to develop evidence to dissuade the skeptic of the adequacy of the null model. The inferential ambitions here are necessarily more limited, in that no alternative model is enlisted to contribute more insights for understanding the phenomenon under study, such as an estimate of effect size. As well, Fisher's null hypothesis approach preserves the Neyman-Pearson incapacitation when the null model is not contradicted by data, in that at best, one will only be able to say that the data are a plausible realization of observations that could be generated under H_{1}.

Fisher's principal tool for the inference was the _{1} yield a sufficient statistic as extreme or more extreme than the sufficient statistic calculated from the real data.

In absence of an alternative model, Fisher's strict _{1}. Accordingly, just about any statistic (besides a sufficient statistic) can be used to obtain a _{1}. Goodness of fit tests therefore tend to multiply, as witnessed by the plethora of tests available for the normal distribution. To sort out the qualities of different goodness of fit tests, one usually has to revert to a Neyman-Pearson two-model framework to establish for what types of alternative models a particular test is powerful.

_{1}. Hinkley (1987) interprets the ^{2} under the chisquare pdf applicable for H_{1}-generated data. For Fisher's preferred statistical distributions (those with sufficient statistics, nowadays called exponential family distributions), the generalized LR statistic ^{2} algebraically reduces to a monotone function of one or more sufficient statistics for the parameter or parameters under constraint in the model _{1}. In the generalized likelihood ratio framework, the hypothesis test decision between H_{1} and H_{2} can be made by comparing the _{1} as a plausible origin of the data if the

In both Neyman-Pearson hypothesis testing and Fisher significance analysis, the _{1}. The _{2}, as the distribution of the _{2} becomes more and more concentrated near zero as sample size becomes large or as model _{2} becomes “farther” from _{1}. In the Fisher one-model framework an alternative model is unspecified. Consequently, a low _{1}. However, the _{1} has a uniform distribution (because a continuous random variable transformed by its own cumulative distribution function has a uniform distribution) no matter what the sample size is or how far away the true data generating process is. Hence, as with NP tests, Fisher's _{1}, as any _{1}.

Ecologists use and discuss hypothesis testing in both the Fisher sense and the Neyman-Pearson sense, sometimes referring to both enterprises as “null hypothesis testing.” The use of

Attempts have been made to modify the Neyman-Pearson framework to accommodate the concept of evidence for H_{1}. In some applied scientific fields, for example in pharmacokinetics and environmental science, the regulatory practice has created a burden of proof around models normally regarded as null hypothesis models: the new drug has an effect equal to the standard drug, the density of a native plant has been restored to equal its previous level (Anderson and Hauck,

Another proposed solution for the evidence-for-the-null-hypothesis problem is the concept of severity (Mayo, _{2} (with the particular effect size specified). In the generalized likelihood ratio framework, the severity would be calculated as the area to the right of the observed value of ^{2} under the non-central chisquare pdf applicable for data generated under model _{2}, with the non-centrality parameter set at a specified value. Thus, severity is a kind of attained power for a particular effect size. Also, severity is mostly discussed in connection with one-sided hypotheses, so that its calculation under the two-sided generalized likelihood ratio statistic is at best an approximation. However, if the effect size is substantial, the probability contribution from the “other side” is low, and the approximation is likely to be fine. In general, the severity of the test is related to the size of the effect, so care needs to be taken in the interpretation of the test.

For a given value of the LR, if the effect size is high, the probability of obtaining stronger evidence against H_{1} is high, and the severity of the test against H_{1} is high. “A claim is severely tested to the extent that it has been subjected to and passes a test that probably would have found flaws, were they present” (Mayo,

For both equivalence testing and severity, we are given procedures in which consideration of evidence requires two statistics and analyses. In the case of equivalence testing, we have a statistical test for each side of the statistical model specified by H_{1}, and for severity we have a statistic for H_{2} and a statistic for H_{1}. Indeed, Thompson (_{1} relative to H_{2} if the first _{1}, is large and the second _{2}, is small. The requirement for two analyses and two interpretations seems a disadvantageous burden for applications. More importantly, the equivalence testing and severity concepts do not yet accommodate the problems of multiple models or non-nested models.

The LR statistic (Equation 7), as discussed by Hacking (_{1} and against H_{2} (Edwards 1972 termed it _{2} and against H_{1}. The evidence concept here is post-data in that the realized value of the LR itself, and not a probability calculated over hypothetical experiment repetitions, conveys the magnitude of the empirical scientific case for H_{1} or H_{2}. However, restricting attention to just the LR itself leaves the prospect of committing an error unanalyzed; while scientists want to search for truth, they strongly want (for reasons partly sociological) to avoid being wrong.

Royall (_{1} when _{1} is _{2} and strong evidence in favor of H_{2} when _{2} is _{1}. Royall's conclusion structure in terms of the LR then has a trichotomy of outcomes:

For

Royall (_{1}. It is possible that the LR could take a wayward value, leading to one of two possible errors in conclusion that could occur: (1) the LR could take a value corresponding to weak or inconclusive evidence (the error of weak evidence), or (2) the LR could take a value corresponding to strong evidence for H_{2} (the error of misleading evidence). Given the data are generated by model _{1}, the probabilities of the two possible errors are defined as follows:

Similarly, given the data are generated under H_{2},

The error probabilities _{1}, _{2}, _{1}, and _{2} can be approximated with the CLT results for _{1}/_{2} (Equations 11–16). Proceeding as before with the Neyman-Pearson error rates, we find that

The error probabilities _{1}, _{2}, _{1}, and _{2} depend on the models being compared, but it is easy to show that all four probabilities, as approximated by Equations (38–41), converge to zero as sample size _{i} (_{i} + _{i} is additionally a monotone decreasing function of

in which the argument of the cdf Φ(_{▪}) is seen (by ordinary differentiation, assuming _{2} + _{2} would have σ_{2} and _{21} in place of σ_{1} and _{12}).

The probability _{1} of strong evidence for model _{1}(_{1}(

with _{2} = 1 − _{2} − _{2} defined in kind. Here _{i} + _{i} that _{i} is a monotone increasing function of _{i} > _{i},

Note that _{1}, _{1}, and _{1} are not in general equal to their counterparts _{2}, _{2}, and _{2}, nor should we expect them to be; frequencies of errors will depend on the details of the model generating the data. One model distribution with, say, a heavy tail could produce errors at a greater rate than a light-tailed model. The asymmetry of errors suggests possibilities of pre-data design to control errors. For instance, instead of LR cutoff points 1/_{1} and _{2} that render M_{1} and M_{2} nearly equal for a particular sample size and particular values of σ_{1}, _{12}, σ_{2}, and _{21}. Such design, however, will induce an asymmetry in the error rates (defined below) for misspecified models.

Interestingly, as a function of _{i} (_{1} at which _{1} is maximized is found by maximizing the argument of the normal cdf in Equation (38):

with the corresponding maximum value of _{1} being

Expressions for ñ_{2} and _{21} and σ_{2} in place of the H_{1} quantities. That the _{i} functions would increase with _{i} + _{i}, that decreases monotonically with sample size.

We illustrate the error properties of evidence under correct model specification with an example. Suppose the values _{1}, _{2}, …, _{n} are zeros and ones that arose as iid observations from a Bernoulli distribution with ^{x}(1 − ^{1−x}, where _{1}: _{1} with H_{2}: _{2}, where _{1} and _{2} are specified values. The log-likelihood ratio is

From Equations (4) and (9) we find that

In the top panel of _{1}, given by _{1} = 1 − _{1} − _{1}, are compared with the values as approximated with the CLT (Equations 38, 40). The simulated values create a jagged curve due to the discrete nature of the Bernoulli distribution but are well-characterized by the CLT approximation. The lower panel of _{1} as a function of _{1}, and the CLT approximation (Equation 38) follows only the lower edges; the approximation could likely be improved (i.e., set toward the middle of the serrated highs and lows) with a continuity correction. The CLT nonetheless picks up the qualitative behavior of the functional form of _{1}.

Evidence error probabilities for comparing two Bernoulli(_{1} = 0.75 and _{2} = 0.50. _{1}, _{1} = 1 − _{1} − _{1}. _{1}. Note that the scale of the bottom graph is one fifth of that of the top graph.

The concept of evidence allows re-interpretation of _{1}/_{2} the realized (i.e., post-data) value of the LR, the lower case signaling the actual outcome rather than the random variable (pre-data) version of the LR denoted by _{1}/_{2}. The classical _{1}, that a repeat of the experiment would yield a LR value more extreme than the value _{1}/_{2} that was observed. In our CLT setup, we can write

Comparing _{1} (Equation 38), we find that _{1} if the experiment were repeated and the value of _{2}/_{1}.

If the value of _{1}/_{2} is considered to be the evidence provided by the experiment, the value of _{1}/_{2} and thereby might be considered to be an evidence measure on another scale. _{1}/_{2}, _{12}, and σ_{1}. Furthermore, _{21} and σ_{2} are left out of the value of _{1} in the determination of amount of evidence, a finger on the scale so to speak. The evidential framework therefore argues for the following distinction in the interpretation of _{1}/_{2}, while _{1}, is a probability of misleading evidence, except that

In fairness to both models, we can define two _{1} and under model _{2}:

These are interpreted as the probabilities of misleading evidence under models 1 and 2, respectively if the value of _{2}/_{1}. The quantity 1 − _{2} in this context is the severity as defined by Mayo (_{1} or _{2} as a local probability of misleading evidence (_{L} in their notation), as opposed to a global, pre-data probability of misleading evidence (_{G} in their notation; _{1} and _{2} here) characterizing the long-range reliability of the design of the data-generating process.

George Box's (Box, _{1} (_{2} (

Statisticians have long cautioned about the prospect that both models _{1} and _{2} in the Neyman-Pearson framework, broadly interpreted to include testing composite models with generalized likelihood ratio and other approaches, could be misspecified, and as a result that the advertised error rates (or by extension the coverage rates for confidence intervals) would become distorted in unknown ways (for instance, Chatfield,

The critical value _{1}. We ask the following question: “Suppose the real Type 1 error is defined as picking model _{2} when the model _{1} is actually closest to the true pdf _{1} is the better model?” We now have

after substituting for

In words, the Type 1 error realized under model misspecification is generally not equal to the specified test size. Note that Equation (53) collapses to Equation (28), as desired, if _{1} =

Whether the actual Type 1 error probability α′ is greater than, equal to, or less than the advertised level α depends on the various quantities arising from the configuration of _{1} (_{2} (_{▪}) is a monotone increasing function, we have

The inequality reduces to three cases, depending on whether σ_{1} − σ_{g} is positive, zero, or negative:

The ratio (_{12} − Δ_{1} − σ_{g}) compares the difference between what we assumed about the LR mean (_{12}) and what is the actual mean (Δ_{1}) with what is the actual variability (σ_{g}). The left-hand inequalities for each case are reversed if α′ < α.

The persuasive strength of Neyman-Pearson testing always revolved around the error rate α being known and small, and the _{1}. When _{1}/_{2} ≤ _{1} as untenable. However, in the presence of misspecification, the real error rate α′ is unknown, as is a real _{12} > Δ_{2} is very different from model _{1} (_{12} large) but is almost as close to truth as _{1} (Δ

That greater sample size would make error more likely seems counterintuitive, but it can be understood from the CLT results for the average log-LR given by (1/_{1}/_{2}) (Equations 12, 21). If the observations arise from _{1}(_{12} and its distribution becomes more and more concentrated around _{12} as _{12} will become more and more certain to reject the null hypothesis when the true mean is Δ

The error probability β′ is defined and approximated in similar fashion. If model _{2} is closer to truth, we have Δ

The CLT then gives

As a function of

The inequality reduces to three cases depending on whether σ_{2} − σ_{g} is positive, zero, or negative:

The left inequalities for the three cases are reversed for β′ < β. The degree to which β′ departs from β is seen to depend on a tangled bank of quantities arising from the configuration of _{1} (_{2} (

The problems with α and β, and with _{1} and thereby could promote misleading conclusions (_{g} for σ_{1} and −Δ_{12}). Equivalence testing, being retargeted hypothesis testing, will take on all the problems of hypothesis testing under misspecification. Severity is 1 − _{2} as defined by Equation (51), but with misspecification the true value of _{2} is Equation (51) with σ_{g} substituted for σ_{2} and −Δ_{21}. With misspecification, the true severity could differ greatly from the severity calculated under H_{2}. One might reject H_{1} falsely, or one might fail to reject H_{1} falsely, or one might fail to reject H_{1} and falsely deem it to be severely tested. Certainly, in equivalence testing and severity analysis, the problem of model misspecification is acknowledged as important (for instance, Mayo and Spanos,

To study the properties of evidence statistics under model misspecification, we redefine the probabilities of weak evidence and misleading evidence in a manner similar to how the error probabilities were handled above in the Neyman-Pearson formulation. We take _{1} is closer to truth, that is, given that Δ

Similarly, given model _{2} is closer to truth,

The error probabilities _{1}/_{2} (Equations 20–22) under misspecification. For example, to approximate

We thus obtain

The other error probability under misspecification, with Δ

The expression is identical to Equation (69) where Δ

In words, for models with no unknown parameters under misspecification, the error probabilities _{1} and _{2} to control error probabilities _{1} and _{2} under correct specification would break the symmetry of errors under misspecification. The consideration of evidential error probabilities in study design forces the investigator to focus on what types of errors and possible model misspecifications are most important to the study.

The symmetry of error rates is preserved for weak evidence, for which we obtain

The formulae for α′ (Equation 53), β′ (Equation 59), and _{12}, and _{21}, multiple configurations should be explored in model space.

Four model configurations involving a bivariate generating process _{1}, _{2}) (in black), and two approximating models _{1}(_{1}, _{2}) (in blue) and _{2}(_{1}, _{2}) (in red). In all cases the approximating models are bivariate normal distributions whereas the generating process is a bivariate Laplace distribution. These model configurations are useful to explore changes in α′ (Equation 53), β′ (Equation 59) and _{1}, _{2}) is a bivariate Laplace distribution centered at 0 with high variance. All three models have means aligned along the 1:1 line and marked with a black, blue, and red filled circle, respectively. Model _{1}(_{1}, _{2}) is closest to the generating process. _{1}(_{1}, _{2}) is still the model closest to the generating process, at exactly the same distance as in _{1}, _{2}) is an asymmetric bivariate Laplace that has a large mode at 0, 0 and smaller mode around the mean, marked with a black dot. In this case, the generating model is closer to model _{2}(_{1}, _{2}) (in red). _{2}(_{1}, _{2}) (in blue) is now misaligned, but still the closest model to the generating process.

Changes in α′ (Equation 53), β′ (Equation 59) and _{1} is closest to the generating process. _{1} is misaligned. _{2} is closer to the generating process and all models are aligned. _{2} is closer to the generating process but model _{2} is misaligned.

Four properties of the error probabilities under misspecification are noteworthy. First, _{1} and H_{2} in representing truth, _{1} and _{1} as model _{1} becomes better at representing truth (i.e., as _{1}) → 0), and likewise _{2} and _{2} as _{2} becomes better. Fourth, if Δ

The total error probability under misspecification given by

The probability of strong evidence for model _{i} if _{i} is closer to

with the corresponding maximum value of

The expressions for _{i} and

An extension of the Bernoulli example from _{1} and _{2}. Suppose however that the data actually arise from a Bernoulli distribution with success probability _{g}. From Equation (17), the value of Δ

Note that Δ_{g}. In the _{1} = 0.75 and _{2} = 0.50. If we take _{g} = 0.65, we have a situation in which model 1 is slightly closer to the true model than model 2. As well, we readily calculate that _{12} = 0.130812 and Δ_{12} > Δ

The top panel of

Evidence error probabilities for comparing two Bernoulli(_{1} = 0.75 and _{2} = 0.50, when the true data-generating model is Bernoulli with _{1} when it is closer than H_{2} to the true model. _{2} when model H_{1} is closer to the true data-generating process.

In the bottom panel of

The example illustrates directly the potential effect of misspecification on the results of the Neyman-Pearson Lemma. The lemma is of course limited in scope, and we should in all fairness note that a classical extension of the lemma to one-sided hypotheses seemingly ameliorates the problem in this particular example. Suppose the two models are expanded: model 1 is the Bernoulli distribution with _{2} such that _{2} < _{1}, the Neyman-Pearson Lemma gives the LR test as most powerful. Second, the cutoff point _{2}. Third, the LR is a monotone function of a sufficient test statistic given by (_{1}+_{2}+…+_{n})/

However, the one-sided extension of our Bernoulli example expands the model space to eliminate the model misspecification problem. We regard the hypotheses H_{1} : _{2} :

Lele (

The basic insight is that the log-LR emerges as the function to use for model comparison when the discrepancy between models is measured by the KL divergence (Equation 3). The reason is that (1/_{1}/_{2}) is a natural estimate of Δ_{1}(_{2}(_{1}(_{2}(

Lele (

The latter part of the 20th Century saw some statistical developments that made inroads into the problems of models with unknown parameters (composite models), multiple models, model misspecification and non-nested models, among the more widely adapted of which were the model selection indexes based on information criteria. The work of Akaike (Akaike,

Moment of discovery: page from Professor H. Akaike's research notebook, written while he was commuting on the train in March 1971. Photocopy kindly provided by the Institute for Statistical Mathematics, Tachikawa, Japan.

The information criteria are model selection indexes, the most widely used of which is the AIC (originally, “an information criterion,” Akaike, _{i}, and _{i} is the number of unknown parameters in model H_{i} that were estimated through the maximization of _{i}. We are now explicitly considering the prospect of more than two candidate models, although each evidential comparison will be for a pair of models.

Akaike's fundamental intuition was that it would be desirable to select models with the smallest “distance” to the generating process. The distance measure he adopted is the KL divergence. The log-likelihood is an estimate of this distance (up to a constant that is identical for all candidate models). Unfortunately, when parameters are estimated, the maximized log-likelihood as an estimate of the KL divergence is biased low. The AIC is an approximate bias-corrected estimate of an expected value related to the distance to the generating process. The AIC is an index where goodness of fit as represented by maximized log-likelihood is penalized by the number of parameters estimated. Penalizing likelihood for parameters is a natural idea for attempting to balance goodness of fit with usefulness of a model for statistical prediction (which starts to break down when estimating superfluous parameters). To practitioners, AIC is attractive in that one calculates the index for every model under consideration and selects the model with the lowest AIC value, putting all models on a level playing field so to speak.

Akaike's inferential concept underlying the AIC represented a breakthrough in statistical thinking. The idea is that in comparing model H_{i} with model H_{j} using an information criterion, both models are assumed to be misspecified to some degree. The actual data generating mechanism cannot be represented exactly by any statistical model or even family of statistical models. Rather, the modeling process seeks to build approximations useful for the purpose at hand, with the left-out details deemed negligible by scientific argument and empirical testing.

Although AIC is used widely, the exact statistical inference presently embodied by AIC is not widely understood by practitioners. What Akaike showed is that under certain conditions −AIC_{i}/(2_{i} is a vector of unknown parameters and _{▪}) with some high-dimensional unknown parameter, while all the candidate models are also in the same form _{▪}) except with the parameter vector constrained to a lower-dimensional subset of parameter space. Truth in Akaike's approach is as unattainable as _{i} − AIC_{j} then is a

In practice, the AIC-type inference represents a relative comparison of two models, not necessarily nested or even in the same model family, requiring only the same data and the same response variable to implement. The inference is post-data, in that there are (as yet) no appeals to hypothetical repeated sampling and error rates. All candidate models, or rather, all pairs of models, can be inspected simultaneously simply by obtaining the AIC value for each model. But, as is the case with all point estimates, without some knowledge of sampling variability and error rates we lack assurance that the comparisons are informative.

We propose that information-based model selection indexes can be considered as generalizations of LR evidence to models with unknown parameters, for model families obeying the usual regularity conditions for ML estimation. The evidence function concept clarifies and makes accessible the nature of the statistical inference involved in model selection. Like LR evidence, one would use information indexes to select from a pair of models, say _{1}(_{1}) and _{2}(_{2}), where θ_{1} and θ_{2} are vectors of unknown parameters. Like LR evidence, the selection is a post-data inference. Like LR evidence, the prospect of model misspecification is an important component of the inference. And critically, like LR evidence, the error probabilities _{i} and _{i} (

As noted earlier, the generalized LR framework of two nested models under correct model specification is a workhorse of scientific practice and a prominent part of applied statistics texts. It is worthwhile then in studying evidence functions to start with the generalized LR framework, in that the model selection indexes are intended in part to replace the hierarchical sequences of generalized LR hypothesis testing (stepwise regression, multiple comparisons, etc.) for finding the best submodel within a large model family.

The model relationships diagrammed in the top portion of _{1} identifies the true model giving rise to the data. Technically the parameter vector is contained in model _{2} as well, but the scientific interest focuses on whether the additional parameters in the unconstrained parameter space of _{2} can be usefully ignored. Case 2 (top right) portrays the situation in which the true parameter vector is in the unconstrained parameter space of model _{2}; model _{1} is too simple to be useful.

Suppose we decide to use ΔAIC_{12} = AIC_{1} − AIC_{2} as an evidence function. For convenience, we have defined this AIC-based evidence function to vary in the same direction as ^{2} (Equation 31) in NP hypothesis testing, so that large values of ΔAIC correspond to large evidence for _{2} (opposite to the direction for the ordinary LR-evidence function given by Equation 33). For instance, the early rule of thumb in the AIC literature was to favor model _{1} when ΔAIC_{12} ≤ −2 and to favor model _{2} when ΔAIC_{12} ≥ 2. Note that

where ν = _{2} − _{1}, the difference of the numbers of unknown parameters in the two models. The behavior of our candidate evidence function ΔAIC_{12} can be studied using the Wilks/Wald results for the asymptotic distribution of ^{2}. Under case 1, ΔAIC_{12} has (approximately) a chisquare(ν) distribution that has been location-shifted to begin at −2ν instead of at 0 (top of _{12} has (approximately) a non-central chisquare(ν, λ) distribution with the same −2ν location shift (bottom of _{1} and _{1} (_{2} and _{2} (

_{1} and _{1}, respectively, are invariant to sample size. _{2} and _{2} decrease as the sample size increases.

As sample size increases, the error probabilities _{1} and _{1} for the AIC-based evidence function do not go to zero but rather remain positive (_{12}, and so the error probabilities _{1} and _{1} remain static. Thus, for the AIC, the probabilities of weak and misleading evidence given model _{1} generates the data both behave like the Type 1 error probability α in Neyman-Pearson testing. The simulation results of Aho et al. (

As sample size increases, the error probabilities _{2} and _{2} for the AIC-based evidence function do go to zero (_{12} is proportional to the value of ^{1/2}), driving the error probabilities _{2} and _{2} to zero. Thus, for the AIC, the probabilities of weak and misleading evidence given model _{2} generates the data both behave like the Type 2 error probability β in Neyman-Pearson testing.

Thus, within the generalized likelihood ratio framework, the AIC appears to bring no particular improvement in the sense of evidence to ordinary Neyman-Pearson testing using ^{2}. Indeed, at least in the Neyman-Pearson approach, the value of α is fixed by the investigator and is therefore

Other information-theoretic indexes used for model selection, however, do have performance characteristics of evidence functions. Consider the Schwarz information criterion (SIC; also known as Bayesian information criterion or BIC) given by

The index originally had a Bayesian-based derivation (Schwarz,

As with the AIC also, the asymptotic distributions of the SIC evidence function under model _{1} and model _{2} are respectively, location-shifted chisquare and non-central chisquare distributions. For the SIC though, the location of the lower bound of the two distributions at −νlog(_{1}, the chisquare distribution is pulled to the left, and the areas under the pdf corresponding to and eventually decrease asymptotically to zero. If the data arise from model _{2}, although the non-central chisquare distribution is also pulled to the left at a rate proportional to log(_{2} and _{2} eventually decrease asymptotically to zero (

To be fair, AIC as well as evidence functions were forged in the fiery world of misspecified models. Does the AIC difference gain the properties of an evidence function when neither _{1} nor _{2} give rise to the data?

If the models are nested or overlapping, the answer is no. To understand this, we must appeal to modern statistical advances in the theory of maximum likelihood estimation and generalized likelihood ratio testing when models are misspecified. The relevant and general theory can be found in White (

Suppose a model with pdf ^{*}, where θ^{*} is the value of θ that minimizes ^{*} as ^{*}.

Now, any two models _{1} (_{1}) and _{2} (_{2}) being compared will be in one of nested, overlapping, or non-overlapping configurations (see

The question needs modification in the nested and overlapping cases. If _{1} is nested within _{2}, _{1} overlaps _{2}, the model closest to truth could be in the overlapping region,

Vuong (^{2} under the nested, overlapping, and non-overlapping cases in the presence of misspecification. His main results relevant here are the following, presented in our notation:

When _{1} is nested within _{2}, or _{1} overlaps _{2}, and the best model is in the nested or overlapping region), then the asymptotic distribution of ^{2} is a “weighted sum of chisquares” in the form _{j} are independent, standard normal random variables (each _{j} values are eigenvalues of a square matrix (_{1} × _{2} rows) of expected values of various derivatives of the two log-pdfs with respect to the parameters (generalization of the Fisher information matrix). The point is, the asymptotic distribution of ^{2} does not depend on _{12} and ΔSIC_{12}, along with evidence functions formed from other information indexes, then have location-shifted versions of the weighted sum of chisquares distribution. The error probabilities

Suppose the models are nested, overlapping, or non-overlapping, but a non-overlapping part of _{1} or _{2} is closer to truth, that is, when ^{2} has an asymptotic normal distribution with mean 2^{*} and variance

and

The result parallels the CLT results (Equations 20–22) for completely specified models, with the added condition that each candidate model is evaluated at its “best” set of parameters. In this situation, the mean of ^{2} increases or decreases in proportion to _{12} as well as for ΔSIC_{12} do decrease to zero as

We must point out that a generalized Neyman-Pearson test (via simulation/bootstrap) of two non-overlapping models with misspecification can suffer the same fate as the completely specified models in the Neyman-Pearson Lemma. The large sample distribution of ^{2}, assuming model 1 generates the data, would have a mean involving _{12} (evaluated at true parameter value in model 1 and best parameter value in model 2); the cutoff point ^{2} has a mean involving Δ^{*} (Equation 78). As was the case for the two models in the Neyman-Pearson Lemma (_{12} and Δ^{*} can cause the generalized Neyman-Pearson test to pick the wrong model with Type 1 error probability approaching 1. The Karlin-Rubin Theorem and the forceful language of uniformly most powerful tests does not rescue Neyman-Pearson testing from derailment when inadequate models are deployed.

Error probabilities going to zero can alternatively be derived as a consequence of the (weak or strong) “consistency” of the model selection index. Consistency here means that the index asymptotically picks the model closest to truth as sample size becomes large. Nishii (_{n} is a possible function of _{n} grows at a rate <

in which the correction term is designed to improve the behavior of the index under small sample sizes. However, the correction term asymptotically approaches zero as

Thus, for either correctly specified or misspecified models in which the best model is in a region of model space that does not overlap any other model under consideration, ΔAIC_{12} indeed behaves like an evidence function. However, many model selection problems, such as in multiple regression, involve collections of models in which model pairs can be nested or overlapping as well as non-overlapping. ΔAIC_{12} will behave more like Neyman-Pearson hypothesis testing for models within overlapping regions and therefore will not possess evidence function properties. Differences of information indexes that adjust ^{2} with a constant or asymptotically constant location shift, such as the TIC and AICc will share the Neyman-Pearson properties of ΔAIC_{12} and cannot be regarded as evidence functions. Differences of those information indexes, such as SIC that produce a location shift that decreases to −∞ as

Simulation of Vuong (_{1} is nested within _{2}, or _{1} overlaps _{2}, and the best model is in the nested or overlapping region), then the asymptotic distribution of ^{2} is a “weighted sum of chisquares” that does not depend on _{1} and _{1} do not decrease to 0 for Δ_{12} but do decrease for Δ_{12}. _{1} or _{2} is closer to truth, then ^{2} has an asymptotic normal distribution with mean and variance that depend on the sample size, and the error probabilities _{1} and _{1} decrease to 0 for both Δ_{12} and Δ_{12}. Details of these two settings in

We have shown that key inferential characteristics for Fisher significance analysis, Neyman-Pearson hypothesis testing, and evidential comparison differ substantially. Evidence has inferential qualities that match or surpass Fisher significance and Neyman-Pearson tests (see

A comparison of inferential characteristics between Fisherian significance testing (

Equal status for null and alternatives | NA | No | Yes |

Allows evidence for Null | No | No | Yes |

Accommodates multiple models | No | Awkward | Yes |

All error rates go to zero as sample size increases | No | No | Yes |

Total error rate always decreases with increasing sample size | No | No | Yes |

Can be used with non-nested models | NA | Not Standard | Yes |

Evidence and error rates distinguished | No | No | Yes |

Robust to model misspecification | Yes | No | Yes |

Promotes exploration of new models | Yes | No | Yes |

AIC and its asymptotic relatives like AICc are built around statistical prediction. The difference of mean expected log-likelihoods is different from what we have defined above as Δ^{*}. The mean expected log-likelihood has a second, predictive layer of expectation in its definition, the idea being to identify the model that could best predict a new observation from

The tendency for AIC related criteria to over fit is a natural consequence of their design goal of prediction mean square error (MSE) minimization. When parameters are estimated, the increase in prediction MSE due to adding a spurious covariate is generally less than the reduction in prediction MSE caused by including a relevant covariate.

The tendency of stepwise regression to overfit using Neyman-Pearson testing has long been noted (Wilkinson and Dallal,

Model selection with AIC or AICc improves somewhat on the Neyman-Pearson overfitting problem in that the misleading error probabilities both go to zero as sample size increases when two non-overlapping models are being compared. However, overlapping models, in which AIC and AICc are prone to overfit, are typically a substantial subset of the models in contention in multiple regression. The AIC and AICc indexes will tend to include spurious variables too often and thus represent only a partial improvement over stepwise regression.

Scientific prediction, however, can be broader than pure statistical prediction. The scientist often desires to predict the outcome of a system manipulation: what will happen if harvest rate is increased, or if habitat extent is halved? Modeling such manipulation might translate as a structural change in a statistical model of the system. The predictive quality of the model then lies more in getting mechanisms in the model as right as possible.

The consistent criteria will asymptotically select the generating process if it is in the model set. If the generating process is not in the model set, the consistent criteria will asymptotically select the model in the set that under best possible parameterization is closest (in the KL sense) to the generating process. The estimation of Δ^{*} by the difference of SIC values represents a quest for a different kind of prediction that might come from a structural understanding of the major forces influencing the system under study. The tendency of the prediction efficient criteria to include spurious covariates promotes a mis-understanding of the generating mechanism (Taper,

Certainly, the finite-sample properties of SIC and other consistent indexes require substantial further study, but the property that more data should be able to distinguish among candidate models with fewer errors seems an important property to preserve.

The scientific allure of information-theoretic indexes resided in the idea that all models were evaluated on a level playing field. One would calculate the index for each model and select the model with the best index, a procedure which promised considerably more clarity over hierarchical sequences of Neyman-Pearson tests, such as stepwise regression.

AIC and its descendants were originally built around concepts of statistical point estimation. The statistical inference represented by AIC is that of an approximately unbiased point estimate of the mean expected log-likelihood. The statistical concepts of errors and variability in information indexes have by contrast not often been emphasized. Partly as a result, model selection with information indexes has been somewhat of a black box for investigators, as achieving a good understanding of the inferences represented by model selection analyses is a mathematical challenge (see Taper and Ponciano,

We have illustrated that, unlike the error rates in Neyman-Pearson hypothesis testing, all of the error rates of evidence analysis converge to zero as sample size increases. However, the errors we have discussed deal only with the determination which of two models is closer to truth; the error rates do not shed light on whether either model is close enough to truth to be scientifically or managerially valuable. This question is the realm of model adequacy analysis.

Whether the statistical inference is a hypothesis test, equivalence analysis, severity analysis, or evidence analysis, whether for a pair of models or multiple pairs of models, a follow-up evaluation of model adequacy looms ever more important as a crucial step (Mayo and Spanos,

Considering the likely prevalence of model misspecification in ecological statistics, analysts will need to consider how a candidate model could be misspecified as well as the effects of such misspecification on the intended uses of the model. Practically, the analyst can introduce models formulated in diverse fashions and let the model identification process itself reduce model misspecification. Further experimental or observational tests of model predictions (e.g., Costantino et al.,

The error properties of evidence analysis are more difficult to calculate than classical NP tests because model misspecification is involved. But once calculated, the rates are likely to be more accurate than classical tests that pretend misspecification does not exist.

Error rates are different pre and post-data.

Non-parametric bootstrapping shows great promise for calculating evidential error rates, for data structures that allow bootstrapping. In work in preparation, we (Taper, Lele, Ponciano, and Dennis) show that bootstrapping greatly aids in the interpretation of evidential results.

A basic recommendation is to stop using NP tests for inference and be cautious about using the AIC family of information criteria for model selection. These are known as the “efficient” or “MSE minimizing” criteria and include the AIC, the AICc, the TIC, many forms of ICOMP and the EIC. These criteria are recognized by a complexity penalty whose expectation is asymptotically constant. Asymptotically equivalent to the AIC is the use of leave-one-out cross-validation (Stone,

There is no reason that the multiple comparisons inference from traditional ANOVAs cannot be made using information criteria (e.g., Kemp et al.,

Classical methods will work well for state description and less well for process identification. Unbiased scientific inferences of process are better made using consistent information criteria (see Jerde et al.,

Besides being influenced by inferential goals, the choice of evidence function should depend on the modeling framework. Information criteria had their beginnings as a tool for variable selection in linear regression with independent observations. In such situations, as derived by Akaike, the number of parameters is a good first order bias correction to the observed likelihood. But, statistics is a world of special cases. The dizzying diversity of information criteria in the literature produces the desire to optimize the bias correction under different modeling frameworks. For instance, in mixed models, even the meaning of the number of parameters or the number of observations becomes ambiguous due to the dependence structure of mixed models. Information criteria have been developed using estimates of the effective number of parameters (e.g., Vaida and Blanchard,

If the generating process is in the model set, or in flat model spaces, such as those in linear regression, the ΔAIC is an unbiased estimate of 2

Evidence is not so much a new statistical method for model selection as it is a new way of thinking about the inference involved with existing model selection methods. The evidential way of thinking has two main components: (1) A post-data trichotomy of outcomes (strong evidence for model _{i}, weak or inconclusive evidence, strong evidence for model _{j}). (2) A framework of pre-data error probabilities, which are assured to go to zero as sample size increases. The evidential approach invites exploration of the error probabilities, usually via simulation, to aid in study design, the selection of evidence thresholds, the effects of different types of misspecification, and the interpretation of study results.

We have proposed here a different way of thinking about statistical analyses and model selection, based on the concept of evidence functions. Evidence is an intuitive way to decide between two models that avoids the famously upside-down logic that accompanies Neyman-Pearson testing. Evidential thinking has helped us reveal the shortcomings of Fisher significance analysis and Neyman-Pearson testing. The errors that can arise in evidence analysis are straightforward to explain, and the frequentist properties of such errors as functions of sample size and effect size are easy to understand and highly compelling in a scientific chain of argument. The information indexes, when differenced, represent a collection of potential evidence functions that extend the evidence ideas to models with unknown parameters. The desirable error properties are preserved in the presence of model misspecification, when the model choice is generalized to be an inference about which model is closer to the stochastic process that generated the data. The error properties of AIC and AICc are similar to those of Neyman-Pearson testing when the candidate models are nested or overlapping and so the AIC-type indexes are not satisfactory evidence functions in those common circumstances. The indexes like SIC in which the parameter penalty is an increasing function of sample size retain the frequentist error properties of evidence functions for all model pairs.

Evidence works well for science in part because its explicit conditioning on the model set invites thinking about new models. Evidence has inferential qualities that match or surpass Fisher significance analysis and Neyman-Pearson tests. Evidence represents a compelling scientific warrant for formulating statistical analyses as model selection problems.

This paper's code is available at

BD wrote the initial draft of the manuscript. BD, JP, and MT jointly derived the mathematical statistics results in Idaho, the summers of 2015, 2016, and 2017, and JP wrote an initial draft of these results. Figures were drawn by JP, BD, and MT. SL and MT contributed with critical insights, discussion, re-organization of the manuscript as well as editing. The ideas reflected in this work started to coalesce when all authors organized courses and lectures at the Center for Research in Mathematics, CIMAT A.C. in 2008, 2009, and 2010, and was reprised when MT and JP organized the International Symposium What is a good model?: Evidential statistics, information criterion and model evaluation held January 2016 at the Institute of Statistical Mathematics in Tokyo, Japan, in which all authors participated.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The authors appreciate the feedback and insights from the audience at the International Symposium What is a good model?: Evidential statistics, information criterion and model evaluation. We also appreciate the insightful comments from the audiences from Virginia Tech's Statistics Department Seminar and University of Idaho's Applied Statistics Seminar, specially to Drs. Leah Thompson and Bill Price.