^{1}

^{*}

^{2}

^{1}

^{2}

Edited by: Martin Lages, University of Glasgow, United Kingdom

Reviewed by: Richard S. John, University of Southern California, United States; Holmes Finch, Ball State University, United States

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In the social sciences it is common practice to test specific theoretically motivated research hypotheses using formal statistical procedures. Typically, students in these disciplines are trained in such methods starting at an early stage in their academic tenure. On the other hand, in psychophysical research, where parameter estimates are generally obtained using a maximum-likelihood (ML) criterion and data do not lend themselves well to the least-squares methods taught in introductory courses, it is relatively uncommon to see formal model comparisons performed. Rather, it is common practice to estimate the parameters of interest (e.g., detection thresholds) and their standard errors individually across the different experimental conditions and to ‘eyeball’ whether the observed pattern of parameter estimates supports or contradicts some proposed hypothesis. We believe that this is at least in part due to a lack of training in the proper methodology as well as a lack of available software to perform such model comparisons when ML estimators are used. We introduce here a relatively new toolbox of Matlab routines called Palamedes which allows users to perform sophisticated model comparisons. In Palamedes, we implement the model-comparison approach to hypothesis testing. This approach allows researchers considerable flexibility in targeting specific research hypotheses. We discuss in a non-technical manner how this method can be used to perform statistical model comparisons when ML estimators are used. With Palamedes we hope to make sophisticated statistical model comparisons available to researchers who may not have the statistical background or the programming skills to perform such model comparisons from scratch. Note that while Palamedes is specifically geared toward psychophysical data, the core ideas behind the model-comparison approach that our paper discusses generalize to any field in which statistical hypotheses are tested.

In the social sciences, and perhaps especially in the field of psychology, it is common practice to test specific theoretically motivated research hypotheses using formal statistical model comparisons when data allow a least-squares criterion to be used. Indeed, students in these disciplines are trained in such methods starting at an early stage in their academic tenure. On the other hand, in psychophysical research, where parameter estimates are generally obtained using a maximum-likelihood (ML) criterion and data do not lend themselves well to least-squares methods, it is relatively uncommon to see formal model comparisons performed. Rather, it is common practice to estimate the parameters of interest (e.g., detection thresholds) as well as their standard errors individually across the different experimental conditions and to ‘eyeball’ whether the observed pattern of parameter estimates supports or contradicts some proposed hypothesis. Another common strategy is to perform a least-squares method (such as a

We believe that the relative lack of formal, appropriate, and optimal statistical tests in the area of psychophysical research is, at least in part, due to a lack of training and familiarity with performing such tests in the context of ML estimators as well as a relative lack in available software to perform such tests. Here, we explain, in non-technical terms and using example analyses, a general purpose approach to test specific research hypotheses involving psychometric functions (PFs). This ‘model-comparison approach’ is extremely flexible and has been advanced previously by

The PF relates some quantitative stimulus characteristic (e.g., contrast) to psychophysical performance (e.g., proportion correct on a detection task). A common formulation of the PF is given by:

in which

We will introduce the logic behind the model-comparison approach first by way of a simple one-condition hypothetical experiment. We will then extend the example to include a second experimental condition.

Imagine an experimental condition in which an observer is to detect a Vernier offset. The task is a 2AFC task in which the observer is to indicate whether the lower of two vertical lines is offset to the left or to the right relative to the upper line. Five different offsets are used and 50 trials are presented at each offset.

All four models in

Even though the models in

Thus, moving to the right in the model grid of

Different research questions require statistical comparisons between different pairs of models. For example, in the hypothetical experiment described here, we might wish to test whether the data suggest the presence of a response bias. In terms of the model’s parameters a bias would be indicated by the location parameter deviating from a value of 0. Thus, we would compare a model in which the location parameter is assumed to equal 0 to a model that does not make that assumption. The models to be compared should differ only in their assumptions regarding the location parameter. If the models in the comparison differ with regard to any other assumptions and we find that the models differ significantly, we would not be able to determine whether the significance arose because the assumption that the location parameter equals 0 was false or because one of the other assumptions that differed between the models was false. What then should the models in the comparison assume about the slope parameter? By the principle of parsimony one should, generally speaking, select the most restrictive assumptions that we can reasonably expect to be valid. Another factor to consider is whether the data contain sufficient information to estimate the slope parameter. In the present context, it seems unreasonable to assume any specific value for the slope parameter and the data are such that they support estimation of a slope parameter. Thus, we will make the slope parameter a free parameter in the two to-be-compared models.

Given the considerations above, the appropriate model comparison here is that between the models labeled ‘fuller’ and ‘lesser’ in _{e}(likelihood ratio)] is distributed asymptotically as the χ^{2} distribution with degrees of freedom equal to the difference in the number of free parameters between the models^{1}. Thus, the likelihood ratio test can be used to perform a classical (‘Fisherian’) NHST to derive a statistical

Schematic depiction of the model-comparison approach as applied to the research question described in the Section “A Simple One-Condition Example Demonstrating the Model-Comparison Approach.” Each circle represents a model of the data shown in

When the model comparison is performed using the likelihood ratio test, the resulting TLR equals 0.158. With 1 degree of freedom (the fuller model has one more free parameter [the location parameter] compared to the lesser model) the

The model comparison to be performed for a Goodness-of-Fit test is that between our lesser model from above and a model that makes only the assumptions of independence and stability. The latter model is called the saturated model. It is the fact that the fuller model in the comparison is the saturated model that makes this test a Goodness-of-Fit test^{2}. Note that the saturated model makes no assumptions at all regarding how the probability of the response ‘left’ varies as a function of stimulus intensity or experimental condition. As such, it allows the probabilities of all five stimulus intensities that were used to take on any value independent of each other. Thus, the saturated model simply corresponds to the observed proportions of ‘left’ responses for the five stimulus intensities. Note that the assumptions of independence and stability are needed in order to assign a single value for p(‘left’) to all trials of a particular stimulus intensity. Note also that all models in

Imagine now that the researchers added a second condition to the experiment in which the observer first adapts to a vertical grating before performing the Vernier alignment trials. Of interest to the researchers is whether Vernier acuity is affected by the adaptation. The results of both conditions are shown in

Moving to the right in the model grid of

Similar to

To recap, the essence of the model comparison approach to statistical testing is that it conceives of statistical tests of experimental effects as a comparison between two alternative models of the data that differ in the assumptions that they make. The nature of the assumptions of the two models determines which research question is targeted. Contrast this to a cook book approach involving a multitude of distinct tests each targeting a specific experimental effect. Perhaps there would be a ‘location parameter test’ that determines whether the location parameters in different conditions differ significantly. There would then presumably also be a ‘slope test,’ and perhaps even a ‘location and slope test.’ For each of these there might be different versions depending on the assumptions that the test makes. For example, there might be a ‘location test’ that assumes slopes are equal, another ‘location test’ that does not assume that slopes are equal, and a third ‘location test’ that assumes a fixed value for the slope parameters. Note that the difference between the approaches is one of conception only, the presumed ‘location test’ would be formally identical to a model comparison between a model that does not restrict the location parameters to a model that restricts them to be identical. The model comparison approach is of course much more flexible. Even in the simple two-condition experiment, and only considering tests involving the location and slope parameters, we have defined the nine different models in

Note that many more model comparisons may be conceived of even in the simple two-condition experiment of our example. For example, maybe we wish to test the effect on slope again but we do not feel comfortable making the assumption that the lapse rate equals 0.02. We then have the option to loosen the assumption regarding the lapse rate that the fuller and lesser model make. We could either estimate a single, common lapse rate for the two conditions if we can assume that the lapse rates are equal between the conditions or we could estimate a lapse rate for each of the conditions individually if we do not want to assume that the lapse rates in the two conditions are equal. We may even be interested in whether the lapse rate is affected by some experimental manipulation (e.g.,

The model-comparison approach generalizes to more complex research designs and research questions. As an example,

Note that research questions rarely are concerned with the absolute value of any parameter

Our software Palamedes was specifically designed to allow users to test research hypotheses using the model-comparison approach. One necessary ingredient to the model comparison approach is the ability to define models in terms of the assumptions that the model makes. These assumptions are formalized by specifying the relationships among parameter values across different experimental conditions. A second necessary ingredient is the ability to compare statistically any two (nested) models that a user might wish to compare. Together these characteristics allow users to perform statistical tests of a virtually unlimited range of research questions using a single general routine.

Following the spirit of the model comparison approach, Palamedes has just a single routine (PAL_PFLR_Model Comparison.m) that performs model comparisons involving fits of PFs to multiple conditions. Users can tailor the model comparison to be performed by defining the fuller and lesser models. Models are specified using the arguments of the routine. Defining the models occurs by specifying the constraints on each of the four parameters (location, slope, guess rate, and lapse rate) for both the fuller and the lesser model. Palamedes offers three methods in which to specify the constraints. These methods differ with regard to ease of use, but also with regard to flexibility offered. We will briefly discuss the three methods in decreasing order of ease of use (but increasing flexibility). We will use the model-comparison between the models labeled ‘fuller’ and ‘lesser’ in

The easiest method by which to specify constraints on parameters is using the verbal labels of ‘unconstrained,’ ‘constrained,’ and ‘fixed.’ Each of the four parameters of a PF (location, slope, guess rate, and lapse rate) can be independently specified. In order to perform the test described in the Section “A Two-Condition Example” we would set the location parameters, guess rates, and lapse rates of both the fuller and the lesser model to ‘fixed’ (and specify the values at which these parameters should be fixed), we would set the slopes of the lesser method to ‘constrained,’ and the slopes of the fuller model to ‘unconstrained.’

The second method by which to specify models is by specifying constraints on parameters using ‘model matrices.’ Model matrices serve to reparameterize parameters into new parameters that correspond to ‘effects.’ For example, a model matrix can be used to reparameterize the two slopes of the PFs into two new parameters, one corresponding to the sum of the slope values, the other to the difference between the slope values. Note that any combination of slope values [e.g., a slope of 3 in condition 1 and a slope of 1 in condition 2) can be recoded without loss of information in terms of their sum (4) and their difference (2)]. Instead of estimating the slopes directly, Palamedes estimates the values of the parameters defined by the model matrix. Each row of the matrix defines a parameter as some linear combination of the original parameters by specifying the coefficients with which the parameters should be weighted to create the linear sum. For example, in order to create a parameter corresponding to the sum of the slopes we include a row in the matrix that consists of two 1’s. The new parameter would then be defined as _{1} = 1 × _{1} + 1 × _{2}, the sum of the slope values. If we include a second row [1 −1], this defines a new parameter _{2} = 1 × _{1} + (−1) × _{2}, the difference between the slope values. In order to allow different slopes in each condition, we instruct Palamedes to estimate both the sum and the difference of the slopes by passing it the matrix [1 1; 1 −1]. If we wish to constrain the slopes to be equal in the conditions, we instruct Palamedes to estimate only one parameter which corresponds to the sum of the slopes by passing the array [1 1]. Note that the new _{1} and _{2}.

The advantage of using model matrices rather than the verbal labels above in order to specify models is that it allows model specifications that are not possible using the verbal label. For example, it allows one to specify that only the PSE in condition 2 should be estimated while the PSE in condition 1 should be fixed (by passing the model matrix [0 1]). Especially when there are more than two conditions does the use of model matrices afford much greater flexibility compared to the use of the verbal labels. A model comparison that compares a model in which the location parameters from, say, six conditions are ‘unconstrained’ to a model in which the location parameters are ‘constrained’ merely tests whether there are significant differences among the six location parameters, not where these differences may lie (i.e., it would test an ‘omnibus’ hypothesis). Model matrices allow researchers to target more specific research questions. For example, if the six location parameters arose from a 2 × 3 factorial design, model matrices can be used to test hypotheses associated with the main effects and their interaction (comparable to the hypotheses tested by a Factorial ANOVA; for an example of this see

A detailed exposition on how to create model matrices in order to target specific research questions is well beyond the scope of this article. A reader familiar with the use of contrasts, for example in the context of analysis of variance (ANOVA) or the General Linear Model (GLM), may have recognized _{2} as a ‘contrast.’ Indeed, much about creating model matrices using contrasts can be learned from texts that discuss ANOVA or GLM (e.g.,

The third and most flexible method by which to specify model constraints on parameters is by supplying a custom-written function that defines a reparameterization of the parameter in question. For example, reparameterizing the slopes into parameters that correspond to the sum and difference between them (as performed above using the model matrix [1 1; 1 −1]) can also be accomplished using this function:

function beta = reparameterizeSlopes(theta)

beta(1) = theta(1)+theta(2);

beta(2) = theta(1)-theta(2);

Note that, somewhat counter-intuitively perhaps, the function takes the ‘new parameters’ (the thetas) as input and returns the parameters that directly correspond to one of the PF’s parameters (here: the betas, the slopes of the PF). We would also specify our guesses for values of theta that will be used as the starting point of the iterative fitting procedure and we specify which, if any, of the thetas should be fixed and which should be estimated. Specifying both thetas to be free parameters is equivalent to using the verbal label ‘unconstrained’ or using the model matrix [1 1; 1 −1]. Specifying that theta(1) should be estimated while fixing theta(2) (at 0), would be equivalent to using the verbal label ‘constrained’ or using the model matrix [1 1].

Note that the size of the input and output arguments need not be equal. To give a simple example, we might fix the location parameters at a value of 0 (as in the example model comparison given above) using the function:

function alpha = reparameterizeFixed(theta)

alpha(1) = theta;

alpha(2) = theta;

in which theta is a scalar. We would also specify that theta should be fixed and that the value to be used is 0. To give another example, one can use this method to implement the assumption that location parameters in a series of, say, 10 consecutive experimental sessions follow a learning curve defined as an exponential decay function by using the reparameterization:

function alphas = reparameterizeLocations(thetas)

session = 1:10;

alphas = thetas(1) + thetas(2)^{∗}exp(-thetas(3)^{∗}(session-1));

If we use this function to specify location parameters, Palamedes will constrain these parameters to follow an exponential decay function with three parameters: thetas(1) will correspond to the lower asymptote of the function, thetas(2) will correspond to the difference between the (modeled) value of the first location parameter and the lower asymptote and thetas(3) will determine the learning rate.

Using this third method of specifying models in Palamedes is clearly the most flexible of the three. Any model that can be specified using model matrices can also be specified using this third method (while the reverse is not the case). This third method also has the advantage that Palamedes will return the fitted model not only in terms of the PF parameters (locations, slopes, guess, and lapse rates) but also in terms of the new parameters (i.e., the

Palamedes allows one to use any combination of the three methods to specify constraints on parameters in a single call of a routine. For example, in a single call to PAL_PFLR_ModelComparison.m one can specify the location parameters of the lesser model to be fixed parameters using the verbal label ‘fixed’ while using the empty matrix ([]) to specify that the location parameters of the fuller model are also fixed. An example call to PAL_PFLR_ModelComparison.m that mixes all three of the above methods is given in ModelComparisonTwoConditions.m.

The function PAL_PFLR_ModelComparison returns the TLR. If the lesser model is correct, the TLR is asymptotically distributed as χ^{2} with degrees of freedom equal to the difference in the number of free parameters between the two models to be compared. PAL_PFLR_ModelComparison also returns the appropriate number of degrees of freedom for the test. Thus, the user can utilize standard χ^{2} tables or, for example, Matlab’s chi2cdf function to determine a ^{2} distribution may not be appropriate for low N experiments, PAL_PFLR_ModelComparison can also be used to generate an empirical sampling distribution from which a ^{2} distribution to that derived from Monte Carlo simulations. As an example, we performed our model comparison between the fuller and lesser model discussed in the Section “A Two-Condition Example” again using an empirical sampling distribution based on 10,000 Monte Carlo simulations. ^{2} distribution with 1 degree of freedom. It is clear that the two sampling distributions correspond quite closely for this example. Similarly, the ^{2} distribution (

The histogram displays an empirical sampling distribution for the transformed likelihood ratio (TLR) for our example model comparison between the fuller and lesser model described in the Section “A Two-Condition Example.” The distribution is based on 10,000 Monte Carlo simulations. The curve corresponds to the theoretical χ^{2} distribution with 1 degree of freedom^{1}.

The usefulness of

An alternative strategy to compare two or more alternative models is to use one of the information criteria (e.g., Akaike’s Information Criterion;

A common error researchers make when creating models is to attempt to estimate too many parameters. A very common example of this in psychophysics is allowing the lapse rate to vary in the model when the data contain virtually no information regarding its value. This practice has negative consequences for the interpretation of the fit (

In this regard, another significant advantage of the flexibility of the model comparison approach is that it allows one to reduce the number of parameters of a model greatly. Imagine for example an experiment with several conditions. The researchers are interested in the effect of the experimental variable(s) on the location of the PF. They also feel it is safe to assume that the experimental variable does not affect the slope of the PF. The flexibility of the model-comparison approach allows them to test for specific effects on the location parameters while implementing the assumption that slopes are not affected by constraining the estimated slopes to be equal in all conditions. Trials from all conditions then contribute to the estimation of the single, shared slope value. Note that this is similar in many ways to the assumption of homoscedasticity in many least-squares procedures (e.g.,

One should interpret results of model comparisons in which the models differ with respect to a large number of assumptions cautiously. For example, the model comparison between the model labeled ‘lesser’ in

The model-comparison approach allows researchers to target very specific research questions thereby preventing diluting of effects. To give just one example, imagine an experiment consisting of four conditions that differ with respect to the duration for which the stimulus is presented. Let’s say that we expect that the location parameter increases with the duration of the stimulus in a linear fashion. We can, of course, compare a model in which the location parameters are constrained to be equal in the four conditions to a model in which they are allowed to differ. However, this would not be the optimal manner in which to test our hypothesis. Since we expect our location parameter to vary in a specific (i.e., linear) manner with duration we should compare a fuller model in which we allow the thresholds to vary according to our expectation (rather than allow any variation) to a lesser model in which they are not allowed to vary. A so-called trend analysis allows us to specify a model in which the location parameters are allowed to vary, but only following a linear function of duration. To do this, we would use polynomial contrasts to reparameterize the four location parameters into four new parameters: one corresponds to the intercept term (the average of the four parameters, the second allows a linear trend, the third allows a quadratic trend, and the fourth requires a cubic trend). In effect, this reparametrization allows us to constrain the location parameters to adhere to polynomial functions of differing degree. In this example, the fuller model would use a model matrix that includes an intercept term and a linear trend allowing location parameters to follow a first-degree polynomial (i.e., straight line): [1 1 1 1; −3 −1 1 3]. The lesser model would specify the constraint on the location parameters using a model matrix containing only the intercept term: [1 1 1 1] (or, equivalently, using the verbal label ‘constrained’). The distinct advantage of this approach is that the fuller and lesser model differ with respect to only one parameter (that defining the linear trend) and we say that the comparison is a one degree of freedom comparison. A fuller model that simply would allow any variation among location parameters would have four parameters and a comparison with our one parameter lesser model would be a three degree of freedom comparison. In case the location parameters indeed do vary as a function of duration in a more or less linear fashion, the one degree of freedom test would be much more likely to result in a significant model comparison. An example trend analysis is performed by the demo program PAL_PFLR_FourGroup_Demo.m in the PalamedesDemo folder.

We have seen that the model comparison approach allows many different model comparisons even in the case of simple experiments. In our discussion of our example two-condition experiment in the Section “A Two-Condition Example” we showed that there are 27 nested model comparisons possible that address the location and slope parameters only. The number of possible model comparisons will grow exponentially when additional parameters (lower asymptote, upper asymptote) are compared. Researchers are cautioned to exercise restraint when performing model comparisons. Firstly, as our statistics instructors have impressed upon us, one should decide on the hypotheses to be tested before we inspect our data or even conduct our experiment. As we have argued above, the selection models to be compared should be guided primarily by our research question and by whether our data contain sufficient information to allow estimation of parameters that are not considered in the research question. If, on the other hand, we allow our model comparison to be guided by the results of our experiment (e.g., we select model comparisons because the data suggest there may be an effect), the resulting

While this paper describes the model comparison approach as it can be applied using Palamedes to define and test assumptions regarding the effects of independent variables on the parameters of PFs, Palamedes can do much more. A full description of all of Palamedes’ capabilities is well beyond the scope of this article. Instead, we will list here in general terms what other functionality Palamedes provides. A much more elaborate description may be found on

Adaptive methods serve to increase the efficiency of psychophysical testing. Palamedes implements the up/down methods (e.g.,

Palamedes provides routines for computing the Signal Detection Theory (SDT) measure

Maximum-Likelihood Difference Scaling (MLDS) (

With regard to the fitting of PFs, Palamedes allows one to fit individual functions with the freedom to fix or estimate any of the four parameters of the PF. Individual functions can be fit using either a maximum-likelihood or Bayesian criterion. Palamedes allows one to estimate the standard errors of the parameter estimates, using bootstrap analysis (e.g., ^{2}, Palamedes also offers the possibility of creating an empirical sampling distribution using Monte Carlo simulations.

Another R package that can fit PFs using a maximum-likelihood criterion is quickpsy by

Another tool to fit PFs is psignifit (

We have provided an introduction to the model-comparison approach. While we have geared our discussion to testing research hypotheses in psychophysical studies, the model-comparison approach we describe generalizes to any field in which models are compared. We have also discussed in general terms how such tests may be implemented using our free Palamedes Toolbox. As described above, Palamedes allows model specification using three methods that vary in their ease of use and their flexibility. This approach makes it possible for researchers without much modeling experience to begin creating relatively simple models of their data. More seasoned researchers can use Palamedes’ flexibility to implement more complex models.

All user-end routines in the Palamedes toolbox provide extensive help on their use (type ‘help’ followed by the name of the function in the Matlab command window). Included in the Palamedes Toolbox are also many demonstration programs that perform a complete analysis and will report the results in the form of text or a figure. Additional specific information on the use of Palamedes may also be found on the Palamedes website^{3} or in

NP and FK created the Palamedes software and authored the paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Note that we are referring here to the theoretical χ^{2} probability density function. Readers should not confuse the likelihood ratio test with Pearson’s χ^{2} test (unfortunately often referred to as ‘the χ^{2} test’) with which they may be more familiar. The latter is based on Pearson’s χ^{2} statistic (unfortunately often referred to as ‘χ^{2}’) which happens to be distributed asymptotically as the χ^{2} probability density function also. Other than sharing the χ^{2} probability density function as its asymptotic distribution, Pearson’s χ^{2} statistic and Pearson’s χ^{2} test are not related to the TLR, the likelihood ratio test, or our discussion.

Similar to the confusion described in footnote 1, the term Goodness-of-Fit test is often taken to be equivalent to a specific Goodness-of-Fit test: (Pearson’s) χ^{2} test for Goodness-of-Fit. In fact, many Goodness-of-Fit tests exist, Pearson’s χ^{2} test is simply one of them. Since we use a maximum-likelihood criterion during fitting the appropriate test statistic here is the TLR (as it is in all our model comparisons) and not Pearson’s χ^{2}.