^{1}

^{*}

^{2}

^{1}

^{2}

Edited by: Hong Jiao, University of Maryland, College Park, United States

Reviewed by: Matthew D. Finkelman, Tufts University School of Medicine, United States; Jean-Paul Fox, University of Twente, Netherlands

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The most common process variable available for analysis due to tests presented in a computerized form is response time. Psychometric models have been developed for joint modeling of response accuracy and response time in which response time is an additional source of information about ability and about the underlying response processes. While traditional models assume conditional independence between response time and accuracy given ability and speed latent variables (van der Linden,

When psychological and educational tests are presented in a computerized form, it is feasible to not only record the product of the response process (i.e., response accuracy), but also the characteristics of the process itself. The most commonly used process variable is response time. Various psychometric models have been developed to jointly model response accuracy and response time (van der Linden,

Several methods have been proposed for testing the assumption of conditional independence (van der Linden and Glas,

When it comes to conditional dependence between time and accuracy, authors typically focus on positive conditional dependence (i.e., relatively slow responses are more often correct) and negative conditional dependence (i.e., relatively fast responses are more often correct). This implies, that a monotone conditional dependence is assumed for time and accuracy. Moreover, most existing models specify the relationship to be linear. However, this assumption of monotone and linear conditional dependence does not necessarily hold in all situations. It could be that responses which are faster than expected are less often correct than responses with response times close to what is expected, but responses slower than expected are not more often correct than those with response times close to what is expected. Therefore, researchers should be able to test whether linearity of conditional dependence between time and accuracy is plausible and to investigate potential nonlinear conditional dependence.

Nonlinear conditional dependence is interesting from the substantive point of view because by abandoning the assumption of monotonicity and linearity of the conditional relationship between time and accuracy one can get a more complete picture of the response process. Since a linear model can only reveal positive or negative dependence, it may ignore important parts of the response phenomena. Imagine a situation in which an item is solved either using a fast optimal strategy or a slow error-prone strategy (i.e., slow responses are less often correct than relatively fast responses) and, in addition to that, some of the respondents respond to the item by guessing (i.e., very fast responses are rarely correct). If one of these phenomena is much stronger than the other, then a linear effect in one of the directions would be detected (i.e., positive conditional dependence if guessing is the strongest factor, or negative conditional dependence if the difference in strategies is the strongest factor). The linear model might also find no evidence of conditional dependence if the two opposing factors balance each other out. In none of these scenarios, a valid conclusion about the relationship between time and accuracy would be drawn. On the contrary, nonlinear methods would allow one to detect a violation of conditional dependence and to get a better understanding of the response processes.

In this paper we develop methods for exploring nonlinear conditional dependence between response time and accuracy. Three different approaches are proposed. (1) The joint models for conditional dependence between time and accuracy (see e.g., Bolsinova et al.,

The remainder of the paper is organized as follows. In section 2 the hierarchical model for response time and accuracy is presented and the assumption of conditional independence is formally defined. In section 3 existing models for conditional dependence are discussed. In section 4 we propose three methods for exploring nonlinear conditional dependence. Section 5 presents an empirical example in which nonlinear conditional dependence is investigated, and the paper concludes with a discussion.

In the hierarchical model (van der Linden, _{pi} (with realizations _{pi} = 0/1 for incorrect/correct) and _{pi} (with realizations _{pi}), respectively, are assumed to be independent, conditional on the latent variable ability, denoted by θ_{p}, and speed, denoted by τ_{p}:

Furthermore, it is assumed that response accuracy is independent of speed given ability, and that response time is independent of ability given speed. The full specification of the hierarchical model for response times and accuracy requires four model ingredients: (1) a measurement model for response accuracy, typically an item response theory (IRT) model; (2) a measurement model for response times; (3) a model for the relationship between the latent variables; and (4) a model for the relationship between the item parameters. In this section, we will present a simple specification of the model, which we will use as a basis for describing the existing extensions of the hierarchical model allowing for conditional dependence.

For the response accuracy measurement model, we use a two-parameter normal-ogive model (Lord and Novick,

where _{i} and _{i} are the slope and the intercept of the ICC, and Φ(·) denotes the cumulative standard normal distribution function. Alternatively, the three-parameter normal-ogive model (Klein Entink et al.,

For the response time measurement model, we use a log-normal model (van der Linden, _{i}, and the speed latent variable:

where

For the relationship between the latent variables and for the relationship between the item parameters we use multivariate normal distributions. For identification, the mean vector of the latent variables is constrained to zero, and the variance of θ is constrained to one^{1}_{i}, _{i}, ξ_{i}) we also use a multivariate normal distribution. Unlike the distribution of the person parameters, here the mean vector and the covariance matrix can be estimated freely.

The conditional independence assumption in Equation (1) means that accuracy and time can be correlated only if ability and speed, which determine their expected values, are correlated. The residual response accuracy and residual log-transformed response time are taken to be noise and the fluctuations on the response accuracy and response time sides of the model are taken to be uncorrelated.

The conditional independence assumption can be relaxed and the relationship between residual response time and residual response accuracy can be incorporated into the model. One way to do that is to model the joint distribution of time and accuracy to the same item as a bivariate distribution with a non-zero correlation parameter. Ranger and Ortner (_{i}:

Here, the marginal distribution of response accuracy and response time are the two-parameter normal-ogive model and log-normal model, the same as in the hierarchical model presented in the previous section. Meng et al. (

Bolsinova et al. (

where _{i0} is the baseline intercept and _{1i} is the linear effect of standardized residual log-transformed response time on the intercept of the ICC. In addition to the linear effect on the intercept, the model can be extended with a linear effect on the slope of the ICC (Bolsinova et al., ^{2}

where _{pi} denotes the standardized difference between the observed and expected log-transformed response time _{i0} and _{i1} are the baseline slope and the linear effect of _{pi} on the slope of the ICC, respectively. The parameters _{i1} and _{i1} can be interpreted as the main effect of residual log-transformed response time on response accuracy, and the interaction effect between ability and _{pi} on accuracy, respectively. Throughout the paper we refer to this model as the linear conditional dependence model.

The approaches discussed above treat the response time as a continuous variable and relate the parameters of the IRT model for accuracy to deviations of the observed log-response time from its expected value. An alternative proposal has been to categorize response time into two classes—fast and slow—and jointly model the dichotomized response time and response accuracy using an IRTree model (De Boeck and Partchev,

where _{iS} > _{iF}), or responses in the slow class being less informative about ability than responses in the fast class (_{iS} < _{iF}).

It is important to note that separation of the response times into two classes is typically done using an item-level median split. Therefore, this approach is different from the linear models discussed above, since the ICC parameters are related to the categorized

The linear conditional dependence models and the fast-slow model provide quite a simplistic picture of the relationship between response time and accuracy. The residual dependence between time and accuracy is not necessarily monotone and the change of the ICC parameters is not necessarily linear in _{pi}. To further investigate the relationship between response time and accuracy, we propose two new joint models for conditional dependence between response time and accuracy, and also use a nonparametric moderation method to explore the relationship between the residual log-transformed response time and the parameters of the response accuracy model.

To allow for a nonlinear relationship between residual log-transformed response time and the ICC parameters, we extend the conditional model of response accuracy in Equation (6) with quadratic effects. To simplify the notation, we introduce a function Ψ(·, ^{x}(1−Φ(·))^{1−x}. The resulting joint model for time and accuracy is then the following:

where _{i2} and _{i2} are the quadratic effects of the residual log-transformed response time on response accuracy. If _{2i} < 0, then the strength of the relationship between ability and the probability of a correct response first increases with residual log-transformed response time and then decreases, and vice versa if _{i2}>0. Similar interpretations can be given to the sign of _{i2}. When the quadratic effect is negative, the corresponding parameter of the ICC (i.e., slope or intercept) is the highest when

Our joint model is an extension of the hierarchical model, therefore in addition to the specification of the joint distribution of the outcome variables, we also need to specify the distribution of the latent variables and the distribution of the item parameters. On the person side we use _{i0}, _{i1}, _{i2}, _{i0}, _{i1}, _{i2}, ξ_{i}}, where _{I} and _{I} are the mean vector and the covariance matrix of the item parameters, respectively. Note, that while we are including nonlinear effects in modeling the conditional dependence between time and accuracy given ability and speed, we do not extend the standard hierarchical model with nonlinear effects on the higher level, since it goes beyond the scope of the current paper. However, one may consider more complex models for the joint distribution of the person parameters and for the joint distribution of the item parameters that would allow for a nonlinear relationship on the higher level as well as on the lower level.

This extended joint model for conditional dependence between response time and accuracy can be estimated in a similar way as the linear conditional dependence models (Bolsinova et al.,

An alternative to the quadratic conditional dependence model for exploration of nonmonotone dependence is an extension of the slow-fast model. Allowing the ICC parameters to differ not just across two classes of responses, but across multiple classes, makes it possible to uncover nonmonotone relationships between residual response time and the ICC parameters (e.g., an item being most informative for the middle categories and least informative for the extreme categories).

Considering multiple categories is not the only way in which our joint model differs from the existing fast-slow models. Instead of categorizing the response time itself, we are going to use the residual log-transformed response time, since we are interested in the

The joint distribution of response time and accuracy in this model is:

where _{1}, …, _{M+1} are the a priori defined thresholds between the categories (_{1} = −∞, _{M+1} = +∞). Note, that in this joint model response time is modeled as a continuous variable such that there is no loss of information in the measurement of speed due to categorization.

Given that residual log-transformed response time belongs to the baseline category, the item parameters are equal to {_{im}, _{im}}. When _{pi} belongs to one of the remaining categories _{im} + _{ik}, _{im} + _{ik}}. When _{ik} < 0, ∀

Analogous to the quadratic model, this joint model for time and accuracy can also be estimated using a Gibbs Sampler (see _{i1}, …, _{iM}, _{i1}, …, _{iM}, ξ_{i}}.

The third approach to exploring nonlinear conditional dependence is in line with the nonparametric indicator-level moderation approach developed by Bolsinova and Molenaar (

Unlike the first two approaches in which the joint distribution of response time and accuracy is modeled, in nonparametric moderation it is not possible to model the two outcome variables jointly since in this approach residual log-transformed response time is treated as an observed covariate. Therefore, we propose using a two-step procedure. First, the measurement model for response times is fitted and the estimates of the standardized residual log-transformed response time are computed:

Second, the estimates ẑ_{pi} are included in the analysis of response accuracy as indicator-level moderators.

For each item, a set of focal points _{1}, …, _{J} for the value of the standardized residual log-transformed response time are defined for which the slope and intercept of the ICC are estimated. Since for all items the moderator has a mean of zero and a standard deviation of one, it makes sense to have the same focal points for different items. For each focal point _{ji} and _{ji} are obtained by weighting the responses to the item from each person _{pi} and the focal point. For each combination of an item _{ji} is defined with each element corresponding to a particular person

there _{pi} has to be to have a relatively large impact on the estimates of the parameters _{ji} and _{ji}. We will use the vale of 1.1 for

The item slopes and intercepts of the ^{*} and ^{*} respectively, are defined. The estimates of the item slopes and intercepts from the conditional independence hierarchical model can be used as starting values. After initialization, repeatedly for each item the estimates of _{ji} and _{ji} are obtained for each focal point

where the responses to item _{ji}, while for the rest of the items

After _{ji} and _{ji} are obtained, we update the values of

with a similar specification for _{pi} is outside of the range of the focal points, then the parameters are set equal to the parameters at the nearest focal point, and otherwise

Under this nonparametric approach the significance of conditional dependence can be tested using permutation tests. To perform these tests, one needs to repeatedly estimate the nonparametric relationship between the residual log-transformed response time and the parameters of the ICCs in permuted data sets, that is, data sets in which the response accuracy data points are kept intact but the residual log-transformed response times are randomly assigned to different persons in the sample. As a first tool to draw inferences about the significance of the relationship between the residual log-transformed response time and the ICC parameter, one can use graphical checks of deviations of the observed relationship and the relationship in the permuted data sets. However, a more rigorous test is to use the variance of the parameters across focal points as a statistic and compare the observed value to its distribution in the permuted data sets. The proportion of permuted data sets in which the variance is larger than in the observed data can be used to approximate the

Furthermore, nonparametric moderation can be used to evaluate the viability of the assumption of linearity of conditional dependence. This can be done by performing posterior predictive checks (Meng,

To illustrate how the nonlinear conditional dependence between response time and accuracy can be investigated, the proposed methods were applied to a data set of a high-stakes arithmetic test^{3}

The four models were fitted using Gibbs Samplers with 10,000 iterations including 5,000 iterations of burn-in. For the details of the estimation algorithm for the conditional independence model and the linear conditional dependence model see Bolsinova et al. (^{4}

In addition to fitting the joint models for response time and accuracy, the nonparametric moderation method was applied to the data. To do so the standardized residuals of log-transformed response time in the one-factor model with equal factor loadings (i.e., which is equivalent to the log-normal model in Equation 3) were computed using “lavPredict” function from the R-package “lavaan” (Rosseel,

Finally, to test the linearity of conditional dependence, posterior predictive checks were performed for the linear conditional dependence model. Given each 10th sample of the model parameters after the burn-in a replicated data set was generated under the linear conditional dependence model (i.e., 500 replicated data sets were generated). The nonparametric moderation method was applied for each of the replicated data sets in the same way as for the observed data. The relationship between standardized residual log-transformed response time and the ICC parameters in the replicated data sets and the observed data were compared graphically. Furthermore, in each data set for each effect, the maximum of the absolute value of the cumulative sum of the residuals in the simple linear regression model with the focal points as a predictor and the ICC parameter as an outcome variable was computed. For each effect, the proportion of replicated data sets in which the deviation from linearity (quantified by the maximum of the absolute value of the cumulative sum of the residuals) was larger than in the observed data was computed to approximate the posterior predictive

Table

Information criteria for the four joint models for time and accuracy.

Conditional independence | 2046864 | – | 2126317 |

Linear conditional dependence | 2042274 | 76 | 2122368 |

Quadratic conditional dependence | 2040146 | 152 | 2120882 |

Multiple-category model | 2039572 | 304 | 2121591 |

It is important to investigate whether the main inferences that are made based on the linear conditional dependence model would also hold for the nonlinear conditional dependence models and for the nonparametric moderation method. The first question is about the presence of the effects on the intercept and the slope of the ICCs of the separate items. In the linear conditional dependence model for 24 and 30 items, the 95% credible intervals of _{1i} and _{1i} respectively, did not include zero, which can be seen as evidence of the presence of the effects. In the quadratic model for 33 and 37 items the 97.5% credible intervals^{5}_{i1} or _{i2}, and of either _{i1} or _{i2} did not contain zero, which can be seen as evidence of the presence of conditional dependence for these items. In the multiple-category conditional dependence model for 29 and 37 items the 98.75% credible intervals^{6}_{ik}, _{ik},

We note that the nonlinear methods are more flexible and complex and therefore provide noisier results and have less power for detecting the effects, so it would not be surprising if a linear effect is detected by the simpler linear method, but not by more complex nonlinear methods. On the contrary, having items for which the linear conditional dependence model does not detect the effect, while it is detected by the nonlinear models should be worrying, since it would mean that the effect is not detected due to its nonlinear nature. This is the case, for example, for the effect on the intercept of item 7: Figure

Intercept of the item characteristic curve of item 7 (_{7}, on the _{7}, on the

The second kind of conclusion that is typically made based on the linear conditional dependence model is about the correlation between the baseline intercept of the items and the effect of residual log-transformed response time on the intercept. In multiple data sets previously this correlation was found to be negative (Bolsinova et al., _{i0} and _{i1} in the linear conditional dependence model. For easier items the effects are more often negative, and for more difficult items the effects are more often positive. To check whether a similar conclusion would be made using the nonlinear methods we performed the following analyses: (1) For the quadratic model for the items with negative _{i2} (i.e., items for which there exists a value of _{pi} which maximizes the intercept of the ICC) we plotted the points at which the intercept is maximized (_{ji} is the highest against the overall proportion of correct responses to the item (see Figure

Differences in the effect of residual log-transformed response time (z) on item easiness depending on the baseline easiness.

The comparison of the information criteria shows that linearity of conditional dependence does not hold for the test as a whole. Additionally, we examined the estimates of the item hyper-parameters specifying the mean and the variance of the quadratic effects. The means of the quadratic effects across items were estimated to be -0.02 [-0.07, 0.04] for _{i2}s, and -0.09 [-0.15, -0.03] for _{i2}s. The variances of the quadratic effects were 0.03 [0.02, 0.05] for _{i2}s and 0.03 [0.02, 0.05] for _{i2}s. For the effects of the item intercepts there is a clear pattern of the intercept first increasing and then decreasing with residual log-transformed response time since the mean of _{i2} is negative, but for the effects on the item slopes the pattern is not so clear.

In addition to the overall conclusions about the presence of nonlinear effects, at least for some of the items, it is also important to look at each item separately and evaluate the results of the posterior predictive checks for linearity. For 27 and 30 items the posterior predictive

Intercept of the item characteristic curve of item 1 (_{1}, on the _{1}, on the

_{2}, on the _{2}, on the _{2}) and the slope of the ICC of item 2 (_{2}) estimated in the replicated data generated under the linear model, and the black line represents the relationship in the observed data.

_{28}, on the _{28}, on the _{28}) and the intercept of the ICC of item 28 (_{28}) estimated in the replicated data generated under the linear model, and the black line represents the relationship in the observed data.

_{30}, on the _{30}, on the _{30}) and the intercept of the ICC of item 30 (_{30}) estimated in the replicated data generated under the linear model, and the black line represents the relationship in the observed data.

Additionally, we compared the estimates of ability under the conditional independence model, the linear conditional dependence model and the two nonlinear conditional dependence models (quadratic and multiple-category models) to check how the inclusion of conditional dependence in a model (and the exact way in which it is modeled) influences the inferences about the respondents. The correlations between the estimates of θ under each pair of models was very high, the lowest value of the correlation was above 0.988, and the highest value of the correlation was above 0.999. Therefore, in this example modeling conditional dependence does not change the measured construct, while it does allow learning more about the relationship between time and accuracy compared to the standard conditional independence model.

Our empirical example shows that conditional dependence between response time and accuracy can be nonlinear: in this example models allowing for nonlinear dependence are preferred over the linear dependence model, and for the majority of the items the posterior predictive checks indicate violations of linearity of the relationship between residual log-transformed response time and the ICC parameters. Using a linear conditional dependence model may in some situations lead to incorrect conclusions about the relationship between response time and accuracy: (1) One may conclude that conditional independence holds, when conditional independence is violated in a nonmonotone way such that the positive dependence in one range of the

The approaches proposed in this paper make use of the difference between the observed and expected log-transformed response times, _{pi}, as a predictor variable to account for unobserved heterogeneity in the responses. In the model, we do not explicitly separate the unobserved heterogeneity by means of additional latent variables. As a result, _{pi}, which contains noise, is fully incorporated in the response model which decreases the power to detect an effect as the parameter estimates will have increased sampling fluctuations due to the noise in the residual log-transformed response time. However, we did not want to further complicate the model by introducing additional latent variables. In addition, introducing more latent variables may also decrease the power to detect an effect due to increased estimation error. Another aspect of the conditional dependence models is that false positives may arise if the response time model is misspecified. That is, such misspecifications will be absorbed in _{pi} which in turn may be detected as a linear or non-linear conditional dependence effect if the misspecification is large enough. As a result, ideally one should carefully consider model fit of the response time measurement model before interpreting the results of the present parametric approach.

The conclusion about the negative relationship between the baseline intercept of the items and the effects of residual log-transformed response time on the intercept, previously found in other datasets (see e.g., Bolsinova et al., _{pi} for which the intercept (and therefore response accuracy) is the highest. For easier items, the optimal values of _{pi} tend to be more negative (responses faster than expected), while for difficult items, the optimal _{pi} is positive (responses slower than expected).

In this paper we used three different approaches to modeling nonlinear conditional dependence: (1) the quadratic conditional dependence model, (2) the multiple-category conditional dependence model, and (3) the nonparametric modeling approach. These three approaches all have their comparative advantages and disadvantages. An important difference between the first two methods and the third one is that the first two methods allow modeling response time and accuracy jointly, while the third method requires a two-step procedure in which the estimates ẑ_{pi} are treated as observed covariates for the distribution of response accuracy. This can be seen as a disadvantage of the nonparametric approach. At the same time, the nonparametric approach allows for more flexibility in the relationship between residual log-transformed response time and the ICC parameters. A limitation of the quadratic approach is that it restricts the possible relationship between the residual log-transformed response time and the ICC parameters to having a particular parametric shape and does not allow exploration of the shape of the conditional dependence. One way in which the quadratic shape of the relationship between _{pi} and the ICC parameters is restrictive is that the function is symmetric, whereas it could be that the decrease of the parameter when moving away from the maximum point (given that the quadratic effect is negative and there is a maximum) is stronger when _{pi} is becomes smaller that its optimal value than when it becomes larger. The nonparametric method allows us to more closely follow the shape of the relationship, however due to its flexibility the method requires larger sample sizes. A limitation of the multiple-category approach is that it assumes that within each category of residual log-transformed response time the item parameters are constant, which might not necessarily be the case in practice.

While the empirical example considered an application from educational measurement, the developed methodology can be expected to be relevant for applications relating to ability measurement in general, in cases where both response time and accuracy are recorded. Like the traditional hierarchical model, the models proposed in this paper make it possible to obtain additional information about ability based on the observed response times, but the methods also allow one to further study and model the complex relationship that may exist between response time and accuracy. This can, for example, be considered relevant in the context of developing and applying intelligence tests or other complex cognitive tests, where one might expect that items display relevant patterns of conditional dependence. For example, it may be that response time is indicative of the particular problem solving strategy that a respondent employs, which may also affect how likely one is to provide a correct response. Additionally, it may be that long response times are indicative of aberrant test taking behavior, such as inattention or distraction, which makes it plausible that such responses should be seen as less informative of ability than responses for which the response times do not indicate aberrant behavior. Our methods allow one to take this into account, by allowing the discrimination parameter of the item to be influenced by residual response time. In this way, the proposed methods allow researchers to work with models for ability measurement that take both response time and accuracy into account and that are highly flexible with regard to the relationship between these two outcome variables that can be dealt with, and can accommodate a variety of deviations from conditional independence that can be expected in both high- and low-stakes psychological testing.

MB and DM designed the study, MB wrote software, performed the analysis and wrote the paper, DM provided feedback on the manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at:

^{1}Note that if the factor model with item-specific factor loadings is used, then the variance of speed also has to be constrained.

^{2}Note, that alternatively it has been proposed to include a linear effect on the log-transformed slope of the ICC (Bolsinova et al.,

^{3}We would like to thank Dutch National Institute for Measurement in Education (CITO) for making this data set available to us. For confidentiality reasons we cannot disclose the content of the test items analyzed in this paper, but example items can be found at

^{4}We are only using the modified BIC and not the modified Akaike information criterion (AIC) which has also been evaluated by the authors because they have shown that AIC tends to be too liberal.

^{5}We decided to use a wider credible interval for the quadratic model because here two parameters are evaluated for each ICC parameter to make a conclusion about the presence of the effect instead of one, that is the area outside of the credible interval was divided by the number of parameters which were evaluated.

^{6}We decided to use a wider credible interval for the multiple-category model because here four parameters are evaluated for each ICC parameter to make a conclusion about the presence of the effect instead of one parameter, that is the area outside of the credible interval was divided by the number of parameters which were evaluated.