^{1}

^{*}

^{1}

^{*}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Holmes Finch, Ball State University, United States

Reviewed by: Chun Wang, University of Minnesota Twin Cities, United States; Jung Yeon Park, KU Leuven, Belgium

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Asymmetric IRT models have been shown useful for capturing heterogeneity in the number of latent subprocesses underlying educational test items (Lee and Bolt,

Most item response theory (IRT) models used in educational and psychological measurement contexts assume item characteristic curves (ICCs) that are symmetric in form, implying the change below the inflection point is a mirror image of change above the inflection point (Embretson and Reise,

One context in which asymmetric models may find particular value is in the scoring of discrete option multiple-choice (DOMC) items (Foster and Miller,

It has been previously demonstrated that administering the same items in a DOMC format as opposed to a traditional MC format often increases item difficulty (Eckerly et al.,

One way of rectifying this problem is to score DOMC item responses in a way that accounts for the psychometric effects of key location. A natural approach would be to use a suitable IRT model as a basis for such scoring. One contribution of this paper is to show how an asymmetric IRT model, namely Samejima's (

A second contribution of the paper relates less to DOMC, and more to the opportunity the DOMC application provides in demonstrating the meaningfulness in studying ICC asymmetry from an item validation perspective. Lee and Bolt (

The remainder of this paper is organized as follows. First, we review the LPE model of Samejima (

Although there exist alternative IRT models that introduce asymmetry (Bazán et al.,

where the slope _{i} reflects the discrimination of item _{i} reflects its difficulty. Then the resulting probability of a correct response to the item under an LPE is written as

In more general measurement contexts, a subprocess, Ψ_{i}(θ_{j}), might be viewed as a single step taken in solving an item, such that an item is answered correctly only if all steps are successfully executed. An example might be a long division problem, in which the repeated sequential steps of division, multiplication, subtraction and bringing down the next digit are each thought of as a separate subprocess. As an exponent parameter, ξ could be viewed as defining the “number” of conjunctively interacting subprocesses that underlie an item, and is referred to as an item complexity parameter. Thus, the asymmetry of the ICC depends on how many conjunctive or disjunctive subprocesses are involved in solving the item.

Figure

Item characteristic curves (ICCs) for three hypothetical LPE items (a = 1,

As noted earlier, the emphasis in the LPE on conjunctively interacting item subprocesses seemingly provides a close fit to DOMC items, and in particular, the effect related to varying the number of scheduled response options. In particular, the LPE exponent parameter provides a convenient way of accounting for the simultaneous increase in both difficulty and discrimination associated with an increase in the number of response options. Such a phenomenon can be easily captured by the LPE model in (1) and (2) by allowing the ξ parameter to vary in relation to the number of scheduled response options. In particular, for a fixed item we expect ξ to become larger as the number of scheduled response options increases.

The item response data in the current study come from an information technology (IT) certification test used for employment decisions. The test is administered internationally and primarily to respondents with a college education; the current sample consisted primarily of respondents in Asian countries. We consider data from two forms of the test administered in 2016–2017, each containing 59 items, but with an overlap of 35 items across forms, implying a total of 83 distinct items in the dataset. This data structure permits a concurrent calibration of both forms against a common latent metric. Most (54) of the items had a single keyed response option; however, 24 had two keyed responses, and 5 had three keyed responses. The items with three keyed responses had a maximum of five response options, while all other items had a maximum of four response options. A total of 648 examinees provided item response data. All DOMC items were adapted from items originally administered using a traditional multiple-choice format. As examinees must respond to each computer-administered response option to proceed to the next option/item, there were no missing data.

As is typically the case with DOMC items, each administered item is scored correct/incorrect on this IT test regardless of the number of response options administered. Although this form of item scoring is straightforward, it might be viewed as suboptimal to the extent that an item will naturally be more difficult when scheduled with more response options than with fewer. Table

Mean classical item difficulties (

1 | 0.64 | 0.22 | ||||

2 | 0.49 | 0.31 | 0.47 | 0.21 | ||

3 | 0.37 | 0.38 | 0.32 | 0.31 | 0.34 | 0.35 |

4 | 0.31 | 0.40 | 0.22 | 0.34 | 0.26 | 0.38 |

5 | 0.19 | 0.35 |

Various IRT models could be considered to account for the psychometric effects of the scheduled number of response options on item functioning. One possibility is to fit a 2PL model in which the

where _{i(k)} = _{i} for all _{i(k)} = _{i} and _{i(k)} = _{i}, effectively assuming no effect of _{i(k)} = _{i} as Model 3.

Our expectation is that the LPE will emerge as statistically superior to each of Models 1–3. Specifically, the repeated administration of response options seemingly maps closely to the notion of distinct item steps that motivate the model. Moreover, due to the random administration of response options, we can attach a psychometric equivalence to each of the steps, as each option as has equal likelihood of emerging at each step. We consider two versions of the LPE. The first, based on Equations (1) and (2), assumes for each item a unique _{i} and _{i} that remain constant across the scheduled number of response options _{ik} that varies as a function of item _{ik} = _{i} ^{*} _{i} parameter is estimated in relation to the exponent of each item. We consider this constrained version of Model 4 as Model 5.

Various approaches to the estimation of the LPE have been presented, although the most promising appear to be those based on Markov chain Monte Carlo (MCMC; see Bolfarine and Bazán,

while for person parameters,

For the special case of Model 5, we specified

while the constant within-item slope that reflects the parameter change in relation to scheduled number of response options (i.e., ξ_{ik} = _{i} ^{*}

Each of the 2PL models considered (Models 1, 2, and 3 above) were specified using the same corresponding prior distributions so as to facilitate comparison of the models.

Markov chains were simulated up to 10,000 iterations, and parameter estimates were based on the posterior means. Convergence of the chains was evaluated using the Gelman-Rubin criterion following a simultaneous simulation of four additional chains for each model.

Table

Empirical comparison of IRT models applied to IT certification data (

1 | 2PL × Item | 43874.8 | 43142.7 | 732.2 | 44,607 |

2 | 2PL × Item × #RespOpt (a and b) | 40681.8 | 39573.7 | 1108.1 | 41,790 |

3 | 2PL × Item × #RespOpt (b only) | 40938.9 | 39985.2 | 953.6 | 41,893 |

4 | LPE × Item × #RespOpt (ξ_{i} only) |
40798.8 | 40119.8 | 678.9 | 41,478 |

5 | LPE × Item (ξ as a linear function of #RespOpt) | 40868.0 | 40185.9 | 682.1 | 41,550 |

Table

Mean (Standard Deviation) of ξ estimates in relation to the number of scheduled response options.

1 | 0.619 (0.417) | ||

2 | 1.000 (0.485) | 1.013 (0.460) | |

3 | 1.487 (0.603) | 1.629 (0.610) | 1.468 (0.379) |

4 | 1.819 (0.703) | 2.246 (0.717) | 1.832 (0.285) |

5 | 2.534 (0.801) |

It is conceivable that despite these nearly equal intervals, the effect of the number of scheduled response options could have a non-linear effect within individual items. Factors that could explain such results could be effects related to the differential attractiveness of keyed vs. distractor options, or alternatively, examinee expectations in terms of the number of keyed responses per item.

Figure

Estimated item characteristic curves (ICCs) for three DOMC items, IT Certification Test (_{1} = 0.441, ξ_{2} = 0.742, ξ_{3} = 1.107, ξ_{4} = 2.303). _{1} = 0.526, ξ_{2} = 0.850, ξ_{3} = 1.198). _{1} = 1.176, ξ_{2} = 1.450; ξ_{3} = 1.833).

As noted earlier, one practical consequence of the use of asymmetric IRT models concerns its effect on the latent IRT metric. Bolt et al. (

We can illustrate this occurrence in relation to the DOMC data by contrasting the latent metric that emerges under Model 4 vs. that seen under Model 2. Figure

Histograms of proficiency estimates under Model 2 (symmetric 2PL model) and Model 4 (asymmetric LPE model).

The cause of the shrinkage in this context can be recognized by appealing to Figure

Our results support the potential benefit of an LPE model in the scoring of test performances using DOMC. For DOMC applications, the LPE has appeal for both psychological and statistical reasons. As a psychological model, the LPE provides a natural mechanism by which multiple conjunctively interacting subprocesses (in this case, the independent responses to different response options) are captured through an exponent parameter. Our findings aligned with expectations in that the increased difficulty and discrimination seen as the last key is located later also corresponded to systematic increases in the ξ parameter estimates. This approach also provides statistical advantages in the sense that we are able to simultaneously account for the difficulty and discrimination effects of key location through one (as opposed to two) added parameter for each potential key location. Further, by accounting for the asymmetry of ICCs, we are able to remove shrinkage in the higher end of the latent proficiency scale, yielding a metric that will be more sensitive to student differences (and thus more suitable for demonstrating growth) at higher proficiency levels.

In a broader sense, our results also extend Lee and Bolt (

We believe that additional psychometric study of asymmetry in IRT models is warranted. Such research can focus not only on the models themselves, but also the consequences of ignoring asymmetry when present. As assessments seek deliberately to incorporate items of greater complexity, and as computer-based assessments open the door to unique item types such as DOMC that have the potential for greater complexity, models that attend more closely to response process likely will become increasingly important. There of course also remain many potential additional directions of research with DOMC items, including the potential for closer comparisons against traditional multiple-choice.

DB, SL, JW, and CE contributed to conception and design of the study. JS organized the database. DB and SL performed the statistical analysis. DB wrote the first draft of the manuscript. SL, JW, and CE wrote sections of the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.