^{1}

^{*}

^{2}

^{3}

^{2}

^{4}

^{1}

^{2}

^{3}

^{4}

Edited by: Ronny Scherer, University of Oslo, Norway

Reviewed by: Alvaro J. Arce-Ferrer, Pearson, United States; Alexander Robitzsch, Christian-Albrechts-Universität zu Kiel, Germany

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The increasing digitalization in the field of psychological and educational testing opens up new opportunities to innovate assessments in many respects (e.g., new item formats, flexible test assembly, efficient data handling). In particular, computerized adaptive testing provides the opportunity to make tests more individualized and more efficient. The newly developed continuous calibration strategy (CCS) from

The shift to using digital technology (e.g., laptops, tablets, and smartphones) for psychological and educational assessments provides the opportunity to implement computer-based state-of-the-art methods from psychometrics and educational measurement in day-to-day testing practice. In particular, computerized adaptive testing (CAT) has the potential to make tests more individualized and to enhance efficiency (e.g.,

To overcome this problem,

In their study,

Against this background, the aim of the present study was to investigate the performance of the equating procedure for different setups conducted under more realistic conditions (i.e., examinees’ average abilities and variance differ between test cycles). The remainder of the article is organized as follows: First, we provide the theoretical background for the present study by introducing the underlying IRT model and by describing the CCS. Next, we discuss both the previously implemented equating procedure and alternative specifications. Then, we examine the performance of different setups of the different equating procedures in a simulation. Finally, we discuss the results and make recommendations for the implementation of the CCS.

The IRT model used in this study was the two-parameter logistic (2PL) model (_{ij} = 1 of examinee j = 1…N with a latent ability level 𝜃_{j} to an item i by the following model, whereby _{i} is the discrimination parameter and _{i} is the easiness parameter of item

In the traditional IRT metric where _{i}𝜃_{j} + _{i} = _{i} (𝜃_{j} – _{i}), the _{i} parameters will be the identical for these parametrizations, while the item difficulty parameter _{i} is calculated as _{i} = -_{i}/_{i}.

In the following paragraphs, we briefly outline the CCS as introduced by

In the _{i}.

After test assembly and test administration, the parameters for the common items are estimated based on the responses of the current test cycle. In the second step of the equating procedure, a

As IPD of item parameters may have a serious impact on equating results such as scaled scores and passing rates (^{2}-test (

The last step of the equating procedure, the

The common item selection and the scale transformation of the common items are crucial parts of the CCS because they ensure that the procedure functions well. In terms of the common item selection, different distributional assumptions such as an approximated normal distribution, as used in

When item parameters are estimated using different groups of examinees, the obtained parameters are often not comparable due to arbitrary decisions that have been made to fix the scale of the item and person parameter space (

where _{Kj} and 𝜃_{Lj} the person parameter values for an examinee _{Ki}, b_{Ki}, and a_{Li}, b_{Li} represent the item parameters on scale

To obtain the transformation constants

As the purpose of equating procedures in the CCS is to enable an interchangeable score interpretation across test cycles, the selection of the common items is a crucial factor for feasible equating. Up to now, only recommendations for the number of common items that should be used when conducting IRT equating have been made (

What effect does the difficulty distribution of the common items in the CCS have on the precision of the item parameter estimates?

What effect does the difficulty distribution of the common items in the CCS have on the quality of the equating?

What effect does the scale transformation method used in the CCS have on the quality of the equating?

As the CCS was developed for a context in which separate calibration studies are often not feasible and sample sizes are too low to allow for stable item parameter estimation, it is important to evaluate whether the results for these three research questions were affected by the sample size. Consequently, each of the three research questions was investigated with a special focus on additional variations of the sample size.

Many factors can affect the quality of the equating within the CCS. These include, among others, the number of common items, the test length, the characteristics of the common items, the scale transformation method applied, the number of examinees per test cycle, the presence of IPD and the test applied for IPD. In the present study, some of these factors were kept constant (e.g., number of common items, test length, the presence of IPD, test applied for IPD) to ensure the comprehensibility of the study results.

To answer the research questions stated above, a Monte Carlo simulation based on a full factorial design with three independent variables (IVs) was conducted. With the first IV, _{i} of the common items (normal, uniform, and bimodal with very low and very high difficulties only) was varied. The second IV,

The simulations were carried out in R (

In each replication, the discrimination parameters _{i} were drawn from a lognormal distribution, _{i} ∼ log_{i} were drawn from a truncated normal distribution, _{i} ∼ N (0, 1.5), _{i} ∈(−2.5, 2.5). Since this study was not designed to investigate IPD detection rates (e.g., _{i} and _{i} remained unchanged over the test cycles.

The ability parameters of the examinees in the first test cycle in each replication were randomly drawn from a standard normal distribution, 𝜃 ∼ _{t}, σ_{t}), whereby the mean μ_{t} ∈ (−0.5, 0.0, 0.5) and the standard deviation σ_{t} ∈ (0.7. 1.0, 1.3) were randomly drawn. This was done to mimic the fact that examinees of different test cycles usually differ with respect to the mean and variance of their ability distribution. The examinees’ responses to the items were generated in line with the 2PL model.

The CCS in the current study was applied with all seven steps proposed by _{t} = 60 + (t − 1) ⋅ 20 after the test cycle t, and a total item pool size of 240 items after the 10th test cycle. Following the recommendation of

For the common item selection within the equating procedure, only items that had already been calibrated in the previous test cycles and that did not serve as common items in the preceding test cycle were eligible. The selection procedure for the common items differed depending on the intended distribution. For the normal distribution, the procedure of _{i}. Then, five items from the “medium” category, three items each from the “low” and “high” categories, and two items from each of the extreme categories were chosen to mimic a normal distribution. For the uniform distribution, the eligible items were assigned to 15 categories based on their easiness parameters _{i} and one item from each category was drawn. The interval limits of the categories were determined as quantiles of the item difficulty distribution. For the bimodal distribution, the eligible items were ordered according to their easiness parameters _{i} and two subsamples were formed containing the 11 easiest and the 11 hardest items, respectively. Then, 15 items in total were randomly drawn from the two subsamples (seven easy and eight difficult items, or vice versa). As already mentioned, the selected common items in periodical assessments should be comparable also with regard to content characteristics. Content balancing approaches like the maximum priority index (

For the scale transformation, one of the four transformation methods (Mean/Mean, Mean/Sigma, Haebara, and Stocking-Lord) was applied. A modified version of Lord’s chi-squared method (_{i} was set to –1 and 5, respectively. For the item easiness parameters _{i}, the bounds were set to –5 and 5.

The mean squared error (_{i} and _{i}, respectively, was calculated after each test cycle _{t} across all replications R = 200. Thus, a high degree of precision is denoted by low values for the

Because our aim was to evaluate whether the modified common item selection could prevent a dysfunction of the CCS in terms of more precise item parameter estimates for items with very low and very high values for _{i}, the conditional _{i} ∈ (−_{i} ∈ (−2,−1], _{i} ∈ (−1, −0.25], _{i} ∈ (−0.25, −0.25], _{i} ∈ (−0.25, −1], _{i} ∈ (1, 2], and _{i} ∈ (2,

Three criteria were used to evaluate the equating quality. As a first criterion, we used the proportion of test cycles in which no breakdown of the common items occurred. Second, we calculated the proportion of drifted items for each of the 36 conditions. And third, we computed the accuracy (

The true transformation constants

The estimated transformation constants

Note that the conditions with the mean/mean method as scale transformation method and normal distributed common items mimic the setup of the equating procedure from

To answer the first research question regarding the precision of the item parameter estimates, we analyzed the conditional _{i} and the item easiness parameters _{i} depending on the scale transformation method, the common item difficulty distribution, and the sample sizes per test cycle. For the sake of clarity, the results are only presented for the second, the sixth, and the 10th test cycles of the CCS. _{i}, and _{i}. As can be expected based on the findings from

Conditional mean squared error (_{i} for specific item easiness intervals after the 2nd, 6th, and 10th test cycles in the continuous calibration strategy with a sample size per test cycle of

Conditional mean squared error (_{i} for specific item easiness intervals after the 2nd, 6th, and 10th test cycle in the continuous calibration strategy with a sample size per test cycle of

Conditional mean squared error (_{i} for specific item easiness intervals after the 2nd, 6th, and 10th test cycle in the continuous calibration strategy with a sample size per test cycle of

Conditional mean squared error (_{i} for specific item easiness intervals after the 2nd, 6th, and 10th test cycle in the continuous calibration strategy with a sample size per test cycle of

Conditional mean squared error (_{i} for specific item easiness intervals after 2nd, 6th, and 10th test cycle in the continuous calibration strategy with a sample size per test cycle of

Conditional mean squared error (_{i} for specific item easiness intervals after the 2nd, 6th, and 10th test cycle in the continuous calibration strategy with a sample size per test cycle of

The second and third research questions focused on the equating procedure. The first evaluation criterion was the proportion of feasible equatings (at least two items remained after the IPD detection). Most striking was that over all replications for none of the test cycles a breakdown of the common items occurred. Furthermore, for all 36 conditions the median number of eligible common items over all test cycles and replications ranged from 14 to 15.

The second evaluation criterion was the proportion of drifted items. As IPD was not simulated in the study and because the type I error level of the test for IPD was set to 0.05, it was expected that approximately five percent of the common items would show significant IPD.

Proportion of drifted items in the continuous calibration strategy for different sample sizes per test cycle, different common item difficulty distributions, and different scale transformation methods (MM = Mean/Mean, MS = Mean/Sigma, HB = Haebara, SL = Stocking-Lord). The dashed line represents the type I error level of 0.05.

The third evaluation criterion was the accuracy of the transformation constants

Error of the transformation constant

Error of the transformation constant

In summary and in terms of the three research questions, the study provided the following results:

The difficulty distribution of the common items in the CCS did not have a substantial impact on the precision of the item parameter estimates although small differences existed between the common item distributions; these differences were in opposite/varying directions for extreme and medium-ranged item easiness parameters _{i} when the sample size was very small.

With regard to the proportion of feasible equatings (at least two common items remained after the test for IPD) no differences were found independent of the common item difficulty distributions, the scale transformation method and the sample size.

The characteristic curve methods outperformed the moment methods in terms of error of the transformation constant. Especially for small sample size the mean/sigma method cannot recommended.

The objective of the present study was to evaluate different setups of the equating procedure implemented in the CCS and to make/provide recommendations on how to apply these setups. For this purpose, the quality of the item parameter estimates and of the equating was examined in a Monte Carlo simulation for different common item difficulty distributions, different scale transformation methods, and different sample sizes per test cycle.

The following recommendations can be made based on the results obtained: First, no clear advantage of using any of the three common item difficulty distributions was identified. Regarding the precision of the item parameter estimates, the results show a slight increase in the precision of the item parameter estimates for items with extreme difficulties when using a bimodal common item difficulty distribution compared to a normal or uniform distribution. However, the precision of the item parameter estimates for items with medium difficulty decreased. These effects were only found for very small sample sizes per test cycle (

Note that exposure control methods (e.g.,

Second, with respect to the quality of the equating, no difference was found for the scale transformation methods with regard to the proportion of feasible equatings independent of the common item difficulty distribution used and the sample size available per test cycle. The rule for evaluating an equating as feasible (at least two common items remained after the test for IPD) is worthy of discussion because of two reasons: first, with a small number of remaining common items, the equating procedure is more prone to sampling error (

SB conceived the study, conducted the statistical analyses, drafted the manuscript, and approved the submitted version. AFi performed substantial contribution to the conception of the study, contributed to the programming needed for the simulation study (R), reviewed the manuscript critically for important intellectual content, and approved the submitted version. CS performed substantial contributions to the interpretation of the study results, reviewed the manuscript critically for important intellectual content, and approved the submitted version. AFr provided advise in the planning phase of the study, reviewed the manuscript critically for important intellectual content, and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The handling Editor declared a shared affiliation, though no other collaboration, with one of the authors AFr at the time of review.