^{1}

^{2}

^{*}

^{1}

^{3}

^{1}

^{2}

^{3}

Edited by: Holmes Finch, Ball State University, USA

Reviewed by: Yoon Soo Park, University of Illinois at Chicago, USA; Yi Zheng, Arizona State University, USA

*Correspondence: Andreas Frey

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Multidimensional adaptive testing (MAT) is a highly efficient method for the simultaneous measurement of several latent traits. Currently, no psychometrically sound approach is available for the use of MAT in testlet-based tests. Testlets are sets of items sharing a common stimulus such as a graph or a text. They are frequently used in large operational testing programs like TOEFL, PISA, PIRLS, or NAEP. To make MAT accessible for such testing programs, we present a novel combination of MAT with a multidimensional generalization of the random effects testlet model (MAT-MTIRT). MAT-MTIRT compared to non-adaptive testing is examined for several combinations of testlet effect variances (0.0, 0.5, 1.0, and 1.5) and testlet sizes (3, 6, and 9 items) with a simulation study considering three ability dimensions with simple loading structure. MAT-MTIRT outperformed non-adaptive testing regarding the measurement precision of the ability estimates. Further, the measurement precision decreased when testlet effect variances and testlet sizes increased. The suggested combination of the MTIRT model therefore provides a solution to the substantial problems of testlet-based tests while keeping the length of the test within an acceptable range.

Multidimensional adaptive testing (MAT) is a highly efficient method for the simultaneous measurement of several latent traits (e.g., Frey and Seitz,

The current lack of operational MAT applications might be due to the relative inflexibility of the “pure” MAT algorithms which were formulated at the onset of MAT-related research in the 1990s (Luecht,

At present, the application of MAT may also be hindered by the fact that it is formulated at the level of items while procedures to adequately process item pools consisting of testlets are missing. Testlets are sets of items sharing the same stimulus such as a graph, a picture, a reading passage, or other context elements. They are used, for example, in nearly all major large-scale assessments of student achievement such as PISA, PIRLS, or NAEP.

Optimally, a new approach for using testlets in MAT should provide a solution to the frequently reported problem of local item dependence (LID) between the items of the same testlet, which is typically not yet addressed in operational large-scale assessments. LID is present if non-zero inter-correlations between items remain after the level of the latent trait or latent traits measured by the items has been controlled for. In several studies, LID between the items of a testlet has been observed (e.g., Monseur et al.,

Even though addressing LID with appropriate psychometric models would be necessary in order to avoid the above mentioned problems, the high complexity of such models would either necessitate prolonging testing sessions, if comparable standard errors of the statistics of interest should be obtained, or reducing the amount or grade of differentiation of the measured content. Both would be problematic for most large-scale assessment programs. In any case, LID is a serious problem that has the potential to spuriously boost observed measurement precision and reliability estimates and to jeopardize the inferences derived from large-scale assessment results. Thus, LID is an issue psychometrics has to face and solve in order to provide unbiased and efficient estimates as prerequisites of valid test score interpretations. As mentioned above, MAT has shown to have very high measurement efficiency and, furthermore, provides a possibility to use an appropriate psychometric model for testlet-based tests. With this flexibility, MAT avoids unwanted effects on parameter estimates of interest, without increasing the length of testing sessions or limiting the measured content.

This study presents and evaluates a new method that combines the merits of a complex psychometric model with testlet effects with the above mentioned very high measurement efficiency of MAT (compared to non-adaptive sequential testing and UCAT). Thereby, (a) MAT's applicability is expanded to the large group of testlet-based tests, and (b) an adequate solution to the problem of how to handle LID between the items of a testlet is demonstrated.

The rest of the text is organized as follows: First, a MIRT model parameterizing LID between the items of the same testlet is introduced. Next, a possibility for how to combine this model with MAT is described. Based on this, the research questions for the simulation study carried out to evaluate the suggested combination are stated. Then, the methods and results of the simulation study are presented. Finally, the results and their implications for practical applications of testlet-based MAT are discussed.

Several psychometric models have been proposed to account for LID between the items of a testlet. One of the first was presented by Rosenbaum (

An alternative model which retains the information of single items and parameterizes effects caused by LID is the testlet IRT (TIRT) model with random effects. It was introduced by Bradlow et al. (_{j} to an item is given in this model by

The difficulty, discrimination, and pseudo-guessing parameters of item _{i}, _{i}, and _{i}, respectively. Additionally, the so-called testlet effects, γ_{jd(i)}, are introduced to model LID between the items of testlet _{j}. To achieve this and in order for the model to be identified, θ_{j} and γ_{jd(i)} are assumed to be uncorrelated and normally distributed with the mean and the variance of the θ_{j} distribution and the mean of the γ_{jd(i)} distribution being fixed (e.g., means set to 0 and variance set to 1). Imposing these restrictions is sufficient for the model to be identified and makes unequivocal interpretations of θ_{j} and γ_{d(i)} possible. In practice, the estimated variances

For the estimation of the model parameters, Wainer et al. (_{j}~_{a}~_{a}), μ_{b}~_{b}), and μ_{c}~_{c}), where _{z} degrees of freedom, with _{z} set to 0.5. These distributional assumptions are typical for Bayesian analyses of high dimensional IRT models. Further information can be found in Wainer et al. (

A closely related approach for considering LID between items of a testlet is the _{d(i)}. In contrast to the random effects testlet model in Equation (1), the item discriminations are allowed to have different values for the general ability dimension and the testlet dimension in the bi-factor model. Formally, it is given by

Li et al. (

To provide a testlet model for MAT, we suggest generalizing the model in Equation (1) to the multidimensional case. For this purpose, we replaced the ability parameter θ by the ability vector _{1}, …, θ_{P}) entailing the abilities for _{i} for item _{i}′. Furthermore, _{i} and γ_{jd(i)} are both multiplied with the

As for the unidimensional random effects testlet model, the estimations of _{jd(i)} parameters are assumed to be uncorrelated with each other and with _{j}, while mutual correlations between the _{jd(i)} parameters for the model in Equation (3) can also be regarded as testlet-specific random nuisance factors, modeling systematic variance caused by LID without affecting the means of _{j}. Further, the ability dimensions θ_{1}, …, θ_{P} and the testlet dimensions γ_{1(i)}, …, _{D(i)} are assumed to be normally distributed with means fixed (e.g., to 0) and the variances of the distributions for the ability dimensions also fixed (e.g., to values known from a previous study or to 1). As an alternative to fixing the ability variances, at least one _{j}, item parameters _{i}, _{i}, and _{i}, and testlet parameters γ_{jd(i)}, MCMC-estimation in a Bayesian framework is recommended, because the superiority of this kind of estimation compared to MML estimation reported for the unidimensional TIRT model (Wainer et al., _{j}~_{θ}, _{θ}), _{i}~_{a}, _{a}), _{θ} is the _{θ} the _{i}, _{i}, and γ_{jd(i)} are specified as described above for the unidimensional testlet model of Wainer et al. (_{a}~_{a}) where _{a} as defined before. Note that the number of additional parameters that need to be estimated for the multidimensional model in Equation (3) is not that much larger (item discriminations and covariances between ability dimensions) than under its unidimensional predecessor. Thus, it is a promising candidate to correspond to the simulation results of the unidimensional version and thus makes it possible to accurately estimate all included item parameters and abilities. This expectation is even stronger if reliable information about some of the model parameters is available. This, for example, is the case in CAT where item parameters are estimated beforehand in a calibration study. In such cases, where item parameters are fixed to the values from the calibration study, standard estimation techniques such as MML estimation with Newton Raphson integration should also provide provisional ability and testlet effect estimations with reasonable accuracy.

Just as with the conventional M3PL without testlet effects, the 2PL version of the MTIRT model results if _{i}′ are allowed to have the values 1 and 0 only, indicating whether an item loads on a dimension or not.

Figure

_{1} to Y_{12}) each loading on one of two ability variables θ_{1} and θ_{2}_{1} to γ_{4}. For identifiability, the means of all latent variables are set to 0 and the variances of θ_{1} and θ_{2} to 1.

Several methods have been proposed for item selection in MAT such as maximizing the determinant of the Fisher information matrix (Segall,

One important aspect of the method proposed by Segall (

The matrix ^{−1}, is the inverse of the variance-covariance matrix as defined in Equation (4). The second,

The diagonal elements of this matrix take the form
_{i} = 1 − _{i}.

The off-diagonal elements are given by

The third summand, ^{*}. It has the same form as specified by Equations (6)–(8), with the difference that it represents only one item ^{*} is selected from the item pool, which results in the largest determinant of the matrix ^{−1} can be dropped from Equation (5).

Since it is not feasible for a testlet-based test to present single items out of a testlet, Equation (5) cannot be directly combined with the model in Equation (3) to build a testlet-based MAT. Two modifications are needed in order to apply the rationale behind Equations (4) and (5) for tests which are composed of testlets.

Firstly, an information measure for complete testlets instead of single items needs to be at hand for selecting complete testlets for presentation instead of single items. Several procedures have been used to calculate such a measure. A commonly applied procedure (e.g., Keng, ^{*}:

Note that, the sum of the item information is only a good choice if all testlets of the test are of the same size. If the testlets differ in the number of items, testlets entailing more items will obviously be favored compared to smaller testlets. In this case, the mean item information of the items of the same testlet is an alternative to using the sum.

The estimation of the testlet information within a MAT process is straightforward: As the testlet effects are defined at the level of individuals, they are not known for the candidate testlets because they have not yet been answered by the participant. Therefore,

Secondly, the variance covariance matrix _{jd(i)}–which is technically an additional dimension–has to be included in the matrix _{v} is of size (

The information matrices are expanded as the test moves in a similar way to ensure that all _{v}(

The first submatrix

The second submatrix

The explicit form for

The third submatrix

The explicit form is given by

The forth submatrix is the transposed of Equation (15).

In the course of testlet-based MAT, after the completion of each testlet, the testlet information of each candidate testlet ^{*} is calculated conditional upon the provisional ability vector _{v}(_{v} as given in Equation (10). ^{*}. It has the same form as specified by Equations (11) to (16), with the difference that it represents only one candidate item ^{*} and is not summed across the

The testlet with the largest determinant of the matrix

The multidimensional TIRT model we defined in Equation (3) is a relatively complex model because the testlet effects γ_{jd(i)} are person- and testlet-specific. In practice, besides the ability dimensions, an additional dimension needs to be estimated for every testlet. The combination of the fact that parameters have to be estimated from a very limited number of data points with the high dimensionality makes the estimation of testlet effects a challenge in itself (see Jiao et al.,

The results regarding research question 1 will provide an insight into the breadth of the applicability of the MTIRT model. Additionally, an unbiased estimation will provide a sound justification for the detailed analyses carried out to answer the following research questions, which focus on specific aspects of the proposed method. In order to keep the study design manageable, more general aspects are not covered by our research questions if they are not specific to MAT-MTIRT or if their impact on the performance of the new method can be derived from previous research. Nonetheless, some of these aspects, such as the effects of the relationship between the measured dimensions and the model complexity, are relevant from a practical point of view and are thus picked up in the discussion.

From the perspective of possible future applications of the proposed method, possible improvements in the precision of ability estimates that can be obtained with the new method compared to a conventional MAT using a MIRT model (such as the current standard version of MAT) are of upmost interest. To cover a broad range of assessments, such effects should be analyzed with respect to the size of the testlet effect variances and the number of items included in the testlets. Therefore, the second research question focuses on the comparison between MAT with the MTIRT model and MAT with the MIRT model, conditional upon the size of the testlet effect variance:

With the third research question, the number of items included in the testlets is addressed:

A realistic variation of (a) the size of the testlet effect variances and (b) the number of items in a testlet allows the results to be generalized to a broad range of operational item pools.

Lastly, it may be possible that a randomly and thus non-adaptively selected set of testlets (RAN) used in combination with the MTIRT model will already lead to a satisfying level of measurement precision of the ability estimates and that moving to MAT will not add a significant increment in precision. If this is the case, moving from MIRT to MTIRT would be sufficient and the effort of implementing a MAT system might not pay off. Thus, the additional potential of MAT compared to more traditional non-adaptive testing is focused on in the fourth research question:

The stated research questions were examined with a simulation study. The simulation is based on a full factorial design with the four factors

In every condition,

The latent correlations of 0.80 between the three ability dimensions are a careful representation of the height of latent correlations between ability dimensions typically found in large-scale assessments of student achievement (e.g., latent correlations of 0.85–0.89 between the dimensions for student literacy in mathematics, reading, and science in PISA 2012 are even a bit higher; s. OECD,

Additionally, an item pool was generated that was used in all research conditions. For each of the three dimensions 108 items were generated, each one loading on exactly one dimension. This loading was indicated by setting the _{i} to 1 and the other two components to 0. An item loading on dimension _{i} = (1, 0, 0). Hence, between-item multidimensionality was used. It was chosen for the simulation because it is the predominant version of multidimensionality used in operational tests. The item difficulties of the 3 · 108 = 324 items were drawn from a uniform distribution for each of the three ability dimensions, _{D} the

The generated ability, item, and testlet parameters were used to produce responses based on the MTIRT model from Equation (3). With these responses, the testing procedure was simulated for the different research conditions. In the simulation of the testing procedure, the item difficulties, the item discriminations, and _{jd(i)} = 0 for ^{1}

3 | 1.000 | 0.925 | 0.863 | 0.811 |

6 | 1.000 | 0.927 | 0.867 | 0.815 |

9 | 1.000 | 0.930 | 0.871 | 0.821 |

Note that these are the item parameters one would use when adopting the current common practice for applying the MIRT model to a dataset including LID between the items of the same testlet. The item discriminations used in the simulation are not affected by shrinkage or other problems and can be directly used in the MIRT condition. Thereby, by using the rescaled item difficulties and the original item discriminations in the MIRT condition, direct comparability with the results obtained in the MTIRT conditions was established. Furthermore, by using fixed values for the item parameters and

The testing procedure was simulated using SAS 9.3. For the MAT conditions, for both the MIRT and MTIRT model, complete testlets were selected. The first testlet was chosen randomly. Next, the testlet with the maximum summed item information given the provisional ability vector

At the end of the simulated testing procedure, the MTIRT model included a large number of testlet effects and thus dimensions. The estimation with Newton-Raphson integration used within the course of the test is an appropriate and sufficiently fast method to provide provisional ability and testlet-effect estimates but is not the best method to estimate the final results for this high-dimensional problem. In order to achieve the highest possible accuracy for the parameter estimates which were finally used to answer the research questions, the responses gathered in the simulated testing procedure were therefore scaled using the MCMC method with WinBUGS 1.4.3 (Lunn et al., _{i} were fixed at either 1 or 0 indicating the item loadings on the dimensions, the item difficulties _{i} were fixed at the generated values, and the pseudo-guessing parameters _{i} were set to 0. For the non-fixed parameters, priors with slightly informative hyperpriors were given by _{j}~_{θ}, _{θ}) with _{θ} fixed at 0 for each dimension and _{jd(i)}. The MCMC method was applied only in this final scaling, because it would have been too slow for the estimation of the provisional ability estimates within the course of the test. The number of burn-in iterations ranged from 14,500 to 80,000 for the final scaling. The burn-in length was determined using the convergence criterion proposed by Geweke (

In this section, the four research questions of the study are answered. First, results regarding the recovery of the testlet effects with the proposed new multidimensional generalization of the TIRT model are presented. Then, MAT with the MIRT model (MAT-MIRT) is compared to MAT with the MTIRT model (MAT-MTIRT) for different testlet effect variances and different testlet sizes with respect to the precision of the ability estimates. Finally, MAT and RAN are compared.

Table

MAT | 3 | 0.022 | 0.012 | 0.506 | 0.012 | 1.012 | 0.017 | 1.509 | 0.024 |

6 | 0.007 | 0.004 | 0.505 | 0.014 | 0.993 | 0.027 | 1.514 | 0.027 | |

9 | 0.005 | 0.510 | 0.006 | 1.015 | 0.033 | 1.553 | 0.045 | ||

RAN | 3 | 0.025 | 0.015 | 0.495 | 0.026 | 1.001 | 0.020 | 1.506 | 0.026 |

6 | 0.008 | 0.496 | 0.019 | 1.012 | 0.027 | 1.510 | 0.032 | ||

9 | 0.019 | 0.012 | 0.511 | 0.018 | 1.062 | 0.043 | 0.049 |

Differences between the testlet effect variances used for data generation and the estimated testlet effect variances are mostly on the second or third decimal. Nevertheless, in some cells of the design, the 95%-credibility interval (±1.96 ·

The second research question considers the differences in the precision of the ability estimates between MAT-MTIRT and MAT-MIRT with respect to the size of the testlet effect variance. To answer, the average mean square error

3 | 0.0 | 0.150 | 0.002 | 0.150 | 0.002 | 0.233 | 0.002 | 0.234 | 0.002 |

0.5 | 0.194 | 0.001 | 0.196 | 0.002 | 0.267 | 0.005 | 0.271 | 0.005 | |

1.0 | 0.230 | 0.002 | 0.241 | 0.002 | 0.296 | 0.004 | 0.309 | 0.003 | |

1.5 | 0.261 | 0.002 | 0.281 | 0.003 | 0.321 | 0.004 | 0.345 | 0.005 | |

6 | 0.0 | 0.155 | 0.002 | 0.155 | 0.002 | 0.238 | 0.004 | 0.237 | 0.004 |

0.5 | 0.229 | 0.003 | 0.243 | 0.003 | 0.297 | 0.005 | 0.317 | 0.006 | |

1.0 | 0.286 | 0.005 | 0.329 | 0.006 | 0.343 | 0.004 | 0.400 | 0.008 | |

1.5 | 0.336 | 0.005 | 0.405 | 0.005 | 0.386 | 0.007 | 0.482 | 0.007 | |

9 | 0.0 | 0.161 | 0.002 | 0.161 | 0.002 | 0.242 | 0.003 | 0.242 | 0.003 |

0.5 | 0.257 | 0.003 | 0.300 | 0.004 | 0.326 | 0.005 | 0.374 | 0.007 | |

1.0 | 0.329 | 0.004 | 0.446 | 0.005 | 0.395 | 0.011 | 0.516 | 0.009 | |

1.5 | 0.393 | 0.008 | 0.581 | 0.009 | 0.461 | 0.016 | 0.658 | 0.010 |

As a general trend, the measurement precision decreases when the testlet effect variance increases. However, the decrease in measurement precision is smaller if the MTIRT model is used compared to the MIRT model. This can be seen well in Figure

To sum up, two results have to be noted. First, testlet effects lead to a decrease in measurement precision even if the MTIRT model is used. Second, when testlet effects are present (i.e.,

The third research question focuses on the differences in the precision of the ability estimates between MAT-MTIRT and MAT-MIRT with respect to testlet size. As can be seen in Table

To summarize, increasing the size of testlets leads to a loss in measurement precision in MAT. This loss is smaller for the MTIRT model than for the MIRT model if testlet effects are present.

Research question four asks which differences in the precision of the ability estimates can be observed between MAT and RAN. The results in Table

However, with increasing testlet size, the flexibility of MAT is more and more restricted. Correspondingly, the relative importance of the measurement model (MTIRT vs. MIRT) compared to the testing algorithm (MAT vs. RAN) gets larger with increasing testlet size. For example, for very large testlet effect variances of

However, the main purpose of the present study was to examine the performance of MAT with the “correct model” for cases where LID caused by testlets exists. Here, the proposed method performed well. If non-zero testlet effect variances were present, the combination of MAT with the MTIRT model achieved the lowest

The present study presents and evaluates a new method, expanding the applicability of MAT to the large group of testlet-based tests. The proposed combination of a multidimensional IRT model incorporating testlet effects with MAT results in an applicable solution capable of overcoming the problem of LID in testlet-based tests. Finding a solution for the issue of LID in testlet-based tests is important because recent research provides strong evidence that LID is present between the items of the same testlet in operational large-scale testing programs. Neglecting this fact leads to overestimated test information, underestimated standard errors and, subsequently, to significance tests which are producing too many significant results. Since many educational assessments are used to make important and sometimes far-ranging decisions, this issue needs to be resolved.

A decisive advantage of the method is that, by utilizing the measurement efficiency of MAT, it makes it possible to apply an appropriate model including testlet effects but without the need to prolong testing sessions. Thus, the proposed method of testlet-based MAT can be used without altering the time frames of large-scale assessments or reducing the amount or grade of differentiation of the measured content.

The suggested combination of the MTIRT model and MAT performed well. First of all, testlet effect variances were recovered satisfactorily. This result is not trivial since the proposed MTIRT model is complex and estimation problems could have occurred. The results further showed that the measurement precision of the ability estimates decreased with increasing amounts of LID and increasing numbers of items within testlets. MAT in combination with the MTIRT model was able to compensate to a certain degree for these decreases but did not fully eliminate them. Hence, losses in measurement precision due to LID within testlets will still have to be assumed even if MAT-MTIRT is used^{2}

Some more predictions about the performance of the suggested method can be derived from existing research. First, based on simulation results (e.g., Wang and Chen,

Note that, the results of the present study regarding measurement precision will only be altered in the sense of a main effect of the correlation between the measured dimensions or the complexity of the IRT model, while the relative differences between the varied factors, model (MTIRT, MIRT), testing algorithm (MAT, RAN), testlet effect variance (0.0, 0.5, 1.0, and 1.5), and testlet size (3, 6, and 9), will not be changed. Both the correlation between the measured dimensions and the model complexity will have no impact on the bias of the ability estimates since asymptotically unbiasedness is a property of the estimator used.

Due to the complexity of the MTIRT model the MCMC method was used for the final estimation. Thus, the proposed combination of MAT with the MTIRT model is limited to assessments where a final scaling of the complete set of responses of a relatively large sample is feasible. In its present form, MAT-MTIRT is hence not suitable for providing instant feedback to individual participants. However, for most large-scale assessments of student achievement like PISA, PIRLS, or TIMSS, it can be applied if testing is carried out using computers. As a possible next step, the item pool and response data from one of these studies may be used to examine the feasibility of MTIRT MAT within a real data simulation study. Thus, in conclusion, we would like to encourage measurement specialists to consider implementing MAT-MTIRT in operational testing programs since it has the capacity to substantially decrease the problems caused by LID between items within testlets.

AF: Conception of the study, directing the statistical analyses, drafting of the manuscript, approval of the final version to be published, agreeing to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. NS: Substantial contribution to the conception of the study, programming needed for the simulation study (SAS and WinBUGS), conducting the data analyses, reviewing the manuscript critically for important intellectual content, approval of the final version to be published, agreeing to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. SB: Substantial contributions to the interpretation of the study results, technical advice in the planning phase, inspection and correction of technical (mathematical) parts, reviewing the manuscript critically for important intellectual content, approval of the final version to be published, agreeing to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

The preparation of this article was supported by grant FR 2552/2-3 from the German Research Foundation (DFG) in the Priority Programme “Models of Competencies for Assessment of Individual Learning Outcomes and the Evaluation of Educational Processes” (SPP 1293).

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

^{1}The re-scaled item difficulties could have also been used in the MIRT conditions. Nevertheless, the item parameters derived from the regression can be expected to be a bit more precise because the regression uses the responses to all items to predict the item difficulties, reducing imprecision for individual items with response patterns randomly deviating from the expected response patterns.

^{2}The reported results for the average