^{1}

^{*}

^{2}

^{1}

^{2}

Edited by: Jason C. Immekus, University of Louisville, United States

Reviewed by: Luciana Pagliosa Carvalho Guedes, Universidade Estadual do Oeste do Paraná, Brazil; Peida Zhan, Zhejiang Normal University, China; Yong Luo, Educational Testing Service, United States; Chung-Ying Lin, The Hong Kong Polytechnic University, Hong Kong

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The standard item response theory (IRT) model assumption of a single homogenous population may be violated in real data. Mixture extensions of IRT models have been proposed to account for latent heterogeneous populations, but these models are not designed to handle multilevel data structures. Ignoring the multilevel structure is problematic as it results in lower-level units aggregated with higher-level units and yields less accurate results, because of dependencies in the data. Multilevel data structures cause such dependencies between levels but can be modeled in a straightforward way in multilevel mixture IRT models. An important step in the use of multilevel mixture IRT models is the fit of the model to the data. This fit is often determined based on relative fit indices. Previous research on mixture IRT models has shown that performances of these indices and classification accuracy of these models can be affected by several factors including percentage of class-variant items, number of items, magnitude and size of clusters, and mixing proportions of latent classes. As yet, no studies appear to have been reported examining these issues for multilevel extensions of mixture IRT models. The current study aims to investigate the effects of several features of the data on the accuracy of model selection and parameter recovery. Results are reported on a simulation study designed to examine the following features of the data: percentages of class-variant items (30, 60, and 90%), numbers of latent classes in the data (with from 1 to 3 latent classes at level 1 and 1 and 2 latent classes at level 2), numbers of items (10, 30, and 50), numbers of clusters (50 and 100), cluster size (10 and 50), and mixing proportions [equal (0.5 and 0.5) vs. non-equal (0.25 and 0.75)]. Simulation results indicated that multilevel mixture IRT models resulted in less accurate estimates when the number of clusters and the cluster size were small. In addition, mean Root mean square error (RMSE) values increased as the percentage of class-variant items increased and parameters were recovered more accurately under the 30% class-variant item conditions. Mixing proportion type (i.e., equal vs. unequal latent class sizes) and numbers of items (10, 30, and 50), however, did not show any clear pattern. Sample size dependent fit indices BIC, CAIC, and SABIC performed poorly for the smaller level-1 sample size. For the remaining conditions, the SABIC index performed better than other fit indices.

Item response theory (IRT;

The single-level mixture IRT models are like multigroup item response models (

As described in

where _{jki} represents the responses of person _{jk} is a within-level latent classification variable where _{ig.W} represents a within-level item discrimination parameter, α_{i.B }represents between-level item discrimination parameter, β_{ig} is a class-specific item location parameter, θ_{jkg} is a class-specific within-level continuous latent variable _{k} represents a between-level continuous latent variable. Both θ_{jkg} and θ_{k} are assumed to follow normal distributions with a mean of zero and variance

The multilevel mixture IRT models have interested researchers due to their utility for correctly accounting for dependencies among the data in multilevel data structures (

The exploratory use of multilevel mixture IRT modeling is based on the comparison of alternative models using relative fit indices such as the Akaike Information Criterion (AIC;

How do the different test characteristics affect the quality of parameter estimates in multilevel mixture IRT models?

How do these different characteristics affect classification accuracy in multilevel mixture IRT models?

How do the model selection indices perform in the presence of these different characteristics?

A Monte Carlo simulation study was conducted to answer the three research questions. Details of the simulation study are given below.

Data were simulated based on the dichotomous multilevel mixture IRT model (

Ten-item test was used to represent a short test condition, a 30-item test was used to represent a medium test length and a 50-item test was used to represent a long test. Two different mixing proportions were included to investigate the effect of different mixing proportions, π: equal mixing proportions (π_{1} = π_{2} = 0.5) and unequal mixing proportions (π_{1} = 0.75, π_{2} = 0.25). Items with the same item threshold parameters across latent classes are considered class-invariant items, and items having unequal threshold parameters are considered class-variant items. Given that the number of class-variant items has been shown to affect number of detected latent class (

Four different models were estimated: CB1C2, CB2C2, CB2C3 and CB3C3, CB is the notation for level-2 and C is the notation for level-1. Thus, CB1C2 represents a model with one level-two class and two level-one classes, CB2C2 represents a model with two level-one classes and two level-two classes, CB2C3 represents a model with level-two classes and three level-one classes, etc. The true (i.e., generating) model in this simulation study was the CB2C2 model, i.e., a multilevel mixture item response model with two within-level and two between-level latent classes. Thus, misspecified models were the CB1C2, CB2C3 and CB3C3 models. The total number of runs was 28,800 (=100 replications × 4 models × 72 conditions). Marginal maximum-likelihood estimation with the MLR estimator option was used as implemented in Mplus for estimation of the multilevel mixture IRT models. The following Mplus options were used: TYPE = TWOLEVEL MIXTURE; ALGORITHM = INTEGRATION; PROCESSORS = 2;. The Mplus syntax for model estimation is provided in the

Root mean square error (RMSE) statistics were calculated, after item parameter estimates were placed onto the scale of the generating parameters, to examine the recovery of the generating parameters. RMSE was calculated between item threshold parameters of the true model and the estimated model using

Label switching can be a concern with mixture IRT estimation. Estimated latent classes can be switch across different replications. As an example, between-level latent class 2 on one data set can potentially correspond to between-level class 1 on another data set. Therefore, results for each data set were monitored to detect and, if necessary, to correct label switching. Threshold values obtained from the class were then used to appropriately calculate RMSE values.

In the mixture IRT framework, each respondent has an estimated posterior probability for membership in each latent class. Each respondents is assigned to a single class based on their highest estimated posterior probability value. As described in _{jkg}, can be calculated as follows:

where _{jki} represents the responses of person _{jk} is a categorical latent variable at the within level, _{k} represents a between-level predicted score. The _{jkg} values sum to 1 for each person (

Simulated examinees were assigned to specified latent classes during data generation. It is necessary to determine whether these examinees were classified into the same latent classes after model estimation. Posterior probabilities for membership of each examinee were calculated using the CPROBABILITIES option of the SAVEDATA command in Mplus. Classification accuracy rate was calculated for each condition. The correct detection rate was defined as the correct classification of the latent class membership for each examinee. Generated and simulated class memberships were compared and a percentage was computed across the 100 replications for each condition. Thus, agreement was recorded when an examinee assigned to the first class (Class 1) during data generation was also classified into Class 1 after estimation.

Unlike multigroup IRT models, the latent classes in mixture IRT models are not known

Information criterion indices are based on some form of penalization of the loglikelihood. The penalization is used to adjust for the selection of over-parameterized models. Let

The performances of AIC, BIC, consistent AIC (CAIC;

Where,

Mean RMSE values of item threshold estimates for the CB2C2 Model.

E5010 | 1.335 | 1.812 | 1.949 | 0.454 | 0.993 | 1.333 | 0.562 | 1.231 | 2.007 |

E5050 | 0.256 | 0.325 | 0.829 | 0.118 | 0.732 | 0.977 | 0.107 | 0.985 | 1.268 |

E10010 | 0.752 | 0.830 | 1.099 | 0.213 | 0.766 | 1.007 | 0.199 | 1.006 | 1.458 |

E10050 | 0.164 | 0.191 | 0.767 | 0.083 | 0.724 | 0.965 | 0.075 | 0.977 | 1.260 |

NE5010 | 1.087 | 1.213 | 1.401 | 1.873 | 2.653 | 2.710 | 1.435 | 1.860 | 2.927 |

NE5050 | 0.400 | 0.596 | 1.010 | 0.328 | 0.751 | 1.087 | 0.134 | 0.988 | 1.321 |

NE10010 | 0.803 | 1.377 | 1.565 | 1.289 | 1.621 | 1.928 | 0.548 | 1.120 | 1.766 |

NE10050 | 0.335 | 0.376 | 0.859 | 0.328 | 0.734 | 1.070 | 0.092 | 0.979 | 1.262 |

As shown in

As with latent class models, mixture IRT models assign each examinee to one of the latent classes based on class probability values. The class memberships created during the data generation were compared with the estimated class memberships. A classification accuracy rate was calculated for each condition between generated values and estimated values based on the same model. Classification accuracy rates are shown in

Classification accuracy rates for CB2C2 Model.

E5010 | 37.35 | 38.20 | 31.43 | 43.19 | 24.14 | 44.66 | 69.11 | 80.04 | 69.38 |

E5050 | 45.13 | 58.58 | 38.69 | 57.86 | 45.05 | 38.85 | 82.29 | 89.02 | 86.92 |

E10010 | 30.82 | 42.54 | 27.87 | 44.18 | 27.58 | 58.00 | 70.04 | 83.34 | 78.93 |

E10050 | 35.39 | 61.53 | 30.18 | 61.15 | 47.43 | 37.79 | 82.09 | 89.02 | 87.12 |

NE5010 | 37.00 | 37.42 | 30.93 | 28.50 | 27.27 | 26.69 | 65.86 | 74.70 | 45.70 |

NE5050 | 52.94 | 57.05 | 45.03 | 38.71 | 26.58 | 29.01 | 85.04 | 90.50 | 88.61 |

NE10010 | 34.79 | 47.14 | 36.97 | 26.61 | 32.31 | 32.50 | 72.86 | 85.12 | 66.97 |

NE10050 | 60.27 | 57.42 | 32.45 | 31.13 | 12.31 | 15.87 | 85.85 | 90.64 | 86.52 |

AIC, BIC, CAIC, and SABIC values were calculated for each condition. The number of correct selections was calculated as the number of detections of the CB2C2 (i.e., the generating) model over 100 iterations. The frequencies of correct model selections are shown in

Number of correct detections over 100 replications for 10-Item conditions.

E5010 | 82 | 52 | 65 | 3 | 0 | 2 | 59 | 31 | 48 | 2 | 0 | 0 |

E5050 | 82 | 76 | 97 | 100 | 100 | 100 | 100 | 100 | 100 | 97 | 99 | 98 |

E10010 | 86 | 67 | 67 | 21 | 0 | 3 | 84 | 58 | 65 | 7 | 0 | 1 |

E10050 | 57 | 70 | 89 | 80 | 100 | 100 | 77 | 100 | 100 | 77 | 97 | 97 |

NE5010 | 70 | 57 | 69 | 1 | 0 | 2 | 51 | 26 | 41 | 0 | 0 | 0 |

NE5050 | 91 | 79 | 90 | 100 | 80 | 100 | 100 | 87 | 100 | 97 | 77 | 95 |

NE10010 | 86 | 74 | 73 | 11 | 1 | 2 | 78 | 42 | 70 | 5 | 0 | 2 |

NE10050 | 75 | 38 | 92 | 100 | 70 | 100 | 100 | 59 | 100 | 100 | 73 | 97 |

Number of correct detections over 100 replications for 30-Item conditions.

E5010 | 53 | 55 | 47 | 28 | 0 | 0 | 99 | 97 | 66 | 11 | 0 | 0 |

E5050 | 56 | 72 | 37 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |

E10010 | 48 | 34 | 48 | 99 | 53 | 0 | 100 | 99 | 66 | 97 | 20 | 0 |

E10050 | 59 | 77 | 41 | 99 | 100 | 100 | 99 | 100 | 100 | 98 | 100 | 100 |

NE5010 | 28 | 38 | 25 | 0 | 2 | 0 | 11 | 6 | 100 | 0 | 0 | 0 |

NE5050 | 18 | 65 | 53 | 81 | 66 | 8 | 97 | 99 | 83 | 66 | 33 | 1 |

NE10010 | 16 | 47 | 31 | 0 | 0 | 0 | 13 | 6 | 1 | 0 | 0 | 0 |

NE10050 | 5 | 63 | 39 | 100 | 99 | 92 | 85 | 99 | 99 | 100 | 98 | 85 |

Number of correct detections over 100 replications for 50-Item conditions.

E5010 | 58 | 79 | 78 | 0 | 0 | 1 | 54 | 30 | 2 | 0 | 0 | 0 |

E5050 | 67 | 66 | 77 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 90 |

E10010 | 67 | 76 | 92 | 1 | 0 | 0 | 93 | 89 | 21 | 0 | 0 | 0 |

E10050 | 69 | 65 | 65 | 100 | 100 | 97 | 100 | 100 | 94 | 100 | 100 | 98 |

NE5010 | 57 | 49 | 31 | 0 | 0 | 0 | 23 | 3 | 0 | 0 | 0 | 0 |

NE5050 | 77 | 74 | 76 | 100 | 89 | 36 | 100 | 99 | 100 | 99 | 78 | 12 |

NE10010 | 60 | 73 | 97 | 0 | 0 | 0 | 53 | 26 | 0 | 0 | 0 | 0 |

NE10050 | 92 | 91 | 68 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 98 |

The numbers of correct detections for 10-item conditions are presented in

The number of correct detections for the 30-item conditions ranged between 0 and 100 (see

Correct detection frequencies (see

This simulation study examined the accuracy of parameter estimates and classifications under different multilevel and mixture conditions. The simulation factors in this research were chosen to represent different class-distinction features in multilevel mixture IRT modeling, in which the percentage of class-variant items, the number and magnitude of clusters, and the number of items varied for the structure with two level-1 and two level-2 classes (i.e., CB2C2 model). In addition, this study also investigated the differential performance of the four information criteria (AIC, BIC, CAIC, and SABIC) for model selection with different multilevel mixture IRT model applications.

Findings from the simulation study indicated that greater accuracy was observed with the higher number of clusters (i.e., 100 clusters) and cluster size (i.e., 50 simulated examinees) conditions, as well as the lower (30%) percentage of class-variant item conditions. When the number of clusters and the cluster sizes were small, the applications of multilevel mixture IRT models can be problematic with respect to the accuracy of item parameter estimates. These findings were consistent with previous research by

Findings regarding classification accuracy rates showed that the classification accuracy rates increased as the number of items increased. Equal mixing proportion conditions yielded smaller accuracy rates than unequal mixing proportion conditions for most percentages of class-variant items and test length conditions. The numbers of clusters and cluster size appeared to influence classification accuracy rates. The smaller cluster size (i.e., 10 examinees) and smaller number of clusters (i.e., 50 clusters) yielded lower accuracy rates. Similarly, the number of clusters appeared to influence classification accuracy rates. As expected, increases in the number of items, number of clusters and cluster size had a positive effect on classification accuracy.

Differential performances of the AIC, BIC, CAIC, and SABIC were observed under the different study conditions. Overall, SABIC performed better than BIC or CAIC for the small level-1 sample (i.e., 10) conditions, and for the conditions with the higher sample size at level-1 (i.e., 50). BIC and CAIC failed to select the true model for conditions with the smaller level-1 sample size. Overall, BIC and CAIC indices showed similar performances under the different data conditions. The SABIC appears to be the better than BIC and CAIC for the smaller level-1 sample size. These findings were consistent with

Multilevel mixture IRT models and relative fit indices used for model selection perform better with higher number of clusters and cluster sizes. The percentage of class-variant items also appeared to have an effect on accuracy of model estimates and on performance of model selection indices. Given these findings, it is important to note that model selection also needs to pay attention to substantive theory as well as to multiple fit indices rather than relying on a single fit index for model selection. The present study shares similar limitations to those of other simulation studies using similar conditions in the study design (e.g.,

Datasets generated for E5010 conditions of this study are included in the article/

Both authors contributed equally to the data analyses and reporting parts.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: