^{1}

^{*}

^{2}

^{1}

^{3}

^{4}

^{2}

^{5}

^{1}

^{2}

^{3}

^{4}

^{5}

Edited by: Holmes Finch, Ball State University, United States

Reviewed by: Anthony D. Albano, University of Nebraska System, United States; Maria Anna Donati, Università Degli Studi di Firenze, Italy

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Item context effects refer to the impact of features of a test on an examinee's item responses. These effects cannot be explained by the abilities measured by the test. Investigations typically focus on only a single type of item context effects, such as item position effects, or mode effects, thereby ignoring the fact that different item context effects might operate simultaneously. In this study, two different types of context effects were modeled simultaneously drawing on data from an item calibration study of a multidimensional computerized test (

Psychological tests including achievement test aim at inferring person's unobservable characteristics from their observed response behavior to a set of stimuli, e.g., test items. If different forms of a particular test exist, it is commonly assumed that the persons' response behavior to the items is independent of the choice of the test form. Violations of this assumption are referred to as item context effects, i.e., the test forms are an unintended source of variability in item and test scores. Ignoring this construct-irrelevant variance can lead to biased inference about person's characteristic measured by the test as well as item characteristics (e.g., item difficulty and item discrimination), and test characteristics (e.g., reliability and validity). Defining item context effects as systematic effects of the test form on the persons' response behavior suggests that potentially many different item context effects exist depending on the properties that differ between the test forms. Well-known item context effects are

Empirical findings suggest that item context effects are quite common. For example,

Most empirical studies focus on just one type of item context effects. However, different types of item context effects are likely to operate simultaneously and may interact with each other. The (position) effect of putting on an item toward the end of a test might depend on the kind of items presented beforehand. Hence, such effects appear to be exaggerated in booklet designs (Shoemaker,

Item context effects violate the assumption of standard item response theory (IRT) models commonly employed for scoring items and persons. In particular, it is assumed that item and person parameters are invariant across test forms and are stable across the course of the testing session. Erroneously assuming the absence of position and domain order effects is likely to result in biased item and person parameter estimates and can, therefore, threaten the validity of test score interpretations and uses. For example, if mathematics items become more difficult when presented after science items compared to reading items in a multidimensional achievement test, the variation in the item difficulties needs to be taken into account when estimating the persons' mathematics competencies. Otherwise, spurious mean differences in the mathematics competencies result between the test takers depending on the test forms. Existing IRT models can be adopted to account for position and domain order effects as well as their interactions. Doing so enables researchers to assess and to statistically control for such effects and also allows for fair comparisons of test scores across test forms. The choice of models depends on the specific item context effects that are considered. If the number of test forms or test modes is small to moderate multiple group IRT models or multi-facet IRT models can be used. Explanatory IRT models based on generalized linear mixed models (GLMM) are very flexible in modeling multiple item context effects.

The aims of this article are 2-fold. First, we show how different types of item context effects can be analyzed simultaneously using generalized linear mixed models (GLMM; McCulloch et al.,

This paper is organized as follows. We start with a brief review of the theoretical aspects of context effects and clarify the terminology used in the later parts of the paper. Subsequently, we present a short overview of existing model-based methods for surveying item context effects, including the GLMM used in this study. In the method section, the study design, the sample, and the data will be introduced. Based on the booklet design presented, we then formulate a series of models with increasing complexity accounting for context effects. After presenting the results of the different models, we close with a discussion of our findings and the implications of our study.

Following Yousfi and Böhme (_{j} of an item

Note that, in tests where items of the same domain are presented in succession, a block structure results, meaning that items belonging to the same domain are typically grouped together to form one block within the test. Hence, most mixed domain assessment designs, such as used in PISA (OECD,

Context effects can be defined in a more formal way by considering the idea of conditional independence of item responses. The position of an item _{h}. Let _{1},…,_{Z} be the vector of all potential item context factors. Furthermore, let _{1},…,_{J} be the items constituting the measurement model _{0} of a potentially multidimensional latent variable _{0}. Let _{0} the vector of model parameters of _{0} including the item parameters. The subscript zero indicates that context effects are not considered in _{0}. Hence, item and person parameters are assumed to be invariant across test forms. The absence of item context effects can then be defined as the conditional stochastic independence

Item context effects exist if conditional independence—as expressed by Equation 1—does not hold. Whenever, the assumption of conditional independence is violated, context effects should be explicitly incorporated in the IRT model (Yousfi and Böhme, _{0} as item and/or person parameters are allowed to be different across test forms depending on

Different approaches have been proposed for dealing with item context effects. Specific IRT models have been derived to account for a particular context effect. Especially item position effects received attention in educational LSA (Hohensinn et al.,

In random effects models, item context effects can be represented as random variables. These models are of major importance if context effects are assumed to vary across items and/or persons. Wang and Wilson (

Because of their flexibility, GLMMs (McCulloch et al.,

Our study focused on item position and domain order effects on subjects' item responses in a mixed domain design consisting of test material typically employed in recent LSAs of student achievement. In this study, we used data from a study whose test design allows assessing both kinds of context effects and their interactions. Our main question was whether item position effects were moderated by the order of domains measured in the different test booklets. From a substantive perspective our results are informative for researchers planning assessments in which multiple domains are assessed. The existence of domain order effects operating in addition to position effects might be an important issue to be considered in large scale studies of student achievement. Typical assessments with different test forms seek to control potential position effects by design (i.e., balancing; Frey et al.,

A second goal of this article is to exemplify the use of GLMMs for the simultaneous analysis of item position and domain order effects. We will present a sequence of multidimensional IRT models specified in the GLMM framework. A step by step derivation of the models will be provided taking the assessment design into account. As we will show in the remainder of the article the GLMM framework provides great flexibility for modeling the impact of features of the assessment design on individuals' item responses that are not easily implemented in the classical MIRT framework.

The data set consisted of 49,128 responses gathered in a calibration study for three tests measuring student achievement in the domains of mathematics, science, and reading within a research project (for more information see Ziegler et al.,

Achievement was assessed by an item pool consisting of 339 items (133 mathematics, 133 science, and 73 reading items). The distribution of the items to the test takers was accomplished with a two-level booklet design. At the first level, domain-specific blocks of items were balanced (

The six domain orders used in the test booklets (Youden squares).

1 | Reading | Reading | Science | Science | Mathematics | Mathematics |

2 | Mathematics | Science | Reading | Mathematics | Science | Reading |

3 | Science | Mathematics | Mathematics | Reading | Reading | Science |

The booklet design implied that each mathematics and science item was presented in the positions 1 to 33. As only 9 reading items were included in a booklet, reading items were presented in the positions 1–9, 13–21, and 25–33. Between 123 and 199 responses were observed for each item (mathematics:

The tests were administered as an online computer-based test. Each testing session started with a 10-min-long standardized instruction. The test takers were informed about the domains being assessed, but participants did not know the order of the domains in their individually assigned test form. The test forms were randomly assigned to the students, who had, in total, 60 min to complete the test.

The data were analyzed by a series of GLMMs, assuming that both persons and items were random (De Boeck, _{M}), science (θ_{S}), and reading (θ_{R}). The ability vector _{M}, ζ_{S}, and ζ_{R}, which were assumed to be normally distributed with zero mean and _{d}) each.

In developing the full model, we began with M0, the three-dimensional random Rasch model (De Boeck, _{M}, _{S}, and _{R}, which indicate whether the response _{ijd} refers to a mathematics, science, or reading item. The level-1 model equation of the logit of person

The level-2 model equation of the random slope of domain

where γ_{d} is the fixed effect for domain _{d} can be interpreted as the mean domain-specific item easiness.

Item position effects are the most frequently investigated type of item context effects. Such effects are incorporated into model M1. We entered the position _{ijpd}. To simplify the notification, we simply write _{p} in the remainder. _{p} is the level-1 covariate, so that Equation 1 extends to

To facilitate the interpretation of model parameters, _{p}might be standardized. In our case, we set _{p} = (_{d} was the expected change in the logit of a randomly drawn person solving a randomly selected item of domain _{d} to vary across the domains, so that

In Equation 5, κ_{d} is the mean logit change in a randomly selected item out of domain

The model M1 is an explanatory IRT model with the item position as a level-1 predictor. The main restriction of this model is that the item position effect has a linear form, and that the item position effect is a fixed effect which does not have different values for different items and/or persons. Both restrictions can be relaxed in the GLMM framework. Nonlinear forms of item position effects can be examined by adding higher-order polynomials of the position variable _{p}(e.g., _{ijd} that can vary across persons and items. In the present study we also employed these extended parameterizations of M1 by checking for nonlinear trends, and for random effects on the person and item side. However, as we found no evidence for random effects, we do not investigate this issue any further.

As M1 only accounts for item position effects, the model was extended to include domain order effects, leading to a new model (M2). M2 not only assumed position and domain order effects, but also allowed for interactions between the two effects. For example, a position effect in science items may be stronger or weaker depending on whether mathematics or reading items were assigned previously. Following this idea, we took the domain order (as an additional predictor) into account in M2.

In the booklet design employed in this article, item position and domain order are not independent from one another. If, for example, a mathematics item was presented in a test booklet of the domain order mathematics (

We included variables _{ijdb} which indicate that an item _{b} in the remainder of the article. The first item block position served as the reference block, so that two additional indicator variables _{2} and _{3} were included that jointly indicate a response to an item of the first block (_{2} = 0, _{3} = 0), the second block (_{2} = 1, _{3} = 0), or the third block (_{2} = 0, _{3} = 1).

In order to yield a model with parameters that can be interpreted unequivocally, the item positions' were within-block standardized in M2 as _{pb} = (2_{b}−_{b}−1)/2(_{b}−1), where _{b} stands for number of items in block _{b} refers to the within-block item position of item _{pb} always has a value of −0.5 when an item _{pb} stand for the logit change if a randomly chosen item in domain

So far, the model equation of model M2 can be written as:

where α_{d2t} and α_{d3t} stand for the effects of the block positions indicated by _{2} and _{3}, and λ_{dbt} stands for the within-block position effect. Note that these parameters are indexed by a newly introduced index

Formally, α_{dbt} is defined as a function _{(D, T)} of the domain _{dbt} is defined as a function _{(D, B, T)} of the domain, the block position and the domain order. So, the item position effects may vary depending on the domain

In the present investigation, the variable _{MSR}, _{MRS}, _{SMR}, _{SRM}, _{RMS}, and _{RSM}. The order of subscripts represents the domain order in a test booklet. It is important to note that, in the present case, each combination of a domain

Given the aforementioned restrictions on the impact of domain orders, and assuming the existence of linear item position effects within item blocks, the full reduced-form model equation of M2 recurring on domain order indicators is given as:

The stepwise derivation of this model is presented in _{db}, where the first subscript ^{×}”. For example, only two possible domain orders—

The domain-specific within-block item position effects in the item blocks one, two and three refer to the parameters κ_{d1}, ^{×}). For example,

Decomposition of block position effects α_{dbt} and within-block item position effects λ_{dbt} as a function of domain order specific effects (Equation 6).

_{dbt} |
||

_{dbt} |
||

λ_{M1(MSR)} = κ_{M1} |
||

λ_{M1(MRS)} = κ_{M1} |
||

λ_{S1(SMR)} = κ_{S1} |
||

λ_{S1(SRM)} = κ_{S1} |
||

λ_{R1(RMS)} = κ_{R1} |
||

λ_{R1(RSM)} = κ_{R1} |

The most complex model (M2) presented here was developed based on the booklet design employed in the present study (

All models presented here can be fitted in

In this section, we present the results of the models M0, M1, and M2 that were used to test increasingly complex hypotheses about item position and domain order effects. The results are presented for each model separately, starting with the three-dimensional Rasch model with random item- and person-effects (M0). No context effects were taken into account in M0, which mainly serves for comparison. The standard deviations of the three latent person variables ranged from 0.710 in science to 0.867 in mathematics (_{d} and the item difficulties of items belonging to domain

Estimated standard deviations and correlations of random effects of the different models.

M0 | Mathematics | 1.316 | 0.866 | ||

Science | 1.268 | 0.710 | 0.907 | ||

Reading | 1.075 | 0.799 | 0.832 | 0.839 | |

M1 | Mathematics | 1.316 | 0.859 | ||

Science | 1.269 | 0.697 | 0.909 | ||

Reading | 1.077 | 0.799 | 0.847 | 0.854 | |

M2r | Mathematics | 1.315 | 0.836 | ||

Science | 1.268 | 0.680 | 0.913 | ||

Reading | 1.077 | 0.770 | 0.848 | 0.854 |

Model M1 allows for estimating and testing the interaction between item position and the domain to study differences in position effects across the three domains mathematics, science, and reading. Before we fitted the multidimensional model M1 to the data, we first applied unidimensional models separately to each domain in order to find the functional form best suited for describing the item position effects. Possible nonlinear position effects were explored by including quadratic and cubic terms into the models. Based on LR tests, models with linear position effects were preferred for science [χ^{2}(2) = 2.806, ^{2}(2) = 2.142, ^{2}(2) = 12.560,

The multidimensional model M1 thus included linear item position effects for reading and science that were allowed to be different in magnitude, and a quadratic item position effect for mathematics. An LR test was applied to test the domain-by-item position interaction effect, providing a statistically significant interaction effect [χ^{2}(2) = 12.650, ^{1}

Proportions of correct responses averaged over all items of a domain depending on the item position within the test.

In M2, the item position was broken down into the block position in which items of the same domain were administered in the test and the within-block item position. ^{2}(15) = 18.097, ^{2}(9) = 94.186, ^{2}(6) = 88.174,

Estimated fixed effects of Models M2 and M2r.

_{M} |
γ_{M} |
−0.018 | 0.125 | 0.882 | −0.019 | 0.125 | 0.877 |

_{S} |
γ_{S} |
0.752^{***} |
0.120 | < 0.001 | 0.751^{***} |
0.120 | < 0.001 |

_{R} |
γ_{R} |
0.402^{**} |
0.133 | 0.003 | 0.401^{**} |
0.133 | 0.003 |

_{M}_{2} |
0.415^{***} |
0.085 | < 0.001 | 0.415^{***} |
0.085 | < 0.001 | |

_{S}_{2} |
−0.197^{**} |
0.076 | 0.010 | −0.196^{*} |
0.076 | 0.003 | |

_{R}_{2} |
−0.429^{***} |
0.078 | < 0.001 | −0.428^{***} |
0.078 | < 0.001 | |

_{M}_{2} × _{RMS} |
−0.268^{**} |
0.089 | 0.002 | −0.265^{**} |
0.088 | < 0.001 | |

_{S}_{2} × _{RSM} |
−0.166^{*} |
0.082 | 0.042 | −0.165^{*} |
0.082 | 0.010 | |

_{R}_{2} × _{SRM} |
0.723^{***} |
0.094 | < 0.001 | 0.723^{***} |
0.094 | 0.044 | |

_{M}_{3} |
0.338^{***} |
0.084 | < 0.001 | 0.344^{***} |
0.084 | < 0.001 | |

_{S}_{3} |
−0.564^{***} |
0.074 | < 0.001 | −0.561^{***} |
0.074 | 0.002 | |

_{R}_{3} |
−0.249^{**} |
0.082 | 0.002 | −0.246^{**} |
0.082 | < 0.001 | |

_{M}_{3} × _{RSM} |
−0.518^{***} |
0.092 | < 0.001 | −0.521^{***} |
0.092 | < 0.001 | |

_{S}_{3} × _{RMS} |
0.246^{**} |
0.079 | 0.002 | 0.246^{**} |
0.079 | 0.003 | |

_{R}_{3} × _{SMR} |
0.132 | 0.102 | 0.197 | 0.131 | 0.102 | 0.201 | |

_{M}_{1} × _{p} |
κ_{M1} |
0.097 | 0.099 | 0.329 | |||

_{S}_{1} × _{p} |
κ_{S1} |
0.083 | 0.104 | 0.426 | |||

_{R}_{1} × _{p} |
κ_{R1} |
0.075 | 0.093 | 0.421 | |||

_{M}_{2} × _{p} |
0.232 | 0.15 | 0.121 | ||||

_{S}_{2} × _{p} |
0.023 | 0.149 | 0.876 | ||||

_{R}_{2} × _{pb} |
−0.009 | 0.15 | 0.952 | ||||

_{M}_{2} × _{RMS}_{pb} |
−0.391^{*} |
0.191 | 0.041 | ||||

_{S}_{2} × _{RSM}_{pb} |
−0.242 | 0.198 | 0.220 | ||||

_{R}_{2} × _{SRM}_{pb} |
−0.225 | 0.217 | 0.298 | ||||

_{M}_{3} × _{pb} |
−0.209 | 0.152 | 0.168 | ||||

_{S}_{3} × _{pb} |
−0.149 | 0.144 | 0.303 | ||||

_{R}_{3} × _{pb} |
−0.294 | 0.165 | 0.075 | ||||

_{M}_{3} × _{RSM}_{pb} |
0.107 | 0.205 | 0.602 | ||||

_{S}_{3} × _{RMS}_{pb} |
0.076 | 0.195 | 0.697 | ||||

_{R}_{3} × _{SMR}_{pb} |
0.199 | 0.242 | 0.410 |

AIC and BIC of the different models.

AIC | 55585.62 | 55539.15 | 55474.87 | 55462.97 |

BIC | 55691.21 | 55671.14 | 55818.04 | 55674.15 |

As parameters of logistic regressions with different fixed parts cannot be compared across models (Mood, ^{2}

The results of our analyses are best illustrated by considering the item means, depending on the test form, the block position, and the within-block item position (

Proportions of correct responses across all items of a domain depending on the item position within the test and the domain order.

Block position effects in reading items were most strongly moderated by the domain order. In line with results of Model M1, results of M2r confirmed an average decrease in logits in reading items when presented in the second item block following mathematic items (

Taken together, the empirical results illustrate that multiple item context effects can interact in a complex way. Such interaction effects may remain undetected if analyses focus on just one of several item context effects. This can result in biased item and person parameter estimates and may lead to invalid explanations and interpretations of such effects.

Items are always presented in a context. Differences in item score distributions which depend on the context in which a particular item is presented are denoted as context effects. In recent years, such effects have received more attention suggesting that they may be the rule rather than the exception (Leary and Dorans,

The aims followed in the present article were 2-fold. First, we investigated potential interactions between item context effects by considering the effects of the domain order and the item positions simultaneously. Second, we presented the GLMM framework (McCulloch et al.,

The main result of the empirical analyses in this study is that two context effects, namely the item position and the domain order effects, may interact substantially. In many achievement tests, it was found that items showed a tendency to become more difficult when presented in later positions of the test. At first glance, this finding was also confirmed in our analyses, when we exclusively focused on item position effects. However, including the domain order as an additional item context factor revealed a much more complex pattern. Items can become easier as well as more difficult depending on the domains presented in the beginning of the test. These results are also theoretically challenging, as they are hardly consistent with widely accepted explanations of item position effects as fatigue or practice effects. In fact, in some cases, domain order effects appeared to be much stronger than position effects. For example, the difference in the mean logits of reading items presented in the second item block between tests of the domain order _{R}) = 0.77, this effect corresponds to a standardized effect of Cohen's

Note that these logit differences between groups of test takers with different versions of test are only interpretable as domain order effects because of the randomized assignment of the test forms. In nonrandomized test designs (e.g., with self-selected test versions) the same logit differences could also reflect true mean differences in the distributions of the latent variables (i.e., the person parameters) between groups with different domain order preferences. In practical applications of complex test and item designs, the analyses of item context effects should be part of the quality assurance, just like analyses of differential item functioning (DIF), item DRIFT or other approaches to check model assumptions. Our findings suggest that reliable analyses of item context effects require (a) strong test and items designs, including randomized assignment of test forms, and (b) to take potential dependencies and interactions between multiple item context effects into account by analyzing them simultaneously.

Despite the substantial item context effects, the distributions of random item and random person effects are very similar across the models M0, M1, M2, and M2r. The estimated variances of item and person parameters as well as the estimated correlation structures of the three domain-specific latent proficiencies hardly differ across models (see

Our study clearly indicates that the domain order can substantially affect the response behavior in mixed domain booklet designs in achievement tests. This result is worrisome, as designs of this kind are quite common in many LSAs of student achievement. In these assessments, a lot of effort is made to account for position effects by using booklet designs with balanced block positions. However, in most designs, domain orders are not balanced, and are sometimes even perfectly confounded with block positions. As a consequence, the impact of position effects and domain order effects on test results cannot be separated (Brennan,

Our results strongly indicate that domain order effects are an issue of concern when assessing student achievement. Careful development of booklet designs would not only enable researchers to quantify the impact of domain orders on individuals' item responses, but also to derive more purified ability estimates. In most cases, item context effects are expected to be a nuisance rather than a benefit. The models used here may not only be used to statistically control for item context effects but to obtain person parameter estimates adjusted for item context effects. GLMMs allow for the computation of the empirical Bayes estimates of individual proficiency levels. Due to the specification of the fixed part of the more complex models M1 and M2 with the first item block as the reference block, all person parameters were estimated as though

Note that taking the alternative route of employing only one fixed order of domains to all subjects in a study is not a solution to the problem. Domain order effects, as well as position effects, are as likely to occur but, in contrast to systematically rotated booklet designs, it is not possible to quantify and control them (Brennan,

As in any other empirical study, this study is affected by limitations which call for further research. The present design does not enable an estimation of a “pure” position effect on the basis of test takers working on only one domain. Although not strictly necessary for examining the moderating role of domain order effects, estimates of “purified” position effects might serve as a useful benchmark for evaluating the size of domain order effects.

Although the modeling approach proposed turned out to be complex, the resulting models might still appear to be overly simplified. For example, the GLMM framework is restricted to one-parameter IRT models. It would be interesting to implement the proposed models in different frameworks, allowing for more complex measurement models, such as the two or three-parameter IRT model. A further point that might be criticized is that our modeling approach did not include random item context effects. Such effects can, in principle, be estimated in the GLMM framework when simpler models are envisaged. However, we did estimate item position models (M1), including random effects on the person and item side. However, as the results indicated that the models were overparameterized, we did not pursue these models any further in this article (results available from the first author).

The results of our study cannot be generalized automatically to other multidimensional tests that are used for assessing different theoretical constructs. The analyses of multiple context effects and interactions between them have never or rarely been tested systematically. This is an area of further research.

This paper was not intended to provide a theoretical explanation of the various item context effects we found empirically: In the existing literature, fatigue effects, practice effects, and backfire effects (Leary and Dorans,

NR was responsible for the theoretical treatise on the conceptualization of item context effects and drafted the manuscript. He performed the theory-based derivation of the presented models and conducted all data analyses in R, including the programming of graphs. GN contributed to the preparation of the manuscript and the presentation of the results. He was involved in writing of all parts of the manuscript and was also responsible for funding the research project. AF Planned and initiated the original study as a part of the MakAdapt project, including the funding thereof. He created the item- and test design and coordinated the assessment and the data collection. He was also involved in the preparation of the manuscript for publication. BN and MB contributed to the preparation of the manuscript and the presentation of the results. Both were involved in writing of all parts of the manuscript and were also responsible for funding the research project. All coauthors provided critical revisions. All authors were involved in interpreting the results and approved the final version of the manuscript for submission.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at:

^{1}Existence of random item position effects across items and persons were tested in all three domains by means of LR Tests. Random item position effects were either not significant or perfectly correlated with other random effects in the model. The later indicates overparameterization of the GLMM (Baayen et al.,

^{2}Estimated standard deviations and correlations of the random effects obtained by Models M2 and M2r are nearly identical. Therefore, results of Model M2 are not included in