^{1}

^{*}

^{2}

^{3}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

Edited by: Jason C. Immekus, University of Louisville, United States

Reviewed by: Oscar Lorenzo Olvera Astivia, University of South Florida, United States; Stefano Noventa, University of Tübingen, Germany

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

An extension to a rating system for tracking the evolution of parameters over time using continuous variables is introduced. The proposed rating system assumes a distribution for the continuous responses, which is agnostic to the origin of the continuous scores and thus can be used for applications as varied as continuous scores obtained from language testing to scores derived from accuracy and response time from elementary arithmetic learning systems. Large-scale, high-stakes, online, anywhere anytime learning and testing inherently comes with a number of unique problems that require new psychometric solutions. These include (1) the cold start problem, (2) problem of change, and (3) the problem of personalization and adaptation. We outline how our proposed method addresses each of these problems. Three simulations are carried out to demonstrate the utility of the proposed rating system.

Large-scale, high-stakes, online, anywhere anytime learning and testing inherently comes with a number of unique problems that require new psychometric solutions. First, there is the

The urnings rating system was introduced by Bolsinova et al. (

Continuous responses can be obtained from a wide variety of data and functions of data. In the DET, item responses are continuous numbers between zero and one. In Math Garden, continuous responses come from a combination of accuracy and time. Other learning and assessment systems may ask users to provide their perceived certainty that the chosen response is correct (Finetti,

The model we consider is the direct extension of the Rasch model to continuous responses and we will refer it as

where θ represents learner ability and δ_{i} item difficulty. This is an exponential family IRT model where the sum ^{1}

(Left) The probability density function, (middle) the cumulative distribution function, and (right) the expectation of the continuous Rasch model where η = θ − δ_{i}.

For our present purpose, we will not analyze the continuous responses directly but a limited number of binary responses derived from them. We now explain how this works. If we define two new variables as follows

we obtain conditionally independent sources of information on ability from which the original observations can be reconstructed; that is, _{i1}⊥ ⊥_{i1}|θ. Moreover, it is readily found that the implied measurement model for _{i1} is the Rasch model:

where the discrimination is equal to a half. The other variable, _{i1}, is continuous with the following distribution over the interval 0 to 1/2:

The distribution of _{i1} and _{i} thus belong to the same family, but with a different range for the values of the random variable. We can now continue to split up _{i1} into two new variables and recursively transform the continuous response to a set of conditionally independent Rasch response variables with discriminations that halve in every step of the recursion.

If we denote the binary response variable obtained in the _{ij}, we obtain the (non-terminating) dyadic expansion (see e.g., Billingsley, _{i} the information in the continuous response is _{i1} alone^{2}

The first three steps of a dyadic expansion of continuous responses into conditionally independent binary response variables. Each follows a Rasch model with a discrimination that halves at each subsequent step.

Other models have been developed for continuous responses. Notably the extensions by Samejima to the graded response models (Samejima,

Adaptive online tests produce data sets with both a large number of test takers and a large number of items. Even when we analyze binary response variables, direct likelihood-based inference will not scale-up to handle these large amounts of data. We will therefore use a rating system. A rating system is a method to assess a player's strength in games of skill and track its evolution over time. Here, learners solving items are considered players competing against each other and the ratings represent the skill of the learner and the difficulty of the item.

Rating systems, such as the Elo rating system (Elo,

Urnings is a rating system where discrete parameters _{p} and _{i}, the “urnings,” track the ability of a person and the difficulty of an item. Urnings assumes that the observed binary responses result from a game of chance played between persons and items matched-up with to probability _{pi}(_{p}, _{i}). The game proceeds with each player drawing a ball from an infinite urn containing red and green balls, the proportion of green balls being π_{p} in the person urn and π_{i} in the item urn. The game ends when the balls drawn are of different color and the player with the green ball wins. If the person wins, the item is solved and so the binary response corresponds to

where

where θ_{p} = ln(π_{p}/(1−π_{p})) and similarly for θ_{i}.

The urnings rating system mimics this game using finite sized urns. For each “real” game that is played, a corresponding simulated game is played with finite urns containing, respectively _{p} and _{i} green balls out of ^{3}_{p} and _{i} denote the outcome of the simulated game. If the result of the simulated game does not match that of the real game, the balls drawn are replaced with the outcome of the real game. If person

where _{p}/_{i}/_{p} and π_{i} when neither persons nor items change.

Urnings rating system.

As the urnings rating system is designed to work with dichotomous response variables it is not directly applicable to the CR. However, through the use of the dyadic expansion, the continuous responses are transformed into a series of dichotomous responses. The urnings rating system can be applied directly to these dichotomous response variables that result from the dyadic expansion of the continuous responses. For a dyadic expansion of order _{p}. This will be similar for the item urns and item difficulty. In the simulation section below, we show how this multi-urn solution can be used to identify model misspecification.

In the next section we derive an extension to the classical urnings rating system, which tracks the θ_{p} using a single urn.

Recall that the ^{k−j}.

How does this impact the urnings update? _{pi} are now assumed to be generated by the following game of chance. The game is same as above for classic urnings, except now the game has stakes

Extended Urnings rating system.

Similarly, a simulated game is played where balls are drawn (_{p} and _{i}) from finite urns until

We provide three simulation studies to illustrate the benefits of the proposed method. Simulation 1 shows how the urnings algorithm can recover the true ability of the persons and is robust to misspecification of the model generating the continuous responses. Simulation 2 simulates a more realistic setting and aims to show how our proposed approach handles the problems inherent in learning and assessment specified in the introduction. Simulation 3 highlights the problems inherent in any model which tracks ability and difficulty: these quantities are not separately identified, and it is easy to be misled when this is not taken into account (Bechger and Maris,

We simulate 1,000 persons with ability uniformly distributed between -4 and 4, θ_{p} ~ _{i} ~

The results of tracking the responses using the three urn system is in

Contours for the predicted and observed proportion of correct responses for every combination of Urnings from simulation 1. Plots from left to right correspond to the urn associated with the respective step in the dyadic expansion.

Urn proportions of the three urns plotted against the expit of the scaled ability,

How robust is this approach to deviations from the assumptions? We investigate this through simulating from a different underlying model. The learning and assessment system Math Garden also has continuous responses and assumes the same distribution for the scores as we have. The scores in Math Garden are derived as a particular function of response accuracy, i.e., was the response correct or incorrect, and response time to produce the continuous item score in such a way that penalizes fast incorrect responses. Specifically, _{i} = (2_{i}−1)(_{i}) where _{i} indicates whether the response was correct or not and _{i} is time when the time-limit for responding is set to _{i} = _{i}−_{i} in which a slow incorrect response has a large negative score. The question is can we detect that learners follow the alternative scoring rule rather than the intended one? The answer is yes. We will show this by means of a simulation.

We augment the first simulation. Rather than simulating from the CR model we will simulate from the distribution implied by the scoring rule _{i} = _{i}−_{i}. One can show that in order to simulate from this distribution we can do the following. We first simulate the response _{i} from the CR model, but if the response is <0.5, _{i} < 0.5, then we set the score to be _{i} = 0.5−_{i}. One of the benefits of using three separate urns to track the ability is that model misfit can be detected by comparing the urns to each other. The relationship between the true urn proportions is a known function. Specifically, if θ_{p} are the true simulated abilities we can plot the inverse logit of θ_{p}/2 against the inverse logit of θ_{p}/4. If the observed own proportions don't follow this relationship there is model misfit.

Urn proportions in urn 1 plotted against urn proportions in urn 2 using the true generating model and the alternative model.

For Simulation 2 we consider a more realistic setting. Specifically, we deal with two problems in learning and assessment systems:

where ^{8}) mapped to the interval (−4,4), θ_{p1} ~ U(−4, 4), θ_{p2} ~ U(−4, 4), and α_{p} ~ Gamma(1, 1). The item difficulty is simulated from the uniform again, δ_{i} ~ U(−4, 4) and held constant. Once again, we simulate 10^{8} responses from the continuous Rasch model where a person is (uniformly) randomly selected but now a random item is selected by choosing one with the following weights

where _{p} corresponds to the selected person's urn proportion, _{i} corresponds to item _{p} and _{i} the person and item urn sizes, respectively. This results in items whose difficulty are closer to the selected person's ability being more likely selected. For this simulation we track the ability using a single urn with urn sizes of 420 for both the person and item urns.

The true (solid red line) and estimated (blue line) change in ability (left) for 1 specific person and item difficulty (right) for 1 specific item in simulation 2.

The probability that a specific person answers the

For the final simulation we explore the trouble with every measurement model, which relates ability to difficulty as the Rasch model does: the issue of unidentifiability of these parameters. In most assessment frameworks this issue is often circumvented by several assumptions, such as the assumption that the abilities of the persons and the difficulties of the items are static and not changing. Additionally, some arbitrary zero point must be decided on, which is typically that the average difficulty of the population of items is equal to zero. In this final simulation, we challenge some of these assumptions as typically happens in real data, especially in learning systems.

As before, we allow the ability to change over time in the same was as we did in simulation 2. However, we restrict the change in ability to only be positive by sampling θ_{p1} ~ U(−4, 0) and θ_{p2} ~ U(0, 4) so that each person's ability increases. Furthermore, we allow the difficulty of the items to change over time. The item difficulties change in the same way as the person ability, but they all decrease over time. Specifically, the difficulty is

where δ_{i1} ~ U(0, 4) and δ_{i2} ~ U(−4, 0). Additionally, we split the items into four groups such that the point, _{0} (at which the difficulty is half way between its starting difficulty, δ_{1}, to its ending difficulty, δ_{i2}) varies between groups. In the first group of items the mid-point is at the first quarter of the number of simulated interactions, the second group is half way through the simulated interactions (just like the person ability), the third group is three quarters of the way through the simulated interactions, and the last group does not change in ability.

True item difficulties in simulation 3.

The true (solid red line) and estimated (blue line) change in ability (left) for one specific person and item difficulty (right) for one specific item in simulation 3.

The probability that a specific person answers the

In this article, we have proposed a new method to analyze data generated by massive online learning systems, such as DET or Math Garden, based on the CR model and the Urnings ratings system. We have demonstrated its feasibility using simulation.

The approach described here is new and based on three ingredients. First, we found that the SRT model is a special case of a Rasch model for continuous item responses. Second, we established that, if the CR model holds, continuous responses can be transformed to independent binary responses that follow the Rasch model and contain most of the information in the original responses. Of course, the Rasch model is known to not always fit the data, as it assumes each item discriminates equally well (Verhelst,

In the introduction, three unique problems with large-scale, high-stakes, online, anywhere anytime learning and testing were identified. Having dealt with the problem of change and of personalization and adaptation we now briefly comment on the cold start problem. Having introduced the notion of stakes, as a way of dealing with differences in item discrimination, we can reuse the same idea for addressing the cold start problem. When a new person or item is added, we initially multiply their stakes by some number. This has the effect, similar to decreasing the urn size, of taking large(r) steps, and hence more rapidly converging to the “correct” value, but with a larger standard error. After some initial responses have been processed, the multiplier can decrease to one. Note that, in principle, the same approach can be used continuously to adjust the stakes depending on how fast or slow a person or item parameter is changing.

An extension of the urnings system was introduced in order to make use of the dichotomous responses with varying discriminations. It will be clear that we have only begun to explore the possibilities offered by the new method.

The datasets generated for this study are available on request to the corresponding author.

GM developed the initial idea. BD, MB, TB, and GM were involved in further developments, writing, and critical revisions. BD and GM developed code and simulations. All authors contributed to the article and approved the submitted version.

BD, TB, and GM work at ACT, Inc. The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at:

^{1}After re-scaling, if

^{2}The infinite sum

^{3}Note that in practice the number of balls in the person urns and item urns don't have to be equal, but for notations sake we will keep them the same.