^{1}

^{*}

^{2}

^{2}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Robbert Smit, University of Teacher Education St. Gallen, Switzerland

Reviewed by: Alexander Naumann, Leibniz Institute for Research and Information in Education (DIPF), Germany; Michiel Veldhuis, Hogeschool iPabo, Netherlands

This article was submitted to Assessment, Testing and Applied Measurement, a section of the journal Frontiers in Education

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Subtraction errors can inform teachers about students’ mathematical reasoning. Not every subtraction error is informative, it’s implications for students’ mathematical reasoning depends on the item characteristics. Diagnostic items are specifically designed to elicit specific subtraction errors. This study evaluated how the diagnostic capacity of subtraction items is related to their characteristics. The item characteristics being studied are open-ended and multiple-choice (MC) items, bare number, and word problems. As well as various number features, such as the number of digits in the subtrahend and minuend. Diagnostic capacity is defined as the extent to which multi-digit subtraction items that require borrowing (e.g., 1000

Diagnostic items can be designed to collect specific and fine-grained information about students’ cognitive strengths and weaknesses (

In the theoretical framework, we explain the conceptual and procedural misunderstanding that underpin BE and discuss how this misunderstanding is related to students’ procedural development in subtraction and to their conceptual development of multi-digit subtraction, place value, and borrowing. Although the analysis of systematic subtraction errors is not a novel research area, research into the design of diagnostic items to elicit specific errors in subtraction is relatively new. Understanding the item characteristics leading to bridging errors will inform the design of diagnostic subtraction items. In this study, we designed diagnostic items that could elicit three types of BE:

e.g., 43−17 =

Internationally, there are differences concerning in which grade multi-digit subtraction procedures, such as column-wise and ciphering, are being taught (

Strategies being taught in Dutch subtraction.

Furthermore, parallel to teaching jumping strategies third-grade students’ conceptual understanding of the base-ten place-value system is being promoted through the use of materials that can be grouped into tens and ones, such as money and Multibase Arithmetic Blocks (MAB) (

During third grade, most Dutch students make the transition from sequential jumping strategies to place-value-based decomposition strategies (

The procedural transition to decomposition requires the simultaneous transition to an integrated concept of multi-digit numbers. According to

Furthermore, previous research has shown that items that require borrowing elicit many different systematic errors that might all somehow be related to students’ conceptual and procedural understanding of multi-digit subtraction, borrowing, and place value (e.g.,

In this paragraph, we elaborate on the number features, item formats, and answering formats that were compared in the present study. The methodological details of the design process of the items are described in the section “Materials and Methods.”

As explained above, decomposition strategies are based on students’ understanding of place-value principles. Students may make an incorrect transition from conceptual understanding of single digits to multiple digits in which they view multi-digit numbers as concatenated single digits: 83 is “eight” “three” instead of “eighty-three” (

Although it could be expected that the more difficult unequal digit problems have a higher diagnostic capacity, it is not yet known how the difficulty of the items relates to their diagnostic capacity. Moreover, due to students’ gradual transition from a linear understanding of multi-digit numbers to a place-value-based understanding (

Moreover, subtraction items can differ in their borrow type, for example 83−26 = requires borrowing from the tens, while 634−251 = requires borrowing from the hundreds, and 400−27 = requires borrowing from both tens and hundreds. Borrowing from both tens and hundreds requires multiple steps, which makes these items more complex. On the one hand, this complexity could make it more likely that bridging errors are elicited; on the other hand, more complex items could also increase the amount of other errors being elicited. This makes it interesting to compare the diagnostic capacity of items that differ in the type and number of borrows.

Item format refers to the way an item is presented: In context as a word problem or as a bare number problem without words or images. Both word and bare number problems are part of the frequently used textbooks in Dutch education. Word problems can convey different meanings of subtraction (taking away and determining the difference), resulting in different solution processes (

Regarding the answer format of the items, we were interested in comparing open-ended (OE) and multiple-choice (MC) items. OE items can provide easy-to-code numeric answers and rich data in which different types of subtraction errors can be observed. Also, OE items might help to discover new systematic errors and thereby contribute to research about misconceptions. Moreover, the biggest advantage of MC items is the possibility to efficiently distinguish between a subset of misconceptions through the use of specific errors as distractors, as is done with ordered MC items and second tier items (e.g.,

The design of the items is elaborated in the section “Materials and Methods.” By answering the following research questions, we intent to inform the design of diagnostic subtraction items and generate new ideas for further research into this relatively new field of assessment research. Additionally, the results of this study can also inform the use of diagnostic subtraction items in classroom assessment.

To what extent is the diagnostic capacity related to the item difficulty and how does this relation differ for the item characteristics?

To what extent can the differences in the diagnostic capacity of the subtraction items be explained by their characteristics (i.e., item format, answering format, and number features)?

Response data was gathered from 264 third-grade students (132 boys, 130 girls,

Historical data from the LOVS M3 test was used to identify potential appropriate number features for the diagnostic items. We selected the subtraction items for which bridging errors were among the four most frequent errors. We identified four LOVS M3 subtraction items that often elicit bridging errors: Two bare number items 76−48 = (item 1) and 700−32 = (item 2) and two context problems 300−2 = (item 3) and 1000−680 = (item 4, see

Contexts of the LOVS M5 subtraction items that require bridging translated from Dutch to English:

Bridging errors are associated with the use of decomposition strategies; therefore, we aimed to minimize the elicitation of strategies associated with jumping, like compensation and subtraction by addition. We used three number constraints to create the number features we are interested in. The first constraint was that the units may not be 8 or 9, because these digits elicit compensation strategies such as 76−48 = via 76−50 = 26, 26 + 2 = 28 (

Secondly, subtraction by addition could make the elicitation of bridging errors less likely. To avoid elicitation of the strategy subtraction by addition (e.g., solving 73−67 = , via 67 + ? = 73;

The last design constraint concerning the number features of the items focused on accidently getting the right answer while applying an erroneous strategy. Hence, items that are not able to distinguish between the correct answer and a bridging error will not result in a valid diagnosis. For example, when solving the item 82−27, borrowing a ten would result in (1) 2−7 = 5, whereas reversing the units 7−2 would also give 5 as the result. Students who make BE type 2 would accidently come to the right answer. Therefore, correct answers to items like 82−27 = do not always provide valid diagnostic information about bridging errors. Similarly, items like 81−26 = with (1)1−6 = 5 and 6−1 = 5 should be avoided.

Item types for bridging errors in third grade subtraction.

1^{a} |
10 | 2n^{b}−2n |
83−26 = |

2 | 10 | 3n−3n | 453−127 = |

3 | 100 | 3n−2n | 347−62 = |

4 | 100 | 3n−3n | 634−251 = |

5 | 10 | 2n−2n = | 70−43 = |

6 | 100 | 3n−2n = | 406−22 = |

7^{a} |
10, 100 | 3n−2n = | 400−27 = |

8 | 100, 1000 | 1000−2n = | 1000−70 = |

9^{a} |
100, 1000 | 1000−3n = | 1000−340 = |

^{a}Types that are cloned from the LOVS M3 test.

^{b}Digits within each number.

Moreover, for the comparison between word and bare number problems, three word-problems were created (of which one was MC) for each item type. The translation of these originally Dutch items is included in English in

As is shown in

Because a test with 54 subtraction items is too long for third-grade students, an incomplete research design with linked items was used. Item types 1, 7, and 9 were used as anchor items because these items were cloned from LOVS M3 items, which means that those items match third-grade students’ subtraction skills. Two additional item types were selected based on students’ responses to the four subtraction items from the LOVS M3 test mentioned above. This selection process is shown in the flowchart included in

Booklet design that resulted from the adaptive selection of items shown in

A research assistant or researcher administrated the DI in each classroom. A standardized instruction was read aloud by the test administrator: “This test consists of 30 subtraction problems. You may write down your calculations in the box next to the problem. We are now going to practice two problems together.” Next, the test administrator practiced two example subtraction problems with the students. The correct answers to the practice problems were given, but no strategies were discussed. Although there was no official time limit, after 60 min the test administrator would collect all the booklets. Most students finished within 60 min. The incidental students who could not finish the test within 60 min were offered to finish it later. Their tests were returned to the researchers by the teacher via mail. When the test administrator and teacher observed a student struggled too much, they gave the student the choice to stop the test at any given moment.

Item response theory (IRT) was used to obtain parameter estimates for the 54 diagnostic items. In IRT, the difficulty of items is estimated conditional on students’ proficiency (

Two IRT analyses were done. In the first analysis, the prevalence of a bridging error is modeled instead of the prevalence of a correct response. The purpose of this analysis was to obtain estimates of the relative diagnostic capacity of the items, defined as the item’s capacity to elicit bridging errors. In this analysis, items were coded as 1 = bridging error, and 0 = correct or other error. Modeling using an IRT procedure allowed to compare the capacity to elicit bridging errors across items administered in an incomplete design with groups of test-takers that differ in tendency to make bridging errors. As described above, the DI was administrated through an incomplete design with 11 booklets, which were linked through 18 common items. The purpose of the second analysis was to obtain estimates of the relative item difficulty. So, this is a more standard application or IRT. In this analysis, items were coded as 1 = correct, and 0 = incorrect response. The item parameters resulting from the IRT analyses could be transformed into the estimated proportion bridging errors or the proportion correct for the total population. OPLM software (

To enhance the interpretation of results, the item-parameters are transformed into the expected proportion BE and the expected proportion correct in the population. Using this transformation, a weighting is given to the item parameters based on practical impact that it translates into observable properties of the items. For example, the difference between item parameters 4 and 5 will not lead to a substantial increase in the probability correct for a student with proficiency 0. While a difference between an item parameter of 0 or 1 does have a substantial impact. These proportions were used in the descriptive and correlational analyses that were done to answer the first research question. This was not done for research question 2 since the item parameters are better in line with the assumption of equal variance of the ANOVA. For these analyses, the item parameters for prevalence of bridging errors were used as the dependent variable. Note that higher item parameter values lead to less bridging errors and consequently a lower diagnostic capacity. In the first ANOVA, item format and answering format were used as independent factors, resulting in a 2 × 2 design. In the second ANOVA, the number of digits and the borrow type were used as independent factors, resulting in a 3 × 3 design. Taking item types together based on common features results in more power of the analysis due to more observations per cell of the design. The 3-category factors digits and borrow type were created by recoding the nine item types. Item types 1 and 5 were recoded as category 1 (2n−2n), item types 3, 6, 7, and 8 were recoded as category 2 (3n/4n−2n), and item types 2, 4, and 9 were recoded into category 3 (3n/4n−3n). Furthermore, for the variable borrow type item types 1, 2, and 5 were recoded into category 1 (borrow from 10). Item types, 3, 4, and 6 were recoded as category 2 (borrow from 100), and item types 7, 8, and 9 were recoded into category 3 (borrow from multiple). The above-described analyses were done with SPSS 23 (

The first research question concerned the relationship between the proportion bridging errors and the difficulty of the subtraction errors. For this research question, the expected proportion correct was used as an indicator of item difficulty. Hence, the higher the proportion, the easier the item. The proportion BE (_{BE}) and proportion correct (_{C}) were calculated as the expected proportion in the population under the IRT model. The Rasch model showed a reasonable fit. In the model with proportion correct, 6 out of 44 items had significant S-statistics (_{BE} of zero in the analyses. Pearson’s bivariate correlation was calculated, _{BE} and _{C} for the MC items, _{BE} and _{C}. However, for the number features a negative relationship between _{BE} and _{C} was found for item type 9 (1000−340 = ),

Furthermore, _{BE} and _{C} for the item characteristics being evaluated in the present study. For the answer format, it was found that the _{BE} as well as the _{C} for MC items is higher than for OE items. The differences in the _{BE} and _{C} for the two item formats were relatively small, with the _{BE} of bare number problems being slightly higher than the _{BE} of word problems. With regard to the number features, it was found that items with more digits in the subtrahend and minuend (i.e., 3n/4n−3n types 2, 4, and 9) had the highest _{BE} and a relative low _{C} compared to items with fewer digits (i.e., 3n/4n−2n and 2n−2n, types 1, 3, 5, 6, 7, and 8). Whether these differences are significant was explored in the ANOVA analyses.

Proportion bridging errors and proportion correct for answer format and item format.

_{BE} |
_{C} |
|||||

Answer format | MC | 0.277 | 0.131 | 0.635 | 0.109 | |

OE | 0.115 | 0.059 | 0.584 | 0.130 | ||

Item format | Word problem | 0.155 | 0.116 | 0.598 | 0.126 | |

Bare number problem | 0.183 | 0.119 | 0.605 | 0.126 | ||

Number features | Digits | 2n−2n = | 0.308 | 0.171 | 0.697 | 0.049 |

3n/4n−2n = | 0.320 | 0.117 | 0.635 | 0.095 | ||

3n/4n−3n = | 0.470 | 0.237 | 0.491 | 0.117 | ||

Borrow | 10 | 0.191 | 0.104 | 0.619 | 0.125 | |

100 | 0.163 | 0.129 | 0.522 | 0.134 | ||

Multiple | 0.154 | 0.121 | 0.662 | 0.063 |

To evaluate the diagnostic capacity of the item and answering format (research question 2), a 2 × 2 between-subject (BS) factor ANOVA with answer format and item format as BS factors and the diagnostic capacity of the items was done. As explained in the section “Materials and Methods,” the parameter estimate of the diagnostic capacity was used for this analysis because of the assumptions underlying ANOVA. As shown in ^{2}^{2} = 0.441. Hence, the diagnostic capacity of MC items is found to be significantly higher than that of OE-items, _{MC} = −0.997, _{MC} = 0.818, _{OE} = 0.498, _{OE} = 0.877. Evidently, this result was to be expected given that the MC items were constructed to have distractors that indicate BE.

ANOVA results with diagnostic capacity as dependent variable.

^{2} |
||||||

Model 1 ^{2} = 0.441 |
*Answer format (AF) | BS-factor | 1,50 | 36.871 | <0.001* | 0.424 |

Item format (IF) | BS-factor | 1,50 | 1.755 | 0.191 | 0.034 | |

AF × IF | Interaction | 1,50 | 0.197 | 0.659 | 0.004 | |

Model 2 ^{2} = 0.225 |
*Digits (D) | BS-factor | 2,48 | 5.790 | 0.007* | 0.187 |

Borrow from (BF) | BS-factor | 2,48 | 0.051 | 0.950 | 0.002 | |

D × BF | Interaction | 1,48 | 0.597 | 0.444 | 0.012 |

To explore whether the distractors that were chosen in the MC items represent the most frequent BE found in the open-ended items, the frequencies of the three BE and their possible combinations were analyzed. More specifically,

Average frequency of BE types for OE and MC items.

1 t/m 9 | 1 | 36 | 27 | 6.19 | 8.20 | 18 | 43 | 9.50 | 12.61 |

2 | 36 | 10 | 2.64 | 3.50 | 18 | 22 | 9.28 | 8.44 | |

3 | 36 | 13 | 4.53 | 7.19 | 18 | 16 | 12.56 | 18.82 | |

7 and 9 | CE^{a} |
8 | 55 | 19.75 | 20.52 | 4 | 85 | 48.75 | 30.97 |

^{a}CE, combination of BE for items with multiple borrows. Note that item type 8 also had possible combination errors, but none of those were observed in our data.

Subsequently, the differences between MC and OE items for the three error types were tested using a _{Difference} = −0.00919,

^{a} |
^{b} |
^{c} |
^{c} |
||||

BE1 | 4.527 | 0.038 | −1.010 | 24.422 | 0.322 | −3.306 | 3.272 |

BE2 | 30.983 | <0.001 | −3.204 | 19.978 | 0.004 | −6.639 | 2.072 |

BE3 | 6.850 | 0.012 | −1.747 | 19.523 | 0.096 | −8.028 | 4.595 |

^{a}Levene’s test for equality of variances.

^{b}Equal variances not assumed.

^{c}Difference.

The second ANOVA was a 3 × 3 BS-factor design with digits and borrow type as the BS factors. It was found that the average diagnostic capacity differed for the BS-factor digits (see _{difference} = 1.145,

Furthermore, item type 8 was found to be the least suitable for diagnosing students’ BE. Looking at the error frequencies for item type 8, the most frequently observed error type were errors, such as 1000−20 = 800, this error was observed 8, 5, 11, 8, 5, and 7 times in respectively, item 0801 through item 0806. Similarly, the error 1000−20 = 80 was observed 7, 1, 2, 9, 1, and 1 times in respectively, item 0801 through item 0806. Note that the frequencies 7 and 9 were observed with an MC item, which might be the reason they were observed more frequently. It is noteworthy that the subsample of students who responded to item type 8 had a mathematical ability of 59.133

The past decades, there has been plenty of research into systematic errors in subtraction. However, none of those studies systematically evaluated what item characteristics make an item suitable for a specific error diagnosis. Based on previous research, we focused on diagnostic items that elicit bridging errors in multi-digit subtraction, which are errors derived from the frequently observed smaller-from-larger error (

We found no significant correlation between the estimated proportion bridging errors and proportion correct of the items (research question 1). This implies that the difficulty of an item is not indicative of the diagnostic capacity of an item. Therefore, the diagnostic capacity of items should be considered a different construct from item difficulty, which might be influenced differently by item characteristics than item difficulty. However, we did find that item type 9 had a negative relationship between the proportion bridging errors and proportion correct was found for item type nine (e.g., 1000−340 = ). So, a higher proportion bridging errors was associated with a lower proportion correct. This result indicates that, for this item type, most of the errors made were bridging errors, and almost no other errors were made on this item (see

Looking at the number features of the items, it was found in the ANOVA that the diagnostic capacity of 3n/4n−3n items is significantly higher than the diagnostic capacity of 3n/4n−2n items (i.e., item types 3, 6, 7, and 8). This result does not, however, indicate that the diagnostic capacity of the items is related to the number of digits in the subtrahend and minuend being unequal. One of the most important findings of this study is that item type 8 (e.g., 1000−70 = ) was not only the easiest item; it also had the lowest diagnostic capacity. A subsequent error analysis showed that students made relatively few bridging errors; instead, these items seem to elicit errors such as 1000−20 = 800 and 1000−20 = 80. Because item type 8 elicits other systematic errors more frequently, it is questionable whether item type 8 is a valid item type for diagnosing bridging errors. Unfortunately, we do not have data about students’ mathematical conceptual and procedural reasoning to explain this error. A plausible explanation is that item type 8 elicits jumping instead of decomposition, because there is no reason for a place-value-based partitioning of the subtrahend when the subtrahend is a multiple of ten.

Moreover, the students who responded to item type 8 had a significantly lower mathematical ability, which makes it more likely that they use a jumping strategy instead of a decomposition strategy (

The lower mathematical ability of students who responded to item type 8 was the result of our adaptive design of item types (see

Finally, the present study focused on identifying students’ procedural and conceptual strengths and weaknesses in multi-digit subtraction and borrowing. The value of diagnosing bridging errors should be further evaluated by studying teachers’ instructional decisions based on students’ error profiles. Such research can result in empirical information about effective interventions to remediate bridging errors and facilitate students’ transition from jumping to decomposition strategies and to column-wise and ciphering in higher grades. Our theoretical framework suggests that the use of models like money and MAB material can support the transition from linear to place-value-based understanding of multi-digit numbers (

In conclusion, the present study showed that items like 453−127 = (Type 2), 634−251 = (Type 4) and 1000−340 = (Type 9) were the most suitable for diagnosing bridging errors mid third grade. As was expected, we found the MC items have a higher diagnostic capacity than open ended items. Nevertheless, we would argue that the use of MC or open-ended items serve different purposes. MC items could be a more accessible approach for teachers when using a diagnostic instrument for bridging errors as part of a formative teaching process. Also, they might be useful for diagnosing the three specific types of BE. Open-ended questions on the other hand are more useful when exploring error profiles that are not solely focused on diagnosing bridging errors; this could be applied in both classroom and research settings. Because item types 2 and 4 only have three possible bridging errors, they can easily be administrated using MC-items with the three BE listed in

The datasets generated for this study are available on request to the corresponding author.

Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was provided by the participants’ legal guardian/next of kin.

JV contributed to the design of the experiment, construction of the materials and instruments, data-collection, data preparation, data-analyses, and writing the manuscript. AB contributed to the design of the experiment, data-analyses, and writing the manuscript. FS contributed to the design of the experiment, construction of the materials and instruments, data-collection, and data preparation. TE advised on the data-analyses and contributed to the writing of the manuscript. All authors contributed to the article and approved the submitted version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: