^{1}

^{*}

^{1}

^{2}

^{*}

^{1}

^{2}

Edited by: Sung-il Kim, Korea University, South Korea

Reviewed by: Frans Prins, Utrecht University, Netherlands; Lu Wang, Ball State University, United States; Roger C. Ho, National University of Singapore, Singapore

This article was submitted to Educational Psychology, a section of the journal Frontiers in Psychology

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

A meta-analysis (435 studies,

Feedback is information provided by an agent regarding aspects of one’s performance or understanding (

Given the impact of the

As noted, the major research method in the

Firstly, the use of a fixed-effect model may not be appropriate. A meaningful interpretation of the mean of integrated effects with this model is only possible if these effects are homogenous (^{2}. This results also in larger confidence intervals.

Secondly, a source of distortion when using a synthesis approach results from overlapping samples of studies. By integrating a number of meta-analyses dealing with effects of feedback interventions without checking every single primary study, there is a high probability that the samples of primary studies integrated in these meta-analyses are not independent of each other, but at least some primary studies were integrated in more than one meta-analysis. Therefore, these would have to be considered as duplets–primary studies that are included in the result of the synthesis more than once–and consequently cause a distortion. In contrast to meta-synthesis, a meta-analytical approach allows to remove duplets and therefore prevent a distortion of results.

The question arises, whether synthesizing research on feedback on different levels, from different perspectives and in different directions and compressing this research in a single effect size value leads to interpretable results. In contrast to a synthesis approach, the meta-analysis of primary studies allows to weigh study effects, consider the issues of systematic variation of effect sizes, remove duplets, and search for moderator variables based on study characteristics. Therefore, a meta-analysis is likely to produce more precise results.

One of the most consistent findings about the power of feedback is the remarkable variability of effects. The existing research has identified several relevant moderators like timing and specificity of the goals and task complexity (

The purpose of the present study was to integrate the primary studies that provide information on feedback effects on student learning (achievement, motivation, behavior), with a meta-analytic approach that takes into account the methodological problems described in the previous part and to compare the results to the results of the

In particular, our study addressed the following research questions:

RQ1: What is the overall effect of feedback on student learning based on an integration of each of the primary studies within each of the all meta-analyses used in the

RQ2: To what extent is the effect of feedback moderated by specific feedback characteristics?

This meta-analysis is a quantitative integration of empirical research comparing the effects of feedback on student learning outcomes. The typical strategy is (1) to compute a summary effects for all primary studies, (2) to calculate the heterogeneity of the summary effect, and (3) in case of heterogeneity between studies to identify study characteristics (i.e., moderators) that may account for a part of or all of that heterogeneity. In detail, and as suggested by

specified the study and reported characteristics making the criteria for eligibility transparent,

described all information sources and the process for selecting studies,

described methods of data extraction from the studies,

described methods used for assessing risk of bias of individual studies,

stated the principal summary measures,

described the methods of handling data and combining results of studies, and

described methods of additional analyses (sensitivity and moderator analyses).

The following procedure was employed in this review (see

Flow diagram of study identification and selection.

We used the random-effects model for integration of the effect sizes that met our inclusion criteria to calculate an average effect size for all studies, and, in a next step, for subgroups defined by our coding scheme. We checked for heterogeneity across the studies and conducted outlier analysis and moderator analysis to assist in reducing the heterogeneity of effect sizes.

To identify the primary studies for our meta-analysis, we searched 32 existing meta-analyses that were used in the context of the

contain an empirical comparison of a form of feedback intervention between an experimental and a control group, or a pre-post comparison;

report constitutive elements to calculate an effect size (e.g., include means, standard deviations, and sample sizes)

report at least one dependent variable related to student learning (achievement, motivation, or behavioral change) and

have an identifiable educational context (data obtained with samples of students or teachers in a kindergarten, primary school, secondary school, college or university)

The inclusion criteria are comparable to the criteria that were used to include meta-analyses in the

We included studies with controlled designs as well as pre-post-test designs, and this became a moderator to investigate any differences related to design (

Existing meta-analyses investigating the factor feedback.

Included | From meta-analysis | ||

Included | From meta-analysis | ||

Included | From meta-analysis | ||

Excluded | Data not available | ||

Excluded | No effect sizes indicated; no individual references provided | ||

Partially included | From meta-analysis | ||

Included | From meta-analysis | ||

Partially included | Original data received from authors; studies that do not deal with educational context excluded | ||

Partially included | 33 sample size values missing; reconstruction from the primary studies not possible | ||

Included | From meta-analysis | ||

Included | From meta-analysis | ||

Excluded | No effect sizes indicated, effect sizes and sample sizes not reconstructable from original studies | ||

Partially included | 44 of 54 primary studies excluded because they do not deal with feedback on the relevant outcomes effect sizes reconstructed from primary studies | ||

Partially included | Effect sizes reconstructed from original studies | ||

Partially included | Missing values reconstructed from primary studies | ||

Included | Updated set of studies was used ( |
||

Partially included | Effect sizes/sample sizes reconstructed from primary studies | ||

Partially included | 38 of 45 studies excluded because they do not deal with the relevant outcomes | ||

included | From meta-analysis | ||

Excluded | Data not available | ||

Excluded | Data no longer available (even directly from authors) | ||

Partially included | 82 of 98 studies excluded because they do not deal with a school context | ||

Partially included | Effect sizes/sample sizes reconstructed from primary studies | ||

Included | Statistical data from meta-analysis, but no references of integrated studies provided | ||

Partially included | From meta-analysis | ||

Included | From meta-analysis | ||

Included | From meta-analysis | ||

Excluded | No effect sizes and sample sizes indicated; reconstruction of data no longer possible | ||

Partially included | 10 of 20 studies excluded because they do not deal with an educational context | ||

Excluded | Data not available | ||

Excluded | No data on feedback effects | ||

Partially included | 45 of 49 studies excluded because data was not reconstructable |

To be able to identify characteristics that influence the impact of feedback, a coding scheme was developed. It includes the following categories of study features: publication type (i.e., journal article, dissertation), outcome measure (i.e., cognitive, motivational, physical, behavioral), type of feedback (i.e., reinforcement/punishment, corrective, high-information), feedback channel (i.e., written, oral, video-, audio- or computer-assisted), and direction (i.e., teacher > learner, learner > teacher). Some of the study features of interest had to be dropped (i.e., perspective of feedback, way of measuring the outcome) because there were insufficient data, or the feature could not be defined based on the article abstracts. Generally, the study features for our coding scheme are orientated toward Hattie’s and Timperley’s (2007) coding features.

We analyzed inter-coder consistency to ensure reliability among coders by randomly selecting 10% of the studies and having them coded separately by two coders. Based on this, we assessed intercoder reliability of each coding variable. For the 6 moderator variables, Krippendorff’s alpha ranged from 0.81 to 0.99, and therefore above the acceptable level (

For the computation of effect sizes, tests for heterogeneity, and in the analysis of moderator variables, we used the Meta and Metafor packages for R (

with the pooled standard deviation of

The average weighted effect size ^{2} (6), which is the variance of the effect size distribution.

The model of random effects (

The random-effects model takes two variance components into account. These are the sum of the individual standard errors of the study effects resulting from the sample basis of the individual studies, and the variation in the random selection of the effect sizes for the meta-analysis. A meaningful interpretation of average effect sizes from several primary studies does not necessarily require homogeneity (i.e., that the variation of the study effects is solely random, Rustenbach. 2003). The basic assumption here is that differences in effect sizes within the sample are due to sample errors as well as systematic variation.

The integration of multiple effect sizes does not only require independence of the primary studies included in the meta-analyses, but also independence of the observed effects reported in the primary studies. The second assumption is violated when sampling errors and/or true effects are correlated. This can be the case when studies report more than one effect and these effects stem from comparisons with a common control group (multiple treatment studies,

Possible selection bias was tested by the means of a funnel plot, a scatter diagram that plots the treatment effect on the

A

The test variable Q is χ^{2}-distributed with degrees of freedom of the number k-1. Q can be used to check whether effect sizes of a group are homogeneous or whether at least one of the effect sizes differs significantly from the others. In order to be able to provide information about the degree of heterogeneity, ^{2}was computed (^{2} is a measure of the degree of heterogeneity among a set of studies along a 0% – 100% scale and can be interpreted as moderate for values between 30 and 60%, substantial for values over 50%, and considerable for values over 75% (

By definition, no outliers exist in the random-effects model because the individual study effects are not based on a constant population mean. Extreme values are attributed to natural variation. An outlier analysis, however, can serve to identify unusual primary studies. We used the method of adjusted standardized residuals to determine whether effect sizes have inflated variance. An adjusted residual is the deviation of an individual study effect from the adjusted mean effect, i.e., the mean effect of all other study effects. Adjusted standardized residuals follow the normal distribution and are therefore significantly different from 0 when they are >1.96. They are conventionally classified as extreme values when > 2 (

For heterogeneous data sets, suitable moderator variables must be used for a more meaningful interpretation. In extreme cases, this can lead to a division into k factor levels if none of the primary studies can be integrated into a homogeneous group. Q_{B} reflects the amount of heterogeneity that can be attributed to the moderator variable, whereas Q_{W} provides information on the amount of heterogeneity that remains within the moderator category. The actual suitability of a moderator variable within a fixed-effect model is demonstrated by the fact that homogeneous effect sizes are present within the primary study group defined by it (Q_{Wempirical} < Q_{Wcritical}) and at the same time the average effect sizes of the individual groups differ significantly from each other (Q_{Bempirical} > Q_{Bcritical}). If both conditions are fulfilled, homogeneous factor levels are present, which are defined by moderator variables, leading to a meaningful separation of the primary studies. However, by this definition, homogeneity of effect sizes within hypothesized moderator groups will occur rarely in real data, which means that fixed-effect models are rarely appropriate for real data in meta-analyses and random-effects models should be preferred (

Studies with control groups were separated from studies with a pre-post-test design. Effect sizes from pre-post designs are generally less reliable and less informative about the effects of the intervention because they are likely to be influenced by confounding variables (

The type of publication (journal article or dissertation) was used as a moderator. Published studies may be prone to having larger effect sizes than unpublished studies because they are less likely to be rejected when they present significant results (

The

A further distinction was made between different types of feedback, namely reinforcement/punishment, corrective feedback, and high-information feedback. Forms of reinforcement and punishment apply pleasant or aversive consequences to increase or decrease the frequency of a desired response or behavior. These forms of feedback provide a minimum amount of information on task level and no information on the levels of process or self-regulation. Corrective forms of feedback typically contain information about the task level in the form of “right or wrong” and the provision of the correct answer to the task. Feedback not only refers to how successfully a skill was performed (knowledge on result), but also to how a skill is performed (knowledge of performance). For some forms of feedback, i.e., modeling, additional information is provided on how the skill could be performed more successfully. Feedback was classified as high-information feedback when it was constituted by information as described for corrective feedback and additionally contained information on self-regulation from monitoring attention, emotions, or motivation during the learning process.

Some studies investigated the effects of feedback according to the channel by which it is provided. Hence, the distinction between three forms: oral, written, and video-, audio- or computer-assisted feedback.

This moderator refers to who gives and who receives feedback. We differentiated between feedback that is given by teachers to students, feedback that is given by students to teachers, and feedback that is given by students to students.

Our search strategy yielded 732 primary studies (see

Number of study effects per year.

The integration of all study effects with the random-effects model leads to a weighted average effect size of ^{2} = 86.47%.

In the funnel plot (

Funnel plot of all study effects.

Normal-quantile-plot of all study effects.

Thirty-five (3.5%) of all effect sizes were identified as extreme values (standardized residuals > 2) and excluded. An exclusion of these extreme values leads to a reduction of the average weighted effect size to 0.48 (CL^{2} = 83.40%).

The most extreme values were found in the meta-analysis by

The average effect sizes of the subsets of primary studies as used in the existing meta-analyses are shown in

Random-effects model calculation for the subsets of previous meta-analyses.

Tests of heterogeneity between and within the moderator subgroups.

_{B} ( |
_{W} ( |
^{2} |
|||

Research design | 29.06 (1) | <0.0001 | 5639.74 (955) | <0.0001 | 83.4% |

Publication type | 6.15 (1) | <0.05 | 5699.39 (957) | <0.0001 | 83.4% |

Outcome measure | 14.12 (3) | <0.001 | 4380.69 (750) | <0.0001 | 83.0% |

Feedback type | 41.52 (2) | <0.0001 | 1541.06 (316) | <0.0001 | 80.9% |

Feedback channel | 5.12 (2) | >0.05 | 2218.20 (337) | <0.0001 | 85.2% |

Feedback direction | 9.35 (2) | <0.001 | 4695.60 (852) | <0.0001 | 81.9% |

_{B}, heterogeneity between groups; Q

_{W}, heterogeneity within groups;

^{2}, total amount of heterogeneity.

Effect sizes and heterogeneity for different moderator subgroups.

^{2} |
|||||

Controlled study | 713 | 0.42 | [0.37 – 0.46] | 3321.86 | 78.6% |

Pre-post study | 244 | 0.63 | [0.56 – 0.69] | 2317.88 | 89.5% |

Journal article | 843 | 0.49 | [0.45 – 0.53] | 5176.67 | 83.7% |

Dissertation | 116 | 0.36 | [0.25 – 0.46] | 522.72 | 78.0% |

Cognitive | 597 | 0.51 | [0.46 – 0.55] | 3689.88 | 83.8% |

Motivational | 109 | 0.33 | [0.23 – 0.42] | 600.96 | 82.0% |

Physical | 19 | 0.63 | [0.34– 0.92] | 36.65 | 50.9% |

Behavioral | 30 | 0.48 | [−0.09 – 1.06] | 0.28 | 50.0% |

Reinforcement or punishment | 39 | 0.24 | [0.06 – 0.43] | 123.54 | 69.2% |

Corrective feedback | 238 | 0.46 | [0.39 – 0.55] | 1260.41 | 81.2% |

High-information feedback | 42 | 0.99 | [0.82 – 1.15] | 157.12 | 73.9% |

Teacher > student | 812 | 0.47 | [0.43 – 0.51] | 4510.40 | 82.0% |

Student > teacher | 27 | 0.35 | [0.13 – 0.56] | 52.92 | 50.9% |

Student > student | 16 | 0.85 | [0.59 – 1.11] | 132.28 | 88.7% |

^{2}, total amount of heterogeneity within subgroup.

The aim of the present study was to investigate the effectiveness of feedback in the educational context with a meta-analytic approach. With

The average weighted effect size differs considerably from the results of meta-synthesis (

We assume that the different results mainly stem from the fact that a number of studies used in the synthesis were excluded from our meta-analysis, either due to a lack of detailed information on the statistical data or due to content-related considerations (studies that did not explicitly deal with an educational context or did not report information on learning outcomes). The average effect size from this meta-analysis is based on a smaller sample of studies than the synthesis, but at the same time, the selection of studies produces more accurate results because it could be checked for each single study if it actually fulfills the inclusion criteria.

Care is needed, however, with focusing too much on the average effect size as the integrated studies show high heterogeneity, suggesting that, conform to expectations, not all feedback is the same and one (average) effect size does not fit all. The funnel- and normal-quantile-plots illustrate that the observed data does not capture the construct of feedback in an unbiased way and that there is an distribution of effect sizes which is not conform to a symmetric inverted funnel theoretically expected for data in which bias and systematic heterogeneity are unlikely. These issues and the results of the tests for homogeneity speak largely to the variability in effects and the need to search for meaningful moderators.

Heterogeneity likely results from different forms of feedback, ranging from the simplest forms of operant conditioning to elaborate forms of error modeling, from feedback to kindergarten children to feedback to university professors, from feedback that people get while learning a handstand to feedback that people get while learning a foreign language.

This study investigated six moderators (RQ2) – research design, publication type, outcome measure, type of feedback, feedback channel, and feedback direction. Generally, and conform to expectations (

Feedback is more effective for cognitive and physical outcome measures than for motivational and behavioral criteria. These claims must be partly interpreted with some caution because there are few studies available related to physical and behavioral outcomes, substantially reducing the precision of the average effect size. From a cognitive perspective, feedback is often considered a source of information that is necessary to improve on a task. Previous meta-analyses have produced inconsistent results of feedback on cognitive variables (

Feedback is more effective the more information it contains. Simple forms of reinforcement and punishment have low effects, while high-information feedback is most effective.

Findings by

Only a very small percentage of the primary studies investigated feedback from students to teachers and out of these, 26 effect sizes could be used to compute an average effect size. These effects were located mainly in studies dealing with higher education, i.e., with feedback from university or college students to their professors. Consequently, the data does not allow conclusions on the effectiveness of student > teacher feedback in the K-12 context. In general, feedback from teachers to students is more effective than from students to teachers, but the average effect of student > teacher feedback has a high variance and there is a rich literature related to this variance (

We tried to shed more light on the role and variability of feedback in the educational context with the help of meta-analysis in comparison to meta-synthesis. Both approaches are often confronted with the accusation of comparing apples and oranges. Still, it is legitimate to aggregate heterogenous data in order to make general statements, but it has to be kept in mind that these statements are often the first step to later understanding the critical moderators. The

In this study, we used a random-effects/mixed-effects model to deal with heterogeneity of effect sizes and accounted for non-independence of study effects by RVE.

Notwithstanding, there has been a long search for the optimal measures of central tendency – and we have added another approach to better understand the power of feedback.

Feedback must be recognized as a complex and differentiated construct that includes many different forms with, at times, quite different effects on student learning. Most importantly, feedback is more effective, the more information it contains and research on estimating this information would be a valuable addition to the area. Developing models, such as the

Estimates of the effects of feedback range between 0.48 (this meta-analysis), 0.70 (

BW developed the idea for this manuscript and the methodological framework and performed the analytic calculations. KZ and JH verified the analytical methods and supervised the findings of this work. All authors discussed the results and contributed to the final manuscript.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: