Edited by: John L. Sievenpiper, St. Michael’s Hospital, Canada
Reviewed by: Claudio Esteban Perez, Universidad Andres Bello, Chile; Paulo Lopez-Meyer, Intel Labs, Mexico
This article was submitted to Nutrition Methodology, a section of the journal Frontiers in Nutrition.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Meta-research, or “research on research,” describes investigations of research itself, including describing how research is conducted and reported, and aggregating and rating studies such as in meta-analyses. However, manually retrieving and analyzing the vast archives of written material to evaluate meta-research questions can be very time consuming and costly. The resource-intensity of this process leads many researchers to narrow the scope of inquiry in some way to answer questions of interest, which might, in principle, be explored more broadly. For example, Kaiser et al. (
One potential approach employs double sampling, where two samples are taken: a large sample and a subsample (from the large sample). The appropriate sample size of the subsample “depends on the relative costs of observing the two variables and on the strength of the ratio relationship between them [Ref. (
The two methods differ with respect to accuracy and precision depending on whether they are used on the subsample or the large sample (Figure
Because the R_{HI}T_{LO} values in the large sample are missing completely at random (MCAR, because the subsample is a random sample of the large sample), we can estimate the large sample R_{HI}T_{LO} values via multiple imputation (
In this paper, we focus on a method employing double sampling with multiple imputation (DS + MI). The aim of the paper is to describe the method and illustrate its application through evaluating whether titles and abstracts adhere to two simple CONSORT guideline criteria. As shown schematically in Figure
To investigate the utility of this technique, we explored two of the criteria outlined in the CONSORT guidelines. The CONSORT guidelines were created in 1996 in an effort to improve the quality of reporting for RCTs (
Our large sample was the entire PubMed database available as of July 28, 2014, subject to the following filters: RCTs, humans, English language, and abstract available (
The subsample entries were rated on the following:
Did the title denote that the study was a RCT?
CONSORT Item 1a. “Identification as a randomized trial in the title … Authors should use the word ‘randomised’ in the title to indicate that the participants were randomly assigned to their comparison groups.” (
Was the abstract structured?
CONSORT Item 1b. “Structured summary of trial design, methods, results, and conclusions … We strongly recommend the use of structured abstracts for reporting randomised trials. They provide readers with information about the trial under a series of headings pertaining to the design, conduct, analysis, and interpretation.” [Ref. (
Was the study actually an RCT?
For the purpose of this study we were interested in randomized controlled trials in humans, written in the English language, and that have an available abstract. Classification of RCT status was based on the R_{HI}T_{LO} raters (Patrice L. Capers and Andrew W. Brown).
For articles where Patrice L. Capers and Andrew W. Brown were uncertain whether a title and abstract were from an RCT, we assumed that PubMed was correct in identifying the entry as an RCT. Any disagreements were resolved by consensus. If no consensus could be met on the title and abstract alone, we retrieved the full article for full review.
For title compliance, the R_{LO}T_{HI} method for the title looked for the word “random,” including variants with any prefixes or suffixes, anywhere in the title. For abstract compliance, the R_{LO}T_{HI} required that an abstract needed to contain words representative of at least three of four headings: Introduction, Methods, Results, and Conclusion. Understanding that a variety of descriptors are used for these subheadings (e.g., Problem or Background may be used in place of Introduction), we grouped a variety of words into the proper identifier categories (Table S1 in Supplementary Material). The R_{LO}T_{HI} was subsequently applied to the entire large sample (including entries from the subsample).
We also identified the country from which abstracts were published using the PubMed place of publication (PL) tag. We used this tag to categorize papers in the following manner, based on data from the CIA World Factbook [Ref. (
From the subsample we gathered information on the variables of interest that had not yet been determined in the large sample making the data MCAR (Figure
Higher rigor, lower throughput results were compared against the R_{LO}T_{HI} results from the subsample using phi coefficients, calculated using the phi command of the R “psych” package (version 1.4.8.11). The Phi coefficient is equivalent to the Pearson product moment correlation coefficient of the dichotomous variables, and therefore provides a single number expressing the similarity between the two methods. Descriptive analyses of the R_{HI}T_{LO} and R_{LO}T_{HI} data are tabled as counts of publications unless otherwise specified.
Logistic regression was used to model title compliance, structured abstract compliance, and both together as a function of country, year, and the interaction of country and year. Because only 79.8% of abstracts were rated as RCTs in the subsample, we imputed a study’s RCT status, and analyses were limited only to abstracts that were imputed as RCTs. Logistic regressions for imputed data were calculated using the glm.mids extension (mice package, version 2.22), while the regressions for the subsample were calculated using glm in the base package (R version 3.0.1). We hypothesized that US and ESC would have similar levels of compliance, while NESC would have lower compliance but would be rapidly improving (i.e., a larger NESC-by-time interaction term). Comparisons among the R_{LO}T_{HI} and R_{HI}T_{LO} results from the subsample and the R_{LO}T_{HI} and the DS + MI (as estimates for R_{HI}T_{LO}) results from the large sample were used to evaluate precision and accuracy. Logistic regression coefficients and confidence intervals are reported in exponentiated form (i.e., odds ratios and their 95% confidence intervals).
Our PubMed search (using the filters: humans, RCTs, and English language with abstract) retrieved entries that did not have an abstract or a country of publication. However, this number was fairly small considering the amount of entries contained within the data set. In our subsample, 20% of the entries were not RCTs according to our guidelines. Our estimate for the percent of actual RCTs in the large sample based on our subsample was 79.8% [95% CI: (76.28, 83.32)]. While there was unequal representation of countries in the entries available, the distribution was similar between the large sample and the subsample (US 51 vs. 51%, ESC 29 vs. 30%, NESC 20 vs. 18%, respectively). If not otherwise stated, discussion of results is about DS + MI results.
Out of the 500 entries that PubMed tagged as RCTs in humans, only 399 were found to be actual RCTs according to the R_{HI}T_{LO} method. Excluding non-RCTs, 28% of the titles were compliant with CONSORT guidelines (Table
Full subsample |
RCTs in subsample |
||||||
---|---|---|---|---|---|---|---|
R_{LO}T_{HI} method | R_{LO}T_{HI} method | ||||||
Non-compliant | Compliant | Non-compliant | Compliant | ||||
Non-compliant | 380 | 7 | Non-compliant | 289 | 0 | ||
Compliant | 0 | 113 | Compliant | 0 | 110 | ||
0.96 | 1.00 |
For every year beyond August 28, 1996 publications had 6.7% greater odds of being title compliant. ESC entries improved more rapidly overtime compared to US, as evidenced by the significant ESC-by-year term. The odds per year change for ESC and NESC are dependent on multiplying the year and the respective interaction terms, resulting in 9.9% greater odds per year for ESC, 7.0% for NESC, and 6.7% for US. In the subsample, entries published every year since CONSORT had 4.5% greater odds of being title complaint. Because of the low precision, no other predicators were significant in the subsample. Since the subsample R_{LO}T_{HI} and R_{HI}T_{LO} data are identical, the regressions on the subsample are identical (Table
R_{HI}T_{LO} imputed (DS + MI) |
R_{LO}T_{HI} with imputed RCT |
R_{HI}T_{LO} subsample | R_{LO}T_{HI} subsample | |
---|---|---|---|---|
Intercept | 0.212 (0.057, 0.787); |
0.183 (0.029, 1.164); |
0.287 (0.191, 0.431); |
0.287 (0.191, 0.431); |
ESC | 0.726 (0.294, 1.794); |
0.813 (0.383, 1.726); |
0.793 (0.348, 1.806); |
0.793 (0.348, 1.806); |
NESC | 0.733 (0.381, 1.407); |
0.771 (0.500, 1.190); |
0.460 (0.161, 1.311); |
0.460 (0.161, 1.311); |
Year | 1.067 (1.021, 1.115); |
1.065 (1.028, 1.105); |
1.045 (1.005, 1.087); |
1.045 (1.005, 1.087); |
ESC-by-year | 1.030 (1.003, 1.058); |
1.033 (1.009, 1.058); |
1.066 (0.988, 1.149); |
1.066 (0.988, 1.149); |
NESC-by-year | 1.002 (0.983, 1.022); |
1.003 (0.984, 1.023); |
1.002 (0.914, 1.097); |
1.002 (0.914, 1.097); |
The consistency in ratings between the R_{HI}T_{LO} and R_{LO}T_{HI} methods was high (Table
Full subsample |
RCTs in subsample |
||||||
---|---|---|---|---|---|---|---|
R_{LO}T_{HI} method | R_{LO}T_{HI} method | ||||||
Non-compliant | Compliant | Non-compliant | Compliant | ||||
Non-compliant | 226 | 18 | Non-compliant | 172 | 14 | ||
Compliant | 2 | 254 | Compliant | 0 | 211 | ||
0.92 | 0.92 |
DS + MI estimated that for each year beyond CONSORT the odds of the abstract being structured was 13.5% greater using the R_{HI}T_{LO} method. At the time of CONSORT publication, NESC abstracts were significantly less compliant than both the ESC and NESC. The significant interaction terms for ESC-by-year and NESC-by-year indicate that ESC and NESC were increasing the odds of abstract compliance faster than US (17.1% per year for ESC, 16.3% for NESC vs. 13.5% for US, calculated by multiplying the year and respective interaction terms). In the subsample, the larger error from the small sample size again resulted in only year being significant, with abstracts having 17.5 and 17.6% greater odds of being structured using the R_{HI}T_{LO} and R_{LO}T_{HI} methods, respectively, every year beyond August 1996 (Table
R_{HI}T_{LO} imputed (DS + MI) |
R_{LO}T_{HI} with imputed RCT |
R_{HI}T_{LO} subsample | R_{LO}T_{HI} subsample | |
---|---|---|---|---|
Intercept | 0.618 (0.473, 0.806); |
0.733 (0.660, 0.815); |
0.480 (0.310, 0.745); |
0.518 (0.336, 0.798); |
ESC | 0.803 (0.635, 1.015); |
0.718 (0.668, 0.771); |
1.086 (0.509, 2.317); |
1.091 (0.516, 2.305); |
NESC | 0.411 (0.277, 0.610); |
0.545 (0.500, 0.595); |
0.559 (0.211, 1.480); |
0.993 (0.429, 2.297); |
Year | 1.135 (1.115, 1.154); |
1.130 (1.121, 1.139); |
1.175 (1.119, 1.233); |
1.176 (1.121, 1.234); |
ESC-by-year | 1.032 (1.025, 1.039); |
1.034 (1.029, 1.040); |
0.980 (0.905, 1.060); |
0.980 (0.906, 1.061); |
NESC-by-year | 1.026 (1.016, 1.035); |
1.027 (1.020, 1.035); |
0.985 (0.898, 1.080); |
0.969 (0.890, 1.055); |
When examining entries in the subsample where both the title and abstract were compliant, we found 17–21% to be in compliance. Ratings were similar between the R_{HI}T_{LO} and R_{LO}T_{HI} methods (phi coefficient = 0.96, 0.99; Table
Full subsample |
RCTs in subsample |
||||||
---|---|---|---|---|---|---|---|
R_{LO}T_{HI} method | R_{LO}T_{HI} method | ||||||
Non-compliant | Compliant | Non-compliant | Compliant | ||||
Non-compliant | 408 | 6 | Non-compliant | 313 | 2 | ||
Compliant | 0 | 86 | Compliant | 0 | 84 | ||
0.96 | 0.99 |
When looking at abstracts using the DS + MI estimates where both the title and abstract were compliant with CONSORT guidelines, entries published every year after CONSORT had 11.6% greater odds of being compliant. However, we see that at the time of CONSORT publication the NESC abstracts were significantly less compliant than US abstracts, but not ESC. Both ESC and NESC entries increased odds of compliance more rapidly than US (14.4% for ESC, 14.0% for NESC, and 11.6% for US, calculated by multiplying the year and respective interaction terms). Again, the small sample size resulted in only year being a significant predictor in the subsample (Table
R_{HI}T_{LO} imputed (DS + MI) |
R_{LO}T_{HI} with imputed RCT |
R_{HI}T_{LO} subsample | R_{LO}T_{HI} subsample | |
---|---|---|---|---|
Intercept | 0.093 (0.028, 0.314); |
0.089 (0.017, 0.470); |
0.116 (0.062, 0.216); |
0.122 (0.067, 0.225); |
ESC | 0.770 (0.349, 1.701); |
0.804 (0.417, 1.552); |
1.304 (0.462, 3.682); |
1.237 (0.442, 3.465); |
NESC | 0.460 (0.257, 0.823); |
0.536 (0.367, 0.783); |
0.115 (0.007, 1.968); |
0.523 (0.116, 2.359); |
Year | 1.116 (1.082, 1.151); |
1.111 (1.086, 1.136); |
1.110 (1.050, 1.174); |
1.108 (1.049, 1.170); |
ESC-by-year | 1.026 (1.000, 1.052); |
1.030 (1.010, 1.051); |
1.014 (0.926, 1.112); |
1.017 (0.928, 1.113); |
NESC-by-year | 1.021 (1.003, 1.040); |
1.018 (1.002, 1.035); |
1.098 (0.894, 1.350); |
0.980 (0.866, 1.109); |
Our results support our hypothesis that DS + MI would result in improved precision and accuracy of a large sample estimate. Using a double sampling approach with multiple imputation improved the precision of the point estimates compared to the subsample alone, as evidenced by the tightened confidence intervals around the logistic regression coefficients. We interpret the difference in point estimates between the R_{LO}T_{HI} results and the DS + MI results, which are more similar to the subsample R_{HI}T_{LO} results, to be evidence of improved accuracy. Specifically, when comparing the R_{HI}T_{LO} to R_{LO}T_{HI} methods, similar patterns were observed between the subsample and large sample in which R_{HI}T_{LO} estimates that were higher than the R_{LO}T_{HI} estimates in the subsample tended to be higher in the large sample, and vice versa. As expected, employing multiple imputation reduced the confidence intervals of those estimates compared to the subsample alone, improving precision. The reduced variance in estimates between the R_{HI}T_{LO} and R_{LO}T_{HI} models were not as dramatic as we had predicted because the R_{LO}T_{HI} we employed was generally a highly correlated proxy for our R_{HI}T_{LO} data, as evidenced by the corresponding phi coefficients. One would expect that with poorer correlation between R_{LO}T_{HI} and R_{HI}T_{LO}, the information gained from DS + MI would improve precision considerably.
In our illustration, we were able to demonstrate that US had significantly higher reporting compliance before the implementation of the CONSORT guidelines relative to NESC but not ESC. However, over time, both ESC and NESC have been improving reporting compliance more rapidly than US abstracts, though it appears that compliance has increased over the years for all countries. Of the two criteria, we investigated structured abstract compliance was higher, which may be the result of journal requirements that submitted manuscripts contain structured abstracts (
In conclusion, using double sampling with multiple imputation allows for tractable large sample estimation for meta-research questions in situations where performing a comprehensive higher rigor, lower throughput evaluation on the entire corpus is impractical. This method is flexible and can be applied to many questions conditional on assumptions being met, namely that data are missing completely at random, expert ratings are sufficiently valid to be of intrinsic interest, that ratings comprise an exhaustive set of options, and sufficient data are collected to inform the imputation. In the presented example, the data were missing completely at random because we retrieved all available entries from PubMed based on the previously mentioned filters; expert ratings were conducted by trained, PhD-level scientists; the possible ratings for title and abstract compliance were exhaustive; and the sample size chosen for this study was appropriate to illustrate the feasibility of this method. One limitation of the present investigation is that R_{HI}T_{LO} results were not obtained for an entire dataset, and thus we were only able to compare the results of the subsample and rely on established statistical theory for our inferences. A key limitation is that the method will only be useful when there is some imperfect correlation between the poorer and better measurement methods that is modeled effectively in the imputation process. To the extent that the relation approaches 0 or 1, the two-stage strategy will be of limited value. So too, if the functional form of the relation cannot be modeled properly in the imputation process, the two-stage strategy may yield biased results. It is possible that use of other imputation algorithms may yield different results. It is our hope that more researchers will continue to evaluate, validate, and extend the use of this method when conducting meta-research of the scientific literature.
PLC, AWB, JAD, and DBA made substantial contributions to the conception and design of this work, the analysis, and interpretation of the data; drafted this work and revised it critically for important intellectual content; provided final approval of the version to be published; and agree to be accountable for all aspects of this work in ensuring that questions related to the accuracy or integrity of any part of this work are appropriately investigated and resolved.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at
This project was supported by NIH grants K12GM088010, P30DK056336, T32HL072757, and R25HL124208. The opinions expressed are those of the authors and do not necessarily reflect those of the NIH or any other organization.
CONSORT, consolidated standards of reporting trials; DP, date of publication; DS + MI, double sampling with multiple imputation; ESC, non-US but primarily English speaking countries; MCAR, missing completely at random; NESC, all other countries; PL, place of publication; RCT, randomized controlled trial; R_{HI}T_{LO}, higher rigor, lower throughput; R_{LO}T_{HI}, lower rigor, higher throughput.