This article has Open Peer Review reports available.
Are potentially clinically meaningful benefits misinterpreted in cardiovascular randomized trials? A systematic examination of statistical significance, clinical significance, and authors’ conclusions
© The Author(s). 2017
Received: 13 November 2016
Accepted: 16 February 2017
Published: 20 March 2017
While journals and reporting guidelines recommend the presentation of confidence intervals, many authors adhere strictly to statistically significant testing. Our objective was to determine what proportions of not statistically significant (NSS) cardiovascular trials include potentially clinically meaningful effects in primary outcomes and if these are associated with authors’ conclusions.
Cardiovascular studies published in six high-impact journals between 1 January 2010 and 31 December 2014 were identified via PubMed. Two independent reviewers selected trials with major adverse cardiovascular events (stroke, myocardial infarction, or cardiovascular death) as primary outcomes and extracted data on trial characteristics, quality, and primary outcome. Potentially clinically meaningful effects were defined broadly as a relative risk point estimate ≤0.94 (based on the effects of ezetimibe) and/or a lower confidence interval ≤0.75 (based on the effects of statins).
We identified 127 randomized trial comparisons from 3200 articles. The primary outcomes were statistically significant (SS) favoring treatment in 21% (27/127), NSS in 72% (92/127), and SS favoring control in 6% (8/127). In 61% of NSS trials (56/92), the point estimate and/or lower confidence interval included potentially meaningful effects. Both point estimate and confidence interval included potentially meaningful effects in 67% of trials (12/18) in which authors’ concluded that treatment was superior, in 28% (16/58) with a neutral conclusion, and in 6% (1/16) in which authors’ concluded that control was superior. In a sensitivity analysis, 26% of NSS trials would include potential meaningful effects with relative risk thresholds of point estimate ≤0.85 and/or a lower confidence interval ≤0.65.
Point estimates and/or confidence intervals included potentially clinically meaningful effects in up to 61% of NSS cardiovascular trials. Authors’ conclusions often reflect potentially meaningful results of NSS cardiovascular trials. Given the frequency of potentially clinical meaningful effects in NSS trials, authors should be encouraged to continue to look beyond significance testing to a broader interpretation of trial results.
The preferred reporting of clinical outcomes in randomized controlled trials (RCTs) is described in the Consolidated Standards of Reporting Trials (CONSORT) statement . Within CONSORT the use of confidence intervals is emphasized in preference to p-values. Confidence intervals describe the precision of the estimate and “are especially valuable in relation to differences that do not meet conventional statistical significance, for which they often indicate that the result does not rule out an important clinical difference” . Editorials dating back almost 40 years have encouraged authors to use confidence intervals to describe the results of their studies rather than simply reporting the findings as statistically significant or not [2–4]. Despite this, the use of p-values in published articles remains approximately seven times more common than confidence intervals . Furthermore, confidence intervals are often used in a manner similar to p-values, to dichotomize outcomes as statistically significant (SS) or not. We have previously written about three important clinical controversies resulting from this dichotomous activity .
Interpretation of trial results when primary outcomes are not statistically significant (NSS) is challenging. In particular, it can be difficult putting the potential clinical relevance of the NSS effect and confidence intervals in context of the entire study results. Boutron and colleagues demonstrated that authors often place a favorable “spin” (positive portrayal) on trial results when the primary outcome is NSS . Such spin occurred in 58% of abstract conclusions, 50% of main text conclusions, and 18% of titles. Others have similarly reported spin in RCTs evaluating wound care  and surgical modalities [9, 10]. Although promotion of results may be common in NSS trial reporting, the evaluation assumes that NSS results demonstrate no potentially clinically meaningful effect.
For these reasons we examined the primary outcomes and conclusions of RCTs in six major medical journals. We had two primary questions: (1) How often do the point estimates and confidence intervals of the primary outcome of NSS and SS trials include potentially clinically meaningful effects? and (2) Are the authors’ conclusions in the abstract of NSS trials influenced by potentially clinically meaningful point estimates and confidence intervals? We focused specifically on cardiovascular trials with major adverse cardiovascular events (MACE) because these are established, objective, patient-oriented outcomes that overlap between trials. Additionally, in large cardiovascular trials with hard clinical endpoints, statistical significance can be difficult to attain but the results have high clinical relevance. We hypothesized that authors of cardiovascular trials may discount potentially clinically meaningful effects identified in the confidence intervals and/or point estimates when the results are NSS.
We followed the basic approach described in PRISMA  because there is no agreed on methodology for this type of study.
Eligibility and information sources
We included all cardiovascular RCTs of superiority design that evaluated preventive or interventional therapies regardless of the nature of the interventions – including medication, surgery, models of care, and lifestyle change. All comparators were valid, including placebo, active control, and no intervention. The primary outcome had to include at least one MACE: myocardial infarction, stroke, or cardiovascular death. We used PubMed to identify relevant trials from five high-impact general medical journals and one high-impact specialty journal: New England Journal of Medicine (N Engl J Med), Lancet, Journal of the American Medical Association (JAMA), British Medical Journal (BMJ), Annals of Internal Medicine (Ann Intern Med), and Circulation.
Study search and selection
Between 17 March and 14 April 2015, we searched PubMed for papers using the full journal title (and abbreviation, if present) with PubMed limits for RCTs and date (1 January 2010 to 31 December 2014). In the case of Circulation, the term circulation could relate to medical/physiologic issues in addition to the journal, so we restricted the search field to “Journal”. For the other five journals we did not apply any search restrictions in order to minimize the unlikely chance of missing relevant articles. For each journal, two authors (from VK, SK, EB, and GMA) independently evaluated and selected studies for inclusion. We excluded studies of subgroups, re-analyses, and studies that were either extensions or follow-ups from previously published trials to avoid including the same data more than once. We also excluded non-inferiority designed studies because authors’ interpretations and conclusions of non-inferiority results are broader, and this would add complexity to our interpretation of abstract conclusions. Disagreements for inclusion were resolved by consensus.
Data extraction and management
Two authors (CF with VK or SK) independently extracted data from the trials. Disagreement was resolved with consensus or third-party review (GMA).
Data extraction on study characteristics included citation, type of intervention and control, primary versus secondary prevention population, mean age in study, and percentage of males studied. Data on traditional risk of bias included allocation concealment, blinding, analysis (intention to treat or per protocol), and withdrawals. We also collected data on funding, and whether the trial was stopped early (if so, why) or extended. Data related to the primary outcome included the clinical endpoint, number of subjects in each study arm, number with the outcomes in each group, point estimate, confidence intervals, and p-values.
To evaluate the authors’ conclusions, the abstract conclusion was rated using a method derived from Als-Nielsen and colleague’s technique . We condensed the score from six to three possible conclusions: treatment superior, neutral, or control superior.
Assessing potentially meaningful effects
To assess if the primary outcome of an NSS trial included potentially meaningful effects, we focused on the point estimate and lower confidence interval. The margins of potentially clinically meaningful effect are undoubtedly debatable. Over 20 years ago, authors suggested that potentially clinically meaningful effects could be 25% or 50% relative risk reductions . More recently, trials showing a relative risk reduction of 6% for ezetimibe  and 14% for empagliflozin  have been greeted with enthusiasm [16, 17]. We selected our margins of potentially meaningful effect liberally to be broad and inclusive, thereby ruling out what is likely not a clinically meaningful effect. We decided that the smallest potentially clinically meaningful effect was a 6% relative risk reduction or a 0.94 relative risk, as reported by the IMPROVE-IT trial for ezetimibe . For lower confidence intervals to include potentially meaningful effects, we selected a 25% relative risk reduction or 0.75 relative risk described in meta-analyses of statin trials , an established clinical therapy.
Analysis of results
Study characteristics and potential biases are presented descriptively. Relative effect estimates including relative risks, hazard ratios, rate ratios, and odds ratios were used for primary analysis. If not provided, relative risks and 95% confidence intervals were calculated.
Trials were initially categorized into three groups based on the statistical testing of the primary outcome: SS trials favoring control, SS trials favoring treatment, and NSS trials. Statistical significance was determined by hypothesis testing via the p-value first and, if not available, we determined if the confidence interval excluded 1 (the line of no-effect).
To analyze and describe the results, the primary outcomes for all RCTs were presented on a forest plot with the potentially clinically meaningful thresholds for point estimate (≤0.94) and confidence interval (≤0.75) indicated. We categorized NSS trials as having (1) both the lower confidence interval and point estimate include potentially meaningful effects; (2) either the lower confidence interval or point estimate include a potentially meaningful effect; or (3) neither the lower confidence interval nor point estimate include a potentially meaningful effect. Among NSS trials, results were further stratified according to authors’ conclusions.
We used chi-square and independent samples median test to examine if selected factors were associated with authors’ conclusions in NSS trials. Factors compared included type of control used in the trials, funding (industry, public, or mixed), point estimates, and lower confidence intervals.
We performed sensitivity analyses to examine the effect of some key variables on the proportion of NSS trials with potentially clinically meaningful effect. Because smaller trials may be expected to have broader confidence intervals, we performed an analysis of trials with <2000 patient-years and those with ≥2000 patient-years. Because primary prevention trials will have smaller absolute benefits for a given relative benefit, we performed an analysis of primary versus secondary prevention trials.
To determine how sensitive the results were to the threshold of potential clinically meaningful effects, we increased the potentially meaningful relative risk reduction threshold for point estimates to ≥15% (or ≤0.85 relative risk) and for lower confidence intervals to ≥35% (or ≤0.65 relative risk).
Study inclusion and characteristics
Study characteristics and risk of bias of the 127 included randomized controlled trials
Journal, n (%)
New England Journal of Medicine
Journal of the American Medical Association
British Medical Journal
Annals of Internal Medicine
Primary or secondary prevention, n
Experimental interventional, n
Models of care
Median age (interquartile range), years
Percent males (interquartile range)
Study size and duration
Median study size (interquartile range)
Median study duration (interquartile range), months
Primary outcome included (median 3, range 1–10), n (%)
Risk of bias, n (%)
Planned trial duration
Completed as planned
Stopped for benefit
Stopped for harm
Stopped for futility
Stopped for financial reasons
Intention to treat
Modified intention to treat
Sample size estimation
No estimation given
Median (interquartile range)
Statistical significance of primary outcome and conclusions
Potentially clinically meaningful effects by statistical significance
Potentially clinically meaningful effects and authors’ conclusions in not statistically significant trials
Factors associated with authors’ conclusions
Abstract conclusions of included not statistically significant trials with a superiority design categorized by study characteristics
Authors’ conclusion in the abstract
Number of studies
Placebo/nothing: 49 studies (%)
Standard/active comparator: 43 studies (%)
Industry: 37 studies (%)
Mixed: 35 studies (%)
Public: 20 studies (%)
Median (interquartile range)
Point estimate >0.94: 48 studies (%)
Point estimate ≤0.94: 44 studies (%)
Lower confidence interval
Median (interquartile range)
Confidence interval >0.75: 51 studies (%)
Confidence interval ≤0.75: 41 studies (%)
Sensitivity analysis of not statistically significant randomized controlled trials
Study size (in patient years)
(n = 35)
(n = 57)
Primary versus secondary prevention
(n = 13)
(n = 79)
Increase in potentially clinically meaningful thresholds (n = 92)
Lastly, NSS trials were re-examined using increased potentially clinically meaningful thresholds. The increased thresholds were a relative risk reduction of ≥15% for point estimates and ≥35% for lower confidence intervals. In 15% of NSS trials (14/92) both the increased point estimate and confidence interval included potentially meaningful effects, in 11% (10/92) only one of the two included a potentially meaningful effect, and in 74% (68/92) neither threshold was met.
In 61% of NSS cardiovascular trials, the primary outcome had a confidence interval that included an effect similar to or better than statin therapy (relative risk reduction ≥25%) and/or a point estimate similar to or better than ezetimibe (≥6%). These results suggest that if we were to strictly focus on a dichotomous finding of whether results are SS or NSS, we run the risk of dismissing a treatment in almost two thirds of NSS trials that could potentially have meaningful effects. Furthermore, about one third of NSS trials had even higher probability of potentially clinically meaningful effects because both confidence intervals and point estimates included potentially meaningful effects. In fact, visual inspection of Fig. 2 shows the distribution of the effects is very similar between SS trials favoring treatment and NSS trials when both confidence interval and point estimates include potential meaningful effects. This further suggests that strict adherence to an arbitrary threshold for statistical significance may serve poorly as a judgment of treatment benefit.
Within NSS trials, authors’ conclusions were associated with the potentially meaningful effects in the confidence intervals and point estimates. For example, both the point estimate and confidence intervals included potentially meaningful effects in 67% of NSS trials in which the authors concluded treatment was superior. In contrast, both the point estimate and confidence intervals included potentially meaningful effects in only 6% of NSS in which the authors’ concluded control was superior. Past research suggested that just over half of NSS studies have conclusions that are unjustifiably positive and inconsistent with the results . However, our study suggests that some of these favorable interpretations may relate to potentially meaningful benefits suggested in the confidence intervals and/or point estimates. Given this and the recommendations of CONSORT regarding the presentation of results , future research evaluating authors’ interpretations or conclusions of NSS trials should assess trial outcomes beyond statistical significance testing.
Potentially meaningful effects in the point estimates and confidence intervals are not the only factors influencing authors’ conclusions. For example, 28% of NSS trials with a neutral conclusion had both a lower confidence interval and point estimate suggestive of potentially meaningful effects. Perhaps these authors are basing their conclusions solely on statistical significance but it is also possible that other elements of the trial results or intervention play a role: adverse events, costs, and secondary outcomes are all potentially relevant.
Our results were sensitive to two possibly predictable factors. First, trials of smaller size frequently have less precision in the estimate and thus broader confidence intervals. Within our study, this could result in more of the smaller trials having lower confidence intervals crossing a potentially meaningful threshold. This did occur but most of the trials included in this review were large. Therefore, the proportion of NSS trials in which either the point estimate and/or the confidence interval included potentially meaningful effects was only slightly lower in larger trials (having ≥2000 patient-years) than overall (53% versus 61%, respectively). Second, modification of the thresholds of potentially clinically meaningful effects foreseeably reduced the proportion of trials with potentially meaningful effects. The proportion of NSS trials in which either the point estimate and/or the confidence interval included potentially meaningful effects was 61% in our primary analysis but fell to 26% when the relative risk reduction thresholds were increased to ≥15% for point estimates and ≥35% for confidence intervals. However, even with these stricter criteria, a quarter of all NSS cardiovascular trials found potentially meaningful effects.
Despite our findings, it is important not to over-interpret our results and assume that we are suggesting that a 6% relative risk reduction is a meaningful effect in all populations. Nor would we suggest all researchers use these thresholds for sample size estimation and/or extended or repeated studies until these small benefits are entirely ruled out. All interventions, and the trials assessing their clinical value, need to be considered in the boarder context of many relevant factors, including overall risk of the primary outcome, adverse events, costs, inconvenience, and alternative interventions. We hope this paper can draw attention to the need to use confidence intervals and describe potentially meaningful effects. Fortunately, it appears that a number of authors are already doing this. Moreover, we support the advice  that authors and evidence-users move away from the dogmatic adherence to hypothesis testing that leads some to believe that a p-value of 0.049 means a positive trial and treatment works while a p-value of 0.051 means a negative trial and treatment does not work.
There are some notable limitations to our study. First, there are many factors involved in how authors interpret their research but our study focused only on point estimates and confidence intervals of primary outcomes. Second, we focused on cardiovascular trials with hard clinical (MACE) endpoints and so confirmation is required to determine if results would be similar for research in other conditions like chronic obstructive pulmonary disease or infectious disease. Third, our definitions of potentially clinically meaningful effects may be seen as arbitrary or too generous. There is no agreed-on minimal clinically important effect for MACE outcomes so we derived our definition from established therapies although some will certainly feel they are too generous. We used somewhat liberal thresholds because our goal was to determine if results included any “potentially” clinically meaningful effects but we also performed a sensitivity analysis with stricter criteria. While some will see these cut-offs as arbitrary, a goal of this paper is to reflect on the rigid adherence to the 0.05 statistic significance threshold, which itself can be considered arbitrary. Fourth, we used relative margins. The use of relative margins allows for more easy comparison across trials because any assessment of absolute effects must also account for time. Fifth, although we assessed authors’ conclusions by focusing on abstract conclusions, this is a previous method of rating conclusions  and abstract conclusion is the most likely location for promotion of results . It should also be noted that the abstract conclusions, like any part of the articles, may have been modified through the peer-review process and editorial recommendations. It is not possible to clarify to what, if any, degree this occurred but we suspect it is small.
In up to 61% of NSS cardiovascular trials, the primary outcome has a point estimate and/or confidence interval that includes potentially clinically meaningful effects. Furthermore, among the NSS cardiovascular trials, authors’ conclusions were positively associated with point estimates and lower confidence intervals that suggest greater potential effects. In fact, both the point estimates and confidence intervals included potentially meaningful effects in 67% of trials (12/18) in which the authors concluded that treatment was superior, compared to only 6% (1/16) in which authors concluded that control was superior. Given the frequency of NSS cardiovascular trials, it is reassuring that many authors look beyond statistical significance testing and consider the potentially meaningful clinical effects of their results. Additionally, journals and evidence-users should be encouraged, as directed by CONSORT, to consider point estimates and confidence intervals in the context of potentially clinically meaningful effects and not strictly for hypothesis and statistical significance testing.
No external funding.
Availability of data and materials
The list of included studies and whether the results were statistically significant is available in the online supplement.
GMA conceived of the study and GMA, SG, MRK, JM, CK, and AJL refined and designed the study. VK, SK, EB and GMA performed study selection while CRF, VK, SK, and GMA performed data extraction. GMA, SG, JM, MRK, CK, and AJL performed analysis. OB provided statistical advice and assistance with analysis. All authors helped refine the study concept, provided input in analysis, critically contributed to the manuscript, had full access to the data, and read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c869. doi:10.1136/bmj.c869.View ArticlePubMedPubMed CentralGoogle Scholar
- Rothman KJ. A show of confidence. N Engl J Med. 1978;299:1362–3.View ArticlePubMedGoogle Scholar
- Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed). 1986;292:746–50.View ArticleGoogle Scholar
- Cummings P, Koepsell TD. P values vs estimates of association with confidence intervals. Arch Pediatr Adolesc Med. 2010;164:193–6.PubMedGoogle Scholar
- Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting P values in the biomedical literature, 1990-2015. JAMA. 2016;315:1141–8.View ArticlePubMedGoogle Scholar
- McCormack J, Vandermeer B, Allan GM. How confidence intervals become confusion intervals. BMC Med Res Methodol. 2013;13(1):134.View ArticlePubMedPubMed CentralGoogle Scholar
- Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes. JAMA. 2010;303(20):2058–64.View ArticlePubMedGoogle Scholar
- Lockyer S, Hodgson R, Dumville JC, Cullum N. “Spin” in wound care research: the reporting and interpretation of randomized controlled trials with statistically non-significant primary outcome results or unspecified primary outcomes. Trials. 2013;14:371.View ArticlePubMedPubMed CentralGoogle Scholar
- Patel SV, Van Koughnett JA, Howe B, Wexner SD. spin is common in studies assessing robotic colorectal surgery: an assessment of reporting and interpretation of study results. Dis Colon Rectum. 2015;58(9):878–84.View ArticlePubMedGoogle Scholar
- Patel SV, Chadi SA, Choi J, Colquhoun PH. The use of “spin” in laparoscopic lower GI surgical trials with nonsignificant results: an assessment of reporting and interpretation of the primary outcomes. Dis Colon Rectum. 2013;56(12):1388–94.View ArticlePubMedGoogle Scholar
- Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535. doi:10.1136/bmj.b2535.View ArticlePubMedPubMed CentralGoogle Scholar
- Als-Nielsen B, Chen W, Gluud C, Kjaergard LL. Association of funding and conclusions in randomized drug trials: a reflection of treatment effect or adverse events? JAMA. 2003;290(7):921–8.View ArticlePubMedGoogle Scholar
- Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA. 1994;272(2):122–4.View ArticlePubMedGoogle Scholar
- Cannon CP, Blazing MA, Giugliano RP, McCagg A, White JA, Theroux P, Darius H, Lewis BS, Ophuis TO, Jukema JW, De Ferrari GM, Ruzyllo W, De Lucca P, Im K, Bohula EA, Reist C, Wiviott SD, Tershakovec AM, Musliner TA, Braunwald E, Califf RM, IMPROVE-IT Investigators. ezetimibe added to statin therapy after acute coronary syndromes. N Engl J Med. 2015;372(25):2387–97.View ArticlePubMedGoogle Scholar
- Zinman B, Wanner C, Lachin JM, Fitchett D, Bluhmki E, Hantel S, Mattheus M, Devins T, Johansen OE, Woerle HJ, Broedl UC, Inzucchi SE, EMPA-REG OUTCOME Investigators. Empagliflozin, cardiovascular outcomes, and mortality in type 2 diabetes. N Engl J Med. 2015;373(22):2117–28.View ArticlePubMedGoogle Scholar
- Jarcho JA, Keaney Jr JF. Proof that lower is better--LDL cholesterol and IMPROVE-IT. N Engl J Med. 2015;372:2448–50.View ArticlePubMedGoogle Scholar
- Grant PJ. Empagliflozin in diabetes: a therapeutic light at the end of the cardiovascular tunnel? Diab Vasc Dis Res. 2015;12(6):394–5.View ArticlePubMedGoogle Scholar
- Taylor F, Huffman MD, Macedo AF, Moore TH, Burke M, Davey Smith G, Ward K, Ebrahim S. Statins for the primary prevention of cardiovascular disease. Cochrane Database Syst Rev. 2013;1, CD004816.Google Scholar
- Hackshaw A, Kirkwood A. Interpreting and reporting clinical trials with results of borderline significance. BMJ. 2011;343:d3340. doi:10.1136/bmj.d3340.View ArticlePubMedGoogle Scholar