Are potentially clinically meaningful benefits misinterpreted in cardiovascular randomized trials? A systematic examination of statistical significance, clinical significance, and authors’ conclusions
BMC Medicine volume 15, Article number: 58 (2017)
While journals and reporting guidelines recommend the presentation of confidence intervals, many authors adhere strictly to statistically significant testing. Our objective was to determine what proportions of not statistically significant (NSS) cardiovascular trials include potentially clinically meaningful effects in primary outcomes and if these are associated with authors’ conclusions.
Cardiovascular studies published in six high-impact journals between 1 January 2010 and 31 December 2014 were identified via PubMed. Two independent reviewers selected trials with major adverse cardiovascular events (stroke, myocardial infarction, or cardiovascular death) as primary outcomes and extracted data on trial characteristics, quality, and primary outcome. Potentially clinically meaningful effects were defined broadly as a relative risk point estimate ≤0.94 (based on the effects of ezetimibe) and/or a lower confidence interval ≤0.75 (based on the effects of statins).
We identified 127 randomized trial comparisons from 3200 articles. The primary outcomes were statistically significant (SS) favoring treatment in 21% (27/127), NSS in 72% (92/127), and SS favoring control in 6% (8/127). In 61% of NSS trials (56/92), the point estimate and/or lower confidence interval included potentially meaningful effects. Both point estimate and confidence interval included potentially meaningful effects in 67% of trials (12/18) in which authors’ concluded that treatment was superior, in 28% (16/58) with a neutral conclusion, and in 6% (1/16) in which authors’ concluded that control was superior. In a sensitivity analysis, 26% of NSS trials would include potential meaningful effects with relative risk thresholds of point estimate ≤0.85 and/or a lower confidence interval ≤0.65.
Point estimates and/or confidence intervals included potentially clinically meaningful effects in up to 61% of NSS cardiovascular trials. Authors’ conclusions often reflect potentially meaningful results of NSS cardiovascular trials. Given the frequency of potentially clinical meaningful effects in NSS trials, authors should be encouraged to continue to look beyond significance testing to a broader interpretation of trial results.
The preferred reporting of clinical outcomes in randomized controlled trials (RCTs) is described in the Consolidated Standards of Reporting Trials (CONSORT) statement . Within CONSORT the use of confidence intervals is emphasized in preference to p-values. Confidence intervals describe the precision of the estimate and “are especially valuable in relation to differences that do not meet conventional statistical significance, for which they often indicate that the result does not rule out an important clinical difference” . Editorials dating back almost 40 years have encouraged authors to use confidence intervals to describe the results of their studies rather than simply reporting the findings as statistically significant or not [2–4]. Despite this, the use of p-values in published articles remains approximately seven times more common than confidence intervals . Furthermore, confidence intervals are often used in a manner similar to p-values, to dichotomize outcomes as statistically significant (SS) or not. We have previously written about three important clinical controversies resulting from this dichotomous activity .
Interpretation of trial results when primary outcomes are not statistically significant (NSS) is challenging. In particular, it can be difficult putting the potential clinical relevance of the NSS effect and confidence intervals in context of the entire study results. Boutron and colleagues demonstrated that authors often place a favorable “spin” (positive portrayal) on trial results when the primary outcome is NSS . Such spin occurred in 58% of abstract conclusions, 50% of main text conclusions, and 18% of titles. Others have similarly reported spin in RCTs evaluating wound care  and surgical modalities [9, 10]. Although promotion of results may be common in NSS trial reporting, the evaluation assumes that NSS results demonstrate no potentially clinically meaningful effect.
For these reasons we examined the primary outcomes and conclusions of RCTs in six major medical journals. We had two primary questions: (1) How often do the point estimates and confidence intervals of the primary outcome of NSS and SS trials include potentially clinically meaningful effects? and (2) Are the authors’ conclusions in the abstract of NSS trials influenced by potentially clinically meaningful point estimates and confidence intervals? We focused specifically on cardiovascular trials with major adverse cardiovascular events (MACE) because these are established, objective, patient-oriented outcomes that overlap between trials. Additionally, in large cardiovascular trials with hard clinical endpoints, statistical significance can be difficult to attain but the results have high clinical relevance. We hypothesized that authors of cardiovascular trials may discount potentially clinically meaningful effects identified in the confidence intervals and/or point estimates when the results are NSS.
We followed the basic approach described in PRISMA  because there is no agreed on methodology for this type of study.
Eligibility and information sources
We included all cardiovascular RCTs of superiority design that evaluated preventive or interventional therapies regardless of the nature of the interventions – including medication, surgery, models of care, and lifestyle change. All comparators were valid, including placebo, active control, and no intervention. The primary outcome had to include at least one MACE: myocardial infarction, stroke, or cardiovascular death. We used PubMed to identify relevant trials from five high-impact general medical journals and one high-impact specialty journal: New England Journal of Medicine (N Engl J Med), Lancet, Journal of the American Medical Association (JAMA), British Medical Journal (BMJ), Annals of Internal Medicine (Ann Intern Med), and Circulation.
Study search and selection
Between 17 March and 14 April 2015, we searched PubMed for papers using the full journal title (and abbreviation, if present) with PubMed limits for RCTs and date (1 January 2010 to 31 December 2014). In the case of Circulation, the term circulation could relate to medical/physiologic issues in addition to the journal, so we restricted the search field to “Journal”. For the other five journals we did not apply any search restrictions in order to minimize the unlikely chance of missing relevant articles. For each journal, two authors (from VK, SK, EB, and GMA) independently evaluated and selected studies for inclusion. We excluded studies of subgroups, re-analyses, and studies that were either extensions or follow-ups from previously published trials to avoid including the same data more than once. We also excluded non-inferiority designed studies because authors’ interpretations and conclusions of non-inferiority results are broader, and this would add complexity to our interpretation of abstract conclusions. Disagreements for inclusion were resolved by consensus.
Data extraction and management
Two authors (CF with VK or SK) independently extracted data from the trials. Disagreement was resolved with consensus or third-party review (GMA).
Data extraction on study characteristics included citation, type of intervention and control, primary versus secondary prevention population, mean age in study, and percentage of males studied. Data on traditional risk of bias included allocation concealment, blinding, analysis (intention to treat or per protocol), and withdrawals. We also collected data on funding, and whether the trial was stopped early (if so, why) or extended. Data related to the primary outcome included the clinical endpoint, number of subjects in each study arm, number with the outcomes in each group, point estimate, confidence intervals, and p-values.
To evaluate the authors’ conclusions, the abstract conclusion was rated using a method derived from Als-Nielsen and colleague’s technique . We condensed the score from six to three possible conclusions: treatment superior, neutral, or control superior.
Assessing potentially meaningful effects
To assess if the primary outcome of an NSS trial included potentially meaningful effects, we focused on the point estimate and lower confidence interval. The margins of potentially clinically meaningful effect are undoubtedly debatable. Over 20 years ago, authors suggested that potentially clinically meaningful effects could be 25% or 50% relative risk reductions . More recently, trials showing a relative risk reduction of 6% for ezetimibe  and 14% for empagliflozin  have been greeted with enthusiasm [16, 17]. We selected our margins of potentially meaningful effect liberally to be broad and inclusive, thereby ruling out what is likely not a clinically meaningful effect. We decided that the smallest potentially clinically meaningful effect was a 6% relative risk reduction or a 0.94 relative risk, as reported by the IMPROVE-IT trial for ezetimibe . For lower confidence intervals to include potentially meaningful effects, we selected a 25% relative risk reduction or 0.75 relative risk described in meta-analyses of statin trials , an established clinical therapy.
Analysis of results
Study characteristics and potential biases are presented descriptively. Relative effect estimates including relative risks, hazard ratios, rate ratios, and odds ratios were used for primary analysis. If not provided, relative risks and 95% confidence intervals were calculated.
Trials were initially categorized into three groups based on the statistical testing of the primary outcome: SS trials favoring control, SS trials favoring treatment, and NSS trials. Statistical significance was determined by hypothesis testing via the p-value first and, if not available, we determined if the confidence interval excluded 1 (the line of no-effect).
To analyze and describe the results, the primary outcomes for all RCTs were presented on a forest plot with the potentially clinically meaningful thresholds for point estimate (≤0.94) and confidence interval (≤0.75) indicated. We categorized NSS trials as having (1) both the lower confidence interval and point estimate include potentially meaningful effects; (2) either the lower confidence interval or point estimate include a potentially meaningful effect; or (3) neither the lower confidence interval nor point estimate include a potentially meaningful effect. Among NSS trials, results were further stratified according to authors’ conclusions.
We used chi-square and independent samples median test to examine if selected factors were associated with authors’ conclusions in NSS trials. Factors compared included type of control used in the trials, funding (industry, public, or mixed), point estimates, and lower confidence intervals.
We performed sensitivity analyses to examine the effect of some key variables on the proportion of NSS trials with potentially clinically meaningful effect. Because smaller trials may be expected to have broader confidence intervals, we performed an analysis of trials with <2000 patient-years and those with ≥2000 patient-years. Because primary prevention trials will have smaller absolute benefits for a given relative benefit, we performed an analysis of primary versus secondary prevention trials.
To determine how sensitive the results were to the threshold of potential clinically meaningful effects, we increased the potentially meaningful relative risk reduction threshold for point estimates to ≥15% (or ≤0.85 relative risk) and for lower confidence intervals to ≥35% (or ≤0.65 relative risk).
Study inclusion and characteristics
The flow of study exclusion and inclusion is detailed in Fig. 1. Of the original 3200 studies identified in our search, 127 RCTs met inclusion criteria. Agreement for study selection was 97% and for data extraction was 93%. General characteristics and risk of bias of included studies are outlined in Table 1 and the full list of the studies is included in Additional file 1. Secondary prevention trials (85%, 108/127) and community-based trials (58%, 74/127) were most common. The primary outcome included a range of one to ten combined outcomes (median, three) with myocardial infarction (80%), stroke (65%), and cardiovascular death (50%) being the most common. Overall, study quality was good: for example, 77% (98/127) described allocation concealment and 94% (119/127) performed intention-to-treat analysis. Most trials (75%, 95/127) were completed as planned but 22% (28/127) were stopped early for varying reasons (usually harm or futility) and 3% (4/127) were extended.
Statistical significance of primary outcome and conclusions
Figure 2 outlines the flow of study outcomes with categorization by statistical significance for the primary outcome and authors’ conclusions. The primary outcome was SS favoring treatment in 21% of trials (27/127) and all concluded treatment was superior. The primary outcome was NSS in 72% of trials (92/127), of which 63% (58/92) had a neutral conclusion, 20% (18/92) concluded treatment was superior, and 17% (16/92) concluded control was superior. The primary outcome was SS favoring control in 6% (8/127) and all concluded control was superior.
Potentially clinically meaningful effects by statistical significance
Figure 3 provides the forest plot of all primary outcomes organized by statistical significance and if the confidence intervals and/or point estimates included potentially meaningful effects. Careful inspection of the forest plot reveals that a number of the NSS trials had lower confidence intervals and point estimates that appear similar to the effects in many of the SS trials. Among NSS trials, in 32% (29/92) both the lower confidence interval and point estimate included potentially meaningful effects, while in 29% (27/92) only one of the two included a potentially meaningful effect. Neither the lower confidence interval nor point estimate included potentially meaningful effects in 39% of NSS trials (36/92).
Potentially clinically meaningful effects and authors’ conclusions in not statistically significant trials
Figure 4 shows the findings based on authors’ conclusions and whether the lower confidence intervals and/or point estimates of the primary outcomes included potentially meaningful effects. The lower confidence interval and point estimate included potentially meaningful effects in 67% of trials (12/18) with conclusions that treatment was superior compared to 6% of trials (1/16) with conclusions that control was superior. Neither the lower confidence interval nor point estimate included potentially meaningful effects in 11% of trials (2/18) with conclusions that treatment was superior compared to 63% of trials (10/16) with conclusions that control was superior. The point estimates and lower confidence intervals of neutral authors’ conclusions were distributed relatively evenly.
Factors associated with authors’ conclusions
Table 2 shows NSS trial abstract conclusions compared to selected study characteristics, including the type of comparator used (placebo or active), funding (industry or public), point estimate (median and threshold), and lower confidence interval (median and threshold). There was no association between conclusions and type of comparator or funding. Both median point estimates and median lower confidence intervals decline as authors’ conclusions change from control superior to neutral to treatment superior (both p ≤ 0.006). Additionally, the clinically meaningful thresholds for point estimates and lower confidence intervals were statistically significantly associated with authors’ conclusion (p ≤ 0.002). These findings consistently show a similar association for NSS trials: lower point estimates and/or confidence intervals that suggest potentially clinical effects are associated with authors’ concluding that treatment is superior.
Table 3 shows the sensitivity analyses. Subgroups of trial size or primary and secondary prevention generally had similar proportions of trials with potentially meaningful effects. The only exception was the trial size subgroup examining the proportion of NSS trials with confidence intervals that suggested potentially meaningful effects: 71% (25/35) for smaller NSS trials with <2000 patient-years versus 28% (16/57) for larger NSS trials with ≥2000 patient-years. The proportion of larger NSS trials with a point estimate and/or confidence interval including potentially meaningful effects was 53% (30/57).
Lastly, NSS trials were re-examined using increased potentially clinically meaningful thresholds. The increased thresholds were a relative risk reduction of ≥15% for point estimates and ≥35% for lower confidence intervals. In 15% of NSS trials (14/92) both the increased point estimate and confidence interval included potentially meaningful effects, in 11% (10/92) only one of the two included a potentially meaningful effect, and in 74% (68/92) neither threshold was met.
In 61% of NSS cardiovascular trials, the primary outcome had a confidence interval that included an effect similar to or better than statin therapy (relative risk reduction ≥25%) and/or a point estimate similar to or better than ezetimibe (≥6%). These results suggest that if we were to strictly focus on a dichotomous finding of whether results are SS or NSS, we run the risk of dismissing a treatment in almost two thirds of NSS trials that could potentially have meaningful effects. Furthermore, about one third of NSS trials had even higher probability of potentially clinically meaningful effects because both confidence intervals and point estimates included potentially meaningful effects. In fact, visual inspection of Fig. 2 shows the distribution of the effects is very similar between SS trials favoring treatment and NSS trials when both confidence interval and point estimates include potential meaningful effects. This further suggests that strict adherence to an arbitrary threshold for statistical significance may serve poorly as a judgment of treatment benefit.
Within NSS trials, authors’ conclusions were associated with the potentially meaningful effects in the confidence intervals and point estimates. For example, both the point estimate and confidence intervals included potentially meaningful effects in 67% of NSS trials in which the authors concluded treatment was superior. In contrast, both the point estimate and confidence intervals included potentially meaningful effects in only 6% of NSS in which the authors’ concluded control was superior. Past research suggested that just over half of NSS studies have conclusions that are unjustifiably positive and inconsistent with the results . However, our study suggests that some of these favorable interpretations may relate to potentially meaningful benefits suggested in the confidence intervals and/or point estimates. Given this and the recommendations of CONSORT regarding the presentation of results , future research evaluating authors’ interpretations or conclusions of NSS trials should assess trial outcomes beyond statistical significance testing.
Potentially meaningful effects in the point estimates and confidence intervals are not the only factors influencing authors’ conclusions. For example, 28% of NSS trials with a neutral conclusion had both a lower confidence interval and point estimate suggestive of potentially meaningful effects. Perhaps these authors are basing their conclusions solely on statistical significance but it is also possible that other elements of the trial results or intervention play a role: adverse events, costs, and secondary outcomes are all potentially relevant.
Our results were sensitive to two possibly predictable factors. First, trials of smaller size frequently have less precision in the estimate and thus broader confidence intervals. Within our study, this could result in more of the smaller trials having lower confidence intervals crossing a potentially meaningful threshold. This did occur but most of the trials included in this review were large. Therefore, the proportion of NSS trials in which either the point estimate and/or the confidence interval included potentially meaningful effects was only slightly lower in larger trials (having ≥2000 patient-years) than overall (53% versus 61%, respectively). Second, modification of the thresholds of potentially clinically meaningful effects foreseeably reduced the proportion of trials with potentially meaningful effects. The proportion of NSS trials in which either the point estimate and/or the confidence interval included potentially meaningful effects was 61% in our primary analysis but fell to 26% when the relative risk reduction thresholds were increased to ≥15% for point estimates and ≥35% for confidence intervals. However, even with these stricter criteria, a quarter of all NSS cardiovascular trials found potentially meaningful effects.
Despite our findings, it is important not to over-interpret our results and assume that we are suggesting that a 6% relative risk reduction is a meaningful effect in all populations. Nor would we suggest all researchers use these thresholds for sample size estimation and/or extended or repeated studies until these small benefits are entirely ruled out. All interventions, and the trials assessing their clinical value, need to be considered in the boarder context of many relevant factors, including overall risk of the primary outcome, adverse events, costs, inconvenience, and alternative interventions. We hope this paper can draw attention to the need to use confidence intervals and describe potentially meaningful effects. Fortunately, it appears that a number of authors are already doing this. Moreover, we support the advice  that authors and evidence-users move away from the dogmatic adherence to hypothesis testing that leads some to believe that a p-value of 0.049 means a positive trial and treatment works while a p-value of 0.051 means a negative trial and treatment does not work.
There are some notable limitations to our study. First, there are many factors involved in how authors interpret their research but our study focused only on point estimates and confidence intervals of primary outcomes. Second, we focused on cardiovascular trials with hard clinical (MACE) endpoints and so confirmation is required to determine if results would be similar for research in other conditions like chronic obstructive pulmonary disease or infectious disease. Third, our definitions of potentially clinically meaningful effects may be seen as arbitrary or too generous. There is no agreed-on minimal clinically important effect for MACE outcomes so we derived our definition from established therapies although some will certainly feel they are too generous. We used somewhat liberal thresholds because our goal was to determine if results included any “potentially” clinically meaningful effects but we also performed a sensitivity analysis with stricter criteria. While some will see these cut-offs as arbitrary, a goal of this paper is to reflect on the rigid adherence to the 0.05 statistic significance threshold, which itself can be considered arbitrary. Fourth, we used relative margins. The use of relative margins allows for more easy comparison across trials because any assessment of absolute effects must also account for time. Fifth, although we assessed authors’ conclusions by focusing on abstract conclusions, this is a previous method of rating conclusions  and abstract conclusion is the most likely location for promotion of results . It should also be noted that the abstract conclusions, like any part of the articles, may have been modified through the peer-review process and editorial recommendations. It is not possible to clarify to what, if any, degree this occurred but we suspect it is small.
In up to 61% of NSS cardiovascular trials, the primary outcome has a point estimate and/or confidence interval that includes potentially clinically meaningful effects. Furthermore, among the NSS cardiovascular trials, authors’ conclusions were positively associated with point estimates and lower confidence intervals that suggest greater potential effects. In fact, both the point estimates and confidence intervals included potentially meaningful effects in 67% of trials (12/18) in which the authors concluded that treatment was superior, compared to only 6% (1/16) in which authors concluded that control was superior. Given the frequency of NSS cardiovascular trials, it is reassuring that many authors look beyond statistical significance testing and consider the potentially meaningful clinical effects of their results. Additionally, journals and evidence-users should be encouraged, as directed by CONSORT, to consider point estimates and confidence intervals in the context of potentially clinically meaningful effects and not strictly for hypothesis and statistical significance testing.
Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c869. doi:10.1136/bmj.c869.
Rothman KJ. A show of confidence. N Engl J Med. 1978;299:1362–3.
Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed). 1986;292:746–50.
Cummings P, Koepsell TD. P values vs estimates of association with confidence intervals. Arch Pediatr Adolesc Med. 2010;164:193–6.
Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA. Evolution of reporting P values in the biomedical literature, 1990-2015. JAMA. 2016;315:1141–8.
McCormack J, Vandermeer B, Allan GM. How confidence intervals become confusion intervals. BMC Med Res Methodol. 2013;13(1):134.
Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and interpretation of randomized controlled trials with statistically nonsignificant results for primary outcomes. JAMA. 2010;303(20):2058–64.
Lockyer S, Hodgson R, Dumville JC, Cullum N. “Spin” in wound care research: the reporting and interpretation of randomized controlled trials with statistically non-significant primary outcome results or unspecified primary outcomes. Trials. 2013;14:371.
Patel SV, Van Koughnett JA, Howe B, Wexner SD. spin is common in studies assessing robotic colorectal surgery: an assessment of reporting and interpretation of study results. Dis Colon Rectum. 2015;58(9):878–84.
Patel SV, Chadi SA, Choi J, Colquhoun PH. The use of “spin” in laparoscopic lower GI surgical trials with nonsignificant results: an assessment of reporting and interpretation of the primary outcomes. Dis Colon Rectum. 2013;56(12):1388–94.
Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535. doi:10.1136/bmj.b2535.
Als-Nielsen B, Chen W, Gluud C, Kjaergard LL. Association of funding and conclusions in randomized drug trials: a reflection of treatment effect or adverse events? JAMA. 2003;290(7):921–8.
Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA. 1994;272(2):122–4.
Cannon CP, Blazing MA, Giugliano RP, McCagg A, White JA, Theroux P, Darius H, Lewis BS, Ophuis TO, Jukema JW, De Ferrari GM, Ruzyllo W, De Lucca P, Im K, Bohula EA, Reist C, Wiviott SD, Tershakovec AM, Musliner TA, Braunwald E, Califf RM, IMPROVE-IT Investigators. ezetimibe added to statin therapy after acute coronary syndromes. N Engl J Med. 2015;372(25):2387–97.
Zinman B, Wanner C, Lachin JM, Fitchett D, Bluhmki E, Hantel S, Mattheus M, Devins T, Johansen OE, Woerle HJ, Broedl UC, Inzucchi SE, EMPA-REG OUTCOME Investigators. Empagliflozin, cardiovascular outcomes, and mortality in type 2 diabetes. N Engl J Med. 2015;373(22):2117–28.
Jarcho JA, Keaney Jr JF. Proof that lower is better--LDL cholesterol and IMPROVE-IT. N Engl J Med. 2015;372:2448–50.
Grant PJ. Empagliflozin in diabetes: a therapeutic light at the end of the cardiovascular tunnel? Diab Vasc Dis Res. 2015;12(6):394–5.
Taylor F, Huffman MD, Macedo AF, Moore TH, Burke M, Davey Smith G, Ward K, Ebrahim S. Statins for the primary prevention of cardiovascular disease. Cochrane Database Syst Rev. 2013;1, CD004816.
Hackshaw A, Kirkwood A. Interpreting and reporting clinical trials with results of borderline significance. BMJ. 2011;343:d3340. doi:10.1136/bmj.d3340.
No external funding.
Availability of data and materials
The list of included studies and whether the results were statistically significant is available in the online supplement.
GMA conceived of the study and GMA, SG, MRK, JM, CK, and AJL refined and designed the study. VK, SK, EB and GMA performed study selection while CRF, VK, SK, and GMA performed data extraction. GMA, SG, JM, MRK, CK, and AJL performed analysis. OB provided statistical advice and assistance with analysis. All authors helped refine the study concept, provided input in analysis, critically contributed to the manuscript, had full access to the data, and read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this article
Cite this article
Allan, G.M., Finley, C.R., McCormack, J. et al. Are potentially clinically meaningful benefits misinterpreted in cardiovascular randomized trials? A systematic examination of statistical significance, clinical significance, and authors’ conclusions. BMC Med 15, 58 (2017). https://doi.org/10.1186/s12916-017-0821-9