### Study selection

We decided to focus on diseases associated with high morbidity and mortality for patients and with significant healthcare expenditure [13–15]. We hypothesized that methodological deficiencies of trials that provided evidence on treatments for such diseases could bias effect estimates and, therefore, clinical practice. We focused on widely prescribed drug and non-drug therapies for which we wanted to have comprehensive sets of trials as identified by systematic reviews, each addressing a specific research question. Therefore, we based the selection of RCTs on 11 Cochrane Reviews that systematically identified and summarized RCTs on the effectiveness of diuretics, metformin, anticoagulants, long-acting β agonists alone or in combination with inhaled corticosteroids, lipid-lowering agents and the non-drug interventions of exercise and diet for each of the four diseases [16–26]. We included all available reviews in these fields except for the stroke literature, where several other reviews about other drug and non-drug treatments exist. The search strategy and eligibility criteria have been described in these Cochrane reviews. We retrieved the main reports of included RCTs, and also retrieved additional papers that described the methods of these trials. We assessed only the reporting of trials and not their conduct because the protocols or internal reports were not available to us. We did not consider abstracts and unpublished data used in the Cochrane reviews because they could not provide the level of detail that we needed, and we excluded 22 trials (out of 183) for that reason. The bibliography of excluded trials is available on request.

### Data extraction

Before systematically extracting data from each trial, we developed a codebook that provided a detailed description of the information to be extracted and how to score it. We pilot tested the data-extraction forms and the codebook on a random sample of 10 articles. All data were then extracted by one reviewer into an online database, and checked by at least one other reviewer. Disagreements were discussed and resolved.

#### Definition of primary outcome

We recorded whether there were one or more clearly defined primary outcomes. Beside guiding interpretation [27] and sample size calculations, defining a primary outcome is important for designing a trial that minimizes confounding indirectly by the use of restriction (exclusion criteria that eliminate some levels of characteristics that may act as confounders), pre-stratification for prognostically important variables, and the collection of data for potential statistical adjustment. Confounders can be profoundly different depending on the primary outcome (for example, mortality versus quality of life) [9]. Possible answer keys for data extraction included 'Yes, clearly defined' if the authors made an explicit distinction between the primary and secondary outcomes, and 'No' if they did not make distinction between primary and secondary outcomes. For those papers that did describe a primary outcome, we also recorded whether the primary outcome was only one measure (for example, percentage of forced expiratory volume in one second predicted at 1 year of follow-up) or whether multiple measures were considered as primary outcome (for example, different individual measures, or one measure at different time points at which treatment effect may differ).

#### Between- and within-group comparisons

We recorded whether or not there were only within-group comparisons reported and no reporting of between-group comparisons ('Yes', 'No' or 'Unclear'). Within-group comparisons are a comparison of baseline and follow-up measurements within each treatment group, rather than an effect estimate for a randomized comparison between groups. Based on a within-group comparison, it is not possible to tell whether a change was caused by the intervention or by some other factor.

#### Comparison of baseline characteristics and decisions to adjust for baseline imbalances

We recorded whether a between-group comparison of baseline characteristics was made to assess significant differences (for example, *P* values in Table 1). This is not a particularly useful comparison, because it tests the null hypothesis that treatment groups are not different, even though we know that through randomization the null hypothesis is true [28–30]. Even more importantly, such testing may misguide the statistical analysis as investigators may inappropriately use these differences to over- or under-adjust for potential confounders, even though significant differences at baseline occur by chance in 5% of the variables tested. Indeed, confounding of observed treatment effects may result if certain characteristics are not well balanced and are thus associated with treatment exposure, or if they influence the outcome but are not a result of treatment exposure (intermediates). However, the decision for or against considering a variable to be a confounder should not be made based on testing for statistical significance but rather on prior evidence and/or biological rationale. Therefore, we distinguished between 'Yes, reporting of *P* values and/or the term statistical significance', 'Yes, significant' (which may refer to a statistical analysis but also to a clinically relevant difference), 'No' and 'Does not apply' in the case of crossover studies. For those papers in which the investigators tested for baseline differences, we assessed what actions were taken as a consequence of testing. We recorded if the authors adjusted for a significant difference in a baseline characteristic or mentioned it in the Discussion section. If they did not find any significant differences, we assessed if they adjusted for large-magnitude differences in baseline characteristics irrespective of significant tests, or adjusted for a potential confounder unrelated to baseline characteristics.

#### Missing data

Missing data occur in almost any study for some outcomes or covariates [10]. We were interested in how chronic disease trials handled missing data because of the potential bias on treatment effects. If data are missing at random and non-differentially in different treatment groups, effect estimates are, in the best case, still valid although less precise. However, if patients drop out of a trial (for example, because of adverse effects) and do not provide outcome data, a selection bias may occur if these patients are dropped from the analyses or censored [8]. In addition, if data for confounders are missing, the statistical adjustment for potential confounding is compromised. We assessed whether or not the approach to the handling of missing data was reported ('Yes', 'No' and 'Unclear') and recorded the corresponding methods (for example, last value carried forward, imputation of fixed values such as mean or multiple imputation). If the authors reported the missing data, but dropped those participants from analysis, we recorded it as a complete case analysis.

#### Intention-to-treat analysis

We also assessed whether or not an intention-to-treat (ITT) analysis was reported. An ITT analysis is an analysis based on the initial treatment intent, not on the treatment eventually administered. Changing a patient from their assigned treatment arm to another arm during the trial and/or dropping some patients from the analysis also leads to selection bias. We considered an analysis to be ITT if the authors explicitly described the analysis as such, or if the numbers of patients included in the analysis corresponded exactly to those randomized to the respective treatment groups [31].

#### Reporting of point estimates and measures of precision

To interpret treatment effects, trial reports should include point estimates, confidence intervals (CIs) and *P* values. *P* values alone do not suffice to interpret the results of trials because they are influenced by both sample size and effect size. We registered whether trial reports included *P* value only, 95% CI only, both, or neither.

#### Subgroup analyses to investigate heterogeneity of treatment effects

Finally, we assessed how often and in what way subgroup analyses were performed, using recently described criteria [11, 12]. We defined a subgroup analysis as the analysis of an effect that varied (or not) depending on different levels of a variable measured before randomization. We assessed whether this occurred by using 'Yes' and 'No'. For those trials that did report one or more subgroup effects, we recorded whether just one, a small number (two to five) of or a large number (more than five) of subgroup effects were reported. We recorded the proportion of trials using an interaction term to compare treatment effects across various other factors and whether the interaction terms were statistically independent of other subgroup effects. In addition, we assessed whether or not subgroup effects were consistent across closely related outcomes. We also assessed if the authors had specified a hypothesis for why a subgroup effect could be present, based on evidence from the literature or biological plausibility, and whether they pre-specified the direction of the subgroup effect.

### Statistical analysis

We used descriptive statistics to summarize our findings across the entire set of trials, and stratified by disease, type of treatment (drug versus non-drug), time of publication (before and after 2001, when many journals adopted the CONSORT statement) and the impact factor of the journal in which the reports were published (per unit increase of the 2009 impact factor), which we used as an indirect measure for the overall quality of a trial. We used simple and multiple logistic regression analysis to detect the association of the sources of bias we assessed with type of treatment, time of publication and the impact factor of the journal. For analyses with time of publication as a covariate, we restricted the analysis to trials published in or after 1990, because initiatives to improve trial reporting, such as CONSORT, did not start before 1990. All analyses were conducted with Stata for Windows (version 10.1; Stata Corp., College Station, TX, USA).