Evidence of unexplained discrepancies between planned and conducted statistical analyses: a review of randomised trials

Background Choosing or altering the planned statistical analysis approach after examination of trial data (often referred to as ‘p-hacking’) can bias the results of randomised trials. However, the extent of this issue in practice is currently unclear. We conducted a review of published randomised trials to evaluate how often a pre-specified analysis approach is publicly available, and how often the planned analysis is changed. Methods A review of randomised trials published between January and April 2018 in six leading general medical journals. For each trial, we established whether a pre-specified analysis approach was publicly available in a protocol or statistical analysis plan and compared this to the trial publication. Results Overall, 89 of 101 eligible trials (88%) had a publicly available pre-specified analysis approach. Only 22/89 trials (25%) had no unexplained discrepancies between the pre-specified and conducted analysis. Fifty-four trials (61%) had one or more unexplained discrepancies, and in 13 trials (15%), it was impossible to ascertain whether any unexplained discrepancies occurred due to incomplete reporting of the statistical methods. Unexplained discrepancies were most common for the analysis model (n = 31, 35%) and analysis population (n = 28, 31%), followed by the use of covariates (n = 23, 26%) and the approach for handling missing data (n = 16, 18%). Many protocols or statistical analysis plans were dated after the trial had begun, so earlier discrepancies may have been missed. Conclusions Unexplained discrepancies in the statistical methods of randomised trials are common. Increased transparency is required for proper evaluation of results.


Background
The results of a clinical trial depend upon the statistical methods used for analysis. For example, changing the analysis population or method of handling missing data can change the size of the estimated treatment effect or its standard error. In some instances, these differences can be large and may affect the interpretation of the trial [1][2][3][4][5][6]. If investigators choose the method of analysis based on trial data in order to obtain more favourable results (often referred to as 'p-hacking'), this can cause bias [7,8]. Selective reporting has been identified previously, where outcomes with more favourable results are more likely to be reported than other outcomes [9][10][11][12][13][14][15][16][17][18][19][20]. There is some evidence to suggest this may also be a concern for statistical analyses; pre-specification of proposed methods in protocols is often poor, discrepancies between protocols and publications are common, and in some instances, changes may have been made to obtain specific results [5,9,11,14,[21][22][23][24].
Guidelines such as ICH-E9 [25] (International Conference for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use), SPIRIT [26,27] (Standard Protocol Items: Recommendations for Interventional Trials), and CONSORT [28] (Consolidated Standards of Reporting Trials) require investigators to pre-specify the principle features of their statistical analysis approach in the trial protocol and report any changes in the trial report. This strategy can reduce bias from analysis being chosen based on trial data and allows readers to assess whether inappropriate changes were made. More recently, guidelines for the content of statistical analysis plans have also been published [6].
There is an increasing demand for transparency and reproducibility in clinical research [29,30]. Although evidence suggests discrepancies in statistical methods are common, there is little data available on how often the statistical methods for a trial's primary outcome are pre-specified within a publicly available document, how often discrepancies occur when accounting for updates to the protocol or statistical analysis plan (SAP), or the blinding status of statisticians around such discrepancies. We therefore conducted a review of trials published in general medical journals to evaluate how often a prespecified analysis approach was publicly available, how often the planned analysis approach was changed, whether these changes were explained, and the reporting around the timing and blinding status of changes.

Methods
The full protocol for this review can be found in Additional file 1.

Search strategy
In this review, we examined randomised controlled trials published between January and April 2018 in six general high-impact medical journals: Annals of Internal Medicine, The BMJ, Journal of the American Medical Association (JAMA), The Lancet, New England Journal of Medicine (NEJM), and PLOS Medicine. We searched for articles in PubMed with a publication type of 'randomized controlled trial' or categorised with the MeSH term 'random allocation', or including the keyword 'random*' in the title or abstract, restricted to the aforementioned included journals and publication period. The full search strategy is shown in Appendix 1 in Additional file 2 and was conducted in July 2018.

Eligibility
Articles were eligible for inclusion if they reported results from a phase 2-4 randomised trial in humans. Exclusion criteria were pilot or feasibility study, phase 1 trial, non-randomised study, secondary analysis of previously published trial, cost-effectiveness as the primary outcome, more than one trial reported in the article, results of an interim analysis, or if the protocol or SAP was not in English.
One author screened the title and abstract of each paper for eligibility. The full texts of these articles were then assessed independently by two reviewers to confirm eligibility. For all eligible studies, one author searched the main text, supplementary material, and references to identify whether a protocol and/or SAP was available.

Data extraction
Data was extracted onto a pre-piloted standardised data extraction form by two reviewers independently (see Additional file 1). Disagreements were resolved by discussion, or by a third reviewer where disagreement could not be resolved. Where the trial publication referred to supplementary material, a SAP or protocol, the extractor referred to these documents.
We extracted data related to the primary analysis of the primary outcome from the trial publication. A single primary outcome was identified as follows: (a) if one outcome was listed as the primary, we used this; (b) if no outcomes or multiple outcomes were listed as being primary, we used the outcome that the sample size calculation was based on; and (c) if no sample size calculation was performed or sample size was calculated for multiple primary outcomes, we used the first clinical outcome listed in the objectives/outcomes section. We identified the primary analysis as follows: (a) if a single analysis strategy was used, or multiple strategies were used with one being identified as primary, we used this; (b) if multiple strategies were used without one being identified as primary, we used the first one presented in the results section.
For each article, we extracted general trial characteristics, whether protocols or SAPs were available, including the given dates of these documents and, if available, the blinding status of trial statisticians. For published protocols or SAPs, we used the date of publication. For protocols or SAPs available as supplementary material with the results article or elsewhere, we used the given date on the documents. For articles with a protocol or SAP, we compared the method of analysis in the trial publication against the method specified in the earliest available protocol or SAP which included some information on the analysis of the primary outcome (referred to as the original analysis plan). We assessed the following four analysis elements: (i) analysis population (the set of participants included in the analysis, and which treatment group they were analysed in), (ii) the statistical analysis model, (iii) use of baseline covariates in the analysis, and (iv) the method for handling missing data. We chose these elements as they are specified in the SPIRIT guidelines and have been used in previous reviews [5,26].
We evaluated two types of discrepancies for each analysis element. The first, termed a 'change', occurred when the analysis element in the trial publication was different to that specified in the original analysis plan. The following examples would constitute changes: (a) if an intention-to-treat analysis population was originally specified, but a perprotocol analysis was used; (b) if the functional form of the statistical analysis model was changed, such as from a mixed-effects regression model to generalised estimating equations (GEE); (c) if the original analysis plan specified the analysis would not adjust for baseline covariates but the trial publication adjusted for one or more patient characteristic; or (d) if a complete case analysis was originally specified, but multiple imputation was used.
The second discrepancy, termed an 'addition', occurred when the original analysis plan gave the investigators flexibility to subjectively choose the final analysis method after seeing trial data. This could occur if the original analysis plan (i) contained insufficient information about the proposed analysis or (ii) allowed the investigators to subjectively choose between multiple different potential analyses.
The following examples would constitute additions: if the original analysis plan stated that (a) both a per-protocol and intention-to-treat analysis population would be used, without specifying which was the primary analysis (as investigators could then decide during final analysis which was the primary, based on which gave the most favourable result); (b) either parametric or non-parametric methods would be used depending on distributional assumptions, but did not define an objective criteria for assessing distributional assumptions (as the investigators could then present whichever method gave the most favourable result); (c) the analysis would adjust for important baseline covariates, but did not define how these covariates would be chosen (as investigators could choose during final analysis the set of covariates which gave the most favourable result); or (d) multiple imputation would be used, but did not define what the method of imputation would be, or what variables would be included in the imputation model (as this would allow the investigators to run several different imputation models during final analysis and present only the most favourable).
We classified each discrepancy as being 'explained' or 'unexplained'. Discrepancies were classified as explained if they had been specified in a subsequent version of the protocol or SAP (with or without a justification or rationale for the discrepancy) or if the trial publication explained that an alteration to the pre-specified analysis approach had been made. Otherwise, discrepancies were classified as unexplained.

Outcomes
The main outcome measures were (i) the number of trials with a publicly available pre-specified analysis approach for the primary outcome (i.e. whether an original analysis plan was available in a protocol or a SAP), (ii) the number of trials with no unexplained discrepancies from the publicly available pre-specified analysis approach, and (iii) the total number of analysis elements for each trial with an unexplained discrepancy.
Secondary outcomes were, for each analysis element described earlier, (i) the number of trials with at least one unexplained discrepancy (either change or addition), (ii) the number of trials with at least one unexplained change, and (iii) the number of trials with at least one unexplained addition.

Statistical methods
Outcomes were summarised descriptively using frequencies and percentages. We performed two pre-specified subgroup analyses, where we summarised outcomes separately according to trial funding status (any for-profit funding source, such as pharmaceutical or medical device companies vs. no for-profit funding sources) and type of intervention (pharmacological vs. surgical vs. psychological/behavioural/educational vs. other vs. multiple types). Two post hoc subgroup analyses were performed. The first summarised outcomes according to whether a pre-specified analysis approach was made available prior to the publication of the results article (made available prior to the trial publication vs. made available at the same time or after the trial publication). The second summarised outcomes between trials with a standalone SAP available versus those without (regardless of the availability of a protocol).
All statistical analyses were performed using Stata version 15 [31].

Search results and characteristics of included studies
Our search identified 197 articles, of which 101 were eligible (see Fig. 1 and for a list of eligible trials Appendix 2 in Additional file 2). General trial characteristics are shown in Table 1.
Protocols were available for 90 trials (89%) (48 published, 70 as supplementary material with publication, 5 on a website). SAPs were available for 46 trials (46%) (3 published, 43 as supplementary material with publication, 2 on a website). Of the 90 trials with an available protocol, the earliest version available was dated before recruitment began for 45 (50%) trials, 19 (21%) were dated during recruitment, 8 (9%) were dated after recruitment ended, and 18 (20%) did not have a date. Of the 46 trials with an available SAP, the earliest version of the SAP was dated before recruitment began for 9 (20%) trials, 13 (28%) were dated during recruitment, 13 (28%) were dated after recruitment ended, and 11 (24%) did not have a date.
Overall, only 11 trials (11%) stated in the trial publication, protocol, or SAP that the statistician was blinded until the SAP was signed off, and 10 (10%) stated the statistician was blinded until the database was locked.

Availability of pre-specified analysis approach
Overall, 89 of 101 trials (88%) had a publicly available pre-specified analysis approach for the primary outcome. Eleven trials did not have an available protocol or SAP, and one trial had a protocol with no information on the analysis and no SAP. The document containing the original analysis plan (83 in a protocol, 6 in a SAP) was dated before the start of recruitment for 41 of 89 (46%) trials, during recruitment in 19 (21%) trials (median 19 months post-recruitment beginning, IQR 9 to 46), and after the end of recruitment in 8 (9%) trials (median 7 months post-recruitment completion, IQR 4 to 13). In 21 trials (24%), no date was available.
For 46 of 89 trials (52%), the document containing the original analysis plan was made publically available prior to the trial publication (44 published, 2 on a website), whereas the document for the remaining 43 trials (48%) was only made publically available at the point the trial was published (39 supplementary material to the trial publication, 3 published, and 1 on a website).

Comparison of pre-specified and conducted statistical analysis approach
Of the 89 trials with an available pre-specified analysis approach, only 22 (25%) did not have any unexplained discrepancies (no discrepancies n = 5, explained discrepancies only n = 17). A further 54 trials (61%) had one or more unexplained discrepancies (see Fig. 2). In 13 trials (15%), it was unclear whether an unexplained discrepancy occurred due to poor reporting of statistical methods (unclear whether discrepancy occurred n = 11, unclear whether discrepancy explained n = 2).
Overall, 29 trials (33%) had at least one explained discrepancy. Most discrepancies were explained in a later version of the protocol or SAP; only 2 trials explained a discrepancy in the trial publication. Of the 29 trials with an explained discrepancy, only 6 (21%) stated that the statistician was blinded until the SAP was signed off, and 4 (14%) until the database was locked.

Subgroup analyses
A total of 43/61 (66%) trials that had no for-profit funding had at least one unexplained discrepancy, compared to 11/28 (45%) trials that had some for-profit funding. One or more unexplained discrepancy occurred for 25 of 47 (53%) pharmaceutical intervention trials, 10 of 12 surgical trials (83%), 9 of 9 (100%) psychological/behavioural/educational trials, 9 of 19 (47%) other trials, and 1 of 2 trials (50%) with multiple intervention types. Fewer trials with a SAP available had unexplained discrepancies than trials without an available SAP, though this figure was still high (SAP available 22/46 [48%] with ≥ 1 unexplained discrepancy vs. no SAP 32/43 [74%]). Trials with a SAP still had a relatively high number of additions to the analysis method, indicating that methods were not being adequately pre-specified within these SAPs (range 7-15% across analysis elements).
A total of 34/46 (74%) trials in which the original analysis plan was made available prior to the trial publication had at least one unexplained discrepancy, compared to 20/43 (47%) trials where the analysis approach was only made available at the time of trial publication. See Additional file 2, Appendix 3 and 4 for additional results, including subgroup results by analysis element.

Discussion
In our review of 101 trials published in high-impact general medical journals, we found that most had a prespecified analysis approach for the primary outcome available in either a protocol or SAP. This is essential to allow transparent assessment of whether inappropriate changes were made to the statistical methods. However, most pre-specified statistical analysis approaches were available in a document that was dated after the trial had begun, or had no date available. It is therefore possible that the analysis approach in these documents may have already been changed from the pre-trial version. Moreover, there was poor reporting around the blinding status of statisticians rendering it impossible to verify that the pre-specified analysis approach had not been written after seeing unblinded trial data. In this case, even if no alterations were made to the pre-specified approach, results still may be subject to bias.
Only 25% of trials did not have any unexplained discrepancies between the trial publication and the prespecified analysis approach, and only 6% had no discrepancies at all. Most trials had at least one unexplained discrepancy (61%), with 32% of trials having two or more. In 15% of trials, it was impossible to assess whether there were unexplained discrepancies due to poor reporting of the statistical methods used. Of note, 33% of trials had one or more explained discrepancies; however, less than a quarter of these trials reported that the statistician was blinded to treatment allocation until the analysis plan was finalised or the database was locked. These alterations may therefore have been made based on unblinded trial data, despite being explained. It was also surprising that only two trials explained a discrepancy in the trial publication, despite requirements by the CONSORT [32] statement to do so.
Our results are broadly consistent with previous reviews. Spence et al. [33] evaluated the availability of protocols and SAPs for trials published in high-impact medical journals and found similar rates of availability. However, the rates of discrepancies we found were generally lower than those previously reported [9,11,21,22]. For example, Chan et al. compared publications to protocols for 70 trials that received ethical approval by the scientific ethics committees for Copenhagen and Frederiksberg, Denmark, in 1994-1995 [22]. Overall, 44% of trials had unexplained discrepancies in the analysis population, 60% in the analysis model, 82% in the use of covariates, and 80% for handling of missing data. There are several potential explanations for these differences. The introduction of the SPIRIT guidelines in 2013 [26,27] may have led to better reporting of statistical methods in trial protocols. We also accessed statistical analysis plans in almost half of trials, which increased the number of explained discrepancies. Conversely, previous reviews have been restricted to comparisons of protocol and final results articles. Finally, we evaluated a different population of trials; most of the high-impact general medical journals in our review required submission of the trial protocol alongside the article and may have been less likely to accept trials with extreme discrepancies.
The key issues we identified in this study were (i) low availability of pre-trial protocols and analysis plans, (ii) poor pre-specification of statistical methods within protocols and analysis plans, (iii) frequent unexplained discrepancies in the final trial publication, (iv) poor reporting of the blinding status of statisticians in relation to modifications of analysis methods or access to trial data, and (v) poor descriptions of the actual analysis methods used in the final publication. Increased adherence to guidelines such as SPIRIT, CONSORT, and the guidelines for Statistical Analysis Plans [6,27,28] would help, though alternative approaches to increase transparency around the statistical methods are also required. Two simple proposals that would greatly improve the situation are (a) journals could require authors to submit the first and last version of their protocol and SAP alongside the results article, and publish these as supplementary material; this would allow transparent evaluation of modifications to the analysis approach and be more effective than relying on authors to publish these documents; and (b) journals could require that authors include the statistical code used to perform their analysis alongside the article as supplementary content to allow a complete and transparent comparison of the planned methods versus the final methods [30]. Increasing the availability of protocols and SAPs will additionally facilitate improved assessments of risk of bias regarding selective reporting of results by systematic reviewers [34].
Our study had some limitations. We only included articles from six high-impact medical journals; it is likely Fig. 2 Number of trials with unexplained discrepancies (total N = 89). *Of the n = 22 trials with none; no discrepancies (n = 5), explained discrepancies only (n = 17). **Unclear if discrepancy occurred (n = 11), unclear if discrepancy explained (n = 2). One trial had both a change and an addition for the analysis model that trials published in other journals may have lower availability of protocols and SAPs and higher rates of unexplained discrepancies. Protocols and SAPs were identified via searching the main text, supplementary material, and references of articles; it is possible that we might have missed a few articles, but we expect this number to be very small, as in our experience when protocols/SAPs are available they are referenced to within the paper. Comparisons were based on the first available protocol or SAP; however, many were dated after the trial had begun, so there may have been discrepancies before this that we missed. We also assumed the dates available on protocols and SAPs included as part of the supplementary material alongside trial publications were accurate. However, we had no way to verify this, and it is possible that some of these dates are incorrect, which could imply a higher proportion of analysis approaches were written later on in the life cycle of the trial than what we have reported here.

Conclusions
In conclusion, unexplained discrepancies in the statistical methods of randomised trials are common. Increased transparency around the statistical methods used in randomised trials is required for proper evaluation of trial results.
Additional file 1. Protocol and data extraction formcontains protocol and data extraction form used for this study.