Evidence of unexplained discrepancies between planned and conducted statistical 1 analyses: a review of randomized trials

2 Abstract 29 Background: Choosing or altering the planned statistical analysis approach after 30 examination of trial data (often referred to as ‘p-hacking’) can bias results of randomized 31 trials. However, the extent of this issue in practice is currently unclear. We conducted a 32 review of published randomized trials to evaluate how often a pre-specified analysis 33 approach is publicly available, and how often the planned analysis is changed. 34 35 Methods: A review of randomised trials published between January and April 2018 in six 36 leading general medical journals. For each trial we established whether a pre-specified 37 analysis approach was publicly available in a protocol or statistical analysis plan, and 38 compared this to the trial publication. 39 40 Results: Overall, 89 of 101 eligible trials (88%) had a publicly available pre-specified 41 analysis approach. Only 22/89 trials (25%) had no unexplained discrepancies between the 42 pre-specified and conducted analysis. Fifty-four trials (61%) had one or more unexplained 43 discrepancies, and in 13 trials (15%) it was impossible to ascertain whether any unexplained 44 discrepancies occurred due to incomplete reporting of the statistical methods. Unexplained 45 discrepancies were most common for the analysis model (n=31, 35%) and analysis 46 population (n=28, 31%), followed by the use of covariates (n=23, 26%) and the approach for 47 handling missing data (n=16, 18%). Many protocols or statistical analysis plans were dated 48 after the trial had begun, so earlier discrepancies may have been missed. 49 Conclusions: Unexplained discrepancies in the statistical methods of randomized trials are 51 common. Increased transparency is required for proper evaluation of results. 52


57
The results of a clinical trial depend upon the statistical methods used for analysis. For 58 Data was extracted onto a pre-piloted standardised data extraction form by two reviewers 109 independently. Disagreements were resolved by discussion, or by a third reviewer where 110 disagreement could not be resolved. Where the trial publication referred to supplementary 111 material, a SAP or protocol, the extractor referred to these documents. 112

113
We extracted data related to the primary analysis of the primary outcome from the trial 114 publication. A single primary outcome was identified as follows; (a) if one outcome was listed 115 as the primary we used this; (b) if no outcomes or multiple outcomes were listed as being 116 primary we used the outcome that the sample size calculation was based on; and (c) if no 117 sample size calculation was performed or sample size was calculated for multiple primary 118 outcomes, we used the first clinical outcome listed in the objectives/outcomes section. We 119 identified the primary analysis as follows; (a) if a single analysis strategy was used, or 120 multiple strategies were used with one being identified as primary, we used this; (b) if 121 multiple strategies were used without one being identified as primary, we used the first one 122 presented in the results section. 123

124
For each article, we extracted general trial characteristics, whether protocols or SAPs were 125 available, including the dates of these documents and, if available, the blinding status of trial 126 statisticians. For articles with a protocol or SAP, we compared the method of analysis in the 127 trial publication against the method specified in the earliest available protocol or SAP which 128 included some information on the analysis of the primary outcome (referred to as the original 129 analysis plan). We assessed the following four analysis elements: (i) analysis population (the 130 set of participants included in the analysis, and which treatment group they were analysed 131 in); (ii) the statistical analysis model; (iii) use of baseline covariates in the analysis; and (iv) 132 the method for handling missing data. We chose these elements as they are specified in the 133 SPIRIT guidelines, and have been used in previous reviews (5,25). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 We evaluated two types of discrepancies for each analysis element. The first, termed a 136 'change', occurred when the analysis element in the trial publication was different to that 137 specified in the original analysis plan. The following examples would constitute changes: (a) 138 if an intention-to-treat analysis population was originally specified, but a per-protocol analysis 139 was used; (b) if the functional form of the statistical analysis model was changed, such as 140 from a mixed-effects regression model to generalized estimating equations (GEE); (c) if the 141 original analysis plan specified the analysis would not adjust for baseline covariates but the 142 trial publication adjusted for one or more patient characteristic; or (d) if a complete case 143 analysis was originally specified, but multiple imputation was used. 144

145
The second discrepancy, termed an 'addition', occurred when the original analysis plan gave 146 the investigators flexibility to subjectively choose the final analysis method after seeing trial 147 data. This could occur if the original analysis plan (i) contained insufficient information about 148 the proposed analysis; or (ii) allowed the investigators to subjectively choose between 149 multiple different potential analyses. The following examples would constitute additions: if 150 the original analysis plan stated that (a) both a per-protocol and intention-to-treat analysis 151 population would be used, without specifying which was the primary analysis (as 152 investigators could then decide during final analysis which was the primary, based on which 153 gave the most favourable result); (b) either parametric or non-parametric methods would be 154 used depending on distributional assumptions, but did not define an objective criteria for 155 assessing distributional assumptions (as the investigators could then present whichever 156 method gave the most favourable result); (c) the analysis would adjust for important baseline 157 covariates, but did not define how these covariates would be chosen (as investigators could is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 We classified each discrepancy as being 'explained' or 'unexplained'. Discrepancies were 164 classified as explained if they had been specified in a subsequent version of the protocol or 165 SAP (with or without a justification or rationale for the discrepancy), or if the trial publication 166 explained that an alteration to the pre-specified analysis approach had been made. 167 Otherwise discrepancies were classified as unexplained. The main outcome measures were (i) the number of trials with a publicly available pre-172 specified analysis approach for the primary outcome (i.e. whether an original analysis plan 173 was available in a protocol or a SAP); (ii) the number of trials with no unexplained 174 discrepancies from the publicly available pre-specified analysis approach; and (iii) the total 175 number of analysis elements for each trial with an unexplained discrepancy. Outcomes were summarised descriptively using frequencies and percentages. We 185 performed two pre-specified subgroup analyses, where we summarised outcomes 186 separately according to trial funding status, and type of intervention. One post-hoc subgroup 187 analysis was performed according to availability of a SAP. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 All statistical analyses were performed using Stata version 15 (28).  Table  198 1. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 Protocols were available for 90 trials (89%) (48 published, 70 as supplementary material with 205 publication, 5 on a website). SAPs were available for 46 trials (46%) (3 published, 43 as 206 supplementary material with publication, 2 on a website). Of 90 trials with an available 207 protocol, the earliest version available was dated before recruitment began for 45 (50%) 208 trials, 19 (21%) were dated during recruitment, 8 (9%) were dated after recruitment ended, 209 and 18 (20%) did not have a date. Of 46 trial with an available SAP, the earliest version of 210 the SAP was dated before recruitment began for 9 (20%) trials, 13 (28%) were dated during 211 recruitment, 13 (28%) were dated after recruitment ended, and 11 (24%) did not have a 212 date. 213

214
Overall, only 11 trials (11%) stated in the trial publication, protocol, or SAP that the 215 statistician was blinded until the SAP was signed off and 10 (10%) stated the statistician was 216 blinded until the database was locked. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 Of the 89 trials with an available pre-specified analysis approach, only 22 (25%) did not have 231 any unexplained discrepancies (no discrepancies n=5, explained discrepancies only n=17). 232 A further 54 trials (61%) had one or more unexplained discrepancies (see Fig 2). In 13 trials 233 (15%) it was unclear whether an unexplained discrepancy occurred due to poor reporting of 234 statistical methods (unclear whether discrepancy occurred n=11, unclear whether 235 discrepancy explained n=2). 236 237 Most trials had one (n=25, 28%) or two (n=16, 18%) unexplained discrepancies. Only 11 238 (12%) had three and 2 (2%) had four unexplained discrepancies. Unexplained discrepancies 239 were most common for the statistical analysis model (n=31, 35%) and analysis population 240 (n=28, 31%), followed by the use of covariates (n=23, 26%) and handling of missing data 241 (n=16, 18%). Table 2 provides a description of the unexplained discrepancies. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 the statistician was blinded until the SAP was signed off, and 4 (14%) until the database was 257 locked. 258 259 260

Subgroup analyses 261
A total of 43/61 (66%) trials that were not for profit only had at least one unexplained 262 discrepancy, compared to 11/28 (45%) trials that were for profit only. Fewer trials with a SAP 263 available had unexplained discrepancies than trials without an available SAP, though this 264 In our review of 101 trials published in high impact general medical journals, we found that 274 most had a pre-specified analysis approach for the primary outcome available in either a 275 protocol or SAP. This is essential to allow transparent assessment of whether inappropriate 276 changes were made to the statistical methods. However, most pre-specified statistical 277 analysis approaches were available in a document that was dated after the trial had begun, 278 or had no date available. It is therefore possible that the analysis approach in these 279 documents may have already been changed from the pre-trial version. 280 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 Only 25% of trials did not have any unexplained discrepancies between the trial publication 282 and the pre-specified analysis approach, and only 6% had no discrepancies at all. Most trials 283 had at least one unexplained discrepancy (61%), with 32% of trials having two or more. In 284 15% of trials it was impossible to assess whether there were unexplained discrepancies due 285 to poor reporting of the statistical methods used. Of note, 33% of trials had one or more 286 explained discrepancies; however, less than a quarter of these trials reported that the 287 statistician was blinded to treatment allocation until the analysis plan was finalised or the 288 database was locked. These alterations may therefore have been made based on unblinded 289 trial data, despite being explained. It was also surprising that only two trials explained a 290 discrepancy in the trial publication, despite requirements by the CONSORT (29)  found similar rates of availability. However, the rates of discrepancies we found were 296 generally lower than those previously reported (8,10,20,21). For example, Chan et al 297 compared publications to protocols for 70 trials that received ethical approval by the 298 scientific-ethics committees for Copenhagen and Frederiksberg, Denmark in 1994-5 (21). 299 Overall, 44% of trials had unexplained discrepancies in the analysis population, 60% in the 300 analysis model, 82% in the use of covariates, and 80% for handling of missing data. There 301 are several potential explanations for these differences. The introduction of the SPIRIT 302 guidelines in 2013 (25, 26) may have led to better reporting of statistical methods in trial 303 protocols. We also accessed statistical analysis plans in almost half of trials, which 304 increased the number of explained discrepancies. Finally, we evaluated a different 305 population of trials; most of the high impact general medical journals in our review required 306 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 submission of the trial protocol alongside the article, and may have been less likely to accept 307 trials with extreme discrepancies. 308

309
The key issues we identified in this study were: (i) low availability of pre-trial protocols and Our study had some limitations. We only included articles from six high impact medical 326 journals; it is likely that trials published in other journals may have lower availability of 327 protocols and SAPs, and higher rates of unexplained discrepancies. Comparisons were 328 based on the first available protocol or SAP, however many were dated after the trial had 329 begun, so there may have been discrepancies before this that we missed. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020 Conclusions 333 In conclusion, unexplained discrepancies in the statistical methods of randomized trials are 334 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020 Critical revision of the manuscript for important intellectual content: All authors 378 379 . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ;https://doi.org/10.1101https://doi.org/10. /2020  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted February 23, 2020. ; https://doi.org/10. 1101/2020