This is a registered report: the research protocol was peer-reviewed by the journal before the actual research took place, and it received in-principle acceptance on December 20, 2019, and was registered on January 14, 2020, on the Open Science Framework [15].
Once accepted, the editors undertake to publish the completed study if the protocol is validated even if there are statistically negative findings (i.e. study hypothesis not verified). This approach is expected to reduce issues such as publication bias [16].
Eligibility criteria
EPARs
We collected all EPARs on new authorised human medications, biosimilars and orphan medicines given a positive opinion by the Committee for Medicinal Products for Human Use (CHMP) between 1 January 2017 and 31 December 2019 and approved by the European Commission. EPARs concerning generics and hybrid medicine were excluded. Definitions concerning the different types of drugs can be found in the web appendix (Additional file 1: Table S1) [15]. The distinction between new biosimilars, new generics, new hybrid medicine, orphan medicines, or new medicines followed the CHMP Meeting Highlights [17].
Main studies
Pivotal trials are referred to as “main studies” in the different EPARs. Any main study was included, with no distinction in terms of study phase, study type, study design, or intervention.
If an indication for a drug had been refused and another indication authorised, the main study for the non-authorised indication was not considered.
Furthermore, studies with no primary outcome identified were not included and were listed as non-evaluable studies.
Search strategy
Eligible main trials
Two reviewers (MS, JG) independently extracted all names of the new medicines, biosimilars and orphan medicines approved by the CHMP and entered the information on a standard data extraction form. Afterwards, a check was performed to verify that the CHMP opinion was adopted by the European Commission [18]. Next, the reviewers identified the corresponding eligible EPARs on the EMA website [19] and independently extracted all main studies reported in these EPARs. Disagreements were resolved by discussion between the two reviewers or after referral to a third reviewer (CL or FN) until a consensus was reached.
Sample size calculation
A random sample of 62 of these main studies was selected using R (rnorm function) [20]. This sample size ensured a precision of ± 12% to estimate our primary outcome (i.e. percentage of reproducible studies, see below for a definition) in the worst-case scenario for precision estimations (i.e. if the percentage of reproducible studies is 50%).
Main study document accessibility
For all randomly sampled studies, one reviewer (JG) searched for the EudraCT number and/or the Sponsor Protocol Number, and/or any other identification information in each EPAR and identified the official sponsor of the study. If this information was lacking, the same reviewer started a wildcard search using keywords (disease, drug) from the study in the European Union Clinical Trial Register [21]. If this was not successful, the reviewer went on the websites ClinicalTrials.gov [22], International Clinical Trials Registry Portal (ICTRP), World Health Organization [23] and the International Standard Randomised Controlled Trial Number (ISRCTN) allocated by BioMedCentral [24]. If information on sponsor and study number was still lacking, the reviewer contacted the EMA.
Once the sponsor and the study number were identified, the reviewer contacted the sponsor to collect all of the following main study documents: (i) IPD; (ii) data analysis plan; (iii) unpublished and/or published study protocols with any date-stamped amendments; (iv) all the following dates: date of the last visit of the last patient, date of database lock (if available) and date of study unblinding; and (v) unpublished and/or published (scientific article) study reports.
To this end, the reviewer sent a standardised email (Additional file 2: Letter 1), presenting the research project with a link to the registered protocol on the Open Science Framework [15]. In order to improve the return rate, up to 4 emails were sent, the original and 3 reminder emails (with a two-week interval between them).
When asked, we indicated that the data-sharing of raw data was welcome in the form of Study Data Tabulation Model (SDTM) which was created by the Clinical Data International Standard Consortium (CDISC) [25].
In some cases, it was sufficient to contact the sponsor by e-mail; in other cases, the sponsor asked us to retrieve the data on a data-sharing platform.
In parallel the same reviewer searched for these documents on the EMA portal [26] and by inspecting the published reports (if available) identified using open trial [27, 28]. This process is summarised in the web appendix (Additional file 3: Figure S1).
Data extraction
The identification of main studies and the following trial characteristics were extracted from the EPARs on a standard data extraction form by two independent researchers (JG and FN). For each study, the following information was collected: patient characteristics (e.g. percentage of women, mean age of participants, paediatric indication), study methods (e.g. type of endpoint, description for each primary endpoint) and intervention characteristics (e.g. drug). An exhaustive list of the trial characteristics extracted can be found in the web appendix (Additional file 4: Table S2).
Concerning the re-analysis, a first reviewer (JG) collected the information and collated data for the re-analysis. More specifically, the reviewer prepared a dossier with the following information for each study: (i) the protocol; (ii) all amendments to the protocol (with their dates); (iii) all the following dates: date of the last visit of the last patient, date of database lock (if available) and date of study unblinding; and (iv) the IPD. If information was still lacking, the study authors were contacted.
Strategy for re-analyses
If the IPD was not available 1 year after our initial request, we initially planned to consider the study as non-reproducible (primary outcome of our study). However, we allowed some flexibility deviations to this rule (in terms of delay) during the conduct of the study, since delays were in general longer than initially planned, including from the legal review on our side. We only considered studies as not reproducible when data was not shared entirely to reproduce the primary endpoint.
Based on the dossier prepared by the first reviewer, re-analyses of the primary outcome(s) of each study were performed by a second reviewer (MS) who had no access to study reports, journal publications, statistical analysis plan, or analytical code, in order to ensure that the analysis was as blind as possible to the primary analysis. In addition, this reviewer was instructed not to try to find these documents or the published report.
For single-blind studies or open-label studies, analyses were performed according to the first version of the protocol, because outcome switching has been documented. For double-blind studies, all re-analyses were based on the latest version of the protocol issued before database lock and unblinding. If this information was not available, the date of the last visit of the last patient was used as a proxy.
Although in therapeutic research, statistical analysis can be “routine”, in some cases the re-analyses involve difficult methodological choices. An independent senior statistician (AR) was available to discuss any difficult aspect or choice in the analysis plan before the re-analysis, so as to choose the most consensual analyses (e.g. intention-to-treat population for a superiority trial).
If insufficient information concerning the main analysis was provided in the protocol, the best practices for clinical research were used, following the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH Guidelines) [29].
An analysis plan was developed for each study included and was recorded on the Open Science Framework. In the supplementary material, a table is provided with details of what was taken from the ICH guidelines in case of missing information (Additional file 5: Table S3).
Re-analyses entailed the following different steps: (i) identification of the primary outcome (and detection of outcome switching), (ii) definition of the study population, (iii) re-analysis of the primary outcome. Any change identified between the first version of the protocol and the version used for the re-analysis of the primary outcome was tracked and described.
Procedure to assess reproducibility
All results of these analyses were reported in terms of each study’s (i) conclusion (positive or negative), (ii) p-value, (iii) effect size (and details about the outcome) and (iv) changes from the initial protocol regarding the primary outcome. Regarding point (i), a non-inferiority trial was considered positive when it showed non-inferiority.
These results were first compared with the results of the analyses reported in the EPARs and, if these were not available, with the study reports, and again if not available, with the publications. All results from all available documents were gathered (EPARs, study reports and publications) and were presented in the results section.
Because interpreting an RCT involves clinical expertise, and cannot be reduced to solely quantitative factors, an in-depth discussion between two researchers not involved in the re-analysis (JG and FN), based on both quantitative and qualitative (clinical judgement) factors, enabled a decision on whether the changes in results described quantitatively could materialise into a change in conclusions.
If these two reviewers judged that the conclusions were the same, the study results were considered as reproduced. If these two researchers judged that the conclusions were not the same, then the researcher in charge of the analysis (MS) was given the statistical analysis plan of the study and was asked to list the differences in terms of analysis. If he found a discrepancy between the study data analysis plan and his own analysis plan, then he corrected this discrepancy in his analysis (e.g. analysis population, use of covariates). Again, an in-depth discussion between two researchers not involved in the re-analysis (JG and FN) enabled a decision on whether the changes in results described quantitatively could materialise into a change in conclusions, and whether the differences in terms of analytical plan were understandable and acceptable. If these two researchers judged that the conclusions were the same, the study was considered as reproduced with verification.
If these two researchers judged that the conclusions were not the same or that the change in the analytical plan was neither justified nor desirable, a senior statistician performed his own re-analysis. Details on this step can be found in the protocol of the registered report [15]. This process is described in the web appendix (Additional file 6: Figure S2).
Outcomes
The primary outcome is the proportion of studies where the conclusions were reproduced (yes/no; i.e. reproduced or reproduced with verification, as defined above). In case of a divergence for two or more co-primary outcomes in the same study (i.e. one analysis is reproduced and not the other(s)), the different co-primary outcomes were described independently but the whole study was considered as not reproduced. All reasons for classifying studies as non-reproducible or not reproduced were described qualitatively using a taxonomy we developed during the research process.
In addition, we described in what way the data-sharing required clarifications for which additional queries had to be presented to the authors to obtain the relevant information, to clarify labels or use, or both, and to reproduce the original analysis of the primary outcomes.
A catalogue of these queries was created, and we grouped similar clarifications for descriptive purposes to generate a list of some common challenges, and to help tackle these challenges pre-emptively in future published trials.
Concerning secondary outcomes, we described and compared the main outcomes, p-values and effect sizes in the re-analyses, and the analyses reported in the EPARs, the study reports and the publications, and we described discrepancies. In addition, for each paper, we assessed the presence of the following key reporting biases: selective reporting of the primary outcome and “spin” [30].
In case of outcome switching, meaning that a secondary outcome was considered as a primary outcome in the final analysis, both endpoints were to be re-analysed.
To analyse “spin” in the results observed for the primary outcome, we took the definition provided by Yavchitz et al. who described it as being “a specific way of reporting, intentional or not, to highlight that the beneficial effect of the experimental treatment in terms of efficacy or safety is greater than that shown by the results” [31].
The modalities of data-sharing were described by the following categories: the type of data-sharing, the time lapse for collecting the data, the reason for non-availability of data, the deidentification of data (i.e. 18 identifiers, as required by the Health Insurance Portability and Accountability Act) [32] and the type of the shared data (here we distinguish “computerized data” which is not formal or ordered, “cleaned data, categorized and ordered” and “analyzable data” meaning ready for analysis) [33].
Data analysis
We performed a descriptive analysis of the characteristics of the main studies extracted included in the EPARs selected. This included counts, percentages and their associated 95% confidence intervals (CIs).
Effect estimates in the different studies were expressed as standardised mean differences (SMDs) and their associated 95% CIs. For binary outcomes, odds ratios and their 95% CIs were calculated and converted into the standardised mean difference [34].
To compare the results of our re-analyses with the original results, the following steps were implemented: (i) we compared the statistical significance in the form of the p-value. If different, the results were considered as not reproducible. If not different, (ii) we qualitatively compared effect sizes and their respective 95% CIs. In case of ± 0.10 points difference in point estimates (expressed as standardised mean differences), the difference was discussed with a clinician in order to assess its clinical significance.
All analyses were performed using the open source statistical software R (R Development Core Team) [20] and SAS software™ .