We now consider some less standard approaches to bringing the sample size requirements closer to the numbers it is feasible to recruit in a reasonable time frame.
Step 3: Relaxing α by a small amount, beyond traditional values
The much-criticised 5 % significance level is used widely in much applied scientific research, but is an arbitrary figure. It is extremely rare for clinical trials to use any other level. It may be argued that this convention has been adopted as a compromise between erroneously concluding a new treatment is more efficacious and undertaking a trial of an achievable size and length. Settings where traditionally sized trials are not possible may be just the area where researchers start to break this convention, for good reason.
In considering the type I error, it is critical to consider the question: ‘What are the consequences of erroneously deciding to use a new treatment routinely if it is truly not better?’
Taking the societal perspective as before, we might consider the probability of making a type I error, thus erroneously burdening patients with treatments that do not improve outcomes, or even worsen them, while potentially imposing unnecessary toxicity.
First, for conditions where there are only enough patients available to run one modestly sized randomised trial in a reasonable time frame, research progress will be relatively slow, and making a type I error may be less of a concern than a type II error. In contrast, making several type I errors in a common disease could lead in practice to patients taking several ineffective treatments; for a disease area where only one trial can run at any given time, the overall burden on patients is potentially taking one ineffective treatment that does not work.
Thus, if we take the societal perspective with the trials in Table 1 then, if each trial was analysed with α=0.05 and we see (hypothetically) 40 % positive results [8], then the expected number of false positive trials is given in the final column. We also assumed 10 % and 70 % positive results, with qualitatively similar conclusions.
It is worth viewing this as a consideration of the joint type I error rate of these trials. If there are t trials published claiming a positive result, each specifying α=0.05, then the chance that a type I error will be made equals 1−(1−0.05)t. If t=1 as we are considering, the type I error rate equals 5 %. This increases to 14.3 % if three trials return a positive result. From a societal perspective, such an error rate may be more important than the 5 % levels specified in individual trials, a rarely acknowledged consideration.
When a new intervention has some toxicity, this argument requires even greater consideration. Assume the intervention is in truth not better (possibly worse) than the control arm and returns a high but tolerable level of toxicity. If it is falsely judged to be superior in a trial (i.e. a type I error is made), there are implications for future research. Patients will already be experiencing some ln, narrowing the path for future treatment options, particularly if a future RCT is one of adding a treatment rather than substituting: any further toxicity may make the total toxicity unacceptable. In this case, significance levels (and target differences) should be chosen with a clear consideration of the likely toxicity.
We note recent work that highlights the importance of considering long-term aims or research in context [9]. Rather than simply setting error rates for a single trial, one might consider a long-term horizon and the aims by that point. For example, running several smaller trials with relaxed α levels may lead to improved expected survival in the long term vs. fewer large trials with more stringent α, though more type I errors will be made (see [9] for details).
Moving from two- to one-sided significance tests
Two-sided tests look for a difference between groups, but are technically agnostic about the direction of this difference. So-called superiority randomised trials aim to show that one treatment is superior to another on major disease outcomes. Two-sided testing is a ritual that involves careful neglect of the substantive hypothesis and it could be argued that it should be abandoned in most superiority RCTs [10].
In a trial looking to detect superiority of one treatment over another, a two-sided hypothesis test says we will reject H0: difference = zero if the new treatment is better or worse. However, if a two-sided test returns a p value of 0.0001 and the new treatment is worse, the decision would be not to use that treatment. The same decision would be made if the p value were 1 and the difference between treatments 0. There is, thus, a disconnect between the statistical hypothesis tests and the operational hypothesis interpretation. Note that a trial that finishes with a highly statistically significant value against the research arm is wasteful and harmful; pre-planned interim analyses should have been used to get out early, and to focus the limited resources into a randomised trial that tested something that might make a difference. Operationally, for superiority trials, both the hypothesis we are primarily interested in and its interpretation are very one-sided (‘harm’ and ‘no-effect’ lead to one decision, while ‘benefit’ leads to another). Researchers using a nominally two-sided, 5 % significance level are effectively using a one-sided, 2.5 % significance level. To improve efficiency, the statistical design could better reflect this behaviour, and employ one-sided testing procedures and intervals.
In many sequential or adaptive designs, it is already common to design and analyse with one-sided significance levels because decisions may otherwise be nonsensical, for example in designs that aim to stop for futility [11].
Note that this argument is not related to the societal perspective adopted in arguing for high power and higher type I error rates than conventionally used.
Figure 2 shows how the target number of patients for EURAMOS-1 depends on the sidedness of tests and on the significance level chosen, with all other design parameters held constant.
Including covariate information in design
Covariates are patient characteristics measured at baseline. In observational studies, adjusting for covariates can reduce bias due to confounding. In randomised trials, accounting for covariates has a different aim: to increase precision, and thus gain power [12, 13].
Adjusting the treatment effect estimate for covariates that affect the outcome measure has been shown to lead to substantial increases in power [14]; adjusting for covariates that do not affect the outcome measure leads to a slight loss, but this is very small [14]. There are several alternative methods of accounting for covariates [13, 15, 16].
When trying to compute sample size requirements, it is possible in principle to allow for covariates. For continuous outcome measures, this may be through reducing the standard deviation in sample size calculations (because covariate effects will explain some of the variation away). However, it is not clear how best to approach this for categorical or time-to-event outcome measures, and is an area worthy of methodological research.
Previous work based on 12 different outcomes taken from eight studies demonstrated that the effects of covariates seen in real trials could increase power from 80 % without covariate adjustment to between 81 and 99 % with planned covariate adjustment [14]. Without formally incorporating covariate effects in the sample size calculations, planning to adjust the analysis may be viewed as a method of reclaiming some power. Compared to the sample size calculations, we may expect this to be in the region of about 5 % (and hope it is more). This may be a way of strengthening the design if power has been relaxed further than we would wish.
Re-randomising patients
Historically, the only context in which patients are permitted to participate in the same trial more than once is in crossover designs, which involve patients being randomly assigned to a sequence of treatments and having outcomes measured after each period, with some or all patients receiving different treatments in different periods (crossing over) [17]. A predefined number of treatment periods is set out for each patient.
However, re-randomising patients who have completed their predefined follow-up from a previous randomisation in the trial and who continue to meet the trial’s eligibility criteria an arbitrary number of times can still result in valid statistical inference about treatment effectiveness [18]. Unlike crossover trials, patients do not have a predefined number of treatment periods and the treatment assignments in the sequence do not depend on previous or subsequent assignments.
The design will be suitable for diseases for which treatments are given repeatedly and follow-up is not long term. For example, it has been used for the treatment of febrile neutropenia and sickle-cell crises. It will be unsuitable in some settings: where long-term follow-up is required (particularly for economic evaluations), where the effectiveness of treatment depends on whether and how much it has previously been received (such as where the intervention is educational) or where a period of treatment and follow-up would mean patients are no longer eligible (for example where the primary outcome measure represents a move into a different disease state). Re-randomisation would be unsuitable for treatment of cancers when prolonged follow-up is required or when a procedure can only occur once, such as appendicectomy.
When patients require regular repeated treatments and outcomes are relatively short term, re-randomisation may inject extra numbers without having to compromise on other aspects of the design. If the majority of patients are randomised on multiple occasions then the analysis can be based on within-patient comparisons, potentially gaining much efficiency [18].
Using external information
A treatment that works in one category of a broadly defined disease may work in related categories. That is, it is plausible that a treatment that works well in one specific disease category would have similar effects in another, even if it is unlikely that the effect will be exactly the same. Chemoradiotherapy is effective in three squamous cell carcinomas: head and neck, cervical and anal. It is therefore plausible that it would be effective in penile and vulval cancer, which are also squamous cell cancers but have far smaller patient populations. This may bolster the choice to relax α in that there is a precedent for the treatment working in closely related conditions, as it would be unlikely to have seen false positives in head and neck, cervical and anal cancer.
The notion of borrowed external information is particularly relevant for considering adverse events that are rare or only appear in the long term. Trials are rarely sized to specifically assess adverse events but it is critical that they are considered. If a treatment is indicated for other conditions then its adverse effects may already be reasonable well characterised, unless there are expected interactions with this specific patient group or another treatment with which the treatment under scrutiny has not previously been combined. In such a setting, a trial may be regarded as verifying the adverse-effect profile of treatment rather than demonstrating them for the first time.
External information on covariate effects can be particularly useful if the sample size will be calculated allowing for covariate effects.
Using external information does not impact on the research question, rather the information that will be brought to bear on that question. We do not aim here to prescribe how external information should be borrowed. However, Bayesian approaches lend themselves naturally to this problem and have been well explored [19]; formal frequentist approaches have been less well explored.