This study was intended to provide the most comprehensive answer feasible as to whether UKCAT scores predict future performance in undergraduate medical training, and to the extent that it adds value within the selection process. Our findings suggest that the test scores are significantly predictive of most aspects of undergraduate performance. Whilst these effects are not always independent of other, potentially confounding, factors, many of the associations with performance remained statistically significant despite controlling for the influence of prior education attainment. Thus, the test can be assumed to add incremental value above and beyond that provided by actual or predicted A-level (or equivalent) grades. When predicting an overall pass at first sitting for a year at medical school, only the prediction of this outcome from the ‘quantitative reasoning’ score became statistically non-significant once prior educational achievement was controlled for. This may be because such quantitative reasoning skills are already well tested in science-based advanced school qualifications, and thus this particular subscale adds little incremental predictive value in this respect.
Whilst the absolute ability of the UKCAT to predict medical school performance appears modest, the challenges of establishing the true ‘construct-level’ for such a selection test cannot be underestimated. McManus et al. [19] elegantly outline these issues, which include the attenuating effects of restriction of range, imperfect test reliability and the homogeneity amongst candidates both in terms of predictors and outcomes. In order to address these problems, firstly we used a formula to ‘disattenuate’ the regression coefficients from our univariable analyses. Secondly, we took a novel approach, using data imputation, to simulate being able to observe missing outcomes in the UKCAT candidates. The findings suggest that, even using the test as a sole selection test, use of the UKCAT as a threshold for application decisions may result in reduced academic failure rates in medical school. Moreover, this is the first study, to our knowledge, that introduces a more pragmatic approach to understanding the potential practical implications of aptitude testing via our NNR estimate.
Our results are largely in line with previous published findings. Unsurprisingly (given that some of the participants are shared between the studies), our observations in relation to performance in year 1 of medical school are almost identical to those cited by the UKCAT12 study [13]. Our univariable findings are also broadly consistent with previous, local studies that have observed some ability of the UKCAT scores to predict aspects of undergraduate performance into the latter, clinical years of training [14, 15]. However, in contrast to the MCAT, the UKCAT scales at times seem to have increased independent predictive validity as medical school education progresses into the final clinical years [17]. This is possibly due to the MCAT having a substantial knowledge-testing component, which becomes less relevant to predicting academic outcomes as undergraduate medical education progresses. This is in contrast to the UKCAT, which does not evaluate semantic knowledge as such. The trend for predictive ability to persist or increase is particularly observed for the ability of ‘quantitative reasoning’ and total score to predict theory assessment performance once confounders are adjusted for. It may reflect the relative importance of cognitive ability over traditional educational attainment as the effects of previous schooling decays. In line with previous findings, we observed that better performance at medical school assessments was generally associated with female sex, older age at entry, attendance at a non-selective state school, White ethnicity, and better A-level (or equivalent) grades.
Strengths and potential limitations of the study
This is the first national study to assess the predictive validity of the UKCAT throughout the entirety of medical undergraduate education. The large number of universities and participants provides statistical power to this study, as well as increasing the likelihood that the findings are generalizable to UK medical schools. Indeed, this is the first national study to investigate the predictive ability of the UKCAT into the clinical years of training, whilst controlling for the effects of a number of potentially confounding factors.
Nevertheless, a number of potential limitations must be highlighted. Firstly, in terms of the outcome measures, skills and theory based assessments were not operationally defined, and therefore rely on the participating medical schools to categorise the evaluations accordingly. It is reasonable to assume that assessments categorised as ‘theory’ evaluated knowledge required of the undergraduate curricula. However, the nature of skills assessments may have varied to a relatively greater degree across medical schools, although these may have included Objective Structured Clinical Examinations or similar. Overall, the relationship between skills performance and UKCAT scores was weaker than for theory exams. It could be speculated that this relationship may have been even less marked if medical schools had only categorised summative evaluations with a strong focus on procedural knowledge and interpersonal functioning (e.g. observed role plays, etc.) as skills. Thus, it is possible that the association between skills and UKCAT scores was inflated by the inclusion, by some medical schools, of assessments that relied on traditional cognitive ability and semantic knowledge. Nevertheless, to some extent, the variability between medical schools (and across time) in the nature and standards of both theory and skills assessments would have been dealt with by the standardisation of the scores within both institutions and cohorts.
It should also be noted that, for this study, the most recent UKCAT scores were used as the primary predictor. These may not have been the best metric of ability, though by the use of these we eliminated ‘practice runs’ and also based the analysis on the scores on which admission decisions are based.
The number of participating universities varied from year to year, and missingness related to this was probably due to chance (hence missing completely at random; it may have been mainly due to medical schools failing to return outcome results). Further, it should be noted that this was not a classical cohort study as subsequent years were not a subset of the original entry cohorts (although we provide the values for this ‘conventional’ attrition rate in Table 1), with whether participants joined or left the study being mainly dependent on their medical school participating that specific year. Sensitivity analyses were conducted to evaluate the potential effects of missing data on the results. We subsequently observed that the results from imputed and non-imputed datasets differed for later years and, therefore, some caution must be exercised when making inferences. However, methodological research supports the use of multiple imputation through chained equations where the pattern of missingness is arbitrary. Therefore, unless a substantial portion of the missing data was non-ignorable, the results from the imputed datasets should be relatively trustworthy [25]. Thus, where the results differ it may be those from the imputed datasets that are more reliable. There is also some uncertainty that must be accepted about the ‘construct-level’ validity of the UKCAT due to the attenuation effects apparent in selection tests [19]. As mentioned above, we were able to crudely correct for this using the ‘Thorndike II’ method in this situation [34]. However, this approach assumes direct range restriction only (i.e. selection was based only on the UKCAT scores) and this is not the case in reality. Moreover, our attempts to estimate the NNR value for the UKCAT as a screener using single imputation could be viewed as based on the (strong) assumption that unobserved outcomes are related to observed values in the same way as non-missing outcomes. Nevertheless, we consider this exploratory analysis as important in beginning to understand the practical implications for the use of the UKCAT within the context of medical selection, where candidate variance is low, poor academic outcomes uncommon, but the competition ratio is high.
Attempts have been made to equate the UKCAT scores in order to ensure that the results are comparable across time [35]. However, significant shifts in the score over time suggest that test equating has not been entirely achieved, possibly due to differences in actual performance of subsequent cohorts (e.g. later cohorts would have access to increased practice opportunities and material). Thus, the properties of the UKCAT may have changed to some extent over time and it is not clear to what extent our findings apply to subsequent cohorts.
Implications for practice and policy
Previous research suggests that universities that use the UKCAT scores as a threshold for interview or place offer may reduce the level of disadvantage faced by certain under-represented groups of applicants, compared to those using the test in a different mode [7]. Moreover, the UKCAT may be less sensitive to the school type-attended (e.g. selective versus non-selective) compared to school leaving qualifications or predicted grades [8]. This is especially important given that there is emerging evidence that the overall performance of a candidate’s secondary school may be inversely related to an individual’s later achievement in higher education, including in medical school [13]. In this case, the total score appeared to be the element of the UKCAT that was the best predictor of a future entrant’s performance, as it will reflect performance on all the constituent scales. Our findings thus confirm that universities wishing to widen participation may wish to use the UKCAT as a relatively strong component of the selection process as it will have some ability to predict academic performance whilst not furthering the disadvantaging of candidates from certain under-represented groups. Moreover, in contrast to the UKCAT, the Biomedical Admissions Test, used in medical selection by some universities, has been unable to demonstrate any incremental predictive validity, over and above conventional measures of knowledge or educational attainment [36]. This should be considered when institutions are considering selection processes.
However, despite having four subscales, the UKCAT may best be conceptualised as testing two main dimensions of cognitive functioning, namely verbal and non-verbal reasoning [10]. Thus, the total score (consisting of three non-verbal and only one verbal scale score) may put too much emphasis on non-verbal performance. Rescoring so that an average of the non-verbal scales is combined with the verbal reasoning score may be a fairer way to obtain a more balanced metric of ability.
Although the magnitude of the effect of UKCAT scores on performance was relatively small, our estimates of NNR suggest some considerable practical utility of the UKCAT as a tool in helping to select out candidates more likely to require at least one resit at medical school. The NNR value of 1.18 (when a cut-off representing the average UKCAT total score for applicants was used), though derived relatively crudely, also suggests that, in a highly competitive selection process, the use of the UKCAT as a ‘screening tool’ for subsequent academic performance may be acceptable. We also reported the likely impact of having either relatively high or low thresholds for the UKCAT, when using the test in this manner. Indeed, across medical schools and time, a variety of cut-points have been used by universities in the UKCAT consortium who use the test scores in this manner, mainly to guide the decision about whether to invite a candidate for interview. However, the median scores used as a threshold by institutions have tended to rise over time, and tend to be slightly above the average score obtained by applicants sitting the test [20]. Our findings suggest that the use of higher thresholds may reduce the risk of future adverse academic outcomes in students further, but at the cost of rejecting a higher ratio of candidates who would have been likely to have done well. Conversely, lower score thresholds reduce the risk of rejecting this latter group of candidates, but will increase the risk that applicants are admitted who are at higher risk of later academic problems. Thus, the choice of threshold would be a subjective one, decided on by medical school admissions teams. No doubt the competition ratio that a specific university encounters would play a role in such decision making; those with more fierce competition for places may be tempted to set a higher threshold, with the opportunity of identifying candidates less likely to do well academically appearing to offset the attendant risks of rejecting acceptable candidates. Thus, we can see that, in the context of medical school applications, where the competition ratio at individual medical schools is approximately 11:1, an NNR of roughly one may be acceptable to admissions teams (though possibly not to candidates), especially given the direct and indirect costs of resits and failures to progress. However, it should be highlighted that selection tests such as the UKCAT are not intended to be used in isolation but in conjunction with other selection criteria. Thus, the estimated NNR in this case only reflects the effectiveness of the UKCAT when used alone, rather than in conjunction with other selection approaches. It may be that the use of other selection criteria (such as performance in multiple mini interviews) would further reduce the value to one that reflects a greater utility in selection. Further, there may be genuine uncertainty over the eventual predictive validity of certain selection tests. Thus, it may be that an approach based on Bayesian principles may be useful. Bayes theorem allows us to increase the accuracy of our probabilistic predictions by conditioning our new observations on previous data or knowledge. The approach also allows us to adjust for uncertainty of how applicable our previous knowledge is to the current issue. Thus, a Bayesian framework may eventually allow us to estimate the impact of combining a variety of selection tests in the admissions process, even allowing for our uncertainty regarding predictive validity.
Given the high stakes involved in deciding on how to allocate medical school places, an alternative approach to ‘front-loading’ selection processes would be to admit a larger number of students but have a lower threshold for failing them after evaluating them during the first year. This, however, could also be viewed as costly given the investment made in educating each student during the initial year of undergraduate study. There will also be costs associated with employing poorly motivated doctors at risk of low morale and burnout [37].