Any measure, be it physical, biological or behavioral, has errors due to unreliability. The measures used in medical student selection also suffer from range restriction, and in addition, as Figures 3 and 4 show, many of the educational measures show right-censorship, typically due to grade inflation, with many candidates being at the ceiling. In consequence, selection measures such as A-level grades often seem to show very small correlations with outcome measures, which typically assess medical school examination performance. A typical predictor-outcome correlation in the present study is .171, with the implication that only studies with nearly 300 students would have a 90% chance of finding a significant correlation between a typical predictor and a typical outcome. Such small correlations, particularly if non-significant, are often erroneously treated as meaning that selection variables are ineffective or of no consequence.

Actual predictor-outcome correlations are often far smaller than construct-level predictive validities (true-score correlations). That difference matters because, as Hunter and Schmidt [52] have emphasized, “what we are interested in scientifically is the construct-level correlation” (p.16). Rubin [53] has emphasized that “we really care about the underlying scientific process that is generating [the] outcomes that we happen to see - that we, as fallible researchers, are trying to glimpse through the opaque window of imperfect empirical studies” [53] (p.157).

In a perfect world there would be perfect measures of academic performance at medical school and perfect measures of educational attainment and intellectual aptitude in applicants applying to medical school and entrants to medical school would be a random sample of those applying. Given that, it would be straightforward to determine how well selection measures work, and whether the measures in use are sufficient or perhaps others, assessing other characteristics or traits, are also needed.

Construct-level predictive validities estimate the correlations that would pertain in a world permitting perfectly accurate and complete measurement, and in so doing make several things possible. First, predictors can be compared with one another without reliabilities and range restriction confounding the differences. Second, construct-level predictive validities also provide a perspective on the limits of what current measures could, in principle, do if they were not subject to measurement error or other problems. That is central to the difficult question of whether current measures should be refined, replaced or supplemented by other measures. Finally, because they attempt to consider perfect measures, construct-level predictive validities also throw into sharp relief the theoretical imperfection of even the best measures that we might have, showing their flaws and their conceptual failings. The end result is an assessment of what the measures can in principle do.

### Comparing predictors

Comparing the main predictors, particularly for undergraduate examinations, it is clear that A-levels are the best predictor (.723; CI: .616 to .803), followed by GCSEs/O-levels (.359; CI: .255 to .455), with intellectual aptitude tests predicting much less well, albeit significantly differently than zero (.181; CI: .055 to .302). Other predictors are mostly present only in the UKCAT-12 study and, hence, it is more difficult to generalize about them. However, it does appear that SQA qualifications have a lower construct-level predictive validity than GCE qualifications, with Highers having a very low validity. The lower construct-level predictive validity of SQA qualifications is important because a simple comparison of predictor-outcome correlations suggests that SQA examinations perform better than GCE examinations [21, 27]. That the construct-level predictive validities are the other way around is a result of SQAs having higher reliabilities and higher selection ratios (see Table 1), which results in relatively lower construct-level predictive validities^{a}. The two composite measures of EducationalAttainmentGCE and EducationalAttainmentSQA, despite having higher correlations with medical school outcome than their component scores, had similar construct-level predictive validities to A-levels and Advanced Highers and are, therefore, probably not providing additional information over the simpler measures concerning construct-level predictive validity, although they may be better for those wishing to predict performance within medical school rather than for selection purposes.

### Predicting first year BMS examinations

In many ways the most important outcome in terms of medical student selection is performance in basic medical sciences examinations in the first year, as the end of the first year is mostly when failing medical students either have to leave the course or are required to repeat a year. Predicting first-year performance is, therefore, particularly important. The meta-regression contained three relevant construct-level predictive validities, and the meta-analytic estimate for A-levels of .809 (CI: .501 .935) is high, and is higher than for GCSEs/O-levels (.332; CI: .024 to .583) and for the sole aptitude test, UKCAT (.245; CI: .207 to .276).

### The Academic Backbone

Educational qualifications predict performance better in assessments earlier in training rather than later. That is hardly surprising, and to some extent reflects what we have elsewhere called the Academic Backbone [4], performance at each stage being built upon performance at previous stages. If educational qualifications predict, say, MRCP(UK) less well than they predict finals, that is in part because finals themselves are part of the prediction of performance at MRCP(UK). Likewise, GCSEs may not predict outcomes well, but they are good at predicting A-levels, which is perhaps their main role [54].

### How much can A-levels predict?

Using the meta-analytic first year BMS construct-level predictive validity estimate of .809, then 65% of the total, true variance in first year examination performance is accounted for by A-level performance, which clearly makes A-levels an important part of medical student selection. The estimate of .809 may itself be an under-estimate, in part because, as shown elsewhere [27], the measure we have called “EducationalAttainmentGCE” predicts outcome better than A-levels alone. That may be because A-levels are not always of equivalent difficulty [55], and better students may choose to take harder A-levels. The measure also includes General Studies which, contrary to popular belief, seems to be a separate and independent predictor of medical school performance [21]. Considering just A-levels, for which 65% of first year exam variance seems to be explained, the important corollary is that 35% of first year performance must be explained by something other than A-levels. Most of that 35% is unlikely to be assessed directly or indirectly by GCSEs or aptitude tests since both of those measures have little incremental validity over A-levels [21]. The most likely origin is in personality, motivation or other individual difference factors, although part of the explanation may also lie in the random, unpredictable events that occur in everyday life, including problems with peers, money, relationships, family or whatever, that are inherently unpredictable but can impact substantially on medical school performance, particularly in students who may recently have left home for the first time. Many such events cannot be predicted when selection takes place and, hence, any variance due to them cannot be taken into account by educational attainment or its correlates. Similar events which have happened before A-levels and selection could also be involved, lowering attained A-level grades, and when the impact of those events subsequently diminishes then students over-perform relative to what their A-levels might seem to have predicted. Whatever the nature of the missing variance, a major challenge has to be identifying the causes or the correlates of that additional variance, as it might account for a quarter or a third of the variance in first year medical school performance. In addition, because impacts on first year performance can subsequently be multiplied through the Academic Backbone with the accumulation of ‘medical capital’ [4], so small over- or under-achievements early in a career can potentially multiply as the medical course continues.

### The stability of construct-level predictive validity of educational achievement measures in the cohorts

The present studies took place in six cohorts of students who entered medical school from 1972 through to 2009. A remarkable finding is that all of the qualifications, be they A-levels, GCSEs/O-levels or aptitude tests, seem to predict at the same level across the entire temporal range of the cohorts. It might have been thought that changes in the nature of examinations such as A-levels, which have become less heavy on facts in recent years, might have altered their construct-level predictive validity. Medical school courses and assessments have also have become less fact heavy, with assessments now including OSCEs and other assessments of practical skills, communicative ability and so on, but despite that the predictive validity of the various qualifications seems to have remained equivalent.

### The role of GCSEs/O-levels

A recurrent theme in student selection is that GCSEs or O-levels may be better predictors of outcome than A-levels. As long ago as a GMC conference in 1973 it was reported that, “performance in the Second MB examination correlated better with GCE O level than with A level results” (p.7), with speculation that, “the O level correlation with future performance might be more accurate than the A level results, because at the latter stage the ‘heat was turned on’ for University entrance. [As a result] the A level results were based on factual knowledge and did not necessarily depend on greater intellectual capacity” [10] (pp. 7–8). The current meta-analysis provides no support for that argument in the undergraduate course, but it is striking that A-levels, like GCSEs/O-levels and aptitude tests, have similar construct-level predictive validities in both undergraduate and postgraduate assessments. Elsewhere we have noticed hints that GCSEs/O-levels may have additional predictive incremental value for predicting finals after taking A-levels and BMS performance into account [4], with the possibility that they are assessing something separate from the academic skills assessed in A-levels.

### Aptitude tests as predictors

The two tests of intellectual aptitude, UKCAT and AH5, predict undergraduate and postgraduate performance to similar extents with an overall construct-level predictive validity for undergraduate performance of .181, which is relatively low and is appreciably lower than for A-levels (.723) and GCSEs/O-levels (.359). In addition the incremental validities for AH5 [3] and UKCAT [21] are small once A-levels have been taken into account. UKCAT and similar tests may have some role to play in selection when there is strong range restriction on A-levels and other attainment tests, although the Sutton Trust reported that the SAT Reasoning test did not differentiate outcome in high-achieving university entrants with AAA grades [56] (pp.37-38). The UKCAT consortium is also currently piloting non-cognitive tests which may have additional predictive ability.

### What is the medical school applicant pool?

Our analyses have taken the pool of medical school applicants as being those who chose to apply, many of whom eventually attain quite low A-levels and other grades. Applying to medical school though is a choice, and there is no reason why candidates with substantially lower grades might not also choose to apply, particularly if medical schools were to suggest that there was a realistic chance that they might be admitted. The estimate of construct-level predictive validity for, say, A-levels is, therefore, an estimate given the applicants who actually applied. Were medical schools to suggest that applicants might be accepted with, say, the minimum matriculation grades of EE, then the variance in A-level grades of candidates would increase, resulting in the construct-level predictive validities being yet higher. Taking the concept to its extreme, were entrants of any intellectual ability to be allowed to enter, including those with minimal grades at GCSE (see the population distribution elsewhere [54]), then the construct-level predictive validity of educational attainment would probably rise close to one, as it also would were applicants to be admitted across the entire population range of intellectual ability.

### What happens to students who enter medical schools with substantially lower A-level grades?

One of the most interesting educational initiatives in UK medical education is the Extended Medical Degree Programme (EMDP) at King’s College, London [57–60], which admits students from low-achieving secondary schools who have A-level grades substantially below those normally required for medical school admission. Average grades initially were CCC (more recently rising to BBC), with BCC currently being the standard offer [61]. The study claimed that, “medical students can succeed without AAB at A level if these results were obtained from a low achieving [secondary] school” [57] (p.1113). The claim would be supported by the finding in the UKCAT-12 study that students attaining A-levels from under-achieving secondary schools subsequently do better at medical school [21], although the effect is relatively small (and the much larger HEFCE study found it to be of the order of one A-level grade, so that ABB from a lower achieving secondary school was equivalent to AAB from a higher achieving secondary school [62]). The effect of a low achieving secondary school is probably therefore too small to account for the claims made for the EMDP program, and potentially, therefore, is a challenge to the predictions made from construct-level predictive validity.

Formal statistical analyses have however suggested that EMDP students have a performance in finals which is about -.73 (CI: -.38 to −1.09) standard deviations below that of students on the five-year program [63]. In the present study, the meta-analytic estimate of construct-level predictive validity for finals in relation to A-levels is .625 (n = 5; CI: .449 to .754). Using a reliability of .905 for finals and .867 for A-levels (from the UKCAT-12 study), then the attenuated A-levels-Final correlation can be estimated at .553. A-levels in the UKCAT-12 applicants have a decensored mean of 29.01 (SD = 5.89), so that students with grades BBB, BBC, BCC and CCC are −.85, −1.19, −1.53 and −1.87 SDs below the mean without taking attenuation into account. Given the estimated A-levels-finals correlation of .553 they would be expected to score −.47, −.66, −.85 and −1.04 SDs below the mean in the finals assessment. The expected average for students with grades CCC to BBB is therefore about −.75, which is very close to the actual value of −.73. Were they admitted, entrants with grades of DDD or EEE would be expected to have mean scores −1.60 and −2.16 SDs below the mean.

In BMS examinations where conventional students show a retention rate of 97% (3% failing), EMDP students showed retention rates of 90% (10% failing) [57]. Retake rates for BMS exams are 15% in conventional students but 32% in EMDP students, with “A level chemistry and biology grades … of the EMDP students showing significant correlation with marks in the first year examinations” [57]. A variant on the calculation for finals can be used to predict these rates. Using a reliability for A-levels of .867, a reliability for a continuous overall BMS result of .904 (based on the UCLMS cohorts), and a meta-analytic construct-level predictive validity of .744 (n = 4; SD = .518 to .872), the attenuated predictor-outcome correlation is calculated as .659. A failure rate of 3% for conventional students implies that the cut-off is −1.88 SDs below the mean, and a retake rate of 15% implies a cutoff of −1.03 SDs. Failure rates for students with entry grades of BBB, BBC, BCC and CCC are then expected to be 9.3%, 13.6%, 19.1% and 25.8% the average of 17.0% being a little higher than the EMDP average of 10%. Likewise retake rates with grades of BBB, BBC, BCC and CCC are expected to be 31.7%, 40.0%, 48.8% and 57.7%, with the average of 44.6%, which again is a little higher than the EMDP’s rate of 32%. Were students to be admitted with grades of DDD or EEE then their failure rates would be expected to be 51% and 76%, with retake rates of 81% and 94%.

The calculation of construct-level predictive validity explicitly makes predictions outside of the normal range of the data for which the correlations were calculated. Although prediction outside of the range is often regarded as bad practice, it is precisely what construct-level predictive validity sets out to do, with a strong theoretical rationale and model behind it; and as the Statistical Appendix (Additional file 1) shows, the HSL method succeeds well at extrapolating correctly to the true figures in a simulation. The King’s EMDP data provide an independent validation of the predicted marks and failure rates. Failure rates and retake rates at BMS exams, and average marks at finals are predicted well from the estimates of construct-level predictive validity, being what would be expected given the A-level grades of the students. That provides confidence in the principle of calculating construct-level predictive validity as a basis for making selection decisions.

### A* grades at A-level

None of the studies described here had information on A* grades at A-level, which were first taken by students sitting A-levels in 2010. Few data have been published on A* grades in medical students, although in February 2013 data were published from Oxford, which is one of the most selective of UK medical schools. Of 2,054 applicants with A-levels, there were 16.7% with grades of less than AAA, 19.% with AAA, 22.4% with at least one A*, 16.9% with at least two A*s, and 24.8% with at least three A*s, with the proportions in those holding offers being 0.7%, 5.7%, 14.3%, 19.4% and 60.0% for grades AAA to A*A*A*. Scoring AAA = 30, AAA* = 32, AA*A* = 34 and A*A*A* = 36 [64], and using the estimates of reliability and construct-level predictive validity used for the King’s study (above), then compared with students scoring AAA, students with AAA*, AA*A* and A*A*A* grades are predicted to score .22, .45 and .67 SDs higher at BMS, and .19, .38 and .56 SDs higher at finals. Those predictions will soon be testable, in all medical schools and not just Oxford, and if correct then the utility of construct-level predictive validity will also be supported.

### Comparison with other studies of selection

This discussion is not the place for a full review of other studies which have assessed educational attainment measures and measures of intellectual aptitude as possible predictors of university and medical school performance. In US medical schools, there seems little doubt that MCAT [65] predicts medical school performance, with the Biological Sciences knowledge test having a higher prediction than the verbal reasoning (aptitude) test. For university admission in general, in the UK both ISPIUA [66, 67] (in the 1960s) and the Sutton Trust SAT test [56, 68] (in the 2000s) showed similar results, with A-levels being strong predictors of university performance and intellectual aptitude tests having little predictive value. The findings reported here are therefore compatible with other large-scale studies, albeit mostly not in medicine.

### Limitations of the present analysis

The present study is limited to a relatively small number of studies, albeit most include entrants to many UK medical schools, but longitudinal cohort studies are rare. The outcome variables are not always detailed, and postgraduate outcomes are restricted to the criteria of MRCP(UK) marks and Specialist Register entry. The statistical analyses also have to use estimates of some parameters such as reliabilities and selection ratios, and the unreliability of these may not have been taken fully into account. Future studies should examine a wider range of measures of clinical knowledge and performance. The outcomes considered here are almost entirely academic measures of success, and other, non-academic measures of clinical and professional performance in medical practice, would be desirable.

### What is the missing ‘dark variance’ of medical education?

Ultimately 100% of the true variance in medical school performance has to be accounted for, once unreliability, regression to the mean and right-censorship have been taken into account, even if some of that variance is sporadic (what one might call ‘deep chance’, to distinguish it from mere noise due to measurement error, and containing things such as the random, unpredictable events of every life, referred to earlier). The situation is akin to that currently being experienced in astrophysics, where the existence of ‘dark matter’ and ‘dark energy’ are inferred from the necessity, in what is effectively an accounting exercise, of accounting for the total mass of the universe and the expansion of the universe, all of which needs to be explained. Medical education also cannot account for all of the variation that needs accounting for, and selection of medical students can never be on a firm foundation without it being able to do so. Nevertheless, the present results provide robust support for the use of measures of educational attainment in student selection.