Examinations in medicine, be they postgraduate or undergraduate, play a key role in ensuring that the technical competence of those passing is at a sufficiently high level to ensure the safe treatment of patients. Implicit in that description is the assumption that the examinations are valid examinations. Validity for postgraduate examinations is currently couched almost entirely in terms of construct validity in its broad sense . Until recently, however, much of the validity of medical examinations has depended on construct validity in the older, narrower sense, in which the items asked about in an examination have a logical and theoretical relationship to medical practice (and essentially, it seems self-evident that, for example, knowledge of the causes and treatment of medical problems such as myocardial infarction, or diabetes, or Fabry's Disease, is more likely to make a better physician than ignorance of such matters). If the knowledge asked about concerns the obscure, recondite, 'fascinomas' once beloved of some examiners, then construct validity in the narrow sense may not necessarily be the case. Excluding that type of question, it is hard to make an argument, beyond mere hand-waving and a few splutters about 'only exam knowledge', that those who have a greater knowledge of medical conditions are no more likely to be better doctors than those who do not have such knowledge. With well-constructed, properly blue-printed examinations (part of the broad sense of construct validity), it seems more likely to be true for physicians that knowledge is better than ignorance. In the case of the MRCP(UK), educators and, particularly, future patients might reflect on whether they would genuinely be indifferent as to whether their physicians did not know about, for example, aseptic meningitis in infectious mononucleosis, bone marrow changes in chronic anaemia of infection, or the electrophysiology of Wolff-Parkinson-White syndrome . When it is asked whether examinations are 'valid', the question is often referring only to predictive validity, which would require a demonstration that those who do better on postgraduate examinations subsequently perform better as doctors on concrete outcomes in daily medical care (or more particularly, that those who do less well show less good care). At present there are almost no studies which have looked at predictive validity (and matters have not changed much since the review of Hutchinson et al. ), although at present we are carrying out a number of studies on the predictive validity of MRCP(UK) in relation to future professional behaviour and clinical practice, and hope to publish in the future. The present study is not, however, looking at predictive validity for future medical care, but is concerned instead with the examination itself and its correlates. There is however an implicit assumption that the examination is valid, particularly in the sense of construct validity.
If examinations are high-stakes, then natural justice requires that if examinations are difficult, and a doctor cannot continue in their chosen specialty without having passed those examinations, that the examinations be fair, valid and reliable (and see Mehrens and Popham [30
] for a good overview of the legal issues involved). On the particular issue of resit assessments in high-stakes assessments, the review cites a court case on teacher assessments in the US State of Georgia, in which the judgment stated that,
'[an] irrebuttable lifetime presumption of unfitness after failure to pass six [assessments] was arbitrary and capricious because no further education, training, experience, maturity or higher degree would enable such persons to become certified ...'  [p.270].
Within medical education, and particularly in the context of setting standards or pass marks, it is a commonplace to find phrases such as that of Case and Swanson , who say, 'Setting standards will always be arbitrary but need not be capricious' (p.111). Certainly at first sight there does seem to be some arbitrariness whenever a continuum of marks is divided at some cut point to distinguish those who pass and those who fail. However, in the sense of being, 'not supported by logic or the necessary facts', there is surely a strong argument that well designed pass marks, perhaps based on clear criterion referencing, or on the Angoff, Edel or Hofstee methods, or on statistical equating, are not arbitrary, since they are grounded in principle, method, evidence and logic, with a carefully articulated measurement model. There might be those who would argue that a pass mark is too strict or too lax, but that is a separate issue from the rational basis by which the pass mark itself has been set.
Part of the process of fairness and natural justice is that if a candidate fails an examination at one attempt, particularly if they feel they were unlucky in an earlier attempt, perhaps because of a particular choice of questions they had been asked (that is, content specificity/case specificity [32–34]), then they should be allowed to resit the examination. At that point the difficult question arises of how many times a candidate should be allowed to resit. In the late 1990s, the MRCP(UK) decided, given the then available evidence, that it could see no reasonable academic argument to prevent candidates from taking an examination as many times as they wished, particularly given that the standards of its examinations were high and the examinations were reliable, particularly for Part 1 . As an extreme example, one candidate in our database subsequently had a total of 35 attempts across the three examinations before eventually gaining the MRCP(UK). Since the candidate had eventually met our standards at each examination there is an argument that it would not have been justified to prevent their progress arbitrarily at an earlier stage.
Although some of the MRCP candidates taking assessments ten or even twenty times may seem extreme in their numbers of attempts, occasional accounts exist of candidates who pass examinations after a very much greater number of attempts, particularly with computer-based assessments. A report on the BBC website http://news.bbc.co.uk/1/hi/8347164.stm described the case of Mrs Cha Sa-Soon, a 68-year-old woman who had passed the theory part of the driving test of South Korea at her 950th attempt. The multiple choice examination has a pass mark of 60% and consists of 40 questions, according to the New York Times
http://www.nytimes.com/2010/09/04/world/asia/04driver.html. When an examination can be taken every day, as can the South Korean driving test, it might seem dubious that a genuine increase in ability has continued to occur until the 950th attempt and it may be thought that chance had begun to play a substantial role. That being said, if the examination were best-of-four, giving a 25% chance of success on any question, and if there were 40 questions, the probability of attaining 60% correct by responding at random would only be about 1 in 1.7 million. The likelihood of success by chance alone by the 950th attempt is quite low, implying that Mrs Cha had not passed entirely due to luck (and the New York Times did say that, 'her scores steadily crept up'). (It should be noted that for examinations such as driving tests there is typically a finite pool of questions, which are themselves sometimes published in their entirety, so that rote learning of the answers is in principle possible).
Calculations for the probability of correctly answering sufficient questions to pass in the 200 best-of-five questions at MRCP(UK) Part 1 suggest it would be extremely unlikely that a candidate could pass merely due to luck alone. At this point it is perhaps worth quoting from the paper by Pell et al.
], p.249], who say:
'The question has often been put to the authors, 'Are not OSCEs [and other assessments] rather like the driving test, candidates are required to reach a certain level of competence, and their route is of little consequence?' In other words, this argument implies that students should be allowed as many resits as necessary until they reach the appropriate level of competence'.
However, Pell et al. resist the obvious conclusion and say they, 'are strongly of the opinion that resits should be constructed to take at least some account of the additional time and support that resit students have been afforded'. How to do that is not straightforward and will be considered in detail elsewhere.
The present study provides a substantial empirical contribution to the evidence base on repeated testing. By means of multilevel modelling of the extensive records of the MRCP(UK), it manages to provide numerical estimates of the extent to which the true ability of candidates improves at repeated attempts at an examination and, hence, the extent to which luck rather than ability begins to play a role. In relation to the central statistical question of the role of luck and genuine improvement, it is clear that on average there is a genuine improvement over many attempts at examinations. It should also be remembered that luck might help an individual candidate pass on a particular attempt but on average it should not increase the overall mark of candidates; that requires a genuine increase in knowledge.
For the Part 1 examination, for which the range of abilities is necessarily much wider, candidates are, on average, still improving at their tenth attempt at the examination. More sophisticated modelling suggests that there is a maximum level of achievement for each candidate, that the maximum level differs between candidates and is sometimes below the pass mark, making eventual success highly unlikely, and that the maximum level correlates strongly with the mark attained at a first attempt at the examination (see Figure 10 for an illustration). Furthermore, the mark attained at a first attempt at the Part 2 and PACES examinations, the taking of which is contingent upon success in the Part 1 examination, depends strongly upon the mark at the first attempt at Part 1, but not on the improvement that subsequently occurs until Part 1 is eventually passed.
In the UK the question of whether candidates in postgraduate examinations should be limited in their number of attempts at an examination has historically been at the discretion of individual examining bodies. The same is also true of undergraduate examinations, where it is generally the case at present that only one or perhaps two attempts at finals or other examinations are allowed (although historically it has not always been so). The rationale for whatever regulations apply is often far from clear and the impression is that whatever limit there is has little formal basis in theory. The primary theoretical concern has to be with the role of 'luck', a difficult term to use, which is partly random variation due to the candidate (perhaps feeling ill on the day, or whatever), partly random variation due to the examiners (who also may feel jaundiced on the day), or the content of the questions (content/case specificity), or can be a deeper process that can simply be regarded as 'chance', 'random variation', 'measurement error', or whatever. The concept of 'luck' is subtle, but consider two candidates, one of whom A, knows about condition P but not Q, and the other B, who knows about condition Q but not P, so both know about half of the expected knowledge. Condition P is asked about, and so A passes but B fails, but on the next occasion the examination asks about Q, and so at the resit B passes. A finite examination cannot ask about all conditions, and so A was indeed lucky (and A's future patients with condition Q could also be regarded as unlucky). B was also lucky that Q eventually came up. Good examinations try to reduce all such factors by blue-printing, ensuring that the examination contains a large, representative number of questions across the entire syllabus, but they can never be entirely eliminated.
The role of purely 'chance' factors is most easily seen in an outcome which depends entirely on chance, as in dice games, where one has to throw a single die to get a six. There is a one in six chance of throwing a six on the first attempt, but with every additional throw the probability of eventually throwing a six increases. However, that increased probability increases with every additional throw. Likewise, the probability of passing an examination due to chance components (and that includes having 'got lucky' due to not feeling ill, examiners feeling beneficent, and cases/questions with which one happens to be experienced) increases with every additional attempt. There is no discrete change in the probability at the seventh (or indeed any other) specific attempt. More problematic is that the probability of passing due to luck begins to rise even at the second attempt (when many candidates do indeed pass examinations which they have failed at their first attempt). Any proper solution to the problem of resits has, therefore, to consider the difficult problem of whether there is a need to set a gradually increasing pass mark for each attempt at an examination, so that a mark which would pass a candidate at their first attempt may result in a failure at a later attempt, even be it their second attempt (when luck has already begun to benefit the candidate).
The central question underpinning any policy on numbers of resits has to be whether a limit is capricious, that is, 'if it is ... irrational', and that is where the difficult problem lies for medical examiners. The fundamental problem in understanding resit examinations is that at any attempt the mark of a candidate is a combination of their true ability and a random, chance process. With each and every repeated attempt at an examination, a candidate capitalizes on those random, chance processes, so that as the number of attempts increases, the probability of benefitting from chance increases with each and every attempt. It is not, therefore, rational or logical to implement a process which implicitly assumes that chance plays no increasing role on attempts one to N, but it does play a role from attempt N+1 onwards, so that N is the limit on attempts allowed. The laws of probability are not compatible with such an approach and, therefore, the process cannot be rational. In socio-political terms, the proposed limit of N appears to find its origins partly as an administrative convenience but mainly as an attempt to provide reassurance. However, that reassurance is surely false and without substance, not only because it does not correctly take chance into account, but because empirically it is the case that most candidates who pass at resits do so at the second or third attempt, when chance will almost certainly have benefitted a proportion of them, and the limit of N does nothing to impede those individuals. Candidates currently passing at, for instance, the seventh or higher attempt are a small minority of those passing at resits.
While there is no rational basis for having a fixed limit to the number of attempts, neither is the converse rational, of allowing an unlimited number of attempts, since chance continues to benefit resit candidates and that will not reassure the public. There is, though, a third way, which is perhaps the only possible rational solution, which is to set a pass mark that itself is dependent on the number of attempts an individual candidate has made. Indeed, an argument could be made, from a Bayesian perspective, that the pass mark for an individual candidate should be dependent on the marks they have obtained at all previous attempts at an examination, a candidate who has previously failed badly having to do better at the Nth attempt than one who only had bare fails on previous attempts. Although far from straightforward to implement, given that any other process could be argued to be capricious, then it is the only solution which can claim to be rational, to avoid the claim of being capricious, and also to be seen to be protecting and reassuring patients.