Bias associated with delayed verification in test accuracy studies: accuracy of tests for endometrial hyperplasia may be much higher than we think!

Background To empirically evaluate bias in estimation of accuracy associated with delay in verification of diagnosis among studies evaluating tests for predicting endometrial hyperplasia. Methods Systematic reviews of all published research on accuracy of miniature endometrial biopsy and endometr ial ultrasonography for diagnosing endometrial hyperplasia identified 27 test accuracy studies (2,982 subjects). Of these, 16 had immediate histological verification of diagnosis while 11 had verification delayed > 24 hrs after testing. The effect of delay in verification of diagnosis on estimates of accuracy was evaluated using meta-regression with diagnostic odds ratio (dOR) as the accuracy measure. This analysis was adjusted for study quality and type of test (miniature endometrial biopsy or endometrial ultrasound). Results Compared to studies with immediate verification of diagnosis (dOR 67.2, 95% CI 21.7–208.8), those with delayed verification (dOR 16.2, 95% CI 8.6–30.5) underestimated the diagnostic accuracy by 74% (95% CI 7%–99%; P value = 0.048). Conclusion Among studies of miniature endometrial biopsy and endometrial ultrasound, diagnostic accuracy is considerably underestimated if there is a delay in histological verification of diagnosis.


Background
The natural history of endometrial hyperplasia is not fully understood [1]. What is known is that a proportion of simple and complex hyperplastic processes will regress without treatment [2] although the time scale over which such regression may occur is unclear. Similarly the time scale over which benign endometrium progresses to hyperplasia is also unknown. Among studies evaluating accuracy of tests for diagnosis of hyperplasia (miniature biopsy or ultrasonography), it has previously been hypothesised that if histological verification of diagnosis after performing the test is delayed, the estimation of test accuracy may be influenced by the phenomena of disease regression or progression [3]. For instance, false positive diagnoses of endometrial hyperplasia may occur due to natural disease regression during the time interval between testing and verification of diagnosis. Similarly, false negative diagnoses may also result from progression of benign functional or atrophic endometrium.
To obtain accurate estimates of test accuracy in studies of hyperplasia, an immediate comparison of the test under scrutiny with a reference standard that verifies the diagnosis will be essential [4][5][6]. When accuracy studies suffer from a delay in performance of the reference standard, the resultant false positives and false negatives will be expected to lead to an underestimation of test accuracy. In systematic reviews, when studies of various designs are collated, the extent of underestimation that arises from delay is important in obtaining an unbiased pooled accuracy estimate. To our knowledge, the extent of underestimation of accuracy due to a delay in verification of diagnosis has not been evaluated empirically in studies of endometrial hyperplasia. We undertook this analysis to examine formally how inaccurate the estimation of accuracy can be in studies evaluating miniature endometrial biopsy devices and endometrial thickness measurement by pelvic ultrasonography for predicting endometrial hyperplasia when there are delays in histological verification of diagnosis.

Methods
To test our hypothesis, a data set of all the published studies reporting the accuracy of miniature endometrial biopsy devices and endometrial ultrasonography for predicting endometrial hyperplasia was obtained from systematic reviews [7,8]. The reviews focused on test accuracy studies in which the results of the test were compared with the results of a reference standard. The targeted population was women with abnormal pre-or postmenopausal uterine bleeding. The diagnostic tests of interest were miniature endometrial biopsy devices (for example, pipelle ® endometrial suction curette, Unimar, Wilton, CT, USA) and endometrial thickness measurement by pelvic ultrasonography. The reference standard was endometrial histology obtained by an independent endometrial sampling technique, for example, inpatient curettage (with hysteroscopy) or hysterectomy.

Identification of studies
Two independent electronic searches of MEDLINE and EMBASE were conducted to identify relevant citations on endometrial biopsy (1980)(1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999) and ultrasonography . Search term combination for endometrial biopsy [8] was diagnosis (MeSH) AND endometrial biopsy (textword), while that for studies on ultrasonography [7] was ultrasound AND endometrial thickness AND sonography (textwords). The searches were limited to human studies, but there were no language restrictions. Relevant studies were identified by examining all the retrieved citations, reference lists of all known reviews and primary studies, and direct contact with manufacturers. Details of the search and selection processes can be found in the published reports of the reviews [7,8].

Study quality assessment
All selected studies were assessed for their methodological quality defined as the confidence that study design, conduct and analysis minimize bias in the estimation of diagnostic accuracy [9][10][11]. We considered the following features in quality assessment: method of recruitment of sample, appropriateness of patient spectrum, and blinding of comparison between test and reference standard. Recruitment was considered to be adequate if patient selection was consecutive or a random sample was obtained. Patient spectrum was considered to be appropriate if both pre-and postmenopausal women were included. Blinding was considered to be present if it was clearly reported that the pathologists providing histological reports were kept unaware of the results of miniature endometrial biopsy or endometrial ultrasonography. If the results of the diagnostic tests were divulged to the pathologists or in the absence of any such reporting, blinding was categorised as absent. For the purpose of our analysis, studies were classified into two quality categories: Category I studies had any one of the following features: adequate recruitment, appropriate spectrum, and blinding; category II studies had none of the above quality features.

Data extraction
In addition to assessment of methodological quality, data were extracted to allow classification of studies into one of two groups: i) immediate verification -reference standard performed within 24 hours of testing, and ii) delayed verification -reference standard performed more than 24 hours after testing. Any studies that could not be categorised in this way due to lack of reporting were excluded. Data were then abstracted as 2 × 2 tables and estimates of diagnostic accuracy were derived for each individual study. A correction factor of 0.5 was used when cells of the 2 × 2 tables included zero values [12]. True positive rates (sensitivity), false positive rates (1-specificity) and diagnostic odds ratios (dORs) were calculated for each primary evaluation. The dOR represents a ratio of the positive and negative likelihood ratios and it can be mathematically summarised as:

Statistical analysis
Pooled dORs were generated as the principal measures of diagnostic accuracy. Meta-analyses to produce summary estimates of accuracy were performed separately for subgroups of studies reporting immediate and delayed verification. To delineate the impact of delay in verification of diagnosis, weused meta-regression analysis [13,14] with the log of dOR as the accuracy measure. This technique fitted a multivariable linear regression model for examining the influence of delay, quality and test type on the estimation of accuracy observed among studies included in the analysis (random effects model). In this way the analysis was adjusted for the confounding effects of study quality (two quality categories) and type of test (miniature endometrial biopsy or endometrial ultrasound).

Selection of studies
The study selection process is shown in Figure 1. In total there were 2,982 subjects in 27 diagnostic evaluations reported in the 24 eligible primary studies. Eleven evaluations delayed verification of the diagnosis by more than 24 hours; the delay was up to six months in one study, up to four weeks in four studies, up to three weeks in one study and up to one week in the remaining three studies. Three of these studies were rated as category I for methodological quality, and eight as category II. Sixteen evaluations verified the diagnosis within 24 hours of the test. Among these, seven studies were rated as category I for quality, and nine as category II (Table 1). Table 2 shows the diagnostic accuracy results for individual studies according to test type and verification status in terms of delay. The summary statistics for the various subgroups showed that the dOR for studies with immediate verification was 67.2 (21.7-208.8) while that for studies with delayed verification was 16.2 (8.6-30.5) as shown in Figure 2. Meta-regression analysis for bias due to delay in verification of diagnosis, adjusted for study quality and test type, showed that the underestimation of test accuracy among studies with delayed verification was 74% (95% CI 7%-99%; P = 0.048) on average compared to studies with immediate verification (Table 3).

Discussion
Our study shows empirically the magnitude of bias associated with delay in verification of diagnosis in test accuracy studies. Delay in verification of more than 24 hours was associated with a considerable underestimation of accuracy of miniature biopsy and endometrial ultrasonography in diagnosing endometrial hyperplasia. This supports the premise that the reported limited accuracy of miniature endometrial biopsy devices and endometrial ultrasonography in diagnosing hyperplasia is due, in part, to natural history of disease rather than resulting entirely from intrinsic problems with performance of the diagnostic tools [3].
We posed our hypothesis a priori and tested it in as rigorous a manner as possible. Our literature search was without language restriction, facilitating retrieval of many relevant test accuracy studies. However, due to poverty of reporting many critical pieces of information were missing in the available literature, restricting the number of studies that could be included in our analysis (for example, 31 studies were ineligible for inclusion because explicit information about time before verification was omitted). Our examination of delays in verification was also restricted; just two time categories were discernible (delay < 24 hours or > 24 hours). Immediate verification (reference standard to be performed straight after the index test) was not achievable in some studies because the reference test (inpatient endometrial sampling) necessitated use of general anaesthesia. A practical cut-off of 24 hours was taken to allow time for reference testing to be undertaken when the preceding index tests (miniature endometrial biopsy and ultrasound) were performed in the conscious outpatient. Although the natural history of endometrial hyperplasia is unclear, it is unlikely that biological alteration would have occurred within 24 hours. To study the rate of disease progression or regression would require repeated testing over time, but such a study is unlikely to be ethically justifiable, given that most clinicians will institute treatment following initial diagnosis. Such a study would be then become one of prognosis under treatment rather than a natural history study.
We also evaluated other features of methodological quality and, in general, found the quality of studies to be poor. For example, only three studies reported blinding interpretation of the reference test from knowledge of results from the index test. A lack of blinding can introduce bias and overestimation of diagnostic accuracy [4]. Pathological interpretation of endometrial hyperplasia is open to a varying degree of subjectivity especially at extreme ends of the spectrum, where overlap with benign functional endometrium (simple hyperplasia) and cancer (complex hyperplasia with cytological atypia) is more likely. Absence (or explicit reporting) of blinding is thus associated with poorer methodological quality and this feature was incorporated in our quality assessment. Our analysis adjusted for the confounding effects of quality but our inferences should be interpreted with caution due to relative scarcity of good quality studies.

Conclusions
Our findings have implications for research into new diagnostic interventions. Our results demonstrate that test evaluation with robust study design (immediate verification) showed good test performance but evaluation in poor designs (delayed verification) showed poor performance. Poor designs may reflect the situation prevalent in routine clinical practice where test results may not be Flow diagram showing study selection process Figure 1 Flow diagram showing study selection process.

Endometrial biopsy Ultrasound
Number of potentially eligible studies from search (see methods) n=52 Number of potentially eligible studies from search n=145 Number of studies included in systematic review n=9 Excluded studies n=0 Number of studies included in systematic review n=57 Excluded studies n=42 Reason for exclusion from present study: Reference standard not obtained by an independent, endometrial sampling technique n=0 Absence of explicit information on time between test performance and verification n=31 Both of the above n=11 Number of studies (and evaluations) eligible for inclusion in present study n=9 (12) Number of studies (and evaluations) eligible for inclusion in present study n=15 (15) Total studies (and evaluations) included in present study n= 24 (27)  Effect of delayed verification on the diagnostic accuracy of miniature endometrial biopsy and transvaginal ultrasound in detect-ing endometrial hyperplasia Figure 2 Effect of delayed verification on the diagnostic accuracy of miniature endometrial biopsy and transvaginal ultrasound in detecting endometrial hyperplasia. Pooled diagnostic odds ratios (dOR) for studies with immediate and delayed verification. immediately confirmed due to resource and other implications. Thus diagnostic evaluations carried out in routine practice may mask the accuracy of tests.