Collectives of diagnostic biomarkers identify high-risk subpopulations of hematuria patients: exploiting heterogeneity in large-scale biomarker data

Background Ineffective risk stratification can delay diagnosis of serious disease in patients with hematuria. We applied a systems biology approach to analyze clinical, demographic and biomarker measurements (n = 29) collected from 157 hematuric patients: 80 urothelial cancer (UC) and 77 controls with confounding pathologies. Methods On the basis of biomarkers, we conducted agglomerative hierarchical clustering to identify patient and biomarker clusters. We then explored the relationship between the patient clusters and clinical characteristics using Chi-square analyses. We determined classification errors and areas under the receiver operating curve of Random Forest Classifiers (RFC) for patient subpopulations using the biomarker clusters to reduce the dimensionality of the data. Results Agglomerative clustering identified five patient clusters and seven biomarker clusters. Final diagnoses categories were non-randomly distributed across the five patient clusters. In addition, two of the patient clusters were enriched with patients with 'low cancer-risk' characteristics. The biomarkers which contributed to the diagnostic classifiers for these two patient clusters were similar. In contrast, three of the patient clusters were significantly enriched with patients harboring 'high cancer-risk" characteristics including proteinuria, aggressive pathological stage and grade, and malignant cytology. Patients in these three clusters included controls, that is, patients with other serious disease and patients with cancers other than UC. Biomarkers which contributed to the diagnostic classifiers for the largest 'high cancer- risk' cluster were different than those contributing to the classifiers for the 'low cancer-risk' clusters. Biomarkers which contributed to subpopulations that were split according to smoking status, gender and medication were different. Conclusions The systems biology approach applied in this study allowed the hematuric patients to cluster naturally on the basis of the heterogeneity within their biomarker data, into five distinct risk subpopulations. Our findings highlight an approach with the promise to unlock the potential of biomarkers. This will be especially valuable in the field of diagnostic bladder cancer where biomarkers are urgently required. Clinicians could interpret risk classification scores in the context of clinical parameters at the time of triage. This could reduce cystoscopies and enable priority diagnosis of aggressive diseases, leading to improved patient outcomes at reduced costs.


Background
The number of patients presenting with hematuria is progressively increasing in our aging population and the diagnosis of serious diseases in some of these patients can be delayed when triage is ineffective [1]. Therefore, novel alternative risk stratification approaches are needed [2].
Hematuria, that is, the presence of blood in urine, is a presenting symptom for a variety of diseases. The final diagnosis for hematuric patients ranges from no diagnosis, through benign conditions including urinary infection, stone disease, benign prostate enlargement (BPE) to renal diseases and malignant causes. Urothelial cancer (UC), the most common malignancy in hematuric patients, is the fourth most common cancer in men and was the estimated cause of death in 150,200 people worldwide in 2008 [3]. Bladder cancer is associated with many risk factors [2].
Smoking increases the risk of UC fourfold and cessation of smoking is associated with a decreased risk [2].
The risk parameters that are currently used to tailor follow-up for patients diagnosed with UC, include pathological parameters, that is, grade, stage and associated carcinoma in situ (CIS), together with resistance to Bacille Calmette-Guerin treatment. At the time of diagnosis, approximately 70% of patients diagnosed with UC have tumors that are pathologically staged as pTa, pT1 or CIS, that is, non-muscle invasive (NMI) disease. The remaining patients present with muscle invasive UC (MI UC) which has a high-risk of progression to a more life threatening disease [2,4]. Unfortunately, it is not always possible to predict correctly the outcome for patients. This is largely attributable to the molecular heterogeneity within tumors which means that a spectrum of outcomes, spanning from negligible risk to life threatening prognosis, exist within the same pathological classification. For this reason, all patients with NMI disease have frequent surveillance cystoscopies and those with MI UC have radiological surveillance for lymph node recurrence or distant metastasis [2].
Cystoscopy is the gold standard for the detection and surveillance of NMI UC [2]. However, this procedure is costly and invasive for the patient. Further, it requires a significant clinical input and has its own shortcomings [2,5]. Cytology, another diagnostic test for bladder cancer, detects the presence of malignant cells in urine. Although cytology has high specificity, it has insufficient sensitivity to stand alone as a diagnostic test for UC in patients presenting with hematuria [2]. Three diagnostic bladder cancer biomarkers, Nuclear Matrix Protein 22 [6], Bladder Tumor Antigen (BTA) [7] and Fibrinogen Degradation Product [8] have Food and Drug Administration (FDA) approval. However, these biomarkers are not in use in routine practice as diagnostic biomarkers for UC because of their limited specificity. There is, therefore, a strong clinical need for urine-based tests which can at least riskstratify and, if possible, be diagnostic in hematuric patients [2].
Researchers often combine multiple tests, genes or biomarkers [9][10][11]. However, it is not possible to predict intuitively how multiple measurements will collectively reflect the underlying biological heterogeneity in complex diseases, such as UC. Complex diseases consist of multiple components which interact to produce emergent properties that the individual components do not possess. The difficulties to date with large amounts of patient biomarker data are that they do not manage or group all patients in a clinically meaningful way. Systems biology is based on the assumption that interactions among molecular components need to be integrated in order to obtain a functional understanding of physiological properties [12,13]. In this paper we used a systems approach, that is, clustering and Random Forests Classification (RFC), to analyze a comprehensive dataset collected from 157 hematuric patients: 80 patients with UC and 77 controls with a range of confounding pathologies.
When we allowed the patients to cluster naturally on the basis of their individual biomarker profiles this resulted in five patient clusters with a non-random distribution of risk characteristics. Three of these patient clusters were enriched with patients with cancer-risk characteristics. The remaining two patient clusters were enriched with patients with non-cancer characteristics.

Patient information and samples
We analyzed data collected during a case-control study approved by the Office for Research Ethics Committees Northern Ireland (ORECNI 80/04) and reviewed by hospital review boards. The study was conducted according to the Standards for Reporting of Diagnostic Accuracy (STARD) guidelines [14,15]. Written consent was obtained from patients with hematuria who had recently undergone cystoscopy or for whom cystoscopy was planned. Patients (n = 181) were recruited between November 2006 and October 2008 [9]. A single consultant pathologist undertook a pathological review of the diagnostic slides for all bladder cancer patients. The following patients were excluded from our analyses: 19 patients with a history of bladder cancer who were disease-free when sampled; one patient who had adenocarcinoma; one patient who had squamous cell carcinoma; and three patients ≥ 85 years old. We, therefore, analyzed data from 157 patients. A single consultant cytopathologist reviewed the cytology from 74 bladder cancer and 65 control patients. There were insufficient cells for diagnosis in 18/157 patients.
The final diagnosis for each of the 157 patients was based on history, physical examination, urinary tract radiological and endoscopy findings and the pathological reports relating to biopsy or resection specimens. For 36/157 (23%) patients, it was not possible to identify the underlying cause for hematuria, even after detailed investigations, including cystoscopy and radiological imaging of the upper urinary tract. These patients were assigned to the 'no diagnosis' category. The remaining patients were assigned into one of the following six categories: 'benign pathologies', 'stones/inflammation', 'BPE', 'other cancers', 'NMI UC' or 'MI UC'. For analyses purposes, we grouped 'no diagnosis', 'benign pathologies', 'stones/inflammation' and 'BPE' together as non-life threatening diagnoses, and grouped 'other cancers', 'NMI UC' and 'MI UC' as life threatening diagnoses (Table 1).

Biomarker measurement
Biomarker measurements were undertaken on anonymized samples at Randox Laboratories Ltd. For each patient, we measured 29 biomarkers; 26 were measured in triplicate ( Table 2). Samples were stored at -80°C for a maximum of 12 months prior to analysis. Creatinine levels (µmol/L) were measured using a Daytona RX Series Clinical Analyzer (Randox) and Osmolarity (mOsm) was measured using a Löser Micro-Osmometer (Type 15) (Löser Messtechnik, Germany). Total protein levels (mg/ml) in urine were determined by the Bradford assay A 595nm (Hitachi U2800 spectrophotometer) using bovine serum albumin as the standard. We classified proteinuria as total urinary protein >0.25mg/ml [16]. Eighteen biomarkers in urine, and carcino-embryonic antigen (CEA) and free prostate specific antigen (FPSA) in serum were measured using Randox Biochip Array Technology (Randox Evidence © and Investigator ©), which are multiplex systems for protein analysis [17]. An additional four biomarkers were measured using commercially available ELISAs. Epidermal growth factor (EGF) and the matrix metalloproteinase 9 neutrophil-associated gelatinase lipocalin (MMP9-NGAL) complex were measured using in-house ELISAs ( Table 2).

Data representation
Data were represented by a matrix X with 157 rows and 29 columns, for example, X(3,5) contained the measurement for patient number 3 and biomarker number 5. In order to simplify the notation, we denoted by X(j,) the 29 dimensional feature vector for patient j and by X(,k) the 157 dimensional feature vector for biomarker k.

Identification of patient clusters
Patients were separated into clusters according to the similarities of their 29 biomarkers using a hierarchical clustering with a Canberra distance and a Mcquitty clustering [18]. Therefore, each patient's profile vector was derived from the levels of the 29 biomarkers in their samples, for example, X(i,) as a profile vector for patient i.
To demonstrate the robustness of the observed clusters, we repeated the same analysis 100 times using only a bootstrap subset of the patients to conduct the clustering.

Chi-square tests
We explored the distribution of final diagnoses and known cancer risk characteristics across the patient clusters. We then constructed five cross-tables in which the patient clusters were listed in rows; and the final diagnosis category, absence/presence of proteinuria, pathological stage, pathological grade, or absence/presence of malignant cytology, was listed in columns. When the number of observed counts was <5 in >80% of cells in any of these tables, we merged groups as previously described (Table 1), prior to undertaking Chi-square analysis.

Identification of biomarker clusters
To allow us to exploit the full complement of biomarker data for subsequent classifications, we conducted hierarchical clustering to identify substructures within the 29 biomarkers themselves. That means for each biomarker k we used X(,k) as a profile vector to conduct an agglomerative clustering for the 29 biomarkers. Thus each biomarker's profile vector was based on the levels of the biomarker measured in each of the 157 patients. On the assumption that biomarkers within individual biomarker clusters would be similar to each other and, hence, contain redundant biological information about patients, we subsequently used one biomarker from each cluster for the classification of individual patient clusters and patient subpopulations, as described in the next section.

Random forest classification (RFC)
As our classification method, we used RFC which is an ensemble method consisting of multiple decision trees which, taken together, can be used to assign each patient into either of two categories. The overall classification of Biomarkers were measured in triplicate except for those marked ‡ for which only a single analysis was undertaken. Twenty of the 29 biomarkers were measured using Biochip Array Technology (BAT) which facilitates the simultaneous analyses of multiple proteins [17]. ELISA, enzyme-linked immunosorbent assay; UC urothelial cancer.
the RFC is obtained by combining the individual votes (classifications) of all individual trees, that is, by a majority vote [19,20]. We used the biomarker clusters to estimate the effective dimension of a feature set for the classification of the patient subpopulations. Each RFC was, therefore, constructed using one biomarker from each of the seven biomarker clusters. We estimated the area under the receiver operating characteristic curve (AUROC) by using out-of-bag samples, which means that the trees of a RFC were trained with bootstrap data which omit approximately one-third of the cases each time a tree is trained. These samples, called out-of-bag samples, are used as test data sets to estimate the classification errors [19].
As a benchmark, we first determined the classification error and the AUROC of RFCs with 1,000 trees for all possible collectives of biomarkers for the total population, that is, 157 patients. Second, we determined classification errors and AUROCs for RFCs for each of the three largest natural patient clusters. Third, we determined classification errors and AUROCs of RFCs for 14 clinically defined subpopulations of patients.
We assumed that clusters/subpopulations with similar contributory biomarkers to their classifiers were more homogeneous than subpopulations with different contributory biomarkers. On this basis, we compared contributory biomarkers to the RFCs for the three largest patient clusters and also compared contributory biomarkers across the split patient populations. For example, we compared the biomarkers that contributed to the RFC for the 101 smokers to the biomarkers that contributed to the RFC for the 56 non-smokers. Similarly, we compared biomarkers that contributed to RFCs across gender, history of stone disease, history of BPE, anti-hypertensive medication, anti-platelet medication, and anti-ulcer medication.

Non-random distribution of final diagnoses across patient clusters
When we clustered the 157 patients on the basis of their individual patient biomarker profiles, this resulted in five patient clusters ( Figure 1). We observed that the final diagnosis categories were non-randomly distributed across the patient clusters ( Figure 2A).

Non-random distribution of cancer-risk characteristics across patient clusters
Further, we observed that the red, purple and gold patient clusters illustrated in Figure 1, were enriched with patients with 'high cancer-risk' characteristics [2,4,21]. Conversely, the blue and green patient clusters were enriched with patients with 'low cancer-risk' characteristics ( Figure 2). On the basis of these observations we designated the red, purple and gold natural patient clusters as 'high-risk' and the blue and green patient clusters as 'low-risk'.
Further, it is important to emphasize that the division of UC tumors into NMI and MI is arbitrary and perhaps too simplistic. For example, there will be a significant difference in risk between a pT1 tumor with minimal submucosal invasion and a pT1 tumor with extensive submucosal invasion with the concomitant risk of lymphovascular invasion. Grade reflects the degree of differentiation within  a tumor. When we explored the pathological grades of the UC tumors, 21/33 (64%) UC patients in the 'highrisk' patient clusters had grade 3 disease (dark brown bars) compared to 14/45 (31%) in the 'low-risk' clusters ( Figure 2D). In addition, we found that there were significant differences in malignant cytology (14.1% versus 48.9%, P = 0.001) between 'low-risk' and 'high-risk' patient clusters.

Reduction of the complexity of the biomarker data
We used hierarchical clustering to identify the most informative set of biomarkers for use as feature vectors for UC diagnostic classifiers. Hierarchical clustering identified seven biomarker clusters consisting of N b = (2, 2, 6, 5, 4, 3, 7) biomarkers ( Figure 3). We assumed that biomarkers within individual clusters would contain redundant biological information about the patients and that it was sufficient to select one biomarker to represent each cluster. Overall, this provided us with a systematic way to estimate the number of representative biomarkers, which could be considered as the effective dimension of the biomarker-space. From this it follows that the total number of combinations is only 10,080 as given by each corresponding to a 7-tuple of biomarkers. Hence, the grouping of biomarkers into seven groups broke down the combinatorial complexity of the overall problem, allowing us to conduct an exhaustive search in this constraint set of biomarkers. In contrast, an unconstrained, exhaustive search would not have been feasible because the number of unconstrained feature combinations for up-to 7-dimensional feature vectors is larger than 2.1 million, as given by This is more than two orders of magnitude larger than N C making an exhaustive search computationally infeasible.
For all possible N C = 10,080 biomarker combinations, we determined the classification error and the AUROC of RFCs for each of the following: (1) all the 157 patients, (2) the three largest patient clusters from Figure 1, and (3) 14 subpopulations which were split on the basis of clinical or demographic parameters.

Contributory biomarkers to UC diagnostic classifiers for the low-risk patient clusters were similar
Only two of the patient clusters, those shown in blue and green in Figure 1, contained sufficient numbers, that is, 57 and 48, to train a RFC. However, for reasons of comparison, we also trained a RFC for the gold cluster, which contained 23 patients, 15 of whom were diagnosed with UC ( Figure 2). We found that 4/7 biomarkers were the same in the diagnostic classifiers for the blue and green patient clusters suggesting that these patient clusters had biological similarities. This is interesting because we had designated patients within both of these clusters as 'low-risk'. Further, only 2/7 and 1/7 of the biomarkers, which contributed to the blue and green low-risk clusters, respectively, also contributed to the classifier for the gold cluster. This would suggest that the gold patient cluster had significantly different underlying biological properties in comparison to the blue and green clusters. These observations would concur with our risk stratification hypothesis. The standard deviation of the classification error and of the AUROC for this smaller gold cluster, in comparison to the blue and green patient clusters, increased by approximately 30 % (Table 3).

Contributory biomarkers to UC diagnostic classifiers across clinically split patient subpopulations were different
When we determined classification errors and AUROCs of UC diagnostic RFCs for 14 clinically defined subpopulations we observed the highest AUROC = 0.843 (averaged over 100 repetitions) in the classifier for patients not taking anti-platelet medication (n = 118). For the clinically split subpopulations, we found that when specific biomarkers contributed to the UC diagnostic RFC for one clinically relevant subpopulation, they were less likely to contribute to the RFC for the complementary subpopulation. For example, compare the biomarkers across patient subpopulations taking anti-platelet medication to those not on the medication (Table 3).

Biomarkers associated with inflammatory conditions predominated two of the biomarker clusters
Biomarkers associated with inflammatory conditions predominated the black and brown biomarker clusters ( Figure 3). The black cluster contained C-reactive protein (CRP) and TNFα. The brown cluster comprised D-dimer, interleukin-1α, interleukin-1β, neutrophil-associated gelatinase lipocalin (NGAL) and total urinary protein. The latter five biomarkers were significantly elevated in urine from patients in the 'high-risk' patient clusters (Mann Whitney U, P <0.001) ( Table 4). NGAL is expressed by neutrophils and its main biological function is inhibition of bacterial growth [24]. NGAL, being resistant to degradation, is readily excreted in urine, both in its free form and in complex with MMP-9, which may protect it from degradation [24]. NGAL is also a useful biomarker of acute kidney disease [23]. Since the prevalence of kidney disease is one in six adults [25], NGAL should perhaps be an important consideration in urinary biomarker studies on patient populations which include high proportions of patients >50 years old. In our analyses, significantly higher NGAL levels were recorded in the purple patient subpopulation (1,379 ng/ml), 14/15 of whom had cancer, compared to levels measured in the patients in the gold group (464 ng/ml) ( Table 4) who had a greater diversity of final diagnoses (Figure 2A) (Mann Whitney U; P = 0.012).
Median EGF levels were significantly higher in the gold patient cluster (14 µg/ml) in comparison to the purple patient cluster (4 µg/ml) (Mann Whitney U; P <0.001) ( Table 4). Interestingly, 9/23 patients in the gold patient cluster had ≥ pT1G3 UC and the purple patient cluster included cancers other than UC (Figure 2). Bladder cancer risk and survival have been associated with genetic variation in the Epidermal growth factor receptor (EGFR) pathway [26].

Translation of risk and diagnostic classifiers from systems biology to the clinic
We have described how hierarchical clustering, conducted on the basis of individual patient biomarker profiles, identified patient clusters and how cancer-associated risk characteristics were non-randomly distributed across these clusters (Figures 1 and 2 and Tables 5, 6, 7, 8, 9, 10). These findings suggest that it should be possible to define risk classifiers which could be informative at the point of triage of hematuric patients. This approach could have the potential to significantly improve healthcare outcomes for patients with hematuria.
Biochip array technology [17] allows rapid and simultaneous measurement of the levels of multiple biomarkers. This technology will facilitate the translation of proteinbased classifiers, as described in this manuscript, from the laboratory to the clinic [27]. Antibodies, raised against biomarkers contributing to an individual classifier, can be formatted onto a single biochip. We predict that risk stratification biochips and UC diagnostic biochips could be created and validated in the near future [28]. In clinical practice, scores between 0 and 1, from the risk and diagnostic UC biochips would make it possible to designate each patient with hematuria as a 'low-risk control', a 'high-risk control', a 'low-risk UC' or a 'high-risk UC (Figure 4). Scores <0.4 obtained using the risk biochip would suggest that the likelihood of serious disease was low. Similarly, a score <0.4 obtained using the UC diagnostic biochip would suggest that it was unlikely that the patient had UC. In contrast, scores >0.6 from the risk or  diagnostic biochip would be suggestive of serious disease or UC, respectively. Scores between 0.4 and 0.6 could be interpreted as indicative of potential risk and the possibility of UC.
If specificities and sensitivities for both biochips were >90%, this would mean a high-risk cancer patient would have a 1:10 chance of being wrongly classified as low-risk and subsequently a 1:10 chance of being wrongly classified as a control. In this scenario, out of 1,000 high-risk cancer patients approximately 810 would be correctly classified as high-risk cancers, approximately 90 as high-risk controls, approximately 90 as low-risk cancers and approximately The median level and the inter-quartile range (IQR) of each biomarker in each patient cluster are shown. The biomarkers are grouped vertically to reflect how they appear in the biomarker cluster dendrogram (Figure 3). BTA, bladder tumor antigen; CEA, carcino-embryonic antigen; CK18, cytokeratin 18; CRP, C-reactive protein; EGF, epidermal growth factor; FPSA, free prostate specific antigen; HA, hyaluronidase; IL, interleukin; LOD, limit of detection; MCP-1, monocyte chemoattractant protein-1; MMP-9, matrix metalloproteinase 9; NGAL, neutrophil-associated gelatinase lipocalin; NSE, neuron specific enolase; sTNFR1, soluble tumor necrosis factor receptor 1; sTNFR2, soluble tumor necrosis factor receptor 2; TM, thrombomodulin; TNFα, tumor necrosis factor α; VEGF, vascular endothelial growth factor; vWF, Von Willeband factor.        Figure 4 Translation of classifiers into biochip format for risk stratification of hematuria patients. In the future when a patient with hematuria presents in primary care, their urine and serum samples could be sent for evaluation using biochips (grey oblongs). One biochip could be created for risk stratification and one biochip for the diagnosis of UC. Each biochip would be formatted with approximately six antibody spots, referred to as test regions. The underlying concept of these biochips is based on procedures similar to an ELISA, that is, light readings are generated from each test region which are proportional to the bound protein that is present in each patient's sample. Computer software would generate a score between 0 and 1 for each patient's sample. For the risk biochip, scores <0.4 would suggest a low risk of serious disease, while scores >0.6 would suggest a high risk of serious disease. The patient could then be designated low-risk (green) or high-risk (red) risk. Patients would then be screened using a second biochip, this time a UC diagnostic biochip. Similarly, scores <0.4 from the UC diagnostic biochip would suggest that it was unlikely that the patient would have bladder cancer while scores >0.6 would suggest that the patient requires further investigations to check for the presence of UC. The scores from both biochips would be interpreted alongside clinical parameters. The patient's clinician would then make a triage decision for that patient which would be informed by the biochip scores. For example, a high-risk UC patient (all red) could obtain a score >0.6 on the scale ranging from 0 to 1 for both biochips and likewise a low-risk control could receive a score <0.4 for both biochips. ELISA, enzyme-linked immunosorbent assay; UC, urothelial cancer. 10 as low-risk controls (Figure 4). Following biochip analyses, patients with scores ≤0.2 from both biochips and no clinical risk factors, that is, low-risk controls, could be monitored in primary care. This would lead to a reduction in the number of cystoscopies in these patients. In another scenario, a proportion of patients might be assigned as high-risk control patients following analyses of their samples using the biochips. These patients should be investigated further because they could have other diseases, for example, kidney disease which could then be managed appropriately [21]. In this way, improved triage would result in expeditious diagnosis for a greater proportion of patients with hematuria who would then receive earlier and more effective therapeutic interventions. This would represent a significant healthcare improvement [29].
Single biomarkers have failed to be diagnostic for hematuria and many other complex diseases. Panels of biomarkers, in addition to clinical information, provide a large array of patient data that can be highly informative and have potential for diagnostic and prognostic decision making. However, the difficulties to date with large amounts of patient biomarker data are that they do not manage or group all patients in a clinically meaningful way. Systems biology is a developing technology [30] that has evolved new and different ways to analyze very large and complex datasets, such as those relating to sequencing of the genome and those collected from complex diseases. We have described how patients with hematuria naturally cluster into risk groupings on the basis of their individual biomarker profiles. This challenges the current practice in hematuria clinics which prioritizes diagnosis of patients with bladder cancer. Patients in the 'high-risk' clusters included controls, that is, patients without bladder cancer. However these 'controls' may have other cancers or may have neoplasms at very early stages of carcinogenesis, that is, below the size threshold for detection. Because cystoscopy is not a perfect diagnostic tool and because there is an urgent need to identify all patients with serious disease at the hematuria clinic, the findings in this paper represent a significant advance in the approach to triage and diagnosis of hematuria patients.

Conclusions
When we clustered patients with hematuria on the basis of their individual patient biomarker profiles, we identified five patient clusters. We observed that the final diagnoses for the 157 patients with hematuria were non-randomly distributed across these patient clusters. Other 'high cancer-risk' characteristics, that is, proteinuria, pathological stage, pathological grade and malignant cytology were also non-randomly distributed across the patient clusters. Indeed, we identified three patient clusters that were enriched with patients who harbored 'high cancer-risk' characteristics and two patient clusters that were enriched with patients with 'low cancer-risk' characteristics. These findings indicate the feasibility of creating risk classifiers that could inform the triage of patients with hematuria. Risk classifiers could improve decision-making at the point of triage. This would result in a more accurate and timely diagnosis for patients with serious disease thus improving outcomes for a greater proportion of patients [1,2,29]. Authors' contributions FES performed statistical analyses, developed concepts, interpreted the statistical analyses and drafted the manuscript. FA was involved in the conception and design of the case control study and read drafts of the manuscript. RdeMS performed components of the statistical analyses and interpreted the same. BD interpreted the findings, wrote components of the manuscript, read drafts of the manuscript and contributed to discussions that shaped the manuscript. MR performed the protein analyses, read drafts of the manuscript and contributed to discussions that shaped the manuscript. CR performed the protein analyses, read drafts of the manuscript and contributed to discussions that shaped the manuscript. OR participated in the statistical analyses of the data and wrote sections of the manuscript. LW participated in the statistical analyses of the data, wrote sections of the manuscript and read drafts of the manuscript. HFOK was involved in the conception and design of the case control study and read drafts of the manuscript. DOR assessed the tumor pathology and read drafts of the manuscript. NHA assessed the cytology and was involved in the conception and design of the study. TN read drafts of the manuscript and contributed to discussions about the clinical significance. KW was involved in the conception and designed the case control study, undertook and interpreted the statistical analyses, developed and tested novel concepts, and contributed to discussions that shaped the manuscript and wrote final drafts of the manuscript. All authors read and approved the final manuscript.

Competing interests
We declare that MWR and CNR are employees of Randox Laboratories Ltd who undertook the biomarker analyses using Biochip Array Technology. Randox Laboratories funded the salary of FA who recruited the patients to the case control study over two years. MWR, CNR, and KEW are named inventors on British Patent No 0916193.6, which protects the biomarkers in algorithms previously published in Cancer 2012 DOI: 10.1002/cncr.26544. FES, RdeMS, BD, OR, LW, HFOK, DOR, NHA and TN declare that they have no competing interests.
OR, CW and KEW were funded by Queen's University Belfast. RdeMS was funded by a grant from BBSRC. Funding for the manuscript will be provided from Queen's University Belfast.