Predicting COPD 1-year mortality using prognostic predictors routinely measured in primary care

Background Chronic obstructive pulmonary disease (COPD) is a major cause of mortality. Patients with advanced disease often have a poor quality of life, such that guidelines recommend providing palliative care in their last year of life. Uptake and use of palliative care in advanced COPD is low; difficulty in predicting 1-year mortality is thought to be a major contributing factor. Methods We identified two primary care COPD cohorts using UK electronic healthcare records (Clinical Practice Research Datalink). The first cohort was randomised equally into training and test sets. An external dataset was drawn from a second cohort. A risk model to predict mortality within 12 months was derived from the training set using backwards elimination Cox regression. The model was given the acronym BARC based on putative prognostic factors including body mass index and blood results (B), age (A), respiratory variables (airflow obstruction, exacerbations, smoking) (R) and comorbidities (C). The BARC index predictive performance was validated in the test set and external dataset by assessing calibration and discrimination. The observed and expected probabilities of death were assessed for increasing quartiles of mortality risk (very low risk, low risk, moderate risk, high risk). The BARC index was compared to the established index scores body mass index, obstructive, dyspnoea and exacerbations (BODEx), dyspnoea, obstruction, smoking and exacerbations (DOSE) and age, dyspnoea and obstruction (ADO). Results Fifty-four thousand nine hundred ninety patients were eligible from the first cohort and 4931 from the second cohort. Eighteen variables were included in the BARC, including age, airflow obstruction, body mass index, smoking, exacerbations and comorbidities. The risk model had acceptable predictive performance (test set: C-index = 0.79, 95% CI 0.78–0.81, D-statistic = 1.87, 95% CI 1.77–1.96, calibration slope = 0.95, 95% CI 0.9–0.99; external dataset: C-index = 0.67, 95% CI 0.65–0.7, D-statistic = 0.98, 95% CI 0.8–1.2, calibration slope = 0.54, 95% CI 0.45–0.64) and acceptable accuracy predicting the probability of death (probability of death in 1 year, n high-risk group, test set: expected = 0.31, observed = 0.30; external dataset: expected = 0.22, observed = 0.27). The BARC compared favourably to existing index scores that can also be applied without specialist respiratory variables (area under the curve: BARC = 0.78, 95% CI 0.76–0.79; BODEx = 0.48, 95% CI 0.45–0.51; DOSE = 0.60, 95% CI 0.57–0.61; ADO = 0.68, 95% CI 0.66–0.69, external dataset: BARC = 0.70, 95% CI 0.67–0.72; BODEx = 0.41, 95% CI 0.38–0.45; DOSE = 0.52, 95% CI 0.49–0.55; ADO = 0.57, 95% CI 0.54–0.60). Conclusion The BARC index performed better than existing tools in predicting 1-year mortality. Critically, the risk score only requires routinely collected non-specialist information which, therefore, could help identify patients seen in primary care that may benefit from palliative care. Electronic supplementary material The online version of this article (10.1186/s12916-019-1310-0) contains supplementary material, which is available to authorized users.


Introduction
Chronic obstructive pulmonary disease (COPD) is associated with significant mortality and morbidity and is one of the most prevalent chronic diseases globally; in the UK, it is the fifth highest cause of death [1,2]. As COPD progresses, patients experience significant decreases in functional capacity, quality of life, social ability and psychological well-being, impairments that are analogous to those from lung cancer. There is growing evidence and increasing expert opinion that palliative care should have a prominent role in patients with end-stage COPD [3,4]. UK clinical guidelines (National Health Service, National Institute for Health and Care Excellence, National Council for Palliative Care) all recommend starting palliative care in the year before people die, with the goal of both improving their quality of life and addressing end-of-life planning [3,5]. The healthcare workers best placed to enable this are often those in primary care. However, we have previously shown in the UK that only 1 in 5 COPD patients within the last year of life are provided palliative care, and a recent Canadian study of COPD patients with advanced disease found a similarly low proportion [6,7]. One major barrier to provision is the challenge of predicting patient survival, due to the irregular disease trajectory of COPD, which is usually one of slow decline, punctuated by sudden unpredictable exacerbations that often end in death [4,[8][9][10]. This is in contrast to lung cancer, where there is often a reasonable level of physical function until a short period of relatively predictable decline. This may partly explain why COPD patients are much less likely to receive palliative care than patients with lung cancer [7].
Many derived prognostic indices can help long-term mortality prediction in COPD, but the ability to predict death at 12 months is currently limited, thought in part to be because the original derivation of some of the scores was to predict mortality over several years, as well as the lack of inclusion of important prognostic factors, such as comorbidities [4,8,11]. Furthermore, these risk scores have been derived using subgroups of patients, in particular patients from secondary care, where more specialised test results are available. Hence, these indices often cannot be applied to the general COPD population, for example, the BODE index is the most commonly used yet requires knowledge of a patient's exercise capacity, measured by their 6-min walk test, which is not routinely carried out in a primary care setting. This limitation prevents those that most commonly attend to COPD patients, healthcare workers within primary care, from identifying COPD patients that would benefit from palliative care. Lastly, the simplicity of the most commonly used predictive indexes may impede their predictive ability, such that addition of clinical variables increased their performance [11]. This seems especially relevant when adding comorbidities as putative prognostic predictors; comorbidities such as cardiovascular disease, cerebrovascular disease and lung cancer are both associated with an increased mortality and are highly prevalent in COPD patients. Moreover, there is evidence to suggest COPD patients are more likely to die from their comorbidities than the disease itself [12].
The aim of this study was to devise a prognostic tool, based on routinely collected variables within primary care, which could provide a 12-month mortality prognosis for general COPD patients. To carry this out, we used the UK's largest longitudinal database of electronic healthcare records and incorporated in our analysis all recorded putative predictive risk factors; these risk factors were based on previous published indices and risk scores.

Data sources
Data from the Clinical Practice Research Datalink (CPRD) was used to derive the prognostic risk model. CPRD currently covers more than 11 million patients, who represent the population, including with respect to gender and age, containing primary care clinical, prescription and test data [13]. To obtain data on exacerbations, socioeconomic status and mortality, linkage respectively to Hospital Episode Statistics (HES), Index of Multiple Deprivation (IMD) and Office of National Statistics (ONS) data was obtained; just over 60% of CPRD practices have patient-level linkage to HES-IMD-ONS.

Study populations
All patients had a COPD diagnosis as determined using a previously validated algorithm [14]. Patients' data were eligible for inclusion after the latest of their COPD diagnosis date, the date the GP practice began recording research quality data, their continuous CPRD registration date, or cohort start date. Patients' data were censored at the earliest of their date of death, end of study (26 June 2015), the GP practice last collection date or the date of transfer out of a CPRD-linked practice. Two study populations were drawn. The first had a cohort start date of 1 January 2010, and an arbitrary index date (time from which the 1-year mortality prognosis model could be applied) set as the first annual COPD review that occurred 12 months after eligibility. This cohort was used to derive the model and internally validate the model.
A second population was drawn that did not have a recorded annual review date and had data drawn from an earlier time period. The second cohort start date was 1 January 2004, and index date was arbitrarily set as the first day after 12 months of eligible data had occurred. Patients were excluded if they had a recorded annual review date between 1 January 2004 and 26 June 2015, and if they had missing values required for the model.

Outcome and prognostic predictors
Death was defined as mortality from any cause. The following prognostic predictors were chosen, based on published indices and risk scores, using appropriate Read codes (codes are available upon request): history of smoking (current or ex-smoker), MRC dyspnoea score, bereavement, myocardial infarction, asthma, osteoporosis, diabetes, hypertension, dementia, lung cancer, heart failure, stroke, anxiety, depression, atrial fibrillation, pulmonary embolism, coronary artery disease, gastric/duodenal ulcer disease, breast cancer, pancreatic cancer, pulmonary fibrosis, stroke, long-term oxygen therapy, influenza and pneumococcal vaccinations (this can be given every 5 years; if records did not extend beyond 5 years and did not show vaccination, this was recorded as missing) [8]. The COTE score (based on the presence of multiple comorbidities, including lung fibrosis, pancreatic cancer and diabetes with neuropathy) was also calculated [15]. Lung fibrosis was defined as any interstitial lung disease (ILD), e.g. sarcoidosis, idiopathic pulmonary fibrosis, rheumatoid arthritis-associated ILD. Prescription data was used to identify patients that had ever used an inhaled corticosteroid (ICS), long-acting beta agonist (LABA), or long-acting muscarinic antagonist (LAMA). Test results were used to identify the following variables, FEV 1 , GOLD staging (FEV 1 and FVC), C-reactive protein (CRP), albumin (low = < 35 g/L), haemoglobin, fibrinogen, platelets (low = < 150 × 10 9 /L, high = > 400 × 10 9 /L) and creatinine; creatinine above 120 μmol/L for males, or 110 μmol/L for females, was used to define chronic kidney disease (CKD). BMI was measured as kg/m 2 (underweight < 19, normal = 19-25, overweight = 25-30, obese ≥ 30). Exacerbations, treated within primary (labelled as moderate) or secondary care (labelled as severe), were identified using a validated algorithm [16,17]. Severe exacerbations were categorised as none, 1-2 hospitalisations annually and ≥ 3 hospitalisations annually. The rules for variable inclusion are defined in Additional file 1: Table S1.

Multivariable prognostic scores
Only three of the nine multivariable scores, which have previously been used to address mortality in unselected COPD patients at 1 year, were able to be derived from routinely collected primary care data. These were ADO (age, dyspnoea and airflow obstruction), BODEx (BMI, airflow obstruction, dyspnoea and exacerbations) and DOSE (dyspnoea, airflow obstruction, smoking status and exacerbations). The scores were derived as per original publication, using variables as defined above (MRC dyspnoea score, FEV 1 , smoking status, exacerbations, BMI) [18][19][20].

Modelling the putative prognostic predictors
The dataset was randomly divided equally into two datasets: a training set, used to derive the model, and a test set, used to internally validate the risk model.
Variables exceeding 50% missing were excluded from the model. An imputation model was defined for each variable with ≤ 50% missing data. Data were assumed to be missing at random, and values for the missing predictors were imputed using multiple imputation techniques based on chained equations [21]. A total of 10 imputed datasets were generated.
To derive the risk model, Cox regression models were fitted using the data from the training set with all predictors (with the exclusion of the COTE scores). Backwards elimination with a stack approach [21] was used, using a 5% significance level for variable selection and weights equal to 1/10 for each one of the imputed training datasets. The coefficient estimates for the final model were combined from the imputed datasets using Rubin's rules [22]. Proportional hazard assumptions were tested for the final model.
The probability of mortality at 1 year for a patient can be calculated using the following equation, derived from the Cox proportional hazards model: where S 0 (t) is the baseline survival probability at time t (i.e. at 1 year in this study). The prognostic index, i.e. the linear predictor of the Cox model, is the quantity we used as our proposed index. The index was given the acronym BARC based on putative prognostic factors including body mass index and blood results (B), age (A), respiratory variables (airflow obstruction, exacerbations, smoking) (R) and comorbidities (C).

Validation of the risk model
To validate the predictive ability of the risk model at 12 months, we relied on the calculation of the BARC index in the test set using the coefficients obtained in the development phase. The model was validated internally in the test dataset and in the external dataset (drawn from the second COPD cohort). Measures assessing calibration (calibration slope) and discrimination (Harrel's C-index and D-statistic) were calculated [23][24][25]. Calibration slope assesses the agreement between predicted and observed risks. A calibration slope of 1 suggests perfect calibration, while a value diverging from 1 is indicative of poorer agreement. A value of 0.5 for C-index indicates no discrimination, and 1 indicates perfect discrimination. A model with no discriminatory ability will produce D value equal to 0, and better separation is achieved with higher values. The performance measures were estimated in each imputed validation test dataset, overall measures were calculated by combining the estimates using Rubin's rules, and in the external dataset.
Graphical illustration of calibration is given by comparing observed (Kaplan-Meier) and predicted survival probabilities in several prognostic groups. Groups were derived by placing cut points on the BARC based on meaningful quantiles [26,27]. We categorised BARC index's at the 1st quartile, median and 3rd quartile of the time of death, i.e. not counting censored observations, to create four risk groups.

Comparing observed and predicted mortality probability
The observed mortality probability was calculated by the proportion of deceased patients in the sample within a year. The same four groups used to graphically calibrate the model were used to classify subjects in very low, low, moderate and high risk [28]. Mortality could then be compared for patients in each risk group between that observed and the predicted mortality using the BARC index.

Comparing the risk model with established multivariable prognostic scores
To compare the predictive capability of the BARC index with that of ADO, BODEx and DOSE scores, we plotted the receiver operating characteristic (ROC) curves and calculated their associated area under the curves (AUC) for the survival threshold of interest, i.e. 1 year. As a sensitivity analysis, the scores were compared on the first cohort (training and test set) without lung cancer.
All statistical analyses were carried out using STATA (version 15) and R (version 3.5.0).

Characteristics of the COPD populations
There were 54,990 eligible COPD patients in the first cohort, from which the training and test datasets were drawn, of whom 21% died during study follow-up; median follow-up was 2.7 years (Additional file 1). The cohort had a median age of 70 years, around half were male, median BMI corresponding to overweight and a median FEV 1 of 1.48 L ( Table 1). All of the cohort had a history of at least one documented comorbidity. Only 1.2% of the cohort had a high COTE index. As might be expected, the cohort that died were slightly older, had a lower FEV 1 , had experienced more moderate and severe exacerbations, were on more inhaled medication and had in general more comorbidities. There were 4931 eligible COPD patients in the external validation dataset (Additional file 2: Figure S1), drawn from the second COPD cohort of whom 29% died during study follow-up; median follow-up was 2.1 years. The dataset had a median age of 71 years, 55% were males, and a median FEV 1 was 1.52 L ( Table 2). The patients that died were older, had a lower FEV 1 and had more exacerbations and comorbidities.

Prevalence of prognostic predictors
In the first primary care COPD cohort, there was < 5% missing data for the most commonly applied prognostic predictors, MRC dyspnoea score, BMI, smoking status, exacerbation history and age, except for FEV 1 where there was 20% missing. Other predictors that had missing values were blood tests; CRP had 79% missing, albumin, haemoglobin, and platelets had around 30% missing and creatinine only had 23% missing. Only 16% of patients did not have a blood test within 12 months of their annual review, and only 3% were taken within 7 days either side of an exacerbation. There was 70% of patients with missing data for the pneumococcal vaccine. All other variables, unless derived from the abovementioned variables, were < 5% missing.

Identification of the risk model
After imputation for the missing values and stepwise elimination, 18 different variables remained in the model, including age, BMI, FEV 1 , severe exacerbations, smoking status, multiple comorbidities, haemoglobin and platelets ( Table 3).
The marginal predictions for the risk of death at 1 year were obtained by the following equation where the baseline survival is estimated by means of a fractional polynomial and the prognostic index is the linear combination of the coefficients given in Table 3 with the values of the corresponding variables.

Validation of the risk model
The predictive performance and calibration of the BARC index was high in the test dataset and satisfactory in the external dataset (test set: C-index 0.79, 95% CI 0.78-0.81; D-statistic 1.9, 95% CI 1.8-2.0; calibration slope 0.95, 95% CI 0.90-0.99, and external dataset: C-index 0.67, 95% CI 0.65-0.70; D-statistic 0.98, 95% CI 0.83-1.14; calibration slope 0.54, 95% CI 0.45-0.64) ( Table 4). We depict the observed and fitted survival probabilities, with pointwise 95% confidence intervals for the latter, at 3, 6 and 9 months, other than at 1 year, to give a visual trend of the survival probabilities. The graphical analysis confirms the satisfactory calibration of the BARC index (Additional file 3: Figure S2), even if the predictions in some of the groups were slightly higher than the observed.

Comparing mortality between that observed and predicted
There was an increasing probability of death with each increasing risk group (Fig. 1). The BARC index estimated the probability of dying to within 1% of the observed probability in the high-risk group, in the training and test sets, and within 5% in the external dataset.

Comparing the BARC to ADO, BODEx and DOSE
The ROC curve of the BARC index was consistently above any of the curves associated with both ADO, BODEx and DOSE scores, showing that our model performed better in the test dataset than the three scores ( Fig. 2 and Additional file 4: Figure S3). This result is confirmed by the associated AUCs and their corresponding 95% confidence intervals (Table 5). BARC index still performed better in the sensitivity analysis, removing lung cancer patients (Additional file 1: Tables S2-S4).

Discussion
From a large cohort of primary care COPD patients, we have derived a 12-month mortality predictive model, the BARC index, with acceptable discrimination and calibration when externally validated. The predictive performance of the model also compared favourably to the commonly used ADO, DOSE and BODEx indexes. The BARC index is comprised of variables commonly included in established predictive indexes, such as airway obstruction, age, smoking status and dyspnoea assessment, as well as several comorbidities and blood biomarkers linked to general health (including serum albumin and haemoglobin). A significant difference between our more favourable model and established scores is the addition of comorbidities. The presence of comorbid disease is common, with at least 80% of COPD patients estimated to have one or more additional chronic disorders; indeed, those within 1 year of death have an even larger proportion with comorbid disease [7,29]. It is also associated with significantly increased mortality; up to two thirds of deaths are thought to be from comorbid disease not COPD [12,15,30]. Perhaps unexpectedly, most cardiovascular comorbidities were not included in the model at the 5% significance level; however, this may be because this model addressed shorter-term 12-month mortality whereas cardiovascular disease has relatively longer-term effects, than some other comorbidities, such as cirrhosis, lung cancer and cerebrovascular disease that were included. Furthermore, cardiovascular mortality continues to decrease [31]. The specific comorbidities Hb haemoglobin. Comorbidities included in COTE are indicated index (COTE) uses 12 comorbidities, but no respiratory parameters, and provides a good 5-year mortality prediction [15]. However, COTE has not been assessed for predicting mortality at 1 year, and as it was derived in secondary care, it requires specialised knowledge on disease status that is not always available. In this respect, the CODEX index (based on the Charlson index and BODEx), derived from a selective cohort of hospitalised COPD patients, also requires in-depth knowledge on comorbidities [18]. In comparison, many variables that are associated with COPD severity, including medication use, moderate exacerbations and GOLD staging, were not included in the model at the 5% significance level. This information in itself points to the complexity of understanding COPD mortality and highlights again the influence of comorbid conditions on mortality. One advantage of the BARC index is that it is practical, and user-friendly, as it incorporates routinely collected data easily available within primary care, which could also allow the risk score to be embedded in the electronic healthcare records system. In addition,   because it was derived and validated in two large nationally representative COPD populations, and nearly 90% of UK population is registered in primary care, this aids the generalisability of the risk score to all COPD populations. The cohorts used had similar mortality rates to other COPD cohorts (data not shown) [11,32]. However, the generalisability could have been reduced as we used an annual review as the arbitrary time point from which to start the study; 20% of the cohort did not have one during their study period, and this was largely due to their short length of research quality data available (i.e. only had just over 1 year of CPRD data therefore not long enough to have 1 year of data and an annual review) rather than lack of attendance to their annual  review. This generalisability issue was overcome as the external dataset contained patients without an annual review during that time period. Another possible limitation of the derivation of the risk score is that five variables (FEV1, albumin, haemoglobin, platelets and creatinine) had to be imputed due to missing data, which potentially could have led to misclassification, though the percentage missing was only around 10 to 30%. The low percentage of missing data in the first cohort was likely due to some selection bias as these patients all had an annual review; there was higher percentage missing in the second cohort, with 15% missing FEV 1 and 50% missing MRC dyspnoea score. In the first cohort, many of the missing variables appeared to be missing due to a relatively short follow-up period before death (in the UK FEV 1 is routinely measured every 18 months); nevertheless, FEV 1 can readily be measured by spirometry if required for the index. Although blood tests were missing from some patients, these provided significant predictive value to the model and were mostly performed less than a year before the annual review date. Moreover, we feel it is likely in patients where a GP is considering this index, they will have had a blood test in the recent past; if not, this information can easily be obtained from a simple single blood test. A strength of this study is the use of such a large cohort of patients to derive the model from; this also provided the power to assess less-common comorbidities (including cirrhosis and dementia) as statistically significant prognostic markers that may not have been found in a smaller sample size. Information on the end of life has been identified as of intrinsic interest to patients, carers and healthcare professionals, but the lack of the ability to approximately predict mortality is thought to be one of the key barriers to providing this information. Therefore, the identification of this accurate, user-friendly, predictive model that is applicable in primary care, could aid communication, shared decision-making and ultimately a palliative care approach directed from primary care. Our findings suggest the currently used predictive scores may be too simple and that incorporating more clinical variables, in particular comorbidities, significantly improves predictive performance. Of course, a risk score only aids decision-making, and physicians should use their clinical acumen and discuss with patients and their families to decide when palliative care is appropriate (it may be appropriate long before the last year of life); a risk score should not be used in isolation as a screening tool for palliative care [28]. is supported by the Marie Curie Chair's grant (MCCC-509537).

Availability of data and materials
The data that support the findings of this study are available from the UK CPRD, but restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available.