A heart failure phenotype stratified model for predicting 1-year mortality in patients admitted with acute heart failure: results from an individual participant data meta-analysis of four prospective European cohorts

Background Prognostic models developed in general cohorts with a mixture of heart failure (HF) phenotypes, though more widely applicable, are also likely to yield larger prediction errors in settings where the HF phenotypes have substantially different baseline mortality rates or different predictor-outcome associations. This study sought to use individual participant data meta-analysis to develop an HF phenotype stratified model for predicting 1-year mortality in patients admitted with acute HF. Methods Four prospective European cohorts were used to develop an HF phenotype stratified model. Cox model with two rounds of backward elimination was used to derive the prognostic index. Weibull model was used to obtain the baseline hazard functions. The internal-external cross-validation (IECV) approach was used to evaluate the generalizability of the developed model in terms of discrimination and calibration. Results 3577 acute HF patients were included, of which 2368 were classified as having HF with reduced ejection fraction (EF) (HFrEF; EF < 40%), 588 as having HF with midrange EF (HFmrEF; EF 40–49%), and 621 as having HF with preserved EF (HFpEF; EF ≥ 50%). A total of 11 readily available variables built up the prognostic index. For four of these predictor variables, namely systolic blood pressure, serum creatinine, myocardial infarction, and diabetes, the effect differed across the three HF phenotypes. With a weighted IECV-adjusted AUC of 0.79 (0.74–0.83) for HFrEF, 0.74 (0.70–0.79) for HFmrEF, and 0.74 (0.71–0.77) for HFpEF, the model showed excellent discrimination. Moreover, there was a good agreement between the average observed and predicted 1-year mortality risks, especially after recalibration of the baseline mortality risks. Conclusions Our HF phenotype stratified model showed excellent generalizability across four European cohorts and may provide a useful tool in HF phenotype-specific clinical decision-making.


Background
Heart failure (HF) is a rapidly growing public health concern with high prevalence, poor prognosis, and high cost. It is estimated that 0.4-2.2% of the population in industrialized countries suffer from HF, with 500k-600k incident cases diagnosed each year [1]. Data from the 2016/2017 UK National Heart Failure Audit [2] showed that mortality remains high with in-hospital mortality and 1 year post-discharge mortality rates of 9.4% and 23.3%, respectively. The total medical expenditure on HF is predicted to rise from US$20.9 billion to $53.1 billion, of which 80% are attributed to increased hospitalization [3]. All of the aforementioned statistics will even deteriorate with the global aging. Accurately predicting prognosis for HF can help in tailoring treatments to subgroups of patients, as was recently shown for the selective adenosine A1 receptor antagonist rolofylline [4] as well as for the disease management programs evaluated in the COACH study [5].
Many clinical prediction models have been developed with the goal of helping physicians stratify patients with HF [6]. Some of these models were developed in patient populations with a particular HF phenotype, such as the Seattle Heart Failure Model (SHFM) [7] that was developed in the setting of HF with reduced ejection fraction (HFrEF), while others were developed in more general cohorts with a mixture of HF phenotypes, such as the MAGGIC risk score [8]. While such latter heterogeneous population models are more widely applicable, they are also likely to yield larger prediction errors for two reasons. One is the potential different baseline mortality rates of three HF subtypes, as indicated by several large studies [9,10] that mortality of HF with preserved ejection fraction (HFpEF) is lower than that in HFrEF, even after adjusting for age, sex, and clinical covariates. However, a recent meta-analysis [11] showed no significant difference in mortality rates between HFrEF and HFpEF. The other one is the potential different predictor-outcome associations across HF subtypes. Among those, age, systolic blood pressure (SBP), and diabetes were verified by large cohort studies [8,12] to have different associations with mortality in patients with HFrEF and HFpEF. Reducing uncertainty in risk prediction model by addressing the aforementioned two factors is essential to improve the prediction accuracy, which could in turn lead to improvements in advanced care planning, treatment adherence, and integration with wider healthcare teams such as palliative care. The purpose of this study was to use individual participant data (IPD) meta-analysis to develop an HF phenotype stratified model for predicting 1-year mortality in patients admitted with acute HF.

Study cohorts
Four cohorts were included in the IPD meta-analysis: BIOSTAT-index, BIOSTAT-validation, THRIUMPH, and COACH (Table 1). Detailed inclusion and exclusion criteria of the four cohorts are provided in Table S1 in Additional file 1. In short, BIOSTAT-CHF [13] was a large European project aimed to characterize biological pathways related to response or non-response to the recommended therapy for HF. To characterize these pathways, two independent HF cohorts were assembled: an index cohort (BIOSTAT-index) consisting of 2516 patients from 69 centers in 11 European countries and a validation cohort (BIOSTAT-validation) consisting of 1738 patients from 6 centers in Scotland, UK. TRI-UMPH [14] was a translational bench-to-bedside study program encompassing the entire spectrum of biomarker discovery to clinical validation. The clinical validation study was an observational prospective study that enrolled 475 patients admitted with acute HF from 14 centers in the Netherlands. This study was designed to establish the clinical value of biomarkers successfully passing the bio-informatics and early-validation stages of TRIUMPH as well as to further evaluate more established biomarkers of HF. COACH [15] was a multicenter randomized controlled trial (RCT) that enrolled 1023 patients admitted with acute HF. This study was designed to evaluate the long-term effects of moderate or intensive disease management on outcome in patients with HF. All patients provided written informed consent. This study was conducted in compliance with the Declaration of Helsinki and was approved by all relevant local ethics committees.
Patients who were enrolled from outpatient clinics (N = 1625), had missing outcome data (N = 29), or had missing ejection fraction values (N = 459) were successively excluded for the present analysis. This resulted in a total sample of 3577 patients, of which 2368 were HFrEF patients, 588 were HF with midrange ejection fraction (HFmrEF) patients, and 621 were HFpEF patients. The HF subtypes were defined according to the European Society of Cardiology guidelines: HFrEF as < 40%, HFmrEF as 40-49%, and HFpEF as ≥ 50% [16].

Outcome and predictor variables
The outcome of interest was 1-year mortality, defined as the time from hospital admission to death from any cause within 1 year after hospital admission. The candidate predictor variables consisted of a set of demographic, clinical, and laboratory variables that were selected according to clinical knowledge, literature [6], and data availability. This included age, sex, myocardial infarction (MI), atrial fibrillation, COPD, peripheral arterial disease, stroke, diabetes, previous HF hospitalization, NYHA class, SBP, diastolic blood pressure, heart rate, BMI, hemoglobin, N-terminal pro-Btype natriuretic peptide (NT-proBNP), serum potassium, serum sodium, serum creatinine, blood urea nitrogen (BUN), coronary artery bypass grafting (CABG), and implantable cardioverter defibrillator (ICD) or pacemaker. Medication use were excluded from the candidate variables because they might be confounded by disease severity influencing tolerability of the use [17]. For the clinical and laboratory variables, the measurements closest to the day of hospital admission were taken. Since patients who died during the index admission were excluded from the COACH study [15], the survival times for patients in COACH were left-truncated at the time of hospital discharge.

Model derivation
Our prognostic model consists of two parts: (i) a prognostic index (PI) that captures the effects of the predictor variables, and (ii) HF subtype (HFrEF, HFmrEF, and HFpEF) specific baseline hazard functions that determine the baseline mortality rates in these three subpopulations.
Following Royston et al. [18], the PI was estimated from a Cox model stratified by cohort and HF subtype. First, a full model with all the predictors and their interaction with HF subtype was built. Backward elimination was then applied to the interaction terms. Another round of backward elimination was subsequently applied to the main effect terms, with the main effects of variables with significant interaction terms retained in the model. The significance level to stay in the model was set to .05. The counting process method was used to account for the left-truncated time-to-event data in COACH [19]. Missing values for the predictor variables were handled using multiple imputation with Rubin's rules applied to obtain pooled estimates and P values at each step of the two backward elimination procedures [20]. Fractional polynomials were used to check the linearity of continuous predictors and to find suitable transformations in case the linearity assumption did not hold [21].
The baseline hazard functions were obtained by fitting an HF subtype stratified Weibull model to the pooled data with the PI obtained from the Cox model included as an offset. The full parameterization of our HF subtype stratified prognostic model can be found in Additional file 2.

Model validation
Model performance was assessed in terms of discrimination and calibration [22]. Discrimination was assessed using the area under the cumulative/dynamic timedependent ROC (AUC) computed at the evaluation time of 1 year [23]. Calibration was assessed by calibration plots comparing predicted vs. observed 1-year mortality rates in total and in subgroups with different predicted risks.
To evaluate the generalizability of our prognostic model, both raw AUCs and internal-external crossvalidated AUCs were computed. The internal-external cross-validation (IECV) approach was also used for generating the calibration plots. IECV is a sequential approach in which every study is excluded once to serve as an external validation cohort for a prognostic model developed in the remaining three cohorts [24]. In this way, it can be evaluated whether the derived model has good prognostic separation in independent cohorts and whether the baseline mortality is comparable across study populations.

Comparison with other risk scores
To compare the predictive performance of our model with the predictive performance of three existing risk scores, namely the MAGGIC risk score [8], the GWTG-HF score [25], and the BCN Bio-HF Calculator [26], the AUCs of these three models were compared with the internal-external cross-validated AUCs of our model. Calculations were performed separately for each of the four study cohorts and stratified by HF phenotype.

Patient population
Baseline characteristics stratified by study cohort are provided in Table 2 The extent of missing data for baseline characteristics is provided in Table S2 in Additional file 1. The proportion of missing data for most of the candidate predictors was very small (< 2%). BUN and NT-proBNP had a relatively larger proportion of missing data (6.7% and 33.2%, respectively).

Clinical prediction model
The final model included 11 predictors: age, COPD, NYHA class, hemoglobin, serum sodium, BUN, NT-proBNP, SBP, serum creatinine, MI, and diabetes. Four of these predictors, namely SBP, serum creatinine, MI, and diabetes, interacted with HF subtype. SBP, BUN, serum creatinine, and NT-proBNP were transformed because of non-linear relationships with mortality. The relative effects of the predictors after transformation are presented in Table 3.
The PI for a specific patient is calculated as the linear combination of the regression coefficients (Table 3) and the values of the corresponding (transformed) predictors for that patient. The distribution of the PI in the pooled dataset is presented in Fig. 1, which also shows the predicted 1-year mortality risk associated with the different values of the PI stratified by HF subtype. Specifically, the median and interquartile range of the PI was − 2.0 (− 2.7 to − 1.3) for HFrEF, − 2.8 (− 3.4 to − 2.3) for HFmrEF, and − 1.4 (− 2.0 to − 0.9) for HFpEF, which associated 1-year predicted mortality risks of 14.8% (7.6 to 28.3%), 18.5% (10.8 to 28.4%), and 18.5% (11.3 to 28.9%) for HFrEF, HFmrEF, and HFpEF, respectively. The mathematical formulas underlying the predicted 1-year mortality risk curves shown in Fig. 1 are provided in Additional file 3 together with an illustration of how these calculations can be conducted for an example patient.

Model validation
The raw AUCs (95% CIs) for HFrEF, HFmrEF, and HFpEF. Moreover, the relatively small differences between the estimated and predicted AUCs for the individual cohorts in the IECV showed that the discrimination reproduced well across four cohorts (Table 4).
In the pooled dataset, the average observed vs. predicted 1-year mortality rates were 20.3% vs. 20.6% for HFrEF, 21.2% vs. 21.5% for HFmrEF, and 21.3% vs. 21.6% for HFpEF. The results of the IECV showed that the discrepancies between the observed vs. predicted 1year mortality rates were larger for the four individual cohorts (Fig. 2), especially for BIOSTAT-index and BIOSTAT-validation. These discrepancies disappeared after recalibration of the baseline mortality risks in each of the omitted cohorts [27] (Fig. 2). Calibration plots comparing the average observed vs. predicted 1-year mortality in different risk groups (deciles of predicted 1year mortality in HFrEF and quintiles of predicted 1year mortality in HFmrEF and HFpEF) yielded similar findings (Additional file 1: Figures S1-S6).

Comparison with other risk scores
In HFrEF, our model outperformed the three existing risk scores. In HFmrEF and HFpEF, the BCN Bio-HF score showed a similar performance to our model, while the predictive performance of the MAGGIC score and the GWTG-HF score was lower (Table 5).

Discussion
Using data collected from 3577 patients across four European cohorts, we developed an HF phenotype   HFpEF. These results also showed a good agreement between the average observed and predicted 1-year mortality risks, especially after recalibration to the cohortspecific baseline risks. The vast majority of the existing prediction models were derived using data from a single HF cohort and then either internally validated or externally validated using data from a second HF cohort. An alternative approach that makes better use of the available data is to perform IPD meta-analysis [24]. While the use of IPD meta-analysis can result in more generalizable prediction models [28], this approach has so far only been applied for the MAGGIC risk score [8], which was predominately developed in ambulatory HF patients. To our knowledge, our study was the first IPD meta-analysis to develop an HF phenotype stratified model in the setting of acute HF. By including patients from three prospective cohorts and one RCT across Europe, the patient population used to develop our model was relatively broad and heterogeneous, and closer to routine clinical practice, especially compared to previous models that were derived from a single HF cohort. Our model nevertheless still showed a very good discriminative ability, with IECV-adjusted AUC of 0.79 for HFrEF, 0.74 for HFmrEF, and 0.74 for HFpEF. The discriminative ability of our model is promising as compared to mean cstatistics of 0.71 across 117 models for predicting mortality in a meta-analysis [6].
Our model outperformed the MAGGIC risk score, especially in HFrEF, indicating that the MAGGIC risk score might be not applicable for patients with decompensated HF, but more suitable for patients with a stable state. It is not unexpected that our model was also better than the GWTG-HF risk score since the latter was initially developed to predict in-hospital mortality. The BCN Bio-HF risk score is a more comprehensive tool in  that it incorporates the combinations of three biomarkers (NT-proBNP, hs-cTnT, and ST2) into the model. Nevertheless, our model, by only incorporating NT-proBNP, performed equally well in HFmrEF and HFpEF, and even better in HFrEF. Lastly, our comparisons to the MAGGIC, GWTG-HF, and BCN Bio-HF risk scores are pragmatic but potentially unfair since the predictors in our model were derived from the data we used for comparison. However, this bias should be largely lessened since the AUCs of our model were adjusted using the IECV technique. Many of the prognostic factors identified in this study were already well established in previous studies. BUN and serum sodium were previously shown to have the highest predictive values among the most frequently used predictors and were also strongly associated with mortality in our study [6]. Most of the predictors in MAGGIC, such as age, SBP, COPD, diabetes, and serum creatinine, were further confirmed in our study. Like BIOSTAT-CHF [17], lower hemoglobin was associated with an increased risk of mortality. Consistent with several studies [29,30], NT-proBNP was confirmed to be strongly associated with mortality. Inclusion of NT-proBNP is particularly an advantage of our model over the MAGGIC risk score. While it is still under debate whether the prognostic impact of NT-proBNP differs among HF subtypes [31], our study did not find the interaction between NT-proBNP and HF subtypes.
A novelty factor of this study is that we used a stratified Cox model to account for the cross-phenotype heterogeneity and this phenotype-specific model allowed both the baseline mortality risk as well as the effect of the prognostic variables to be different for each phenotype. Particularly, having a history of MI indicated increased mortality risk in HFrEF, while the effect of this variable was neutral in HFmrEF and HFpEF. It has been reported that ischemic etiology is associated with an increased risk of mortality in HFrEF but neutral in HFpEF [7,[32][33][34][35], and thus, it is not surprising that history of MI is more strongly associated with mortality in HFrEF.
Consistent with Go et al. [12], history of diabetes was associated with higher mortality in HFrEF, but neutral in HFmrEF and HFpEF in our study. However, this was discordant with two previous studies [34,36], in which diabetes was also associated with poor outcome in HFpEF. Consistent with MAGGIC, increased baseline SBP was associated with lower mortality in HFrEF, and this association disappeared in HFmrEF and HFpEF. Elevated serum creatinine was associated with lower mortality in HFmrEF, but neutral in HFrEF and HFpEF. This finding may suggest HFmrEF patients had a good diuretic response, which commonly showed an increase in serum creatinine, but still had good clinical outcomes [37]. Overall, we found the effect of the predictor variables to be similar for HFmrEF and HFpEF and more likely to be different for HFrEF, suggesting that HFmrEF is closer to HFpEF than to HFrEF.
The results of the IECV showed that our model discriminated well across the four different cohorts. Particularly in HFrEF, our model discriminated not only well in three cohorts close to routine clinical practice (BIOSTAT-index, BIOSTAT-validation, and TRI-UMPH; AUC 0.76, 0.84, and 0.80), but equivalently well in the population from a RCT (COACH; AUC 0.76). In HFmrEF, the results suggested the Scottish patients in BIOSTAT-validation might have a different predictor-outcome association from other patients. In HFpEF, our model discriminated very well in BIOSTAT-validation and COACH, though not so well in BIOSTAT-index and TRIUMH. For BIOSTATindex, this finding may be explained by the fact that HFpEF patients in this cohort were confined to NT-proBNP levels > 2000 pg/mL, resulting in a different population of HFpEF patients compared to the other three cohorts.
While differences in the baseline mortality risks among the four cohorts did not have a profound impact on model discrimination, model calibration was more heavily affected by this. For example, the predicted 1-year mortality risk was higher than the observed 1-year mortality risk for patients with HFrEF in BIOSTAT-index  (Fig. 2), which is consistent with the lower observed mortality rate in this cohort relative to the other three cohorts (Table 2). Such discrepancies between the observed and predicted 1-year mortality risks attenuated after recalibration to the cohort-specific baseline risks, suggesting that more accurate predictions can be obtained by tailoring the parameter values of the baseline hazard functions to the baseline risk in the patient population to which the model is applied [28] (e.g., by taking the baseline hazard functions from the study cohort which has the closest observed outcome incidences).
The implication of our model relates to its ability to support bedside decision-making by complementing physician's clinical judgment. Currently, treatment decisions in HF are based on population-averaged effects observed in RCTs. However, patients enrolled in RCTs can differ substantially in their risks of outcome and treatment effects are not necessarily homogeneous across the risk spectrum [38]. For example, in the PROTECT trial, the experimental treatment rolofylline was found to have a neutral effect in the treatment of acute HF with renal dysfunction [39]. However, in a subsequent post hoc analysis, Demissei et al. [4] found this treatment effect to be moderated by the predicted 180-day all-cause mortality risk, with rolofylline being beneficial in higher-risk patients but harmful in low-risk patients. These results suggest that there may still be a window of opportunity for rolofylline and other novel acute HF therapies that showed disappointing population-averaged effects, such as serelaxin [40], provided that a more targeted approach is implemented for the administering of these treatments. Risk prediction models, such as the one developed in this paper, are fundamental in moving forward such a more personalized approach in the treatment of acute HF.
Our study has several limitations. Firstly, the IPD meta-analysis included relatively small numbers of HFmrEF and HFpEF patients, which may hinder the generalizability of the results to other HFmrEF and HFpEF populations. Secondly, only variables that were measured in all four cohorts were considered as candidate predictors. Some of the more recently established prognostic markers such as ST2 [41] and Galectin-3 [42] could therefore not be included in our prognostic model. Finally, all the predictors including the ejection fraction were treated as timefixed covariates, meaning that their values were assumed to remain constant across the prediction period. This is a limitation when large fluctuations in the values of the predictor variables are expected. However, given the relatively short prediction window and good model performance, it seems reasonable to treat the predictors as time-fixed for the present study.

Conclusion
To conclude, using IPD meta-analysis, we were able to develop an HF phenotype stratified model for predicting 1-year mortality in patients hospitalized with acute HF that was generalizable across a range of European HF populations. Our model can therefore become a helpful tool in quantifying and classifying the prognosis of patients hospitalized with acute HF, allowing targeted treatment and management of those patients.