Evaluation and improvement of the National Early Warning Score (NEWS2) for COVID-19: a multi-hospital study

Background The National Early Warning Score (NEWS2) is currently recommended in the UK for the risk stratification of COVID-19 patients, but little is known about its ability to detect severe cases. We aimed to evaluate NEWS2 for the prediction of severe COVID-19 outcome and identify and validate a set of blood and physiological parameters routinely collected at hospital admission to improve upon the use of NEWS2 alone for medium-term risk stratification. Methods Training cohorts comprised 1276 patients admitted to King’s College Hospital National Health Service (NHS) Foundation Trust with COVID-19 disease from 1 March to 30 April 2020. External validation cohorts included 6237 patients from five UK NHS Trusts (Guy’s and St Thomas’ Hospitals, University Hospitals Southampton, University Hospitals Bristol and Weston NHS Foundation Trust, University College London Hospitals, University Hospitals Birmingham), one hospital in Norway (Oslo University Hospital), and two hospitals in Wuhan, China (Wuhan Sixth Hospital and Taikang Tongji Hospital). The outcome was severe COVID-19 disease (transfer to intensive care unit (ICU) or death) at 14 days after hospital admission. Age, physiological measures, blood biomarkers, sex, ethnicity, and comorbidities (hypertension, diabetes, cardiovascular, respiratory and kidney diseases) measured at hospital admission were considered in the models. Results A baseline model of ‘NEWS2 + age’ had poor-to-moderate discrimination for severe COVID-19 infection at 14 days (area under receiver operating characteristic curve (AUC) in training cohort = 0.700, 95% confidence interval (CI) 0.680, 0.722; Brier score = 0.192, 95% CI 0.186, 0.197). A supplemented model adding eight routinely collected blood and physiological parameters (supplemental oxygen flow rate, urea, age, oxygen saturation, C-reactive protein, estimated glomerular filtration rate, neutrophil count, neutrophil/lymphocyte ratio) improved discrimination (AUC = 0.735; 95% CI 0.715, 0.757), and these improvements were replicated across seven UK and non-UK sites. However, there was evidence of miscalibration with the model tending to underestimate risks in most sites. Conclusions NEWS2 score had poor-to-moderate discrimination for medium-term COVID-19 outcome which raises questions about its use as a screening tool at hospital admission. Risk stratification was improved by including readily available blood and physiological parameters measured at hospital admission, but there was evidence of miscalibration in external sites. This highlights the need for a better understanding of the use of early warning scores for COVID. Supplementary Information The online version contains supplementary material available at 10.1186/s12916-020-01893-3.

The National Early Warning Score (NEWS2), currently recommended for stratification of severe COVID-19 disease in the UK, showed poor-tomoderate discrimination for medium-term outcomes (14-day transfer to intensive care unit (ICU) or death) amongst COVID-19 patients. Risk stratification was improved by the addition of routinely measured blood and physiological parameters routinely at hospital admission (supplemental oxygen, urea, oxygen saturation, Creactive protein, estimated glomerular filtration rate, neutrophil count, neutrophil/lymphocyte ratio) which provided moderate improvements in a risk stratification model for 14-day ICU/death. This improvement over NEWS2 alone was maintained across multiple hospital trusts, but the model tended to be miscalibrated with risks of severe outcomes underestimated in most sites. We benefited from existing pipelines for informatics at King's College Hospital such as CogStack that allowed rapid extraction and processing of electronic health records. This methodological approach provided rapid insights and allowed us to overcome the complications associated with slow data centralisation approaches.

Background
As of 9 December 2020, there have been > 67 million confirmed cases of COVID-19 disease worldwide [1]. While approximately 80% of infected individuals have mild or no symptoms [2], some develop severe COVID-19 disease requiring hospital admission. Within the subset of those requiring hospitalisation, early identification of those who deteriorate and require transfer to an intensive care unit (ICU) for organ support or may die is vital.
Currently, available risk scores for deterioration of acutely ill patients include (i) widely used generic wardbased risk indices such as the National Early Warning Score (NEWS2, [3]), (ii) the Modified Sequential Organ Failure Assessment (mSOFA) [4] and Quick Sequential Organ Failure Assessment [5] scoring systems, and (iii) the pneumonia-specific risk index, CURB-65 [6] which combines physiological observations with limited blood markers and comorbidities. NEWS2 is a summary score of six physiological parameters or 'vital signs' (respiratory rate, oxygen saturation, systolic blood pressure, heart rate, level of consciousness, temperature and supplemental oxygen dependency) used to identify patients at risk of early clinical deterioration in the United Kingdom (UK) National Health Service (NHS) hospitals [7,8] and primary care. Some components (in particular, patient temperature, oxygen saturation, and supplemental oxygen dependency) have been associated with COVID-19 outcomes [2], but little is known about their predictive value for COVID-19 disease severity in hospitalised patients [9]. Additionally, a number of COVID-19-specific risk indices are being developed [10,11] as well as unvalidated online calculators [12], but generalisability is unknown [13]. A Chinese study has suggested a modified version of NEWS2 with the addition of age only [14] but without any data on performance. With near-universal usage of NEWS2 in UK NHS Trusts since March 2019 [15], a minor adaptation to NEWS2 would be relatively easy to implement.
Our aim is to evaluate the NEWS2 score and identify which clinical and blood biomarkers routinely measured at hospital admission can improve medium-term risk stratification of severe COVID-19 outcome at 14 days from hospital admission. Our specific objectives were as follows: 1. To explore independent associations of routinely measured physiological and blood parameters (including NEWS2 parameters) at hospital admission with disease severity (ICU admission or death at 14 days from hospital admission), adjusting for demographics and comorbidities 2. To develop a prediction model for severe COVID-19 outcomes at 14 days combining multiple blood and physiological parameters 3. To compare the discrimination, calibration, and clinical utility of the resulting model with NEWS2 score and age alone using (i) internal validation and (ii) external validation at seven UK and international sites A recent systematic review found that most existing prediction models for COVID-19 had a high risk of bias due to non-representative samples, model overfitting, or poor reporting [13]. The analyses presented here build upon our earlier work [24] which suggested that adding age and common blood biomarkers to the NEWS2 score could improve risk stratification in patients hospitalised with COVID-19. While incorporating external validation, this preliminary work was limited in that the training sample comprised 439 patients (the cohort available at the time of model development). In the present study, we (i) expand the cohort used for model development to all 1276 patients at King's College Hospital (KCH), (ii) use hospital admission (rather than symptom onset) as the index date, (iii) consider shorterterm outcomes (3-day ICU/death), (iv) improve the reporting of model calibration and clinical utility, and (v) increase the number of external sites from three to seven.

Study cohorts
The KCH training cohort (n = 1276) was defined as all adult inpatients testing positive for severe acute respiratory syndrome coronavirus 2 (SARS-Cov2) by reverse transcription polymerase chain reaction (RT-PCR) Data were extracted from structured and/or unstructured components of electronic health records (EHR) in each site as detailed below.

Measures Outcome
For all sites, the outcome was severe COVID-19 disease at 14 days following hospital admission, categorised as transfer to the ICU/death (WHO-COVID-19 Outcomes Scales 6-8) vs. not transferred to the ICU/death (scales 3-5) [25]. For nosocomial patients (patients with symptom onset after hospital admission), the endpoint was defined as 14 days after symptom onset. Dates of hospital admission, symptom onset, ICU transfer, and death were extracted from electronic health records or ascertained manually by a clinician.

Blood and physiological parameters
We included blood and physiological parameters that were routinely obtained at hospital admission and which are routinely available in a wide range of national and international hospital and community settings. Measures available for fewer than 30% of patients were not considered (including Troponin-T, Ferritin, D-dimers and glycated haemoglobin (HbA1c), Glasgow Coma Scale score). We excluded creatinine since this parameter correlates highly (r > 0.8) with, and is used in the derivation of, estimated glomerular filtration rate. We excluded white blood cell count (WBCs) which is highly correlated with neutrophil and lymphocyte counts.

Demographics and comorbidities
Age, sex, ethnicity and comorbidities were considered. Self-defined ethnicity was categorised as White vs. non-White (Black, Asian, or other minority ethnic) and patients with ethnicity recorded as 'unknown/mixed/ other' were excluded (n = 316; 25%). Binary variables were derived for comorbidities: hypertension, diabetes, heart disease (heart failure and ischemic heart disease), respiratory disease (asthma and chronic obstructive pulmonary disease (COPD)), and chronic kidney disease.

Data processing King's College Hospital
Data were extracted from the structured and unstructured components of the electronic health record (EHR) using natural language processing (NLP) tools belonging to the CogStack ecosystem [26], namely MedCAT [27] and MedCATTrainer [28]. The CogStack NLP pipeline captures negation, synonyms, and acronyms for medical Systematised Nomenclature of Medicine Clinical Terms (SNOMED-CT) concepts as well as surrounding linguistic context using deep learning and long shortterm memory networks. MedCAT produces unsupervised annotations for all SNOMED-CT concepts (Additional file 1: Table S1) under parent terms Clinical Finding, Disorder, Organism, and Event with disambiguation, pre-trained on MIMIC-III [29]. Starting from our previous model [30], further supervised training improved detection of annotations and meta-annotations such as experiencer (is the annotated concept experienced by the patient or other), negation (is the concept annotated negated or not), and temporality (is the concept annotated in the past or present) with MedCAT-Trainer. Meta-annotations for hypothetical, historical, and experiencer were merged into "Irrelevant" allowing us to exclude any mentions of a concept that did not directly relate to the patient currently. Performance of the NLP pipeline for comorbidities mentioned in the text was evaluated on 4343 annotations in 146 clinical documents by a clinician (JT). F1 scores, precision, and recall are presented in Additional file 2: Table S2.

Guy's and St Thomas' NHS Foundation Trust
Electronic health records from all patients admitted to Guy's and St Thomas' NHS Foundation Trust who had a positive COVID-19 test result between 3 March and 21 May 2020, inclusive, were identified. Data were extracted using structured queries from six complementary platforms and linked using unique patient identifiers. Data processing was performed using Python 3.7 [31]. The process and outputs were reviewed by a study clinician.

University Hospitals Southampton
Data were extracted from the structured components of the UHS CHARTS EHR system and data warehouse. Data were transformed into the required format for validation purposes using Python 3.7 [31]. Diagnosis and comorbidity data of interest were gathered from the International Statistical Classification of Diseases (ICD-10) coded data. No unstructured data extraction was required for validation purposes. The process and outputs were reviewed by an experienced clinician prior to analysis.

University Hospitals Bristol and Weston NHS Foundation Trust
Data were extracted from UHBW electronic health records system (Medway). ICD-10 codes were used for diagnosis and comorbidity data. Data were transformed in line with project specifications and exported for analysis in Python 3.7 [31].

University College Hospital London
Dates of hospital admission, symptom onset, ICU transfer, and death were extracted from electronic health records. The outcome (14-day ICU/death) was defined in UCLH as 'initiation of ventilatory support (continuous positive airway pressure, non-invasive ventilation, high-flow nasal cannula oxygen, invasive mechanical ventilation, or extracorporeal membrane oxygenation) or death' which is consistent WHO-COVID-19 Outcomes Scales 6-8.

Wuhan cohort
Demographic, premorbid conditions, clinical symptoms or signs at presentation, laboratory data, and treatment and outcome data were extracted from electronic medical records using a standardised data collection form by a team of experienced respiratory clinicians, with double data checking and involvement of a third reviewer where there was disagreement. Anonymised data was entered into a password-protected computerised database.

University Hospitals Birmingham
Dates of hospital admission, symptom onset, ICU transfer, and death were extracted from electronic health records using the Prescribing Information and Communications System (PICS) system. The extracted data was transformed into the required format for validation purposes using Python 3.8 [31]. Diagnosis and comorbidity data of interest were gathered from ICD-10 coded data. The outcomes (3-and 14-day ICU/death) were defined consistent with WHO-COVID-19 Outcomes Scales 6-8.

Oslo University Hospital
All admitted patients with confirmed COVID-19 by positive SARS-CoV2 PCR were included in a quality registry. Data input into the register was manual. Register data was supplemented with test results from the laboratory information system (LIS) by matching exported Excel files from the register with exported Excel files from LIS. The fidelity of the match was checked against the original data source manually for a small number of patients. Only patients with symptoms consistent with COVID-19 were included in the study.

Statistical analyses
All continuous parameters were winsorized (at 1% and 99%) and scaled (mean = 0; standard deviation = 1) to facilitate interpretability and comparability [32]. Logarithmic or square root transformations were applied to skewed parameters. To explore independent associations of blood and physiological parameters with 14-day ICU/ death (objective 1), we used logistic regression with Firth's bias reduction method [33]. Each parameter was tested independently, adjusted for age and sex (model 1), and then additionally adjusted for comorbidities (model 2). P values were adjusted using the Benjamini-Hochberg procedure to keep the false discovery rate (FDR) at 5% [34].
To evaluate NEWS2 and identify parameters that could improve prediction of severe COVID-19 outcomes (objectives 2 and 3), we used regularised logistic regression with a least absolute shrinkage and selection operator (LASSO) estimator that shrinks parameters according to their variance, reduces overfitting, and enables automatic variable selection [35]. The optimal degree of regularisation was determined by identifying a tuning parameter λ using cross-validation. To avoid overfitting and to reduce the number of false-positive predictors, λ was selected to give a model with an area under the receiver operating characteristic curve (AUC) one standard error below the 'best' model. To evaluate the predictive performance of our model on new cases of the same underlying population (internal validation), we performed nested cross-validation (10-folds the for inner loop; 10-folds/1000 repeats for the outer loop). Discrimination was assessed using AUC and Brier score. Missing feature information was imputed using k-nearest neighbour (kNN) imputation (k = 5). All steps (feature selection, winsorizing, scaling, and kNN imputation) were incorporated within the model development and selection process to avoid data leakage that would otherwise result in optimistic performance measures [36]. All analyses were conducted with Python 3.8 [31] using the statsmodels [37] and Scikit-Learn [38] packages.
We evaluated the transportability of the derived regularised logistic regression model in external validation samples from GSTT (n = 988), UHS (n = 633), UHBW (n = 190), UCH (n = 411), UHB (n = 1037), OUH (n = 163), and Wuhan (n = 2815). Validation used LASSO logistic regression models trained on the KCH training sample, with code and pre-trained models shared via GitHub. 1 Models were assessed in terms of discrimination (AUC, sensitivity, specificity, Brier score), calibration, and clinical utility (decision curve analysis, number needed to evaluate) [32,39]. Moderate calibration was assessed by plotting model-predicted probabilities (xaxis) against observed proportions (y-axis) with locally estimated scatterplot smoothing (LOESS) and logistic curves [40]. Clinical utility was assessed using decision curve analysis where 'net benefit' was plotted against a range of threshold probabilities. Unlike diagnostic performance measures, decision curves incorporate preferences of the clinician and patient. The threshold probability (p t ) is where the expected benefit of treatment is equal to the expected benefit of avoiding treatment [41]. Net benefit was calculated by counting the number of true positives (predicted risk > p t and experienced severe COVID-19 outcome) and false positives (predicted risk > p t but did not experience severe COVID-19 outcome) and using the below formula: Our model was developed as a screening tool, to identify at hospital admission patients at risk of more severe outcomes. The intended treatment for patients with a positive result from this model would be further examination by a clinician, who would make recommendations regarding appropriate treatment (e.g. earlier transfer to the ICU, intensive monitoring, treatment). We compared the decision curve from our model to two extreme cases of 'treat none' and 'treat all'. The 'treat none' (i.e. routine management) strategy implies that no patients would be selected for further examination by a clinician; the 'treat all' strategy (i.e. intensive management) implies that all patients would undergo further assessment. A model is clinically beneficial if the model-implied net benefit is greater than either the 'treat none' or 'treat all' strategies.
Since the intended strategy involves a further examination by a clinician, and is therefore low risk, our emphasis throughout is on avoiding false negatives (i.e. failing to detect a severe case) at the expense of false positives. We therefore used thresholds of 30% and 20% (for 14-day and 3-day outcomes, respectively) to calculate sensitivity and specificity. This gave a better balance of sensitivity vs. specificity and reflected the clinical preference to avoid false negatives for the proposed screening tool.

Sensitivity analyses
We conducted five sensitivity analyses. First, to explore the ability of NEWS2 to predict shorter-term severe COVID-19 outcome, we developed models for ICU transfer/death at 3 days following hospital admission. All steps described above were repeated, including training (feature selection) and external validation. Second, following recent studies suggesting sex differences in COVID-19 outcome [18], we tested interactions between each physiological and blood parameters and sex using likelihood-ratio tests. Third, we repeated all models with adjustment for ethnicity in the subset of individuals with available data for ethnicity (n = 960 in the KCH training sample). Fourth, to explore the differences between community-acquired vs. nosocomial infection, we repeated all models after excluding 153 nosocomial patients (n = 1123). Finally, we considered an alternative baseline model of 'NEWS2 only'. Our primary analyses used a baseline model of 'NEWS2 + age' because NEWS2 is rarely used in isolation for prognostication and treatment decisions will incorporate other patient characteristics such as age.

Descriptive analyses
The KCH training cohort comprised 1276 patients admitted with a confirmed diagnosis of COVID-19 (from 1 March to 31 April 2020) of whom 389 (31%) were transferred to the ICU or died within 14 days of hospital admission, respectively. The validation cohorts comprised 6237 patients across seven sites. At UK NHS trusts, 30 to 42% of patients were transferred to the ICU or died within 14 days of admission. Disease severity was lower in the Wuhan sample, where 4% were transferred to the ICU or died. Table 1 presents the demographic and clinical characteristics of the training and validation cohorts. The UK sites were similar in terms of age and sex, with patients tending to be older (median age 59-74) and male (58 to 63%) but varied in the proportion of patients of non-White ethnicity (from 10% at UHS to 40% at KCH and UCH). Blood and physiological parameters were broadly consistent across UK sites.
Logistic regression models were used to assess independent associations between each variable and severe COVID-19 outcome (ICU transfer/death) in the KCH cohort. Additional file 3: Table S3 presents odds ratios adjusted for age and sex (model 1) and comorbidities (model 2), sorted by effect size. Increased odds of transfer to the ICU or death by 14 days were associated with NEWS2 score, oxygen flow rate, respiratory rate, CRP, neutrophil count, urea, neutrophil/lymphocyte ratio, heart rate, and temperature. Reduced odds of severe outcomes were associated with lymphocyte/CRP ratio, oxygen saturation, estimated GFR, and albumin.

Supplementing NEWS2 with routinely collected blood and physiological parameters
We considered whether routine blood and physiological parameters could improve risk stratification for medium-term COVID-19 outcome (ICU transfer/death at 14 days). When adding demographic, blood, and physiological parameters to NEWS2, nine features were retained following LASSO regularisation, in order of effect size: NEWS2 score, supplemental oxygen flow rate, urea, age, oxygen saturation, CRP, estimated GFR, neutrophil count, and neutrophil/lymphocyte ratio. Notably, comorbid conditions were not retained when added in subsequent models, suggesting most of the variance explained was already captured by the included parameters. Internally validated discrimination in the KCH training sample was moderate (AUC = 0.735; 95% CI 0.715, 0.757) but improved compared to 'NEWS2 + age' ( Table 2). This improvement over NEWS2 alone was replicated in validation samples (Fig. 1). The supplemented model continued to show evidence of substantial miscalibration.

Sensitivity analyses
For the 3-day endpoint, 13% of patients at KCH (n = 163) and between 16 and 29% of patients in the UK and Norway were transferred to the ICU or died ( Table 1). The 3-day model retained just two parameters following regularisation: NEWS2 score and supplemental oxygen flow rate. For the baseline model ('NEWS2 + age'), discrimination was moderate at internal validation (AUC = 0.764; 95% CI 0.737, 0.794; Additional file 4: Table S4) and external validation (AUC = 0.673 to 0.755), but calibration remained poor (Additional file 5: Figure S1). Moreover, the supplemented model ('NEWS2 + oxygen flow rate') showed smaller improvements in discrimination compared to those seen at 14 days. For the KCH training cohort, internally validated AUC increased by 0.025: from 0.764 (95% CI 0.737, 0.794) for 'NEWS2 + age' to 0.789 (0.763, 0.819) for the supplemented model ('NEWS2 + oxygen flow rate'). At external validation, improvements were modest (UHBW, OUH) or negative (GSTT) in some sites, but more substantial in others (UHS, UCH). Moreover, model calibration was considerably worse for the supplemented 3-day model (Additional file 5: Figure S1).
We found no evidence of difference by sex (results not shown) and the findings were consistent when additionally adjusting for ethnicity in the subset of individuals with ethnicity data and when excluding nosocomial patients (Additional file 6: Table S5). Discrimination for the alternative baseline model of 'NEWS2 only' (Additional file 7: Table S6) showed a similar pattern of results as those for 'NEWS2 + age', except that improvements in discrimination for the supplemented model ('All features') were larger in most sites.

Decision curve analysis
Decision curve analysis for the 14-day endpoint is presented in Fig. 3. At KCH, the baseline model ('NEWS2 + age') offered small increments in net benefit compared to the 'treat all' and 'treat none' strategies for risk thresholds in the range 25 to 60%. This was replicated in all validation cohorts except for UHBW and OUH where the net benefit for 'NEWS2 + age' was lower than the 'treat none' strategy beyond the 40% risk threshold. The supplemented model ('All features') improved upon 'NEWS2 + age' and the two default strategies in most sites across the range 20 to 80%, except for (i) UHBW, where 'treat none' was superior beyond thresholds of 55%, and (ii) GSTT, where 'treat all' was superior up to a threshold of 30% and no improvement was seen for the supplemented model. For the 3-day endpoint, the improvement in net benefit for the supplemented model over the two default strategies was smaller, compared to the improvements seen at 14 days (Additional file 8: Figure  S2). At three sites (UHBW, GSST, and Wuhan), neither the baseline ('NEWS2 + age') nor the supplemented ('All features') models offered any improvement over the 'treat all' or 'treat none' strategies. At KCH and UHS, net benefit for 'NEWS2 + age' was higher than the default strategies for a range of risk thresholds but was not increased further by the supplemented ('NEWS2 + oxygen flow rate') model.

Principal findings
This study is amongst the first to systematically evaluate NEWS2 for severe COVID-19 outcome and carry out external validation at multiple international sites (five UK NHS Trusts, one hospital in Norway, and two hospitals in Wuhan, China). We found that while 'NEWS2 + age' had moderate discrimination for shortterm COVID-19 outcome (3-day ICU transfer/death), it showed poor-to-moderate discrimination for the medium-term outcome (14-day ICU transfer/death). Thus, while NEWS2 may be effective for short-term (e.g. 24 h) prognostication, our results question its suitability as a screening tool for medium-term COVID-19 outcome. Risk stratification was improved by adding routinely collected blood and physiological parameters, and discrimination in supplemented models was moderateto-good. However, the model showed evidence of miscalibration, with a tendency to underestimate risks in external sites. The derived model for 14-day ICU transfer/death included nine parameters: NEWS2 score, supplemental oxygen flow rate, urea, age, oxygen saturation, CRP, estimated GFR, neutrophil count, and neutrophil/ lymphocyte ratio. Notably, pre-existing comorbidities did not improve risk prediction and were not retained in the final model. This was unexpected but may indicate that the effect of pre-existing health conditions could be manifest through some of the included blood or physiological markers.
Overall, this study overcomes many of the factors associated with a high risk of bias in the development of prognostic models for COVID-19 [13] and provides some evidence to support the supplementation of NEWS2 for clinical decisions with these patients.

Comparison with other studies
A systematic review of 10 prediction models for mortality in COVID-19 infection [10] found broad similarities with the features retained in our models, particularly regarding CRP and neutrophil levels. However, existing prediction models suffer several methodological Fig. 3 Net benefit of supplemented NEWS2 model for 14-day ICU/death compared to default strategies ('treat all' and 'treat none') at training and validation sites weaknesses including overfitting, selection bias, and reliance on cross-sectional data without accounting for censoring. Additionally, many existing studies have relied on single-centre or ethnically homogenous Chinese cohorts, whereas the present study shows validation across multiple and diverse populations. A key strength of our study is the robust and repeated external validation across national and international sites; however, evidence of miscalibration suggests we should be cautious when attempting to generalise these findings. Future research should include larger collaborations and aim to develop 'from onset' population predictions.
NEWS2 is a summary score derived from six physiological parameters, including oxygen supplementation. Lack of evidence for NEWS2 use in COVID-19 especially in primary care has been highlighted [9]. The oxygen saturation component of physiological measurements added value beyond NEWS2 total score and was retained following regularisation for 14-day endpoints. This suggests some residual association over and above what is captured by the NEWS2 score and reinforces Royal College of Physicians guidance that the NEWS2 score ceilings with respect to respiratory function [42].
Cardiac disease and myocardial injury have been described in severe COVID-19 cases in China [2,23]. In our model, blood Troponin-T, a marker of myocardial injury, had additional salient signal but was only measured in a subset of our cohort at admission, so it was excluded from our final model. This could be explored further in larger datasets.

Strengths and limitations
Our study provides a risk stratification model for which we obtained generalisable and robust results across seven national and international sites with differing geographical catchment and population characteristics. It is amongst the first to evaluate NEWS2 at hospital admission for severe COVID-19 outcome and amongst a handful to externally validate a supplemented model across multiple sites.
However, some limitations must be acknowledged. First, there are likely to be other parameters not measured in this study that could substantially improve the risk stratification model (e.g. radiological features, obesity, or comorbidity load). These parameters could be explored in future work but were not considered in the present study to avoid limiting the real-world implementation of the risk stratification model. Second, our models showed better performance in UK secondary care settings amongst populations with higher rates of severe COVID-19 disease. Therefore, further research is needed to investigate the suitability of our model for primary care settings which have a high prevalence of mild disease severities and in community settings. This would allow us to capture variability at earlier stages of the disease and trends in patients not requiring hospital admission. Third, while external validation across multiple national and international sites represents a key strength, we did not have access to individual participant data and model development was limited to a single site (KCH). Although we benefited from existing infrastructure to support rapid data analysis, we urgently need infrastructure to support data sharing between sites to address some of the limitations of the present study (e.g. miscalibration) and improve the transferability of these models. Not only would this facilitate external validation, but more importantly, it would allow multi-site prediction models to be developed using pooled, individual participant data [43]. Fourth, our analyses would have excluded patients who experienced severe COVID-19 outcome at home or at another hospital, after being discharged from a participating hospital. Fifth, our model was restricted to blood and physiological parameters measured at hospital admission. This was by design and reflected the aim of developing a screening tool for risk stratification at hospital admission. However, future studies should explore the extent to which risk stratification could be improved by incorporating repeated measures of NEWS2 and relevant biomarkers.

Conclusions
The NEWS2 early warning score is in near-universal use in UK NHS Trusts since March 2019 [15], but little is known about its use for COVID-19 patients.
Here, we showed that NEWS2 and age at hospital admission had poor-to-moderate discrimination for medium-term (14-day) severe COVID-19 outcome, questioning its use as a tool to guide hospital admission. Moreover, we showed that NEWS2 discrimination could be improved by adding eight blood and physiological parameters (supplemental oxygen flow rate, urea, age, oxygen saturation, CRP, estimated GFR, neutrophil count, neutrophil/lymphocyte ratio) that are routinely collected and readily available in healthcare services. Thus, this type of model could be easily implemented in clinical practice, and predicted risk score probabilities of individual patients are easy to communicate. At the same time, although we provided some evidence of improved discrimination vs. NEWS2 and age alone, given miscalibration in external sites, our proposed model should be used as a complement and not as a replacement for clinical judgement.

Availability of data and materials
Code and pre-trained models are available at https://github.com/ewancarr/ NEWS2-COVID-19 and openly shared for testing in other COVID-19 datasets. Source text from patient records used at all sites in the study will not be available due to inability to safely fully anonymise up to the Information Commissioner Office (ICO) standards and would be likely to contain strong identifiers (e.g. names, postcodes) and highly sensitive data (e.g. diagnoses). A subset of the KCH dataset limited to anonymisable information (e.g. only SNOMED codes and aggregated demographics) is available on request to researchers with suitable training in information governance and human confidentiality protocols subject to approval by the King's College Hospital Information Governance committee; applications for research access should be sent to kch-tr.cogstackrequests@nhs.net. This dataset cannot be released publicly due to the risk of re-identification of such granular individual-level data, as determined by the King's College Hospital Caldicott Guardian. The GSTT dataset cannot be released publicly due to the risk of reidentification of such granular individual-level data, as determined by the Guy's and St Thomas's Trust Caldicott Guardian. The UHS dataset cannot be released publicly due to the risk of reidentification of such granular individual-level data, as determined by the University Hospital Southampton Caldicott Guardian.
The UCH data cannot be released publicly due to conditions of regulatory approvals that preclude open access data sharing to minimise the risk of patient identification through granular individual health record data. The authors will consider specific requests for data sharing as part of academic collaborations subject to ethical approval and data transfer agreements in accordance with the GDPR regulations. The Wuhan dataset used in the study will not be available due to the inability to fully anonymise in line with ethical requirements. Applications for research access should be sent to TS and details will be made available via https://covid.datahelps.life/prediction/. The OUH dataset cannot be released publicly due to the risk of reidentification of such granular individual-level data. For UHBW, the project was considered as service evaluation by the organisational review board. Informed consent was deemed unnecessary due to the retrospective observational nature of the data. Ethical approval for GSTT was granted by the London Bromley Research Ethics Committee (reference 20/HRA/1871) to the King's Health Partners Data Analytics and Modelling COVID-19 Group to collect clinically relevant data points from patient's electronic health records. The Wuhan validation was approved by the Research Ethics Committee of Shanghai Dongfang Hospital and Taikang Tongji Hospital. For the OUH validation, a project protocol was approved by the Regional Ethical Committee of South-East Norway (Reference number 137045) and the OUH data protection officer (Reference number 20/08822). Informed consent in the OUH cohort was waived because of the strictly observational nature of the project.

Consent for publication
Not applicable.