Skip to main content

Table 2 Measures for model’s performance assessment (definitions adapted from Riley et al. [31] )

From: External validation of inpatient neonatal mortality prediction models in high-mortality settings

1. Calibration

 This is how close the predicted mortality event is close to the observed mortality event. This measure has two key components:

  (a) Calibration slope

   The calibration slope measures the agreement between the observed and predicted risks of the event (outcome) across the whole range of predicted values. For a perfectly calibrated model, we expect to see that, in 100 individuals with a predicted risk of r% from our model, r of the 100 truly have the outcome of interest (i.e. death in this case). The slope should ideally be 1. A slope < 1 indicates that some predictions are too extreme (e.g. predictions close to 1 are too high, and predictions close to 0 are too low), and a slope > 1 indicates predictions are too narrow. A calibration slope < 1 is often observed in validation studies, consistent with over-fitting in the original model development

  (b) Calibration-in-the-large (calibration intercept)

   The calibration intercept compares the mean of all predicted risks with the mean observed risk, i.e. on average how close is predicted to observed in the whole dataset. This parameter hence indicates the extent that predictions are systematically too low or too high. It can be well assessed graphically, in a plot with predictions on the x-axis and the observed endpoint on the y-axis. The observed values on the y-axis are 0 or 1 (e.g. dead/alive), while the predictions on the x-axis range between 0 and 100% with the intercept representing calibration-in-the-large

2. Discrimination

 The is a measure of a prediction model’s separation between those with or without the outcome, usually represented by the c-statistic which is also known as the concordance index or, for binary outcomes, the area under the receiver operating characteristic (AUROC) curve. It gives the probability that for any randomly selected pair of individuals, one with and one without the disease (outcome), the model assigns a higher probability to the individual with the disease (outcome). A value of 1 indicates the model has perfect discrimination, while a value of 0.5 indicates the model discriminates no better than chance

3. Brier score

 The Brier score captures both discrimination and calibration simultaneously, with smaller values indicating better model performance. Consider a set of events with binary outcomes (e.g. ‘death will or will not happen’). If an event comes to pass (‘death did happen’), it is assigned a value of 1 otherwise it is assigned a value of 0. Given probabilistic predictions for those events (‘.77 probability of death’), the Brier score is the mean of squared differences between those predictions and their corresponding event scores (1 s and 0 s) on the probability scale lying between 0 and 1. Larger differences between expected and observed event outcomes reflect more error in predictions, so a lower Brier score indicates greater accuracy