Skip to main content

Table 25 Methods for assessing performance of prediction models: MSE, MAE, ROC curves, AUC, misclassification rate, Brier score, calibration plots, deviance

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Mean squared error (MSE) and mean absolute error (MAE)

Mean squared error (MSE) and mean absolute error (MAE), sometimes denoted as mean squared prediction error (MSPE) and mean absolute prediction error (MAPE) to emphasize the fact that they are computed on a test set (see discussion below), are commonly used measures to evaluate the prediction performance of a model in the case of a continuous target variable. They are computed by averaging the squared differences or the absolute differences, respectively, between the values predicted by the model and the true values of the target variable. Note that the MSE, being a quadratic measure, is sometimes reported after a square root transformation, the so-called root mean squared error (RMSE)

ROC curves and AUC

 A receiver operating characteristic (ROC) curve is a graphical plot that facilitates visualization of the discrimination ability of a binary classification method. Many statistical methods classify observations into two classes based on estimated probabilities of their membership. If the probability is larger than a threshold, then the response is classified as positive (e.g., sick), otherwise as negative (e.g., healthy). This threshold is mostly set to 0.5 or to the prevalence of the positive cases in the dataset. Choosing a lower threshold corresponds to more positive predictions, with the consequence of increasing the percentage of observations correctly classified positive among those actually positive (sensitivity) with the potential cost of decreasing the percentage of observations correctly classified negative among those actually negative (specificity). Conversely, a larger threshold generally leads to lower sensitivity and higher specificity

 The ROC curve is typically constructed with values for 1 − specificity (x-axis) plotted against the values for sensitivity (y-axis) for all possible values of the threshold. The result is a curve that indicates how well the method discriminates between the two classes. Models with the best discrimination ability will correspond to ROC curves occupying the top left corner of the plot, corresponding to simultaneous high sensitivity and high specificity. A ROC curve close to the diagonal line from lower left to upper right represents poor discrimination ability that is no better than random guessing, e.g., by flipping a coin. The information provided by the ROC curve is often summarized in one single number by calculating the area under the curve (AUC). Best classifiers obtain an AUC value close to 1, while methods not better than random guessing exhibit values close to 0.5. Figure 17 [185] shows an exemplary ROC curve corresponding to high discrimination ability with AUC = 0.90 (and confidence interval [0.86, 0.95])

 Caution is advised regarding the risk of overestimating the performance of a classifier based solely on the AUC value, as the binary decision depends on an optimized threshold, which can be quite different from 0.5. This problem is especially important for HDD, since there is a lot of flexibility to tune and optimize the classifier, including the decision threshold, based on the large number of predictor variables. Calibration plots (see below) are also important to assess whether the classifier is well calibrated, i.e., estimated probabilities correspond to similar proportions in the data

Misclassification rate

 A simpler measure of the prediction ability in the case of categorical response is the misclassification rate that quantifies the proportion of observations that have been erroneously classified by the model. Here, in contrast to AUC, smaller values are better. While this measure is simple and can be used even if the classifier does not assign probabilities to observations, but only predicts classes, it does not differentiate between false positives and false negatives. Therefore, the overall misclassification rate can be heavily dependent on the mix of true positive and true negative cases in the test set

Brier score

 While the misclassification rate only measures accuracy, the Brier score also takes into account the precision of a predictor [180, 186]. The Brier score can be applied for binary, categorical, or time-to-event predictions. It calculates quadratic differences between predicted probabilities and observed outcomes. Thus, it can be considered the counterpart for these prediction targets of the MSE used for regression models. The Brier score is particularly useful because it captures both aspects of a good prediction, namely calibration (similarity between the actual and predicted survival time) and discrimination (ability to predict the survival times of the observations in the right order). For survival data, the Brier score is generally plotted as a function of the time, where higher curves mean worse models. Alternatively, the area under the Brier score curve is computed, leading to the integrated Brier score, which summarizes in a single number the measure of the prediction error (lower being better)

Calibration plots

 Calibration plots for statistical prediction models can be used to visually check if the predicted probabilities of the response variable agree with the empirical probabilities. For example, for logistic regression models, the predicted probabilities of the target outcome are grouped into intervals and for all observations within each interval the proportion of observations positive for the target outcome are calculated. The means of the predicted values are plotted against the proportion of true responders across the intervals. For survival models, the Kaplan–Meier curve (the observed survival function) can be compared with the average of the predicted survival curves of all observations. Poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making [187]. Figure 18 [187] visualizes different types of miscalibration using calibration plots

Deviance

 The deviance measures a distance between two probabilistic models, and it is based on likelihood functions. It can be used to perform model comparison, for any kind of response variable for which a likelihood function can be specified. For a Gaussian response, it corresponds (up to a constant) to the MSE and thus provides a measure of goodness-of-fit of the model compared to a null model without predictors. For model comparison, when computed on the training set (see discussion below) to choose the “best” model among several alternatives, it is often regularized. A factor is applied which penalizes larger models (large p, where p is the number of predictor variables), obtaining measures such as the information criteria AIC (penalty equal to 2p) and BIC (penalty equal to p * log n). The specific choice of the information criterion is difficult and depends, e.g., for classification tasks, also on the relative importance of sensitivity and specificity [188]