Reporting performance of prognostic models in cancer: a review

Mallett, Susan; Royston, Patrick; Waters, Rachel; Dutton, Susan; Altman, Douglas G

doi:10.1186/1741-7015-8-21

Problems with perhaps excess model evaluations.

Mitchell Wachtel, TTUHSC

11 May 2010

Many issues are raised by this pair of articles that need discussion.

The first is the notion that continuous variables are best analyzed by either polynomials or splines.

When one evaluates a variable, one most often wants matters to be monotonic and continuous. The reason is that then one can say, if the relationship is linear, that an increment in so many X units is associated, on average, with an increment or decrement in so many Y units or, if the relationship is curved, that an increment in so many X units is associated, on average, with a change in Y by a a certain factor.

If this is not so, as often true, then one might wish to divide the variable, after examining tertiles, quartiles, etc, into groups delimited by accepted criteria. For example, it is very reasonable to evaluate patients 65 and older versus younger patients or to look at the size of tumors by AJCC stage.

A difficulty with splines is that the choice of knot placement is data driven. A difficulty with both splines and polynomials is that the Y change accorded a change in X units depends on the initial X value. With splines and polynomials, one chooses breakpoints after completing the analysis and carefully examining the results; the chances of an investigator's experiencing undue influence from the data under such circumstances is not small.

Model fit is a complex, difficult process without separating matters into discrimination, calibration, and validation. In practice there is little in the way of guidance as to what constitutes a model with good discrimination. A study on pancreas cancer (Ann Surg 2004;240: 293–298) found an improvement in Harrell's c of 0.08; the validation study (Journal of Clinical Oncology, 2005;23:7529-7535) found an increase in Harrell's c of 0.03. Whether such increments constitute an advance or not is unsettled. One might wonder about the author's model analysis (Gastrointest Endosc 2005;62:333-340), given that it does not compare it's prognostic classification with that of any prior classification and yields a Harrell's c of 0.64 or 0.65, similar to the values for the pancreas cancer articles. Unclear, moreover, is whether Harrell's c is appropriate (Stat Med 2008;27:157-172), especially for models that lack a constant hazard ratio or proportional odds ratio with respect to time. Perhaps better is simply to see if adding a variable set with the ten deciles of risk produces a model with a lower BIC.

Most often, one would like to see a study replicated in multiple settings before adopting it in medical practice. This is the best validation possible, a standard that should not be abandoned irrespective of the validation performed within any single study. The difficulty with bootstrap procedures is two-fold: 1) with a study population of more than, say, 10,000 it might take weeks for your computer to perform 200 bootstraps; 2) no one is sure about what constitutes a good or bad degree of optimism. Until such questions are resolved, one is better off simply awaiting a second, and preferably third or fourth, study that shows the predictive model to be of value.

Competing interests

No competing interests.

Problems with perhaps excess model evaluations.

Mitchell Wachtel, TTUHSC

11 May 2010

Many issues are raised by this pair of articles that need discussion.

The first is the notion that continuous variables are best analyzed by either polynomials or splines.

When one evaluates a variable, one most often wants matters to be monotonic and continuous. The reason is that then one can say, if the relationship is linear, that an increment in so many X units is associated, on average, with an increment or decrement in so many Y units or, if the relationship is curved, that an increment in so many X units is associated, on average, with a change in Y by a a certain factor.

If this is not so, as often true, then one might wish to divide the variable, after examining tertiles, quartiles, etc, into groups delimited by accepted criteria. For example, it is very reasonable to evaluate patients 65 and older versus younger patients or to look at the size of tumors by AJCC stage.

A difficulty with splines is that the choice of knot placement is data driven. A difficulty with both splines and polynomials is that the Y change accorded a change in X units depends on the initial X value. With splines and polynomials, one chooses breakpoints after completing the analysis and carefully examining the results; the chances of an investigator's experiencing undue influence from the data under such circumstances is not small.

Model fit is a complex, difficult process without separating matters into discrimination, calibration, and validation. In practice there is little in the way of guidance as to what constitutes a model with good discrimination. A study on pancreas cancer (Ann Surg 2004;240: 293–298) found an improvement in Harrell's c of 0.08; the validation study (Journal of Clinical Oncology, 2005;23:7529-7535) found an increase in Harrell's c of 0.03. Whether such increments constitute an advance or not is unsettled. One might wonder about the author's model analysis (Gastrointest Endosc 2005;62:333-340), given that it does not compare it's prognostic classification with that of any prior classification and yields a Harrell's c of 0.64 or 0.65, similar to the values for the pancreas cancer articles. Unclear, moreover, is whether Harrell's c is appropriate (Stat Med 2008;27:157-172), especially for models that lack a constant hazard ratio or proportional odds ratio with respect to time. Perhaps better is simply to see if adding a variable set with the ten deciles of risk produces a model with a lower BIC.

Most often, one would like to see a study replicated in multiple settings before adopting it in medical practice. This is the best validation possible, a standard that should not be abandoned irrespective of the validation performed within any single study. The difficulty with bootstrap procedures is two-fold: 1) with a study population of more than, say, 10,000 it might take weeks for your computer to perform 200 bootstraps; 2) no one is sure about what constitutes a good or bad degree of optimism. Until such questions are resolved, one is better off simply awaiting a second, and preferably third or fourth, study that shows the predictive model to be of value.

Competing interests

No competing interests.

Archived Comments for: Reporting performance of prognostic models in cancer: a review

Problems with perhaps excess model evaluations.

Competing interests

BMC Medicine

Contact us