#### Item 10. Specify all statistical methods, including details of any variable selection procedures and other model-building issues, how model assumptions were verified, and how missing data were handled

After some broad introductory observations about statistical analyses, we consider this key item under eight subheadings.

All the statistical methods used in the analysis should be reported. A sound general principle is to 'describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to verify the reported results' [91]. It is additionally valuable if the reader can also understand the reasons for the approaches taken.

Moreover, for prognostic marker studies there are many possible analysis strategies and choices are made at each step of the analysis. If many different analyses are performed, and only those with the best results are reported; this can lead to very misleading inferences. Therefore, it is essential also to give a broad, comprehensive view of the range of analyses that have been undertaken in the study (see also the REMARK profile in Item 12). Details can be given in supplementary material if necessary due to publication length limitations.

Analysis of a marker's prognostic value is usually more complex than the analysis of a randomized trial, for which statistical principles and methods are well developed and primary analysis plans are generally pre-specified. Many of the marker analysis decisions can sensibly be made only after some preliminary examination of the data and therefore generally only some key features of the analysis plan can be pre-specified. Many decisions will be required, including coding of variables, handling of missing data and specification of models. It would be useful to clarify which of these decisions were pre-specified and which were made *post hoc* or even in deviation from the original analysis plan.

Reporting of key features of an analysis is important to allow readers to understand the reasons for the specific approach chosen and to assess the results. No study seems yet to have investigated in detail the large variety of statistical methods used and the quality of their reporting, but the common weaknesses in applying methods and the general insufficient reporting of statistical aspects of a multivariable analysis have been well known for many years. Empirical investigations of published research articles seem to concentrate more on randomized trials and epidemiological studies, but the methods and problems of multivariable models in the latter are similar to prognostic studies. Concato *et al*. identified 44 articles which considered risk factors in the framework of a logistic regression model or a proportional hazard model [92]. All had at least one severe weakness, and they concluded 'the findings suggest a need for improvement in the reporting and perhaps conducting of multivariable analyses in medical research'. Recently Mallett and colleagues assessed 50 articles reporting tumor marker prognostic studies for their adherence to some items from the REMARK checklist [20]. In 49 out of 50 studies (98%), the Cox model was used. Proportional hazards is one of the key assumptions of this model but only four articles (8%) reported testing this assumption (see Item 18). Sigounas *et al*. assessed 184 studies on prognostic markers for acute pancreatitis. Multivariable analyses were performed in only 15 of them, of which only one provided all details requested in Item 10 [21]. Although bad reporting does not mean that bad methods were used, the many studies identifying specific issues of bad reporting clearly show that a substantial improvement of reporting of statistical methods is needed [18, 21, 33, 64, 93–98].

In the following sections we consider specific aspects of analyses under eight headings. Not all aspects will be relevant for some studies. More extensive discussions of statistical analysis methods for binary outcome and for survival data can be found elsewhere [73, 99–111].

##### a. Preliminary data preparation

##### Example

'Ki67 was measured as a continuous score which is typically positively skewed. Analysis was undertaken by log transforming Ki67 and using log(Ki67) as a covariate to investigate whether there is a linear increase in the probability of relapse with increasing Ki67 value.' [112]

##### Explanation

Some assessment of the data quality usually takes place prior to the main statistical analyses of the data, and some data values may be changed or removed if they are deemed unreliable. These manipulations and pre-modeling decisions could have a substantial impact on the results and should be reported, but rarely are [113–117].

There are many examples of steps typically taken in initial data analyses. The distribution of the marker values and distributions of any other variables that will be considered in models should be examined for evidence of extreme values or severe skewness. It may be appropriate to truncate or omit extreme outliers. Preliminary transformations of specific variables (for example, logarithm or square root) may be applied to remove severe skewness. For categorical variables, re-categorization is often performed to eliminate sparse categories (for example, histological types of tumors). Graphical representations or summary statistics calculated to assess the distribution of the marker or other variables (for example, boxplot; mean, median, SD, range and frequencies) should be described because different methods will depict features of the data with varying degrees of sensitivity (such as outliers and skewness). If some marker measurements were judged to be unreliable and consequently omitted or adjusted to lessen their influence in the analysis, it is recommended these details be reported as they can be informative about the robustness of the assay and stability of the analysis results. It is helpful to report these early steps of the analysis along with the number of data values that were excluded or somehow modified (see also Items 12 and 13).

##### b. Association of marker values with other variables

##### Example

'The associations of cathepsin-D with other variables were tested with non-parametric tests: with Spearman rank correlation (r_{s}) for continuous variables (age, ER, PgR), and the Wilcoxon rank-sum test or Kruskal-Wallis test, including a Wilcoxon-type test for trend across ordered groups where appropriate, for categorical variables.' [29]

##### Explanation

Early steps in an analysis may include an examination of the relationship of the marker to other variables being considered in the study. These variables might include established clinical, pathologic, and demographic covariates (see Items 13 and 14). If more than one marker is being evaluated in a study, the relationships between the multiple markers should be examined.

Methods for summarizing associations with other variables (for example, correlation coefficients, chi-square tests and t-tests) should be described. Extreme or unusual associations may be relevant to the validity of analyses and stability of results and may suggest further data modifications are advisable (see section a above) or that certain variables are redundant.

##### c. Methods to evaluate a marker's univariable association with clinical outcome

##### Example

'Median survival time and median DFI [disease free interval] for the whole test set were estimated using the Kaplan-Meier product limit method. Univariate associations between survival time, DFI, and glucose were examined using Cox proportional hazards regression models. These analyses examined glucose as a continuous variable, using an increment of 70 mg/dL to derive hazard ratios, and adjusted for time of blood draw to control for circadian effects on glucose levels ... Wald Chi-square P values were used to calculate univariate statistical significance, and 95% confidence intervals were estimated.' [118]

##### Explanation

A marker's association with clinical outcome is of key importance. The first evaluation will usually be conducted without adjustment for additional variables, that is, a univariable analysis. The method of analysis (for example, logrank test or estimated effect with confidence interval in a Cox regression or a parametric model for survival data), including options such as choice of test statistic (for example, Wald test, likelihood ratio test or score test), should be reported.

Any variable codings or groupings, or transformations of continuous values applied to the marker variable or any other variables, should be stated to allow for proper interpretation of the estimated associations (see Box 4 and Item 11).

In addition, similar analyses may be conducted to examine the association of other variables with clinical outcome.

##### d. Multivariable analyses

##### Examples

'A Cox regression model was used with individual marker as the exposure variables and OS [overall survival] (from time of surgery to time of death or end of current follow-up) as the outcome. The analyses were adjusted simultaneously for sex, age, tumour size, grade (World Health Organization), stage and sites as well as use of post-operative adjuvant therapies.' [76]

'Univariable and multivariable Cox regression models addressed CSM after NU or SU. Covariates consisted of pathologically determined T stage (pT1 versus pT2 versus pT3 versus pT4), N stage (N0 versus N1-3), tumour grade (I versus II versus III versus IV), primary tumour location (ureter versus renal pelvis), type of surgery (NU with bladder cuff versus NU without bladder cuff versus SU), year of surgery, gender (male versus female) and age. Since pT and pN stages, as well as tumour grade, may contribute to a multiplicative increase in CSM rate, we tested three first-degree interactions between these variables. Specifically, multivariable interaction tests were performed between pT and pN stages, between T stage and tumour grade and between N stage and tumour grade.' [119]

'For both models 1 and 2 a competing risk analysis was performed using cause-specific hazards. This analysis follows separate Cox models for each event assuming proportional hazards. In such competing risks analyses with two endpoints, it is possible to interpret both cause-specific hazard ratios simultaneously for each risk factor. Cumulative incidence functions have been displayed for each endpoint. The proportional hazard assumptions were assessed by study of the graphs of the Schoenfeld's residuals; this technique is especially suitable for time-dependent covariates.' [120]

##### Explanation

Univariable analyses are useful but, except in early studies, are generally insufficient because of the possible relationship of the marker with other variables. Thus the prognostic value of the marker after adjustment for established prognostic factors, as estimated from a multivariable model (see Item 17), will be of major interest. To facilitate comparison of the unadjusted and adjusted measures of association, it is helpful to report results from univariable analyses that used the same general approach as the approach used for the multivariable analysis. For example, if multivariable analyses adjusting for standard prognostic factors are based on a Cox regression model with the log-transformed marker value as one of the independent variables, then it is helpful also to report the corresponding results of a univariable Cox regression analysis. This allows for direct assessment of how the marker's regression coefficient is altered by inclusion of standard covariates in the model.

Whereas the Cox proportional hazards model allows a flexible form of baseline hazard, parametric models assume specific functional forms [109, 121, 122]. Parametric models [123] will be statistically more efficient if the model is correct and may be more easily adaptable to situations involving complex censoring patterns, but if the assumed functional form of the baseline hazard is incorrect, they can be misleading. It is important that authors report which model was used.

Multivariable methods can also be used to build prognostic models involving combinations of several candidate markers or even many hundreds of markers (for example, gene expression microarray data). Although the same basic analysis principles apply to these situations, even greater care must be taken to ensure proper fit of such models and avoid overfitting, and to rigorously evaluate the model's prognostic performance. These topics are covered in many articles and books [99, 101, 108, 110, 124–126] and are not a focus of this paper.

Investigators may use statistical approaches other than classic multivariable regression to take into account multiple variables. Such techniques include classification and regression trees and artificial neural networks. Their detailed discussion is beyond the scope of the current guidelines; for details the reader is referred elsewhere [107].

##### Example

'Thirteen patients (all either ductal carcinoma, lobular carcinoma or mixed histology) had no grade information recorded in the data and one patient had no tumour size recorded. These patients were included in the analysis using multiple imputation methods to estimate the missing values. The hazard ratios were derived from the average effect across 10 augmented datasets, with the confidence intervals and significance tests taking into account the uncertainty of the imputations. The multiple imputation was performed by the MICE library within the S-Plus 2000 Guide to Statistics Volumes 1 and 2 (MathSoft, Seattle, WA, USA) ... ' [127]

##### Explanation

Almost all prognostic studies have missing marker or covariate data for some patients because clinical databases are often incomplete. Also, some marker assays may not yield interpretable results for all specimens. However, not all papers report in detail the amount of missing data and very few attempt to address the problem statistically [33].

Authors should report the number of missing values for each variable of interest. They should give reasons for missing values if possible, and indicate how many individuals were excluded because of missing data when describing the flow of participants through the study (see Item 12). Many authors omit cases without all relevant information from all analyses or they may vary who is included according to which variables are included in the analysis. Including only cases with complete data may greatly reduce the sample size and potentially lead to biased results if the likelihood of being missing is related to the true value (see Box 2) [33, 128–131]. Modern statistical methods exist to allow estimation (imputation) of missing observations. These issues are clarified in Box 2. Authors should describe the nature of any such analysis (for example, multiple imputation) and specify assumptions that were made (for example, 'missing at random').

In a review of 100 prognostic articles, the percentage of eligible cases with complete data was obtainable in only 39; in 17 of these articles more than 10% of patients had some missing data. The methods used to handle incomplete covariates were reported in only 32 out of 81 articles with known missing data [33].

##### Example

'When using a stepwise variable selection procedure to identify independent factors prognostic for survival, variables were added using forward selection according to a selection entry criterion of 0.05 and removed using backward elimination according to a selection stay criterion of 0.05. The importance of a prognostic factor was assessed via Wald-type test statistics, the hazard ratio and its 95% confidence interval for survival.' [132]

##### Explanation

Sometimes several multivariable models containing different subsets of variables are considered. The rationale for these choices and details of any model selection strategies used should be described. The REMARK profile can provide a concise summary of all analyses performed (Item 12).

If patients in the study received different treatments, one or more variables indicating treatments received can be considered in models, treatment can be used as a stratification factor or separate models may be built for each treatment. For many cancer types, there are a few generally accepted staging variables or other clinical or pathologic variables that would be available in most cases, and these variables would usually be considered in multivariable models (see also Item 17).

The main multivariable model may sometimes be pre-specified, which helps to avoid biases caused by data-dependent model selection. More often, however, many candidate variables are available and some type of variable selection procedure is sensible in order to derive simpler models which are easier to interpret and may be more generally useful [108, 133]. It is particularly important to state if the variables included in a reported model were determined using variable selection procedures. Any selection procedures used should be described (for example, stepwise regression or backward elimination) along with specific criteria used to determine inclusion or exclusion of variables from the model (for example, *P* values) or to select a best fitting model (for example, Akaike information criterion) [101]. It is well known that, unless sample sizes are large, use of variable selection procedures will lead to biased parameter estimates and exaggerated measures of statistical significance [66, 121, 134]. For this reason, Item 17 requests that results from a particular multivariable model which includes the marker along with 'standard' prognostic variables, regardless of statistical significance, be reported.

##### g. Checking model assumptions

##### Examples

'In the basic form of the Cox regression model, the coefficients corresponded to the logarithm of the HR and were constant in time. This assumption was graphically evaluated by means of smoothed Schoenfeld residuals and tested as suggested by Grambsch and Therneau.' [135]

'The proportional hazards assumptions were checked by plots of log(- log survival time) versus log time.' [136]

'We evaluated the proportional hazards assumption by adding interaction terms between the time-dependent logarithm of follow-up time plus 1 and tamoxifen treatment, ERαS118-P status, or both and found no evidence for nonproportional hazards (P = .816, .490, and .403, respectively).' [24]

##### Explanation

Any statistical model, univariable or multivariable, makes certain assumptions about the distributions of variables or the functional relationships between variables. For example, the Cox proportional hazards regression model commonly used for survival data requires several important assumptions, including proportional hazards and linear relationships between continuous covariates and the log hazard function. Proportional hazards assumptions are often violated when there is long follow-up, for example, for certain types of cancers in which a portion of patients can be considered cured. How the variables are coded or transformed will also affect the appropriateness of linear versus non-linear relationships (see Item 11 and Box 4).

Methods used to empirically check model assumptions should be reported. For example, residual plots and models containing time-by-covariate interactions are often used to diagnose departures from linearity and proportional hazards [122, 137–139]. Influential points and outliers can often be detected by diagnostic plots such as added variable plots [140]. Parametric survival models, such as lognormal or Weibull models, make additional assumptions about the distribution of the survival times [123]. The suitability of parametric models can be checked using methods such as residual plots and goodness of fit tests [109, 121]. Many extensions of the Cox model have been proposed to handle departures from the basic assumptions [138, 139] but they will not be discussed here. More complex models require larger sample sizes than often are available in tumor marker prognostic studies to avoid overfitting to noise in the data [107, 141].

Alternative models evaluated for purposes of sensitivity analyses should also be described (see Item 18).

##### Examples

'For internal validation of the multivariate models, 1000 bootstrap samples were created and stepwise Cox regression analysis was applied to each sample. The relative frequencies of inclusions of the respective factors were calculated.' [142]

'For this study, and future studies using this TMA, the primary investigator is given access to all clinical, outcome, and TMA data from the training set only. The training set is used to generate and refine hypotheses regarding the biomarker under study. Significant findings are then formally presented ... Those findings considered to be of clinical and scientific interest are then re-tested on the validation set. A separate researcher who did not participate in the training set analysis performs the re-testing on the validation set. Our statistical approach is intended to minimize false positive results, particularly with subgroup analysis.' [143]

##### Explanation

Invariably, the strongest evidence for the validity of results is confirmation of the findings on data not involved in the original analysis [144, 145]. The ideal approach is to confirm findings from the main (final) model on completely independent data, preferably collected by different investigators but under pre-defined appropriate conditions. If successful, this approach would indicate that the results are transportable to other settings. This would be a type of 'external validation'. A prospectively designed and conducted clinical trial is the strongest form of validation, but trials designed with the primary objective to validate a prognostic marker or model are rare. More often, evaluations of markers occurring within trials are secondary aims in trials primarily designed to evaluate a treatment or other intervention. The marker evaluation could occur during the trial, or the evaluation might take place even years after completion of the trial using specimens banked during the course of the trial. This latter option has been referred to as a 'prospective-retrospective' design, and it can provide a high level of evidence for the utility of a marker if conducted under appropriate conditions [146]. Complete specification of the marker assay method and model (if relevant), a pre-specified analysis plan, and enforcement and documentation of lock-down of marker analytical results prior to unblinding of clinical outcome data (see also Item 5) are among the conditions that should be satisfied for a rigorous prospective-retrospective validation.

A completely independent data set (a 'similar' study) often will not be available, but 'internal' validation procedures, such as cross-validation, bootstrapping or other data resampling methods [133, 147], are useful to give insights into critical issues such as bias of regression parameter estimates, overoptimism of prognostic model discriminatory ability or stability of the model derived (see also Item 18). Internal validation involves holding out some portion of the data ('test set') while a model is built on the remaining portion ('training set'); when the model is completely specified on the training set, it is then evaluated (tested) on the held-out data. A limitation of internal validation is that there may be biases affecting the entire data set that will not be detected by internal validation because the biases will affect the training and test sets equally [46]; however, if a model has been seriously overfitted to random noise in the training set, properly performed internal validation should reveal failure of the model on the test data. The study report should include a description of any validations that were performed, internal or external.

For internal validation, the specific validation algorithm used should be described (for example, bootstrapping, 10-fold or leave-one-out cross-validation) [147–149]. If a study performs any external validation, basic details of the study population, design and analysis approach should be provided. It should be clarified whether the external validation sample came from the same or different centers or periods as the samples used to develop the model. In cases where the whole study represents a validation of a previously developed model this should be stated, along with proper reference to the previous study that developed that model.