Skip to main content

Table 9 Methods for filtering and exclusion of variables: Variable filtering

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Variable filtering

 Variable filtering is typically accomplished by calculation of a score for each variable, followed by exclusion of variables having a score below a threshold from further analyses. Modelling or multiple testing procedures can then be applied only to the resulting variable set. However, in order to preserve the correct error control in multiple testing, it is crucial that the filtering is independent of the test statistics that will be used to analyze the filtered data [52]. This is generally accomplished using “nonspecific” filters, where the filtering does not depend on the outcome data. For example, when comparing groups using two-sample t-tests, first removing the variables that exhibit a small difference in the mean values of the classes and then applying the multiple testing corrections to the remaining variables leads to greatly inflated type I errors and overoptimistic multiplicity adjusted p-values. In contrast, type I error is correctly controlled if the filter is based on the overall variance or mean of the variables (combined across both groups), filtering out the variables with small overall variability or low overall expression [52,53,54]. Although computationally helpful, filtering that does not inflate errors also does not necessarily increase statistical power; for example, Bourgon et al. [52] showed an example for Affymetrix gene expression data, where filtering out a large proportion of the genes with low expression actually decreased the number of true discoveries

 Variable filtering is implicitly performed also by some methods that can be used in regression modelling. These methods include Lasso, which will be discussed in the context of prediction modelling in section “PRED: Prediction.”