Skip to main content

Table 24 Methods for statistical modelling with machine learning algorithms: Support vector machine, trees, random forests, neural networks and deep learning

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Support vector machine (SVM)

 A support vector machine (SVM) is a typical example of an algorithmic method developed in the machine learning context [150]. It is mostly used for classification, i.e., to predict the response class of the observations (e.g., healthy vs. sick patients), but can also be applied for regression. An SVM divides a set of observations into classes in such a way that the widest possible area around the class boundaries remains free of observations; it is a so-called Large Margin Classifier. The main idea is to construct a p-1 dimensional hyperplane (imagine a two-dimensional plane in a three-dimensional space, or a straight line in a plane) which separates the observations based on their response class. Often it is unrealistic to find such a perfectly separating hyperplane and one should accept some misclassified observations. Therefore, in the standard extended version of an SVM, observations on the wrong side of the boundaries are allowed, but their number and their combined distance to the boundary are restricted, such that a tuning parameter, usually denoted by C, defines how much “misclassification” is allowed. In addition, the extended implementation of kernel-based methods allows non-linear separating boundaries.

Trees and random forests

 One of the simplest algorithmic tools for prediction is a tree, in which the prediction is based on binary splits on the variable space. For example, a simple tree could have two nodes (splits): a root (the first split), which divides the space into two regions based on the presence of a genetic mutation, and a second node that divides the observations with this mutation again into two parts, based on another mutation. A tree can be grown further, until a predetermined (usually via cross-validation) number of regions in the variable space is reached [151]. In many studies, variables are measured on different scales (binary, ordinal, categorical, continuous) and several binary splits are possible, raising the issue of multiple testing. Algorithms which do not correct for multiple testing are biased in favor of variables allowing several cut points over binary variables [152].

 Simple trees are often unstable, i.e., fitting a tree to subsets of the data leads to very different estimated trees. One idea to solve this problem is to aggregate the results of trees computed on several bootstrap samples (bagging = Bootstrap AGGregatING, [153]). For example, for continuous variables, the predictions of different trees are typically averaged, and for categorical variables, for each category, the proportion of trees with this category as prediction is used as estimate of the probability of that category.

 While bagging partially mitigates the instability problem, often it is not very effective, due to the strong correlation among the trees. Random forests [154] improve upon this approach by limiting the correlation among the trees through use of only a subset of the variables in the construction of each tree. As in bagging, the results of the different trees are then aggregated to obtain a final prediction rule. Tuning parameters such as the size of the subset and the number of bootstrap samples must be chosen, but often default values are successfully used. While using the default values is often a good strategy in the LDD case, this is not necessarily the case for HDD problems. For example, the best size of the variable subset depends on the dimension of the total number of variables available [155]. An overview from early development to recent advances of random forests was provided by Fawagreh et al. [156].

Neural networks and deep learning

 In recent years, machine learning techniques like neural networks and deep learning have gained much interest due to their excellent performance in image recognition, speech recognition, and natural language processing [157, 158]. They are based on variable transformations: in neural networks, the predictor variables are transformed in a generally non-linear fashion through what is called an activation function. One popular choice for the activation function is a sigmoid or logistic function, which is applied to a linear combination of predictor variables (the coefficients used in the linear function, which provide the individual contribution of each predictor variable, are called weights). These new transformed variables (neurons in machine learning terminology, latent variables in statistical terms) form the so-called hidden layers, which are used to build the predictor. Mathematical theorems show that increasing the number of hidden layers and decreasing the number of neurons in each layer can improve the prediction performance of neural networks.

 Specific neural networks with many hidden layers are called deep learning. The choice of the tuning parameters (activation function, number of hidden layers, and number of neurons per layer) characterizes the different kinds of neural networks (and deep learning algorithms). In the high-dimensional contexts, special approaches (e.g., selecting variables or setting weights to zero) are used to avoid overfitting.

 Deep learning methods are extremely successful in the situation of a very large number of observations (as in image classification and speech recognition based on huge databases). However, they tend to generate overfitted models for typical biomedical applications in which the number of observations (e.g., number of patients or subjects) does not exceed a few hundred or thousand (see Miotto et al. [159] for a discussion of opportunities and challenges).