Skip to main content

Table 1 Overview of the structure of the paper, as a list of the sections with corresponding analytical goals and common approaches

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Section

Analytical goals

Common approaches

Examples

IDA

Initial data analysis and preprocessing

  

 IDA1

Identify inconsistent, suspicious or unexpected values

Visual inspection of univariate and multivariate distributions

Histograms, boxplots, scatterplots, correlograms, heatmaps

 IDA2

Describe distributions of variables, and identify missing values and systematic effects due to data acquisition

Descriptive statistics, tabulation, analysis of control values, graphical displays

Measures for location and scale, bivariate measures, RLE plots, MA plots, calibration curve, PCA, Biplot

 IDA3

Preprocess the data

Normalization, batch correction

Background correction, baseline correction, centering and scaling, quantile normalization, ComBat, SVA

 IDA4

Simplify data and refine/update analysis plan if required

Recoding, variable filtering and exclusion of uninformative variables, construction of new variables, removal of variables or observations due to missing values, imputation

Collapsing categories, variable filtering, discretizing continuous variables, multiple imputation

EDA

Exploratory data analysis

  

 EDA1

Identify interesting data characteristics

Graphical displays, descriptive univariate and multivariate statistics

PCA, Biplot, multidimensional scaling, t-SNE, UMAP, neural networks

 EDA2

Gain insight into the data structure

Cluster analysis, prototypical samples

Hierarchical clustering, k-means, PAM, scree plot, silhouette values

TEST

Identification of informative variables and multiple testing

  

 TEST1

Identify variables informative for an outcome

Test statistics, modelling approaches

t-test, permutation test, limma, edgeR, DESeq2

 TEST2

Perform multiple testing

Multiple tests, control for false discoveries

Bonferroni correction, Holm’s procedure, multivariate permutation tests, Benjamini-Hochberg (BH), q-values

 TEST3

Identify informative groups of variables

Tests for groups of variables

Gene set enrichment analysis, over-representation analysis, global test, topGO

PRED

Prediction

  

 PRED1

Construct prediction models

Variable transformations, variable selection, dimension reduction, statistical modelling, algorithms, integrating multiple sources of information

Log-transform, standardization, superPC, ridge regression, lasso regression, elastic net, boosting, SVM, trees, random forest, neural networks, deep learning

 PRED2

Assess performance and validate prediction models

Choice of performance measures, internal and external validation, identification of influential points

MSE, MAE, ROC curves, AUC, misclassification rate, Brier score, calibration plots, deviance, subsampling, cross-validation, bootstrap, use of external datasets