Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Rahnenführer, Jörg; De Bin, Riccardo; Benner, Axel; Ambrogi, Federico; Lusa, Lara; Boulesteix, Anne-Laure; Migliavacca, Eugenia; Binder, Harald; Michiels, Stefan; Sauerbrei, Willi; McShane, Lisa

doi:10.1186/s12916-023-02858-y

Table 20 Methods for multiple testing for groups of variables: Gene set enrichment analysis (GSEA), Over-representation analysis, global test, topGO

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Gene set enrichment analysis (GSEA)
The popular gene set enrichment analysis (GSEA; [118]) and its extensions are considered mixed approaches, as they test whether any of the variable groups is associated to the outcome variable and if any of the variable groups is enriched by variables associated to the outcome variable. A summary statistic is computed for each variable, a relative enrichment score based on a signed Kolmogorov–Smirnov statistic is calculated for each group, and its significance is evaluated using permutations. The groups with scores above or below a threshold are called enriched and the false positive rate is evaluated using a permutation procedure that permutes the specimens rather than the variables. Efron and Tibshirani [119] proposed to base the score on a standardized “maxmean” statistic (the standardized maximum of positive and negative summary statistics in each group), thus improving the power of the method
Over-representation analysis
Over-representation analysis (ORA; [120]) uses a similar concept to GSEA. It determines which variable groups are more present (overrepresented) in a subset of a given list of “interesting” variables than would be expected by chance. This can also be applied to situations where GSEA is used, but then instead of the Kolmogorov–Smirnov statistic the hypergeometric distribution is used for determining the significance of the over-representation, and thus a subjective cutoff for the summary statistic must be chosen a priori
Global test
The global test [121] is based on the estimation of a regression model where all the variables belonging to the group are included as covariates, and the global null hypothesis is tested whether any of the variables is associated with the outcome variable. The method is particularly good at identifying groups containing many variables, each of which might have relatively small effects
topGO
The topGO algorithm [122] provides methods for testing specific gene groups defined via the Gene Ontology (GO). The Gene Ontology is a widely recognized comprehensive reference for gene annotations. It assigns genes to GO terms belonging to the three main domains: biological processes, molecular functions, or cellular components. The corresponding gene groups (defined according to GO terms) are widely used prespecified groups of variables, often referred to as gene sets. However, when scoring the relevance of GO terms with methods as mentioned above, due to the high redundancy of many terms resulting in many similar groups of variables, the list of the most significant groups is also highly redundant. topGO provides algorithms for testing GO terms while accounting for the relationships between the corresponding gene groups. As a result, the final list of the most significant groups better represents the diversity of all significant groups, see Figure 16 [123] for the result of the topGO algorithm

Back to article page

ISSN: 1741-7015

Contact us

Submission enquiries: bmcmedicineeditorial@biomedcentral.com
General enquiries: info@biomedcentral.com

BMC Medicine

Contact us