Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Rahnenführer, Jörg; De Bin, Riccardo; Benner, Axel; Ambrogi, Federico; Lusa, Lara; Boulesteix, Anne-Laure; Migliavacca, Eugenia; Binder, Harald; Michiels, Stefan; Sauerbrei, Willi; McShane, Lisa

doi:10.1186/s12916-023-02858-y

Table 16 Methods for hypothesis testing for multiple variables in HDD: Limma, edgeR, DEseq2

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Limma
Linear Models for Microarray Data (limma) developed by Smyth and colleagues [95, 96] and implemented in the R package limma was developed to address several challenges of multiple testing for HDD. Limma offers a unifying, statistically based framework for multiple testing that uses empirical Bayes shrinkage methods in the context of linear models. Initially popularized in the context of traditional gene expression analysis with microarrays, limma is based on normal distribution theory. It evolved from a procedure to modify t-statistics by “borrowing information” across variables to improve variance estimation and increase statistical power. Limma provides a way to balance the need for small type I errors for testing individual variables in HDD settings against statistical power to identify true discoveries. Designs more complex than simple two-group comparisons are easily accommodated by limma’s linear model framework. Although it was developed originally to identify differentially expressed genes for normalized measurements from microarrays, it has also been used successfully for analysis of data generated by other omics technologies, e.g., proteomics [97]
For simplicity of explanation, the focus of discussion here is how limma works in the context of simple two-group comparisons as an extension of the familiar t-test. Limma relies on the concept of borrowing information across a collection of similar variables (e.g., expression levels for the thousands of genes measured on a microarray). Many omics studies have relatively small sample size compared to the number of variables, so the idea of borrowing information across a very large number of variables is very attractive. If one can assume that the true variances across the many variables follow some overarching distribution, then variance estimates for individual variables that are imprecise due to small sample size can be made more precise by shrinking them toward a variance estimate that is pooled from all variables. The amount of shrinkage depends on the distribution estimated (empirical Bayes) or assumed (Bayes) for the true variances. Limma is based on an empirical Bayes approach that assumes normally distributed variables and shrinks the individual variances toward the mean of the estimated distribution of true variances
Out of this empirical Bayes framework comes the moderated t-statistic, which is similar in form to the usual t-statistic, but with an adjusted estimate of standard deviation for each variable that has been shrunk toward the mean of the distribution of variances, replacing the usual sample standard deviation estimate in the denominator. These shrunken estimates are more precise as reflected in larger degrees of freedom achieved by “gathering strength” across the many variables, resulting in higher statistical power to identify true discoveries
An additional advantage of limma is the complexity of experimental designs that it can handle. Many extensions beyond two class comparison problems can be accommodated by the linear model framework. Comparisons can be made between more than two classes, including linear contrasts, for example to assess for linear trends in means across classes. In addition, limma offers a powerful set of tools to address a broad range of experimental settings in which data can be reasonably represented by a Gaussian linear model. Included in the limma framework are factorial designs, which consist of two or more factors with levels (discrete possible values), for which all combinations across the factors are investigated. This allows the analysis of main effects and interactions between variables
The evolution of technologies for gene expression analysis from microarrays to sequencing-based approaches such as RNA-Seq presented new statistical challenges for HDD analysis. Gene expression measurements generated by these newer technologies are typically count data rather than continuous intensity values as for microarray technologies. Count data are generally not compatible with assumptions of normally distributed data on which limma relies. For example, RNA-Seq measures the number of reads (DNA fragments) that map to specific genomic locations or features represented on a reference genome. Two extensions to limma were developed to address gene expression measurements expressed as counts. Limma-trend shrinks the gene-wise variances of the log-transformed count values toward a global mean–variance trend curve. Limma-voom extends this idea further by also taking into account global differences in counts between samples, for example due to different sequencing depths
Several other methods to analyze count data were developed independently of the limma extensions, with foundation on negative binomial models to characterize the distribution of count data. The negative binomial includes the Poisson as a special case and is generally preferred in the setting of modern gene expression analysis. It has greater flexibility for modelling variances of counts, particularly when those counts are not large or when the number of replicates for each biological group or condition is not large
edgeR
The edgeR procedure [98] assumes that the read count for a particular genomic feature follows a negative binomial (NB) distribution. Although a genomic variable of interest need not correspond exactly to a gene, in the following the term gene is used for simplicity of discussion. Much of the discussion is framed in terms of gene expression count data arising from RNA-Seq measurements, but the developers note that the methods implemented in edgeR apply more generally also to count data generated by other omics technologies, including ChIP-Seq for epigenetic marks and DNA methylation analyses
The measured count for gene g in sample i is assumed to follow a NB distribution with mean equal to the library size for that sample (total number of DNA fragments generated and mapped) multiplied by a parameter representing the relative abundance of gene g in the experimental group j to which sample i belongs. The variance of the count for a specific gene based on the NB distribution is assumed to be a function of the mean and a dispersion parameter; specifically, the variance is modeled as the sum of the technical variation and the biological variation. Technical variation for gene expression and other types of omics count data can usually be adequately modeled as a Poisson variable, but incorporating biological variability leads to additional variability. To incorporate this additional variability, an “overdispersion” term is introduced into the variance. Specifically, the variance of a count is modeled as the mean multiplied by the sum of one and the mean multiplied by a term that represents the coefficient of variation of biological variation between samples. This expression reflects a partition of the variance into contributions from technical and biological variation. When there is no biological variation between samples, e.g., when samples are true technical replicate sequencing runs from a single library produced for a sample, this variance reduces to the Poisson variance, which equals the mean. This model provides a flexible and intuitive expression for the variance and incorporates dependence of the variance on the mean as expected for count data
Using an empirical Bayes approach similar in flavor to that described for limma, the edgeR procedure borrows information across genes to shrink the gene-specific dispersion parameters toward a model describing the distribution of dispersion parameters. The simplest model is one in which all genes share a common dispersion parameter, which can be estimated from the data. Allowing greater flexibility, dispersion parameters can be modeled as a smooth trend as a function of average read count for each gene. To allow for further gene-specific reasons for variation in the count of a gene, empirical Bayes methods are employed to estimate weighted averages that combine gene-specific dispersion estimates with those arising from dispersion models, in this way “shrinking” gene-specific dispersion estimates toward the overall model
The edgeR software allows the user to compare gene expression between groups when there are replicate measurements in at least one of the groups and more generally when the group mean structure can be expressed as a linear model. Scientific questions of interest can be framed in terms of inferences about the relative abundance parameters in the linear model. For example, one might wish to compare relative abundance of a particular gene transcript in a group of samples taken from cell cultures that had not been exposed to a new drug to that in samples from cultures after exposure to the new drug. There could be interest in examining the pattern of change in relative abundance of the gene, sampling from a series of cultures that are exposed to the new drug for differing lengths of time. From the specified linear model and shrunken variance estimates, the edgeR software can perform gene-wise tests of significance, based on likelihood ratio statistics, for any parameters or contrasts of parameters in the mean model
DEseq2
DESeq2 [99] is another method for differential analysis of count data that is widely used. Performance of DEseq2 compares with edgeR in terms of false discovery control and statistical power to detect differentially expressed genes. It also uses a negative binomial model for the counts with variance expression that incorporates a dispersion parameter, as described for edgeR. Dispersion parameters are modeled across genes as a smooth curve depending on average gene expression strength. Using empirical Bayes methods, gene-specific dispersion parameters are shrunk toward the curve by an amount dependent on how close the individual dispersion estimates tend to be to the fitted smooth curve and the sample size (through the degrees of freedom)
A feature of DEseq2 that distinguishes it from other methods is incorporation of shrinkage into estimation of mean parameters. Shrinkage of mean parameters, e.g., fold-change, has appeal because researchers tend to find larger effects more convincing. Genes that attain statistically significant effects but exhibit small effect sizes are frequently manually filtered out due to concern that the significance could be due to random experimental noise. Shrinkage of fold-changes implemented by DESeq2 provides a more statistically based approach to address these less reliable findings, which are observed particularly often for genes with small counts. Additional useful features of DESEq2 include options for outlier detection

Back to article page

ISSN: 1741-7015

Contact us

Submission enquiries: bmcmedicineeditorial@biomedcentral.com
General enquiries: info@biomedcentral.com

BMC Medicine

Contact us