Skip to main content

Table 3 Methods for descriptive statistics: Measures for location and scale, bivariate measures, RLE plots, MA plots

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Measures for location and scale

 As measure of location, the mean is standard for continuous data, the median is robust regarding extreme values, and the mode is often used for categorical data. Such measures can be extended to higher dimensions by calculating them component-wise, i.e., for every variable separately, and then collecting the values into a vector

 As measure of scale, the standard deviation for continuous data and the median absolute deviation to the median (MAD) as a robust counterpart are often used. The coefficient of variation scales the standard deviation by dividing by the mean and is helpful for comparing variables that are measured on different scales

Bivariate measures

 Bivariate descriptive statistics are based on pairs of variables, often the correlation coefficient is used to quantify the relationship between two variables. The classical Pearson correlation coefficient captures only linear relationships, whereas Spearman’s rank-based correlation coefficient may more effectively capture strong non-linear, but monotonic, relationships

RLE plots

 Relative log expression (RLE) plots [32] can be used for visualizing and detecting unwanted variation in HDD. They were developed for gene expression microarray data, but are now very popular especially for the analysis of single-cell expression data. For each variable (e.g., expression of a particular gene), first, its median value across all observations is calculated. Then, the median is subtracted from all values of the corresponding variable. Finally, for each observation, a boxplot is generated of all deviations across the variables. Comparing the boxplots, if one of them looks different with respect to location or spread, it may indicate a problem with the data from that observation. RLE plots are particularly useful for assessing the effects of normalization methods that are applied for removing unwanted variation, which might be due to, e.g., batch effects, see also section “IDA3.2: Batch correction.” An example RLE plot is presented in Fig. 3 [32]

MA plots (Bland–Altman plots)

 A natural way to assess concordance between measurements that are supposed to be replicates is to construct a simple scatterplot and look for distance from the 45-degree line. However, a preferred approach is to construct a Bland–Altman plot [33] instead of a scatterplot. In the omics literature, this plot is often referred to as an MA plot [34]. The horizontal (“x”) axis of a Bland–Altman plot is the mean of the paired measurements, and the vertical (“y”) axis is the difference, often after measurements have been log transformed. The advantage of this plot compared to a traditional scatterplot is that it allows better visualization of differences against a reference horizontal line at height zero and improved ability to detect changes in variability (spread) of those differences moving along the x-axis (see section “IDA4: Simplify data and refine/update analysis plan if required”). An example of a Bland–Altman Plot is presented in Fig. 4 [35]