Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Rahnenführer, Jörg; De Bin, Riccardo; Benner, Axel; Ambrogi, Federico; Lusa, Lara; Boulesteix, Anne-Laure; Migliavacca, Eugenia; Binder, Harald; Michiels, Stefan; Sauerbrei, Willi; McShane, Lisa

doi:10.1186/s12916-023-02858-y

Table 2 Methods for visual inspection of univariate and multivariate distributions: Histograms, boxplots, scatterplots, correlograms, heatmaps

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Histograms
Histograms divide the range of values into intervals and then count how many values fall into each interval. They can be useful to visualize the shape of the data distribution and identify outlying points. Sometimes use of a transformation before plotting will improve visualization by providing better resolution of densely packed values and drawing more extreme values closer to the main body of the distribution
Boxplots
A boxplot (also called box-and-whisker-plot) is a graphical display that gives a quick impression of location and spread of data values and thus makes the comparison between variables simpler. A central box indicates the values that include the central 50% of the data (interquartile range), the median is indicated with a line within the box, and the lines extending vertically from the box (whiskers) indicate the area of all values that are not further than 1.5 times the interquartile range from the edges of the box. In addition, a commonly used option is to plot points individually that are outside the main area indicated by the whiskers. When boxplots are used to display variables with many values (like the expression values of all genes within an experiment), it is expected that many values fall in this category and plotting them individually can create the impression of many extreme values
Scatterplots
Scatterplots display one variable plotted against another, with each axis corresponding to one of the two variables. Both variables may be observed (e.g., expression of one gene against expression of a different gene), or one of the two variables could be a factor such as time, order of entry into study, or order in which a measurement such as an assay was conducted. Plotted points may represent the values of two variables for each of the study subjects, or each point could represent one of many different variables measured on an individual subject. For HDD, plots in which each point represents a different variable may contain an extremely large number of points making them hard to interpret due to many overlapping plotting symbols. Strategies such as use of semi-transparent colors for the plotted points or density plots, where regions with more observations appear darker in the plot, may be necessary. Another strategy is to randomly sample points to create a subset that provides a less dense plot
Correlograms
A correlogram (or corrgram) is a graphical representation of the correlation matrix [26]. It is a visual display for depicting patterns of relations among variables directly from the correlation matrix. In a correlogram, the values are rendered to depict sign and magnitude. Further, variables can be reordered such that similar variables are positioned adjacently, in order to facilitate the perception of the relations. Since correlograms visualize correlation matrices, they are only useful for LDD, i.e., if the number of variables is not too large. Of course, the correlations themselves can be computed from high-dimensional vectors. Figure 1 [27] shows an example of a correlogram
Heatmaps
A common two-dimensional visualization method is a heatmap [28] where the individual values contained in a data matrix are represented as colors in boxes of a rectangular grid. Sometimes raw data values are centered or scaled within rows or columns prior to display, which can be particularly helpful when rows or columns represent variables having different ranges or measurement scales. Clear description of any such centering and/or scaling is essential for proper interpretation of the figure. Choice of color-palette and ordering of rows and columns can both heavily influence the information conveyed by the graphical display. Complementary colors (e.g., red and green, blue and orange) can be used to emphasize two sides of a centered scale. Examples include many published heatmaps for gene expression microarray data in which shades of red might represent degrees of overexpression (relative to median or mean) of a gene and shades of green could represent underexpression. Another consideration for a heatmap display is the ordering of the rows and columns. Sometimes there is an ordering of the observations based on experimental design, for example, samples collected in a time course experiment are represented as ordered columns in the heatmap. As a quality check, it can be useful to order columns by sequence in which samples were assayed. Unexpected trends may indicate assay drift or batch effects. If rows correspond to factors such as gene transcript or protein levels, it can be illuminating to order them according to similarity of pattern across observations. Various clustering methods can be applied to construct orderings of observations or variables. These orderings might be illustrated by dendrograms, which can be displayed along axes of the heatmap to depict the distance structure (see section “EDA2.1: Cluster analysis” for discussion of clustering methods). Figure 2 [29] shows an example of a heatmap

Back to article page

ISSN: 1741-7015

Contact us

Submission enquiries: bmcmedicineeditorial@biomedcentral.com
General enquiries: info@biomedcentral.com

BMC Medicine

Contact us