Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Rahnenführer, Jörg; De Bin, Riccardo; Benner, Axel; Ambrogi, Federico; Lusa, Lara; Boulesteix, Anne-Laure; Migliavacca, Eugenia; Binder, Harald; Michiels, Stefan; Sauerbrei, Willi; McShane, Lisa

doi:10.1186/s12916-023-02858-y

Table 6 Methods for background subtraction and normalization: Background correction, baseline correction, centering and scaling, quantile normalization

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Background correction
A classic example of such a step is a background correction applied to data generated from some of the earliest microarrays [41]. In this approach, the signal of interest is obtained by summarizing the pixel intensity values within a designated region or “spot” (e.g., corresponding to location of probe for a particular gene) on a scanned image of a hybridized array. Ideally, pixels for areas outside the spots should have zero intensity, but this is rarely the case because of the fluorescence of the array surface itself. This fluorescence is termed the background. Because background may contaminate the measurement of spot fluorescence, the signal in the spot should be corrected for it by subtracting the fluorescence measured in the background
Baseline correction
In proteomic mass spectrometry [42], the counterpart of background correction is “baseline correction.” In mass spectrometry, the mass-to-charge ratio (m/z) of molecules present in a sample is measured. A resulting mass spectrum is an intensity vs. m/z plot representing the distribution of proteins in a sample. In this technology, chemical noise is usually present in the spectrum, which is typically caused by chemical compounds such as solvent or sample contaminants that did not originate from the analyzed biological sample. Chemical noise can cause a systematic upward shift of measured intensity values from the true baseline across a spectrum. The presence of baseline noise poses a problem, as the intensity is used to infer the relative abundance of molecules in the analyzed sample. A baseline shift will distort those relative measures; hence, baseline subtraction is typically applied when preprocessing mass spectrometry data
Centering and scaling
Normalization aimed at addressing between-run differences typically involves re-centering or re-scaling data obtained for a particular run by applying a correction factor that captures the difference between the measurements from that run and measurements from some type of average over multiple runs or from a reference run. The correction factor may be obtained by using internal controls or standards. These can be either analytes known to be present in the sample or analytes added to the sample that should, theoretically, yield the same measurements if the same amount of sample material is measured. If the measured values of internal standards differ across runs, then these internal control or standard values can be used for re-centering or re-scaling purposes
An alternative approach is to use a run-based estimate of the constant that is calculated across the many measured variables for an individual sample. Examples include re-centering or re-scaling the measurements by their mean value (as in the total ion current normalization of mass spectrometry data), or by an estimate reflecting the amount of processed biological material (as in library size normalization of next-generation sequencing data)
Data preprocessing terminology can be confusing for high-dimensional omics data. Although centering and scaling are often referred to generically as standardization, here, centering and scaling will refer to adjustment to all values of one observation (across variables). Standardization as meaning centering and scaling of all values of a variable (across observations) is described in section “PRED1.1: Variable transformations.”
Quantile normalization
Quantile normalization [43] is a widely used normalization procedure that addresses between-run differences and has been popular for use with omics data. The method assumes that the distribution of measured values across the many analytes measured is roughly similar from sample to sample, with only relatively few analytes accounting for differences in phenotypes (biological or clinical) across samples. Quantiles of the distribution of raw measured values (e.g., across genes) for each sample are adjusted to match a reference distribution, which is obtained either from a reference sample or constructed as some sort of average over a set of samples. Although the numerical quantiles are forced to match, the particular analyte (e.g., pertaining to a certain gene) to which the quantile corresponds can vary from sample to sample, thus capturing the biological differences across samples. Figure 8 [44] shows the effect of quantile normalization

Back to article page

ISSN: 1741-7015

Contact us

Submission enquiries: bmcmedicineeditorial@biomedcentral.com
General enquiries: info@biomedcentral.com

BMC Medicine

Contact us