Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Rahnenführer, Jörg; De Bin, Riccardo; Benner, Axel; Ambrogi, Federico; Lusa, Lara; Boulesteix, Anne-Laure; Migliavacca, Eugenia; Binder, Harald; Michiels, Stefan; Sauerbrei, Willi; McShane, Lisa

doi:10.1186/s12916-023-02858-y

Guideline
Open access
Published: 15 May 2023

Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

BMC Medicine volume 21, Article number: 182 (2023) Cite this article

8379 Accesses
8 Citations
9 Altmetric
Metrics details

Abstract

Background

In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions.

Methods

Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 “High-dimensional data” of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD.

Results

The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided.

Conclusions

This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.

Peer Review reports

Background

The goal of the topic group TG9 “High-dimensional data” (HDD) of the STRATOS (STRengthening Analytical Thinking for Observational Studies) [1] initiative is to provide guidance for planning, conducting, analyzing, and reporting studies involving high-dimensional biomedical data. The increasing availability and use of “big” data in biomedical research, characterized by “large n” (independent observations) and/or “large p” (number of dimensions of a measurement or number of variables associated with each independent observation), has created a need for the development and novel application of statistical methods and computational algorithms. Either large n or p may present difficulties for data storage or computations, but large p presents several major statistical challenges and opportunities [2]. The dimension p can range from several dozen to millions. The situation of very large p is the focus of TG9 and this paper. Throughout the paper, “p” will refer to the number of variables and the term “subject” will be used broadly to refer to independent observations, including human or animal subjects, or biospecimens derived from them; or other independent experimental or observational units. Researchers who design and analyze such studies need a basic understanding of the commonly used analysis methods and should be aware of pitfalls when statistical methods that are established in the low-dimensional setting cannot, or should not, be used in the HDD setting.

This overview, a product of STRATOS topic group TG9, provides a gentle introduction to fundamental concepts in the analysis of HDD, in the setting of observational studies in biomedical research. The focus is on analytical methods; however, issues related to study design, interpretation, transportability of findings, and clinical usefulness of results should also be considered as briefly discussed throughout this paper.

The STRATOS initiative and the STRATOS topic group TG9 “High-dimensional data”

The STRATOS initiative (www.stratos-initiative.org) is a large collaboration involving experts in many different areas of biostatistical research. The objective of STRATOS is to provide accessible and sound guidance for the design and analysis of observational studies [1]. This guidance is intended for applied statisticians and other data analysts with varying levels of statistical training, experience and interests. TG9 is one of nine topic groups of STRATOS and deals with aspects of HDD analysis.

Main issues addressed by TG9 often overlap with those of other TGs, but in the work of TG9 there is always a focus on the HDD aspect. Sometimes TG9 guidance will build upon that of other TGs to adapt it for relevance to HDD (see the “Discussion” section), but also completely new issues arise and require novel statistical approaches.

High-dimensional data are now ubiquitous in biomedical research, very frequently in the context of observational studies. Particularly omics data, i.e., high-throughput molecular data (e.g., genomics, transcriptomics, proteomics, and metabolomics) have provided new insights into biological processes and disease pathogenesis and have furthered the development of precision medicine approaches [3]. Rapidly expanding stores of electronic health records contain not only standard demographic, clinical, and laboratory data collected through a patient history, but also information from potentially many different providers involved in a patient’s care [4]. Data may be derived from multiple sources and can be represented in many different forms. Collectively, these data can be leveraged to support programs in comparative effectiveness and health outcomes research, and to monitor public health. Many statistical methods that are discussed here may be applied to health records data as well as to omics data, but our primary focus here is on the analysis of omics data.

Simultaneously, advances in statistical methodology and machine learning methods have contributed to improved approaches for data mining, statistical inference, and prediction in the HDD setting. Strong collaborations between data and computational scientists (e.g., statisticians, computational biologists, bioinformaticians, and computer scientists) and other biomedical scientists (e.g., clinicians and biologists) are essential for optimal generation, management, processing, analysis, and interpretation of these high-dimensional biomedical data [5].

Credibility and importance of research findings from biomedical studies involving HDD can be better judged when there is understanding of various approaches for statistical design and analysis along with their strengths and weaknesses. While this overview directly aims to improve understanding, simultaneously this guidance implies what information is necessary to report to fully appreciate how a study was designed, conducted, and analyzed. Whether study results prompt further pre-clinical or early clinical work, or translation to clinical use, ability to judge quality, credibility, and relevance of those results is critical. It is important to avoid sending research programs down unproductive paths or allowing flawed research results such as poorly performing prognostic models or therapy selection algorithms generated from HDD to be implemented clinically [6]. Historically, research involving biomarkers and prognostic modelling has been criticized for lack of rigor, reproducibility, and clinical relevance [7,8,9,10], and for poor reporting [11, 12]. At least as many deficiencies are also common in biomedical research involving HDD. The goal of STRATOS TG9 is to reduce these deficiencies, and improve rigor and reproducibility, by providing widely accessible didactic materials pertinent to studies involving HDD.

Study design

In any observational study, including in the HDD setting, study design plays a crucial role in relation to the research question. A first important point is the precise definition of the target population and the sampling procedure. The subjects included in a study (or biospecimens derived from them) may be selected from the population by a random or other statistically designed sampling procedure (e.g., case–control, case-cohort), or may simply represent a “convenience” sample. It is therefore important to understand whether the subjects are representative of the target population, how the variables associated with subjects were measured or ascertained, and whether there are potential confounding factors. Failure to account for confounding factors or minimize bias in subject or variable ascertainment can lead to useless or misleading results.

Outcome-dependent sampling is rather common in observational studies, particularly for those investigating risk factors for relatively uncommon diseases or outcomes. Examples include classical matched or unmatched case–control designs along with two-phase sampling from a cohort (case-cohort or nested case–control). Another often-used strategy oversamples long survivors, or, for continuous outcomes, subjects with high and low values of the outcome variable. When any such sampling strategies are employed, it is important to use inferential procedures [13, 14] that properly account for the sampling design.

Laboratory experiments generating high-dimensional assay data should adhere to the same best practices as traditional controlled experiments measuring only one or a few analytes, including randomization, replication, blocking, and quality monitoring. Arguably, careful design might be even more important in the setting of HDD generation because HDD assays may be especially sensitive to technical artifacts. Even when a study is primarily observational yet involves analysis of stored biospecimens using omics assays, good design principles should be followed when performing the assays. Best practices include randomizing biospecimens to assays batches to avoid confounding assay batch effects with other factors of interest. For unmatched case–control studies, balancing (randomizing) cases and controls into batches may provide important advantages for reducing the influence of batch effects [15]. For matched case–control studies or studies involving analysis of serial specimens from each subject, grouping matched or longitudinal sets within the same assay batch can be a convenient way to control for batch effects.

Another fundamental aspect of design is sample size, which refers to the measurement of different subjects, which are referred to as biological replicates. Whenever there is interest in making inference beyond an individual subject, e.g., assessing differential gene expression between groups of subjects with different phenotypes or exposed to different conditions such as treatments, biological replicates are required. In the HDD setting, standard sample size calculations generally do not apply. If statistical tests are performed one variable at a time (e.g., differential expression of each gene comparing two groups), then the number of tests performed for HDD is typically so large that a sample size calculation applying stringent multiplicity adjustment would lead to an enormous sample size. Alternative approaches to controlling false positive findings in HDD studies are discussed in section “TEST: Identification of informative variables and multiple testing.” If the goal is to develop a risk or prognostic model using HDD, typical recommendations about the number of events required per variable break down [16]. Other sample size methods that require assumptions about the model are challenging to implement considering the complexity of models that might be used in HDD settings [17, 18], as discussed in section “PRED2.4: Sample size considerations.” In reality, HDD studies are often conducted with inadequate sample size, which is an important reason why many results are not reproducible and never advance to use in practice [19].

It is important to distinguish technical from biological replicates. Technical replication refers to repeating the measurement process on the same subject. It should not be confused with sample size. Technical replicates are useful for evaluating the variability in the measurement process, which may be comprised of multiple steps each potentially contributing to the total error in the measurement [20] (Fig. 1) described the many steps in gene expression microarray analysis of mouse brains. Technical replication could theoretically be carried out at any of those steps. Sometimes measurements are repeated using an alternative non-high-throughput measurement technique (e.g., RT-PCR assay to measure expression or Sanger sequencing of a specific gene) as a form of measurement validation, but this must not be confused with other forms of validation such as clinical validation of a prediction model (see section “PRED2: Assess performance and validate prediction models”). In presence of budget constraints, if the goal is to compare different biological conditions, it is advisable to invest in biological replicates. When biological samples are inexpensive compared to the cost of the measurement process, pooling is sometimes recommended as a way to reduce costs by making fewer total measurements [21]. However, caution is advised, as assumptions may be required about assay limits of detection or the correspondence between physical pooling and additivity of measurements [22]. The context of any technical replication must be carefully described along with any methods of summarizing over replicates in order to interpret results appropriately.

Design of a study should ideally be placed in the context of an overarching analysis plan. Each individual study should be designed to produce results of sufficient reliability that its results will inform next steps in the research project.

Methods

Structure of the paper

This paper is organized with respect to subtopics that are most relevant for the analysis of HDD, particularly motivated by typical aims of biomedical studies but also applicable more generally. These subtopics are initial data analysis (IDA and Preprocessing, section “IDA: Initial data analysis and preprocessing”), exploratory data analysis (EDA, section “EDA: Exploratory data analysis”), multiple testing (section “TEST: Identification of informative variables and multiple testing”), and prediction (section “PRED: Prediction”). For each subtopic, we discuss a list of main analytical goals. For each goal, basic explanations, at a minimally technical level, are provided for some commonly used analysis methods. Situations are identified where performance of some traditional, possibly more familiar, statistical methods might break down in the HDD setting or might not be possible to apply at all when p is larger than n. Strengths and limitations of competing approaches are discussed, and some of the gaps in the availability of adequate analytic tools are noted when relevant. Many key references are provided. It should be noted that throughout this paper we are concerned almost exclusively with cross-sectional or independent observations rather than longitudinal observations.

Topics in the paper are organized into sections according to the structure summarized in Table 1, followed by a discussion of the importance of good reporting to improve transparency and reproducible research in the “Discussion” section and a summarizing discussion in the “Conclusions” section.

Table 1 Overview of the structure of the paper, as a list of the sections with corresponding analytical goals and common approaches

Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Abstract

Background

Methods

Results

Conclusions

Background

The STRATOS initiative and the STRATOS topic group TG9 “High-dimensional data”

Study design

Methods

Structure of the paper

Results

IDA: Initial data analysis and preprocessing

IDA1: Identify inconsistent, suspicious or unexpected values

IDA1.1: Visual inspection of univariate and multivariate distributions

IDA2: Describe distributions of variables, and identify missing values and systematic effects due to data acquisition

IDA2.1: Descriptive statistics

IDA2.2: Tabulation of missing data

IDA2.3: Analysis of control values

IDA2.4: Graphical displays

IDA3: Preprocessing the data

IDA3.1: Background subtraction and normalization

IDA3.2: Batch correction

IDA4: Simplify data and refine/update analysis plan if required

IDA4.1: Recoding

IDA4.2: Variable filtering and exclusion of uninformative variables

IDA4.3: Construction of new variables

IDA4.4: Removal of variables or observations due to missing values

IDA4.5: Imputation

EDA: Exploratory data analysis

EDA1: Identify interesting data characteristics

EDA1.1: Graphical displays

EDA2: Gain insight into the data structure

EDA2.1: Cluster analysis

EDA2.2: Prototypical samples

TEST: Identification of informative variables and multiple testing

TEST1: Identify variables informative for an outcome

TEST1.1: Test statistics: Hypothesis testing for a single variable

TEST1.2: Modelling approaches: Hypothesis testing for multiple variables

TEST2: Multiple testing

TEST2.1: Control for false discoveries: Classical multiple testing corrections

TEST2.2: Control for false discoveries: Methods motivated by HDD

TEST2.3: Sample size considerations

TEST3: Identify informative groups of variables

PRED: Prediction

PRED1: Construct prediction models

PRED1.1: Variable transformations

PRED1.2: Variable selection

PRED1.3: Dimension reduction

PRED1.4: Statistical modelling

PRED1.5: Algorithms

PRED1.6: Integrating multiple sources of information

PRED2: Assess performance and validate prediction models

PRED2.1: Choice of performance measures

PRED2.2: Internal and external validation

PRED2.3: Identification of influential points

PRED2.4: Sample size considerations

Good reporting to improve transparency and reproducible research

Discussion

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Disclaimer

Funding

Author information

Authors and Affiliations

Consortia

for topic group “High-dimensional data” (TG9) of the STRATOS initiative

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article