Skip to main content

Table 5 Methods for graphical displays: Principal component analysis (PCA), Biplot

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Principal component analysis (PCA)

 The basic idea of PCA [38] is to transform the (possibly correlated) variables into a set of linearly uncorrelated variables as follows. The first variable (first principal component) is constructed to capture as much of the total multidimensional variability in the data as possible, the second variable is uncorrelated with the first and maximizes capture of the residual variability (i.e., variability not already captured by the first principal component), and so on. Each principal component is a linear combination of the original variables. The result is a set of uncorrelated variables of decreasing importance, in the sense that the variables are ranked from the most informative (the first principal component, i.e., the one with the highest variance) to the least informative (i.e., the one with the lowest variance). The positions of each observation in the new coordinate system of principal components are called scores, and the loadings indicate how strongly the variables contribute to each PC. A major portion of the total variation in the data is often captured by the first few principal components alone, which are the only ones retained for the further analysis. Use of principal components greatly reduces the dimension of the data typically without losing much information (with respect to variability in the data). In the context of IDA, often the first two principal components are plotted to inspect for peculiarities in the data. Figure 6 [39] shows a PCA plot constructed from high-dimensional gene expression profiles generated from analysis of lymphoma specimens

Biplot

 Biplots, introduced by Gabriel [40], are designed to show PCs’ contributions with regard to both observations and variables. In a biplot, both the principal component scores and loadings are plotted together. The most common biplot is a two- or three-dimensional representation, where any two (or three) PCs of interest are used as the axes. Since often most of the variation in the data is explained by the first few PCs, it usually suffices to concentrate on plotting those. The biplot allows identifying samples that are “different” from the majority of samples, and at the same time, it illustrates nicely where these differences occur, i.e., for which variables the samples show different values. Figure 7 [39] shows a biplot for the same data used for the PCA plot