Skip to main content

Table 13 Methods for cluster analysis: Hierarchical clustering, k-means, PAM

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Hierarchical clustering

 Hierarchical clustering is a popular class of clustering algorithms, mostly in an agglomerative version, where initially all objects are assigned to their own cluster, and then iteratively, the two most similar clusters are joined, representing a new node of the clustering tree [72]. The similarities between the clusters are recalculated, and the process is repeated until all observations are in the same cluster. The distance metric to be used for comparing two individual objects is specified by the researcher. For defining distances between two clusters of objects, there are also several options. In hierarchical clustering, the approach for measuring between-cluster distance is referred to as the linkage method. Single linkage specifies the distance between two clusters as the closest distance between the objects from two clusters; average linkage calculates the mean of those distances, and complete linkage specifies the largest distance. Single linkage has the disadvantage that it tends to generate long thin clusters, whereas complete linkage tends to yield clusters that are more compact, and average linkage typically produces clusters with compactness somewhere in between. Hierarchical clustering results are often displayed in a tree-like structure called a dendrogram. A dendrogram is viewed from the bottom up, with each object beginning in its own cluster as the terminal end of a branch and eventually being merged with other objects as clusters are formed climbing up the branches of the tree toward the root where all objects are combined into one cluster. The heights in the tree at which the clusters are merged correspond to the between-cluster distances. Cutting the tree at a particular height defines a number of clusters. Although the hierarchical structure displayed in the dendrogram may seem appealing, it should be interpreted with caution as there can be substantial information loss incurred as a result of enforcing a flattened tree structure. Figure 11 [73] shows an example for a dendrogram resulting from hierarchical clustering

k-means

 A popular partitioning clustering algorithm is k-means [74]. For its traditional implementation, the researcher must specify the number of clusters. First, random objects are chosen as initial centroids for the clusters. Then the algorithm proceeds by iterating between two steps, (i) comparing each observation to the mean of each cluster (centroid) and assigning it to the cluster for which the squared Euclidean distance from the observation to the cluster centroid is minimized, and (ii) recalculating cluster centroids based on the current cluster memberships. The iterative process continues until no observations are reassigned. k-means is not guaranteed to converge to the optimal cluster assignments that minimize the sum of within-cluster variances, and it can be strongly influenced by the selected number of clusters and initial cluster centroids. Nonetheless, it is a relatively simple algorithm to understand and implement and is widely used. Figure 12 [75] visualizes the k-means algorithm with an example

PAM

 Several important extensions and generalizations of k-means have been developed. PAM (partitioning around medoids, [76]) allows using arbitrary distances instead of Euclidean distance, and instead of mathematically calculated centroids, actual observations are selected as prototypes of clusters. The algorithm iteratively improves a starting solution with respect to the sum of distances of all observations to their corresponding prototypes, until no improvement can be obtained by replacing one current prototype with another observation