Skip to main content

Table 14 Methods for estimation of the number of clusters: Scree plots, silhouette values

From: Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

Scree plots

 One traditional approach for estimation of the number of clusters is the construction of a scree plot, which involves plotting some measure of within-cluster variation on the y-axis and the number of clusters assumed in applying the algorithm on the x-axis. For hierarchical clustering, which does not require a priori specification of the number of clusters, a similar plot can be constructed by “cutting” the dendrogram at different levels corresponding to a range of numbers of clusters. The optimal number of clusters is determined by visual inspection where a line connecting the points shows a kink and there is diminished reduction in within-cluster variation with increasing number of clusters. Noise accumulating over the variables in HDD coupled with no guarantee that applications of the algorithms identify the optimal clusterings may lead to scree plots that fail to reveal a strong indication for the number of clusters. Figure 13 [80] shows such a typical scree plot

Silhouette values

 Silhouette values are numerical tools for estimating the number of clusters [81]. The silhouette value of a single observation measures how well the observation fits to its assigned cluster by comparing its average similarity to members of its own cluster to the average similarity to the next best cluster. It is scaled such that the value 1 corresponds to an optimal fit (similarities to members of own cluster extremely large compared to next best cluster) and − 1 to the worst case (similarities to members of own cluster extremely small compared to best other cluster). The average silhouette width (asw) is then defined as average of all single silhouette values, which quantifies the quality of the clustering result. The asw requires no distributional assumptions for the data. In contrast, when using distribution-based clustering, typically so-called information criteria are required for selecting the number of clusters. These balance the coherence of the clusters (as large as possible) and the number of clusters (as small as possible). Figure 14 [82] shows a silhouette plot that visualizes the silhouette values of observations that were grouped into four clusters