Chapter 5 Hierarchical Clustering Methods

At this point, the focus shifts from reducing the dimensionality of the variables to reducing the dimensionality of observations, through the clustering of similar observations into a smaller number of groups. The chapters in this Part deal with classic clustering methods, which are inherently non-spatial. However, as before, a strong spatial perspective is maintained by linking the solutions to a geographical representation. Part III covers clustering methods where spatial contiguity of cluster members is enforced as a constraint.

In general terms, clustering methods group n observations into k clusters, such that both the intra-cluster similarity and the between-cluster dissimilarity are maximized. Equivalently, one can think of it as minimizing intra-cluster dissimilarity and between-cluster similarity, the complement of the previous objective.

The goal of clustering methods is thus to achieve compact groups of similar observations in multi-attribute space that are separated as much as possible from the other groups.20

There are a very large number of clustering techniques and algorithms. These methods are standard tools of so-called unsupervised learning. They constitute a core element in any machine learning toolbox. Classic texts include Hartigan (1975), Jain and Dubes (1988), Kaufman and Rousseeuw (2005), and Everitt et al. (2011). A fairly recent historical overview of method development can be found in Jain (2010). In addition, excellent treatments of some of the more technical aspects are contained in Chapter 14 of Hastie, Tibshirani, and Friedman (2009), Chapters 10 and 11 of Han, Kamber, and Pei (2012), and Chapter 10 of James et al. (2013), among others.

Clustering methods can be organized along a number of different dimensions. A common distinction is between hierarchical methods, partitioning methods, density-based methods and grid-based methods (see, e.g., Han, Kamber, and Pei 2012, 448–50). In addition, there are model-based approaches developed in the statistical literature, such as Gaussian mixture models (GMM) and Bayesian clusters.

In this and the next three chapters, the essence behind two most common approaches is covered, namely hierarchical methods and partitioning methods. In addition, a more recent approach based on spectral decomposition is treated as well. Model-based techniques are not considered, since they are less compatible with the exploratory mindset taken in these Volumes. More precisely, they require a formal probabilistic model.

In addition, the discussion is limited to exact clustering methods. Therefore fuzzy clustering techniques are not considered, i.e., methods that yield solutions where an observation may belong to more than one cluster. In exact clustering, the clusters are both exhaustive and exclusive. This means that every observation must belong to one cluster, and only one cluster.

The topic of the current chapter is Hierarchical clustering. Partitioning clustering is covered in the next chapter, more advanced techniques in a third, and a fourth chapter is devoted to Spectral clustering.

Hierarchical clustering methods build up the clusters step by step. The size of the cluster is not set a priori (in contrast to what is the case for partioning methods), but is determined from the results at each step that are visually represented in a tree structure, the so-called dendrogram.

The successive steps can be approached in a top-down fashion or in a bottom-up fashion. The latter is referred to as agglomerative hierarchical clustering, the former as divisive clustering. In this chapter, only agglomerative methods are treated. A form of divisive clustering is implemented in some of the spatially constrained clustering techniques discussed in Chapter 10.

The chapter begins with a discussion of the concept of dissimilarity, which is essential to all clustering procedures. Next, agglomerative clustering is introduced and the importance of the so-called linkage function outlined. The visual expression of clusters is obtained by means of a dendrogram. Four important linkage functions are discussed: single linkage, complete linkage, average linkage and Ward’s method.

The methods are illustrated by means of the Chicago Community Areas sample data set that contains several socio-economic indicators for 2020. This particular example was chosen because the 77 observations allow for an easy visual analysis of the dendrogram. With larger data sets, such as the Chicago SHOH sample data set (n = 791) considered in the other cluster chapters, it becomes difficult to visualize the details. This aspect of the dendrogram is one of the drawbacks of hierarchical clustering applications for large(r) data sets.


  1. While dimension reduction and clustering are treated here separately, so-called biclustering techniques group both variables and observations simultaneously. While old, going back to an article by Hartigan (1972), these techniques have gained a lot of interest more recently in the field of gene expression analysis. For overviews, see, e.g., Tanay, Sharan, and Shamir (2004) and Padilha and Campello (2017).↩︎