Chapter 6 Partioning Clustering Methods

The principle behind partitioning clustering methods is to assign each observation to one out of \(k\) clusters. The objective is the same as that of hierarchical clustering, except that the number of clusters (traditionally referred to as \(k\)) is pre-specified. For a given \(k\), the optimal assignment of observations to clusters is obtained such that, as before, the similarity within each cluster and the dissimilarity between clusters are maximized.

In the current chapter, K-means clustering is covered, the most familiar example of a partitioning method. More advanced partitioning methods are considered in Chapter 7.

The earliest solution of the K-means problem is commonly attributed to Lloyd (1982) and referred to as Lloyd’s algorithm (the algorithm was first contained in a Bell Labs technical report by Lloyd from 1957). It implements a so-called iterative relocation, i.e., the gradual improvement from an initial solution obtained by moving (relocating) observations to a different cluster. The K-means algorithm can be considered to be a special case of the EM (expectation-maximization) algorithm of Dempster, Laird, and Rubin (1977). The expectation step then consists of allocating each observation to its nearest cluster center, and the maximization step is the recalculation of those cluster centers for each new layout (Han, Kamber, and Pei 2012). While the progression of the iterative relocation is fairly straightforward, the choice of the initial starting point is not. It requires a careful sensitivity analysis. An extensive evaluation of K-means on a number of benchmark data sets is contained in Fränti and Sieranoja (2018).

The chapter begins with a step-by-step illustration of the K-means algorithm. A discussion of initial starting points leads to coverage of the K-means++ approach. The implementation in GeoDa is outlined, with a discussion of the interpretation, visualization and spatialization of cluster results, options and sensitivity analysis, and the use of cluster categories as variables.

The K-means method is illustrated with a subset of data on socio-economic determinants of health for 791 census tracts in Chicago contained in the Chicago SDOH sample data set.