12.4 Beyond Clustering

The combination of dimension reduction in the variable space and the grouping of observations into clusters can provide very powerful insights into the structure of multivariate data. However, similar to the outcomes of a data exploration discussed in Volume 1, the ultimate results do not provide explanations, nor do they suggest particular spatial processes. In essence, a clustering exercise results in a simplification of the original complex data structure in order to facilitate further analysis. As pointed out in this volume in several places, any such clustering exercise needs to be carried out with caution. A mechanical approach that relies on the default values in the software must be avoided.

Of course, this leaves the question of how to select the proper combination of algorithms, parameters and other tuning factors. Unfortunately, there is no hard and fast answer to address these choices. To some extent, it depends on the goal of the analysis. For example, sometimes it is required to have spatially contiguous observations in each cluster, e.g., as a legal requirement for electoral redistricting purposes or for solutions to a location-allocation problem. Obviously, in such instances, the classic non-spatial clustering techniques fall short. On the other hand, when there is no such strict requirement, the loss in fit due to the constrained spatial solution is something that needs to be carefully evaluated in each particular instance. It should be kept in mind that unconstrained solutions always obtain a better fit than constrained ones, so a sole focus on fit will favor non-spatial approaches.

There is no general guidance in the choice between spatial or a-spatial solutions. A case in point is the delineation of so-called housing submarkets in urban studies, i.e., clusters of similar housing units used to explain variations in their value. Part of the literature argues that the main objective should be to maximize the similarity of housing units in a given submarket, whereas another part of the literature insists on imposing a spatial contiguity constraint (for an extensive discussion, see Anselin and Amaral 2023).

Sometimes, the construction of clusters is an objective in itself, such as in the redistricting example mentioned earlier. However, many times the goal is to simplify the complex multivariate structure of the data to provide a starting point for further analysis. For example, this is the case when statistical or privacy concerns require a minimum size for the denominator in rates, such as for rates of rare diseases. A common solution is to carry out a constrained spatial clustering exercise, which guarantees a minimum population at risk. As the several examples covered illustrate, such an approach typically does not yield a unique solution. Again, careful sensitivity analyses is needed to assess how the various design decisions affect the ultimate conclusions (e.g., about the location of hot spots or cold spots).

In the spatial analysis literature, clustering is often seen as a way to address the modifiable areal unit problem or MAUP (see Chapter 21 in Volume 1). When elemental units are available (e.g., individual housing units in the housing market example), a clustering approach allows for the grouping of observations according to certain objective functions. To some extent, this avoids the arbitrariness of the scale and spatial arrangement of administrative aggregate units, but it also introduces a degree of uncertainty related to the choice of algorithm, tuning parameters, etc., mentioned before.

In a similar vein, clustering methods can be used to address the problem of spatial heterogeneity in spatial regression analysis (Anselin and Rey 2014). Such heterogeneity pertains to structural breaks in the data that result in separate models and/or separate model coefficients for spatial subsets of the data, referred to as spatial regimes. The grouping of observations into regimes is a particular application of spatially constrained clustering. Tight integration between the estimation of model parameters and the delineation of spatial regimes is achieved in models of endogenous spatial regimes, a subject of active ongoing research (Anselin and Amaral 2023).