Chapter 1 Introduction

This second volume in the Introduction to Spatial Data Science is devoted to the topic of spatial clustering. More specifically, it deals with the grouping of observations into a smaller number of clusters, which are designed to be representative of their members. The techniques considered constitute an important part of so-called unsupervised learning in modern machine learning. Purely statistical methods to discover spatial clusters in data are beyond the scope.

In contrast to Volume 1, which assumed very little prior (spatial) knowledge, the current volume is somewhat more advanced. At a minimum, it requires familiarity with the scope of the exploratory toolbox included in the GeoDa software. In that sense, it clearly builds upon the material covered in Volume 1. Important principles that are a main part of the discussion in Volume 1 are assumed known. This includes linking and brushing, the various types of maps and graphs, spatial weights and spatial autocorrelation statistics .

Much of the material covered in this volume pertains to methods that have been incorporated into the GeoDa software only in the past few years, so as to support the second part of an Introduction to Spatial Data Science course sequence. The particular perspective offered is the tight integration of the clustering results with a spatial representation, through customized cluster maps and by exploiting linking and brushing.

The treatment is slightly more technical than in the previous volume, but the mathematical details can readily be skipped if the main interest is in application and interpretation. Necessarily, the discussion relies on somewhat more formal concepts. Some examples are the treatment of matrix eigenvalues and matrix decomposition, the concept of graph Laplacian, essentials of information theory, elements of graph theory, advanced spatial data structures such as quadtree and vantage point tree, and optimization algorithms like gradient search, iterative greedy descent, simulated annealing and tabu search. These concepts are not assumed known, but will be explained in the text.

While many of the methods covered constitute part of mainstream data science, the perspective offered here is rather unique, with an enduring attempt at spatializing the respective methods. In addition, the treatment of spatially constrained clustering introduces contiguity as an additional element into clustering algorithms.

Most methods discussed are familiar from the literature, but some are new. Examples include the common coverage percentage, a local measure of goodness of fit between distance preserving dimension reduction methods, two new spatial measures to assess cluster quality, i.e., the join count ratio and the cluster match map, a heuristic to obtain contiguous results from classic clustering results, and a hybrid approach towards spatially constrained clustering, whereby the outcome of a given method is used as the initial feasible region in a second method. These techniques resulted as refinements of the software and the presentation of cluster results, and have not been published previously. In addition, the various methods to spatialize cluster results are mostly also unique to the treatment in this volume.

As in Volume 1, the coverage here also constitutes the definitive user’s guide to the GeoDa software, complementing the previous discussion.

In the remainder of this introduction, I provide a broad overview of the organization of Volume 2, followed by a listing of the sample data sets used. As was the case for Volume 1, these data sets are included as part of the GeoDa software and do not need to be downloaded separately. For a quick tour of the GeoDa software, I refer to the Introduction of Volume 1.