Chapter 21 Postscript - The Limits of Exploration

In this first volume, devoted to exploring spatial data, I have outlined a progression of techniques to aid in finding interesting patterns of dependence and heterogeneity in geospatial data, supported by the GeoDa software. The highly interactive process of discovery moves from simple maps and graphs to a more structured visualization of clusters, outliers and indications of potential structural breaks.

The main topic in this volume is the investigation of spatial correlation, i.e., the match or tension between attribute similarity and locational similarity. In that quest, the primary focus is on the identification of univariate locations of clusters and spatial outliers by means of a range of local indicators of spatial association (LISA).

For a single variable, Tobler’s first law of geography suggests the omnipresence of positive spatial autocorrelation, with a distance decay effect for the dependence. This is indeed observed for many variables in empirical applications. However, Tobler’s law does not necessarily generalize to a bivariate or multivariate setting, where the situation is considerably more complex (Anselin and Li 2020). As discussed in more detail in Chapter 18, in a multivariate setting, not only is there the tension between attribute and locational similarity, but there is an additional dimension of inter-attribute similarity, which is typically not uniform across space.

The GeoDa software is designed to make this process of discovery easy and intuitive. However, this may also unintentionally facilitate potentially pointless clicking through the range of interfaces until one finds an outcome one likes, without fully understanding the limitations of the techniques involved. The extensive discussion of the methods in the book is aimed at remedying the second aspect. Nevertheless, some caution is needed.

First and foremost, exploration is not the same as explanation. Interesting patterns may be discovered, but that does not necessarily identify the process(es) that yielded the patterns. As stated strongly by Heckman and Singer (2017): “data never speak for themselves.” As argued in the discussion of global spatial autocorrelation in Section 13.5.3, spatial data are characterized by the inverse problem, in the sense that the same pattern may be generated by very different spatial processes. More specifically, there is an important distinction between true contagion and apparent contagion. In cross-sectional data, both types of processes yield a clustered pattern. However, the pattern that follows from apparent contagion is generated by a spatially heterogeneous process and not a dependent process, as is typically assumed. Without further information, such as a time dimension, it is impossible to distinguish between the two generating processes.

The interpretation and validation of outcomes of an exploratory (spatial) data analysis has been the subject of growing discussion in the literature, as reviewed in Section 4.2.1. The exploratory process is neither inductive nor deductive, but rather abductive, involving a move back and forth between data analysis, hypothesis generation and reformulation, as well as the addition of new information, an approach sometimes referred to as the “Sherlock Holmes method” (Gahegan 2009; Heckman and Singer 2017).

How such a discovery process is carried out has become increasingly important with the advent of big data, and the associated argument for data-driven discovery as the fourth paradigm in scientific reasoning (see, e.g., Hey, Tansley, and Tolle 2009; Gahegan 2020).

In this final chapter, I want to briefly consider three broader issues that run the risk of getting lost in the excitement of the discovery process driven by interactive software such as GeoDa: (1) the potential pitfalls of data science and their implications for scientific reasoning; (2) the limitations intrinsic to spatial analysis; and (3) reproducible research.

A critical aspect of the abductive approach is how to deal with surprising results, or with results that run counter to pre-conceived ideas. As outlined in detail in Nuzo (2015), among others, it is easy to fool oneself by finding patterns where there are none, by focusing on explanations that fit one’s prior convictions, or by failing to consider potential alternative hypotheses. This can result in confirmation bias (i.e., finding what one sets out to find), as well as disconfirmation bias (tendency to reject results that are counter to one’s priors). In GeoDa, the extensive use of the permutation approach to represent spatial randomness in the data is one way to partially address this concern, but by itself, it is insufficient.

Even without nefarious motivations (e.g., driven by the pressure to publish), problems such as p-hacking (searching until significant results are found), HARKing (hypotheses after the results are known), JARKing (justifying after the results are known), and the like can unwittingly infiltrate
an exploratory spatial data analysis. This has led to extensive discussions in the literature as to how these issues affect the process of scientific discovery (among others, Kerr 1998; Simmons, Nelson, and Simonsohn 2011; Gelman and Loken 2014; Gelman and Hennig 2017; Rubin 2017). Sound scientific reasoning is a way of reasoning that looks for being wrong, a systematic and enduring search for what might be wrong, and an iterative process that goes back and forth between potential explanations and evidence.¹³⁴ Ideally, the process of data exploration should be guided by these principles.

A critical notion in this regard is the researcher degrees of freedom, a reference to the many decisions one makes with respect to the data that are included, the hypotheses considered (and not considered), and methods selected, in the so-called garden of forking paths (Gelman and Loken 2014). In the context of the methods considered here and implemented in GeoDa, aspects such as the selection of spatial scale, how to deal with outliers, whether to apply imputation for missing values, the choice of spatial weights, the treatment of unconnected observations (islands or isolates), and various tuning parameters are prime examples of decisions one has to make that may affect the outcome of the data exploration. Careful attention to these decisions, ideally accompanied by a sensitivity analysis can partially remedy the problem.

Finally, as in any data science, spatial data science similarly may suffer from well-known generic pitfalls, for example as outlined in the book by Smith and Cordes (2019). Their book specifically lists “using bad data, putting data before theory, worshipping math, worshipping computers, torturing data, fooling yourself, confusing correlation with causation, regression towards the mean, and doing harm” as examples of such pitfalls. Any serious student of exploratory spatial data analysis should become familiar with these potential traps and learn to recognize and avoid them.

In addition to these aspects associated with data science in general, spatial data science also faces its own special challenges. These include the ecological fallacy (Robinson 1950; Goodman 1953, 1959), the modifiable areal unit problem, or MAUP (Openshaw and Taylor 1979), the change of support problem (Gotway and Young 2002), as well as the more general issue of the importance of spatial scale (Goodchild 2011; Oshan et al. 2022). Many of these are just special cases of well-known challenges associated with any type of statistical analysis. For example, the ecological fallacy was first raised in sociology and it cautions against interpreting the results of aggregate (spatial) analysis to infer individual behavior. The change of support problem concerns the combination of data at various spatial scales, such as point data (e.g., measurements by environmental sensors) and areal data (e.g., health outcomes at the census tract level), and associated issues of data aggregation and imputation, shared with the attention to scale in geographical analysis.

The MAUP stands out as particular to spatial analysis. It includes both aspects of data aggregation as well as spatial arrangement, or, zonation. In essence, MAUP suggests that different spatial scales and different areal boundaries will yield different and sometimes conflicting statistical results. In many instances in the social sciences, the boundaries are pre-set (e.g., administrative districts) and there is little one can do about it, other than being careful in phrasing the findings of an analysis. This is particularly important when areal boundaries do not align with the spatial scale of the processes investigated. For example, in the U.S., census tracts are typically assumed to correspond with neighborhoods, even though this is seldom the case. Similarly, counties are identified with labor markets, but this is clearly invalid in metro areas (consisting of multiple counties) or in the sparsely populated large counties in the west.

In terms of the spatial exploration, this means that the discovery of clusters needs to be interpreted with caution. A cluster may be nothing more than an indication of poor spatial scale (e.g., spatial units of observation much smaller than the process of interest), especially when the size of the areal units is heterogeneous across the data set. On the other hand, when individual spatial observations are available, the regionalization methods from Volume 2 may be applied to group them into meaningful spatial aggregates.

MAUP and its associated challenges does not invalidate spatial analysis, but it may limit the relevance of some of the findings. Where possible, sensitivity analysis should be implemented.

A second type of limit to spatial analysis follows from the enormous advances made in machine learning and artificial intelligence. For example, deep learning techniques (e.g., Goodfellow, Bengio, and Courville 2016) are able to identify patterns without much prior (spatial) structure, as long as they are trained on very large data sets. GeoAI pertains to the application of these new methods to the analysis of geospatial data (Janowicz et al. 2020). Most current applications have been in the physical domain, such as land cover classification by means of remote sensing, landscape feature extraction and way finding. The focus in GeoAI has been on so-called feature engineering, i.e., identifying those spatial characteristics that should be included in the training data sets to yield good predictive performance. To date, much of GeoAI can be viewed as applying data science to spatial data, rather than spatial data science as conceived of in this book. Nevertheless, this raises an important question about the relevance of spatial constructs and spatial thinking in AI. Does this imply that spatial analysis as considered here will become irrelevant?

While it may be tempting to reach this conclusion, the powerful new deep learning methods are not a panacea for all empirical situations. Specifically, in order to perform well, these techniques require very large training data sets, consisting of millions (and even billions) of data points. Such data sets are still rather rare in the contexts in empirical practice, especially in the social sciences. In addition, the objective in exploratory spatial data analysis is to discover patterns and generate hypotheses, not prediction, the typical focus in AI. So, the insights gained from the methods covered in this book will remain relevant for some time. Nevertheless, the extent to which spatial structure and explicit spatial features contribute to the performance of deep learning methods, or, instead becomes discovered by these methods (and thus irrelevant) still remains to be answered.

A final concern pertaining to the exploration of spatial data is the extent to which it can be reproducible and replicable. Reproducibility refers to obtaining the same results in a repetition of the analysis on the same data (by the researchers themselves or by others), replicability to obtaining the same type of results using different data.

The growing demands for transparency and openness in the larger scientific community (so-called TOP guidelines, for transparency and openness promotion) have resulted in requirements of open data, open science and open software code. These concerns have also been echoed in the context of spatial data science (among others, by Rey 2009, 2023; Singleton, Spielman, and Brunsdon 2016; Burnsdon and Comber 2021; Kedron et al. 2021). Common approaches to accomplish this in spatial data analysis is the codification of workflows by means of visual modeling languages, e.g., as implemented in various GIS software (see also Kruiger et al. 2021). More formally, data, text and code are combined in notebooks, such as the well-known Jupyter (originally for Python) and RStudio (originally for R) implementations (see Rowe et al. 2020).

To what extent does the highly dynamic and interactive visualization in GeoDa meet these requirements? At one level, it does not and cannot. The thrill of discovery can easily result in a rapid succession of linked graphs and maps, in combination with various cluster analyses, in a fairly unstructured manner that does not lend itself to codification in workflows. However, this intrinsic lack of reproducibility can be remedied to some extent.

For example, it may be possible to embed elementary analyses (e.g., a Local Moran) into a specific workflow (e.g., select variable, specify weights, identify clusters), although this will necessarily result in a loss of spontaneity. Alternatively, as outlined in Appendix C, the geodalib architecture allows specific analyses to be embedded in notebook workflows through the new RGeoDa and PyGeoda wrappers.

In addition, there are some simple ways to obtain a degree of reproducibility for major aspects of the analysis, such as the specification of spatial weights, custom classifications, and various variable transformations through the project file (see, e.g., Sections 4.6.3, 9.2.1.1 and 10.3.5). Nevertheless, obtaining full reproducibility for dynamic and interactive visualization remains a challenge.

In spite of these challenges, I believe that the exploratory perspective developed in the book and implemented in the GeoDa software can be very efficient at generating useful insights. As long as the approach is applied with caution, such as by including ample sensitivity analyses, pitfalls can be avoided. Time will tell.

The second volume of this introduction to spatial data science is devoted to what is referred to in the machine learning literature as unsupervised learning. The objective is to reduce the complexity in multivariate data by means of dimension reduction, both in the attribute dimension (e.g., principal components and multidimensional scaling), as well as in the observational dimension (clustering and regionalization). The distinctive perspective offered is to emphasize the relevance of spatial aspects, by spatializing the results of classic methods, either through mapping in combination with linking and brushing, or by imposing spatial constraints in the clustering algorithms themselves. This builds upon the foundations developed in the current volume.

For examples of how this applies to data science and spatial data science, see, e.g., https://puttingscienceintodatascience.org.↩︎