Chapter 1 Introduction

Spatial data are special, in that the location of the observations, the where, plays a critical role in the methodology required for their analysis. Two aspects in particular distinguish spatial data from the standard independent and identically distributed paradigm (i.i.d.) in statistics and data analysis, i.e., spatial dependence and spatial heterogeneity (Anselin 1988, 1990). Spatial dependence refers to the similarity of values observed at neighboring locations, or “everything is related to everything else, but closer places more so,” known as Tobler’s first law of geography (Tobler 1970). Spatial heterogeneity is a particular form of structural change, associated with spatial subregions of the data, i.e., showing a clear break in the spatial distribution of a phenomenon. Both spatial dependence and spatial heterogeneity require a specialized methodology for data analysis, generically referred to as spatial analysis.

Spatial data science is an emerging paradigm that extends spatial analysis, situated at the interface between spatial statistics and geocomputation. What the term actually encompasses is not settled, and the collection of methods and software tools it represents is also sometimes referred to as geographic data science, or geospatial data science (Anselin 2020; Singleton and Arribas-Bel 2021; Comber and Brunsdon 2021; Rey, Arribas-Bel, and Wolf 2023). The concept is closely related to, overlaps somewhat with and has many methods and approaches in common with fields such as geocomputation (Brunsdon and Comber 2015; Lovelace, Nowosad, and Muenchow 2019), cyberGIScience (Wang 2010; Wang et al. 2013), and, more recently, GeoAI (Janowicz et al. 2020; Gao 2021).

This two-volume collection is intended as an introduction to the field of spatial data science, emphasizing data exploration and visualization and focusing on the importance of a spatial perspective. It represents an attempt to promote spatial thinking in the practice of data science. It is admittedly a selection of methods that reflects my own biases, but it has proven to be an effective collection over many years of teaching and research. The first volume deals with the exploration of spatial data, whereas the second volume focuses on spatial clustering methods.

The methods covered in both volumes work well for so-called small to medium data settings, but not all of them scale well to big data settings. However, some important principles do scale well, like local indicators of spatial association. Even though data sets of very large size have become commonplace and arguably have been the drivers behind a lot of methodological development in modern data science, this is not always relevant for spatial data analysis. The point of departure is often big data (e.g., geo-located social media messages), but eventually the analysis is carried out at a more spatially aggregate level, where the techniques covered here remain totally relevant.

The methodological approach outlined in this first volume supports an abductive process of exploration, a dynamic interaction between the analyst and the data with the goal of obtaining new insights. The focus is on insights that pertain to spatial patterns in the data, such as the location of interesting observations (hot spots and cold spots), the presence of structural breaks in the spatial distribution of the data, and the comparison of such patterns between different variables and over time.

The identification of the patterns is intended to provide cues about the types of processes that may have generated them. It is important to appreciate that exploration is not the same as explanation. In my opinion, exploration nevertheless constitutes an important and necessary step to obtain effective and falsifiable hypotheses, to be used in the next stages of the analysis. However, in practice, the line between pure exploration and confirmation (hypothesis testing) is not always that clear and the process of scientific discovery may move back and forth between the two. I return to this question in more detail in the closing chapter.

The two volumes are both an introduction to the methodology of spatial data science as well as the definitive guide to the GeoDa software. This software represents the implementation of my vision for a gradual progression in the exploration of spatial data, from simple description and mapping, to more structured identification of patterns and clusters, culminating with the estimation of spatial regression models. It came at the end of a series of software developments that started in the late 1980s (for a historical overview, see Anselin 2012).

GeoDa is designed to be user friendly and intuitive, working through a graphical user interface, and therefore it does not require any programming expertise. Similarly, the emphasis in the two volumes is on spatial concepts and how they can be implemented through the software, but it does not deal with geocomputation as such.

A distinctive characteristic of GeoDa is the efficient implementation of dynamically linked graphs, in the sense that one or more selected observations in a “view” of the data (a graph or map) are immediately also selected in all the other views, allowing interactive linking and brushing of the data (Anselin, Syabri, and Kho 2006). Since its initial release in 2003 (through the NSF-funded Center for Spatially Integrated Science), the software has been adopted widely for both teaching and research, with close to 600,000 unique downloads at the time of this writing.

In the remainder of this introduction, I first provide a broad overview of the organization of this first volume. This is followed by a quick tour of the GeoDa software and a listing of the sample data sets used to illustrate the methods.