4.2 From EDA to ESDA

The mapping functionality outlined in this and the next chapters is part of an overall framework to guide and facilitate learning from spatial data, which also includes a range of traditional statistical graphs (see Chapters 7 and 8). GeoDa implements general ideas from exploratory data analysis (EDA) and its spatial extension, exploratory spatial data analysis (ESDA). To put these various concepts into context, next, I briefly discuss the evolution of EDA and visual analytics as well as how geovisualization and ESDA fit into this framework. I close with a brief introduction of the important concepts of linking and brushing.

4.2.1 Exploratory Data Analysis (EDA)

EDA is generally viewed to have originated in the 1960s as a reaction by computationally oriented statisticians to the primary focus on mathematics and modeling in their discipline. It was first comprehensively outlined in the classic book by John Tukey (Tukey 1977). His presentation stressed the value of investigating the raw data by means of a range of graphic devices, several of which were developed by Tukey himself. Well-known examples include the box plot and stem and leaf plot.

The objective of EDA is to create effective tools to guide the analyst in the discovery of information in the data, specifically, “indications of unexpected phenomena” or to “display the unanticipated” (Tukey 1962; Tukey and Wilk 1966), or even to “discover potentially explicable patterns” (Good 1983). In this context, EDA is often contrasted with confirmatory data analysis, or CDA. This reflects the traditional distinction in knowledge discovery between an inductive approach (data first, hypothesis later) and a deductive approach (hypothesis/model first, data later). However, with the emphasis on visual exploration (see also Tufte 1997), EDA is really about an abductive approach, where the interaction between data exploration and human perception leads to the detection of patterns jointly with the formulation of hypotheses (e.g., Gahegan 2009).

Early approaches to visualizing data go back to Greek times, although major innovations did not occur until the work of William Playfair during the late 18th and early 19th century (Friendly 2008). However, practical visual exploration of large data sets had to wait for the development of powerful computer graphics hardware. This allowed direct interaction with the data shown on a computer screen through so-called dynamic graphics (Becker, Cleveland, and Wilks 1987; Cleveland and McGill 1988; Cleveland 1993). Dynamic graphics allowed the data to be represented simultaneously by means of different views (Buja, Cook, and Swayne 1996), i.e., graphs, tables, charts, and even maps that focus on various aspects of the data distribution. Especially when dealing with high-dimensional data, insight is gained through careful manipulation of the views, such as linking, focusing and arranging (for a recent review of the range of techniques, see Chen, Härdle, and Unwin 2008).

In the computer science literature, a similar and largely parallel development occurred in the form of visual analytics, an approach to knowledge discovery based on the use of statistical graphics and other visual tools to facilitate pattern recognition and data mining (Thomas and Cook 2005; Kielman, Thomas, and May 2009). As in EDA, in visual analytics much attention is paid to human-computer interaction so as to facilitate analytical reasoning in order to “detect the expected and discover the unexpected” (Kielman, Thomas, and May 2009, 245).

Whereas early on EDA and CDA were mostly seen as opposing strategies to gain knowledge, more recently attempts are being made to bring the two closer together. Specifically, a concern that unstructured exploration may lead to spurious results has yielded methods to augment EDA in order to provide some measure as to how unusual the findings may be. For example, Buja et al. (2009) and Wickham et al. (2010) suggest a Rorschach test and a line up to compare what outcomes may be generated under a specific null hypothesis to assess how unusual the graphs and charts obtained in exploration actually are.²³ A Bayesian perspective is taken by Hullman and Gelman (2021), who suggest the use of a graph as a model check. This topic is the subject of ongoing discussion and debate. I revisit some of these ideas in the postscript (Chapter 21).

4.2.2 Mapping as exploration

Historically, cartography, the science (and art) of map making, has focused on the map as a presentation or an expository device (Monmonier 1993). In this context, the map is an end product, created to represent findings, but it is not part of the analytical process itself. In recent years, while this aspect is still important, attention has shifted to making the map an integral part of knowledge discovery from data. This is variously referred to as geovisualization, geovisual analytics, or even geospatial visual analytics. This effort consists of a combination of new analytical tools, visual representations and their software implementation. It represents a shift away from the map as an end product to the integration of the map and a spatial focus in an interactive process of data exploration. In other words, the map becomes one of the views manipulated in a dynamic graphics environment (for a historical perspective, see Anselin 2005b, 2012).²⁴

4.2.3 Exploratory Spatial Data Analysis (ESDA)

Exploratory spatial data analysis (ESDA) similarly incorporates the map and spatial information as an integral part of the data exploration process, but the focus is on spatial patterns. As such, ESDA can be viewed as “a collection of techniques to describe and visualize spatial distributions, identify atypical locations or spatial outliers, discover patterns of spatial association, clusters or hot spots and suggest spatial regimes or other forms of spatial heterogeneity” (Anselin 1999).

ESDA techniques, as implemented in GeoDa thus augment the map as a view of the data with targeted searches for spatial patterns, while leveraging global and local spatial autocorrelation statistics (spatial dependence). In addition, the discovery of the location of structural breaks in the spatial distribution (spatial heterogeneity) is facilitated (early overviews of this literature are contained in Anselin 1994, 1998, 1999). In Volume 2, spatial constraints are introduced into multivariate clustering techniques.

4.2.4 Linking and Brushing

Two concepts central to the way ESDA is implemented in the architecture of GeoDa are so-called linking and brushing (Anselin, Syabri, and Kho 2006). In Chapter 2, it was shown how observations can be selected, either by creating a query in the selection tool (Section 2.5.1), or by means of a spatial selection in a map view (Section 2.5.4).

When a selection is made in a map, such as in Figure 2.28, the corresponding observations are also highlighted in the table. The same happens in the other direction as well. When observations are selected in the table, the matching locations (points or polygons) are highlighted in any open map view. The connection between the selections in all open windows is referred to as linking. This works not just for a map and associated table (as is the case in many GIS), but also simultaneously for any map view and all the statistical graphs (see Chapters 7 and 8).

A dynamic version of this process is brushing, first proposed for scatter plots in the statistical literature by Stuetzle (1987) and Becker and Cleveland (1987). It was further extended to choropleth maps by Monmonier (1989).

The idea behind brushing is that the selection tool (e.g., the selection rectangle on a map) becomes a moving object. As the rectangle moves over the map, the collection of selected objects is immediately updated. In this way, one can move the selection brush over the map (or any statistical graph) and assess the effect of the changing selection.

In GeoDa, the concept of brushing is combined with linking in the sense that the updated selection is instantaneously transmitted to all the open windows through the linking process. This provides a very powerful visual tool to assess the effect of the changing selection on various aspects of the spatial and statistical distributions, in both univariate and multivariate settings. The linking-brushing combination is critical to support ESDA. The map plays a central role in this process as an interactive visualization tool, discussed in more detail in the remainder of the chapter.

The permutation approach towards inference in spatial autocorrelation analysis can be viewed as an implementation of the line up approach. See Chapters 13 and 16.↩︎
Early discussions of these concepts are included in, among others, Haslett et al. (1991), Dykes (1997), MacEachren and Kraak (1997), and MacEachren et al. (1999). Overviews of various techniques can be found in Dykes, MacEachren, and Kraak (2005), Kraak and MacEachren (2005), Rhyne, MacEachren, and Dykes (2006), G. Andrienko et al. (2011) and N. Andrienko et al. (2018), among others.↩︎