## 7.5 Spatial Heterogeneity

Together with spatial dependence, *spatial heterogeneity* is the other critical aspect in any spatial data analysis.
In essence, its presence suggests that more than one underlying distribution may be responsible for generating the observed sample.
This is evidenced in the form of structural breaks, such as different mean or median values
in different subsets of the data, or different slopes in a bivariate regression over observations in those subsets (see Anselin 1988, for a more formal treatment).

The *spatial* aspect in the heterogeneity comes from the nature of the subsets in the data, which are spatially defined.
Examples are the difference between center and periphery, east and west, and north and south. In an exploratory data analysis, spatial heterogeneity can be assessed by
selection in a map *linked* to a statistical graph. An
application of this idea was illustrated in the context of a linked map and histogram (Section 7.2.1.4). Next, two further implementations of this approach are considered.
One is the analysis of differences in means through the **Averages Chart**, the other
the investigation of structural stability in a scatter plot by means of brushing a
map linked to the scatter plot.

### 7.5.1 Averages chart

The **Averages Chart** is an implementation of a simple test on the difference in means between selected
and unselected observations. Its most meaningful use is in the context of observations at different points
in time (see Chapter 9), but it is equally applicable in a cross-sectional setting, illustrated here.

The core functionality of this chart is to illustrate and quantify the difference between the
mean of a variable for *selected* observations and *unselected* observations (the complement).
In `GeoDa`

, this is not implemented as a traditional t-test, but rather as an F-statistic for
a regression that includes an indicator variable for the selection (i.e., value = 1 for selected,
and zero otherwise). The F-statistic on the significance of the joint slopes in that regression is
equivalent to a t-test on the coefficient of the indicator variable, since there is only one slope.

This F-statistic is basically a test on whether there is a significant gain in explanation in the regression beyond the overall mean (i.e., the constant term). Formally, the statistic uses the sum of squared residuals in the regression RSS and the sum of squared deviations from the mean for the dependent variable RSY. The statistic follows as: \[F = \frac{RSS - RSY}{k - 1} / \frac{RSY}{n - k},\] with \(k\) as the number of explanatory variables. In our simple dummy variable regression, \(k = 2\), so that the degrees of freedom for the F-statistic are \(1, n-2\) (see also Anselin and Rey 2014, 98–99.)

#### 7.5.1.1 Selected and unselected observations

The averages chart is invoked from the menu as
**Explore > Averages Chart**. However, its toolbar icon is not part of the EDA
group, but instead is included as
the right-most icon in the time management toolbar in Figure 9.1.
Its most effective application is in a space-time context, but here its use is
illustrated in a cross-sectional setup.

The example explores the difference in mean food insecurity in 2020 between the Oaxaca Central Valleys and the rest of the country. The so-called Valles Centrales are the location of the original Zapotec civilization and are still characterized by a large indigenous population. They also contain the state capital of Oaxaca.

The overall spatial distribution of **pfood_20** is shown in the box map in
Figure 7.19, with the central valley municipalities highlighted.
The selection is obtained by setting **region = 8** in the selection tool
(see Section 2.5.1).

The selection is used in the averages chart to define **selected** and **unselected**
as the **Groups**. In
Figure 7.20, the **Variable** is specified as **pfood_20**. In the current context, the time settings
can be ignored. The statistics for the two groups are listed in a
small table. **Selected** has 121 observations with a **Mean** of 17.65 and **S.D.** of
7.74. In contrast, the **Unselected** group contains 449 observations, with **Mean** of
29.09 and **S.D.** of 15.07.

The formal test on equality of means results in an F-statistic of 64.99, which yields a very small p-value (essentially zero) for 568 degrees of freedom.

In the right-hand panel of the window, the overall mean (black), selected mean (red) and unselected mean (blue) are represented graphically.

In this case, there is strong evidence that food insecurity is less severe in the Central Valleys compared to the rest of the state.

#### 7.5.1.2 Map brushing and the averages chart

A more comprehensive exploration of spatial heterogeneity is achieved by *brushing* a
map linked to the averages chart. As the selection in the map changes, the statistics
in the averages chart are updated. As a result, one can assess the extent to which
subregions of the data have a different mean for the variable under consideration.

For example, Figure 7.21 has a six category quantile map for
elevation on the left-hand side. As the *brush* is moved from west to east, the
effect on the test of means can be assessed in the averages chart on the right. Note that the brushing is not based on the altitude variable, but is purely a move over space.
However, by including a different variable, one may be able to discover what
lies behind the spatial structural differences.

The variable under consideration in the averages chart is **ppov_20**. The first selection contains
213 observations with a mean of 75.42, compared to a mean of 78.46 for the rest.
The test on the difference between the means is not significant at p = 0.052. Thus
this initial selection of higher elevation locations is not substantially different
from the rest of the state.

In Figure 7.22, the selection is moved to the center of the state, with 215 selected observations, yielding a mean of 73.69. The unselected mean is 79.53. For this selection, the test rejects the null hypothesis with a p-value of 0.000, suggesting strong spatial heterogeneity.

Finally, in Figure 7.23, the selection is moved even further to the east. The 87 selected observations have a mean of 79.01, whereas the complement has a mean of 77.02. The test on difference between the means is not significant at p = 0.348.

By moving the brush over different regions of the map, the extent of spatial structural instability can be assessed. In addition, by using a map for a different variable (such as altitude here), one can possibly gain insight into factors that may be behind the spatial structural instability. However, for this to be meaningful, one has to make sure that sufficient observations are contained in each selection.

#### 7.5.1.3 Averages chart options

The default setting for the averages chart is to have a **Fixed scale over time**.
This means that the minimum and maximum tick marks on the left side axis remain
the same as the selection changes. In some instances, the respective means may
no longer fall within this range. As a result, they will not be shown in the graph.
To remedy this, the **Axis Option** can be set to
**Enable User Defined Value Range of Y-Axis**, through which the range can be customized.

A final option is to **Set Display Precision of Y-Axis**.

### 7.5.2 Brushing the scatter plot

#### 7.5.2.1 Scatter plot brushing

The concept of scatter plot *brushing* was initially suggested by Stuetzle (1987), and extended to the map context by Monmonier (1989). The idea is to dynamically adjust the selection
of observations in the scatter plot within a selection *brush* (typically a rectangle). As the brush moves over the plot, observations are added to and dropped from the selection and the slope of the linear fit is adjusted. As discussed in Section 7.3.1.1, this is
implemented in `GeoDa`

through the **Regimes Regression** option (see also Figure 7.11).

Since all windows are simultaneously linked in the basic architecture
of the `GeoDa`

software (Anselin, Syabri, and Kho 2006), the dynamic selection through brushing can be initiated in
any open window. This results in an immediate adjustment of the slope of the linear fit
in the scatter plot and an updated computation of the Chow test. As in the averages chart, *spatial heterogeneity* can be assessed by initiating a *spatial* selection through the brushing operation in a map.

#### 7.5.2.2 Map brushing and scatter plot

To illustrate the dynamic map brushing operation, a scatter plot of **c_ptot12** on
**ppov_20** is considered jointly with the elevation map used before. The global linear fit yields
a slope coefficient estimate of -0.272, with an \(R^2\) of 0.121. The brush moves from
west to east.

In Figure 7.24, the brush is initiated by selecting 93 observations in the higher altitude region in the western part of the state. The regression slope of the selected observations is -0.217, compared to -0.303 for the unselected, yielding a significant Chow test result at p = 0.0012. Note that the coefficient for the selected observations is not significant (p=0.128).

As the brush is moved to the right, the selection is adjusted. In Figure 7.25, the slope of the 77 selected observations has changed to -0.390, compared to -0.266 for the unselected. Even though the estimates are both significant, they cannot be deemed to be different, with the Chow test yielding a p-value of 0.387.

In Figure 7.26, the brush is moved even further to the east, yielding a selection of 33 observations. Note that since the brush size is fixed, but the spatial extent of the municipalities varies, the number of observations contained in each spatial selection will not be constant. The municipalities in this part of the state tend to be larger, resulting in fewer observations in the selection window. This will affect the precision of the estimates in the subset, and, indirectly, also the Chow test statistic.

At this point, the selected observations show no significant relationship, which a slightly positive coefficient of 0.056 (p-value of 0.335). In contrast, the coefficient in the unselected observations is negative at -0.307, and highly significant.

A careful assessment of the effect of different spatial selections can provide insight into spatially defined structural instabilities in the relationship between any two variables. The inclusion of a Chow test provides some guidance, even though its results need to be interpreted with caution due to the problem of multiple comparisons (many tests carried out with the same data). However, it allows for a more quantitative measure of the spatial heterogeneity, instead of relying on a purely visual assessment, which can be misleading. This is in the spirit of the more recently developed perspectives on EDA, e.g., as discussed in Section 4.2.1.