## 8.3 Clustering as a Graph Partitioning Problem

The clustering methods considered so far were based on the use of a *dissimilarity* matrix, with the
objective of minimizing within-cluster dissimilarity. In spectral clustering, the focus is on the
complement, i.e., a *similarity* matrix, and the goal is to maximize the internal similarity within a
cluster. Of course, any dissimilarity matrix can be turned into a similarity matrix using a number
of different methods. For example, a distance decay function can be employed (inverse distance, negative exponential), or a new metric can be computed as
the difference from the maximum (e.g., \(d_{max} - d_{ij}\)).

A similarity matrix \(S\) consisting of elements \(s_{ij}\) can be viewed as the basis for the *adjacency matrix*
of a weighted undirected graph \(G = (V,E)\). In this graph, the vertices \(V\) are the observations and the edges \(E\)
give the strength of the similarity between \(i\) and \(j\), \(s_{ij}\). This is identical to the interpretation of a
spatial weights matrix (see Volume 1). As it turns out, the standard notation for the adjacency matrix is
to use \(W\), as was the case for spatial weights.

In practice, the adjacency matrix that is used in the algorithm is typically not the same as the full similarity matrix. Instead, it
follows from a transformation of the latter to a *sparse* form (see Section 8.4.1).

The goal of graph partitioning is to delineate subgraphs that are internally strongly connected, but only
weakly connected with the other subgraphs. In the ideal case, the subgraphs are so-called *connected components*. These
are subgraphs with fully connected internal elements, but without any connections to the other subgraphs. A geographical example would be a collection of counties located on a number of islands. The counties on the same island are connected, but there are no connections between counties on different islands. As a result, the corresponding weights structure would be block-diagonal. In
practice, this will rarely be the case.

The objective of graph partitioning then becomes one of finding a set of *cuts* in the graph that
create \(k\) subsets, such that the internal connectivity is maximized and the in-between connectivity is minimized. A naive application
of this principle would lead to the least connected vertices to become singletons (similar to what tends to happen in single and average linkage hierarchical clustering) and all the rest to form one large cluster. Better
suited partitioning methods include a weighting of the cluster *size* so as to end up with well-balanced
subset.^{41}

A very important concept in this regard is the *graph Laplacian* associated with the adjacency matrix \(W\).

### 8.3.1 Graph Laplacian

In network analysis or graph theory, an important property of the adjacency matrix \(W\) associated with a graph is the *degree* of a vertex \(i\):
\[d_i = \sum_j w_{ij},\]
i.e., the row-sum of the similarity weights. This is the same concept as the cardinality of spatial weights, reflected in the
connectivity histogram.

The *graph Laplacian* is defined as the following matrix:
\[L = D - W,\]
where \(D\) is a diagonal matrix containing the degree of each vertex. The graph Laplacian has the
property that all its eigenvalues are real and non-negative, and, most importantly, that its smallest
eigenvalue is zero.

In the (ideal) case where the adjacency matrix can be organized into separate partitions (unconnected to each other), the matrix takes on a block-diagonal form. Each block contains the partitioning sub-matrix for the matching group. The corresponding graph Laplacian will similarly have a block-diagonal structure. Since each of these sub-blocks is itself a graph Laplacian (for the sub-network corresponding to the partition), its smallest eigenvalue is zero as well. An important result is then that if the graph is partitioned into \(k\) disconnected blocks, the graph Laplacian will have \(k\) zero eigenvalues. This forms the basis for the logic of using the \(k\) smallest eigenvalues of \(L\) to find the corresponding clusters.

While it can be used to proceed with spectral clustering, the unnormalized Laplacian \(L\) has some undesirable properties.
Instead, the preferred approach is to use a so-called *normalized* Laplacian.

There are two ways to normalize the adjacency
matrix and thus the associated Laplacian. One is to *row-standardize* the adjacency matrix, or \(D^{-1}W\). This is exactly
the same idea as row-standardizing a spatial weights matrix. When applied to the Laplacian, this yields:
\[L_{rw} = D^{-1}D - D^{-1}W = I - D^{-1}W.\]

This is referred to as a *random walk* normalized graph Laplacian, since the row elements can be viewed
as transition probabilities from state \(i\) to each of the other states \(j\). As with spatial weights, the
resulting normalized matrix is no longer symmetric, although its eigenvalues remain real, with the smallest eigenvalue
being zero. The associated eigenvector is \(\iota\), a vector of ones.

A second transformation pre- and post-multiplies the Laplacian by \(D^{-1/2}\), the inverse of the square root of the degree.
This yields a *symmetric* normalized Laplacian as:
\[L_{sym} = D^{-1/2}DD^{-1/2} - D^{-1/2}WD^{-1/2} = I - D^{-1/2}WD^{-1/2}.\]
Again, the smallest eigenvalue is zero, but the associated eigenvector now becomes \(D^{1/2}\iota\).

Spectral clustering algorithms differ by whether the unnormalized or normalized Laplacian is used to compute the \(k\) smallest eigenvalues and associated eigenvectors, and whether the Laplacian or the adjacency matrix is the basis for the calculation.

Specifically, as an alternative to using the smallest eigenvalues and associated eigenvectors of the normalized Laplacian,
the *largest* eigenvalues/eigenvectors of the normalized adjacency (or affinity) matrix can be computed.

The standard eigenvalue
expression is the following equality:
\[Lu = \lambda u,\]
where \(u\) is the eigenvector associated with eigenvalue \(\lambda\). Subtracting both sides from \(Iu\) yields:
\[(I - L)u = (1 - \lambda)u.\]
In other words, if the smallest eigenvalue of \(L\) is 0, then the largest eigenvalue of \(I - L\) is 1. Moreover:
\[I - L = I - [I - D^{-1/2}WD^{-1/2}] = D^{-1/2}WD^{-1/2}.\]
This result implies that the search for the smallest eigenvalue of \(L\) is equivalent to the search for the *largest* eigenvalue
of \(D^{-1/2}WD^{-1/2}\), the normalized adjacency matrix.