## 8.4 The Spectral Clustering Algorithm

In general terms, a spectral clustering algorithm consists of four phases:

- turning the similarity matrix into an adjacency matrix
- computing the \(k\)
*smallest*eigenvalues and matching eigenvectors of the graph Laplacian, or, alternatively, the \(k\)*largest*eigenvalues and eigenvectors of the affinity matrix - using the (rescaled) resulting eigenvectors to carry out K-Means clustering
- associating the resulting clusters back to the original observations

The most important step is the construction of the adjacency matrix.

### 8.4.1 Adjacency matrix

The first step consists of selecting a criterion to turn the *dense* similarity matrix into a
*sparse* adjacency matrix, sometimes also referred to as the *affinity matrix*. The logic is very similar to that of creating spatial weights by means of
a distance criterion.

For example, *neighbors* can be defined as being within a critical
distance \(\delta\). In the spectral clustering literature, this is referred to as an epsilon (\(\epsilon\)) criterion.
This approach shares the same issues as in the spatial case when the observations are distributed with
very different densities. In order to avoid isolates, a max-min nearest neighbor distance needs to be
selected, which can result in a very unbalanced adjacency matrix. An adjacency matrix derived from the
\(\epsilon\) criterion is typically used in unweighted form.

A preferred approach is to use k nearest neighbors, although this is not a symmetric property. Since the eigenvalue computations require a symmetric matrix, there are two approaches to remedy the asymmetry (see also the discussion for spatial weights in Volume 1). In one, the affinity matrix is made symmetric as \((1/2) (W + W')\). In other words, if \(w_{ij} = 1\), but \(w_{ji} = 0\) (or the reverse), a new set of weights is created with \(w_{ij} = w_{ji} = 1/2\).

Instead of a pure k-nearest neighbor criterion, so-called *mutual* k nearest
neighbors can be defined, which consists of those neighbors among the k-nearest neighbor set that \(i\) and \(j\)
have in common. More precisely, only those connectivities are kept for which \(w_{ij} = w_{ji} = 1\).

The knn adjacency matrix can join points in disconnected parts of the graph, whereas the mutual k nearest neighbors will be sparser and tends to connect observations in regions with constant density. Each has pros and cons, depending on the underlying structure of the data.

An alternative approach is to compute a similarity matrix that has a built-in distance decay. The most common
method is to use a Gaussian density (or *kernel*) applied to the Euclidean distance between observations:
\[s_{ij} = \exp [- (x_i - x_j)^2 / 2\sigma^2],\]
where the standard deviation \(\sigma\) plays the role of a bandwidth. With the right choice of \(\sigma\), the corresponding adjacency matrix can be made more or less sparse. More precisely, the Gaussian transformation translates a
dissimilarity matrix (Euclidean distances) into a similarity matrix. The new similarity matrix plays the role
of the adjacency matrix in the remainder of the algorithm.

### 8.4.2 Clustering on the Eigenvectors of the Graph Laplacian

With the adjacency matrix in place, the \(k\) smallest eigenvalues and associated eigenvectors of the normalized graph Laplacian can be computed. Alternatively, the \(k\) largest eigenvalues/eigenvectors of the adjacency matrix can be calculated.

Since only a few eigenvalues are required, specialized algorithms can be exploited that only extract the smallest or largest eigenvalues/eigenvectors, such as the power iteration method outlined in Section 3.2.1.2.

When a symmetric normalized graph Laplacian is used, the \(n \times k\) matrix of eigenvectors, say \(U\),
is row-standardized such that the norm of each row equals 1. The new matrix \(T\) has elements:^{42}
\[t_{ij} = \frac{u_{ij}}{(\sum_{h=1}^k u_{ih}^2)^{1/2}}.\]
The new observations consist of the values of \(t_{ij}\) for each \(i\). These values are used in a
standard K-Means clustering algorithm to yield \(k\) clusters. Finally, the labels for the clusters
are associated with the original observations and several cluster characteristics can be computed.

### 8.4.3 Spectral Clustering Parameters

The results of spectral clustering tend to be highly sensitive to the choice of the
parameters that are used to define the adjacency matrix. For example, when using k-nearest neighbors,
the choice of the number of neighbors is an important decision. In the literature,
a value for the number of nearest neighbors of the order of \(\log(n)\) is suggested for large \(n\) (von Luxburg 2007). In practice,
both \(\ln(n)\) as well as \(\log_{10}(n)\) are used.^{43}

Similarly, the bandwidth of the Gaussian transformation is determined by the value for the standard deviation, \(\sigma\). One suggestion for the value of \(\sigma\) is to take the mean distance to the k nearest neighbor, or \(\sigma \sim \log(n) + 1\) (von Luxburg 2007). Again, either \(\ln(n) + 1\) or \(\log_{10}(n) + 1\) can be implemented. In addition, sometimes \(\sigma = \sqrt{1/p}\) is suggested, where \(p\) corresponds to the number of variables (features).

In practice, these parameters are best set by trial and error, and a careful sensitivity analysis is in order.