8.3 Clustering as a Graph Partitioning Problem

The clustering methods considered so far were based on the use of a dissimilarity matrix, with the objective of minimizing within-cluster dissimilarity. In spectral clustering, the focus is on the complement, i.e., a similarity matrix, and the goal is to maximize the internal similarity within a cluster. Of course, any dissimilarity matrix can be turned into a similarity matrix using a number of different methods. For example, a distance decay function can be employed (inverse distance, negative exponential), or a new metric can be computed as the difference from the maximum (e.g., \(d_{max} - d_{ij}\)).

A similarity matrix \(S\) consisting of elements \(s_{ij}\) can be viewed as the basis for the adjacency matrix of a weighted undirected graph \(G = (V,E)\). In this graph, the vertices \(V\) are the observations and the edges \(E\) give the strength of the similarity between \(i\) and \(j\), \(s_{ij}\). This is identical to the interpretation of a spatial weights matrix (see Volume 1). As it turns out, the standard notation for the adjacency matrix is to use \(W\), as was the case for spatial weights.

In practice, the adjacency matrix that is used in the algorithm is typically not the same as the full similarity matrix. Instead, it follows from a transformation of the latter to a sparse form (see Section 8.4.1).

The goal of graph partitioning is to delineate subgraphs that are internally strongly connected, but only weakly connected with the other subgraphs. In the ideal case, the subgraphs are so-called connected components. These are subgraphs with fully connected internal elements, but without any connections to the other subgraphs. A geographical example would be a collection of counties located on a number of islands. The counties on the same island are connected, but there are no connections between counties on different islands. As a result, the corresponding weights structure would be block-diagonal. In practice, this will rarely be the case.

The objective of graph partitioning then becomes one of finding a set of cuts in the graph that create \(k\) subsets, such that the internal connectivity is maximized and the in-between connectivity is minimized. A naive application of this principle would lead to the least connected vertices to become singletons (similar to what tends to happen in single and average linkage hierarchical clustering) and all the rest to form one large cluster. Better suited partitioning methods include a weighting of the cluster size so as to end up with well-balanced subset.41

A very important concept in this regard is the graph Laplacian associated with the adjacency matrix \(W\).

8.3.1 Graph Laplacian

In network analysis or graph theory, an important property of the adjacency matrix \(W\) associated with a graph is the degree of a vertex \(i\): \[d_i = \sum_j w_{ij},\] i.e., the row-sum of the similarity weights. This is the same concept as the cardinality of spatial weights, reflected in the connectivity histogram.

The graph Laplacian is defined as the following matrix: \[L = D - W,\] where \(D\) is a diagonal matrix containing the degree of each vertex. The graph Laplacian has the property that all its eigenvalues are real and non-negative, and, most importantly, that its smallest eigenvalue is zero.

In the (ideal) case where the adjacency matrix can be organized into separate partitions (unconnected to each other), the matrix takes on a block-diagonal form. Each block contains the partitioning sub-matrix for the matching group. The corresponding graph Laplacian will similarly have a block-diagonal structure. Since each of these sub-blocks is itself a graph Laplacian (for the sub-network corresponding to the partition), its smallest eigenvalue is zero as well. An important result is then that if the graph is partitioned into \(k\) disconnected blocks, the graph Laplacian will have \(k\) zero eigenvalues. This forms the basis for the logic of using the \(k\) smallest eigenvalues of \(L\) to find the corresponding clusters.

While it can be used to proceed with spectral clustering, the unnormalized Laplacian \(L\) has some undesirable properties. Instead, the preferred approach is to use a so-called normalized Laplacian.

There are two ways to normalize the adjacency matrix and thus the associated Laplacian. One is to row-standardize the adjacency matrix, or \(D^{-1}W\). This is exactly the same idea as row-standardizing a spatial weights matrix. When applied to the Laplacian, this yields: \[L_{rw} = D^{-1}D - D^{-1}W = I - D^{-1}W.\]

This is referred to as a random walk normalized graph Laplacian, since the row elements can be viewed as transition probabilities from state \(i\) to each of the other states \(j\). As with spatial weights, the resulting normalized matrix is no longer symmetric, although its eigenvalues remain real, with the smallest eigenvalue being zero. The associated eigenvector is \(\iota\), a vector of ones.

A second transformation pre- and post-multiplies the Laplacian by \(D^{-1/2}\), the inverse of the square root of the degree. This yields a symmetric normalized Laplacian as: \[L_{sym} = D^{-1/2}DD^{-1/2} - D^{-1/2}WD^{-1/2} = I - D^{-1/2}WD^{-1/2}.\] Again, the smallest eigenvalue is zero, but the associated eigenvector now becomes \(D^{1/2}\iota\).

Spectral clustering algorithms differ by whether the unnormalized or normalized Laplacian is used to compute the \(k\) smallest eigenvalues and associated eigenvectors, and whether the Laplacian or the adjacency matrix is the basis for the calculation.

Specifically, as an alternative to using the smallest eigenvalues and associated eigenvectors of the normalized Laplacian, the largest eigenvalues/eigenvectors of the normalized adjacency (or affinity) matrix can be computed.

The standard eigenvalue expression is the following equality: \[Lu = \lambda u,\] where \(u\) is the eigenvector associated with eigenvalue \(\lambda\). Subtracting both sides from \(Iu\) yields: \[(I - L)u = (1 - \lambda)u.\] In other words, if the smallest eigenvalue of \(L\) is 0, then the largest eigenvalue of \(I - L\) is 1. Moreover: \[I - L = I - [I - D^{-1/2}WD^{-1/2}] = D^{-1/2}WD^{-1/2}.\] This result implies that the search for the smallest eigenvalue of \(L\) is equivalent to the search for the largest eigenvalue of \(D^{-1/2}WD^{-1/2}\), the normalized adjacency matrix.


  1. For details, see Shi and Malik (2000), as well as the discussion of a “graph cut point of view” in von Luxburg (2007).↩︎