5.2 Dissimilarity

The overall objective in any clustering exercise is to end up with a grouping that minimizes the dissimilarity within each cluster.

Mathematically, the point of departure is an overall loss function that consists of summing the distances between all the pairs of observations: \[T = (1/2) \sum_{i = 1}^n \sum_{j = 1}^n d_{ij},\] where \(d_{ij}\) is some measure of dissimilarity, such as the Euclidean distance between the values at observations \(i\) and \(j\).21

A cluster assignment can be symbolized by an encoder C, such that each observation \(i\) is assigned to a cluster, \(C(i) = h\), where the cluster indicator \(h\) is an element from the set \(\{1, \dots, k\}\). The cluster labels themselves (\(h\)) are meaningless, and could just as well be letters or other distinct symbols. In agglomerative clustering, the assignment is bottom-up, with \(k\) starting at \(n\) (each observation is its own cluster) and eventually ending up at \(1\) (all observations are in a single cluster).

At each stage, the loss function can be evaluated. In any given cluster \(h\), the distances from one of its members (\(i \in h\)) to all other observations can be separated out between those that belong to the cluster (\(j \in h\)) and those that do not (\(j \notin h\)): \[T_{i} = (1/2) [ \sum_{j \in h} d_{ij} + \sum_{j \notin h} d_{ij} ].\] This can be generalized to all the elements of cluster \(h\) by summing over \(i\): \[T_{i \in h} = (1/2) [ \sum_{i \in h} \sum_{j \in h} d_{ij} + \sum_{i \in h} \sum_{j \notin h} d_{ij} ].\] More generally, for all the cluster sizes up to \(k\), i.e., for all possible pairs, the sum is over \(h\): \[T = (1/2) (\sum_{h=1}^k [ \sum_{i \in h} \sum_{j \in h} d_{ij} + \sum_{i \in h} \sum_{j \notin h} d_{ij}]) = W + B,\] where the first term (\(W\)) is referred to as the within dissimilarity and the second (\(B\)) as the between dissimilarity.

In other words, the total dissimilarity decomposes into one part due to what happens within each cluster and another part that pertains to the between cluster dissimilarities. \(W\) and \(B\) are complementary, since \(T\) is fixed. Therefore, the lower \(W\), the higher will be \(B\), and vice versa.

In hierarchical clustering, the loss function is not optimized per se, but rather it is evaluated for each \(k\) that results from a cut in the dendrogram (see Section 5.3.2).


  1. Since each pair is counted twice, the total sum is divided by 2. While this seems arbitrary at this point, it helps in some calculations. For hierarchical clustering methods, the distinction is immaterial.↩︎