8.5 Implementation
Spectral clustering is invoked from the Clusters toolbar, as the next to last item in the classic clustering subset, shown in Figure 6.1. Alternatively, from the menu, it is selected as Clusters > Spectral.
The variable settings panel has the same general layout as for K-Means. The coordinates for the points are selected from the Select Variables dialog as x and y. In the Parameters panel, one set pertains to the construction of the Affinity (or adjacency) matrix, the others are the usual parameters for the K-Means algorithm that is applied to the transformed eigenvectors.
The Affinity option provides three alternatives, K-NN, Mutual K-NN and Gaussian, with specific parameters for each. In the example, the number of clusters is set to 2 and all options are kept to the default setting. This includes K-NN with 3 neighbors for the affinity matrix, and all the default settings for K-Means. The value of 3 for knn corresponds to \(\log_{10}(300) = 2.48\), rounded up to the next integer.
The Run button generates the cluster characteristics in the Summary panel. In the usual manner, this also brings up a new window with the cluster map, and saves the cluster classification as an integer variable to the data table.
8.5.1 Cluster results
The cluster map for the chosen parameter settings is shown in Figure 8.3. It yields a perfect separation of the two spirals, although that is by no means the case for all parameter settings, as will be illustrated below.
The cluster characteristics in Figure 8.4 list the parameter settings first, followed by the values for the cluster centers (the mean) for the two (standardized) coordinates and the decomposition of the sum of squares. The ratio of between sum of squares to total sum of squares is a dismal 0.04. This is not surprising, since this criterion provides a measure of the degree of compactness for the cluster, which a non-convex cluster like the spirals example does not meet.
In this example, it is easy to visually assess the extent to which the nonlinearity is captured. However, in the typical high-dimensional application, this will be much more of a challenge, since the usual measures of compactness may not be informative. A careful inspection of the distribution of the different variables across the observations in each cluster is therefore in order.
8.5.2 Options and Sensitivity Analysis
The results of spectral clustering are extremely sensitive to the parameters chosen to create the affinity matrix. Suggestions for default values are only suggestions, and the particular values many sometimes be totally unsuitable. Experimentation is therefor a necessity. There are two classes of parameters. One set pertains to the number of nearest Neighbors for K-NN or Mutual K-NN. The other set relates to the bandwidth of the Gaussian kernel, determined by the standard deviation Sigma.
8.5.2.1 K-nearest neighbor affinity matrix
The two default values for the number of nearest neighbors are contained in a drop-down list. In the spirals example, with n=300, \(\log_{10}(n) = 2.48\), which rounds up to 3, and \(\ln(n) = 5.70\), which rounds up to 6. These are the two default values provided. Any other value can be entered manually in the dialog. The options for a mutual knn affinity matrix have the same entries.
The results for the Mutual option with 3 nearest neighbors are shown in Figures 8.5 and 8.6. The separation is far from perfect, with cluster members in both spirals. The measures of fit are even worse than for the default case.
8.5.2.2 Gaussian kernel affinity matrix
The built-in options for sigma, the standard deviation of the Gaussian kernel are 0.707107, 3.477121, and 6.703782. The smallest value corresponds to \(\sqrt{1/p}\), where \(p\), the number of variables, equals 2 in the example. The other two values are \(\log_{10}(n) + 1\) and \(\ln(n) + 1\), yielding respectively 3.477121 and 6.703782 for n=300. In addition, any other value can be entered in the dialog.
The results for the first option (0.707107) are shown in Figures 8.7 and 8.8. The clusters totally fail to extract the shape of the separate spirals, although they are perfectly balanced. The layout looks similar to the results for K-Means and K-Medians in Figure 8.2. Interestingly, this layout scores much better on the BSS/TSS ratio (0.39). Rather than being an indication of a good separation, it suggests that the nonlinearity of the true clusters is not reflected in the grouping.
In order to find a solution that provides the same separation as in Figure 8.3, some experimentation with different values for \(\sigma\) is needed. As it turns out, the same result as for knn with 3 neighbors is obtained with \(\sigma = 0.08\) or \(0.07\), neither of which are even close to the default values. This illustrates how in an actual example, where the results cannot be readily visualized in two dimensions, it may be very difficult to find the parameter values that discover the true underlying patterns.