4.3 t-SNE

The t-SNE method, due to van der Maaten and Hinton (2008), is a refinement of the original SNE in two main respects: the conditional distribution is replaced by a joint distribution on \((i,j)\), and the Gaussian kernel is replaced by a thick tailed normalized Student-t kernel with a single degree of freedom.

The point of departure is a joint probability \(P\) for the points in high-dimensional multi-attribute space obtained from the original conditional probabilities as: \[p_{ij} = \frac{p(j|i) + p(i | j)}{2n},\] where \(n\) is the number of observations.

The counterpart of the distribution \(P\) for the high-dimensional points is a distribution \(Q\) for the points in a lower dimension. It is defined as: \[q_{ij} = \frac{(1 + ||z_i - z_j||^2)^{-1}}{\sum_{h \neq l} (1 + ||z_h - z_l||^2)^{-1}},\] where, as before, the \(z_i\) are the coordinates in the embedded (low dimensional) space. The denominator is simply a scaling factor to ensure that the individual joint probabilities sum to one. The numerator is inversely dependent on the Euclidean distance between the points \(z_i\) and \(z_j\). In other words, the closer \(i\) and \(j\) are in embedded space, the higher the associated probability \(q_{ij}\), and vice versa.

The objective is to minimize the divergence between the distributions \(P\) and \(Q\). This yields a specific cost function, considered next.

4.3.1 Cost Function and Optimization

The objective of t-SNE is to find the set of coordinates \(z\) in embedded space that minimize the Kullback-Leibler divergence between the \(p_{ij}\) and corresponding \(q_{ij}\). This is formalized in the following expression: \[\min_z(C) = \min_z [KLIC(P || Q)] = \min_z \sum_{i \neq j} [ p_{ij} \ln p_{ij} - p_{ij} \ln q_{ij} ] .\] In essence, this boils down to trying to match pairs with a high \(p_{ij}\) (nearby points in high dimension) to pairs with a high \(q_{ij}\) (nearby points in the embedded space).

Following standard calculus, in order to minimize the cost function, its gradient is needed. This is the first partial derivative of the function \(C\) with respect to the coordinates \(z_i\). It has the complex form: \[\frac{\partial C}{\partial z_i} = 4 \sum_{j \neq i}(p_{ij} - q_{ij})(1 + ||z_i - z_j ||^2)^{-1} (z_i - z_j).\]

The expression can be simplified somewhat by setting \(U = \sum_{h \neq l} (1 + ||z_h - z_l||^2)^{-1}\), i.e., the denominator in the \(q_{ij}\) probability. As a result, since \(q_{ij} = (1 + ||z_i - z_j ||^2)^{-1} / U\), the term \((1 + ||z_i - z_j ||^2)^{-1}\) can be replaced by \(q_{ij}U\), so that the gradient becomes: \[\frac{\partial C}{\partial z_i} = 4 \sum_{j \neq i}(p_{ij} - q_{ij})q_{ij}U (z_i - z_j).\]

The optimization involves a complex gradient search that is adjusted with a learning rate and a momentum term to speed up the process. At each iteration \(t\), the value of \(z_i^t\) is updated as a function of its value at the previous iteration, \(z_i^{t-1}\), the change in the function, i.e., the gradient, and an adjustment that is a function of the previous change (to avoid over- and under-shooting). The contribution of the gradient is not used in-full, but is multiplied by a fraction, the so-called learning rate, \(\eta\). Since using the full gradient tends to result in inefficient changes in the result, its effect is dampened. Furthermore, there is an additional adjustment in the form of the so-called momentum, \(\alpha(t)\), which adds a fraction of the change in the previous iteration, \((z_i^{t-1} - z_i^{t-2})\). The momentum is typically adjusted during the optimization process, hence the inclusion of the index \(t\).

In full then, each iteration proceeds as: \[z_i^t = z_i^{t-1} + \eta \frac{\partial C}{\partial z_i} + \alpha(t)(z_i^{t-1} - z_i^{t-2}).\]

The learning rate \(\eta\) and momentum \(\alpha(t)\) are parameters that need to be specified. The results can vary considerably depending on how these parameters are tuned. Furthermore, the algorithm is rather finicky and may require some trial and error to find a stable solution.¹⁶

4.3.2 Large Data Applications (Barnes-Hut)

The most recent implementation of the t-SNE optimization process uses advanced algorithms that speed up the computations and allow the method to scale to very large data sets, as outlined in van der Maaten (2014).

The approach is based on two simplifications, one pertaining to \(P\), the other to \(Q\). The first strategy is to parse the distribution \(P\) and eliminate all the probabilities \(p_{ij}\) that are too small in some sense. The other strategy is to do something similar to the probabilities \(q_{ij}\), in the sense that the value for several individual points that are close together is represented by a central point. This reduces the number of individual \(q_{ij}\) that need to be evaluated. This is referred to as the Barnes-Hut optimization (Barnes and Hut 1986).

The t-SNE gradient is separated into two parts (in the same notation as before): \[\frac{\partial C}{\partial z_i} = 4 [\sum_{j \neq i} p_{ij}q_{ij}U(z_i - z_j) - \sum_{j \neq i} q_{ij}^2 U (z_i - z_j)],\] where the first part (the cross-products) represents an attractive force and the second part (the squared probabilities) represents a repulsive force. The simplification of the \(p_{ij}\) targets the first part (converting many products to a value of zero), whereas the simplification of \(q_{ij}\) addresses the second part (reducing the number of expressions that need to be evaluated).

4.3.2.1 Simplification of \(P\)

In order to simplify the \(p_{ij}\), those point pairs that are farther apart than a given cut-off distance have their probability set to zero. The rationale for this is that the value of \(p_{ij}\) for those pairs that do not meet the closeness criterion is likely to be very small and hence can be ignored, similar to the logic behind Tobler’s law. The distance criterion is related to the preferred perplexity, a rough measure of the target number of neighbors. In practice, pairs that are more than \(3u\) apart, with \(u\) as the perplexity, have \(p_{ij} = 0\). As a result of the choice of the distance criterion, the value for perplexity set as an option can be at most \(n/3\). However, in van der Maaten and Hinton (2008), the value for perplexity is suggested to be between 5 and 50.

The parsing of \(p_{ij}\) is implemented by means of a so-called vantage-point tree (VP tree). In essence, this tree divides the space into those points that are closer than a given distance to a reference point (the vantage-point) and those that are farther away. By applying this recursively, the data get partitioned into smaller and smaller entities such that neighbors in the tree are likely to be neighbors in space (see Yianilos 1993 for technical details). The VP tree is used to carry out a nearest neighbor search so that all \(j\) that are not in the nearest neighbor set for a given \(i\) result in \(p_{ij} = 0\). Importantly, the \(p_{ij}\) do not change during the optimization, since they pertain to the high-dimensional space, so this calculation only needs to be done once.

4.3.2.2 Simplification of \(Q\)

The simplification of \(Q\) relies on an efficient organization of the solution points as a quadtree, together with a replacement of points that are close together (in the same box in the quadtree) by a single center point.

At each iteration, a quadtree is created for the current layout, based on the current solution for the coordinates \(z\).¹⁷ The tree is traversed (depth-first) and at each node a decision is made whether the points in that cell can be effectively represented by their center. The logic behind this is that the distance from \(i\) to all the points \(j\) in the cell will be very similar, so rather than computing each individual \(q_{ij}^2U(z_i - z_j)\) for all the \(j\) in the cell, they are replaced by \(n_c q^2_{ic}(1 + ||z_i - z_c ||^2)^{-1} (z_i - z_c)\), where \(n_c\) is the number of points in the cell and \(z_c\) are the coordinates of a central (representative) point. The denser the points are in a given cell, the greater the computational gain.¹⁸

A critical decision is the selection of those cells for which the simplification is appropriate. This is set by the \(\theta\) parameter to the optimization program, such that: \[\frac{r_c}{||z_i - z_c||} < \theta,\] where \(r_c\) is the length of the diagonal of the cell square. In other words, the cell is summarized when the diagonal of the cell is less than \(\theta\) times the distance between a point \(i\) and the center of the cell. With \(\theta\) set to zero, there is no simplification going on, and all the \(q_{ij}\) are computed at each iteration.

This second simplification only makes a difference for truly large data sets.

An excellent illustration of the sensitivity of the algorithm to various tuning parameters can be found in Wattenberg, Viégas, and Johnson (2016) (https://distill.pub/2016/misread-tsne/).↩︎
A quadtree is a tree data structure that recursively partitions the space into squares as long as there are points in the resulting space. For example, the bounding box of the points is first divided into four squares. If one of those is empty, that becomes an endpoint in the tree (a leaf). For those squares with points in them, the process is repeated, again stopping whenever a square is empty. This structure allows for very fast searching of the tree. In three dimensions, the counterpart (using cubes) is called an octree.↩︎
For further technical details, see van der Maaten (2014).↩︎