## 4.2 Basics of Information Theory

Two important concepts from information theory are part of the SNE logic. The first is the notion of
*Shannon entropy*, a measure of the information content of a distribution, the second is a measure of *fit*,
the *Kullback-Leibler Information Criterion*, or *KLIC*.

Consider a finite number of events \(i\), each with an associated (discrete) probability \(p_i\). The *entropy* of
this distribution is defined as:
\[H = - \sum p_i \ln(p_i).\]
The *maximum* entropy is obtained for the most *diverse* distribution, i.e., a distribution where each event has
equal probability. For example, with four equally likely events, \(p_i = 0.25\) for each, and the entropy follows as
\(H = -4 \times (0.25)(-1.386) = 1.386\). This is the highest possible value for four events. One way to view this
concept, is to think of it as a measure of difficulty to predict a result, e.g., if the four events were cards to
be drawn from a deck. With equal probability, it is the hardest to guess which card will be drawn. The least entropy
occurs when all the probability is concentrated in one event, which yields \(H = 0\). This is also the least diverse distribution.

To make this concept more concrete, consider two more distributions for the four events: one with probabilities 0.125, 0.125, 0.25 and 0.5; and one with probabilities 0.125, 0.125, 0.125 and 0.625. Straightforward calculations yield the entropy for the former as 1.213 and for the latter as 1.074. Because more probability is concentrated in a single event in the latter case, it has a smaller entropy.

In the context of SNE, entropy is rescaled into a measure of *perplexity*, as \(2^H\). For the three examples considered, the
corresponding perplexity is 2.614, 2.318, and 2.105. Perplexity is used as a measure of *how far* one needs to move from a given point to
find a sufficient number of neighbors, since the probability becomes a proxy for the distance between two points (i.e., what is the probability that they are neighbors). The lower the entropy, the more high(er) probability is concentrated in fewer events, which means one needs to go less far (smaller bandwidth) to find a desired number of neighbors. Indirectly, therefore, perplexity becomes a proxy for the bandwidth required.

The *Kullback-Leibler* divergence between two distributions is a measure of *fit*, with smaller values indicating a closer correspondence.
With two sets of probabilities \(p_i\) and \(q_i\), it is defined as:
\[KLIC(P || Q) = \sum_i p_i \ln \frac{p_i}{q_i}
= \sum_i [ p_i \ln p_i - p_i \ln q_i ].\]

In the example, the KLIC between the equal probability distribution and the second case is 0.173, and between the first and third is 0.291. The measure is not necessarily symmetric. For example, when the role of the first and third distributions are interchanged, the associated KLIC is 0.313.

The KLIC can be interpreted as the difference between the entropy of a distribution and a mixture, where the role of \(\ln p_i\) is taken by \(\ln q_i\) from the second distribution.

In SNE, the KLIC will be used in the objective function that minimizes the difference in fit between the probabilities in the original high dimensional space (\(p\)) and the associated probabilities in the low-dimensional, embedded space (\(q\)).

### 4.2.1 Stochastic Neighbors

In SNE, the concept of multi-attribute distance is replaced by a set of conditional probabilities. More precisely, a conditional probability is introduced for the original high-dimensional data of \(p (j | i)\) to express the probability that \(j\) is picked as a neighbor of \(i\). In other words, the distance between \(i\) and \(j\) is converted into a probability measure. Operationally, this is implemented by means of a
*kernel* function centered on \(i\). The kernel is a symmetric distribution such that the probability decreases with distance.

For example, using a normal (Gaussian) distribution centered at \(i\) with a variance of \(\sigma_i^2\) would yield the conditional distribution as:
\[ p_{j|i} = \frac{exp(-d(x_i,x_j)^2/\sigma_i^2)}{\sum_{h \neq i} exp(-d(x_i,x_h)^2/\sigma_i^2) },\]
with \(d(x_i, x_j) = ||x_i - x_j||\), i.e., the Euclidean distance between \(i\) and \(j\) in high-dimensional space, and, by convention, \(p_{i|i} = 0\). The variance \(\sigma_i^2\) determines the *bandwidth* of the kernel. It is chosen optimally for each point, such that a target *perplexity* is reached. While a complex concept, the perplexity intuitively relates to a smooth measure of the effective number of neighbors (fewer neighbors relates to a smaller perplexity). In regions with dense points, the variance \(\sigma_i^2\) should be smaller, whereas in areas with sparse points, it should be larger to reach the same number of neighbors. This is similar in spirit to selecting an adaptive bandwidth based on the \(k\) nearest neighbors.