Chapter 2 Principal Component Analysis (PCA)

The familiar curse of dimensionality affects analysis across two dimensions. One is the number of observations (big data), the other the number of variables considered. The methods included in Volume 2 address this problem by reducing the dimensionality, either in the number of observations (clustering), or in the number of variables (dimension reduction). The three chapters in Part I address the latter problem. The current chapter covers principal components analysis (PCA), a core method of both multivariate statistics and machine learning. Dimension reduction is particularly relevant in situations where many variables are available that are highly intercorrelated. In essence, the original variables are replaced by a smaller number of proxies that represent them well in terms of their statistical properties.

Before delving into the formal derivation of principal components, a brief review is included of some basic concepts from matrix algebra, focusing in particular on matrix decomposition. Next follows a discussion of the mathematical properties of principal components and their implementation and interpretation.

A distinct characteristic of this chapter is the attention paid to spatializing the inherently non-spatial concept of principal components. This is achieved by exploiting geovisualization, linking and brushing to represent the dimension reduction in geographic space. Of particular interest are principal component maps, and the connection between univariate local cluster maps for principal components and their multivariate counterpart.

The methods are illustrated using the Italy Community Banks sample data set.