My Notes on AI

Dimensionality Reduction with PCA

High dimensional data is hard to analyze, difficult to interpret and visualize. Its hard to store them for computation. high-dimensional data is often overcomplete, i.e., many dimensions are redundant and can be explained by a combination of other dimensions. High dimensional data are often correlated and maintain an intrinsic lower-dimensional structure. Dimensionality reduction exploits structure and correlation of higher dimensional data and provides a compact representation of the data without loosing information as much as possible. However the representation is enough coherent and compact in order to be used for analysis.

Principal Component Analysis(PCA) is an algorithm for linear dimensionality reduction. PCA algorithm focuses in finding projections $\hat{x_n}$ of data $x_n$ that are similar to the original data as much as possible.

key steps of PCA

1. Compute the mean $\mu$ of the data matrix $ X = [x_1|...|x_N]^{T} \in \mathbf{R}^{N \times D} $

2. Mean subtraction: Replace all data points $x_i$ with $\hat{x}_{i} = x_{i} - \mu $

3. Divide the data by its standard deviation in each dimension. $ \hat{X}^(d) = \hat{X} / \sigma(X^(d)) for d = 1, ...., D $

4. Compute the eigenvectors (orthonormal) and eigenvalues of the data covariance matrix $ S = \frac{1}{N}\vec(X)^{T}\vec(X) $

5. Choose the eigenvectors associated with the M largest eigenvalues to be the basis of the principal subspace.

6. Collect these eigenvectors in a matrix B = [b_1, ... , b_M]

7. Orthogonal projection of the data onto the principal axis using the projection matrix $BB^T$