PCA optimises by finding component axes that maximises the variation:

  • Each new component axis is orthogonal to the previous, since if not, they’d capture overlapping information.
  • Projection onto the component axes produces feature vector with features ordered in descending order of variation.

PCA works by finding euclidean planes onto which projections of the input data produces the largest variation.

Principle Components are a linear combination of input attributes, i.e. PCA is a linear projection of the data points.

Data

Given observations of dimension , we collect the original data into a matrix

The centered (mean subtracted) matrix is called , which is done my subtracting the sample mean of each attribute from each observation:

With being the sample mean:

Variance

Since PCA maximises variance, lets recap what variance means:

Since we’re working with vector observations, we make use of covariance instead:

  • diagonals: variance
  • off-diagonals
    • symmetric
    • co-variance between two pairs of inputs

We call the covariance matrix of the data , and it is a matrix

Finding the first principle component (PC1)

Consider a potential feature vector , we know score of for this feature to be

This means that the sample mean of the projected data is now

Also, we calculate the new covariance matrix of the projected data: (remember the property )

Now, since PCA is concerned with maximising variance, we can calculate as:

This problem lends itself to be solved through Lagrange Multipliers:

Setting up the Lagrangian and solving yields: (See Matrix Calculus to help with the differentiation, note , the covariance matrix, is symmetrical)

The solution is clearly an Eigenvalue problem, with being an eigenvector of , meaning there are possible solutions to , finding the optimal by plugging it into the original function . However:

Shows that you must pick the eigenvector corresponding to the largest eigenvalue to maximise the variance.

The eigenvector corresponding to the largest eigenvalue is our first principle component

Finding the rest of the principle components

Since PCA requires principle components to be orthogonal, we simply add an additional constraint and Lagrange Multiplier to optimise the covariance.

This creates the Lagrangian:

We can prove that is zero by multiplying with

With this:

The solutions are once again the eigenvectors of , however, since they must be orthogonal, it is next biggest eigenvalue’s eigenvector

Therefore, the nth principle component is eigenvector corresponding to the nth biggest eigenvalue

Finding Principle Components In Practice

Now that we know what makes up the principle component matrix , here-forth referred to as (the eigenvector matrix) or (as the matrix containing the principle component vectors )

We want a way to find the eigenvectors and eigenvalues of the covariance matrix .

There’s two approaches:

Projecting the data

Now consider the PCA transformation:

Or in matrix form

Since is orthogonal, we’ve simply shifted the origin of the coordinate axes to the mean of the data and rotated and/or reflected the coordinate axes around the mean to coincide with the principle components. I.e. we have only changed the basis and reduction has taken place

The sample covariance , therefore the transformed covariance matrix is diagonal with the eigenvalues of along its diagonal. This also means the transformed data’s components are uncorrelated, since off diagonals represent correlation (covariance between two features)

Forming a new matrix ( indicating the new number of columns), we are performing dimensionality reduction!

With ( matrix) containing the largest eigenvalues and the corresponding eigenvectors.

We can also consider transformed data in terms of the SVD

Recover data from projection

With dimensionality reduction, a perfect reconstruction is impossible. However, we can get an approximated reconstruction, and is exact if there was no loss during projection.

Or in matrix form:

Or In terms of SVD:

Data Loss

Since the reconstructed data is of form

The data loss is therefore

The amount of variance retained is equal to

With being the rank of the matrix usually