PCA optimises by finding component axes that maximises the variation:
- Each new component axis is orthogonal to the previous, since if not, they’d capture overlapping information.
- Projection onto the component axes produces feature vector with features ordered in descending order of variation.
PCA works by finding euclidean planes onto which projections of the input data produces the largest variation.
Principle Components are a linear combination of input attributes, i.e. PCA is a linear projection of the data points.
Data
Given observations of dimension , we collect the original data into a matrix
The centered (mean subtracted) matrix is called , which is done my subtracting the sample mean of each attribute from each observation:
With being the sample mean:
Variance
Since PCA maximises variance, lets recap what variance means:
Since we’re working with vector observations, we make use of covariance instead:
- diagonals: variance
- off-diagonals
- symmetric
- co-variance between two pairs of inputs
We call the covariance matrix of the data , and it is a matrix
Finding the first principle component (PC1)
Consider a potential feature vector , we know score of for this feature to be
This means that the sample mean of the projected data is now
Also, we calculate the new covariance matrix of the projected data: (remember the property )
Now, since PCA is concerned with maximising variance, we can calculate as:
This problem lends itself to be solved through Lagrange Multipliers:
Setting up the Lagrangian and solving yields: (See Matrix Calculus to help with the differentiation, note , the covariance matrix, is symmetrical)
The solution is clearly an Eigenvalue problem, with being an eigenvector of , meaning there are possible solutions to , finding the optimal by plugging it into the original function . However:
Shows that you must pick the eigenvector corresponding to the largest eigenvalue to maximise the variance.
The eigenvector corresponding to the largest eigenvalue is our first principle component
Finding the rest of the principle components
Since PCA requires principle components to be orthogonal, we simply add an additional constraint and Lagrange Multiplier to optimise the covariance.
This creates the Lagrangian:
We can prove that is zero by multiplying with
With this:
The solutions are once again the eigenvectors of , however, since they must be orthogonal, it is next biggest eigenvalue’s eigenvector
Therefore, the nth principle component is eigenvector corresponding to the nth biggest eigenvalue
Finding Principle Components In Practice
Now that we know what makes up the principle component matrix , here-forth referred to as (the eigenvector matrix) or (as the matrix containing the principle component vectors )
We want a way to find the eigenvectors and eigenvalues of the covariance matrix .
There’s two approaches:
Projecting the data
Now consider the PCA transformation:
Or in matrix form
Since is orthogonal, we’ve simply shifted the origin of the coordinate axes to the mean of the data and rotated and/or reflected the coordinate axes around the mean to coincide with the principle components. I.e. we have only changed the basis and reduction has taken place
The sample covariance , therefore the transformed covariance matrix is diagonal with the eigenvalues of along its diagonal. This also means the transformed data’s components are uncorrelated, since off diagonals represent correlation (covariance between two features)
Forming a new matrix ( indicating the new number of columns), we are performing dimensionality reduction!
With ( matrix) containing the largest eigenvalues and the corresponding eigenvectors.
We can also consider transformed data in terms of the SVD
Recover data from projection
With dimensionality reduction, a perfect reconstruction is impossible. However, we can get an approximated reconstruction, and is exact if there was no loss during projection.
Or in matrix form:
Or In terms of SVD:
Data Loss
Since the reconstructed data is of form
The data loss is therefore
The amount of variance retained is equal to
With being the rank of the matrix usually