Principle Component Analysis

PCA optimises $A$ by finding component axes that maximises the variation:

Each new component axis is orthogonal to the previous, since if not, they’d capture overlapping information.
Projection onto the component axes produces feature vector with features ordered in descending order of variation.

PCA works by finding euclidean planes onto which projections of the input data produces the largest variation.

Principle Components are a linear combination of input attributes, i.e. PCA is a linear projection of the data points.

Data

Given $N$ observations $x_{n}, n = 1, ..., N$ of dimension $d$ , we collect the original data into a $d \times N$ matrix $X$

The centered (mean subtracted) matrix is called $D$ , which is done my subtracting the sample mean of each attribute from each observation:

d_{i} = x_{i} - \overset{ˉ}{x} D = {d_{1}, ..., d_{N}}

With $\overset{ˉ}{x}$ being the sample mean:

\overset{ˉ}{x} = \frac{1}{N} n = 1 \sum N x_{n}

Variance

Since PCA maximises variance, lets recap what variance means:

Var [{x_{1}, ..., x_{N}}] = E [(x - \overset{x}{ˉ})^{2}] = \frac{1}{N} n = 1 \sum N (x_{n} - \overset{x}{ˉ})^{2}

Since we’re working with vector observations, we make use of covariance instead:

Cov [{x_{1}, ..., x_{N}}] = \frac{1}{N} n = 1 \sum N (x_{n} - \overset{ˉ}{x}) (x_{n} - \overset{ˉ}{x})^{T} = \frac{1}{N} D D^{T}

diagonals: variance
off-diagonals
- symmetric
- co-variance between two pairs of inputs

We call the covariance matrix of the data $S = \frac{1}{N} D D^{T}$ , and it is a $d \times d$ matrix

Finding the first principle component (PC1)

Consider a potential feature vector $u_{1}$ , we know score of $x$ for this feature to be $u_{1}^{T} x_{i}$

This means that the sample mean of the projected data is now $u_{1}^{T} \overset{ˉ}{x}$

Also, we calculate the new covariance matrix of the projected data: (remember the property $(A B)^{T} = B^{T} A^{T}$ )

C_{X} = \frac{1}{N} n = 1 \sum N (u_{1}^{T} x_{n} - u_{1}^{T} \overset{ˉ}{x}) (u_{1}^{T} x_{n} - u_{1}^{T} \overset{ˉ}{x})^{T} = \frac{1}{N} n = 1 \sum N u_{1}^{T} (x_{n} - \overset{ˉ}{x}) (x_{n} - \overset{ˉ}{x})^{T} u_{1} = u_{1}^{T} S u_{1}

Now, since PCA is concerned with maximising variance, we can calculate $u_{1}$ as:

u_{1} max u_{1}^{T} S u_{1} subject to u^{T} u = 1 (u is a unit vector)

This problem lends itself to be solved through Lagrange Multipliers:

f (u_{1}) = u_{1}^{T} S u_{1} g (u_{1}) = 0 = 1 - u_{1}^{T} u_{1}

Setting up the Lagrangian and solving yields: (See Matrix Calculus to help with the differentiation, note $S$ , the covariance matrix, is symmetrical)

L (u_{1}, λ_{1}) = u_{1}^{T} S u_{1} + λ \cdot (1 - u_{1}^{T} u_{1}) \nabla_{u} L = 2 S u_{1} - 2 λ u_{1} = 0 2 S u_{1} = 2 λ u_{1} S u_{1} = λ u_{1}

The solution is clearly an Eigenvalue problem, with $u_{1}$ being an eigenvector of $S$ , meaning there are $r ank (S)$ possible solutions to $u_{1}$ , finding the optimal by plugging it into the original function $f (u_{1})$ . However:

u_{1}^{T} S u_{1} = λ

Shows that you must pick the eigenvector corresponding to the largest eigenvalue to maximise the variance.

The eigenvector corresponding to the largest eigenvalue is our first principle component

Finding the rest of the principle components

Since PCA requires principle components to be orthogonal, we simply add an additional constraint and Lagrange Multiplier to optimise the covariance.

u_{2} max u_{2}^{T} S u_{2} subject to u_{2}^{T} u_{2} = 1 and also u_{2}^{T} u_{1} = 0 (orthogonal)

This creates the Lagrangian:

L (u_{2}, λ_{1}, λ_{2}) = u_{2}^{T} S u_{2} + λ_{1} (1 - u_{2}^{T} u_{2}) + λ_{2} u_{2}^{T} u_{1} \nabla_{u_{2}} L = 0 = 2 S u_{2} - 2 λ_{1} u_{2} + λ_{2} u_{1} \frac{δ L}{δ λ _{2}} = 0 = 1 - u_{2}^{T} u_{2} \frac{δ L}{δ λ _{2}} = 0 = u_{2}^{T} u_{1} (orthoganal to eachother)

We can prove that $λ_{2}$ is zero by multiplying $\nabla_{u_{2}} L$ with $u_{1}^{T}$

u_{1}^{T} \nabla_{u_{2}} L = 0 = 2 u_{1}^{T} S u_{2} - 2 u_{1}^{T} λ_{1} u_{2} + λ_{2} u_{1}^{T} u_{1} 2 u_{1}^{T} S u_{2} = 0 (since u_{1} and u_{2} are orthogonal) u_{1}^{T} λ_{1} u_{2} = 0 (since u_{1} and u_{2} are orthogonal) λ_{2} u_{1}^{T} u_{1} = 1 (since a unit vector) ∴ λ_{2} = 0

With this:

0 = 2 S u_{2} - 2 λ_{1} u_{2} S u_{2} = λ_{1} u_{2}

The solutions are once again the eigenvectors of $S$ , however, since they must be orthogonal, it is next biggest eigenvalue’s eigenvector

Therefore, the nth principle component is eigenvector corresponding to the nth biggest eigenvalue

Finding Principle Components In Practice

Now that we know what makes up the principle component matrix $A$ , here-forth referred to as $Q$ (the eigenvector matrix) or $U$ (as the matrix containing the principle component vectors $U = {u_{1}, ..., u_{v}}$ )

We want a way to find the eigenvectors and eigenvalues of the covariance matrix $S$ .

There’s two approaches:

Projecting the data

Now consider the PCA transformation:

y_{n} = Q^{T} (x_{n} - \overset{ˉ}{x})

Or in matrix form

Y = Q^{T} (X - \overset{ˉ}{X}) = Q^{T} D

Since $Q$ is orthogonal, we’ve simply shifted the origin of the coordinate axes to the mean of the data and rotated and/or reflected the coordinate axes around the mean to coincide with the principle components. I.e. we have only changed the basis and reduction has taken place

The sample covariance $S_{y} = Λ$ , therefore the transformed covariance matrix is diagonal with the eigenvalues of $S$ along its diagonal. This also means the transformed data’s components are uncorrelated, since off diagonals represent correlation (covariance between two features)

Forming a new $d \times v$ matrix $Q_{v}$ ( $v$ indicating the new number of columns), we are performing dimensionality reduction!

S = Q Λ Q^{T} \approx Q_{v} Λ Q_{v}^{T}

With $Λ_{v}$ ( $v \times v$ matrix) containing the $v$ largest eigenvalues and $Q$ the corresponding eigenvectors.

We can also consider transformed data $Y$ in terms of the SVD

Y = U_{v}^{T} (U_{v} Σ_{v} V_{v}^{T}) = Σ_{v} V_{v}^{T}

Recover data from projection

With dimensionality reduction, a perfect reconstruction is impossible. However, we can get an approximated reconstruction, and is exact if there was no loss during projection.

\hat{x}_{n} = \overset{ˉ}{x} + Q_{v} y_{n}

Or in matrix form:

\hat{X} = \overset{ˉ}{X} + Q_{v} Y

Or In terms of SVD:

\hat{X} = \overset{ˉ}{X} + U_{v} Σ_{v} V_{v}^{T} = \overset{ˉ}{X} + D_{v}

Data Loss

Since the reconstructed data is of form

\hat{X} = \overset{ˉ}{X} + D_{v}

The data loss is therefore

D - D_{v}

The amount of variance retained is equal to

\frac{\sum _{i = 1}^{v} λ _{i}}{\sum _{i = i}^{r} λ _{r}}

With $r$ being the rank of the matrix usually $r = d$

📓 Daniel's Notes

Explorer