A linear Dimensionality Reduction technique for classification

It’s like PCA but focusses on separability among known categories ( $t_{i}$ )

Data

Data is a set of $N$ input-output pairs
Inputs are real-valued $d$ -dimensional vectors $x_{i}$
Outputs $t_{i}$ indicate class membership of $x_{i}$ , and are positive integers from 1 to $k$ (the amount of classes)
Inputs are collected into $d \times N$ matrix $X$ and outputs into $N$ -dimensional vector $t$

Binary (Two Class) Classification Example

While a has greater variance, b has better separation between classes as there is less overlap between classes, making it ideal for classification problems.

Projection

LDA attempts to maximise the distance between the class means, while minimising variation (referred to as scatter in LDA) within each category.

Given a $d$ -dimensional observation $x_{n}$ , we can project it down to a single dimension as before with:

y_{n} = w^{T} x_{n}

Therefore, our aim is to find $w$ as to maximise between class separation in the projected 1-d space. However, we need to decide how to quantify this, as we will in the following sections.

Means

Have $\overset{ˉ}{x}$ be the mean of the input data as before in PCA

We define the class mean as

m_{j} = \frac{1}{N _{j}} {i : t_{i} = j} \sum x_{i}

With:

$i : t_{i} = j$ meaning sample indexes $i$ belonging to class $j$
$N_{j}$ being the total samples in class $j$

Scatter

$S_{T} = \sum_{i = 1}^{N} (x_{i} - \overset{ˉ}{x}) (x_{i} - \overset{ˉ}{x})^{T}$ is the total scatter about the mean
$ρ_{j} = \sum_{{i : t_{i} = j}} (x_{i} - m_{j}) (x_{i} - m_{j})^{T}$ is the within class $j$ scatter
$S_{W} = \sum_{j = 1}^{k} ρ_{j}$ is the within-class scatter for the data
$S_{B} = \sum_{i = 1}^{N} (m_{t_{i}} - \overset{ˉ}{x}) (m_{t_{i}} - \overset{ˉ}{x})^{T} = \sum_{j = 1}^{k} N_{j} (m_{j} - \overset{ˉ}{x}) (m_{j} - \overset{ˉ}{x})^{T}$ (where $m_{t_{i}}$ is the class mean of the class to which sample $i$ belongs) is the between-class scatter.

Notice that $S_{T} = S_{W} + S_{B}$

Note: that scatter matrixes are unnormalised covariance matrices

Specifying the LDA axes

We want to find $v$ suitable axes collected into a $d \times v$ matrix $W$ so that we can transform our $d$ -dimensional input $x$ :

y = W^{T} x

The scatters of data transformed this way, can be calculated:

W^{T} \overset{ˉ}{x}

is the new data mean

i = 1 \sum N (W^{T} x_{i} - W^{T} \overset{ˉ}{x}) (W^{T} x_{i} - W^{T} \overset{ˉ}{x})^{T} = i = 1 \sum N W^{T} (x_{i} - \overset{ˉ}{x}) (x_{i} - \overset{ˉ}{x})^{T} W = W^{T} S_{T} W

is the new total scatter.

Likewise, we can show that the new between class scatter is $W^{T} S_{B} W$ and the within class scatter is $W^{T} S_{W} W$ .

LDA aims to have points within the same class tightly clustered, while having clusters clearly separated

One objective for finding $W$ is to optimise

W max \frac{W ^{T} S _{B} W}{W ^{T} S _{W} W}

However, this is not clearly defined as the quotient of matrices does not exist. The maximisation of between class scatter (separation of clusters) and minimisation of within class scatter (tightly clustered classes) are in conflict and a tradeoff is to be made

As with PCA, we find the first axis vector $w$

w max \frac{w ^{T} S _{B} w}{w ^{T} S _{W} w}

Notice, as $w$ increases without bound, the function will also grow boundlessly, therefore we add the constraint that it is unit vector:

w^{T} w = 1

This yields the constrained optimisation problem

w max w^{T} S_{B} w subject to w^{T} S_{W} w = 1

Hey! Its a Lagrange Multipliers problem! Let’s solve it:

f (w) = w^{T} S_{B} w g (w) = 1 - w^{T} S_{W} w

The Lagrangian is:

L (w, λ) = w^{T} S_{B} w + λ \cdot (1 - w^{T} S_{W} w)

Which can be solved

\nabla_{w} L = 2 S_{B} w - 2 λ S_{W} w = 0 S_{B} w = λ S_{W} w

This is not quite a eigenvalue problem since $S_{W}$ is in the way, but if we assume $S_{W}$ is invertible:

S_{W}^{- 1} S_{B} w = λ w

it becomes an eigenvalue problem for $S_{W}^{- 1} S_{B}$ , with solutions for $w$ being the eigenvectors corresponding to the largest eigenvalues!

How about if $S_{W}$ is not invertible?!

$S_{W}$ is not invertible if it isn’t full rank (has some zero eigenvalues).

We can find a solution in two steps:

use PCA on $S_{W}$ to remove empty dimensions, making the transformed data invertible
Scale the axes so that orthogonal transformations do not change within class scatter (i.e. whiten the class centered data)
- Whitening: $(Q_{r} Λ_{r}^{- 1/2})^{T} S_{W} (Q_{r} Λ_{r}^{- 1/2}) = I$

After whitening we have a new objective $w_{*}$

📓 Daniel's Notes

Explorer

Linear Discriminant Analysis

Data

Binary (Two Class) Classification Example

Projection

Means

Scatter

Specifying the LDA axes

How about if $S_{W}$ is not invertible?!

TODO: Finish this up!

Graph View

Table of Contents

Backlinks

📓 Daniel's Notes

Explorer

Linear Discriminant Analysis

Data

Binary (Two Class) Classification Example

Projection

Means

Scatter

Specifying the LDA axes

How about if SW​ is not invertible?!

TODO: Finish this up!

Graph View

Table of Contents

Backlinks

How about if $S_{W}$ is not invertible?!