Probabilistic Discriminative Model

Unlike the Generative method, the PDM does not calculate the class conditional probabilities $p (x ∣ C_{j})$ and instead directly computes the posterior $P (C_{j} ∣ x)$

Posterior of Two Classes with Shared Covariance Matrix

As we derived in the PGM notes, the posterior is given as follows:

P (C_{1} ∣ x) a (x) = σ (a (x)) = \frac{1}{1 + e ^{- a (x)}} = ln \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x ∣ C _{2} ) P ( C _{2} )} = w^{T} x + w_{0}

We’ve also showed these weights for the case the classes follow a gaussian distribution:

w w_{0} = Σ^{- 1} (μ_{1} - μ_{2}) = \frac{1}{2} (μ_{2}^{T} Σ^{- 1} μ_{2} - μ_{1}^{T} Σ^{- 1} μ_{1}) + ln \frac{P ( C _{1} )}{P ( C _{2} )}

We can combine the $w_{0}$ bias term into the $w$ matrix by adding a constant 1 to the input data: $x = [1 x_{1} x_{2} \dots x_{d}]^{T}$ and $w = [w_{0} w_{1} \dots w_{d}]^{T}$

This also simplifies the posterior for a class to:

P (C_{1} ∣ x, w) = σ (w^{T} x)

Why PDM?

PGM - Computes the weights $w$ indirectly by:
- Estimating Gaussian class-conditional PDFs
- Assuming shared or diagonal $Σ$
- Computes the weights from the decision boundary ( $a (x) = 0.5$ )
PDM - Computes the weights $w$ directly
- $w$ maps the inputs $x$ to the posterior $P (C ∣ x)$
- Why don’t we just directly compute the weights?

Determining the Weight Parameters

Two Class Case

The decision boundary between two classes are where

w^{T} x = 0

Assuming we’re given training data $D = [X, y]$ with observations $X = {x_{1}, \dots x_{N}}$ and it’s respective classes $y = {y_{1}, \dots, y_{N}}$ , $y_{n}$ can be either 1 or 0 for class $C_{1}$ or $C_{2}$ respectively.

We assume conditional independence of sampled data, i.e. for a known $X$ observations, and weights $w$ , knowing one of the output sample output classes $y_{m}$ does not change the probability of any other output class $y_{n}$ for $n \neq = m$

We also assume the data is sampled independently from the same distribution such that individual observations are independent of each other, knowing prior observations does not change the probability of the current observation being in a specific class.

We can then find the joint probability of this training set given the weights:

p (D ∣ w) = p (X, y ∣ w) = p (y ∣ X, w) \cdot p (X ∣ w) (from product rule) = p (y ∣ X, w) \cdot p (X) (since X is independent of w) = p (X) n = 1 \prod N p (y_{n} ∣ X, w) (assuming conditional independence, use product rule) = p (X) n = 1 \prod N p (y_{n} ∣ x_{n}, w) (assuming indepentently sampled observations)

Since the outputs $y_{n}$ follow a Bernoulli Distribution we can write

p (y_{n} ∣ x_{n}, w) = P (C_{1} ∣ x_{n}, w)^{y_{n}} \cdot (1 - P (C_{1} ∣ x_{n}, w))^{1 - y_{n}}

What?! Let’s explore how this is derived:

If $y_{n} = 1$ then $p (y_{n} ∣ x_{n}, w) = P (C_{1} ∣ x_{n}, w)$ (since if $y_{n} = 1$ then the point is in class $C_{1}$ and therefore these are equivalent)
Likewise, if $y_{n} = 0$ then $p (y_{n} ∣ x_{n}, w) = P (C_{2} ∣ x_{n}, w) = 1 - P (C_{1} ∣ x_{n}, w)$

We just use some clever tricks with powers to combine these two cases into a single equation! Neat.

We can now use our existing formula for the posteriors

p (D ∣ w) = p (X) n = 1 \prod N P (C_{1} ∣ x_{n}, w)^{y_{n}} \cdot (1 - P (C_{1} ∣ x_{n}, w))^{1 - y_{n}} = p (X) n = 1 \prod N σ (w^{T} x_{n})^{y_{n}} \cdot (1 - σ (w^{T} x_{n}))^{1 - y_{n}}

We want to now maximise the likelihood (get the probability as close to 1 as possible). However, since we’re working with the product of fractions, and computers don’t do floating points well, it is more numerically stable to use the negative log-likelihood for maximisation

Negative Log-Likelihood

E (w) = - ln P (D ∣ w) = - n = 1 \sum N [y_{n} ln σ (w^{T} x_{n}) + (1 - y_{n}) ln (1 - σ (w^{T} x_{n}))] - ln p (X)

To maximise this, we consider the gradient

\nabla E (w) = \frac{\partial E ( w )}{\partial w}

We know

\frac{\partial σ ( x )}{\partial x} = σ (x) (1 - σ (x))

We can use this to calculate the partial derivative above using a bunch of chain rule:

\frac{\partial E ( w )}{\partial w} = - n = 1 \sum N [y_{n} \frac{σ ( w ^{T} x _{n} ) ( 1 - σ ( w ^{T} x _{n} ))}{σ ( w ^{T} x _{n} )} - (1 - y_{n}) \frac{σ ( w ^{T} x _{n} ) ( 1 - σ ( w ^{T} x _{n} ))}{1 - σ ( w ^{T} x _{n} )}] \frac{\partial w ^{T} x _{n} )}{\partial w} = - n = 1 \sum N [y_{n} - y_{n} σ (w^{T} x_{n}) - σ (w^{T} x_{n}) + y_{n} σ (w^{T} x_{n})] x_{n} = n = 1 \sum N [σ (w^{T} x_{n}) - y_{n}] x_{n}

To maximise $E (w)$ , we’ll minimise its gradient $\nabla E (w) = 0$

This is a non-linear system of questions with $w$ needing to be solved numerically. We’ll use Newton-Raphson

Problems During Parameter Estimation

When far from the decision boundary, the posterior is almost 1 ( $P (C_{1} ∣ x_{n}, w) \approx σ (w^{T} x_{n}) \approx 1$ ) so these points have very little influence.

Points close to the decision boundary have a disproportionately large influence.

📓 Daniel's Notes

Explorer