Unlike the Generative method, the PDM does not calculate the class conditional probabilities and instead directly computes the posterior

Posterior of Two Classes with Shared Covariance Matrix

As we derived in the PGM notes, the posterior is given as follows:

We’ve also showed these weights for the case the classes follow a gaussian distribution:

We can combine the bias term into the matrix by adding a constant 1 to the input data: and

This also simplifies the posterior for a class to:

Why PDM?

  • PGM - Computes the weights indirectly by:
    • Estimating Gaussian class-conditional PDFs
    • Assuming shared or diagonal
    • Computes the weights from the decision boundary ()
  • PDM - Computes the weights directly
    • maps the inputs to the posterior
    • Why don’t we just directly compute the weights?

Determining the Weight Parameters

Two Class Case

The decision boundary between two classes are where

Assuming we’re given training data with observations and it’s respective classes , can be either 1 or 0 for class or respectively.

We assume conditional independence of sampled data, i.e. for a known observations, and weights , knowing one of the output sample output classes does not change the probability of any other output class for

We also assume the data is sampled independently from the same distribution such that individual observations are independent of each other, knowing prior observations does not change the probability of the current observation being in a specific class.

We can then find the joint probability of this training set given the weights:

Since the outputs follow a Bernoulli Distribution we can write

What?! Let’s explore how this is derived:

  • If then (since if then the point is in class and therefore these are equivalent)
  • Likewise, if then

We just use some clever tricks with powers to combine these two cases into a single equation! Neat.

We can now use our existing formula for the posteriors

We want to now maximise the likelihood (get the probability as close to 1 as possible). However, since we’re working with the product of fractions, and computers don’t do floating points well, it is more numerically stable to use the negative log-likelihood for maximisation

Negative Log-Likelihood

To maximise this, we consider the gradient

We know

We can use this to calculate the partial derivative above using a bunch of chain rule:

To maximise , we’ll minimise its gradient

This is a non-linear system of questions with needing to be solved numerically. We’ll use Newton-Raphson

Problems During Parameter Estimation

When far from the decision boundary, the posterior is almost 1 () so these points have very little influence.

Points close to the decision boundary have a disproportionately large influence.