Unlike the Generative method, the PDM does not calculate the class conditional probabilities and instead directly computes the posterior
Posterior of Two Classes with Shared Covariance Matrix
As we derived in the PGM notes, the posterior is given as follows:
We’ve also showed these weights for the case the classes follow a gaussian distribution:
We can combine the bias term into the matrix by adding a constant 1 to the input data: and
This also simplifies the posterior for a class to:
Why PDM?
- PGM - Computes the weights indirectly by:
- Estimating Gaussian class-conditional PDFs
- Assuming shared or diagonal
- Computes the weights from the decision boundary ()
- PDM - Computes the weights directly
- maps the inputs to the posterior
- Why don’t we just directly compute the weights?
Determining the Weight Parameters
Two Class Case
The decision boundary between two classes are where
Assuming we’re given training data with observations and it’s respective classes , can be either 1 or 0 for class or respectively.
We assume conditional independence of sampled data, i.e. for a known observations, and weights , knowing one of the output sample output classes does not change the probability of any other output class for
We also assume the data is sampled independently from the same distribution such that individual observations are independent of each other, knowing prior observations does not change the probability of the current observation being in a specific class.
We can then find the joint probability of this training set given the weights:
Since the outputs follow a Bernoulli Distribution we can write
What?! Let’s explore how this is derived:
- If then (since if then the point is in class and therefore these are equivalent)
- Likewise, if then
We just use some clever tricks with powers to combine these two cases into a single equation! Neat.
We can now use our existing formula for the posteriors
We want to now maximise the likelihood (get the probability as close to 1 as possible). However, since we’re working with the product of fractions, and computers don’t do floating points well, it is more numerically stable to use the negative log-likelihood for maximisation
Negative Log-Likelihood
To maximise this, we consider the gradient
We know
We can use this to calculate the partial derivative above using a bunch of chain rule:
To maximise , we’ll minimise its gradient
This is a non-linear system of questions with needing to be solved numerically. We’ll use Newton-Raphson
Problems During Parameter Estimation
When far from the decision boundary, the posterior is almost 1 () so these points have very little influence.
Points close to the decision boundary have a disproportionately large influence.