Probabilistic Generative Model

The generative approach to classification (PGM), works by:

Estimate the class-conditional probabilities, $p (x ∣ C_{j})$ - which can be used to generate new data points, hence the name ‘generative model’
Calculate the posterior probability using Bayes Theorem: $P (C_{j} ∣ x) \propto p (x ∣ C_{j}) \cdot P (C_{j})$

Key Steps

To solve for the posterior probabilities $P (C_{j} ∣ x)$ via PGM, we follow the key steps:

Expand the posterior using logistic/softmax functions
Determine form of arguments to the softmax function (with model assumptions)
Estimate parameters for the resulting form

Expanding the Posterior

Two Class Case

Let us consider the key steps for the simple case of two classes.

P (C_{2}) = 1 - P (C_{1})

We know the equation for joint probability to be:

p (x, y) = p (x ∣ y) \cdot p (y)

We also know the sum rule (marginalisation):

p (x) = y \sum p (x, y)

And the product rule

p (x, y) = p (y ∣ x) \cdot p (x)

With this, we can expand the posterior:

NOTE: the lecturer mentioned this being in the A1 likely!

P (C_{1} ∣ x) = \frac{P ( C _{1} , x )}{P ( x )} (from the definition of joint probability) = \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x )} (from the product rule) = \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x , C _{1} ) + P ( x , C _{2} )} (from the sum rule) = \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x ∣ C _{1} ) P ( C _{1} ) + P ( x ∣ C _{2} ) P ( C _{2} )} (from the product rule) We’re now going to coerce this into a logistic form = \frac{1}{1 + \frac{P ( x ∣ C _{2} ) P ( C _{2} )}{P ( x ∣ C _{1} ) P ( C _{1} )}} (divide by numerator) = \frac{1}{1 + e ^{l n \frac{P ( x ∣ C _{2} ) P ( C _{2} )}{P ( x ∣ C _{1} ) P ( C _{1} )}}} (turn to exponential) = \frac{1}{1 + e ^{- l n \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x ∣ C _{2} ) P ( C _{2} )}}} (flip the fraction) Now, let us define a to be the ln function = \frac{1}{1 + e ^{- a (x)}} We can now notice that this is the sigmoid function = σ (a (x))

Let’s have a closer look at the implications of the definition of $a (x)$ :

Some background, the odds of an event $E$ is defined as:

odds (E) = \frac{P ( E )}{1 - P ( E )}

We can use this to say the prior odds (odds without an observation) for class $C_{1}$ is:

odds (C_{1}) = \frac{P ( C _{1} )}{1 - P ( C _{1} )} = \frac{P ( C _{1} )}{P ( C _{2} )}

Similarly, the posterior odds (odds given an observation) is:

odds (C_{1} ∣ x) = \frac{P ( C _{1} ∣ x )}{P ( C _{2} ∣ x )} = \frac{P ( x , C _{1} ) P ( C _{1} )}{P ( x , C _{2} ) P ( C _{2} )} (product rule)

Look familiar? This means that $a$ is the log-posterior odds of $C_{1} ∣ x$

a (x) = ln odds (C_{1} ∣ x) = ln \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x ∣ C _{2} ) P ( C _{2} )}

Therefore, the posterior probability $P (C_{1} ∣ x)$ is the logistic function (sigmoid) evaluated at the log-posterior odds of $C_{1} ∣ x$

What does this mean?!

The logistic sigmoid function (as above), takes the log-posterior odds $a (x)$ and maps it to a value between 0 and 1, thereby assigning a posterior probability to $x$

$x$ belongs to class $C_{1}$ if $odds > 1$ , otherwise $C_{2}$

k-Class Case

We can expand the two class case above to a general $k$ -class case:

P (C_{n} ∣ x) = \frac{P ( x ∣ C _{n} ) P ( C _{n} )}{\sum _{j = 1}^{k} P ( x ∣ C _{j} ) P ( C _{j} )} = \frac{exp ( a _{n} ( x ))}{\sum _{j = 1}^{k} exp ( a _{j} ( x ))}

This is the basis of the Softmax Function

Notice that the $a_{n} (x)$ function is not the log-posterior odds as before

a_{n} (x) = ln (P (x ∣ C_{n}) P (C_{n})) = ln P (x ∣ C_{n}) + ln P (C_{j})

Also note, the total posterior probability must add to 1

j = 1 \sum k P (C_{j} ∣ x) = 1

What’s next?

PGM requires two more things:

Prior class probabilities $P (C_{j})$
Class conditional densities $p (x ∣ C_{j})$ which we’ll assume is Gaussian in this course (GaussianNB in scikit-learn)

Gaussian Class-Conditional PDFs

p (x ∣ C_{j}) = \frac{1}{∣2 π Σ _{j} ∣} exp (- \frac{1}{2} (x - μ_{j})^{T} Σ_{j}^{- 1} (x - μ_{j}))

Let’s break this down:

$∣2 π Σ_{j} ∣$ is a determinant
$μ_{j}$ is the mean vector for each class $C_{j}$
$Σ_{j}$ is the class covariance matrix

We’re going to assume that the class covariance matrices are the same across all classes

Σ_{j} = Σ_{i} = Σ

What’s our Log-Posterior Odds?

Two Class Case

Since we now know the class conditional PDFs, we can calculate the log-posterior odds for the two class case as we did before, assuming shared $Σ$ :

a (x) = ln \frac{P ( x ∣ C _{1} ) P ( C _{1} )}{P ( x ∣ C _{2} ) P ( C _{2} )} = ln P (x ∣ C_{1}) - ln P (x ∣ C_{2}) + ln \frac{P ( C _{1} )}{P ( C _{2} )} = - \frac{1}{2} (x - μ_{1})^{T} Σ^{- 1} (x - μ_{1}) + \frac{1}{2} (x - μ_{2})^{T} Σ^{- 1} (x - μ_{2}) + ln \frac{P ( C _{1} )}{P ( C _{2} )} = - \frac{1}{2} [x^{T} Σ^{- 1} x - 2 μ_{1}^{T} Σ^{- 1} x + μ_{1}^{T} Σ^{- 1} μ_{1}] + \frac{1}{2} [x^{T} Σ^{- 1} x - 2 μ_{2}^{T} Σ^{- 1} x + μ_{2}^{T} Σ^{- 1} μ_{2}] + ln \frac{P ( C _{1} )}{P ( C _{2} )} = μ_{1}^{T} Σ^{- 1} x - \frac{1}{2} μ_{1}^{T} Σ^{- 1} μ_{1} - μ_{2}^{T} Σ^{- 1} x + \frac{1}{2} μ_{2}^{T} Σ^{- 1} μ_{2} + ln \frac{P ( C _{1} )}{P ( C _{2} )} = (μ_{1} - μ_{2})^{T} Σ^{- 1} x + [\frac{1}{2} (μ_{2}^{T} Σ^{- 1} μ_{2} - μ_{1}^{T} Σ^{- 1} μ_{1}) + ln \frac{P ( C _{1} )}{P ( C _{2} )}]

Wow! If we make $Σ$ shared, the quadratic terms ( $x^{T} Σ^{- 1} x$ ) cancels out, resulting in a linear classifier. Let’s define some terms to make it more obvious:

$w^{T} = (μ_{1} - μ_{2})^{T} Σ^{- 1} \Rightarrow w = Σ^{- 1} (μ_{1} - μ_{2})$
- _we can do $(Σ^{- 1})^{T} = Σ^{- 1}$ _since the covariance matrix is symmetrical _
$w_{0} = \frac{1}{2} (μ_{2}^{T} Σ^{- 1} μ_{2} - μ_{1}^{T} Σ^{- 1} μ_{1}) + ln \frac{P ( C _{1} )}{P ( C _{2} )}$
- Which is the bias term (constant) and contains the prior probabilities

Making the log-posterior probability simplified to:

a (x) = w^{T} x + w_{0}

With the shared $Σ$ , we have a diagonal matrix, therefore containing $\frac{d \cdot ( d + 1 )}{2}$ parameters. Each $μ_{i}$ also contains $d$ parameters. Therefore, the total amount of parameters required for the two class generative classifier is:

2 d + \frac{d \cdot ( d + 1 )}{2} from μ_{1}, μ_{2} and Σ

Where’s the decision boundary?

Where the probability of both classes are the same, i.e. they equal 0.5.

P (C_{1} ∣ x) σ (w^{T} x + w_{0}) ∴ w^{T} x + w_{0} = P (C_{2} ∣ x) = 1 - P (C_{1} ∣ x) = \frac{1}{2} = \frac{1}{2} = 0 (sigmoid is 0.5 at 0)

k-Class Case

Similar to above, we can find:

P (C_{j} ∣ x) a_{j} w_{j} w_{j 0} = \frac{exp ( a _{n} ( x ))}{\sum _{j = 1}^{k} exp ( a _{j} ( x ))} = w_{j}^{T} x + w_{j 0} = Σ^{- 1} μ_{j} = \frac{1}{2} μ_{j}^{T} Σ^{- 1} μ_{j} + ln P (C_{j})

Parameter Estimation

To find the parameters for the class conditional densities and the prior probabilities we need a fully observed dataset, this means observations along with their class labels.

Maximum Likelihood Estimation (MLE)

We can determine the means and covariance matrix for each class using the usual methods:

Means:

μ_{j} = \frac{1}{N _{j}} {i : y_{i} \in C_{j}} \sum x_{i}

With $N_{j}$ being the amount of observations in class $C_{j}$

Covariances:

Σ_{j} = \frac{1}{N _{j}} {i : y_{i} \in C_{j}} \sum (x_{i} - μ_{j}) (x_{i} - μ_{j})^{T}

From maximum likelihood, we can find the prior probabilities as:

P (C_{j}) = \frac{N _{j}}{N}

The Flaws of MLE

The MLE approach is expensive to compute. With $d$ input dimensions and $k$ classes:

each class has a mean with $d$ parameters
each class has a symmetrical covariance matrix containing $\frac{d \cdot ( d + 1 )}{2}$ parameters

This results in a parameter count of

k [d + \frac{d \cdot ( d + 1 )}{2}] = \frac{k \cdot d \cdot ( d + 3 )}{2}

If we share the covariance matrix, this reduces to:

k d + \frac{d \cdot ( d + 1 )}{2}

However, sharing a covariance matrix may cause issues, since classes may not have the same covariances and thereby a shared matrix might not give a good description of each class.

This leads us to the Naive Bayes approach, which assumes a diagonal covariance matrix for each class, resulting in a parameter count of

2 k d

This strikes a good middle ground between a per class full covariance matrix and a shared covariance matrix.

📓 Daniel's Notes

Explorer

Probabilistic Generative Model

Key Steps

Expanding the Posterior

Two Class Case

k-Class Case

What’s next?

Gaussian Class-Conditional PDFs

What’s our Log-Posterior Odds?

Two Class Case

k-Class Case

Parameter Estimation

Maximum Likelihood Estimation (MLE)

The Flaws of MLE

Graph View

Table of Contents

Backlinks