Classification

The basic problem of classification is: given a observation $x$ assign it to one of $k$ classes $C_{j}, j = 1, ..., k$

Introduction

Supervised Classification involves the use of labeled data, i.e. data consisting of a pair of both inputs and outputs $(x_{i}, y_{i})$ .

Unsupervised Classification, conversely, utilises only input data $x_{i}$

Data Sets and Preprocessing

Data is preprocessed using:

LDA is a more reasonable choice for Supervised Classification.

Data is also split up into three sets while setting up the model:

Training set: This is the set of data set aside to train the classifier
Validation set: We use this set during training to check if we’re using the correct model (model selection), and to tune the model hyperparameters
Test set: used to evaluate the final model performance

Performance

For a single value summary of model performance, values such as accuracy is used.

For a more detailed view on classification performance, consider using the confusion matrix

Core Principles of Classification

To classify a given input data sample $x$ into one of $k$ classes:

Define the classes $C_{j}$ for all $j = 1, ..., k$
Determine the posterior probability $P (C_{j} ∣ x)$ - the probability of $C_{j}$ being the correct class given the input data

While we can then compute the predicted class as the class with the highest probability $C^{*} = arg max_{C_{j}} P (C_{j} ∣ x)$ , the class probabilities are often more useful than just knowing the class with the highest probability.

So how dow we classify data like this?!

There’s two methods:

Supervised Classification - for when your data set is labelled
Unsupervised Classification - for when your data set is unlabelled

📓 Daniel's Notes

Explorer

Classification

Introduction

Data Sets and Preprocessing

Performance

Core Principles of Classification

Graph View

Table of Contents

Backlinks