The basic problem of classification is: given a observation assign it to one of classes

Introduction

Supervised Classification involves the use of labeled data, i.e. data consisting of a pair of both inputs and outputs .

Unsupervised Classification, conversely, utilises only input data

Data Sets and Preprocessing

Data is preprocessed using:

LDA is a more reasonable choice for Supervised Classification.

Data is also split up into three sets while setting up the model:

  • Training set: This is the set of data set aside to train the classifier
  • Validation set: We use this set during training to check if we’re using the correct model (model selection), and to tune the model hyperparameters
  • Test set: used to evaluate the final model performance

Performance

For a single value summary of model performance, values such as accuracy is used.

For a more detailed view on classification performance, consider using the confusion matrix

Core Principles of Classification

To classify a given input data sample into one of classes:

  • Define the classes for all
  • Determine the posterior probability - the probability of being the correct class given the input data

While we can then compute the predicted class as the class with the highest probability , the class probabilities are often more useful than just knowing the class with the highest probability.

So how dow we classify data like this?!

There’s two methods: