The basic problem of classification is: given a observation assign it to one of classes
Introduction
Supervised Classification involves the use of labeled data, i.e. data consisting of a pair of both inputs and outputs .
Unsupervised Classification, conversely, utilises only input data
Data Sets and Preprocessing
Data is preprocessed using:
LDA is a more reasonable choice for Supervised Classification.
Data is also split up into three sets while setting up the model:
- Training set: This is the set of data set aside to train the classifier
- Validation set: We use this set during training to check if weβre using the correct model (model selection), and to tune the model hyperparameters
- Test set: used to evaluate the final model performance
Performance
For a single value summary of model performance, values such as accuracy is used.
For a more detailed view on classification performance, consider using the confusion matrix
Core Principles of Classification
To classify a given input data sample into one of classes:
- Define the classes for all
- Determine the posterior probability - the probability of being the correct class given the input data
While we can then compute the predicted class as the class with the highest probability , the class probabilities are often more useful than just knowing the class with the highest probability.
So how dow we classify data like this?!
Thereβs two methods:
- Supervised Classification - for when your data set is labelled
- Unsupervised Classification - for when your data set is unlabelled