What is the Confusion Matrix?

Imagine you have a computer program that's really good at telling you whether an email is spam (the kind you don't want) or not. The confusion matrix is like a scorecard that helps you see how well this program is doing. 

Now, let's say you have a bunch of emails, and some of them are spam, and some are not. The confusion matrix helps you understand how many times your program got it right and how many times it got confused.

How does it work?

Here's how the scorecard (a.k.a Confusion Matrix) works:

  • True Positives (TP): These are the times the program correctly said: "Yes, this email is spam."

  • True Negatives (TN): These are the times the program correctly said: "No, this email is not spam."

  • False Positives (FP): These are the times the program made a mistake and said "Yes, this email is spam," when it's actually an email from your boss telling you that you got a promotion and a BIG raise.

  • False Negatives (FN): These are the times the program goofed and said "No, this email is not spam," when it actually was your local grocery store with their big price change of 10 cents.

If we put these values into a matrix it would look like this:


Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN


In the context of Machine Learning, positive and negative instances (also referred to as samples) are defined based on the specific classification problem, and they represent the different outcomes or classes that the machine learning model is trained to predict.


In binary classification problems, where there are only two classes (e.g., spam or not spam), one of the classes is designated as the "positive" class, and the other is the "negative" class. The choice of which class is considered positive depends on the problem's context and the specific goal of the analysis.

Here are 2 examples.

  • Medical Test

    In a medical test for a disease, the positive class might be the presence of the disease, and the negative class would be the absence of the disease.

  • Credit Fraud

    In a credit card fraud detection system, the positive class might be fraudulent transactions, and the negative class would be legitimate transactions.

Why is the Confusion Matrix important?

The confusion matrix is a powerful tool for gaining a deeper understanding of how well a machine learning model is doing and for making informed decisions to better its performance. To be more specific, a confusion matrix helps evaluate:

  • How often the model is making correct predictions (True Positives and True Negatives) and where it might be making mistakes (False Positives and False Negatives).

  • Various performance metrics such as accuracy, precision, recall, and F1 score. Take a look at the next section to learn some of them. These metrics offer more nuanced insights into the model's strengths and weaknesses.

  • The differences between various models. The confusion matrix allows for a detailed comparison of their performance, helping in the selection of the most suitable model for a given task. Consider that the importance of False Positives and False Negatives can vary depending on the context of the problem. For example, in medical diagnoses, a False Negative (missing a real positive) might be more critical than a False Positive.

Performance Metrics

Acronym

TP
True Positive
TN
True Negative
FP
False Positive
FN
False Negative
P
Total number of positives 
N
Total number of negatives


Accuracy: how often the classifier makes correct predictions
Error Rate: how often the classifier makes wrong predictions


\(\textrm{accuracy}=\frac{TP + TN}{P+N}\)

\(\textrm{error rate}=\frac{FP + FN}{P+N}\)




Sensitivity: how well the classifier can recognize the positive instances 

Specificity: how well the classifier can recognize the negative instances


\( \textrm{sensitivity}=\frac{TP}{P}\)

\(\textrm{specificity}=\frac{TN}{N}\)


It can be shown that accuracy is a function of sensitivity and specificity:
 

\(\textrm{accuracy = sensitivity}\times \frac{P}{P+N} + \textrm{specificity} \times \frac{N}{P+N}\)




Precision can be thought of as a measure of exactness (i.e., what percentage of instances labeled as positive are actually such), whereas recall is a measure of completeness (what percentage of positive instances are labeled as such). 


Note: If recall seems familiar, that’s because it is the same as sensitivity (or the true positive rate). 


\(\textrm{precision}=\frac{TP}{TP+FP}\)

\(\textrm{recall}=\frac{TP}{TP+FN}=\frac{TP}{P}\)



F1 score is a combination of precision and recall and provides a balance between these two metrics. It is particularly useful when you want to strike a balance between precision and recall, especially when there is an uneven class distribution (i.e., the number of positive and negative instances is significantly different).


\(F = \frac{2 \times \textrm{precision} \times \textrm{recall}}{\textrm{precision} + \textrm{recall}}\)



Here is a table that you can use for reference. Note that some measures are known by more than one name.

Measure Formula
accuracy, recognition rate

\(\textrm{accuracy}=\frac{TP + TN}{P+N}\)


error rate, misclassification rate

\(\textrm{error rate}=\frac{FP + FN}{P+N}\)


sensitivity, true positive rate, recall \( \textrm{sensitivity}=\frac{TP}{P}\)
specificity, true negative rate \(\textrm{specificity}=\frac{TN}{N}\)
precision

\(\textrm{precision}=\frac{TP}{TP+FP}\)


F, F1, F-score, harmonic mean of precision and recall
\(F = \frac{2 \times \textrm{precision} \times \textrm{recall}}{\textrm{precision} + \textrm{recall}}\)