In this lecture, we're going to take a closer look at a couple of metrics we can use to evaluate the predictions of a classification model.
Let's consider a machine learning system used in a hospital. The system analyzes chest x-ray images of patients and predicts if the patient has pneumonia or not.
This is actually a very popular machine learning exercise. There are public data sets available with labelled chest x-rays, and several academic teams are racing to build a diagnostic system that can beat medical doctors.
But let's return to the model. The classifier predicts if a patient has pneumonia or not. We're going to define our output label as either positive (the patient has pneumonia) or negative (the patient is healthy).
A patient can either be sick or healthy, and our model can predict either a positive or negative outcome. That give us 4 quadrants:
The quadrants are defined as follows:
A useful metric for predicting the reliability of our model is called accuracy. It's defined as the fraction of predictions that our model got right. We can define accuracy as follows:
$$A=\frac{correct\text{}predictions}{all\text{}predictions}=\frac{TP+TN}{TP+TN+FP+FN}$$
If we feed in the number of occurrences listed in each quadrant, we get an accuracy of 0.91. In other words, out of every 100 predictions, our model gets 91 right.
That's a pretty great result, right?
Well, not really. Consider the patients with pneumonia. We have 9 patients with pneumonia, but our model only predicted one case correctly. If we test our model on a population of patients that are all sick, then our model only identifies 11% of them correctly.
That's not good.
And it gets worse. Let's say I build a second model but instead of using machine learning, I just write some code that ignores the chest x-ray and prints "not pneumonia" every time.
When I run this second model, I also get an accuracy of 0.91. In other words, our machine learning model is no better than one that has zero predictive ability to distinguish pneumonia patients from healthy patients.
So it's pretty obvious that accuracy does not paint a full picture. When we're working with a class-imbalanced data set like this one, where there is a significant disparity between the number of positive and negative labels, we're going to need a different metric.
A possible metric we can use is precision. It is defined as the fraction of positive model predictions that are correct. We can define precision as follows:
$$P=\frac{correct\text{}positive\text{}predictions}{all\text{}positive\text{}predictions}=\frac{TP}{TP+FP}$$
Our model made 1 correct positive prediction out of a total of 2, so the precision is 0.5. For every 100 pneumonia predictions the model makes, it will get 50 correct.
Another useful metric is called recall. It is defined as the fraction of positive cases that were predicted correctly by the model. We can define recall as follows:
$$R=\frac{correct\text{}positive\text{}predictions}{all\text{}positive\text{}cases}=\frac{TP}{TP+FN}$$
Our model made 1 correct prediction out of a total of 9 pneumonia cases, so the recall is 0.11. For every 100 pneumonia patients, the model will predict 11 correct.
To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, there's often an inverse relationship between the two. That is, improving precision typically reduces recall, and vice versa.
Here's how that works. Imagine we have the following set of predictions, sorted by probability from 0 to 1:
We have set the classification threshold to 0.6. At this setting, we get 7 true positives, 2 false positives, 3 false negatives, and 13 true negatives.
That gives a precision of 0.77 and a recall of 0.7.
Now let's increase the threshold to 0.7. That gives us the following set of predictions:
The two predictions between 0.6 and 0.7 flip from FP, TP to TN, FN. We now have 6 true positives, 1 false positive, 4 false negatives, and 14 true negatives.
That gives a precision of 0.86 and a recall of 0.6. The precision has gone up and the recall has gone down.
Now let's lower the classification threshold to 0.5:
The two predictions between 0.5 and 0.6 flip from TN, FN to FP, TP. We now have 8 true positives, 3 false positives, 2 false negatives, and 12 true negatives.
The precision is 0.72 and the recall is 0.8. The precision has gone down and the recall has gone up.