Understanding the ROC curve and AUC-ROC with Python example
2022-07-08
How to select classification threshold for imbalanced datasets
2022-07-11
Show all

Which performance metrics to use for evaluating a classification model on imbalanced datasets?

8 mins read

There are various metrics to evaluate a classification model: Accuracy, Precision, Recall F1-score, and AUC-ROC score. However, it is always confusing for newcomers in Machine Learning to decide which performance metrics should we use for evaluating a model on an imbalanced data set in case of Classification settings?

In this post, I will explain how to answer the above question in different cases.

What is a confusion matrix?

It is a matrix table (rows and columns) that is used to describe the performance of a classification model in terms of TP, TN, FP, and FN as follows:

Let’s suppose we have a cancer dataset in which we are supposed to predict based on some medical report, who is going to suffer from cancer in the near future. Then TP, TN, FP, and FN can be defined as:

  • True Positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
  • True Negatives (TN): We predicted no, and they don’t have the disease.
  • False Positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
  • False Negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

The confusion matrix alone often becomes hard to interpret when we have a multi-class classification problem. Now, let’s get to know performance metrics based upon:

For a better understanding of the above topics, I take the Binary Classification problem (1 => positive class and 0 => negative class) with the following two imbalanced scenarios:

  1. Case 1: A dataset in which the number of positive points is much larger than the number of negative points (number of positive points >> number of negative points)
  2. Case 2: A dataset in which the number of negative points is much larger than the number of positive points (number of negatives points >> number of positive points)

Case 1: number of positive samples >> number of negative samples

I willfully created an imbalanced dataset (situations) to get a stronghold on the concepts.

Assume that our trained classifier labeled all negative samples as False Positive (FP).

import numpy as np
import pandas as pd
Y = np.hstack((np.ones((10000,)), np.zeros((100,))))
Y_score = np.random.uniform(0.5,0.9,10100)
df_imb = pd.DataFrame(data=np.array((Y, Y_score)).T, columns=['y','proba'])
df_imb = df_imb.sample(10100)

y_pred=[0 if y_score < 0.5 else 1]

and then

from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(df_imb.y, df_imb.y_pred)
conf_mat

Key observation: got lots of FP

Accuracy: It defines the number of correct predictions out of total predictions.

Accuracy score: (TP+TN)/(TP+TN+FP+FN) = 0.9900990099009901

Precision: How many samples belong to the actual positive class out of the total positive predicted samples by a model.

Precision: TP/(TP+FP) = 0.9900990099009901

Recall: Out of total actual positive samples how many are predicted as positive by the model. Recall is also called as True Positive Rate (TPR) or Sensitivity or probability of detection vice-versa.

Recall: (TP)/(TP+FN) = 1.0

F1-score:  It returns the Harmonic Mean of Precision and Recall.

F1-score = 2 * (precision*recall)/(precision+recall)= 0.9950248756218906

True Positive Rate(TPR) = Recall

False Positive Rate (FPR) = Out of all actual negative samples how many are predicted as positive by a model. Its range is between 0-1 (lower the better)

FPR = (FP)/(FP+TN)= 1.0

ROC curve

receiver operating characteristic curve, or ROC curve is plotted between the true positive rate (TPR /Recall) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings. ROC curve is used to measure how well the classifier can separate TP and TN.

How the ROC curve is drawn?

We take each probability score we calculated using these steps:

LogisticRegression.predict_proba as threshold => compute confusion matrix => measure TPR and FPR (for each threshold)

ROC curve can be extended to a Multiclass Classification problem using the one-vs-all approach.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:ROC_curves.svg

The diagonal line represents a random model that predicts either 1 or 0 randomly. The area under the diagonal line is 0.5

How to interpret the AUC score in the ROC curve?

For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions:

AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example. AUC provides an aggregate measure of performance across all possible classification thresholds. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

ROC curve for our synthetic dataset
AUC score: 0.4580425

Key Observations

Accuracy score: 0.9900990099009901
FPR: 1.0
Precision: 0.9900990099009901
Recall: 1.0
F1-score 0.9950248756218906
AUC score: 0.4580425

A. Metrics that don’t help to measure your model:

  • Accuracy: is very high. Even when TN = 0. Since the data is imbalanced (high number of positive samples). Numerator i.e TN+TP is high.
  • Precision: is very high. Since data has a very disproportionately high number of Positive cases.
    The ratio of TP/(TP+FP) becomes high.
  • Recall: is very high. Since data has a very disproportionately high number of Positive cases.
    The ratio of TP/(TP+FN) becomes high.
  • F1-score: is very high. The high values of Precision and Recall make the F1-score misleading.

Precision and Recall basically deal with the positive class. And when the dataset inherently has lots of positive cases, Precision and Recall seem to be not good metrics to measure the model performance.

B. Metrics that help to measure your model:

  • FPR: is high. Since our model predicts everything 1, we have a high number of FP and it signifies that this is not a good classifier/model.
  • AUC score: is very low and represents the true picture of evaluation here.

Case 2: number of negative samples >> number of positive samples

Here we just do the opposite of the previous situation:

import numpy as np
import pandas as pd
Y = np.hstack((np.ones((10000,)), np.zeros((100,))))
Y_score = np.random.uniform(0.1,0.51,10000)
df_imb = pd.DataFrame(data=np.array((Y, Y_score)).T, columns=['y','proba'])
df_imb = df_imb.sample(10100)


def pred(X):
    N = len(X)
    predict = []
    for i in range(N):
        if X[i] >= 0.5: # sigmoid(w,x,b) returns 1/(1+exp(-(dot(x,w)+b)))
            predict.append(1)
        else:
            predict.append(0)
    return np.array(predict)

from sklearn import metrics
print(f'Accuracy score :{metrics.accuracy_score(Y, pred(Y_score)):>{20}}',)
print(f'F1-score%:{metrics.f1_score(Y, pred(Y_score)):>{26}}')
print(f'RoC score:{metrics.roc_auc_score(Y, Y_score):>{25}}')
print(f'Precison:{metrics.precision_score(Y, pred(Y_score)):>{25}}')
print(f'Recall:{metrics.recall_score(Y, pred(Y_score)):>{15}}'))
metrics.confusion_matrix(Y, pred(Y_score))
Confusion matrix in the second case
ROC curve for the second case
Accuracy score :  0.9722772277227723
FPR:              0.0232
Precison:         0.18309859154929578
Recall(TPR):      0.52
F1-score:         0.27083333333333337
RoC score:        0.9276659999999999

Key Observations

A. Metrics that don’t help to measure your model:

  • Accuracy: is very high since the proportion of TN is high(high number of negative class). Numerator i.e TN+TP becomes high.
  • AUC score: is high. Even more than 50% of actual positives are predicted as FN (TPR).
  • FPR: is low. It gets skewed because of the large number of TN (imbalanced). Even when a classifier makes a lot of FP.

AUC score doesn’t capture the true picture when a dataset contains negative majority class and our focus is the minority positive class.

B. Metrics that help to measure your model:

  • Precision is very low. Because of the high number of FP, the ratio of TP/(TP+FP) becomes low.
  • Recall is very low. Since data has a very disproportionately high number of negative cases. The classifier may detect a larger number of positives than negatives. So, the ratio of TP/(TP+FN) becomes low.
  • F1-score is low. The low values of Precision and Recall make the F1-score a good indicator of performance here.

Summary:

  1. Use the AOC score, when the positive class is the majority and your focus class is Negative.
  2. Use Precision, Recall, and F1-score, when the negative class is the majority and your focus class is positive.
  3. Accuracy score doesn’t help much in imbalanced situations
  4. High FPR tells the classifier/model predicts a high number of False Positives.

Note: What is “Positive” and what is “negative” is a purely semantic construction (in your situation). You can simply flip the labels and then decide your focus class based on the given Business Problem and finally opt for the correct performance metrics as discussed in this post.

Source:

https://medium.com/datasciencestory/performance-metrics-for-evaluating-a-model-on-an-imbalanced-data-set-1feeab6c36fe

https://towardsdatascience.com/demystifying-roc-and-precision-recall-curves-d30f3fad2cbf

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.