Intuitively Understanding Approaches to Handling Imbalanced Data in Machine Learning

Eric Cai
4 min readSep 18, 2023
Photo by Elena Mozhvilo on Unsplash

When creating classification models in machine learning, one of the common issues we encounter is data imbalance. For instance, when predicting credit card fraud or some rare diseases, the positive-negative ratio can be 1 to 10 or even 1 to 100. This can result in a significant problem: the model may tend to predict every record as negative to achieve a seemingly high accuracy, even though this accuracy is deceptive.

How to address it? There are 3 commonly used ways: changing threshold, changing evaluation metric and changing sampling method.

Changing Threshold

Let’s take logistic regression as an example. By default, the threshold between positive (1) and negative (0) is set at 0.5. This means that when the predicted model score is greater than 0.5, the data point is labeled as 1, and if not, it’s labeled as 0.

When dealing with imbalanced data, it’s highly likely that most of the predicted values, when applied to new datasets, will fall below 0.5. To address this, we can lower the threshold from 0.5 to 0.1, which will help balance the predictions.

Changing Evaluation Metric

--

--

Eric Cai
Eric Cai

Written by Eric Cai

Things about data science that cannot be learnt from textbook. Experienced DS in health care and fin-tech industry.