Intuitively Understanding Approaches to Handling Imbalanced Data in Machine Learning
When creating classification models in machine learning, one of the common issues we encounter is data imbalance. For instance, when predicting credit card fraud or some rare diseases, the positive-negative ratio can be 1 to 10 or even 1 to 100. This can result in a significant problem: the model may tend to predict every record as negative to achieve a seemingly high accuracy, even though this accuracy is deceptive.
How to address it? There are 3 commonly used ways: changing threshold, changing evaluation metric and changing sampling method.
Changing Threshold
Let’s take logistic regression as an example. By default, the threshold between positive (1) and negative (0) is set at 0.5. This means that when the predicted model score is greater than 0.5, the data point is labeled as 1, and if not, it’s labeled as 0.
When dealing with imbalanced data, it’s highly likely that most of the predicted values, when applied to new datasets, will fall below 0.5. To address this, we can lower the threshold from 0.5 to 0.1, which will help balance the predictions.