How to Deal imbalanced datasets in machine learning?

In Machine Learning, many of us come across problems like anomaly detection in which classes are highly imbalanced. For binary classification (0 and 1 class) more than 85% of data points belong to either class. So, this blog will cover techniques to handle highly imbalanced data. Here, some of the classic examples are fraud detection, anomaly detection, etc.

Let’s dive into the details of these techniques.

These techniques are given by a leading Python Django development company.

List of techniques -:

Random Undersamopling
Random Oversampling
Python imblearn Undersampling
Python imblearn Oversampling
Oversampling : SMOTE(Synthetic Minority Oversampling Technique)
Undersampling: Tomek Links

Looking at imbalanced data

Here, I have collected raw data from here:

Data is about the classification of glass. It has 2 imbalanced classes, here it is not highly imbalanced but they are imbalanced. Let’s take a look with the help of code.

importpandasaspd importnumpyasnp importmatplotlib.pyplotasplt % matplotlib inline data=pd.read_csv(‘../Data_imb.csv') print(data)

	RI_real	Na_real	Mg_real	Al_real	Si_real	K_real	Ca_real	Ba_real	Fe_real	Class
0	1.515888	12.87795	3.43036	1.40066	73.2820	0.68931	8.04468	0.00000	0.1224	negative
1	1.517642	12.97770	3.53812	1.21127	73.0020	0.65205	8.52888	0.00000	0.0000	negative
2	1.522130	14.20795	3.82099	0.46976	71.7700	0.11178	9.57260	0.00000	0.0000	negative
3	1.522221	13.21045	3.77160	0.79076	71.9884	0.13041	10.24520	0.00000	0.0000	negative
4	1.517551	13.39000	3.65935	1.18880	72.7892	0.57132	8.27064	0.00000	0.0561	negative
5	1.520991	13.68925	3.59200	1.12139	71.9604	0.08694	9.40044	0.00000	0.0000	negative
6	1.517551	13.15060	3.60996	1.05077	73.2372	0.57132	8.23836	0.00000	0.0000	negative
7	1.519100	13.90205	3.73119	1.17917	72.1228	0.06210	8.89472	0.00000	0.0000	negative
8	1.517938	13.21045	3.47975	1.41029	72.6380	0.58995	8.43204	0.00000	0.0000	negative

Checking the counts of each class data['Class'].value_counts() negative 181 positive 33 Name: Class, dtype: int64 ## Plotting the counts for each class data['Class'].value_counts().plot(kind='bar') <matplotlib.axes._subplots.AxesSubplot at 0x25d838418d0>

So, here positive class covers about 15% of data which is an imbalance in nature

Now, let’s start looking at techniques to handle this problem.

Random Undersampling

In this, what will happen is the majority of class examples will be sampledi.e., some of the examples which belong to the majority class will be removed. It has advantages but, it may cause a lot of information loss in some cases. Let’s look at code, and how to perform undersampling in Python Django development.

count_class_n, count_class_p=data.Class.value_counts() # Divide by class data_class_n=data[data['Class'] ==' negative'] data_class_p=data[data['Class'] ==' positive'] ## Undersampling over negative class(data_class_n) data_class_n_under=data_class_n.sample(count_class_p) ## Concatenating undersamples classes of Negative(undersampled) with the minority class for final data set data_under=pd.concat([data_class_n_under, data_class_p], axis=0) ## Checking the counts of each class after undersampling print('Class count after under-sampling:') print(data_under.Class.value_counts()) ## Plotting the graphs data_under.Class.value_counts().plot(kind='bar', title='Count of Classes') Class count after under-sampling: negative 33 Positive 33 Name: Class, dtype: int64

After undersampling, we have 33 data points in each class.

Here undersampling is not a better option because we already have 200 points and after that, we reducing just to 30’s which is less. So, it depends upon the use-case as well.

Random Oversampling

Random oversampling will create duplicates of randomly selected examples in the minority class.

## Oversampling over positive class(data_class_p) data_class_p_over=data_class_p.sample(count_class_n, replace=True) ## Concatenating undersamples classes of Positive(Oversampled) with the majority class for final data set data_over=pd.concat([data_class_p_over, data_class_n], axis=0) ## Checking the counts of each class after oversampling print('Class count after under-sampling:') print(data_over.Class.value_counts()) ## Plotting the graphs data_over.Class.value_counts().plot(kind='bar', title='Count of Classes', color=['b','r']) Class count after under-sampling: positive 181 negative 181 Name: Class, dtype: int64 <matplotlib.axes._subplots.AxesSubplot at 0x25d83c85ac8>

Undersampling using Imblearn

Let’s look at undersampling using imblearn package in Python.

Similarly, we can perform oversampling using Imblearn. It might confuse you why to use different libraries for performing undersampling and oversampling. Sometimes, it happens that undersampling(oversampling) using imblearn and simple resampling produces different results and we can select based on the performance as well.

importimblearn fromimblearn.under_samplingimportRandomUnderSampler Undersample=RandomUnderSampler(return_indices=True) ## preparing the data in the format X = data[['RI_real', 'Na_real', 'Mg_real', 'Al_real', 'Si_real', 'K_real', 'Ca_real', 'Ba_real', 'Fe_real']] y =data['Class'] ## Fitting for performing undersampling using imblearn X_rus, y_rus, id_rus=Undersample.fit_sample(X, y) print("Count of class after undersampling") print(pd.Series(y_rus).value_counts()) ## Plotting the graph for class counts after undersampling imblearn pd.Series(y_rus).value_counts().plot(kind='bar', title='Count of Classes', color=['r','b']) Count of class after undersampling negative 33 positive 33 dtype: int64 <matplotlib.axes._subplots.AxesSubplot at 0x25d86471128>

Tomek links

Tomek links is the algorithm based on the distance criteria, instead of removing the data points randomly. It uses distance metrics to remove the points from the majority class. It finds the pair of points that has less distance between them one from the minority class and another from the majority class and will remove the majority point from that pair.

fromimblearn.under_samplingimportTomekLinks ## Creating the Tomeklink object tl=TomekLinks(return_indices=True, ratio='majority') ## Fitting on data X_tl, y_tl, id_tl=tl.fit_sample(X, y) print("Count of class after Tomek-links undersampling") print(pd.Series(y_tl).value_counts()) ## Plotting the class count after the undersampling(Tomeklinks) pd.Series(y_tl).value_counts().plot(kind='bar', title='Count of Classes', color= ['r','b']) Count of class after Tomek-links undersampling negative 172 positive 33 dtype: int64 <matplotlib.axes._subplots.AxesSubplot at 0x25d864360b8>

SMOTE

It stands from the synthetic minority oversampling technique. It also performs oversampling.

fromimblearn.over_samplingimport SMOTE ## Performing the Smote oversampling smote= SMOTE(ratio='minority') X_sm, y_sm=smote.fit_sample(X, y) ## Checking and Plotting the points after smote oversampling print("Count of class after SMOTE undersampling") print(pd.Series(y_sm).value_counts()) pd.Series(y_sm).value_counts().plot(kind='bar', title='Count of Classes', color = ['b','r']) Count of class after undersampling positive 181 negative 181 dtype: int64 <matplotlib.axes._subplots.AxesSubplot at 0x25d863f7eb8>

Conclusion

Hope you are clear with different techniques to overcome with imbalanced dataset in Machine Learning. There are many other methods to deal with imbalance thing. It depends upon your dataset when to perform oversampling and when to perform undersampling. If your dataset is huge then go for undersampling otherwise perform oversampling or TomekLinks.

Hope you enjoyed learning. Keep learning with new stuff in Machine Learning.

🙌

Thank you!
We will contact soon.

Oops! Something went wrong.

What is imbalanced data?