Enable Javascript

Please enable Javascript to view website properly

Toll Free 1800 889 7020

Looking for an Expert Development Team? Take 2 weeks Free Trial! Try Now

Pandas Optimization for Largest Datasets

python-pandas-optimization-for-largest-datasets

If you worked on core Machine Learning applications like Churn Prediction for any industry, Classifying people will come across loading and looping over tons of data. Hence, Python provides an excellent library for handling panel data, i.e., multidimensional data sets that are Pandas, but it is lacking if you don't use an appropriate method to loop over large datasets. Here a Python development services provider has provided information on optimizing your Python code in this blog, so it doesn't take much time.

Different methods of Pandas will explore in this blog:

  • Crude Looping with indices
  • Looping with iterrows() function available in pandas
  • Lopping with apply() method for dataframe
  • Vectorization in Pandas

This blog will use a dataset of Churn modeling that contains many variables such as SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, etc.

You can check out the dataset here: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Loading the dataset using pandas

  • Import pandas as pd
  • data = pd.read_csv('./Churn_Data.csv')

It will try to optimize one part of the calculation supposed for building a machine learning model. We have to derive a new variable which comprises the sum of the payment in the last 2 months.

So, the payment function is defined as follows:

  • def sum_payment(pay1,pay2):
  • return (pay1 + pay2)

Crude Looping with indices

This section will see how crude looping i.e., iterating with indices for the whole dataframe. Here in our dataset, we have 30k data points. Let’s create a new variable in dataframe for the payment of the last two months.

1. def total_pay_last_two(): 2. pay_last_two = [] 3. ## looping over length of data 4. for i in range(0, len(data)): 5. ## adding last 2 months payment i.e, pay_amt5 and pay_amt6 6. s = sum_payment(data.iloc[i]['PAY_AMT5'],data.iloc[i]['PAY_AMT6']) 7. pay_last_two.append(s) 8. return pay_last_two

In the above code, we are using indices form to get values from data. Let’s see how much will it take, we will use a magic command %%timeit in the jupyter notebook otherwise timeit or time module can be used as well.

1. %%timeit 2. data['pay_last_two'] = total_pay_last_two()

Output -:

14.6 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This output says that 1 loop takes about 14s which is very high.

iterrows() method

Now, we will explore the iterrows() method of Pandas. It will enhance the speed of execution. iterrows() is a generator that gives an index to loop over along with the current row of datafarme. It is optimized for working with pandas. Let’s check out with the help of an example.

1. def total_pay_last_two(): 2. pay_last_two = [] 3. for i, row in data.iterrows(): 4. ## Now we can access the data row without index 5. s = sum_payment(row['PAY_AMT5'], row['PAY_AMT6']) 6. pay_last_two.append(s) 7. return pay_last_two

We will see how much time this iterrows() function takes:

1. %%timeit 2. data['pay_last_two'] = total_pay_last_two()

Output:

3.99 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Iterrows() function is faster than crude looping through indices. Which is around 5 times faster.

apply () method

At last, we will explore the apply method for the dataframe. apply () method access the whole dataframe at an instance, and it uses anonymous lambda function to fetch rows.

1. %%timeit 2. data['pay_last_two'] = data.apply(lambda row: sum_payment(row['PAY_AMT5'], row['PAY_AMT6']), axis=1)

Output -:

1.51 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

apply() method is faster than the iterrows() method. Here it shows, 3 times faster. Hence, in many cases, we have to use the apply() method, which makes the program more efficient.

There are several other methods, such as panda series vectorization, or we can use Nampy for vectorized implementation.

Conclusion

By integrating all of the above methods, we can say that we can use the optimized version of Panda for larger datasets. The application () method or vectorized implementation is usually used for quick calculations. Now, you add a new item to your toolbox.

Software Development Team
Need Software Development Team?
captcha
🙌

Thank you!
We will contact soon.

Oops! Something went wrong.

Recent Blogs

Categories

NSS Note
Trusted by Global Clients