JavaScript Required

We're sorry, but we doesn't work properly without JavaScript enabled.

Pandas Optimization for Largest Datasets

Introduction:

python-pandas-optimization-for-largest-datasetsIf you worked on core Machine Learning applications like Churn Prediction for any industry, Classifying people in categories will definitely come across loading and looping over tons of data. Hence, python application development provides an amazing library for handling panel data i.e., multidimensional data sets which are Pandas but still it is lacking if don’t use correct method to loop over for larger datasets.

Therefore in this blog, will look at how to optimize your python code in order to avoid taking a long time.

Different methods of Pandas will explore in this blog -:

  • Crude Looping with indices
  • Looping with iterrows() function available in pandas
  • Lopping with apply() method for dataframe
  • Vectorization in Pandas

For this blog, will use dataset of Churn modelling which cintains many variables such as SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4 etc.

You can check out the dataset here -: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Loading the dataset using pandas

  • Import pandas as pd
  • data = pd.read_csv('./Churn_Data.csv')

In this will try to optimize one part of calculation suppose for building a machine learning model. In which we have to derive a new variable which comprises the sum of payment in the last 2 months.

So, the payment function is defined as follows -:

  • def sum_payment(pay1,pay2):
  • return (pay1 + pay2)
  • Crude Looping with indices

    In this section will see how crude looping i.e., iterating with indices for the whole dataframe. Here in our dataset, we have 30k data points. Let’s create a new variable in dataframe for payment of last two months.

    1. def total_pay_last_two(): 2. pay_last_two = [] 3. ## looping over length of data 4. for i in range(0, len(data)): 5. ## adding last 2 months payment i.e, pay_amt5 and pay_amt6 6. s = sum_payment(data.iloc[i]['PAY_AMT5'],data.iloc[i]['PAY_AMT6']) 7. pay_last_two.append(s) 8. return pay_last_two

    In the above code, we are using indices form to get values from data. Let’s see how much will it take we will use a magic command %%timeit in the jupyter notebook otherwise timeit or time module can be used as well.

    1. %%timeit 2. data['pay_last_two'] = total_pay_last_two()

    Output -:

    14.6 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    This output says that 1 loop takes around 14s which is really quite high.

  • iterrows() method

    Now, we will explore the iterrows() method of Pandas. It will enhance the speed of execution. iterrows() is a generator which gives index to loop over along with the current row of datafarme. It is optimized in working with pandas. Let’s check out with the help of example.

    1. def total_pay_last_two(): 2. pay_last_two = [] 3. for i, row in data.iterrows(): 4. ## Now we can access the data row without index 5. s = sum_payment(row['PAY_AMT5'], row['PAY_AMT6']) 6. pay_last_two.append(s) 7. return pay_last_two

    We will see how much time this iterrows() function takes -:

    1. %%timeit 2. data['pay_last_two'] = total_pay_last_two()

    Output -:

    3.99 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    Iterrows() function is really faster than crude looping through indices. Which is around 5 times faster.

  • apply () method

    At last we will explore the apply method for dataframe. apply () method access the whole dataframe at an instance and it uses anonymous lambda function to fetch rows.

    1. %%timeit 2. data['pay_last_two'] = data.apply(lambda row: sum_payment(row['PAY_AMT5'], row['PAY_AMT6']), axis=1)

    Output -:

    1.51 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    apply() method is faster than the iterrows() method. Here it shows, 3 times faster. Hence, in many cases we have to use the apply() method which makes program more efficient.

    There are some other methods like vectorization of pandas series or we can use Numpy for vectorized implementation.

Conclusion

Consolidating above all methods we can say that for larger datasets we can use optimized version of pandas. More generally, apply () method or vectorized implementation is used for faster computation. Now, you added one new thing to your toolbox. Hope you enjoyed.

Ast Note

Some of our clients

team