Enable Javascript

Please enable Javascript to view website properly

Toll Free 1800 889 7020

Looking for an Expert Development Team? Take two weeks Trial! Try Now

Pandas Optimization for Largest Datasets

python-pandas-optimization-for-largest-datasets

If you worked on core Machine Learning applications like Churn Prediction for any industry, Classifying people in categories will come across loading and looping over tons of data. Hence, Python Development services provide an amazing library for handling panel data i.e., multidimensional data sets which are Pandas but still it is lacking if don’t use the correct method to loop over for larger datasets. Therefore in this blog, will look at how to optimize your python code to avoid taking a long time.

Different methods of Pandas will explore in this blog:

  • Crude Looping with indices
  • Looping with iterrows() function available in pandas
  • Lopping with apply() method for dataframe
  • Vectorization in Pandas

This blog will use a dataset of Churn modeling that contains many variables such as SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, etc.

You can check out the dataset here -: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Loading the dataset using pandas

  • Import pandas as pd
  • data = pd.read_csv('./Churn_Data.csv')

It will try to optimize one part of the calculation supposed for building a machine learning model. In which we have to derive a new variable which comprises the sum of the payment in the last 2 months.

So, the payment function is defined as follows:

  • def sum_payment(pay1,pay2):
  • return (pay1 + pay2)
  • Crude Looping with indices

    This section will see how crude looping i.e., iterating with indices for the whole dataframe. Here in our dataset, we have 30k data points. Let’s create a new variable in dataframe for the payment of the last two months.

    1. def total_pay_last_two(): 2. pay_last_two = [] 3. ## looping over length of data 4. for i in range(0, len(data)): 5. ## adding last 2 months payment i.e, pay_amt5 and pay_amt6 6. s = sum_payment(data.iloc[i]['PAY_AMT5'],data.iloc[i]['PAY_AMT6']) 7. pay_last_two.append(s) 8. return pay_last_two

    In the above code, we are using indices form to get values from data. Let’s see how much will it take, we will use a magic command %%timeit in the jupyter notebook otherwise timeit or time module can be used as well.

    1. %%timeit 2. data['pay_last_two'] = total_pay_last_two()

    Output -:

    14.6 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    This output says that 1 loop takes about 14s which is very high.

  • iterrows() method

    Now, we will explore the iterrows() method of Pandas. It will enhance the speed of execution. iterrows() is a generator that gives an index to loop over along with the current row of datafarme. It is optimized for working with pandas. Let’s check out with the help of an example.

    1. def total_pay_last_two(): 2. pay_last_two = [] 3. for i, row in data.iterrows(): 4. ## Now we can access the data row without index 5. s = sum_payment(row['PAY_AMT5'], row['PAY_AMT6']) 6. pay_last_two.append(s) 7. return pay_last_two

    We will see how much time this iterrows() function takes:

    1. %%timeit 2. data['pay_last_two'] = total_pay_last_two()

    Output -:

    3.99 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    Iterrows() function is faster than crude looping through indices. Which is around 5 times faster.

  • apply () method

    At last, we will explore the apply method for the dataframe. apply () method access the whole dataframe at an instance and it uses anonymous lambda function to fetch rows.

    1. %%timeit 2. data['pay_last_two'] = data.apply(lambda row: sum_payment(row['PAY_AMT5'], row['PAY_AMT6']), axis=1)

    Output -:

    1.51 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    apply() method is faster than the iterrows() method. Here it shows, 3 times faster. Hence, in many cases, we have to use the apply() method which makes the program more efficient.

    There are several other methods, such as panda series vectorization, or we can use Nampy for vectorized implementation.

Conclusion

By integrating all the methods above we can say that we can use the optimized version of Panda for larger datasets. More commonly, the apply() method or vectorized implementation is used for quick calculations. Now, you added one new thing to your toolbox.

Recent Blogs

Categories

NSS Note

Some of our clients

team