If you worked on core Machine Learning applications like Churn Prediction for any industry, Classifying people will come across loading and looping over tons of data. Hence, Python provides an excellent library for handling panel data, i.e., multidimensional data sets that are Pandas, but it is lacking if you don't use an appropriate method to loop over large datasets. Here a Python development services provider has provided information on optimizing your Python code in this blog, so it doesn't take much time.
Different methods of Pandas will explore in this blog:
- Crude Looping with indices
- Looping with iterrows() function available in pandas
- Lopping with apply() method for dataframe
- Vectorization in Pandas
This blog will use a dataset of Churn modeling that contains many variables such as SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, etc.
You can check out the dataset here: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
Loading the dataset using pandas
- Import pandas as pd
- data = pd.read_csv('./Churn_Data.csv')
It will try to optimize one part of the calculation supposed for building a machine learning model. We have to derive a new variable which comprises the sum of the payment in the last 2 months.
So, the payment function is defined as follows:
- def sum_payment(pay1,pay2):
- return (pay1 + pay2)
Crude Looping with indices
This section will see how crude looping i.e., iterating with indices for the whole dataframe. Here in our dataset, we have 30k data points. Let’s create a new variable in dataframe for the payment of the last two months.
In the above code, we are using indices form to get values from data. Let’s see how much will it take, we will use a magic command %%timeit in the jupyter notebook otherwise timeit or time module can be used as well.
14.6 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This output says that 1 loop takes about 14s which is very high.
Now, we will explore the iterrows() method of Pandas. It will enhance the speed of execution. iterrows() is a generator that gives an index to loop over along with the current row of datafarme. It is optimized for working with pandas. Let’s check out with the help of an example.
We will see how much time this iterrows() function takes:
3.99 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Iterrows() function is faster than crude looping through indices. Which is around 5 times faster.
apply () method
At last, we will explore the apply method for the dataframe. apply () method access the whole dataframe at an instance, and it uses anonymous lambda function to fetch rows.
1.51 s ± 57.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
apply() method is faster than the iterrows() method. Here it shows, 3 times faster. Hence, in many cases, we have to use the apply() method, which makes the program more efficient.
There are several other methods, such as panda series vectorization, or we can use Nampy for vectorized implementation.
By integrating all of the above methods, we can say that we can use the optimized version of Panda for larger datasets. The application () method or vectorized implementation is usually used for quick calculations. Now, you add a new item to your toolbox.