JavaScript Required

We're sorry, but we doesn't work properly without JavaScript enabled.

Looking for an Expert Development Team? Take two weeks Trial! Try Now

Reduce time-complexity using Joblib in Python

Joblib in Python

Are you a machine learning enthusiast? Or a freaking nerd who always concern about time and space optimization in your code? Well, anyway, it’s a blog for making developers wiser during coding.

INTRODUCTION

With the advancement of AI era, many new machine learning algorithms and optimization techniques are invented to cut a throat for best time and space complexity. Keeping this in mind, let me introduce not very old yet very powerful Python library, Joblib. Joblib brought a breakthrough in various contexts of Python activities such as loading up large Numpy arrays, serializing and persisting python object or performance of python functions you build, seeking with the help of parallel computing, memorization (not typo mistake,’r’ doesn’t exists) and caching mechanism in addition to multi-processing, loky(default) and threading backend.

This blog will give you shivers and shrieks when you will be awed by the performance of Joblib. Let’s cut to the chase. We will be crossing the following milestones.

  • Dive into Joblib

    Features which made Joblib an Avenger

    How to get started?

    Main Conceptual Features of Joblib

    Joblib functional areas of optimization in Python

  • Implementations of Joblib

  • Conclusion

DIVE INTO Joblib

Joblib is a library built purely in Python by scikit-learn developers. It entirely focuses on optimizing python-based persistence and functions. A fantastic library that became popular because of its optimized time-complexity feature, especially skilled in handling large data. It provides lightweight pipelining in Python development services.

Problem: Many challenges we face while dealing with large data. Call it taking huge time and space when working with intensive computational functions or persisting then loading huge data as a pickle.

Solution: Joblib

Features which made Joblib an Avenger in reducing time-complexity:

  • 1. Fast Disk-cachingand lazy-evaluationusing hashing technique as well
  • 2. Capable of distributing jobs (parallelization) using a Parallel helper
  • 3. Compression feature during persistence containing large data
  • 4. Best known for handling large data
  • 5. Specific optimization for handling large Numpy arrays
  • 6. Memoization where function called with same argument won’t re-compute, instead, output loads back from cache using memmapping Cherry on the cake
  • 7. No dependent library (except Python itself)

Later, we will look over the practical examples of above features one by one. Stay tuned!

How to get started?

You can install Joblib using pip as follows:

pip install joblib

Main Conceptual Features of Joblib

Joblib in Python

Parallel Computing:

  • 1. Parallel class

    Normally, concurrent computing achieves by n_jobs argument referring to different concurrent processes which means OS lets those jobs run at the same time. Generally, it refers to CPU (processor) cores whose value is determined by a task. Suppose a task of intensive I/O but not with a processor, then processes can be more.

    classjoblib.Parallel(n_jobs=None, backend=None, verbose=0, timeout=None, pre_dispatch='2 * n_jobs', batch_size='auto', temp_folder=None, max_nbytes='1M', mmap_mode='r', prefer=None, require=None)

    Also, you can explore more about “backend” gives you options like multi-processing and multi-threading. For more info, check out the documentation.

  • 2. delayed decorator

    delayed is a decorator mainly to get the arguments of a function by creating a tuple with function call syntax.

    joblib.delayed(function, check_pickle=None)

Caching (Memoization)

  • 1. Memory class

    Lazy evaluation of Python function in simple terms means a code though assigned to a variable that will execute only when its result is needed by other computations. Caching the result of a function is termed as memorization to avoid recomputing.

    classjoblib.memory.Memory(location=None, backend='local', cachedir=None, mmap_mode=None, compress=False, verbose=1, bytes_limit=None, backend_options={})

    Also, avoids rerunning the function with same args. Memory class stores result in a disk that loads back the output cached by using hashing technique when a function called with same args. Hashing will check out whether if output for inputs is already computed or not, if not then recomputed or else loads cache value. It is mainly featured for large NumPy arrays.

    Output is saved in a pickle file in cache directory.

  • 2. Memory.cache()

    Callable object furnishing a function for stashing its return value each time it is called.

Data Persistence

Joblib offers help in persisting any data structure or your machine learning model. It has proved to be a better replacement of Python’s standard library, Pickle. Unlike Pickle library, Joblib can pickle Python objects and filenames.

Breakthrough is optimizing space complexity during pickling which is achieved by joblib’s compression techniques to save a persisted object in compressed form. Joblib compresses data before saving into a disk. Various compression extensions like gz, z, etc have their respective compression methods. For more info, visit the following link. http://gael-varoquaux.info/programming/new_low-overhead_persistence_in_joblib_for_big_data.html

Joblib AREAS OF OPTIMIZATION IN Python

Joblib in Python

IMPLEMENTATIONS of Joblib

  • 1. Run over loops (Embarrassingly parallel for loops)

    Joblib in Python
  • 2. Reload large numpy (Memoize pattern)

    As mentioned previously, memorize refers to just loading up the output from cache for the functioned called with the same arguments again.

    Memory class context is also used over numpy when it comes about long calculation of numpy or loading large numpy. This can be achieved using mmap_mode (memory map) or just decorator function.

    Using Memmapping (memory mapping) mode helpful while reloading large numpy arrays by speeding up cache to find out. Can also use memory.cache decorator.

    Joblib in Python

    Square function is called again with same argument which is now using memorize technique using mmap_mode (Memory mapping) which again uses hashing technique so as to speed up with the cache.

    Joblib in Python

    Using decorator function, here is fun1 is called again with same arguments, will follow a memoize pattern.

    Joblib in Python
  • 3. Python function (Memorize (caching) +Parallelism)

    Caching: While working with custom python function as demonstrated below, it took 5.01 s.

    Joblib in Python

    Same function when called again, it takes 0 s since to load output from cache.

    Joblib in Python
  • 4. Serialization and Persistence

    With Joblib you can persist filenames or even file objects. Python objects can be any data structure object or even your machine or deep learning model. Let’s look at the ways to dump and load objects.

    - Normal persisting and loading of list object

    Joblib in Python

    - Persisting file is compressed using compress argument, hence achieving space-complexity will indirectly effecting time-complexity during loading up of object.

    Joblib in Python

    - In below example, ‘.z’ compressed file is dumped.

    Joblib in Python

    - In below example, .gz compressed file is dumped which has gzip compression method with compression level of 3.

    Joblib in Python

    - As you can see below, difference of storage between varied forms of pickle files.

    Joblib in Python

CONCLUSION:

Every beginning has an end. Well, that’s a cycle of nature! Likewise, our blog came to our end. We have seen how Joblib is a life savior in the context of handling huge data which could have taken a lot of space and time, if not without Joblib. The blog has immensely described this lightweight pipelining library which is capable enough to optimize time and space. Features like parallelism, memorization, and caching or file compression outperformed among all ML/AI library. In machine learning, a huge model pickle file now can no more consume a lot of space and load the same file more quickly. But, life isn’t fair every time, right? Jokes apart!

Joblib can at times not be quicker when a small amount of data comes into view. But, above all, it is recommended over the Pickle library for object persistence and can be considered when in need to perform Parallel tasks.

 
NSS Note

Some of our clients

team