Python Performance improvements

Python is a widely used programming language with a diverse range of libraries and frameworks, making it beneficial for both startups and established enterprises. It offers excellent solutions for mobile and web application development. Python is also a language of choice for AI/ML development, and almost all AI/ML code is written in Python. The language was designed to be interpreted, and Python code is executed line by line by the CPython runtime.

The designers of Python’s original intent was for people to write the most critical components of applications in a compiled language such as C to make the app fast and memory efficient. Python should act as glue to combine binary code modules and execute as an interpreted language for interactive development. Many popular Python libraries, like Scipy, Numpy, and TensorFlow, are built using this approach. Writing code in C is, however, not an option for modern business applications. This slow Python model made Python less desirable for a lot of situations. 

Limitations of Python

One of the most well-known limitations of CPython implementation is the GIL lock, which affects the performance of Python programming. Despite having a multiprocessing library, the GIL lock ensures that only one thread can execute at a time for any Python multithreaded code. As a result, Python is one of the few programming languages where multithreaded code can make applications run slower than single-threaded ones. While the GIL locking mechanism effectively prevents memory leaks in Python, it has a devastating impact on Python code performance.

Consider the following code snippet from Stack Overflow. It has been observed that the single-threaded version of this code runs faster than its multi-processing equivalent. The single-threaded code generates an array of 100,000 integers ranging between zero and one hundred, and then the distance of each integer is calculated from the middle value of 50.

Its multiprocessing equivalent divides the array into 100-element processing units and sends each to a thread in C, which would make the code significantly faster.

Unfortunately, Python’s threading and its wrapper class, the concurrent futures, make it run slower, taking twice as long to finish. David Beasley described this in PyCon 2010 as Python allowing multiple cooks to do their cooking separately. Still, each has to wait until the same utensils are made available to them one by one, negating all the benefits of multithreading.

Python Performance Solutions

The Global Interpreter Lock (GIL) doesn’t always slow things down. Input/output (I/O) operations can be performed in parallel. For example, instead of calculating the mean distance, the simple_abs_range() function could execute an ‘HTTP GET’ or ‘POST’ to a URL. Multi-threading would speed up this operation, as Python releases the GIL lock for I/O operations.

Over the years, Python designers have developed various solutions to address performance issues. One such solution is the Python library, which enables cooperative multitasking to compensate for the lack of true preemptive multitasking, resulting in faster execution similar to multi-threaded compiled code. Other approaches include replacing CPython with an implementation that supports JIT compiling (PyPy) or compiling specific functions of code into pre-compiled code for quicker execution (Numba). However, not all of these approaches optimize every line of Python code, and significant development work is required to utilize these technologies effectively.

PyPy, for example, does not use CPython at all. CPython API-based C extensions sometimes do not work, and for performance improvement, PyPy has its own C language foreign function support. Library authors have to use this support to take advantage of its JIT compilation. The PyPy team has tested Python code up to version 3.9 and found it highly compatible with PyPy.

Similar constraints apply to Numba. It can speed up certain Python functions and execute them in parallel using decorators. For example, the following logistic regression function will be calculated in parallel to generate the output vector.

However, to parallelize most real-world operations, Mumba’s special prange operator has to be used instead of the Python range object, and the code has to be tailored to take advantage of Numba acceleration.

The code above can run in parallel for all elements of vector n by using inner and outer parallel loops (i and j) to calculate all values of the return vector acc. Not every part of the code is executed in parallel. It goes through an interesting optimization process to create C kernels that can be executed in parallel. The key point is that Numba does not make every Python code fast. Still, it allows developers to write high-performance code with GPU acceleration without directly writing CUDA kernels in C/C++.

The Faster Python

Guido Van Rossum, the benevolent dictator for life for Python, announced the Python Performance Improvement Project in 2021. The project aims to make the CPython implementation five times faster over the next four years, benefiting most Python code with performance optimizations. Specific areas for additional performance improvements were also identified.

Python 3.9 included performance improvements, includes

  •  faster module initialization 
  • efficient handling of object attributes.

Python 3.10, released in October 2021, included several performance enhancements, including 

  •  faster method calls
  • efficient memory management.

The latest release of Python 3.12 includes several performance improvements, with a major enhancement being the introduction of per-interpreter GIL (Global Interpreter Lock). This new feature allows multiple multi-core CPUs to run various interpreters, each with its own GIL lock. Additionally, each interpreter can be passed context information. This is particularly useful for writing multi-processing-based data loaders for LLM (Large Language Model) training, especially on server machines that host 4-8 expensive GPU cards and 1-2 CPUs, each with 20-32 cores for handling CPU load. The per-interpreter GIL module is designed to leverage all CPU cores. This update is part of the Python 3.12 release.

The following code block demonstrates how multiple interpreters for a 12 Core CPU can parallelize Python operation.

Simply queuing requests in 12 12-element queues can spawn 12 interpreters, each assigned one core to run a single thread in each interpreter. Even though GIL is not disabled, it is made irrelevant as each interpreter runs its code on one of the available CPU cores.

Other planned improvements

The Python core team is considering a proposal to disable the Global Interpreter Lock (GIL) in the CPython runtime, as the underlying CPython runtime has become strong enough to no longer require GIL locking. The proposal, titled “Making the Global Interpreter Lock Optional in CPython,” is expected to be fully implemented in the 3.13 release, planned for October 2024.

Another area of performance improvement involves promptly compiling Python code to achieve performance comparable to Java. This proposal, described in “JIT Compilation,” is still in the draft stages and may take 1-2 years to complete.

These proposals aim to make Python one of the fastest-interpreted languages. Importantly, to take advantage of these improvements, future implementations of PEP-703 and PEP-744 will not require code changes.

What actions can we take in the meantime?

Python 3.12 is significantly faster than Python 3.7 or 3.8. For web apps and APIs, Python has the FastAPI framework, which promises performance similar to Node.JS and GO-based web app frameworks.

Numba and PyPy can still be used to optimize slow sections of code. Numba allows the writing of fast Python kernels that can take advantage of CUDA high-performance computing without needing to delve deep into the C/C++ route, which many development teams may need more resources to do.

Social Share :

Introducing Amazon Q

Overview Amazon Q is a new-gen AI solution that provides insights into enterprise data stores.…

Python Performance improvements

Python is a widely used programming language with a diverse range of libraries and frameworks,…

What is Retrieval Augmented Generation

What is Retrieval Augmented Generation Introduction Retrieval-augmented generation (RAG) is a cutting-edge technique that combines…

Ready to make your business more efficient?