Multithreading and multiprocessing in python
Table of Contents
Introduction
Multithreading and multiprocessing are two techniques used to achieve concurrent execution in Python. Although they share the common goal of improving performance by leveraging multiple tasks simultaneously, they are fundamentally different in how they manage and utilize system resources.Global Interpreter Lock (GIL): Python's standard implementation, CPython, has a Global Interpreter Lock (GIL) that prevents multiple native threads from executing Python bytecodes at once. This means that even though a program has multiple threads, only one thread can execute Python code at a time. This significantly limits the performance gains of multithreading for CPU-bound tasks in Python.
I/O-Bound Tasks: Tasks that involve a lot of waiting, such as reading or writing files, making network requests, or interacting with databases.
What Are Threads?
A thread is a smaller unit of a process that can be scheduled to run by the operating system. When a program starts, it runs as a single process that contains at least one thread — the main thread. This main thread is where the program begins execution. However, a process can create additional threads, and each of these threads can run code independently and concurrently within the same process. The key characteristics of threads are:- Shared Memory Space: All threads within a process share the same memory space. This means they can access and modify the same variables and data structures. While this can be beneficial for sharing information between threads, it also requires careful management to avoid issues like race conditions, where two or more threads attempt to modify the same data simultaneously.
- Independent Execution: Each thread runs independently of the others. This means threads can perform different tasks simultaneously, which can improve the efficiency and responsiveness of a program.
- Lightweight: Compared to processes, threads are relatively lightweight. Creating a thread consumes fewer resources than creating a new process because threads within the same process share many resources like memory and file handles.
- Concurrency: Threads allow a program to perform multiple operations concurrently. For example, in a web server, one thread might handle client requests, while another handles logging, and yet another thread manages database queries, all simultaneously.
- Context Switching: The operating system can switch between different threads, a process known as context switching. This allows for the concurrent execution of threads, even on a single-core processor, by quickly switching between threads.
- Multithreading:
- Multithreading involves running multiple threads in a single process. Threads share the same memory space and resources within the process.
- Python's threading module is used for creating and managing threads.
- It's suitable for I/O-bound tasks where the threads spend most of their time waiting for I/O operations to complete (e.g., network requests, file I/O, etc.).
- Due to Python's Global Interpreter Lock (GIL), multithreading might not be as effective for CPU-bound tasks that require intensive computation because only one thread can execute Python bytecode at a time.
- Example: concurrent downloading of files from the internet.
- Multiprocessing:
- Multiprocessing involves running multiple processes, each with its own memory space and resources. Processes do not share memory by default and communicate via inter-process communication (IPC) mechanisms.
- Python's multiprocessing module is used for creating and managing processes.
- It's suitable for CPU-bound tasks where parallelism can be achieved by distributing the workload across multiple processes.
- Since each process has its own GIL, multiprocessing can effectively utilize multiple CPU cores.
- Example: parallelizing a CPU-intensive task such as image processing.
Comparison Between Multithreading and Multiprocessing
Aspect | Multithreading | Multiprocessing |
---|---|---|
Memory Usage | Less memory usage/Shared memory space | More memory usage/Separate memory space for each process |
Concurrency Type | Concurrent threads within a single process | Parallel processes with separate memory spaces |
GIL Impact | Affected by GIL (only one thread executes Python code at a time) | No GIL; multiple processes can run Python code simultaneously |
Best For | I/O-bound tasks (e.g., web scraping, file I/O) | CPU-bound tasks (e.g., heavy computations) |
Synchronization | More complex due to shared state | Less complex but requires IPC for communication |
Overhead | Lower overhead, but limited by GIL | Higher overhead due to process creation and management, but no GIL limitation |
Fault Isolation | A thread crash can affect the entire process | A process crash is isolated to that process |
What is Multithreading?
Multithreading allows a program to run multiple threads concurrently, which is particularly useful in scenarios where the program needs to perform multiple tasks simultaneously without requiring significant CPU resources.
In multithreading, each thread operates independently but shares the same memory space with other threads within the same process. This is especially beneficial for I/O-bound tasks (e.g., file I/O, network operations), where the program often spends time waiting for operations to complete. While one thread is waiting for I/O, other threads can continue executing, leading to better overall performance and reduced idle time.
- When to use Multithreading? Multithreading is most beneficial in the following scenarios:
- Multiple Tasks Simultaneously: When the program needs to handle several tasks at once without significant CPU load.
- I/O-Bound Operations: In tasks where the program spends a lot of time waiting for I/O operations to complete, such as reading/writing files, network communication, or database interactions.
- Why Use Multithreading?:
- Efficency: Multithreading improves efficiency by allowing other threads to execute while one thread is waiting, minimizing idle time.
- Improved Responsiveness: In applications like GUIs, multithreading can keep the interface responsive while performing background tasks.
- Example:
- Web Scraping: Fetching data from multiple web pages simultaneously.
- Network Operations: Handling multiple client connections on a server, downloading files, or sending requests to APIs.
Creating and Using Threads in Python
Python provides the threading module to work with threads. Below is a simple example of how to create and start a thread:
import threading
def print_numbers():
for i in range(10):
print(i)
# Create a thread object
thread = threading.Thread(target=print_numbers)
# Start the thread
thread.start()
# Wait for the thread to finish
thread.join()
Thread Synchronization
Due to the shared memory space, threads may encounter issues like race conditions when they try to access shared resources simultaneously. To avoid these issues, Python provides several synchronization primitives, such as Locks, RLocks, Semaphores, Events, and Conditions.- Lock: A Lock object is a basic synchronization primitive. It ensures that only one thread can access a particular section of code at a time.
import threading lock = threading.Lock() def safe_increment(counter): with lock: counter.value += 1
- RLock: A reentrant lock (RLock) allows a thread to acquire the same lock multiple times without blocking itself.
Thread Pools
For managing a pool of threads, Python providesconcurrent.futures.ThreadPoolExecutor
, which makes it easier to work with multiple threads.
from concurrent.futures import ThreadPoolExecutor
def square(n):
return n * n
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(square, [1, 2, 3, 4])
print(list(results))
Multiprocessing
Multiprocessing involves running multiple processes simultaneously. Each process has its own memory space and Python interpreter, which means there is no GIL limitation. This makes multiprocessing more suitable for CPU-bound tasks where multiple processes can run in parallel on different CPU cores.Creating and Using Processes in Python
Python provides the multiprocessing module to create and manage processes. Each process runs independently, and processes do not share memory space, which avoids issues like race conditions but requires inter-process communication (IPC) to share data.
import multiprocessing
def print_numbers():
for i in range(10):
print(i)
# Create a process object
process = multiprocessing.Process(target=print_numbers)
# Start the process
process.start()
# Wait for the process to finish
process.join()
Inter-Process Communication (IPC)
Since processes do not share memory, Python provides several IPC mechanisms:- Queues: Used to pass messages or data between processes.
- Pipes: A Pipe provides a two-way communication channel between two processes.
- Shared Memor: Allows sharing of variables between processes using
multiprocessing.Value
ormultiprocessing.Array
.
Process Pools
Similar to thread pools, Python providesconcurrent.futures.ProcessPoolExecutor
for managing a pool of worker processes.
from concurrent.futures import ProcessPoolExecutor
def square(n):
return n * n
with ProcessPoolExecutor(max_workers=4) as executor:
results = executor.map(square, [1, 2, 3, 4])
print(list(results))
Synchronization in Multiprocessing
Even though processes do not share memory, they might need to coordinate actions. Python provides synchronization primitives similar to those in threading, such as Locks, Events, Semaphores, and Conditions, but adapted for inter-process use.Choosing Between Multithreading and Multiprocessing
- Use Multithreading when your application is I/O-bound, meaning that the task spends most of its time waiting for I/O operations like file handling, network communication, etc.
- Use Multiprocessing when your application is CPU-bound, meaning that the task spends most of its time performing computations, and you want to leverage multiple CPU cores for parallel execution.
References
- Udemy playlist on advanced python by Krish Naik.
- For more details, please chekout the official documentation.
Some other interesting things to know:
- Visit my website on For Data, Big Data, Data-modeling, Datawarehouse, SQL, cloud-compute.
- Visit my website on Data engineering