A (not) short story about how I came to understand the discrete difference between multiprocessing and multithreading.
Foreword
What do you remember from 1991? It was full of interesting events: USSR collapsed and the Chicago Bulls with Michael Jordan won their first NBA championship, for example. But developers remember this year as a different milestone - Python had its premiere.
Although Python has been around for so long, there's a lot to explore. One of the things I learned recently was multithreading, where in general, Python offers two approaches to the topic.
In this article, I will try to describe them and point out the differences, which I hope will allow you to choose the right solutions for the processes you program in your remarkable, wonderful, and amazing Python applications.
Part 1 - Explaining the concept
Traditional Approach (Or How Not To Program)
Imagine you have a yeast dough to bake. Generally, it is a simple cake, the preparation of which we can divide into 3 main phases:
- Gathering of ingredients
- Dough preparation
- Baking (Here you do not actually do anything, the oven works for you so we will skip this step)
Phase 1 - Gathering The Ingredients
As I mentioned, yeast dough is a simple dough. To prepare it you need: flour, sugar, yeast, milk, 5 eggs, margarine, and salt. (By the way, I will tell you my grandma's secret. It's better to use butter instead of margarine, it comes out much better).
So our list of ingredients looks like this:
Ingredients = [“egg”, “egg”, “egg”, “egg”, “egg”, “sugar 250g”, “flour 1kg”, “yeast 75g”, “milk 1.5 glass”, “butter 200g”, “pinch of salt”]
In the traditional approach of gathering ingredients for a cake as a cook, you are in the kitchen alone. Moreover, you are not very smart and you’re using only one hand to do the work.
The cook hand is our Thread.
So the code for gathering ingredients written in Python would look like this:
for ingredient in ingredients:
cook.go_to_fridge_or_cabinet()
cook.take_ingredient(ingredient)
cook.bring_ingredient()
cook.put_it_on_the_table()
Quite a few steps that we have to do. We do it separately with each ingredient so the whole thing is done 11 times.
With this (traditional) programming approach we run around the kitchen a bit.
Phase 2 - Dough Preparation
Preparing a yeast dough is no longer as simple as gathering the ingredients. It is also a more laborious process and more prone to errors. The order of operations is crucial otherwise we will end up with a scone. Our "program" of preparing the dough must be done exactly according to the recipe. No step can be skipped.
So our cake baking “code” looks like this:
cook.warm_up_milk(37 st.C)
cook.put_igridients_to_bowl([“(warm) milk”,”‘yast 75g”, “pinch of sugar”, “spoon of flour”])
cook.mix_igredients_in_bowl()
cook.wait(10 min.)
cook.melt_butter()
cook.put_igredients_to_bowl([“egg”, “egg”, “egg”, “egg”, “egg”, “rest of sugar”, “melted butter”, “pinch of salt”, “rest of flour”])
cook.mix_igredients_in_bowl(20 min.)
cook.wait(30 min)
cook.put_cake_to_backing_tray()
cook.wait(30 min)
cook.bake_cake(50min, 170 st.C)
Again, you do everything with one hand. In the end, you have baked a tasty cake, but its preparation takes a lot of time. The order of the steps is also important here. For example, you can't combine all the `cook.wait()`
commands into one à la `cook.wait(70 min.)`
. You also can't change the order in which each line of the program is executed. If you do this the cake won’t be good.
While baking a cake is hard to optimize, it takes about the same amount of time no matter how many cooks make it, Phase 1 (gathering ingredients) seems pretty easy to optimize. It doesn't matter the ingredients in which order you bring them from the fridge and put them on the table for further processing. What's more, you can safely bring in all the ingredients at once. I assure you that no egg will protest that you bring it to the table with another egg, flour, or sugar.
Just how do you do it with one hand?
Multithreading - Let's Add To Our Cook Hands To The Work
To understand how multithreading works in Python, the key point is the "To Our Cook" part in the chapter title. By using multithreading, you add "cook" hands to your work and there is still only one cook in the kitchen.
The official Python documentation refers to threading as "Thread-based parallelism". Tasks are executed in parallel... or rather quasi-parallel. It is this fine distinction between multithreading and multiprocessing that has eluded me all along.
Multithreading gives us the ability to simultaneously run the tasks that need to be executed and execute them, regardless of their duration. Tasks are executed by a single processor core, with access to operational memory in which the program is executed.
Referring to our pie example, our cook in the kitchen is a mutant octopus on steroids that has grown an extra 4 arms. On a signal we specify, the octopus performs a job we specify, which we will call a worker for readability:
def worker(ingredient):
cook.go_to_fridge_or_cabinet()
cook.take_ingredient(ingredient)
cook.bring_ingredient()
cook.put_it_on_the_table()
for ingredient in ingredients:
octopus.submit(worker, ingredient)
What is going on here? In the beginning, we define the work to be done which is our "worker" and this is the function that will be performed. The worker needs to know what component to bring, without it he will get lost. This is the same as we have described above, we do not change anything here.
The second part of the program is more interesting. Having our list (array) of ingredients we tell our octopus to do a worker with each ingredient from the list.
`octopus.submit(worker, ingredient)`
Since we do this infinitely fast, shouting out successive commands (Fetch egg, Fetch sugar, Fetch yeast...) the octopus, before it even moves, already has all the workers specified and starts executing them. Each worker is carried out by a separate arm with a tentacle.
Phase 1 - Gathering The Ingredients
And so in the first phase the octopus:
- He walks over to the refrigerator,
- Opens it (all arms at once),
- Inserts his/her arms (all at once) into the refrigerator by grabbing the specified component in the worker
- Takes it out
- Closes the fridge and goes back to the table
- Puts the ingredient back on the table.
This is where we may encounter minor inconveniences. Yeast, milk, or butter are more likely to be on different shelves in the refrigerator, in different places. Eggs are most likely to be on one shelf in one package. What does an octopus do? The 5 arms reach onto the same shelf at the same time, wedging together and blocking each other.
No worries. Python multithreading automatically solves this problem for us by waiting a while until one of the tentacles (thread) frees up space (computer resources) and another thread can be executed.
Once all the processes are done, we have the ingredients on the table, ready to continue and we can move on to Phase 2. And here we hit an obstacle that multithreading is not able to handle especially if we had several cakes to bake.
Phase 2 - Dough Preparation
The preparation of the cake, as I have already mentioned, is the phase that requires more attention from the cook.The"program" must also be done in the right order where at the end we have baking in the oven.
Imagine a situation where we have to bake 10 cakes. In the first step, we bring all the ingredients needed to make them. The table gets cramped but we still fit somehow. We start with 10 cake workers and here we hit an obstacle we cannot overcome.
- We have one bowl where we mix the ingredients
- We have one oven where we bake cakes.
Running 10 parallel threads won't do us any good. Preparing and baking one cake blocks our resources (bowl and oven) for another 70-80 minutes. The threads running in parallel "wait" for resources to be released before they start executing. And so baking 10 cakes, using multithreading is a job for about 800 minutes (13+ hours).
How do we increase our resources and add more bowls and ovens to the kitchen?
Multiprocessing - Let’s Clone Kitchens
The idea of multiprocessing, allowing all computer resources to be used in parallel, is nothing new. However, unlike other programming languages, Python itself is not ready for it. It is hindered by the Global Interpreter Lock (GIL), which prevents more than one thread from executing at once in a given time unit. For those interested in the issue, I recommend the interesting post "What is the Python Global Interpreter Lock (GIL)?" where you will find a detailed description of this "infamous" Python feature.
In order to bypass GIL-related limitations, an additional library written in C/C++ has been introduced to Python, which has been available since version 3.5. Multiprocessing frees us from Python's limitations, giving us the possibility of full and unlimited use of all the computer resources... however, it has its own limitations, which you have to remember, and which we will discuss in more detail later in this article
Let's refer one last time to our cake-baking example. Our kitchen in which we can bake one cake at a time, even if our cook has 10 hands and moves the ingredients to the workbench quite quickly, is not able to handle the case where we have to bake 10 cakes, because this kitchen has one oven into which we can put one cake at a time. With help comes multiprocessing, which replicates our kitchen.
You can imagine it as a block of apartments in which there are many apartments and each of them has a kitchen.
Thanks to multiprocessing. we can use each of them keeping in mind some important things.
- Each block, even the biggest one, has a limited number of kitchens that we can use (RAM, processor cores, and all the technical stuff that goes into processing our program).
- The kitchens are in separate apartments. As such, they do not know about each other's existence. More specifically, Kitchen A does not know if Kitchen B is making a pie and (if it is) at what stage it is.
- Using all the kitchens at once to bake a cake can completely block our ability to make other meals around the block. And what if there's a sudden need to heat up some porridge for the baby? A hungry child very quickly turns into an emergency situation that we certainly don't want to have.
As you can guess, a block is our computer/server. We can expand it by adding more processors, RAM, hard disks, etc. However, every computer, even the biggest one in the world, will reach its limit which means that if we run 100 processes on it and each of them uses 100 threads we'll reach the maximum performance of most of the publicly available servers.
A separate problem associated with multiprocessing is the issue of information exchange. As I mentioned our "kitchens" do not know anything about each other. However, in most cases, when we run parallel tasks, we would like to be able to do something with the results of their actions at the end, when they are all done. In our case, in the end, we'd like to pack up our cakes and take them to the cafe for guests. This is quite an obvious problem, so the library creators have added appropriate solutions that we can use.
The obvious problem we will encounter when using the power of multiprocessing is the question of the maximum resources we will use. Let's imagine a situation where our server gives us 30 processes to use. If we occupy all of them, and in the meantime some other user types in the address of our web page, the server won't even be able to display it, because all 30 processes will be currently occupied with the work we gave the server. Like a hungry child, a client who doesn't see the website will very quickly lose patience.
Multiprocessing, like multithreading, also runs the program at the same time but uses different computer resources so that the work of one process can be performed independently of the work of the other process
The execution of our 10-cake multiprocessing baking program would look like this.
START |
|||
Process 1 |
Process 2 |
… |
Process 10 |
Phase 1 |
Phase 1 |
Phase 1 |
Phase 1 |
Phase 2 |
Phase 2 |
Phase 2 |
Phase 2 |
END. 10 cakes baked |
|||
|
The baking time for all 10 cakes is the maximum baking time for the longest cake. Since no one is blocking the oven, none of the processes running are waiting for the oven to slow down. The whole program will be done not in 800 minutes (multithreading) but in 80 minutes. 10 cakes in 80 minutes is already a micro-bakery that we can conquer the market with, especially if we bake such a good cake as the yeast cake described above.
I hope that this explanation of the differences between traditional (in the loop) programming, multithreading, and multiprocessing will help you understand the difference between these issues in more detail and... more importantly, will allow you to better choose a programming solution strategy for your applications.
Now it's time for a short break with coffee and yeast cake and after that, in the next part of the article, I will show you how you can use the knowledge gained in practice.
Part 2 - Multithreading and Multiprocessing in Action
Side note
For the rest of this article, I assume you have some basic knowledge of programming in Python :), or configuring the docker we will use to create our development environment.
You can find the examples below in the repository at https://github.com/michal-stachura/blog-mvm.
Given the speed at which computer programs are executed, the differences between multithreading and multiprocessing that we will discuss next are quite difficult to see. However, we will add a few "test points'' in our code, and load the CPU heavily, which will give us a better understanding of the differences between these issues.
—
Ok, in the first part of this article we had two main "Phases" of baking a cake. Phase 1 was easier to execute. Phase 2 required more resources and could "block" us from executing the program due to insufficient computer/server resources.
In the publicly available examples on the Internet, this issue is solved with solutions à la `time.sleep(1)`
for easy processes and `time.sleep(10)`
for hard processes taking 10 seconds. In reality, both of these processes are just as trivial to the CPU and do not consume any CPU resources, but make it wait 1 or 10 seconds.
Personally, I prefer a more empirical approach. We write a program that does:
- From https://www.thispersondoesnotexist.com/ will download 10 images of generated faces of people who do not exist in real life
- Change images to appropriate size
- It will put them on the page of the pdf document, and add fictitious details of the person such as name, home address etc.
- Prepared PDF files will be packed to ZIP and saved to disk
As you can guess, image download #1 is the easier Phase 1. The service responds at different speeds so we will see small differences in the time it takes to download the images and process them.
Points 2, 3 and 4 of our program are already Phase 2, involving much more of our computer and requiring more resources. Ok, enough talking. Let’s do some code.
The main.py - is our main application file where in the first part I add the logger configuration and define the parameters that we can use in the tests:
- --cvs - number of generated CV files
- --details - flag whether the report should contain details or not.
- --p1_type, --p2_type - the type of processing of work through phase 1 and 2. The choices are "common", "multithreading" and "multiprocessing"
- --p1_max_workers, --p2_max_workers - the number of parallel threads/processes we want to use in tests for phase 1 and 2
Next, we have a simple call to the classes PhaseOne and PhaseTwo which are defined in `app/phase1.py` and `app/phase2.py` respectively: `app/phase1.py` and `app/phase2.py`
Both files with the `PhaseOne` and `PhaseTwo` classes have a similar structure where for both, I first define the "job" that will be done `def job()`
and then the workers that in addition to logging times will do the defined `def job()`
.
OK, I think this is understandable. Time to get your hands a little dirty :)
Test 1 - Traditional approach
The traditional approach of doing the work in a simple loop without running the code in parallel and doing several things at once.
In the ssh console, type:
docker --tag monte_py .
Which will build us a picture of the environment that we will use further in testing.
Successfully built 267b3b24efc6
Successfully tagged monte_py:latest
Sidenotes
- If you run into problems, see if you need to run the above command with sudo
- If you are not a fan of docker, you can run the whole thing using virtualenv keeping in mind:
- You must have python version 3.10+
- Create the directories that docker creates, i.e. 'downloads' and 'results'
- After starting the virtual machine, be sure to install the necessary libraries `pip install -r requirements.txt`
With the image ready, we run the first test:
docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="common" --p2_type="common"
The resulting output should look roughly like this:
######################
Number of CV's: 10
Test type:
- Phase 1: common
- Phase 2: common
Detailed report: Y
Max workers:
- Phase 1: Not considered
- Phase 2: Not considered
######################
--- Phase 1 - gathering data ---
Average request time: 0:00:00.398223
Phase 1 took: 0:00:03.985781
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.375734
Phase 2 took: 0:00:23.978809
--- Summary ---
Whole process took: 0:00:27.964590
--- Details Phase 1 ---
[
"Task: 0 (start) - PID: 1 CPU: 9.5%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 0 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (end) - PID: 1 CPU: 10.2%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",
"Task: 3 (end) - PID: 1 CPU: 5.2%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 4 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 4 (end) - PID: 1 CPU: 1.7%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 5 (end) - PID: 1 CPU: 2.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 6 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 6 (end) - PID: 1 CPU: 4.0%, RAM (GB): avl: 23.93, used: 6.47, 23.7%)",
"Task: 7 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.93, used: 6.47, 23.7%)",
"Task: 7 (end) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 8 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.59, used: 6.81, 24.8%)",
"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.59, used: 6.81, 24.8%)",
"Task: 9 (end) - PID: 1 CPU: 12.3%, RAM (GB): avl: 23.56, used: 6.84, 24.9%)"
]
--- Details Phase 2 ---
[
"Task: 0 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.56, used: 6.84, 24.9%)",
"Task: 0 (end) - PID: 1 CPU: 15.2%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 1 (end) - PID: 1 CPU: 16.2%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 2 (end) - PID: 1 CPU: 15.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.55, used: 6.85, 24.9%)",
"Task: 3 (end) - PID: 1 CPU: 14.7%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 4 (start) - PID: 1 CPU: 100.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 4 (end) - PID: 1 CPU: 16.3%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.53, used: 6.87, 25.0%)",
"Task: 5 (end) - PID: 1 CPU: 14.5%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 6 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 6 (end) - PID: 1 CPU: 14.4%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 7 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 7 (end) - PID: 1 CPU: 15.5%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.49, used: 6.91, 25.1%)",
"Task: 8 (end) - PID: 1 CPU: 15.3%, RAM (GB): avl: 23.77, used: 6.63, 24.2%)",
"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.77, used: 6.63, 24.2%)",
"Task: 9 (end) - PID: 1 CPU: 13.6%, RAM (GB): avl: 23.76, used: 6.65, 24.3%)"
]
In the beginning, we have listed the configuration of our test that was done. As you can see `max_workers` for phase 1 and phase 2 is undefined even though the default value is 10. Well. In a traditional loop, we do not run the code in parallel. The whole is processed in a single thread/process and we have no influence on it.
Then we have the time summaries for phase 1 and phase 2. In my case, it came out at:
--- Phase 1 - gathering data ---
Average request time: 0:00:00.398223
Phase 1 took: 0:00:03.985781
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.375734
Phase 2 took: 0:00:23.978809
It took an average of ~0.39 seconds to download one image, all 10 images we downloaded took ~3.98 seconds.
It takes my computer about ~2.37 seconds to generate one PDF file, and 10 pdf files in ~23.97 seconds... quite long
We closed the entire process in ~27.96 seconds which is a very poor time. I think not many customers would wait almost half a minute after clicking the "Generate me 10 resume files" button :)
In the test details, you can see how each task in the loop is executed. In both cases, we have the same scheme.
"Task: 0 (start) - PID: 1 CPU: 9.5%, RAM (GB): avl: 23.54, used: 6.87, 25.0%)",
"Task: 0 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.75, used: 6.65, 24.3%)",
"Task: 1 (end) - PID: 1 CPU: 10.2%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.64, used: 6.76, 24.6%)",
"Task: 2 (end) - PID: 1 CPU: 9.2%, RAM (GB): avl: 23.58, used: 6.83, 24.9%)",
We start a task -> we do it -> we finish the task. Boredom. We use one `PID` process all the time, our `CPU` processor is bored at 10% most of the time, and the `RAM` operating memory remains mostly unused.
It's time to speed things up a bit.
Test 2 - Multithreading
We are working again to generate 10 resume files
docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="multithreading" --p2_type="multithreading" --p1_max_workers=8 --p2_max_workers=8
######################
Number of CV's: 10
Test type:
- Phase 1: multithreading
- Phase 2: multithreading
Detailed report: Y
Max workers:
- Phase 1: 8
- Phase 2: 8
######################
--- Phase 1 - gathering data ---
Average request time: 0:00:00.447264
Phase 1 took: 0:00:00.706548
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:20.365066
Phase 2 took: 0:00:30.022417
--- Summary ---
Whole process took: 0:00:30.728965
--- Details Phase 1 ---
[
"Task: 0 (start) - PID: 1 CPU: 17.6%, RAM (GB): avl: 23.64, used: 6.82, 24.6%)",
"Task: 4 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 2 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 1 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 6 (start) - PID: 1 CPU: 25.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 7 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 5 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.63, used: 6.83, 24.7%)",
"Task: 6 (end) - PID: 1 CPU: 10.9%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 8 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 0 (end) - PID: 1 CPU: 20.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 9 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 4 (end) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 1 (end) - PID: 1 CPU: 11.1%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 2 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 5 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 7 (end) - PID: 1 CPU: 6.7%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 3 (end) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.58, used: 6.88, 24.8%)",
"Task: 8 (end) - PID: 1 CPU: 9.1%, RAM (GB): avl: 23.53, used: 6.93, 25.0%)",
"Task: 9 (end) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)"
]
--- Details Phase 2 ---
[
"Task: 1 (start) - PID: 1 CPU: 66.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 0 (start) - PID: 1 CPU: 66.7%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 3 (start) - PID: 1 CPU: 0.0%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 2 (start) - PID: 1 CPU: 33.3%, RAM (GB): avl: 23.52, used: 6.94, 25.0%)",
"Task: 4 (start) - PID: 1 CPU: 22.2%, RAM (GB): avl: 23.47, used: 6.99, 25.2%)",
"Task: 6 (start) - PID: 1 CPU: 18.7%, RAM (GB): avl: 23.33, used: 7.14, 25.6%)",
"Task: 5 (start) - PID: 1 CPU: 19.2%, RAM (GB): avl: 23.33, used: 7.14, 25.6%)",
"Task: 7 (start) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.32, used: 7.14, 25.7%)",
"Task: 6 (end) - PID: 1 CPU: 14.5%, RAM (GB): avl: 23.16, used: 7.32, 26.2%)",
"Task: 0 (end) - PID: 1 CPU: 9.3%, RAM (GB): avl: 23.15, used: 7.32, 26.2%)",
"Task: 8 (start) - PID: 1 CPU: 9.7%, RAM (GB): avl: 23.15, used: 7.33, 26.2%)",
"Task: 7 (end) - PID: 1 CPU: 10.6%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 9 (start) - PID: 1 CPU: 8.3%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 5 (end) - PID: 1 CPU: 8.1%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 4 (end) - PID: 1 CPU: 8.1%, RAM (GB): avl: 23.14, used: 7.34, 26.2%)",
"Task: 3 (end) - PID: 1 CPU: 13.5%, RAM (GB): avl: 23.25, used: 7.23, 25.9%)",
"Task: 1 (end) - PID: 1 CPU: 18.2%, RAM (GB): avl: 23.25, used: 7.23, 25.9%)",
"Task: 2 (end) - PID: 1 CPU: 18.4%, RAM (GB): avl: 23.26, used: 7.21, 25.8%)",
"Task: 8 (end) - PID: 1 CPU: 13.1%, RAM (GB): avl: 23.21, used: 7.27, 26.0%)",
"Task: 9 (end) - PID: 1 CPU: 16.7%, RAM (GB): avl: 23.21, used: 7.27, 26.0%)"
]
Let's see what happened here:
--- Phase 1 - gathering data ---
Average request time: 0:00:00.447264
Phase 1 took: 0:00:00.706548
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:20.365066
Phase 2 took: 0:00:30.022417
--- Summary ---
Whole process took: 0:00:30.728965
This time, the average image download time took ~0.45 seconds. In a traditional approach, 10 images would download in about ~4.5 seconds. Meanwhile, we have now downloaded them in just ~0.71 seconds. Multithreading sped up the process for us by about 75%.
Phase 2, however, did not go so well. The process to generate one PDF file is about ~20.36 seconds. Remember in the traditional process it was ten times less at ~2.37 seconds. Why such a slowdown?
Let's go back to our cake-baking analogy for a moment. Phase 1 is gathering ingredients, where our eight-handed cook (--p1_max_workers=8) brings the ingredients for the cake and puts them on the table. Since we have 10 ingredients and 8 hands the first "run" will bring 8 ingredients from the fridge and the next run will bring the remaining 2. The whole process is asynchronous so the time will never be the same as two runs fridge <-> table but it will be similar. If we gave 10 workers this time would be not much bigger than the time to fetch one image.
By executing 10 image requests in parallel, in fact, we wait for the last image response. The request that responds the last image closes the thread's queue and our code continues to execute
The matter gets a bit more complicated in Phase 2 where we deal with slightly more serious activities requiring more computer resources. Here the time to generate one PDF file took on average ~20.36 seconds. 20 seconds vs 2.4 in the classic approach is almost 8 times slower. We rather don't want that :)
Why has the file generation time increased so much?
The answer is simple. In this test, we ran 8 pdf generation processes in parallel, which are themselves quite processor-intensive. It takes a lot more resources to generate our file. Even if we run 10 or more of them it won't change much. Our resources that are allocated to the asynchronously fired threads remain unchanged. Individual threads simply limp along, waiting for the previously run asynchronous CV file generation activities to finish and free up some resources to execute the code run earlier.
Back to our cake analogy. It doesn't matter how many cooks you put at the pie table. The table is a certain size. Even if you tell everyone at the same time "start, make pies" they will get stuck and have to wait for the table to clear before they start working.
For tasks requiring large amounts of computer resources, Python multithreading does not work
Let's see how it looks using multiprocessing? But before that, take another look at the details of the individual processes in phase 1 and phase 2.
For Phase 1 we have 8 tasks to execute in parallel (--p1_max_workers=8) then task 6 ends and task8 (waiting in the queue to be executed) starts immediately. The same goes for task 9, which starts after task 0 ends when there's a worker left that could run the task.
The details look similar for Phase 2. Notice the CPU usage. At the very beginning, it reaches a fairly high result of 66.7%, which reflects the situation where at one point 8 threads are opened in parallel to generate a pdf file. Then, the processor, with a relatively constant load oscillating around 20%, closes the individual threads generating the PDF file.
Test 3 - Multiprocessing
docker run --rm --name mvm_blog monte_py --cvs=10 --details="Y" --p1_type="multiprocessing" --p2_type="multiprocessing" --p1_max_workers=8 --p2_max_workers=8
After running such a test, you will get a result that looks roughly like this:
######################
Number of CV's: 10
Test type:
- Phase 1: multiprocessing
- Phase 2: multiprocessing
Detailed report: Y
Max workers:
- Phase 1: 8
- Phase 2: 8
######################
--- Phase 1 - gathering data ---
Average request time: 0:00:00.451822
Phase 1 took: 0:00:00.815908
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.507320
Phase 2 took: 0:00:05.045943
--- Summary ---
Whole process took: 0:00:05.861851
--- Details Phase 1 ---
[
"Task: 1 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.88, used: 7.52, 27.1%)",
"Task: 0 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 6 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 3 (start) - PID: 10 CPU: 18.5%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 2 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 5 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 4 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 7 (start) - PID: 10 CPU: 18.4%, RAM (GB): avl: 22.87, used: 7.53, 27.1%)",
"Task: 6 (end) - PID: 10 CPU: 11.7%, RAM (GB): avl: 22.75, used: 7.65, 27.5%)",
"Task: 8 (start) - PID: 10 CPU: 33.3%, RAM (GB): avl: 22.75, used: 7.65, 27.5%)",
"Task: 0 (end) - PID: 10 CPU: 10.7%, RAM (GB): avl: 22.74, used: 7.66, 27.5%)",
"Task: 9 (start) - PID: 10 CPU: 25.0%, RAM (GB): avl: 22.74, used: 7.66, 27.5%)",
"Task: 1 (end) - PID: 10 CPU: 10.9%, RAM (GB): avl: 22.72, used: 7.68, 27.6%)",
"Task: 2 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 7 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 3 (end) - PID: 10 CPU: 11.2%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 5 (end) - PID: 10 CPU: 11.4%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 4 (end) - PID: 10 CPU: 11.3%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 9 (end) - PID: 10 CPU: 11.5%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)",
"Task: 8 (end) - PID: 10 CPU: 7.7%, RAM (GB): avl: 22.71, used: 7.69, 27.6%)"
]
--- Details Phase 2 ---
[
"Task: 0 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.78, used: 7.63, 27.4%)",
"Task: 1 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.78, used: 7.63, 27.4%)",
"Task: 2 (start) - PID: 10 CPU: 12.9%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",
"Task: 4 (start) - PID: 10 CPU: 12.8%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",
"Task: 6 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",
"Task: 3 (start) - PID: 10 CPU: 12.9%, RAM (GB): avl: 22.77, used: 7.63, 27.4%)",
"Task: 5 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",
"Task: 7 (start) - PID: 10 CPU: 13.0%, RAM (GB): avl: 22.76, used: 7.64, 27.4%)",
"Task: 2 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 8 (start) - PID: 10 CPU: 75.0%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 5 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.29, used: 8.11, 28.9%)",
"Task: 9 (start) - PID: 10 CPU: 100.0%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 3 (end) - PID: 10 CPU: 72.6%, RAM (GB): avl: 22.29, used: 8.11, 28.9%)",
"Task: 4 (end) - PID: 10 CPU: 72.5%, RAM (GB): avl: 22.25, used: 8.16, 29.1%)",
"Task: 1 (end) - PID: 10 CPU: 71.5%, RAM (GB): avl: 22.25, used: 8.16, 29.1%)",
"Task: 0 (end) - PID: 10 CPU: 71.4%, RAM (GB): avl: 22.26, used: 8.14, 29.0%)",
"Task: 7 (end) - PID: 10 CPU: 71.4%, RAM (GB): avl: 22.28, used: 8.12, 29.0%)",
"Task: 6 (end) - PID: 10 CPU: 71.3%, RAM (GB): avl: 22.3, used: 8.11, 28.9%)",
"Task: 9 (end) - PID: 10 CPU: 25.6%, RAM (GB): avl: 22.33, used: 8.07, 28.8%)",
"Task: 8 (end) - PID: 10 CPU: 25.7%, RAM (GB): avl: 22.32, used: 8.08, 28.8%)"
]
We received interesting results here:
--- Phase 1 - gathering data ---
Average request time: 0:00:00.451822
Phase 1 took: 0:00:00.815908
--- Phase 2 - generate PDF ---
Average pdf generation time: 0:00:02.507320
Phase 2 took: 0:00:05.045943
--- Summary ---
Whole process took: 0:00:05.861851
Phase 1 more or less with similar results as in multithreading. We got a very big speedup of the overall process for Phase 2, which took us more than 30 seconds in multithreading. This time we generated 10 PDF files in about ~5.04 seconds. That's over 70% faster work done!
The details of how each phase runs are also interesting. As in multithreading, we have asynchronous processes for downloading images and generating pdf files. However, unlike multithreading, multiprocessing uses not 1 but 8 parallel processes. The processor utilization at the beginning is 12-13%, similar to the multitasking process, but then it rapidly rises to 70-100% and stays at that level until the very end, when the last two processes use 25% of the CPU.
You can see the process better for more cv files we are generating but here I do not want to generate too long logs. In the end, let's do one more test.
Test 4 - Mix
Finally, let us combine the multithreading and multiprocessing approaches where for phase 1 we use multithreading and for phase 2 we use multiprocessing. Here we will use more CVs and compare the separation of the two phases of the code. We will also try to run as many processes as possible.