-
Notifications
You must be signed in to change notification settings - Fork 278
[Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils
#12923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… in `ParallelUtils`
schedule(dynamic) by default in OMP loops in ParallelUtilsschedule(dynamic) by default in OMP loops in ParallelUtils
…ith dynamic schedule without conflicting the GIL
|
are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here |
|
And now is failing when running tests: https://github.com/KratosMultiphysics/Kratos/actions/runs/12275201329/job/34250231555?pr=12923. I will define in CMake |
|
@loumalouomega dynamic scheduling is used today for example in the builder and solver....without the need of releasing the GIL why is that different? |
No idea, look at the outcome from the CI. We tested for some functions and the improvement is significant. This was added in a recent version of pybind11. pybind/pybind11#4246 |
|
Okay, looks like the last change fixed the issue |
|
@RiccardoRossi we can set it on runtime with this: https://www.openmp.org/spec-html/5.0/openmpse49.html and keep the current code and set the |
|
Modified to be on runtime, defaulted to dynamic |
schedule(dynamic) by default in OMP loops in ParallelUtilsschedule(runtime), with dynamic by default, in OMP loops in ParallelUtils
|
Okay, looks like the runtime works |
|
Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling... |
The default is |
Okay fixed that issue, BTW, now the the banner includes the parallelism information: | / |
' / __| _` | __| _ \ __|
. \ | ( | | ( |\__ \
_|\_\_| \__,_|\__|\___/ ____/
Multi-Physics 10.1."0"-core/explicit-schedule-parallel-utili-d7754dadfa-Release-x86_64
Compiled for GNU/Linux and Python3.10 with Clang-14.0
Compiled with threading and MPI support. Threading support with OpenMP, scheduling dynamic.
Maximum number of threads: 20.
Running without MPI. |
…t variable for scheduling type
| # Check if the environment variable OMP_SCHEDULE is defined | ||
| if(DEFINED ENV{OMP_SCHEDULE}) | ||
| # Set the already defined one | ||
| set(KRATOS_OMP_SCHEDULE $ENV{OMP_SCHEDULE}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMP_SCHEDULE is a runtime env variable, it is a extremely bad idea to use it a compilation switch (IMO).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but this is the following.
During compilation the OMP_SCHEDULE will set KRATOS_OMP_SCHEDULE that will be used as default if actually OMP_SCHEDULE is not defined, but if OMP_SCHEDULE is defined OMP_SCHEDULE will be taken into account. Do you understand me?
|
I agree with chunk size argument by @RiccardoRossi My point (in #12924) was to first give a way to define dynamic scheduling in our for each loop. This would let us to fine tune our parallelization in many cases that dynamic would be better or at least not worst. For having the dynamic as default now I understand that would not work and chunk size would be an important blocker.... |
to clarify, it is NOT difficult to change the chunking algorithm (i guess it will be a 20 lines long code), i am simply telling that it needs to be done aside of the other changes. |
…ock-based operations for improved performance and clarity
…uling for improved performance
|
Cannot this be changed with an environment variable? |
I think I added the option to that |
|
I have been thinking about this PR for a while, and in my PERSONAL opinion (not yet brought up in @KratosMultiphysics/technical-committee) it is time finally switch to using std::for_each (and other stuff as described in c++17). My point is that with the recent upgrade to centod Devtools 11 and the changes in ManyLinux2014 (which we use to package our releases) we are effectively setting our compiler baseline to gcc11. according to cppreference
my point is that we could finally rely on high quality implementations instead of the ones we can make ourselves ... |
I agree with you. This is the current status of OpenMP when limited to because of MSVC: Jokes aside, yes, I think We should move to c++ paralellel. I started a branch long ago, but it didn't work. https://github.com/KratosMultiphysics/Kratos/tree/core/cxx17-parutils |
|
I did lots of work on pushing for Cxx-based parallelization, IMO only few things are left. There are a few important parallel loops like in the B&S that probably need a manual reimplementation And "minor" things like In a nutshell, I believe it is possible to switch if several devs push for it |
|
Clarifica, My point is that I would not further expand the parallel útilities but rather to use std parallelism for what goes beyond that. (For example the assemblea phase) Now this also means for example forgetting things like "dynamic dcheduling" and thrusting that the lib will handle those things for us. (I understand this is a harder call). Having daid this we did a very preliminary test with @roigcarlo and gcc and clang both compile std parallel examples |
I agree with you. IMHO parallel utilities could be moved to an independent library, as it is quite straighforward to use and could be useful to many project outside this one. In addition to improve the library as standalone would be faster and easier IMHO. ALso easier to benchamark and test. I had also some experiment with STD paralell in the utilities, we can try to see if we can reuse some of my work. |
|
I forgot an implrtant issue that we previously considerado as blocker: parallel stl does not allow controlling the number of threads. I discussed about this with chwtgpt, and I got the following (to be verified) |
"Unable to load conversation 68369519-1f8c-800b-bef7-d34ccc7faad8" |
https://g.co/gemini/share/718135e95327 Gemini also says that is not possible with the STL but it is possible with others like TBB, whichs is very simialr to STL. Maybe we should think about this. |
|
This: You're absolutely right to seek control over thread usage when using std::execution::par (not par_seq, which doesn't exist — maybe you meant std::execution::par or std::execution::seq). However, C++17's Parallel STL (PSTL) provides no standard mechanism to control the number of threads, and this is implementation-defined — not controlled like OpenMP via OMP_NUM_THREADS. But you're not out of luck. If you're using Intel's PSTL or oneTBB (which many implementations use under the hood), you can set the number of threads used by TBB like this: #include <tbb/global_control.h> Set thread count before calling the algorithm: 🚧 Important Notes: 🛠 How to Check Your Implementation: LLVM libc++ 🔄 Alternative: Roll Your Own Thread Pool If none of the above satisfies you, consider using a real task scheduler (e.g. TBB directly, std::async, std::jthread, or custom pool) — std::execution::par is too limited for fine-grained control. Would you like me to help you determine which backend your system uses or show how to switch to Intel’s PSTL with TBB for full control? Excellent follow-up. Under MSVC (Microsoft Visual C++ STL), things are slightly different — and still a bit limited — but you can control the thread pool to a degree. Here’s the situation: 🔧 How to Set the Number of Threads in MSVC STL Parallel Algorithms Here’s a working example using the Concurrency Runtime: #include int main() { } 🧱 Compile with: Make sure you compile with /std:c++17 and link to the concrt.lib if needed (though it's usually automatic). ✅ Bottom Line for MSVC Would you like me to provide a CMake configuration or Visual Studio project snippet to go along with this? |
I don't like the idea of having different things for diffent OSs. I would rather use TBB instead (specially if is the standard facto under the hood). |
|
i don't dislike the idea of using TBB ... but .. .what about ARM machines? will that work fine there? |
Intel TBB (now part of Intel oneAPI Threading Building Blocks, or oneTBB) is compatible with ARM processors. While Intel TBB was originally developed by Intel for Intel processors, it has evolved to be highly portable. Modern versions, especially oneTBB, support a variety of architectures, including ARM.
|
|
Did you check that yourself? Sounds like the LLM might be hallucinating 😅 I would double check, only last week I had issues with Intel (MPI) on a non-Intel CPU (AMD) |
https://archlinuxarm.org/packages/aarch64/onetbb -> aarch64 is ARM (I check it before send it, but the text of Gemini was better structured anything I could write, so I copy pasted) |
|
just to report here that apparently in P1000 everyone asked for control over resources. The partial answer to this is P2300 which made it as one of the major changes in c++26. Unfortunately for dynamic schedulers we will have to wait for c++29 ... |


📝 Description
Making explicitily
schedule(runtime), withdynamicby default, in OMP loops inParallelUtils. I still need to add a benchmark and actually compare that is faster. Also updates the bannerwith the parallelism information:Fixes #12924
🆕 Changelog
schedule(dynamic)by default in OMP loops inParallelUtilsparallel_utilities#12942