-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
methods: add a lock around method table changes #52997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f5e0c2e to
b060560
Compare
b060560 to
7cc03c5
Compare
7cc03c5 to
bbe0c93
Compare
Ensures that only one update can occur simultaneously, and that the acquire-release on the world-age will be sequenced after all of the invalidations to the caches. It should now be safe to add methods in parallel, concurrently with running code. However, there are still no locks here to ensure that only one module is deserialized at a time, which means that is still unsafe. As future work, all of the method inserting and (separately) the invalidation work could run in parallel after loading an image, since there is a fine-grained lock on each individual part of that, as well as the big lock so that only one module at a time can load methods. The min/max are always written behind a lock, but they are not usually read from behind a lock, so add relaxed atomic markers to make any analyzers happy.
bbe0c93 to
98a1269
Compare
|
Daily PkgEval started failing on GeoParquet.jl and SubpixelRegistration.jl; I haven't bisected or reduced but this PR seems related: https://s3.amazonaws.com/julialang-reports/nanosoldier/pkgeval/by_date/2024-01/27/GeoParquet.primary.log rr traces included |
|
adding this assert catches the mistake even earlier: Looks like there are 2, very close to being almost identical, CodeInstance objects for this MethodInstance, which is broken in this part of the code. And if we look this up in |
|
Looks like what happened is during serialization of Parquet2 there were indeed 2 distinct MethodInstance objects But during deserialization of it, these got merged into a single object. So basically, it is version of this bug: Though also possibly a case that deserialization should just accept and ignore this bug. |
|
This causes a test failure in Pluto.jl, I wonder why this was not caught by PkgEval? Running |
|
Does someone know why the failing Pluto tests were not caught by PkgEval? It looks like other packages were checked (#52997 (comment)). |
|
It can be a somewhat stochastic failure, since it depends on the specific order in which hashes are inserted into the dictionary, which can depend on the memory layout |
|
Our case was not stochastic, we did |
|
Mutating that field is undefined behavior because it changes internal state that corrupts the internal invariants regarding the program representation and inference, so it is not expected to be safe |
|
After this PR was merged, the Pluto tests failed with: but this was not caught by PkgEval and I'm wondering why. We fixed it afterwards in fonsp/Pluto.jl#2807 |
Ensures that only one update can occur simultaneously by adding a lock around it, and that the acquire-release on the world-age will be sequenced after all of the invalidations to the caches, by updating it last.
It should now be safe to add methods in parallel, concurrently with running code. However, there are still no locks here to ensure that only one module is deserialized at a time, which means that parallel require calls are still unsafe. (and also because the loading.jl code is currently a thread-safety disaster zone)
As future work, all of the method inserting and (separately) the validation/insertion work could run in parallel after loading an image, since there is a fine-grained lock on each individual part of that, as well as the big lock so that only one module at a time can load methods.
The min/max fields are always written behind a lock, but they are not usually read from behind a lock, so add relaxed atomic markers to all uses of that field to make any data-race analysis tools happy. These are never used in a comparison with a value greater than
jl_atomic_load_acquire(&jl_world_counter), and therefore the results achieved are consistent.