-
Notifications
You must be signed in to change notification settings - Fork 5.3k
A few fixes in the threadpool semaphore. Unify Windows/Unix implementaiton of LIFO policy. #123921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Tagging subscribers to this area: @agocke, @VSadov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses performance regressions in the threadpool semaphore (issue #123159) and unifies the Windows/Unix implementation of the LIFO (Last-In-First-Out) policy for threadpool worker thread management.
Changes:
- Introduces a unified
LowLevelThreadBlockerclass that uses OS-provided compare-and-wait APIs (futex on Linux, WaitOnAddress on Windows) for efficient thread blocking, with a fallback to monitor-based implementation for other platforms - Refactors
LowLevelLifoSemaphoreto use the new blocker infrastructure, removes platform-specific Windows/Unix implementations, and improves spinning heuristics based on CPU availability - Adds native futex support for Linux through syscalls and Windows WaitOnAddress API interop
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/native/libs/System.Native/pal_threading.h | Adds declarations for Linux futex operations |
| src/native/libs/System.Native/pal_threading.c | Implements futex wait/wake operations for Linux using syscalls |
| src/native/libs/System.Native/entrypoints.c | Registers new futex entrypoints for Linux |
| src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs | Fixes spelling, removes spin count configuration, passes active thread count to semaphore |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs | New class providing portable thread blocking using futex/WaitOnAddress or monitor fallback |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs | Major refactoring to use LowLevelThreadBlocker, implements LIFO queue with pending signals, improves spin heuristics |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Windows.cs | Deleted - functionality moved to unified implementation |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs | Deleted - functionality moved to unified implementation |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Windows.cs | New file providing Windows WaitOnAddress API wrapper |
| src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Unix.cs | New file providing Linux futex wrapper |
| src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs | Modified to return spin count and skip spinning on first attempt |
| src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems | Updates project to include new files and remove deleted platform-specific files |
| src/libraries/Common/src/Interop/Windows/Kernel32/Interop.WaitOnAddress.cs | New interop declarations for Windows WaitOnAddress and WakeByAddressSingle APIs |
| src/libraries/Common/src/Interop/Windows/Kernel32/Interop.CriticalSection.cs | Adds SuppressGCTransition attribute to LeaveCriticalSection |
| src/libraries/Common/src/Interop/Windows/Kernel32/Interop.ConditionVariable.cs | Adds SuppressGCTransition attribute to WakeConditionVariable |
| src/libraries/Common/src/Interop/Unix/System.Native/Interop.LowLevelMonitor.cs | Adds SuppressGCTransition attributes to Release and Signal_Release |
| src/libraries/Common/src/Interop/Unix/System.Native/Interop.Futex.cs | New interop declarations for Linux futex operations |
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Outdated
Show resolved
Hide resolved
| <Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelLifoSemaphore.Unix.cs" Condition="'$(TargetsUnix)' == 'true' or '$(TargetsBrowser)' == 'true' or '$(TargetsWasi)' == 'true'" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelThreadBlocker.cs" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelFutex.Windows.cs" Condition="'$(TargetsWindows)' == 'true'" /> | ||
| <Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelFutex.Unix.cs" Condition="'$(TargetsUnix)' == 'true'" /> |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LowLevelFutex.Unix.cs file is conditionally compiled only for TARGET_LINUX, but the project file includes it for all Unix targets. This means non-Linux Unix platforms (like macOS, FreeBSD, etc.) will include an effectively empty file. While this doesn't cause a build error (since LowLevelThreadBlocker falls back to USE_MONITOR on non-Linux Unix), it's cleaner to only include this file when TARGET_LINUX is true. Consider changing the condition to match the file's actual content: Condition="'$(TargetLinux)' == 'true'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expected that most other Unix-like platforms will eventually provide an implementation and it will go into LowLevelFutex.Unix.cs.
I.E. FreeBSD certainly has futexes, but adding and testing that should probably be a separate change.
I think it is easier to use #if in the code than in the .projitems, but I could be easily convinced otherwise.
We have precedents for either pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.
| _maxSpinCount = AppContextConfigHelper.GetInt32ComPlusOrDotNetConfig( | ||
| "System.Threading.ThreadPool.UnfairSemaphoreSpinLimit", | ||
| "ThreadPool_UnfairSemaphoreSpinLimit", | ||
| DefaultSemaphoreSpinCountLimit, | ||
| false); |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing validation for the configured spin count: The _maxSpinCount is read from AppContext configuration without bounds checking beyond the default value. A user could potentially configure a very large value (e.g., int.MaxValue) which could cause integer overflow in line 116's calculation _maxSpinCount * 2 / _procCount or result in excessive spinning that degrades performance. Consider adding validation to clamp the configured value to a reasonable maximum (e.g., 1,000,000) or adding overflow checks in the calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a public API, but validating the input seems like a good idea.
We do validate that the number is not negative.
| } | ||
|
|
||
| // We timed out, but our waiter is already popped. Someone is waking us. | ||
| // We can't leave or the wake could be lost, let's wait again. |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential race condition in timeout handling: In WaitCore, when a timeout occurs at line 338-349, the code checks if the blocker can be removed from the stack. If TryRemove returns false (line 340-343), it means someone is about to wake us, so we wait again with a 10ms timeout. However, there's a subtle race: after TryRemove fails and before we call TimedWait again (line 338), the wake call (line 381 in WakeOne) might have already completed. In this case, we would block for another 10ms even though we've already been woken. While this results in a spurious 10ms delay rather than correctness issue, consider whether this could be optimized by checking the wake state before waiting again.
| // We can't leave or the wake could be lost, let's wait again. | |
| // We can't leave or the wake could be lost, let's wait again. | |
| // Before blocking for longer, check if the wake has already completed. | |
| if (blocker.TimedWait(0)) | |
| { | |
| break; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A completed WakeOne will leave blocker in a signaled state and TimedWait will take the fast path. Either it will not wait on the futex at all or the futex wait will immediately return, since the state no longer matches the "undesirable" value.
The typical timeout here is 10 seconds, so either way an extra 10 msec would make no difference.
| public LowLevelLifoSemaphore(int maximumSignalCount, Action onWait) | ||
| { | ||
| Debug.Assert(maximumSignalCount > 0); | ||
| Debug.Assert(maximumSignalCount <= short.MaxValue); | ||
| Debug.Assert(spinCount >= 0); | ||
|
|
||
| _separated = default; | ||
| _maximumSignalCount = maximumSignalCount; | ||
| _spinCount = spinCount; | ||
| _onWait = onWait; | ||
| _procCount = Environment.ProcessorCount; | ||
|
|
||
| Create(maximumSignalCount); | ||
| _maxSpinCount = AppContextConfigHelper.GetInt32ComPlusOrDotNetConfig( | ||
| "System.Threading.ThreadPool.UnfairSemaphoreSpinLimit", | ||
| "ThreadPool_UnfairSemaphoreSpinLimit", | ||
| DefaultSemaphoreSpinCountLimit, | ||
| false); | ||
| } |
Copilot
AI
Feb 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removal of IDisposable interface: The LowLevelLifoSemaphore class previously implemented IDisposable (visible in the deleted Windows and Unix implementations), but the new unified implementation in LowLevelLifoSemaphore.cs does not implement IDisposable and has no Dispose method. The old Windows implementation had a Dispose method that closed the IO Completion Port handle, and the Unix implementation had an empty Dispose. The new implementation uses LowLevelThreadBlocker instances that are IDisposable and pooled via ThreadStatic storage (t_blocker). However, these blockers are never explicitly disposed. While they have finalizers, consider whether the LowLevelLifoSemaphore itself should implement IDisposable to properly clean up the blocker stack when the semaphore is no longer needed. Currently, s_semaphore in PortableThreadPool.WorkerThread.cs is a static readonly field that lives for the entire process lifetime, so disposal may not be an issue in practice, but the API design could be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I foresee other uses where the blocker can be deterministically disposed (i.e. if used for thread parking in low level lock-like things).
Here blockers belong to worker threads and finalizing them upon thread destruction seems both safe and rare enough that designing eager disposal seems unnecessary.
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs
Show resolved
Hide resolved
|
One test that was affected by #123159 is The test involved one thread renting an array, mutating it, passing to another thread via one-element long channel and so on. Since the scenario needs to wait on a task only occasionally, depending on environment (CPU speed vs. memor speed, it guess..) it varies how frequently it needs to run a task, but generally the test is sensitive to threadpool spinning long enough to pick up the task without waking a thread. The results after this PR, vs baseline: === Linux x64 (azure VM, so it is what it is, but the test has little of guest/host interactions)
|
…aiton of LIFO policy.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
Same tests on Windows: BenchmarkDotNet v0.14.1-nightly.20250107.205, Windows 11 (10.0.26200.7623) === baseline:
=== this PR:
|
|
TE benchmarks seem to favor the change. Using command: === Baseline: === This PR: |
Re: #123159
Changes:
Backoff.Exponential(0).Embarrassing bug.
To get spin count for an iteration we generate pseudorandom
uintand do>> (32 - attempt), but C# masks the shift operand to 31, thus whenattempt==0we do not shift at all, and it results in a large random spin count.That caused many noisy results and interestingly some improvements (in scenarios that benefit from very long spins).
Once we are done spinning, we block threads and when workers are needed again wake them in LIFO order.
Unix WaitSubsystem is pretty heavy for these needs. It supports Interruptible waits, waiting on multiple objects, etc... None of that is interesting here. Most calls into the subsystem take a global process-wide lock which can contend under load with other uses, or a worker-waking threads may contend with the workers going to sleep, etc...
Windows used an opaque
GetQueuedCompletionStatusfor the side effect of releasing threads in LIFO order when completion is posted, with unknown overheads and interactions, even though typically it is more efficient than Unix WaitSubsystem.The portable implementation seems to be faster than either of the platform-specific ones.
(measured by disabling spinning and running a few latency-sensitive benchmarks).
The portable implementation is also easier to reason about and to debug anomalies.
Spinning in threadpool is very tricky and spinning benefits differ greatly between scenarios. For some scenarios the longer the spin the better. There are scenarios that like when the threadpool releases cores quickly once it sees no work. No preset fixed spin count is going to be good for everything.
Adaptive approach appears to be necessary to improve some scenarios without regressing many others.
We can further improve the heuristic, if there are more ideas.