Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Feb 2, 2026

Re: #123159

Changes:

  • Correctly handle Backoff.Exponential(0).

Embarrassing bug.
To get spin count for an iteration we generate pseudorandom uint and do >> (32 - attempt), but C# masks the shift operand to 31, thus when attempt==0 we do not shift at all, and it results in a large random spin count.
That caused many noisy results and interestingly some improvements (in scenarios that benefit from very long spins).

  • Unified implementation of LIFO policy with lightweight minimal implementation of LIFO waiting.

Once we are done spinning, we block threads and when workers are needed again wake them in LIFO order.

Unix WaitSubsystem is pretty heavy for these needs. It supports Interruptible waits, waiting on multiple objects, etc... None of that is interesting here. Most calls into the subsystem take a global process-wide lock which can contend under load with other uses, or a worker-waking threads may contend with the workers going to sleep, etc...

Windows used an opaque GetQueuedCompletionStatus for the side effect of releasing threads in LIFO order when completion is posted, with unknown overheads and interactions, even though typically it is more efficient than Unix WaitSubsystem.

The portable implementation seems to be faster than either of the platform-specific ones.
(measured by disabling spinning and running a few latency-sensitive benchmarks).

The portable implementation is also easier to reason about and to debug anomalies.

  • Adaptive spinning in the threadpool based on estimates of CPU core availability.

Spinning in threadpool is very tricky and spinning benefits differ greatly between scenarios. For some scenarios the longer the spin the better. There are scenarios that like when the threadpool releases cores quickly once it sees no work. No preset fixed spin count is going to be good for everything.

Adaptive approach appears to be necessary to improve some scenarios without regressing many others.
We can further improve the heuristic, if there are more ideas.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @agocke, @VSadov
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses performance regressions in the threadpool semaphore (issue #123159) and unifies the Windows/Unix implementation of the LIFO (Last-In-First-Out) policy for threadpool worker thread management.

Changes:

  • Introduces a unified LowLevelThreadBlocker class that uses OS-provided compare-and-wait APIs (futex on Linux, WaitOnAddress on Windows) for efficient thread blocking, with a fallback to monitor-based implementation for other platforms
  • Refactors LowLevelLifoSemaphore to use the new blocker infrastructure, removes platform-specific Windows/Unix implementations, and improves spinning heuristics based on CPU availability
  • Adds native futex support for Linux through syscalls and Windows WaitOnAddress API interop

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/native/libs/System.Native/pal_threading.h Adds declarations for Linux futex operations
src/native/libs/System.Native/pal_threading.c Implements futex wait/wake operations for Linux using syscalls
src/native/libs/System.Native/entrypoints.c Registers new futex entrypoints for Linux
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs Fixes spelling, removes spin count configuration, passes active thread count to semaphore
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs New class providing portable thread blocking using futex/WaitOnAddress or monitor fallback
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs Major refactoring to use LowLevelThreadBlocker, implements LIFO queue with pending signals, improves spin heuristics
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Windows.cs Deleted - functionality moved to unified implementation
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs Deleted - functionality moved to unified implementation
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Windows.cs New file providing Windows WaitOnAddress API wrapper
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Unix.cs New file providing Linux futex wrapper
src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs Modified to return spin count and skip spinning on first attempt
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems Updates project to include new files and remove deleted platform-specific files
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.WaitOnAddress.cs New interop declarations for Windows WaitOnAddress and WakeByAddressSingle APIs
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.CriticalSection.cs Adds SuppressGCTransition attribute to LeaveCriticalSection
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.ConditionVariable.cs Adds SuppressGCTransition attribute to WakeConditionVariable
src/libraries/Common/src/Interop/Unix/System.Native/Interop.LowLevelMonitor.cs Adds SuppressGCTransition attributes to Release and Signal_Release
src/libraries/Common/src/Interop/Unix/System.Native/Interop.Futex.cs New interop declarations for Linux futex operations

Copilot AI review requested due to automatic review settings February 3, 2026 00:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

<Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelLifoSemaphore.Unix.cs" Condition="'$(TargetsUnix)' == 'true' or '$(TargetsBrowser)' == 'true' or '$(TargetsWasi)' == 'true'" />
<Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelThreadBlocker.cs" />
<Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelFutex.Windows.cs" Condition="'$(TargetsWindows)' == 'true'" />
<Compile Include="$(MSBuildThisFileDirectory)System\Threading\LowLevelFutex.Unix.cs" Condition="'$(TargetsUnix)' == 'true'" />
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LowLevelFutex.Unix.cs file is conditionally compiled only for TARGET_LINUX, but the project file includes it for all Unix targets. This means non-Linux Unix platforms (like macOS, FreeBSD, etc.) will include an effectively empty file. While this doesn't cause a build error (since LowLevelThreadBlocker falls back to USE_MONITOR on non-Linux Unix), it's cleaner to only include this file when TARGET_LINUX is true. Consider changing the condition to match the file's actual content: Condition="'$(TargetLinux)' == 'true'"

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@VSadov VSadov Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expected that most other Unix-like platforms will eventually provide an implementation and it will go into LowLevelFutex.Unix.cs.
I.E. FreeBSD certainly has futexes, but adding and testing that should probably be a separate change.

I think it is easier to use #if in the code than in the .projitems, but I could be easily convinced otherwise.
We have precedents for either pattern.

Copilot AI review requested due to automatic review settings February 3, 2026 01:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Comment on lines +78 to +82
_maxSpinCount = AppContextConfigHelper.GetInt32ComPlusOrDotNetConfig(
"System.Threading.ThreadPool.UnfairSemaphoreSpinLimit",
"ThreadPool_UnfairSemaphoreSpinLimit",
DefaultSemaphoreSpinCountLimit,
false);
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for the configured spin count: The _maxSpinCount is read from AppContext configuration without bounds checking beyond the default value. A user could potentially configure a very large value (e.g., int.MaxValue) which could cause integer overflow in line 116's calculation _maxSpinCount * 2 / _procCount or result in excessive spinning that degrades performance. Consider adding validation to clamp the configured value to a reasonable maximum (e.g., 1,000,000) or adding overflow checks in the calculation.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@VSadov VSadov Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a public API, but validating the input seems like a good idea.
We do validate that the number is not negative.

}

// We timed out, but our waiter is already popped. Someone is waking us.
// We can't leave or the wake could be lost, let's wait again.
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential race condition in timeout handling: In WaitCore, when a timeout occurs at line 338-349, the code checks if the blocker can be removed from the stack. If TryRemove returns false (line 340-343), it means someone is about to wake us, so we wait again with a 10ms timeout. However, there's a subtle race: after TryRemove fails and before we call TimedWait again (line 338), the wake call (line 381 in WakeOne) might have already completed. In this case, we would block for another 10ms even though we've already been woken. While this results in a spurious 10ms delay rather than correctness issue, consider whether this could be optimized by checking the wake state before waiting again.

Suggested change
// We can't leave or the wake could be lost, let's wait again.
// We can't leave or the wake could be lost, let's wait again.
// Before blocking for longer, check if the wake has already completed.
if (blocker.TimedWait(0))
{
break;
}

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@VSadov VSadov Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A completed WakeOne will leave blocker in a signaled state and TimedWait will take the fast path. Either it will not wait on the futex at all or the futex wait will immediately return, since the state no longer matches the "undesirable" value.

The typical timeout here is 10 seconds, so either way an extra 10 msec would make no difference.

Comment on lines +68 to 83
public LowLevelLifoSemaphore(int maximumSignalCount, Action onWait)
{
Debug.Assert(maximumSignalCount > 0);
Debug.Assert(maximumSignalCount <= short.MaxValue);
Debug.Assert(spinCount >= 0);

_separated = default;
_maximumSignalCount = maximumSignalCount;
_spinCount = spinCount;
_onWait = onWait;
_procCount = Environment.ProcessorCount;

Create(maximumSignalCount);
_maxSpinCount = AppContextConfigHelper.GetInt32ComPlusOrDotNetConfig(
"System.Threading.ThreadPool.UnfairSemaphoreSpinLimit",
"ThreadPool_UnfairSemaphoreSpinLimit",
DefaultSemaphoreSpinCountLimit,
false);
}
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of IDisposable interface: The LowLevelLifoSemaphore class previously implemented IDisposable (visible in the deleted Windows and Unix implementations), but the new unified implementation in LowLevelLifoSemaphore.cs does not implement IDisposable and has no Dispose method. The old Windows implementation had a Dispose method that closed the IO Completion Port handle, and the Unix implementation had an empty Dispose. The new implementation uses LowLevelThreadBlocker instances that are IDisposable and pooled via ThreadStatic storage (t_blocker). However, these blockers are never explicitly disposed. While they have finalizers, consider whether the LowLevelLifoSemaphore itself should implement IDisposable to properly clean up the blocker stack when the semaphore is no longer needed. Currently, s_semaphore in PortableThreadPool.WorkerThread.cs is a static readonly field that lives for the entire process lifetime, so disposal may not be an issue in practice, but the API design could be clearer.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

@VSadov VSadov Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I foresee other uses where the blocker can be deterministically disposed (i.e. if used for thread parking in low level lock-like things).

Here blockers belong to worker threads and finalizing them upon thread destruction seems both safe and rare enough that designing eager disposal seems unnecessary.

@VSadov
Copy link
Member Author

VSadov commented Feb 3, 2026

One test that was affected by #123159 is
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer

The test involved one thread renting an array, mutating it, passing to another thread via one-element long channel and so on.
In particular there is a scenario where both sides wait synchronously on async result. There is an occasional race condition when IsCompleted in the async result returns false, but subsequent OnCompleted sees a completed async operation. Since it can`t attach continuation to already completed result, it posts a workitem to the threadpool and attaches continuation to that. Then one of the threads that play buffer ping-pong effectively wait on the completion of the workitem.

Since the scenario needs to wait on a task only occasionally, depending on environment (CPU speed vs. memor speed, it guess..) it varies how frequently it needs to run a task, but generally the test is sensitive to threadpool spinning long enough to pick up the task without waking a thread.

The results after this PR, vs baseline:

=== Linux x64 (azure VM, so it is what it is, but the test has little of guest/host interactions)

  • baseline:
Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 1.787 us 0.0995 us 0.1146 us 1.800 us 1.5178 us 1.997 us 0.0050 84 B
ProducerConsumer 4096 False False True 1.836 us 0.1398 us 0.1610 us 1.859 us 1.3978 us 2.057 us - 82 B
ProducerConsumer 4096 False True False 1.035 us 0.0642 us 0.0739 us 1.052 us 0.7293 us 1.074 us - -
ProducerConsumer 4096 False True True 1.095 us 0.0601 us 0.0692 us 1.122 us 0.8213 us 1.135 us - -
ProducerConsumer 4096 True False False 1.927 us 0.0647 us 0.0692 us 1.939 us 1.7744 us 2.025 us - 6 B
ProducerConsumer 4096 True False True 1.952 us 0.0584 us 0.0672 us 1.952 us 1.7911 us 2.045 us - 2 B
ProducerConsumer 4096 True True False 1.494 us 0.0366 us 0.0422 us 1.491 us 1.4217 us 1.575 us - -
ProducerConsumer 4096 True True True 1.875 us 0.0916 us 0.1055 us 1.879 us 1.6830 us 2.075 us - -
  • after the PR:
    (mostly improvements, some scenarios really like long spins though....)
Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 1,706.1 ns 122.59 ns 141.18 ns 1,709.4 ns 1,486.7 ns 1,921.7 ns 0.0050 83 B
ProducerConsumer 4096 False False True 1,737.3 ns 134.77 ns 155.20 ns 1,723.8 ns 1,558.4 ns 2,107.2 ns - 83 B
ProducerConsumer 4096 False True False 855.6 ns 25.55 ns 28.40 ns 860.4 ns 740.1 ns 869.3 ns - -
ProducerConsumer 4096 False True True 1,035.9 ns 21.40 ns 23.79 ns 1,038.6 ns 965.3 ns 1,065.8 ns - -
ProducerConsumer 4096 True False False 1,450.8 ns 53.79 ns 57.55 ns 1,446.4 ns 1,326.9 ns 1,572.2 ns - 1 B
ProducerConsumer 4096 True False True 2,383.5 ns 261.28 ns 300.89 ns 2,273.3 ns 2,037.4 ns 3,040.7 ns - 9 B
ProducerConsumer 4096 True True False 1,503.2 ns 57.55 ns 66.28 ns 1,512.2 ns 1,375.6 ns 1,584.2 ns - -
ProducerConsumer 4096 True True True 1,842.5 ns 68.79 ns 79.22 ns 1,837.4 ns 1,712.4 ns 2,035.9 ns - -

@VSadov
Copy link
Member Author

VSadov commented Feb 3, 2026

Same tests on Windows:
(clearly an improvement)

BenchmarkDotNet v0.14.1-nightly.20250107.205, Windows 11 (10.0.26200.7623)
AMD Ryzen 9 7950X 4.50GHz, 1 CPU, 32 logical and 16 physical cores

=== baseline:

Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 552.4 ns 23.25 ns 25.84 ns 549.7 ns 514.5 ns 601.0 ns 0.0050 84 B
ProducerConsumer 4096 False False True 734.0 ns 26.71 ns 29.69 ns 730.6 ns 671.1 ns 795.4 ns 0.0025 83 B
ProducerConsumer 4096 False True False 325.9 ns 13.82 ns 14.19 ns 325.1 ns 304.5 ns 358.8 ns - -
ProducerConsumer 4096 False True True 367.9 ns 7.89 ns 9.09 ns 369.1 ns 332.3 ns 376.9 ns - -
ProducerConsumer 4096 True False False 1,390.5 ns 447.74 ns 515.62 ns 1,050.2 ns 959.7 ns 2,085.7 ns - 58 B
ProducerConsumer 4096 True False True 1,380.9 ns 32.70 ns 37.65 ns 1,386.5 ns 1,286.9 ns 1,437.0 ns - 32 B
ProducerConsumer 4096 True True False 886.0 ns 17.08 ns 18.28 ns 889.2 ns 852.0 ns 922.1 ns - -
ProducerConsumer 4096 True True True 1,012.1 ns 19.45 ns 18.20 ns 1,007.6 ns 979.7 ns 1,043.3 ns - -

=== this PR:

Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 395.7 ns 26.11 ns 25.64 ns 384.3 ns 373.3 ns 462.0 ns 0.0050 84 B
ProducerConsumer 4096 False False True 498.8 ns 73.72 ns 81.94 ns 480.5 ns 399.9 ns 667.2 ns 0.0050 83 B
ProducerConsumer 4096 False True False 249.4 ns 4.31 ns 3.37 ns 249.8 ns 243.8 ns 257.0 ns - -
ProducerConsumer 4096 False True True 321.5 ns 5.30 ns 4.42 ns 319.3 ns 317.4 ns 330.1 ns - -
ProducerConsumer 4096 True False False 960.5 ns 30.51 ns 35.14 ns 955.2 ns 914.2 ns 1,035.8 ns - 5 B
ProducerConsumer 4096 True False True 1,255.1 ns 25.08 ns 24.63 ns 1,256.5 ns 1,209.8 ns 1,304.5 ns - 30 B
ProducerConsumer 4096 True True False 883.0 ns 15.86 ns 14.83 ns 882.2 ns 863.3 ns 917.1 ns - -
ProducerConsumer 4096 True True True 1,056.4 ns 19.67 ns 18.40 ns 1,059.9 ns 1,014.1 ns 1,087.3 ns - -

@VSadov
Copy link
Member Author

VSadov commented Feb 3, 2026

TE benchmarks seem to favor the change.

Using command:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/json.benchmarks.yml --scenario json    --profile aspnet-gold-lin  --application.framework net11.0 --application.options.outputFiles <. . .>

=== Baseline:

| First Request (ms)        | 172                 |
| Requests/sec              | 1,828,617           |
| Requests                  | 27,611,979          |
| Mean latency (ms)         | 0.14                |
| Max latency (ms)          | 12.27               |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 291.23              |
| Latency 50th (ms)         | 0.12                |
| Latency 75th (ms)         | 0.16                |
| Latency 90th (ms)         | 0.22                |
| Latency 99th (ms)         | 0.37                |

=== This PR:

| First Request (ms)        | 171                 |
| Requests/sec              | 1,846,521           |
| Requests                  | 27,882,744          |
| Mean latency (ms)         | 0.14                |
| Max latency (ms)          | 7.00                |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 294.08              |
| Latency 50th (ms)         | 0.12                |
| Latency 75th (ms)         | 0.16                |
| Latency 90th (ms)         | 0.22                |
| Latency 99th (ms)         | 0.37                |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant