Introduce async scheduler implementation with mixin pattern by GOavi101 · Pull Request #941 · torch-spyre/sendnn-inference

GOavi101 · 2026-04-21T08:10:51Z

Description

Introduce async scheduler implementation with mixin pattern for cleaner architecture.

New Implementation (mixins)

PoolingSpyreMixin and ChunkedPrefillSpyreMixin classes
Runtime detection via _is_async_scheduler() (isinstance check)
Simple multiple inheritance for concrete classes:
- class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler):
- class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler):
- class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler):
- class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler):

Related Issues

Test Plan

Added comprehensive unit tests in tests/v1/core/test_async_scheduler.py (16 tests):
- TestIsAsyncScheduler: Verifies _is_async_scheduler() detection (4 tests)
- TestPoolingSpyreMixinSchedule: Tests warmup-shape constraints in sync/async modes (4 tests)
- TestChunkedPrefillSpyreMixinSchedule: Verifies constraint bypass in async mode (3 tests)
- TestChunkedPrefillSpyreMixinUpdateFromOutput: Tests scheduler output filtering in async mode (5 tests)

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

github-actions · 2026-04-21T08:11:01Z

👋 Hi! Thank you for contributing.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

joerunde · 2026-04-22T20:30:05Z

+            # The mixin's pre-filter pattern is not safe under that run-ahead scenario.
+            # For TP=1 (UniProcExecutor), futures are immediately done so it's safe.
+            if parallel_config.world_size > 1:
+                scheduler_config.async_scheduling = False


Interesting- if we wanted to support this feature then it would likely need to work with TP=4 which is how we run most models. I thought this was only incompatible with pipeline parallel upstream - does it also not work with tensor parallel?

@joerunde

The fix is SpyreMultiprocExecutor — a thin MultiprocExecutor subclass that overrides max_concurrent_batches to return 1 instead of 2. This forces the engine to use the simpler step() path (strictly schedule → execute → update) rather than step_with_batch_queue, which was the only thing that broke TP>1.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose. The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.

what do you think?

That doesn't quite line up with my understanding- IIUC the step_with_batch_queue method is what works with the speculative scheduling: The engine runs the scheduler again while the model is running, assuming that the requests in the batch will continue.

Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose

I don't quite understand this either- the multiproc executor is definitely async, it broadcasts an RPC to the workers to run the model and the engine gets back a future that it waits on. step_with_batch_queue queues up that future so that it can speculatively schedule the next pass.

This TP=1 profile shows the scheduler running in between the model forward passes, the goal with async scheduling is to get the scheduler running for the next step during the model forward pass instead:

The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.

Based on the above, my understanding is that the run-ahead state is the whole point and we won't gain any performance benefit from this unless we support it, so this is a blocker. Is there something else I'm missing?

You're right, thanks for the correction. I'll fix this — snapshot the mixin's mutable state (ongoing_prefills, tkv, previous_step_was_prefill) before delegating to super().schedule() so the run-ahead second schedule() call sees consistent state, and remove SpyreMultiprocExecutor. That way TP≥2 gets the full async scheduling benefit.

Yeah, the step_with_batch_queue is actually what makes async scheduling work at all. In the PR where Woosuk added this, he reused the existing batch queue that was originally intended for Pipeline Parallel (PP). With PP you have to wait N steps for each of the N PP stages. But with async scheduling the intention is for the scheduler to be ahead by 1 step, so the num_output_placeholders were introduced to prevent the scheduler from waiting for the current step.

joerunde · 2026-04-22T20:58:24Z

Thanks @GOavi101!

A few notes:

If this can't be done with tensor parallel, then maybe it's not worth pursuing. Is that a hard blocker?
We need to have an end-to-end test that shows this working, ie using an LLM with async scheduling enabled. It would also be good to include an illustrative test at the engine level (see https://github.com/torch-spyre/sendnn-inference/blob/main/tests/e2e/test_spyre_pc_scheduler_steps.py) that shows the effects of async scheduling. From my quick skim it sounds like the engine is speculatively scheduling batches one step ahead, so we should see a "dead token" in some cases where the engine schedules a decode past the end of a sequence.
It would be really great to see a profile of this in action, or at least some minimal vllm bench results showing what kind of performance improvement we can expect.

GOavi101 · 2026-04-23T10:35:45Z

After vLLM 0.14, the async scheduler is enabled by default. All the tests below are running using the async scheduler. To run with the synchronous scheduler:
we have to add the --no-async-scheduling flag.

Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

maxdebayser

Thanks for taking on this issue, @GOavi101. I think perhaps it would better to take a step back to re-think this PR a bit.

There are a few unecessary changes in this PR. I would suggest the following:

Remove the AsyncPoolingSpyreScheduler class and the code that selects it in platform.py as async scheduling is not advantageous for pooling
Undo the mixin class structure. Currently the only purpose of this is for _is_async_scheduler() to run isinstance(self, AsyncScheduler) but the same can be achieved by returning vllm_config.scheduler_config.async_scheduling

In PR 19970 Woosuk added async scheduling based in this idea from the NanoFlow paper:

Asynchronous scheduling: Batch formation, including estimating memory 
usage, scheduling new requests, retiring finished requests, and adjusting the 
page table for PagedAttention [17], consumes a non-negligible amount of time 
on the CPU side [42]. In most serving frameworks [17,58], only after the GPU 
executes one iteration, the scheduler on the CPU is able to detect EOS tokens, 
remove the finished request from the batch, and refill the batch with new 
requests. However, GPU is under-utilized during this time. To avoid this waste, 
NanoFlow asynchronously schedules batch formation in parallel to the GPU 
executions. At any iteration i, NanoFlow forms the batch for the next iteration 
before the end of the current iteration. This means that NanoFlow cannot 
detect the EOS tokens generated at iteration i. After launching iteration i+1, 
NanoFlow forms the batch for cycle i+2, detects the EOS token from iteration i, 
and removes finished requests.  Fortunately, since the average decode length 
surpasses 100 for typical workloads (See Table 4), the overhead of one extra 
decode token is negligible (< 1%), given the benefit of hiding the batch 
formation overhead.

In this first PR, no model runner changes were needed. Later additions were made to overlap the model runner execution with other steps such as sampling and the WorkerBase sample_tokens method was added to separate sampling from the execution of the model.

I think that the main problem to solve is that in Spyre we can't run prefills in the same batch as the decode requests. If it weren't for this fact, probably the only required code change would be implementing the sample_tokens() method in the SpyreWorker.
Since we can currently interleave prefill chunk batches and decode batches, in principle there is no problem in doing so asynchronously.
But once prefill is done, we either must wait and add the request in the current decode batch or in the one after that.

maxdebayser · 2026-05-06T00:37:16Z

+                    AsyncPoolingSpyreScheduler,
+                )
+
+                return AsyncPoolingSpyreScheduler


I would prefer removing the AsyncPoolingSpyreScheduler for now. It's really an optimization for generative use cases and it's disabled upstream because it negatively affects the pooling performance: https://github.com/vllm-project/vllm/blob/13338644082bdab10c4ff017de1a97c03ba62524/vllm/config/vllm.py#L813

maxdebayser · 2026-05-06T01:38:20Z

-        return EMPTY_MODEL_RUNNER_OUTPUT
+        cached = self._last_execute_model_output
+        self._last_execute_model_output = None
+        return cached if cached is not None else EMPTY_MODEL_RUNNER_OUTPUT


Isn't it better to just always return None in SpyreModelRunner.execute_model()? The engine core knows that it has to call sample_tokens in that case. In this case, you just need to split out the sampling code in execute_model into a sample_tokens method.

maxdebayser · 2026-05-06T01:50:13Z

        else:
+            # Reconcile input_batch with the current decode set.
+            # With async run-ahead scheduling a request can appear in both
+            # finished_req_ids and scheduled_cached_reqs simultaneously (the


I don't think this should happen. The scheduler is the component that decides when requests are done so it should never schedule a finished request. With async scheduling, a request can be technically finished, but the scheduler only discovers that in the next step, so it's never "knowingly" scheduling a finished request.

… collapse mixins - Remove AsyncPoolingSpyreScheduler; pooling models always use the sync PoolingSpyreScheduler (async scheduling is not advantageous for pooling). - Collapse PoolingSpyreMixin / ChunkedPrefillSpyreMixin into direct Scheduler subclasses. AsyncChunkedPrefillSpyreScheduler subclasses (ChunkedPrefillSpyreScheduler, AsyncScheduler). - Replace isinstance-based _is_async_scheduler() with a check on vllm_config.scheduler_config.async_scheduling. - Revert speculative model_runner / worker changes that should land in a follow-up that implements SpyreWorker.sample_tokens(). - Update tests accordingly. Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>

GOavi101 requested review from dilipgb and joerunde April 21, 2026 08:10

GOavi101 force-pushed the feature/async-scheduler-mixin-pattern branch 15 times, most recently from 1a3ecbb to b0e8e83 Compare April 22, 2026 17:20