Introduce async scheduler implementation with mixin pattern#941
Introduce async scheduler implementation with mixin pattern#941GOavi101 wants to merge 2 commits intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing. We also recommend installing prek and configuring it to check your code before every local commit. |
1a3ecbb to
b0e8e83
Compare
b0e8e83 to
d71cfb3
Compare
| # The mixin's pre-filter pattern is not safe under that run-ahead scenario. | ||
| # For TP=1 (UniProcExecutor), futures are immediately done so it's safe. | ||
| if parallel_config.world_size > 1: | ||
| scheduler_config.async_scheduling = False |
There was a problem hiding this comment.
Interesting- if we wanted to support this feature then it would likely need to work with TP=4 which is how we run most models. I thought this was only incompatible with pipeline parallel upstream - does it also not work with tensor parallel?
There was a problem hiding this comment.
The fix is SpyreMultiprocExecutor — a thin MultiprocExecutor subclass that overrides max_concurrent_batches to return 1 instead of 2. This forces the engine to use the simpler step() path (strictly schedule → execute → update) rather than step_with_batch_queue, which was the only thing that broke TP>1.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose. The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.
what do you think?
There was a problem hiding this comment.
That doesn't quite line up with my understanding- IIUC the step_with_batch_queue method is what works with the speculative scheduling: The engine runs the scheduler again while the model is running, assuming that the requests in the batch will continue.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose
I don't quite understand this either- the multiproc executor is definitely async, it broadcasts an RPC to the workers to run the model and the engine gets back a future that it waits on. step_with_batch_queue queues up that future so that it can speculatively schedule the next pass.
This TP=1 profile shows the scheduler running in between the model forward passes, the goal with async scheduling is to get the scheduler running for the next step during the model forward pass instead:
The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.
Based on the above, my understanding is that the run-ahead state is the whole point and we won't gain any performance benefit from this unless we support it, so this is a blocker. Is there something else I'm missing?
There was a problem hiding this comment.
You're right, thanks for the correction. I'll fix this — snapshot the mixin's mutable state (ongoing_prefills, tkv, previous_step_was_prefill) before delegating to super().schedule() so the run-ahead second schedule() call sees consistent state, and remove SpyreMultiprocExecutor. That way TP≥2 gets the full async scheduling benefit.
There was a problem hiding this comment.
Yeah, the step_with_batch_queue is actually what makes async scheduling work at all. In the PR where Woosuk added this, he reused the existing batch queue that was originally intended for Pipeline Parallel (PP). With PP you have to wait N steps for each of the N PP stages. But with async scheduling the intention is for the scheduler to be ahead by 1 step, so the num_output_placeholders were introduced to prevent the scheduler from waiting for the current step.
|
Thanks @GOavi101! A few notes:
|
1bd875b to
2246d48
Compare
|
After vLLM 0.14, the async scheduler is enabled by default. All the tests below are running using the async scheduler. To run with the synchronous scheduler: |
421b610 to
4777fb6
Compare
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
4777fb6 to
6af4564
Compare
maxdebayser
left a comment
There was a problem hiding this comment.
Thanks for taking on this issue, @GOavi101. I think perhaps it would better to take a step back to re-think this PR a bit.
There are a few unecessary changes in this PR. I would suggest the following:
- Remove the
AsyncPoolingSpyreSchedulerclass and the code that selects it inplatform.pyas async scheduling is not advantageous for pooling - Undo the mixin class structure. Currently the only purpose of this is for
_is_async_scheduler()to runisinstance(self, AsyncScheduler)but the same can be achieved by returningvllm_config.scheduler_config.async_scheduling
In PR 19970 Woosuk added async scheduling based in this idea from the NanoFlow paper:
Asynchronous scheduling: Batch formation, including estimating memory
usage, scheduling new requests, retiring finished requests, and adjusting the
page table for PagedAttention [17], consumes a non-negligible amount of time
on the CPU side [42]. In most serving frameworks [17,58], only after the GPU
executes one iteration, the scheduler on the CPU is able to detect EOS tokens,
remove the finished request from the batch, and refill the batch with new
requests. However, GPU is under-utilized during this time. To avoid this waste,
NanoFlow asynchronously schedules batch formation in parallel to the GPU
executions. At any iteration i, NanoFlow forms the batch for the next iteration
before the end of the current iteration. This means that NanoFlow cannot
detect the EOS tokens generated at iteration i. After launching iteration i+1,
NanoFlow forms the batch for cycle i+2, detects the EOS token from iteration i,
and removes finished requests. Fortunately, since the average decode length
surpasses 100 for typical workloads (See Table 4), the overhead of one extra
decode token is negligible (< 1%), given the benefit of hiding the batch
formation overhead.
In this first PR, no model runner changes were needed. Later additions were made to overlap the model runner execution with other steps such as sampling and the WorkerBase sample_tokens method was added to separate sampling from the execution of the model.
I think that the main problem to solve is that in Spyre we can't run prefills in the same batch as the decode requests. If it weren't for this fact, probably the only required code change would be implementing the sample_tokens() method in the SpyreWorker.
Since we can currently interleave prefill chunk batches and decode batches, in principle there is no problem in doing so asynchronously.
But once prefill is done, we either must wait and add the request in the current decode batch or in the one after that.
| AsyncPoolingSpyreScheduler, | ||
| ) | ||
|
|
||
| return AsyncPoolingSpyreScheduler |
There was a problem hiding this comment.
I would prefer removing the AsyncPoolingSpyreScheduler for now. It's really an optimization for generative use cases and it's disabled upstream because it negatively affects the pooling performance: https://github.com/vllm-project/vllm/blob/13338644082bdab10c4ff017de1a97c03ba62524/vllm/config/vllm.py#L813
| return EMPTY_MODEL_RUNNER_OUTPUT | ||
| cached = self._last_execute_model_output | ||
| self._last_execute_model_output = None | ||
| return cached if cached is not None else EMPTY_MODEL_RUNNER_OUTPUT |
There was a problem hiding this comment.
Isn't it better to just always return None in SpyreModelRunner.execute_model()? The engine core knows that it has to call sample_tokens in that case. In this case, you just need to split out the sampling code in execute_model into a sample_tokens method.
| else: | ||
| # Reconcile input_batch with the current decode set. | ||
| # With async run-ahead scheduling a request can appear in both | ||
| # finished_req_ids and scheduled_cached_reqs simultaneously (the |
There was a problem hiding this comment.
I don't think this should happen. The scheduler is the component that decides when requests are done so it should never schedule a finished request. With async scheduling, a request can be technically finished, but the scheduler only discovers that in the next step, so it's never "knowingly" scheduling a finished request.
… collapse mixins - Remove AsyncPoolingSpyreScheduler; pooling models always use the sync PoolingSpyreScheduler (async scheduling is not advantageous for pooling). - Collapse PoolingSpyreMixin / ChunkedPrefillSpyreMixin into direct Scheduler subclasses. AsyncChunkedPrefillSpyreScheduler subclasses (ChunkedPrefillSpyreScheduler, AsyncScheduler). - Replace isinstance-based _is_async_scheduler() with a check on vllm_config.scheduler_config.async_scheduling. - Revert speculative model_runner / worker changes that should land in a follow-up that implements SpyreWorker.sample_tokens(). - Update tests accordingly. Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
Description
Introduce async scheduler implementation with mixin pattern for cleaner architecture.
New Implementation (mixins)
PoolingSpyreMixinandChunkedPrefillSpyreMixinclasses_is_async_scheduler()(isinstance check)class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler):class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler):class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler):class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler):Related Issues
Test Plan
tests/v1/core/test_async_scheduler.py(16 tests):TestIsAsyncScheduler: Verifies_is_async_scheduler()detection (4 tests)TestPoolingSpyreMixinSchedule: Tests warmup-shape constraints in sync/async modes (4 tests)TestChunkedPrefillSpyreMixinSchedule: Verifies constraint bypass in async mode (3 tests)TestChunkedPrefillSpyreMixinUpdateFromOutput: Tests scheduler output filtering in async mode (5 tests)Checklist
bash format.sh)Signed-off-by:line (DCO compliance)