Skip to content

Conversation

@taoluo
Copy link
Contributor

@taoluo taoluo commented Aug 8, 2025

Dynamic Load Balancing via Request Interruption for vLLM Engines

Summary

This PR implements a sophisticated dynamic load balancing system for distributed vLLM engines by introducin request interruption. The system automatically redistributes workload across engines to optimize throughput and reduce latency.

Key Features

Request Interruption

  • Selection Algorithm: Scheduler calculates target_leftover_cnt allowing workers to select which requests to interrupt based on migration overhead
  • Two-tier Interruption Strategy:
    • First interrupts all unscheduled (waiting) requests
    • Then selects running/swapped requests with shortest total sequence length to minimize wasted computation

Dynamic Load Balancing

  • Automatic Imbalance Detection: Monitors request distribution across engines and triggers rebalancing when imbalance exceeds threshold
  • Conservative Interruption: Calculates interruption count as half of load imbalance to avoid over-correction and oscillation
  • Freeze Period: Implements 3-second freeze after interruption to allow proper load redistribution before next rebalancing cycle

Implementation Details

  • Modified generate_scheduler.py to track load metrics and trigger rebalancing
  • Enhanced vllm_strategy.py with request interruption capabilities
  • Updated worker communication protocol to support targete interruption cnt

taoluo added 16 commits July 15, 2025 22:06
…ancing

  - Scheduler now sends target_leftover_cnt instead of specific request IDs
  - Worker selects requests to interrupt based on overhead:
    * First interrupts all unscheduled (waiting) requests
    * Then selects running requests with shortest total sequence length
  - Add assertions to validate request existence before interruption
  - Calculate interruption count as half of load imbalance to avoid over-correction
  - Add 3-second freeze period after interruption to allow load redistribution

  This minimizes wasted computation by prioritizing requests that haven't started
  and selecting running requests based on actual progress rather than arbitrary order.
  - Implement abort_to_target_requests_cnt for v1 engine
  - Support both v0/v1 engines in VllmStrategy
  - Prioritize interruption: waiting → (swapped+running by length)
  - Fix swapped_count missing in total request calculation
  - Return interrupted request IDs for proper tracking

  Enables efficient request migration by minimizing computational loss.
# Conflicts:
#	roll/distributed/scheduler/generate_scheduler.py
#	roll/distributed/strategy/vllm_strategy.py
#	roll/pipeline/rlvr/rlvr_pipeline.py
#	roll/third_party/vllm/vllm_0_8_4/llm.py
@CLAassistant
Copy link

CLAassistant commented Aug 8, 2025

CLA assistant check
All committers have signed the CLA.

@taoluo taoluo changed the title Dynamic Load Balancing with Request Interruption for vLLM Engines Dynamic Load Balancing via Request Interruption and Migration for vLLM Engines Aug 8, 2025
@taoluo taoluo closed this Aug 8, 2025
@taoluo taoluo deleted the dynamic-load-balancing branch August 8, 2025 19:25
@taoluo taoluo restored the dynamic-load-balancing branch August 8, 2025 19:25
@taoluo taoluo reopened this Aug 8, 2025
@taoluo taoluo changed the title Dynamic Load Balancing via Request Interruption and Migration for vLLM Engines [Draft] Dynamic Load Balancing via Request Interruption and Migration for vLLM Engines Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants