feat: emit per-iteration forward pass metrics via ZMQ PUB#20569
feat: emit per-iteration forward pass metrics via ZMQ PUB#20569ishandhanani wants to merge 2 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
df1e02d to
61362cc
Compare
Add ForwardPassMetrics emission from the scheduler on every forward pass iteration via ZMQ PUB socket. Enables external consumers (planners, routers) to observe real-time scheduling behavior without polling Prometheus. Controlled via --forward-pass-metrics-port server arg. Zero overhead when not set. Background publisher thread keeps serialization off the scheduler hot path. Idle heartbeats emitted every 1s. Data emitted per iteration: - Scheduled requests: prefill/decode counts, token sums, KV lengths, variance - Queued requests: waiting queue depth and token distribution - wall_time: time.monotonic() at emit time for inter-iteration cadence
61362cc to
44b59a5
Compare
| ) -> Union[GenerationBatchResult, EmbeddingBatchResult]: | ||
| """Run a batch.""" | ||
| self.forward_ct += 1 | ||
| batch.fpm_start_time = time.monotonic() |
There was a problem hiding this comment.
maybe it's better to include scheduling time and log fpm_start_time before get_next_batch_to_run?
|
seems there are repeated files (FMP struct definitions, var calaculation, etc), is it better to import them from dynamo, or we want to make this feature generic to all sglang users? |
We don't want a dependency on Dynamo here (even lazy). I'm ok with it being duplicated for now....ideally we could publish the FPM spec properly somewhere. This and events are currently duplicated across FWs |
Summary
ForwardPassMetricsemission from the scheduler on every forward pass iteration via ZMQ PUB socket--forward-pass-metrics-portserver arg -- zero overhead when not setMotivation
External orchestration systems need per-iteration scheduling telemetry to make informed routing decisions. The existing KV metrics (from #6721) provide block-level cache occupancy, but planners also need request-level scheduling data: how many prefill/decode requests ran, token counts, KV context lengths, queue depth, and iteration wall time.
This uses sglang's existing
SchedulerMetricsMixin-- no scheduler subclass needed.Data emitted per iteration
wall_timenum_prefill_requestssum_prefill_tokenssum_prefill_kv_tokensvar_prefill_lengthnum_decode_requestssum_decode_kv_tokensvar_decode_kv_tokensArchitecture
How to enable
Files changed
python/sglang/srt/observability/forward_pass_metrics.py-- ForwardPassMetrics schema, WelfordAccumulator, _FpmPublisherThreadpython/sglang/srt/observability/scheduler_metrics_mixin.py-- init FPM publisher,_emit_forward_pass_metrics(),_shutdown_fpm()python/sglang/srt/managers/scheduler.py-- record batch start time, emit FPM fromprocess_batch_result()python/sglang/srt/server_args.py-- addforward_pass_metrics_portfield and--forward-pass-metrics-portCLI flagtest/manual/test_forward_pass_metrics.py-- schema roundtrip, ZMQ PUB/SUB e2e, heartbeatTest plan
--forward-pass-metrics-port 20380, sent requests, verified prefill/decode metrics arrive with correct values