Skip to content

Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038

Merged
Oseltamivir merged 3 commits into
SemiAnalysisAI:mainfrom
Duyi-Wang:mp_benchmark
Apr 20, 2026
Merged

Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038
Oseltamivir merged 3 commits into
SemiAnalysisAI:mainfrom
Duyi-Wang:mp_benchmark

Conversation

@Duyi-Wang

Copy link
Copy Markdown
Contributor

Summary

Accelerate random prompt generation in benchmark_serving.py by parallelizing the sample_random_requests() function using Python multiprocessing.Pool. This addresses the bottleneck where generating large numbers of long prompts (e.g., 20K+ prompts at 8K+ input tokens) takes tens of minutes due to sequential tokenizer encode/decode operations.

Problem

When running benchmarks with high concurrency and long input sequences, the data preparation phase dominates total wall time. For example:

  • 2048 concurrency × 10 = 20,480 prompts @ 8,192 input tokens: the original serial path would take ~25 minutes just to generate prompt data before any actual benchmarking begins.

The root cause is that each prompt requires multiple tokenizer.decode()tokenizer.encode() round-trips (up to 10 retries) to calibrate token length, and this entire loop runs sequentially in a single process.

Solution

  • Added multiprocessing support to sample_random_requests() via multiprocessing.Pool
  • Each worker process initializes its own tokenizer instance once (via Pool(initializer=...))
  • The prompt generation workload is split into chunks and distributed across workers
  • Added --random-num-workers CLI argument (grouped with other --random-* options):
    • 0 (default): auto-select min(cpu_count, 8) workers
    • 1: force serial execution (original behavior, full backward compatibility)
    • N: use exactly N worker processes
  • New parameters (tokenizer_id, tokenizer_mode, trust_remote_code, num_workers) added to sample_random_requests() function signature; all are optional with backward-compatible defaults

Test Results

Tested with DeepSeek-R1 tokenizer (vocab_size=128,000), input_len=8192, output_len=1024, range_ratio=0.8:

Correctness Verification (2,048 prompts, serial vs parallel)

Metric Result
Prompt length exact match 2048/2048 (100.0%)
Output length exact match 2048/2048 (100.0%)
Prompt text exact match 1949/2048 (95.2%)
Prompt length mean diff 0.00
Output lengths identical True
Overall PASS

Note: ~5% prompt text difference is expected — the retry loop uses random token padding, and multiprocessing workers use independent RNG states. However, all prompt/output lengths match exactly, which is what matters for benchmark accuracy.

Performance (8 worker processes)

Scenario Serial Parallel (8 workers) Speedup
2,048 prompts × 8K input 150.88s 24.74s 6.10x
20,480 prompts × 8K input ~1,508s (est.) 228.37s ~6.6x

Statistical Consistency

Serial   prompt_len: mean=7379.3  std=478.8  min=6553  max=8192
Parallel prompt_len: mean=7379.3  std=478.8  min=6553  max=8192

Serial   output_len: mean=920.6  std=60.2  min=819  max=1024
Parallel output_len: mean=920.6  std=60.2  min=819  max=1024

Files Changed

  • utils/bench_serving/benchmark_serving.py — Added multiprocessing support for prompt generation (+152/-28 lines)

Usage

# Default: auto-parallel with up to 8 workers (no change needed to existing scripts)
python benchmark_serving.py --dataset-name random --random-input-len 8192 --num-prompts 20480 ...

# Explicit worker count
python benchmark_serving.py --dataset-name random --random-num-workers 16 ...

# Force serial (original behavior)
python benchmark_serving.py --dataset-name random --random-num-workers 1 ...

Reproducibility Verification

The parallel path is fully deterministic: given the same --seed and --random-num-workers, multiple runs produce byte-identical results.

Verified by running 3 consecutive executions with seed=0, num_workers=4, num_prompts=200, input_len=1024 and computing MD5 over all prompt texts and lengths:

Run 1: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
Run 2: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
Run 3: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
All identical: True

This is guaranteed because:

  1. The main process np.random.seed() makes input_lens, output_lens, offsets, and per-worker seeds all deterministic
  2. Each worker creates an independent np.random.RandomState(seed) with its assigned fixed seed
  3. pool.map() returns results in chunk order (not completion order)

Note: changing --random-num-workers will change per-worker seed assignments, so results will differ from serial mode or a different worker count. However, the prompt/output length distributions remain statistically identical across any worker configuration.

Backward Compatibility

  • Default behavior changes from serial to parallel, but results are statistically equivalent
  • --random-num-workers 1 preserves exact original behavior
  • No changes to benchmark output format or metrics calculation
  • No new package dependencies (uses stdlib multiprocessing)

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@functionstackx

Copy link
Copy Markdown
Collaborator

@Oseltamivir can u take a detailed look at this and merge it if it paases ur review

@Oseltamivir

Oseltamivir commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Hi @Duyi-Wang , thanks for PR. Some issues:

  1. Warmup NameError:contextlib.nullcontext() is called but contextlib is never imported. name 'contextlib' is not defined

  2. Minor RNG contamination: Serial leaves the global RNG at position 182, parallel at position 184 (2 extra draws for worker seeds). The subsequent gamma draws are shifted: same seed produces different inter-arrival times (1.249126 vs 1.430393 as first value).

@Duyi-Wang

Copy link
Copy Markdown
Contributor Author

Hi @Duyi-Wang , thanks for PR. Some issues:

  1. Warmup NameError:contextlib.nullcontext() is called but contextlib is never imported. name 'contextlib' is not defined
  2. Minor RNG contamination: Serial leaves the global RNG at position 182, parallel at position 184 (2 extra draws for worker seeds). The subsequent gamma draws are shifted: same seed produces different inter-arrival times (1.249126 vs 1.430393 as first value).

Update

@Oseltamivir

Copy link
Copy Markdown
Collaborator

/sweep full-sweep --config-files .github/configs/amd-master.yaml --model-prefix gptoss --runner-type mi355x --seq-lens 8k1k --no-evals --framework vllm --max-tp 1

@github-actions

Copy link
Copy Markdown
Contributor

@Oseltamivir Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24682736183
Command: full-sweep --config-files .github/configs/amd-master.yaml --model-prefix gptoss --runner-type mi355x --seq-lens 8k1k --no-evals --framework vllm --max-tp 1
Pinned ref: 0c3855d
Approval: not required (trusted collaborator).

@Oseltamivir

Copy link
Copy Markdown
Collaborator

Checks with main vs this PR

Concurrency Prompts Serial (s) Parallel (s) Speedup
4 40 1.3 1.2 1.1x
8 80 2.5 1.5 1.7x
16 160 5.0 1.9 2.6x
32 320 10.1 2.7 3.7x
64 640 20.6 4.2 4.9x
128 1280 47.3 7.9 6.0x

@Oseltamivir Oseltamivir left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Oseltamivir Oseltamivir merged commit 1780296 into SemiAnalysisAI:main Apr 20, 2026
billishyahao added a commit to billishyahao/sglang_disagg that referenced this pull request Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants