Multiprocess Parallel Random Data Generation for Benchmark Serving. by Duyi-Wang · Pull Request #1038 · SemiAnalysisAI/InferenceX

Duyi-Wang · 2026-04-16T07:34:09Z

Summary

Accelerate random prompt generation in benchmark_serving.py by parallelizing the sample_random_requests() function using Python multiprocessing.Pool. This addresses the bottleneck where generating large numbers of long prompts (e.g., 20K+ prompts at 8K+ input tokens) takes tens of minutes due to sequential tokenizer encode/decode operations.

Problem

When running benchmarks with high concurrency and long input sequences, the data preparation phase dominates total wall time. For example:

2048 concurrency × 10 = 20,480 prompts @ 8,192 input tokens: the original serial path would take ~25 minutes just to generate prompt data before any actual benchmarking begins.

The root cause is that each prompt requires multiple tokenizer.decode() → tokenizer.encode() round-trips (up to 10 retries) to calibrate token length, and this entire loop runs sequentially in a single process.

Solution

Added multiprocessing support to sample_random_requests() via multiprocessing.Pool
Each worker process initializes its own tokenizer instance once (via Pool(initializer=...))
The prompt generation workload is split into chunks and distributed across workers
Added --random-num-workers CLI argument (grouped with other --random-* options):
- 0 (default): auto-select min(cpu_count, 8) workers
- 1: force serial execution (original behavior, full backward compatibility)
- N: use exactly N worker processes
New parameters (tokenizer_id, tokenizer_mode, trust_remote_code, num_workers) added to sample_random_requests() function signature; all are optional with backward-compatible defaults

Test Results

Tested with DeepSeek-R1 tokenizer (vocab_size=128,000), input_len=8192, output_len=1024, range_ratio=0.8:

Correctness Verification (2,048 prompts, serial vs parallel)

Metric	Result
Prompt length exact match	2048/2048 (100.0%)
Output length exact match	2048/2048 (100.0%)
Prompt text exact match	1949/2048 (95.2%)
Prompt length mean diff	0.00
Output lengths identical	True
Overall	PASS

Note: ~5% prompt text difference is expected — the retry loop uses random token padding, and multiprocessing workers use independent RNG states. However, all prompt/output lengths match exactly, which is what matters for benchmark accuracy.

Performance (8 worker processes)

Scenario	Serial	Parallel (8 workers)	Speedup
2,048 prompts × 8K input	150.88s	24.74s	6.10x
20,480 prompts × 8K input	~1,508s (est.)	228.37s	~6.6x

Statistical Consistency

Serial   prompt_len: mean=7379.3  std=478.8  min=6553  max=8192
Parallel prompt_len: mean=7379.3  std=478.8  min=6553  max=8192

Serial   output_len: mean=920.6  std=60.2  min=819  max=1024
Parallel output_len: mean=920.6  std=60.2  min=819  max=1024

Files Changed

utils/bench_serving/benchmark_serving.py — Added multiprocessing support for prompt generation (+152/-28 lines)

Usage

# Default: auto-parallel with up to 8 workers (no change needed to existing scripts)
python benchmark_serving.py --dataset-name random --random-input-len 8192 --num-prompts 20480 ...

# Explicit worker count
python benchmark_serving.py --dataset-name random --random-num-workers 16 ...

# Force serial (original behavior)
python benchmark_serving.py --dataset-name random --random-num-workers 1 ...

Reproducibility Verification

The parallel path is fully deterministic: given the same --seed and --random-num-workers, multiple runs produce byte-identical results.

Verified by running 3 consecutive executions with seed=0, num_workers=4, num_prompts=200, input_len=1024 and computing MD5 over all prompt texts and lengths:

Run 1: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
Run 2: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
Run 3: hash=28004a3db05c9f6b98cc8169405f83c0, prompt_lens_sum=184378
All identical: True

This is guaranteed because:

The main process np.random.seed() makes input_lens, output_lens, offsets, and per-worker seeds all deterministic
Each worker creates an independent np.random.RandomState(seed) with its assigned fixed seed
pool.map() returns results in chunk order (not completion order)

Note: changing --random-num-workers will change per-worker seed assignments, so results will differ from serial mode or a different worker count. However, the prompt/output length distributions remain statistically identical across any worker configuration.

Backward Compatibility

Default behavior changes from serial to parallel, but results are statistically equivalent
--random-num-workers 1 preserves exact original behavior
No changes to benchmark output format or metrics calculation
No new package dependencies (uses stdlib multiprocessing)

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

functionstackx · 2026-04-17T10:09:12Z

@Oseltamivir can u take a detailed look at this and merge it if it paases ur review

Oseltamivir · 2026-04-17T17:30:28Z

Hi @Duyi-Wang , thanks for PR. Some issues:

Warmup NameError:contextlib.nullcontext() is called but contextlib is never imported. name 'contextlib' is not defined
Minor RNG contamination: Serial leaves the global RNG at position 182, parallel at position 184 (2 extra draws for worker seeds). The subsequent gamma draws are shifted: same seed produces different inter-arrival times (1.249126 vs 1.430393 as first value).

Duyi-Wang · 2026-04-20T06:34:30Z

Hi @Duyi-Wang , thanks for PR. Some issues:

Warmup NameError:contextlib.nullcontext() is called but contextlib is never imported. name 'contextlib' is not defined

Minor RNG contamination: Serial leaves the global RNG at position 182, parallel at position 184 (2 extra draws for worker seeds). The subsequent gamma draws are shifted: same seed produces different inter-arrival times (1.249126 vs 1.430393 as first value).

Update

Oseltamivir · 2026-04-20T18:13:27Z

/sweep full-sweep --config-files .github/configs/amd-master.yaml --model-prefix gptoss --runner-type mi355x --seq-lens 8k1k --no-evals --framework vllm --max-tp 1

github-actions · 2026-04-20T18:13:38Z

@Oseltamivir Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24682736183
Command: full-sweep --config-files .github/configs/amd-master.yaml --model-prefix gptoss --runner-type mi355x --seq-lens 8k1k --no-evals --framework vllm --max-tp 1
Pinned ref: 0c3855d
Approval: not required (trusted collaborator).

Oseltamivir · 2026-04-20T19:36:28Z

Checks with main vs this PR

Concurrency	Prompts	Serial (s)	Parallel (s)	Speedup
4	40	1.3	1.2	1.1x
8	80	2.5	1.5	1.7x
16	160	5.0	1.9	2.6x
32	320	10.1	2.7	3.7x
64	640	20.6	4.2	4.9x
128	1280	47.3	7.9	6.0x

Oseltamivir

lgtm

Multiprocess Parallel Random Data Generation for Benchmark Serving.

7aaed12

Duyi-Wang requested a review from a team April 16, 2026 07:34

github-project-automation Bot added this to InferenceMAX Board Apr 16, 2026

claude Bot reviewed Apr 16, 2026

View reviewed changes

update RNG and import issue

496c217

Oseltamivir mentioned this pull request Apr 20, 2026

[Experimental] Add timing instrumentation to serial prompt generation #1101

Closed

SemiAnalysisAI deleted a comment from github-actions Bot Apr 20, 2026

Merge branch 'main' into mp_benchmark

0c3855d

Oseltamivir approved these changes Apr 20, 2026

View reviewed changes

Oseltamivir merged commit 1780296 into SemiAnalysisAI:main Apr 20, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 20, 2026

billishyahao added a commit to billishyahao/sglang_disagg that referenced this pull request Apr 25, 2026

pick SemiAnalysisAI/InferenceX/pull/1038

d5a7eb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038

Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038
Oseltamivir merged 3 commits into
SemiAnalysisAI:mainfrom
Duyi-Wang:mp_benchmark

Duyi-Wang commented Apr 16, 2026

Uh oh!

claude Bot left a comment

Uh oh!

functionstackx commented Apr 17, 2026

Uh oh!

Oseltamivir commented Apr 17, 2026 •

edited

Loading

Uh oh!

Duyi-Wang commented Apr 20, 2026

Uh oh!

Oseltamivir commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Oseltamivir commented Apr 20, 2026

Uh oh!

Oseltamivir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Duyi-Wang commented Apr 16, 2026

Summary

Problem

Solution

Test Results

Correctness Verification (2,048 prompts, serial vs parallel)

Performance (8 worker processes)

Statistical Consistency

Files Changed

Usage

Reproducibility Verification

Backward Compatibility

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

functionstackx commented Apr 17, 2026

Uh oh!

Oseltamivir commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Duyi-Wang commented Apr 20, 2026

Uh oh!

Oseltamivir commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Oseltamivir commented Apr 20, 2026

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oseltamivir commented Apr 17, 2026 •

edited

Loading