Multiprocess Parallel Random Data Generation for Benchmark Serving.#1038
Conversation
|
@Oseltamivir can u take a detailed look at this and merge it if it paases ur review |
|
Hi @Duyi-Wang , thanks for PR. Some issues:
|
Update |
|
/sweep full-sweep --config-files .github/configs/amd-master.yaml --model-prefix gptoss --runner-type mi355x --seq-lens 8k1k --no-evals --framework vllm --max-tp 1 |
|
@Oseltamivir Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24682736183 |
Summary
Accelerate random prompt generation in
benchmark_serving.pyby parallelizing thesample_random_requests()function using Pythonmultiprocessing.Pool. This addresses the bottleneck where generating large numbers of long prompts (e.g., 20K+ prompts at 8K+ input tokens) takes tens of minutes due to sequential tokenizer encode/decode operations.Problem
When running benchmarks with high concurrency and long input sequences, the data preparation phase dominates total wall time. For example:
The root cause is that each prompt requires multiple
tokenizer.decode()→tokenizer.encode()round-trips (up to 10 retries) to calibrate token length, and this entire loop runs sequentially in a single process.Solution
sample_random_requests()viamultiprocessing.PoolPool(initializer=...))--random-num-workersCLI argument (grouped with other--random-*options):0(default): auto-selectmin(cpu_count, 8)workers1: force serial execution (original behavior, full backward compatibility)N: use exactly N worker processestokenizer_id,tokenizer_mode,trust_remote_code,num_workers) added tosample_random_requests()function signature; all are optional with backward-compatible defaultsTest Results
Tested with DeepSeek-R1 tokenizer (vocab_size=128,000),
input_len=8192,output_len=1024,range_ratio=0.8:Correctness Verification (2,048 prompts, serial vs parallel)
Performance (8 worker processes)
Statistical Consistency
Files Changed
utils/bench_serving/benchmark_serving.py— Added multiprocessing support for prompt generation (+152/-28 lines)Usage
Reproducibility Verification
The parallel path is fully deterministic: given the same
--seedand--random-num-workers, multiple runs produce byte-identical results.Verified by running 3 consecutive executions with
seed=0,num_workers=4,num_prompts=200,input_len=1024and computing MD5 over all prompt texts and lengths:This is guaranteed because:
np.random.seed()makesinput_lens,output_lens,offsets, and per-worker seeds all deterministicnp.random.RandomState(seed)with its assigned fixed seedpool.map()returns results in chunk order (not completion order)Backward Compatibility
--random-num-workers 1preserves exact original behaviormultiprocessing)