FineServe is an in-the-wild, multi-model LLM serving workload dataset collected from a global commercial marketplace. It enables fine-grained characterization of real-world serving dynamics across heterogeneous models and tasks, and provides a workload generator for benchmarking multi-model serving platforms.
- Realistic workloads: derived from desensitized, real-world LLM serving traces, covering 57 models across 10 task categories (desensitized).
- Fine-grained control: configurable request arrival patterns and token-length behaviors for benchmarking.
- Two generation modes:
- Replay: directly replay desensitized traces collected from real systems.
- Parametric: synthesize workloads from statistical parameters to generate workloads at arbitrary scale.
FineServe/
├─ Generator.py # Workload generator & benchmark runner
├─ datasets/
│ ├─ replay/ # Real trace CSVs
│ └─ Parametric/ # Parametric workload configurations
│ ├─ gamma/ # Per-window gamma parameters
│ └─ nb/ # Example parameter files (e.g., NB)
├─ README.md
FineServe captures real-world workloads from a diverse set of production LLMs, covering both Dense and Mixture-of-Experts (MoE) architectures across multiple parameter scales.
The dataset includes multiple representative models deployed in real-world systems. To improve clarity and maintain modular documentation, we provide a detailed breakdown of model categories and corresponding model instances in a separate document.
See metadata/model.md for the full model list and taxonomy.
FineServe also captures diverse user intents by categorizing requests into 10 representative task types, reflecting real-world usage patterns.
Task labels are obtained through an automated classification pipeline and are used to characterize workload heterogeneity across different application scenarios.
See metadata/task.md for the full task taxonomy and definitions.
In replay mode, FineServe consumes a CSV file that records desensitized request traces. Each row corresponds to a single request and captures both token-level characteristics and metadata.
- input_length (int)
Number of input tokens. - output_length (int)
Number of generated tokens. - latency (float)
Observed request latency in seconds. - timestamp (float)
Arrival time of the request (in milliseconds).
This is a relative timestamp, where the first request typically starts att = 0.0. - architecture (string)
Model architecture type, e.g.,DenseorMoE. - scale (string)
Model size category, e.g.,<10B,10to30B,30to100B,>100B.
| input_length | output_length | latency | timestamp | architecture | scale |
|---|---|---|---|---|---|
| 631 | 64 | 0.0021 | 0.0 | MoE | >100B |
| 763 | 160 | 0.0022 | 5799279.0 | MoE | >100B |
| 2794 | 80 | 0.0012 | 5827389.0 | MoE | >100B |
| 4766 | 403 | 0.0013 | 5828345.0 | MoE | >100B |
| 5484 | 125 | 0.0013 | 5851573.0 | MoE | >100B |
- Requests are replayed based on inter-arrival gaps derived from
timestamp. - Only
input_length,output_length, andtimestampare required for replay. - Additional fields (e.g.,
latency,architecture,scale) are optional and mainly used for analysis, filtering, or stratified workload generation.
- Python 3.9+
- A serving backend compatible with the OpenAI Completions API (e.g., vLLM)
--backend: backend type (vllm)--host,--port: serving endpoint--model: model path--tokenizer: tokenizer path--dataset-path: prompt source dataset--mode: request generation mode (poisson,replay, orparametric)
Poisson mode generates requests according to a Poisson arrival process with request rate --request-rate (req/s). Prompts are sampled from the dataset specified by --dataset-path.
python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode poisson \
--request-rate 50 \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer" \
--num-prompts 1000Replay mode replays request arrivals from a CSV trace. The generator replays arrival times from timestamp, while prompt lengths and output lengths are controlled by the trace.
python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode replay \
--request-trace-csv "datasets/replay/Dense_10to30B.csv" \
--time-scale 1.0 \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer"Optional replay arguments
--timestamp-column(default:timestamp)--input-length-column(default:input_length)--output-length-column(default:output_length)--time-scale(default: 1.0)
Controls time compression of the replay trace.
A larger value compresses inter-arrival intervals, resulting in higher effective request rate and increased concurrency.
Parametric mode synthesizes request arrivals using parameterized statistical models. By default, it uses Gamma parameters for inter-arrival timing, and can optionally enable NB-based micro-burst modeling for burst-dominant workloads. See Parametric Modeling Details for full details.
You can start parametric mode in two ways:
This automatically resolves the corresponding Gamma parameter file under datasets/Parametric/gamma/.
python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode parametric \
--model-class Dense_lt10B \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer" \
--num-prompts 1000python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode parametric \
--gamma-params-csv "datasets/Parametric/gamma/Dense_lt10B_gamma_5min.csv" \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer" \
--num-prompts 1000-
--parametric-extension-mode: how to extend finite Gamma rows to longer runs
Choices:repeat_jitter(default): repeat windows cyclically and add small multiplicative perturbationsmodel_sample: fit a global lognormal model and sample synthetic Gamma parameters per window
-
--parametric-jitter-ratio: relative noise level forrepeat_jitter
Default:0.05 -
--parametric-window-seconds: duration of each parametric window in seconds
Default:300
Example using repeat_jitter with custom jitter ratio:
python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode parametric \
--model-class Dense_lt10B \
--parametric-extension-mode repeat_jitter \
--parametric-jitter-ratio 0.10 \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer" \
--num-prompts 1000Example using model_sample:
python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode parametric \
--model-class Dense_10to30B \
--parametric-extension-mode model_sample \
--parametric-window-seconds 300 \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer" \
--num-prompts 1000For burst-dominant workloads such as Dense_lt10B, NB-based 1ms micro-burst modeling is automatically enabled unless explicitly disabled.
--parametric-nb-json: manually specify an NB JSON file--disable-parametric-nb-lt10b: disable automatic NB enablement forlt10B
Example with manual NB override:
python Generator.py \
--backend vllm \
--host 127.0.0.1 \
--port 8000 \
--mode parametric \
--model-class Dense_lt10B \
--parametric-nb-json "datasets/Parametric/nb/Dense_lt10B_nb.json" \
--dataset-path "/path/to/dataset.json" \
--model "/path/to/model" \
--tokenizer "/path/to/tokenizer" \
--num-prompts 1000- In
poissonmode, request arrivals are controlled by--request-rate. - In
replaymode, request arrivals are driven by the timestamps in--request-trace-csv. - In
parametricmode, arrivals are synthesized from Gamma parameters loaded from--gamma-params-csv. - Prompts are always sourced from
--dataset-path, while request lengths and arrival patterns depend on the selected mode.
FineServe is constructed from real-world LLM serving traces collected from a commercial platform. We take data privacy and ethical considerations seriously and adopt strict measures to ensure that no sensitive or personally identifiable information is exposed.
- The released dataset does not contain any raw user prompts or model responses.
- No user-identifiable information is included, such as:
- user names
- user IDs
- IP addresses
- geographic locations
- device or account information
- All released fields (e.g.,
input_length,output_length) are derived from system-level logs rather than raw content. - These features are content-agnostic and only reflect token-level statistics.
- No semantic information from user queries is retained.
- Request timestamps are fully de-identified:
- Converted to relative time (starting from
t = 0) - No absolute wall-clock time is preserved
- Converted to relative time (starting from
- This prevents reconstruction of real-world user activity patterns.
- All reported statistics, figures, and distributions are derived from sampled subsets of the original data, rather than the full production dataset.
- This sampling process further reduces the risk of reconstructing user behavior patterns or rare events.
- The released data is therefore statistically representative but not individually traceable.
- Task-related labels and distributions are obtained through an automated system-level classification pipeline.
- The classification is performed on transient data during processing, and only the aggregated or labeled results are retained.
- No raw inputs used for classification are stored or released.
- As a result, the dataset does not support reverse tracing from labels to original user requests.
- This dataset is released solely for research purposes, including:
- LLM serving system evaluation
- workload modeling and simulation
- scheduling and systems optimization
