Defer InferenceAPIData gen to worker procs by jjk-g · Pull Request #157 · kubernetes-sigs/inference-perf

jjk-g · 2025-07-25T23:01:40Z

Summary

Loadgen architecture performance improvements (and qps accuracy improvements).
Achieves > 10k qps in some cases (large machine, random/synthetic/shared_prefix datasets, low request latency).
Improved schedule accuracy in low qps scenarios.

Fixes #188

Changes

Updates default num_wokers == num_cpus
Updates default worker_max_concurrency to 100
Sharegpt remains using the old path, a later PR can support preprocessing the dataset (~100s latency vs 6s for current streaming method). Preprocessing is only really useful for > 1000qps.
Use uvloop in worker processes for drop in event loop improvement.
Adds sleep(0) in worker loop to yield event loop

Future Recommendations

Automated load testing [Testing] Add test cases for validating inference perf load generation #40
Configurable HF datagen preprocessing to support deferring get_request to workers HF datagen preprocessing to support get_request #191
Enforce worker scheduling strategy like round robin to disperse requests across processes.

Testing Done

Later comments show acceptable scheduling accuracy conformance for:

High qps against low latency server
low qps against real vllm llama3-8b deployment
sharegpt low qps against llama3-8b deployment

Also profiling and vllm metric comparison in #188

Currently worker 0 queues all request data upfront via yield. Significant perf improvement defering the generation of data to the worker procs. This refactors each worker to have their own datagen instance and requires generating the request per request number.

jjk-g · 2025-08-19T16:47:00Z

With latest changes hit nearly 20k qps (w/ some client side errors) against an instantaneous echo server.

2025-08-19 16:43:43,968 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-08-19 16:44:36,534 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-08-19 16:44:42,157 - inference_perf.reportgen.base - WARNING - Prometheus Metrics Client is not configured or not of type PrometheusMetricsClient
2025-08-19 16:44:42,163 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-164306/summary_lifecycle_metrics.json
2025-08-19 16:44:42,163 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-164306/stage_0_lifecycle_metrics.json
root@test:/ip# cat  reports-20250819-164306/stage_0_lifecycle_metrics.json
{
  "load_summary": {
    "count": 600000,
    "schedule_accuracy": {
      "mean": 0.022521974029993532,
      "min": -0.0013766014017164707,
      "p10": -2.8226463473401947e-05,
      "p50": 0.004687890119384974,
      "p90": 0.05163855425780639,
      "max": 0.4181435744976625
    },
    "send_duration": 30.301377784984652,
    "requested_rate": 20000.0,
    "achieved_rate": 19801.07981417664
  },
  "successes": {
    "count": 599944,
    "latency": {
      "request_latency": {
        "mean": 0.12777751959306832,
        "min": 0.0015589899849146605,
        "p10": 0.01384555449767504,
        "p50": 0.059299415501300246,
        "p90": 0.23142807749682107,
        "max": 9.379575761995511
      },
      "normalized_time_per_output_token": {
        "mean": 0.0,
        "min": 0.0,
        "p10": 0.0,
        "p50": 0.0,
        "p90": 0.0,
        "max": 0.0
      },
      "time_per_output_token": null,
      "time_to_first_token": null,
      "inter_token_latency": null
    },
    "throughput": {
      "input_tokens_per_sec": 2976762.5758456974,
      "output_tokens_per_sec": 0.0,
      "total_tokens_per_sec": 2976762.5758456974,
      "requests_per_sec": 17390.151034252096
    },
    "prompt_len": {
      "mean": 171.17519968530397,
      "min": 1.0,
      "p10": 1.0,
      "p50": 139.0,
      "p90": 408.0,
      "max": 1084.0
    },
    "output_len": {
      "mean": 0.0,
      "min": 0.0,
      "p10": 0.0,
      "p50": 0.0,
      "p90": 0.0,
      "max": 0.0
    }
  },
  "failures": {
    "count": 56,
    "request_latency": {
      "mean": 1.9021719288014407,
      "min": 1.0866388019931037,
      "p10": 1.1052991239994299,
      "p50": 2.401181801498751,
      "p90": 2.4872083624941297,
      "max": 2.5177627649973147
    },
    "prompt_len": {
      "mean": 0.0,
      "min": 0.0,
      "p10": 0.0,
      "p50": 0.0,
      "p90": 0.0,
      "max": 0.0
    }
  }
}

jjk-g · 2025-08-19T17:57:50Z

Testing against llama3-8b model server

2025-08-19 17:52:49,038 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
data:
  type: random
  path: null
  input_distribution:
    min: 1
    max: 1024
    mean: 132.64
    std: 165.23
  output_distribution:
    min: 1
    max: 1
    mean: 1
    std: 1
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 3
    duration: 30
  - rate: 4
    duration: 30
  - rate: 5
    duration: 30
  num_workers: 88
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics: null
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20250819-175248
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  base_url: http://llama3-8b-vllm-service:8000
  ignore_eos: true


2025-08-19 17:52:49,038 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20250819-175248
2025-08-19 17:52:49,044 - inference_perf.client.modelserver.vllm_client - INFO - Inferred model meta-llama/Meta-Llama-3-8B
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.6k/50.6k [00:00<00:00, 7.29MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 17.0MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 945kB/s]
2025-08-19 17:52:51,513 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
2025-08-19 17:53:22,645 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-08-19 17:53:23,648 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run started
2025-08-19 17:53:55,656 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run completed
2025-08-19 17:53:56,657 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run started
2025-08-19 17:54:28,661 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run completed
2025-08-19 17:54:29,664 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-08-19 17:54:29,675 - inference_perf.reportgen.base - WARNING - Prometheus Metrics Client is not configured or not of type PrometheusMetricsClient
2025-08-19 17:54:29,675 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/summary_lifecycle_metrics.json
2025-08-19 17:54:29,676 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/stage_0_lifecycle_metrics.json
2025-08-19 17:54:29,676 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/stage_1_lifecycle_metrics.json
2025-08-19 17:54:29,676 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/stage_2_lifecycle_metrics.json

root@test:/ip# cat reports-20250819-175248/stage_0_lifecycle_metrics.json
{
  "load_summary": {
    "count": 90,
    "schedule_accuracy": {
      "mean": 7.183565240767266e-05,
      "min": -0.0012097999278921634,
      "p10": -0.0004931023111566901,
      "p50": 0.00013036270684096962,
      "p90": 0.0006820293085183951,
      "max": 0.0011117425456177443
    },
    "send_duration": 29.997558503004257,
    "requested_rate": 3.0,
    "achieved_rate": 3.0002441695708835
  },
  "successes": {
    "count": 90,
    "latency": {
      "request_latency": {
        "mean": 0.13590163123444654,
        "min": 0.07111576999886893,
        "p10": 0.08528758568572811,
        "p50": 0.12025816801178735,
        "p90": 0.2022134727158118,
        "max": 0.3147395479900297
      },
      "normalized_time_per_output_token": {
        "mean": 0.06229598418948525,
        "min": 0.0,
        "p10": 0.0380347798607545,
        "p50": 0.05716220824979246,
        "p90": 0.09890849794610404,
        "max": 0.15736977399501484
      },
      "time_per_output_token": {
        "mean": 8.759600750636309e-05,
        "min": 2.0774663425982e-05,
        "p10": 6.540676501269142e-05,
        "p50": 8.843749916801849e-05,
        "p90": 0.00010638393384094041,
        "max": 0.00011664600848841171
      },
      "time_to_first_token": {
        "mean": 0.12451681615752427,
        "min": 0.07048727202345617,
        "p10": 0.07660025471413974,
        "p50": 0.10620090400334448,
        "p90": 0.18276557610952296,
        "max": 0.30085346099804156
      },
      "inter_token_latency": {
        "mean": 8.759600750636309e-05,
        "min": 4.645989974960685e-06,
        "p10": 7.4397918069735174e-06,
        "p50": 7.540651131421328e-05,
        "p90": 0.00019451440311968324,
        "max": 0.00029400098719634116
      }
    },
    "throughput": {
      "input_tokens_per_sec": 539.1644956094423,
      "output_tokens_per_sec": 5.576190301796743,
      "total_tokens_per_sec": 544.740685911239,
      "requests_per_sec": 2.9872448045339692
    },
    "prompt_len": {
      "mean": 180.48888888888888,
      "min": 2.0,
      "p10": 2.0,
      "p50": 137.0,
      "p90": 449.80000000000007,
      "max": 640.0
    },
    "output_len": {
      "mean": 1.8666666666666667,
      "min": 0.0,
      "p10": 2.0,
      "p50": 2.0,
      "p90": 2.0,
      "max": 2.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-175248/stage_1_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 120,
    "schedule_accuracy": {
      "mean": -0.00019664103383547625,
      "min": -0.0012719880614895374,
      "p10": -0.0007337450195336714,
      "p50": -0.00022769310453440994,
      "p90": 0.0004266469011781738,
      "max": 0.0007701559516135603
    },
    "send_duration": 29.887932301993715,
    "requested_rate": 4.0,
    "achieved_rate": 4.014998387559759
  },
  "successes": {
    "count": 120,
    "latency": {
      "request_latency": {
        "mean": 0.1475990853907812,
        "min": 0.06992501299828291,
        "p10": 0.0813140000012936,
        "p50": 0.13131745699502062,
        "p90": 0.23128018591378352,
        "max": 0.39448169301613234
      },
      "normalized_time_per_output_token": {
        "mean": 0.07135801408282229,
        "min": 0.0,
        "p10": 0.03813977015233831,
        "p50": 0.06420277099823579,
        "p90": 0.11564009295689176,
        "max": 0.19724084650806617
      },
      "time_per_output_token": {
        "mean": 8.305333022791376e-05,
        "min": 1.985333316649e-05,
        "p10": 5.68036960127453e-05,
        "p50": 8.27751694790398e-05,
        "p90": 0.0001044135259386773,
        "max": 0.00036030767175058526
      },
      "time_to_first_token": {
        "mean": 0.1425993932328614,
        "min": 0.06934425799408928,
        "p10": 0.07935923019831534,
        "p50": 0.12722604199370835,
        "p90": 0.22713207799824886,
        "max": 0.3771019450214226
      },
      "inter_token_latency": {
        "mean": 8.305333022791375e-05,
        "min": 4.257017280906439e-06,
        "p10": 6.35309552308172e-06,
        "p50": 4.9858499551191926e-05,
        "p90": 0.0002092103153700009,
        "max": 0.0010103499807883054
      }
    },
    "throughput": {
      "input_tokens_per_sec": 714.8689233788198,
      "output_tokens_per_sec": 7.70497515558124,
      "total_tokens_per_sec": 722.573898534401,
      "requests_per_sec": 3.985331977024779
    },
    "prompt_len": {
      "mean": 179.375,
      "min": 2.0,
      "p10": 2.0,
      "p50": 138.5,
      "p90": 444.1,
      "max": 641.0
    },
    "output_len": {
      "mean": 1.9333333333333333,
      "min": 0.0,
      "p10": 2.0,
      "p50": 2.0,
      "p90": 2.0,
      "max": 2.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-175248/stage_2_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 150,
    "schedule_accuracy": {
      "mean": -0.0003411002755941202,
      "min": -0.0011716120643541217,
      "p10": -0.0008971005561761558,
      "p50": -0.00035967130679637194,
      "p90": 0.0001696345338132232,
      "max": 0.0015637431642971933
    },
    "send_duration": 29.939004405983724,
    "requested_rate": 5.0,
    "achieved_rate": 5.010186643682127
  },
  "successes": {
    "count": 150,
    "latency": {
      "request_latency": {
        "mean": 0.15204639654645385,
        "min": 0.07000356499338523,
        "p10": 0.08212132930348162,
        "p50": 0.1427730229916051,
        "p90": 0.24229057717893737,
        "max": 0.3179700030013919
      },
      "normalized_time_per_output_token": {
        "mean": 0.07312806974659906,
        "min": 0.0,
        "p10": 0.04020805734762689,
        "p50": 0.06625330800306983,
        "p90": 0.11783517569565445,
        "max": 0.15898500150069594
      },
      "time_per_output_token": {
        "mean": 8.135460685783375e-05,
        "min": 1.7595671427746613e-05,
        "p10": 5.4437031697792314e-05,
        "p50": 7.70806655054912e-05,
        "p90": 0.00010656010126695036,
        "max": 0.00039195767021737993
      },
      "time_to_first_token": {
        "mean": 0.1495171119470615,
        "min": 0.06942152098054066,
        "p10": 0.08072178370493929,
        "p50": 0.14068766750278883,
        "p90": 0.24136214567988645,
        "max": 0.31703847702010535
      },
      "inter_token_latency": {
        "mean": 8.135460685783377e-05,
        "min": 5.253998097032309e-06,
        "p10": 6.276596104726196e-06,
        "p50": 5.0237998948432505e-05,
        "p90": 0.00020748540991917256,
        "max": 0.0011021530081052333
      }
    },
    "throughput": {
      "input_tokens_per_sec": 904.360704698067,
      "output_tokens_per_sec": 9.710363146601786,
      "total_tokens_per_sec": 914.0710678446688,
      "requests_per_sec": 4.988200246542013
    },
    "prompt_len": {
      "mean": 181.3,
      "min": 2.0,
      "p10": 2.0,
      "p50": 143.0,
      "p90": 436.4,
      "max": 641.0
    },
    "output_len": {
      "mean": 1.9466666666666668,
      "min": 0.0,
      "p10": 2.0,
      "p50": 2.0,
      "p90": 2.0,
      "max": 2.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}

jjk-g · 2025-08-19T18:55:26Z

sharegpt:

2025-08-19 18:47:30,817 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
data:
  type: shareGPT
  path: null
  input_distribution:
    max: 1024
  output_distribution:
    max: 1024
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 3
    duration: 30
  - rate: 4
    duration: 30
  - rate: 5
    duration: 30
  num_workers: 88
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics: null
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20250819-184730
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  base_url: http://llama3-8b-vllm-service:8000
  ignore_eos: true


2025-08-19 18:47:30,817 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20250819-184730
2025-08-19 18:47:30,823 - inference_perf.client.modelserver.vllm_client - INFO - Inferred model meta-llama/Meta-Llama-3-8B
2025-08-19 18:47:39,188 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
2025-08-19 18:48:51,324 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-08-19 18:48:52,325 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run started
2025-08-19 18:50:20,329 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run completed
2025-08-19 18:50:21,334 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run started
2025-08-19 18:51:54,157 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run completed
2025-08-19 18:51:58,165 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-08-19 18:51:58,253 - inference_perf.reportgen.base - WARNING - Prometheus Metrics Client is not configured or not of type PrometheusMetricsClient
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/summary_lifecycle_metrics.json
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/stage_0_lifecycle_metrics.json
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/stage_1_lifecycle_metrics.json
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/stage_2_lifecycle_metrics.json
root@test:/ip# cat reports-20250819-184730/stage_0_lifecycle_metrics.json
{
  "load_summary": {
    "count": 90,
    "schedule_accuracy": {
      "mean": 5.221641460795783e-05,
      "min": -0.0009789488685782999,
      "p10": -0.0005377435736590996,
      "p50": 5.94295997871086e-05,
      "p90": 0.0006354953395202757,
      "max": 0.001245536986971274
    },
    "send_duration": 29.465435588004766,
    "requested_rate": 3.0,
    "achieved_rate": 3.0544262524541996
  },
  "successes": {
    "count": 90,
    "latency": {
      "request_latency": {
        "mean": 16.855602986932112,
        "min": 1.3469735749822576,
        "p10": 2.2618570482882205,
        "p50": 14.023260419504368,
        "p90": 34.246013917305426,
        "max": 56.9563091089949
      },
      "normalized_time_per_output_token": {
        "mean": 0.10431093692514658,
        "min": 0.07290401113192446,
        "p10": 0.08277983479058802,
        "p50": 0.10053955099979986,
        "p90": 0.13586304840555744,
        "max": 0.1928966081535551
      },
      "time_per_output_token": {
        "mean": 0.04859078957569721,
        "min": 0.03592505416358479,
        "p10": 0.040810231570446345,
        "p50": 0.04796033795054988,
        "p90": 0.05649722807534755,
        "max": 0.08334063840605244
      },
      "time_to_first_token": {
        "mean": 0.19460895551019347,
        "min": 0.07818793502519839,
        "p10": 0.09808347138750832,
        "p50": 0.1621460514870705,
        "p90": 0.3419241371011595,
        "max": 0.6196016960020643
      },
      "inter_token_latency": {
        "mean": 0.04502083609991919,
        "min": 4.388013621792197e-06,
        "p10": 1.766601635608822e-05,
        "p50": 7.605599239468575e-05,
        "p90": 0.0813809909945121,
        "max": 0.9274438979919069
      }
    },
    "throughput": {
      "input_tokens_per_sec": 239.83594952000615,
      "output_tokens_per_sec": 233.730261732462,
      "total_tokens_per_sec": 473.5662112524682,
      "requests_per_sec": 1.2749696076078296
    },
    "prompt_len": {
      "mean": 188.11111111111111,
      "min": 10.0,
      "p10": 14.9,
      "p50": 75.0,
      "p90": 442.5000000000003,
      "max": 1010.0
    },
    "output_len": {
      "mean": 183.32222222222222,
      "min": 10.0,
      "p10": 15.9,
      "p50": 134.0,
      "p90": 400.0,
      "max": 692.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-184730/stage_1_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 120,
    "schedule_accuracy": {
      "mean": 0.0006318472092971206,
      "min": -0.0020282877958379686,
      "p10": -0.0011430173093685881,
      "p50": -0.00034038534795399755,
      "p90": 0.00032524631242267807,
      "max": 0.121013759780908
    },
    "send_duration": 29.675006324978312,
    "requested_rate": 4.0,
    "achieved_rate": 4.043807057220154
  },
  "successes": {
    "count": 120,
    "latency": {
      "request_latency": {
        "mean": 25.566489407231952,
        "min": 1.2205224850040395,
        "p10": 3.4664570443856064,
        "p50": 21.11329986547935,
        "p90": 50.897888788004664,
        "max": 82.11318211900652
      },
      "normalized_time_per_output_token": {
        "mean": 0.1394324919548036,
        "min": 0.08078520659632298,
        "p10": 0.09461560029375231,
        "p50": 0.12584894262082813,
        "p90": 0.18335988017058566,
        "max": 0.49897901211321977
      },
      "time_per_output_token": {
        "mean": 0.0614751124171674,
        "min": 0.03984124532503313,
        "p10": 0.04535869591081229,
        "p50": 0.057812230116445706,
        "p90": 0.07751729355019848,
        "max": 0.12403310752225756
      },
      "time_to_first_token": {
        "mean": 0.4935139758073395,
        "min": 0.07457844298915006,
        "p10": 0.11656759349571075,
        "p50": 0.25556120299734175,
        "p90": 1.0773074870056014,
        "max": 3.208106656005839
      },
      "inter_token_latency": {
        "mean": 0.05448512702157385,
        "min": 2.9660004656761885e-06,
        "p10": 1.632000203244388e-05,
        "p50": 7.719801214989275e-05,
        "p90": 0.09491028750198893,
        "max": 1.9733509530196898
      }
    },
    "throughput": {
      "input_tokens_per_sec": 394.5423113100554,
      "output_tokens_per_sec": 315.74230001527286,
      "total_tokens_per_sec": 710.2846113253283,
      "requests_per_sec": 1.3844804327048177
    },
    "prompt_len": {
      "mean": 284.975,
      "min": 10.0,
      "p10": 14.9,
      "p50": 167.5,
      "p90": 729.0000000000002,
      "max": 1022.0
    },
    "output_len": {
      "mean": 228.05833333333334,
      "min": 9.0,
      "p10": 18.9,
      "p50": 170.0,
      "p90": 491.40000000000055,
      "max": 901.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-184730/stage_2_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 150,
    "schedule_accuracy": {
      "mean": -0.00029434598186829435,
      "min": -0.0012856500106863678,
      "p10": -0.0008751995075726882,
      "p50": -0.00030727076227776706,
      "p90": 0.00021163554338272657,
      "max": 0.0010848964448086917
    },
    "send_duration": 29.14394703900325,
    "requested_rate": 5.0,
    "achieved_rate": 5.146866338977884
  },
  "successes": {
    "count": 150,
    "latency": {
      "request_latency": {
        "mean": 21.323471702333965,
        "min": 1.3652286779833958,
        "p10": 2.405568425502861,
        "p50": 17.431325447003474,
        "p90": 45.278694986019396,
        "max": 82.03417063000961
      },
      "normalized_time_per_output_token": {
        "mean": 0.1444949032553297,
        "min": 0.0803363197732064,
        "p10": 0.09066429713280562,
        "p50": 0.1333252786694974,
        "p90": 0.20589139041799956,
        "max": 0.3611066017797889
      },
      "time_per_output_token": {
        "mean": 0.06638236991747913,
        "min": 0.03937716875109712,
        "p10": 0.04463379044102219,
        "p50": 0.06303210961626521,
        "p90": 0.09392862158313589,
        "max": 0.13958217461948239
      },
      "time_to_first_token": {
        "mean": 0.2435184809875985,
        "min": 0.08514626399846748,
        "p10": 0.11298617671418469,
        "p50": 0.21930343950225506,
        "p90": 0.4126431950018741,
        "max": 0.5618777420022525
      },
      "inter_token_latency": {
        "mean": 0.05498562843920867,
        "min": 2.8579961508512497e-06,
        "p10": 1.589400926604867e-05,
        "p50": 7.072500011418015e-05,
        "p90": 0.09152998790377752,
        "max": 1.5335636739910115
      }
    },
    "throughput": {
      "input_tokens_per_sec": 392.5527475368729,
      "output_tokens_per_sec": 313.4243855897332,
      "total_tokens_per_sec": 705.9771331266061,
      "requests_per_sec": 1.6489655865616772
    },
    "prompt_len": {
      "mean": 238.06,
      "min": 10.0,
      "p10": 16.0,
      "p50": 117.0,
      "p90": 634.6999999999999,
      "max": 1012.0
    },
    "output_len": {
      "mean": 190.07333333333332,
      "min": 7.0,
      "p10": 12.9,
      "p50": 135.0,
      "p90": 470.9,
      "max": 946.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}

jjk-g · 2025-08-20T16:23:44Z

Added future thoughts on testing in #40

Add uvloop as drop in replacement for asyncio event loop Fix timer to more reliably hit specified duration Add sleep to reduce CPU contention of queue polling

achandrasekar · 2025-08-20T21:31:19Z

/lgtm
/approve

k8s-ci-robot · 2025-08-20T21:31:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achandrasekar, jjk-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [achandrasekar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2025

k8s-ci-robot requested a review from ArangoGutierrez July 25, 2025 23:01

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 25, 2025

k8s-ci-robot requested a review from SergeyKanzhelev July 25, 2025 23:01

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 25, 2025

jjk-g force-pushed the 10kqps branch from f4f12c2 to 03b5ac7 Compare July 31, 2025 01:09

jjk-g changed the title ~~[WIP] Defer InferenceAPIData gen to worker procs~~ Defer InferenceAPIData gen to worker procs Jul 31, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2025

jjk-g mentioned this pull request Aug 18, 2025

Loadgen asyncio task execution discrepancies can alter lifecycle results #188

Closed

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 19, 2025

jjk-g added 3 commits August 19, 2025 16:04

Support sharegpt

93600b1

Update default worker concurrency

15775bb

jjk-g force-pushed the 10kqps branch from db58f88 to 78fc188 Compare August 19, 2025 16:07

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 19, 2025

jjk-g force-pushed the 10kqps branch 2 times, most recently from a7a05c5 to 232e560 Compare August 19, 2025 16:27

jjk-g changed the title ~~Defer InferenceAPIData gen to worker procs~~ [WIP] Defer InferenceAPIData gen to worker procs Aug 19, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 19, 2025

jjk-g force-pushed the 10kqps branch from 232e560 to af12f81 Compare August 19, 2025 16:45

jjk-g force-pushed the 10kqps branch from af12f81 to e5099ea Compare August 19, 2025 16:57

jjk-g force-pushed the 10kqps branch from e5099ea to 7153660 Compare August 19, 2025 18:53

jjk-g mentioned this pull request Aug 20, 2025

[Testing] Add test cases for validating inference perf load generation #40

Open

jjk-g changed the title ~~[WIP] Defer InferenceAPIData gen to worker procs~~ Defer InferenceAPIData gen to worker procs Aug 20, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2025

jjk-g mentioned this pull request Aug 20, 2025

HF datagen preprocessing to support get_request #191

Closed

achandrasekar reviewed Aug 20, 2025

View reviewed changes

Comment thread inference_perf/datagen/shared_prefix_datagen.py

Perf improvements

75616e5

Add uvloop as drop in replacement for asyncio event loop Fix timer to more reliably hit specified duration Add sleep to reduce CPU contention of queue polling

jjk-g force-pushed the 10kqps branch from 7153660 to 75616e5 Compare August 20, 2025 21:22

k8s-ci-robot assigned achandrasekar Aug 20, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2025

achandrasekar merged commit 5e1649d into kubernetes-sigs:main Aug 20, 2025
3 of 4 checks passed

jjk-g mentioned this pull request Sep 4, 2025

Reduce CPU utilization of waiting workers #204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer InferenceAPIData gen to worker procs#157

Defer InferenceAPIData gen to worker procs#157
achandrasekar merged 4 commits into
kubernetes-sigs:mainfrom
jjk-g:10kqps

jjk-g commented Jul 25, 2025 •

edited

Loading

Uh oh!

jjk-g commented Aug 19, 2025

Uh oh!

jjk-g commented Aug 19, 2025

Uh oh!

jjk-g commented Aug 19, 2025

Uh oh!

jjk-g commented Aug 20, 2025

Uh oh!

Uh oh!

achandrasekar commented Aug 20, 2025

Uh oh!

k8s-ci-robot commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jjk-g commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Future Recommendations

Testing Done

Uh oh!

jjk-g commented Aug 19, 2025

Uh oh!

jjk-g commented Aug 19, 2025

Uh oh!

jjk-g commented Aug 19, 2025

Uh oh!

jjk-g commented Aug 20, 2025

Uh oh!

Uh oh!

achandrasekar commented Aug 20, 2025

Uh oh!

k8s-ci-robot commented Aug 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jjk-g commented Jul 25, 2025 •

edited

Loading