Skip to content

Defer InferenceAPIData gen to worker procs#157

Merged
achandrasekar merged 4 commits into
kubernetes-sigs:mainfrom
jjk-g:10kqps
Aug 20, 2025
Merged

Defer InferenceAPIData gen to worker procs#157
achandrasekar merged 4 commits into
kubernetes-sigs:mainfrom
jjk-g:10kqps

Conversation

@jjk-g
Copy link
Copy Markdown
Collaborator

@jjk-g jjk-g commented Jul 25, 2025

Summary

Loadgen architecture performance improvements (and qps accuracy improvements).
Achieves > 10k qps in some cases (large machine, random/synthetic/shared_prefix datasets, low request latency).
Improved schedule accuracy in low qps scenarios.

Fixes #188

Changes

  • Updates default num_wokers == num_cpus
  • Updates default worker_max_concurrency to 100
  • Sharegpt remains using the old path, a later PR can support preprocessing the dataset (~100s latency vs 6s for current streaming method). Preprocessing is only really useful for > 1000qps.
  • Use uvloop in worker processes for drop in event loop improvement.
  • Adds sleep(0) in worker loop to yield event loop

Future Recommendations

Testing Done

Later comments show acceptable scheduling accuracy conformance for:

  • High qps against low latency server
  • low qps against real vllm llama3-8b deployment
  • sharegpt low qps against llama3-8b deployment

Also profiling and vllm metric comparison in #188

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 25, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 25, 2025
@jjk-g jjk-g changed the title [WIP] Defer InferenceAPIData gen to worker procs Defer InferenceAPIData gen to worker procs Jul 31, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 31, 2025
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 19, 2025
jjk-g added 3 commits August 19, 2025 16:04
Currently worker 0 queues all request data upfront via yield.
Significant perf improvement defering the generation of data
to the worker procs.

This refactors each worker to have their own datagen instance
and requires generating the request per request number.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 19, 2025
@jjk-g jjk-g force-pushed the 10kqps branch 2 times, most recently from a7a05c5 to 232e560 Compare August 19, 2025 16:27
@jjk-g jjk-g changed the title Defer InferenceAPIData gen to worker procs [WIP] Defer InferenceAPIData gen to worker procs Aug 19, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 19, 2025
@jjk-g
Copy link
Copy Markdown
Collaborator Author

jjk-g commented Aug 19, 2025

With latest changes hit nearly 20k qps (w/ some client side errors) against an instantaneous echo server.

2025-08-19 16:43:43,968 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-08-19 16:44:36,534 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-08-19 16:44:42,157 - inference_perf.reportgen.base - WARNING - Prometheus Metrics Client is not configured or not of type PrometheusMetricsClient
2025-08-19 16:44:42,163 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-164306/summary_lifecycle_metrics.json
2025-08-19 16:44:42,163 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-164306/stage_0_lifecycle_metrics.json
root@test:/ip# cat  reports-20250819-164306/stage_0_lifecycle_metrics.json
{
  "load_summary": {
    "count": 600000,
    "schedule_accuracy": {
      "mean": 0.022521974029993532,
      "min": -0.0013766014017164707,
      "p10": -2.8226463473401947e-05,
      "p50": 0.004687890119384974,
      "p90": 0.05163855425780639,
      "max": 0.4181435744976625
    },
    "send_duration": 30.301377784984652,
    "requested_rate": 20000.0,
    "achieved_rate": 19801.07981417664
  },
  "successes": {
    "count": 599944,
    "latency": {
      "request_latency": {
        "mean": 0.12777751959306832,
        "min": 0.0015589899849146605,
        "p10": 0.01384555449767504,
        "p50": 0.059299415501300246,
        "p90": 0.23142807749682107,
        "max": 9.379575761995511
      },
      "normalized_time_per_output_token": {
        "mean": 0.0,
        "min": 0.0,
        "p10": 0.0,
        "p50": 0.0,
        "p90": 0.0,
        "max": 0.0
      },
      "time_per_output_token": null,
      "time_to_first_token": null,
      "inter_token_latency": null
    },
    "throughput": {
      "input_tokens_per_sec": 2976762.5758456974,
      "output_tokens_per_sec": 0.0,
      "total_tokens_per_sec": 2976762.5758456974,
      "requests_per_sec": 17390.151034252096
    },
    "prompt_len": {
      "mean": 171.17519968530397,
      "min": 1.0,
      "p10": 1.0,
      "p50": 139.0,
      "p90": 408.0,
      "max": 1084.0
    },
    "output_len": {
      "mean": 0.0,
      "min": 0.0,
      "p10": 0.0,
      "p50": 0.0,
      "p90": 0.0,
      "max": 0.0
    }
  },
  "failures": {
    "count": 56,
    "request_latency": {
      "mean": 1.9021719288014407,
      "min": 1.0866388019931037,
      "p10": 1.1052991239994299,
      "p50": 2.401181801498751,
      "p90": 2.4872083624941297,
      "max": 2.5177627649973147
    },
    "prompt_len": {
      "mean": 0.0,
      "min": 0.0,
      "p10": 0.0,
      "p50": 0.0,
      "p90": 0.0,
      "max": 0.0
    }
  }
}

@jjk-g
Copy link
Copy Markdown
Collaborator Author

jjk-g commented Aug 19, 2025

Testing against llama3-8b model server

2025-08-19 17:52:49,038 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
data:
  type: random
  path: null
  input_distribution:
    min: 1
    max: 1024
    mean: 132.64
    std: 165.23
  output_distribution:
    min: 1
    max: 1
    mean: 1
    std: 1
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 3
    duration: 30
  - rate: 4
    duration: 30
  - rate: 5
    duration: 30
  num_workers: 88
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics: null
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20250819-175248
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  base_url: http://llama3-8b-vllm-service:8000
  ignore_eos: true


2025-08-19 17:52:49,038 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20250819-175248
2025-08-19 17:52:49,044 - inference_perf.client.modelserver.vllm_client - INFO - Inferred model meta-llama/Meta-Llama-3-8B
tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.6k/50.6k [00:00<00:00, 7.29MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 17.0MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 945kB/s]
2025-08-19 17:52:51,513 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
2025-08-19 17:53:22,645 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-08-19 17:53:23,648 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run started
2025-08-19 17:53:55,656 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run completed
2025-08-19 17:53:56,657 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run started
2025-08-19 17:54:28,661 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run completed
2025-08-19 17:54:29,664 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-08-19 17:54:29,675 - inference_perf.reportgen.base - WARNING - Prometheus Metrics Client is not configured or not of type PrometheusMetricsClient
2025-08-19 17:54:29,675 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/summary_lifecycle_metrics.json
2025-08-19 17:54:29,676 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/stage_0_lifecycle_metrics.json
2025-08-19 17:54:29,676 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/stage_1_lifecycle_metrics.json
2025-08-19 17:54:29,676 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-175248/stage_2_lifecycle_metrics.json

root@test:/ip# cat reports-20250819-175248/stage_0_lifecycle_metrics.json
{
  "load_summary": {
    "count": 90,
    "schedule_accuracy": {
      "mean": 7.183565240767266e-05,
      "min": -0.0012097999278921634,
      "p10": -0.0004931023111566901,
      "p50": 0.00013036270684096962,
      "p90": 0.0006820293085183951,
      "max": 0.0011117425456177443
    },
    "send_duration": 29.997558503004257,
    "requested_rate": 3.0,
    "achieved_rate": 3.0002441695708835
  },
  "successes": {
    "count": 90,
    "latency": {
      "request_latency": {
        "mean": 0.13590163123444654,
        "min": 0.07111576999886893,
        "p10": 0.08528758568572811,
        "p50": 0.12025816801178735,
        "p90": 0.2022134727158118,
        "max": 0.3147395479900297
      },
      "normalized_time_per_output_token": {
        "mean": 0.06229598418948525,
        "min": 0.0,
        "p10": 0.0380347798607545,
        "p50": 0.05716220824979246,
        "p90": 0.09890849794610404,
        "max": 0.15736977399501484
      },
      "time_per_output_token": {
        "mean": 8.759600750636309e-05,
        "min": 2.0774663425982e-05,
        "p10": 6.540676501269142e-05,
        "p50": 8.843749916801849e-05,
        "p90": 0.00010638393384094041,
        "max": 0.00011664600848841171
      },
      "time_to_first_token": {
        "mean": 0.12451681615752427,
        "min": 0.07048727202345617,
        "p10": 0.07660025471413974,
        "p50": 0.10620090400334448,
        "p90": 0.18276557610952296,
        "max": 0.30085346099804156
      },
      "inter_token_latency": {
        "mean": 8.759600750636309e-05,
        "min": 4.645989974960685e-06,
        "p10": 7.4397918069735174e-06,
        "p50": 7.540651131421328e-05,
        "p90": 0.00019451440311968324,
        "max": 0.00029400098719634116
      }
    },
    "throughput": {
      "input_tokens_per_sec": 539.1644956094423,
      "output_tokens_per_sec": 5.576190301796743,
      "total_tokens_per_sec": 544.740685911239,
      "requests_per_sec": 2.9872448045339692
    },
    "prompt_len": {
      "mean": 180.48888888888888,
      "min": 2.0,
      "p10": 2.0,
      "p50": 137.0,
      "p90": 449.80000000000007,
      "max": 640.0
    },
    "output_len": {
      "mean": 1.8666666666666667,
      "min": 0.0,
      "p10": 2.0,
      "p50": 2.0,
      "p90": 2.0,
      "max": 2.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-175248/stage_1_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 120,
    "schedule_accuracy": {
      "mean": -0.00019664103383547625,
      "min": -0.0012719880614895374,
      "p10": -0.0007337450195336714,
      "p50": -0.00022769310453440994,
      "p90": 0.0004266469011781738,
      "max": 0.0007701559516135603
    },
    "send_duration": 29.887932301993715,
    "requested_rate": 4.0,
    "achieved_rate": 4.014998387559759
  },
  "successes": {
    "count": 120,
    "latency": {
      "request_latency": {
        "mean": 0.1475990853907812,
        "min": 0.06992501299828291,
        "p10": 0.0813140000012936,
        "p50": 0.13131745699502062,
        "p90": 0.23128018591378352,
        "max": 0.39448169301613234
      },
      "normalized_time_per_output_token": {
        "mean": 0.07135801408282229,
        "min": 0.0,
        "p10": 0.03813977015233831,
        "p50": 0.06420277099823579,
        "p90": 0.11564009295689176,
        "max": 0.19724084650806617
      },
      "time_per_output_token": {
        "mean": 8.305333022791376e-05,
        "min": 1.985333316649e-05,
        "p10": 5.68036960127453e-05,
        "p50": 8.27751694790398e-05,
        "p90": 0.0001044135259386773,
        "max": 0.00036030767175058526
      },
      "time_to_first_token": {
        "mean": 0.1425993932328614,
        "min": 0.06934425799408928,
        "p10": 0.07935923019831534,
        "p50": 0.12722604199370835,
        "p90": 0.22713207799824886,
        "max": 0.3771019450214226
      },
      "inter_token_latency": {
        "mean": 8.305333022791375e-05,
        "min": 4.257017280906439e-06,
        "p10": 6.35309552308172e-06,
        "p50": 4.9858499551191926e-05,
        "p90": 0.0002092103153700009,
        "max": 0.0010103499807883054
      }
    },
    "throughput": {
      "input_tokens_per_sec": 714.8689233788198,
      "output_tokens_per_sec": 7.70497515558124,
      "total_tokens_per_sec": 722.573898534401,
      "requests_per_sec": 3.985331977024779
    },
    "prompt_len": {
      "mean": 179.375,
      "min": 2.0,
      "p10": 2.0,
      "p50": 138.5,
      "p90": 444.1,
      "max": 641.0
    },
    "output_len": {
      "mean": 1.9333333333333333,
      "min": 0.0,
      "p10": 2.0,
      "p50": 2.0,
      "p90": 2.0,
      "max": 2.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-175248/stage_2_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 150,
    "schedule_accuracy": {
      "mean": -0.0003411002755941202,
      "min": -0.0011716120643541217,
      "p10": -0.0008971005561761558,
      "p50": -0.00035967130679637194,
      "p90": 0.0001696345338132232,
      "max": 0.0015637431642971933
    },
    "send_duration": 29.939004405983724,
    "requested_rate": 5.0,
    "achieved_rate": 5.010186643682127
  },
  "successes": {
    "count": 150,
    "latency": {
      "request_latency": {
        "mean": 0.15204639654645385,
        "min": 0.07000356499338523,
        "p10": 0.08212132930348162,
        "p50": 0.1427730229916051,
        "p90": 0.24229057717893737,
        "max": 0.3179700030013919
      },
      "normalized_time_per_output_token": {
        "mean": 0.07312806974659906,
        "min": 0.0,
        "p10": 0.04020805734762689,
        "p50": 0.06625330800306983,
        "p90": 0.11783517569565445,
        "max": 0.15898500150069594
      },
      "time_per_output_token": {
        "mean": 8.135460685783375e-05,
        "min": 1.7595671427746613e-05,
        "p10": 5.4437031697792314e-05,
        "p50": 7.70806655054912e-05,
        "p90": 0.00010656010126695036,
        "max": 0.00039195767021737993
      },
      "time_to_first_token": {
        "mean": 0.1495171119470615,
        "min": 0.06942152098054066,
        "p10": 0.08072178370493929,
        "p50": 0.14068766750278883,
        "p90": 0.24136214567988645,
        "max": 0.31703847702010535
      },
      "inter_token_latency": {
        "mean": 8.135460685783377e-05,
        "min": 5.253998097032309e-06,
        "p10": 6.276596104726196e-06,
        "p50": 5.0237998948432505e-05,
        "p90": 0.00020748540991917256,
        "max": 0.0011021530081052333
      }
    },
    "throughput": {
      "input_tokens_per_sec": 904.360704698067,
      "output_tokens_per_sec": 9.710363146601786,
      "total_tokens_per_sec": 914.0710678446688,
      "requests_per_sec": 4.988200246542013
    },
    "prompt_len": {
      "mean": 181.3,
      "min": 2.0,
      "p10": 2.0,
      "p50": 143.0,
      "p90": 436.4,
      "max": 641.0
    },
    "output_len": {
      "mean": 1.9466666666666668,
      "min": 0.0,
      "p10": 2.0,
      "p50": 2.0,
      "p90": 2.0,
      "max": 2.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}

@jjk-g
Copy link
Copy Markdown
Collaborator Author

jjk-g commented Aug 19, 2025

sharegpt:

2025-08-19 18:47:30,817 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
data:
  type: shareGPT
  path: null
  input_distribution:
    max: 1024
  output_distribution:
    max: 1024
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 3
    duration: 30
  - rate: 4
    duration: 30
  - rate: 5
    duration: 30
  num_workers: 88
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics: null
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20250819-184730
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  base_url: http://llama3-8b-vllm-service:8000
  ignore_eos: true


2025-08-19 18:47:30,817 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20250819-184730
2025-08-19 18:47:30,823 - inference_perf.client.modelserver.vllm_client - INFO - Inferred model meta-llama/Meta-Llama-3-8B
2025-08-19 18:47:39,188 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
2025-08-19 18:48:51,324 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-08-19 18:48:52,325 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run started
2025-08-19 18:50:20,329 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run completed
2025-08-19 18:50:21,334 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run started
2025-08-19 18:51:54,157 - inference_perf.loadgen.load_generator - INFO - Stage 2 - run completed
2025-08-19 18:51:58,165 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-08-19 18:51:58,253 - inference_perf.reportgen.base - WARNING - Prometheus Metrics Client is not configured or not of type PrometheusMetricsClient
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/summary_lifecycle_metrics.json
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/stage_0_lifecycle_metrics.json
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/stage_1_lifecycle_metrics.json
2025-08-19 18:51:58,254 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20250819-184730/stage_2_lifecycle_metrics.json
root@test:/ip# cat reports-20250819-184730/stage_0_lifecycle_metrics.json
{
  "load_summary": {
    "count": 90,
    "schedule_accuracy": {
      "mean": 5.221641460795783e-05,
      "min": -0.0009789488685782999,
      "p10": -0.0005377435736590996,
      "p50": 5.94295997871086e-05,
      "p90": 0.0006354953395202757,
      "max": 0.001245536986971274
    },
    "send_duration": 29.465435588004766,
    "requested_rate": 3.0,
    "achieved_rate": 3.0544262524541996
  },
  "successes": {
    "count": 90,
    "latency": {
      "request_latency": {
        "mean": 16.855602986932112,
        "min": 1.3469735749822576,
        "p10": 2.2618570482882205,
        "p50": 14.023260419504368,
        "p90": 34.246013917305426,
        "max": 56.9563091089949
      },
      "normalized_time_per_output_token": {
        "mean": 0.10431093692514658,
        "min": 0.07290401113192446,
        "p10": 0.08277983479058802,
        "p50": 0.10053955099979986,
        "p90": 0.13586304840555744,
        "max": 0.1928966081535551
      },
      "time_per_output_token": {
        "mean": 0.04859078957569721,
        "min": 0.03592505416358479,
        "p10": 0.040810231570446345,
        "p50": 0.04796033795054988,
        "p90": 0.05649722807534755,
        "max": 0.08334063840605244
      },
      "time_to_first_token": {
        "mean": 0.19460895551019347,
        "min": 0.07818793502519839,
        "p10": 0.09808347138750832,
        "p50": 0.1621460514870705,
        "p90": 0.3419241371011595,
        "max": 0.6196016960020643
      },
      "inter_token_latency": {
        "mean": 0.04502083609991919,
        "min": 4.388013621792197e-06,
        "p10": 1.766601635608822e-05,
        "p50": 7.605599239468575e-05,
        "p90": 0.0813809909945121,
        "max": 0.9274438979919069
      }
    },
    "throughput": {
      "input_tokens_per_sec": 239.83594952000615,
      "output_tokens_per_sec": 233.730261732462,
      "total_tokens_per_sec": 473.5662112524682,
      "requests_per_sec": 1.2749696076078296
    },
    "prompt_len": {
      "mean": 188.11111111111111,
      "min": 10.0,
      "p10": 14.9,
      "p50": 75.0,
      "p90": 442.5000000000003,
      "max": 1010.0
    },
    "output_len": {
      "mean": 183.32222222222222,
      "min": 10.0,
      "p10": 15.9,
      "p50": 134.0,
      "p90": 400.0,
      "max": 692.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-184730/stage_1_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 120,
    "schedule_accuracy": {
      "mean": 0.0006318472092971206,
      "min": -0.0020282877958379686,
      "p10": -0.0011430173093685881,
      "p50": -0.00034038534795399755,
      "p90": 0.00032524631242267807,
      "max": 0.121013759780908
    },
    "send_duration": 29.675006324978312,
    "requested_rate": 4.0,
    "achieved_rate": 4.043807057220154
  },
  "successes": {
    "count": 120,
    "latency": {
      "request_latency": {
        "mean": 25.566489407231952,
        "min": 1.2205224850040395,
        "p10": 3.4664570443856064,
        "p50": 21.11329986547935,
        "p90": 50.897888788004664,
        "max": 82.11318211900652
      },
      "normalized_time_per_output_token": {
        "mean": 0.1394324919548036,
        "min": 0.08078520659632298,
        "p10": 0.09461560029375231,
        "p50": 0.12584894262082813,
        "p90": 0.18335988017058566,
        "max": 0.49897901211321977
      },
      "time_per_output_token": {
        "mean": 0.0614751124171674,
        "min": 0.03984124532503313,
        "p10": 0.04535869591081229,
        "p50": 0.057812230116445706,
        "p90": 0.07751729355019848,
        "max": 0.12403310752225756
      },
      "time_to_first_token": {
        "mean": 0.4935139758073395,
        "min": 0.07457844298915006,
        "p10": 0.11656759349571075,
        "p50": 0.25556120299734175,
        "p90": 1.0773074870056014,
        "max": 3.208106656005839
      },
      "inter_token_latency": {
        "mean": 0.05448512702157385,
        "min": 2.9660004656761885e-06,
        "p10": 1.632000203244388e-05,
        "p50": 7.719801214989275e-05,
        "p90": 0.09491028750198893,
        "max": 1.9733509530196898
      }
    },
    "throughput": {
      "input_tokens_per_sec": 394.5423113100554,
      "output_tokens_per_sec": 315.74230001527286,
      "total_tokens_per_sec": 710.2846113253283,
      "requests_per_sec": 1.3844804327048177
    },
    "prompt_len": {
      "mean": 284.975,
      "min": 10.0,
      "p10": 14.9,
      "p50": 167.5,
      "p90": 729.0000000000002,
      "max": 1022.0
    },
    "output_len": {
      "mean": 228.05833333333334,
      "min": 9.0,
      "p10": 18.9,
      "p50": 170.0,
      "p90": 491.40000000000055,
      "max": 901.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}root@test:/ip# catreports-20250819-184730/stage_2_lifecycle_metrics.jsonn
{
  "load_summary": {
    "count": 150,
    "schedule_accuracy": {
      "mean": -0.00029434598186829435,
      "min": -0.0012856500106863678,
      "p10": -0.0008751995075726882,
      "p50": -0.00030727076227776706,
      "p90": 0.00021163554338272657,
      "max": 0.0010848964448086917
    },
    "send_duration": 29.14394703900325,
    "requested_rate": 5.0,
    "achieved_rate": 5.146866338977884
  },
  "successes": {
    "count": 150,
    "latency": {
      "request_latency": {
        "mean": 21.323471702333965,
        "min": 1.3652286779833958,
        "p10": 2.405568425502861,
        "p50": 17.431325447003474,
        "p90": 45.278694986019396,
        "max": 82.03417063000961
      },
      "normalized_time_per_output_token": {
        "mean": 0.1444949032553297,
        "min": 0.0803363197732064,
        "p10": 0.09066429713280562,
        "p50": 0.1333252786694974,
        "p90": 0.20589139041799956,
        "max": 0.3611066017797889
      },
      "time_per_output_token": {
        "mean": 0.06638236991747913,
        "min": 0.03937716875109712,
        "p10": 0.04463379044102219,
        "p50": 0.06303210961626521,
        "p90": 0.09392862158313589,
        "max": 0.13958217461948239
      },
      "time_to_first_token": {
        "mean": 0.2435184809875985,
        "min": 0.08514626399846748,
        "p10": 0.11298617671418469,
        "p50": 0.21930343950225506,
        "p90": 0.4126431950018741,
        "max": 0.5618777420022525
      },
      "inter_token_latency": {
        "mean": 0.05498562843920867,
        "min": 2.8579961508512497e-06,
        "p10": 1.589400926604867e-05,
        "p50": 7.072500011418015e-05,
        "p90": 0.09152998790377752,
        "max": 1.5335636739910115
      }
    },
    "throughput": {
      "input_tokens_per_sec": 392.5527475368729,
      "output_tokens_per_sec": 313.4243855897332,
      "total_tokens_per_sec": 705.9771331266061,
      "requests_per_sec": 1.6489655865616772
    },
    "prompt_len": {
      "mean": 238.06,
      "min": 10.0,
      "p10": 16.0,
      "p50": 117.0,
      "p90": 634.6999999999999,
      "max": 1012.0
    },
    "output_len": {
      "mean": 190.07333333333332,
      "min": 7.0,
      "p10": 12.9,
      "p50": 135.0,
      "p90": 470.9,
      "max": 946.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}

@jjk-g
Copy link
Copy Markdown
Collaborator Author

jjk-g commented Aug 20, 2025

Added future thoughts on testing in #40

@jjk-g jjk-g changed the title [WIP] Defer InferenceAPIData gen to worker procs Defer InferenceAPIData gen to worker procs Aug 20, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2025
Comment thread inference_perf/datagen/shared_prefix_datagen.py
Add uvloop as drop in replacement for asyncio event loop
Fix timer to more reliably hit specified duration
Add sleep to reduce CPU contention of queue polling
@achandrasekar
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achandrasekar, jjk-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 20, 2025
@achandrasekar achandrasekar merged commit 5e1649d into kubernetes-sigs:main Aug 20, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Loadgen asyncio task execution discrepancies can alter lifecycle results

3 participants