Skip to content

feat: loadgen SIGINT handler#244

Merged
achandrasekar merged 5 commits into
kubernetes-sigs:mainfrom
changminbark:loadgen-sigint-debug
Oct 15, 2025
Merged

feat: loadgen SIGINT handler#244
achandrasekar merged 5 commits into
kubernetes-sigs:mainfrom
changminbark:loadgen-sigint-debug

Conversation

@changminbark
Copy link
Copy Markdown
Member

@changminbark changminbark commented Oct 5, 2025

PR Template

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test

/kind feature

/kind flake

What this PR does / why we need it:
This PRs introduces a SIGINT handler for the loadgen phase so when a SIGINT is trapped, the program moves onto the reportgen phase instead of hanging.

Which issue(s) this PR fixes:

Fixes #133

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

The load generator now has a SIGINT handler that will gracefully handle SIGINT by stopping all workers, flushing out the request queue, and then move onto the report gen phase with the results gathered thus far.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Testing

Testing was done using the default config.yml file in the inference_perf directory and the necessary services (like vLLM serving HuggingFaceTB/SmolLM2-135M-Instruct and local prometheus).

Click to expand functional test output

Before change

$ python3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-10-03 15:24:54,533 - inference_perf.config - INFO - Using configuration from: config.yml
2025-10-03 15:24:54,536 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
  headers: null
data:
  type: shareGPT
  path: null
  input_distribution: null
  output_distribution: null
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 1
    duration: 30
  sweep: null
  num_workers: 16
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics:
  type: prometheus
  prometheus:
    url: http://localhost:9090
    scrape_interval: 15
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20251003-152453
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
  ignore_eos: true
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct


2025-10-03 15:24:54,536 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20251003-152453
2025-10-03 15:25:25,217 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
Stage 0 progress:  27%|███████████████████████████████▏                                                                                     | 0.26666666666666666/1.0 [00:10<00:26, 35.75s/it]Stage 0 progress:  27%|███████████████████████████████▏                                                                                     | 0.26666666666666666/1.0 [00:10<00:28, 38.95s/it]
Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 78, in _run
    await self.loadgen.run(self.client)
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 394, in run
    return await self.mp_run(client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 376, in mp_run
    await self.run_stage(
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 235, in run_stage
    await sleep(1)
  File "/usr/lib/python3.12/asyncio/tasks.py", line 665, in sleep
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 301, in <module>
    main_cli()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 278, in main_cli
    perfrunner.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/inference_perf/main.py", line 82, in run
    asyncio.run(_run())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt
^CException ignored in atexit callback: <function _exit_function at 0x720907d42f20>
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/util.py", line 360, in _exit_function
    p.join()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt: 
Process Worker-15:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
Process Worker-16:
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt
Process Worker-9:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt
Process Worker-1:
Process Worker-4:
Traceback (most recent call last):
Process Worker-5:
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
Traceback (most recent call last):
KeyboardInterrupt
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt
Process Worker-6:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/chang-min/Desktop/OpenSource/k8s/inference-perf/venv/lib/python3.12/site-packages/inference_perf/loadgen/load_generator.py", line 153, in run
    run(self.loop())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1512, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1505, in uvloop.loop.Loop.run_until_complete
  File "uvloop/loop.pyx", line 1379, in uvloop.loop.Loop.run_forever
  File "uvloop/loop.pyx", line 557, in uvloop.loop.Loop._run
  File "uvloop/handles/poll.pyx", line 216, in uvloop.loop.__on_uvpoll_event
  File "uvloop/cbhandles.pyx", line 83, in uvloop.loop.Handle._run
  File "uvloop/cbhandles.pyx", line 66, in uvloop.loop.Handle._run
  File "uvloop/loop.pyx", line 399, in uvloop.loop.Loop._read_from_self
  File "uvloop/loop.pyx", line 404, in uvloop.loop.Loop._invoke_signals
  File "uvloop/loop.pyx", line 379, in uvloop.loop.Loop._ceval_process_signals
  File "/usr/lib/python3.12/asyncio/runners.py", line 157, in _on_sigint
    raise KeyboardInterrupt()
KeyboardInterrupt

After change

$ python3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-10-05 15:35:00,115 - inference_perf.config - INFO - Using configuration from: config.yml
2025-10-05 15:35:00,119 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
  headers: null
data:
  type: shareGPT
  path: null
  input_distribution: null
  output_distribution: null
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 1
    duration: 30
  sweep: null
  num_workers: 16
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics:
  type: prometheus
  prometheus:
    url: http://localhost:9090
    scrape_interval: 15
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20251005-153458
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
  ignore_eos: true
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct


2025-10-05 15:35:00,119 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20251005-153458
LOADGENERATOR __init__ CALLED
2025-10-05 15:35:44,482 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
Stage 0 progress:  37%|████████████████████████████████▋                                                        | 0.36666666666666664/1.0 [00:16<00:26, 41.73s/it]Stage 0 progress:  43%|██████████████████████████████████████▌                                                  | 0.43333333333333335/1.0 [00:17<00:22, 39.28s/it]
2025-10-05 15:36:01,557 - inference_perf.loadgen.load_generator - INFO - Loadgen encountered SIGINT
2025-10-05 15:36:02,559 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-10-05 15:36:03,564 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-10-05 15:36:20,631 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: avg_inter_token_latency. Skipping this metric.
2025-10-05 15:36:20,631 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: median_inter_token_latency. Skipping this metric.
2025-10-05 15:36:20,631 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: p90_inter_token_latency. Skipping this metric.
2025-10-05 15:36:20,631 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: p99_inter_token_latency. Skipping this metric.
2025-10-05 15:36:20,634 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251005-153458/summary_lifecycle_metrics.json
2025-10-05 15:36:20,634 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251005-153458/stage_0_lifecycle_metrics.json
2025-10-05 15:36:20,634 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251005-153458/summary_prometheus_metrics.json
2025-10-05 15:36:20,636 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251005-153458/config.yaml

Reports generated after:

{
  "load_summary": {
    "count": 13,
    "schedule_delay": {
      "mean": 0.0006833168680924035,
      "min": -0.0007300731404029648,
      "p0.1": -0.0007276184231595835,
      "p1": -0.0007055259679691517,
      "p5": -0.0006073372782338992,
      "p10": -0.0004284748230929835,
      "p25": 0.00026718769913713913,
      "median": 0.0006134590676083462,
      "p75": 0.001087595815079112,
      "p90": 0.0018773561738271388,
      "p95": 0.0020964901998013373,
      "p99": 0.0021771716267176087,
      "p99.9": 0.0021953249477737703,
      "max": 0.0021973419834466767
    },
    "send_duration": 13.933553711999593,
    "requested_rate": 1.0,
    "achieved_rate": 0.9329995971382652
  },
  "successes": {
    "count": 13,
    "latency": {
      "request_latency": {
        "mean": 1.7430407067691145,
        "min": 0.06072670499997912,
        "p0.1": 0.06099953636397913,
        "p1": 0.06345501863997924,
        "p5": 0.07436827319997974,
        "p10": 0.08805789179987188,
        "p25": 0.3781903320004858,
        "median": 1.2660093819995382,
        "p75": 2.4516203359999054,
        "p90": 3.8283391862001745,
        "p95": 5.109201258400022,
        "p99": 6.2777515292798105,
        "p99.9": 6.540675340227768,
        "max": 6.569889096999759
      },
      "normalized_time_per_output_token": {
        "mean": 0.012470926502922166,
        "min": 0.00917398102898216,
        "p0.1": 0.009175159876347384,
        "p1": 0.009185769502634401,
        "p5": 0.009232923397243362,
        "p10": 0.009364713885629417,
        "p25": 0.009774655890977644,
        "median": 0.009984633886017872,
        "p75": 0.015041869878779946,
        "p90": 0.017443585992874606,
        "p95": 0.01932054613711183,
        "p99": 0.02089432538733256,
        "p99.9": 0.02124842571863223,
        "max": 0.021287770199887746
      },
      "time_per_output_token": {
        "mean": 0.004691286935402641,
        "min": 0.003218188090861738,
        "p0.1": 0.0032228111164381104,
        "p1": 0.003264418346625462,
        "p5": 0.003449339369680357,
        "p10": 0.0036334146248348177,
        "p25": 0.004493048480143827,
        "median": 0.004630491446476313,
        "p75": 0.004954509693413679,
        "p90": 0.005680080064766961,
        "p95": 0.005993028690565533,
        "p99": 0.0063096841381211355,
        "p99.9": 0.006380931613821148,
        "max": 0.006388848000010037
      },
      "time_to_first_token": {
        "mean": 0.03603282338463032,
        "min": 0.017281655999795476,
        "p0.1": 0.017282495099800146,
        "p1": 0.01729004699984216,
        "p5": 0.017323611000028903,
        "p10": 0.017407786000148917,
        "p25": 0.018456763999893155,
        "median": 0.022838959000182513,
        "p75": 0.0634562069999447,
        "p90": 0.06582986580015131,
        "p95": 0.06801724380020459,
        "p99": 0.07008519036018697,
        "p99.9": 0.07055047833618301,
        "max": 0.07060217700018256
      },
      "inter_token_latency": {
        "mean": 0.004846513091485928,
        "min": 2.050999682978727e-06,
        "p0.1": 2.1702484373236075e-06,
        "p1": 1.4293440290202853e-05,
        "p5": 1.4872400242893491e-05,
        "p10": 1.516680003987858e-05,
        "p25": 1.5984000128810294e-05,
        "median": 6.18069998381543e-05,
        "p75": 0.009163918000012927,
        "p90": 0.010292426600426553,
        "p95": 0.011176135199639248,
        "p99": 0.013324490480117674,
        "p99.9": 0.030814539975600103,
        "max": 0.057281617000626284
      }
    },
    "throughput": {
      "input_tokens_per_sec": 177.8494350469882,
      "output_tokens_per_sec": 143.53764693502904,
      "total_tokens_per_sec": 321.38708198201726,
      "requests_per_sec": 0.8260245286212384
    },
    "prompt_len": {
      "mean": 215.30769230769232,
      "min": 7.0,
      "p0.1": 7.036,
      "p1": 7.36,
      "p5": 8.8,
      "p10": 11.200000000000001,
      "p25": 17.0,
      "median": 62.0,
      "p75": 420.0,
      "p90": 476.8,
      "p95": 610.3999999999995,
      "p99": 764.4799999999997,
      "p99.9": 799.1480000000005,
      "max": 803.0
    },
    "output_len": {
      "mean": 173.76923076923077,
      "min": 4.0,
      "p0.1": 4.012,
      "p1": 4.12,
      "p5": 4.6,
      "p10": 5.2,
      "p25": 21.0,
      "median": 138.0,
      "p75": 221.0,
      "p90": 410.0000000000001,
      "p95": 530.7999999999997,
      "p99": 632.5599999999998,
      "p99.9": 655.4560000000002,
      "max": 658.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}
{
  "load_summary": {
    "count": 13,
    "schedule_delay": {
      "mean": 0.0006833168680924035,
      "min": -0.0007300731404029648,
      "p0.1": -0.0007276184231595835,
      "p1": -0.0007055259679691517,
      "p5": -0.0006073372782338992,
      "p10": -0.0004284748230929835,
      "p25": 0.00026718769913713913,
      "median": 0.0006134590676083462,
      "p75": 0.001087595815079112,
      "p90": 0.0018773561738271388,
      "p95": 0.0020964901998013373,
      "p99": 0.0021771716267176087,
      "p99.9": 0.0021953249477737703,
      "max": 0.0021973419834466767
    }
  },
  "successes": {
    "count": 13,
    "latency": {
      "request_latency": {
        "mean": 1.7430407067691145,
        "min": 0.06072670499997912,
        "p0.1": 0.06099953636397913,
        "p1": 0.06345501863997924,
        "p5": 0.07436827319997974,
        "p10": 0.08805789179987188,
        "p25": 0.3781903320004858,
        "median": 1.2660093819995382,
        "p75": 2.4516203359999054,
        "p90": 3.8283391862001745,
        "p95": 5.109201258400022,
        "p99": 6.2777515292798105,
        "p99.9": 6.540675340227768,
        "max": 6.569889096999759
      },
      "normalized_time_per_output_token": {
        "mean": 0.012470926502922166,
        "min": 0.00917398102898216,
        "p0.1": 0.009175159876347384,
        "p1": 0.009185769502634401,
        "p5": 0.009232923397243362,
        "p10": 0.009364713885629417,
        "p25": 0.009774655890977644,
        "median": 0.009984633886017872,
        "p75": 0.015041869878779946,
        "p90": 0.017443585992874606,
        "p95": 0.01932054613711183,
        "p99": 0.02089432538733256,
        "p99.9": 0.02124842571863223,
        "max": 0.021287770199887746
      },
      "time_per_output_token": {
        "mean": 0.004691286935402641,
        "min": 0.003218188090861738,
        "p0.1": 0.0032228111164381104,
        "p1": 0.003264418346625462,
        "p5": 0.003449339369680357,
        "p10": 0.0036334146248348177,
        "p25": 0.004493048480143827,
        "median": 0.004630491446476313,
        "p75": 0.004954509693413679,
        "p90": 0.005680080064766961,
        "p95": 0.005993028690565533,
        "p99": 0.0063096841381211355,
        "p99.9": 0.006380931613821148,
        "max": 0.006388848000010037
      },
      "time_to_first_token": {
        "mean": 0.03603282338463032,
        "min": 0.017281655999795476,
        "p0.1": 0.017282495099800146,
        "p1": 0.01729004699984216,
        "p5": 0.017323611000028903,
        "p10": 0.017407786000148917,
        "p25": 0.018456763999893155,
        "median": 0.022838959000182513,
        "p75": 0.0634562069999447,
        "p90": 0.06582986580015131,
        "p95": 0.06801724380020459,
        "p99": 0.07008519036018697,
        "p99.9": 0.07055047833618301,
        "max": 0.07060217700018256
      },
      "inter_token_latency": {
        "mean": 0.004846513091485928,
        "min": 2.050999682978727e-06,
        "p0.1": 2.1702484373236075e-06,
        "p1": 1.4293440290202853e-05,
        "p5": 1.4872400242893491e-05,
        "p10": 1.516680003987858e-05,
        "p25": 1.5984000128810294e-05,
        "median": 6.18069998381543e-05,
        "p75": 0.009163918000012927,
        "p90": 0.010292426600426553,
        "p95": 0.011176135199639248,
        "p99": 0.013324490480117674,
        "p99.9": 0.030814539975600103,
        "max": 0.057281617000626284
      }
    },
    "throughput": {
      "input_tokens_per_sec": 177.8494350469882,
      "output_tokens_per_sec": 143.53764693502904,
      "total_tokens_per_sec": 321.38708198201726,
      "requests_per_sec": 0.8260245286212384
    },
    "prompt_len": {
      "mean": 215.30769230769232,
      "min": 7.0,
      "p0.1": 7.036,
      "p1": 7.36,
      "p5": 8.8,
      "p10": 11.200000000000001,
      "p25": 17.0,
      "median": 62.0,
      "p75": 420.0,
      "p90": 476.8,
      "p95": 610.3999999999995,
      "p99": 764.4799999999997,
      "p99.9": 799.1480000000005,
      "max": 803.0
    },
    "output_len": {
      "mean": 173.76923076923077,
      "min": 4.0,
      "p0.1": 4.012,
      "p1": 4.12,
      "p5": 4.6,
      "p10": 5.2,
      "p25": 21.0,
      "median": 138.0,
      "p75": 221.0,
      "p90": 410.0000000000001,
      "p95": 530.7999999999997,
      "p99": 632.5599999999998,
      "p99.9": 655.4560000000002,
      "max": 658.0
    }
  },
  "failures": {
    "count": 0,
    "request_latency": null,
    "prompt_len": null
  }
}
{
  "load_summary": {},
  "successes": {
    "count": 0,
    "rate": 0.0,
    "prompt_len": {
      "mean": 0,
      "rate": 0.0
    },
    "output_len": {
      "mean": 0,
      "rate": 0.0
    },
    "queue_len": {
      "mean": 0
    },
    "request_latency": {
      "mean": 0.0,
      "median": 0.0,
      "p90": 0.0,
      "p99": 0.0
    },
    "time_to_first_token": {
      "mean": 0.0,
      "median": 0.0,
      "p90": 0.0,
      "p99": 0.0
    },
    "time_per_output_token": {
      "mean": 0.0,
      "median": 0.0,
      "p90": 0.0,
      "p99": 0.0
    },
    "kv_cache_usage_percentage": {
      "mean": 0.0,
      "median": 0.0,
      "p90": 0.0,
      "p99": 0.0
    },
    "num_requests_swapped": {
      "mean": 0
    },
    "num_preemptions_total": {
      "mean": 0
    },
    "prefix_cache_hit_percent": {
      "mean": 0.0
    }
  },
  "failures": {}
}
api:
  type: completion
  streaming: true
  headers: null
data:
  type: shareGPT
  path: null
  input_distribution: null
  output_distribution: null
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 1.0
    duration: 30
  sweep: null
  num_workers: 16
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics:
  type: prometheus
  prometheus:
    scrape_interval: 15
    url: http://localhost:9090/
    filters: []
    google_managed: false
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20251005-153458
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
  ignore_eos: true
  api_key: null
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct
  trust_remote_code: null
  token: null

Multi-stage Run Report:

Canceled during stage 0:

$ python3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-10-06 15:50:20,188 - inference_perf.config - INFO - Using configuration from: config.yml
2025-10-06 15:50:20,193 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
  headers: null
data:
  type: shareGPT
  path: null
  input_distribution: null
  output_distribution: null
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 1
    duration: 30
  - rate: 1
    duration: 30
  sweep: null
  num_workers: 16
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics:
  type: prometheus
  prometheus:
    url: http://localhost:9090
    scrape_interval: 15
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20251006-155018
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
  ignore_eos: true
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct


2025-10-06 15:50:20,194 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20251006-155018
2025-10-06 15:51:05,366 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
Stage 0 progress:  50%|████████████████████████████████████████████████████████████                                                            | 0.5/1.0 [00:39<00:16, 33.05s/it]Stage 0 progress:  50%|████████████████████████████████████████████████████████████                                                            | 0.5/1.0 [00:40<00:40, 80.09s/it]
2025-10-06 15:51:45,472 - inference_perf.loadgen.load_generator - INFO - Loadgen encountered SIGINT
2025-10-06 15:51:46,473 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-10-06 15:51:46,475 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-10-06 15:52:03,580 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: avg_inter_token_latency. Skipping this metric.
2025-10-06 15:52:03,580 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: median_inter_token_latency. Skipping this metric.
2025-10-06 15:52:03,580 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: p90_inter_token_latency. Skipping this metric.
2025-10-06 15:52:03,580 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: p99_inter_token_latency. Skipping this metric.
2025-10-06 15:52:03,584 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155018/summary_lifecycle_metrics.json
2025-10-06 15:52:03,585 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155018/stage_0_lifecycle_metrics.json
2025-10-06 15:52:03,585 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155018/summary_prometheus_metrics.json
2025-10-06 15:52:03,591 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155018/config.yaml

config.yaml
summary_prometheus_metrics.json
stage_0_lifecycle_metrics.json
summary_lifecycle_metrics.json

Canceled during stage 1:

$ python3 inference_perf/main.py -c config.yml
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-10-06 15:52:10,033 - inference_perf.config - INFO - Using configuration from: config.yml
2025-10-06 15:52:10,038 - inference_perf.config - INFO - Benchmarking with the following config:

api:
  type: completion
  streaming: true
  headers: null
data:
  type: shareGPT
  path: null
  input_distribution: null
  output_distribution: null
  shared_prefix: null
load:
  type: constant
  interval: 1.0
  stages:
  - rate: 1
    duration: 30
  - rate: 1
    duration: 30
  sweep: null
  num_workers: 16
  worker_max_concurrency: 100
  worker_max_tcp_connections: 2500
metrics:
  type: prometheus
  prometheus:
    url: http://localhost:9090
    scrape_interval: 15
report:
  request_lifecycle:
    summary: true
    per_stage: true
    per_request: false
  prometheus:
    summary: true
    per_stage: false
storage:
  local_storage:
    path: reports-20251006-155207
    report_file_prefix: null
  google_cloud_storage: null
  simple_storage_service: null
server:
  type: vllm
  model_name: HuggingFaceTB/SmolLM2-135M-Instruct
  base_url: http://0.0.0.0:8000
  ignore_eos: true
tokenizer:
  pretrained_model_name_or_path: HuggingFaceTB/SmolLM2-135M-Instruct


2025-10-06 15:52:10,039 - inference_perf.client.filestorage.local - INFO - Report files will be stored at: reports-20251006-155207
2025-10-06 15:52:55,544 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run started
Stage 0 progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [01:30<00:00, 90.09s/it]
2025-10-06 15:54:25,709 - inference_perf.loadgen.load_generator - INFO - Stage 0 - run completed
2025-10-06 15:54:26,710 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run started
Stage 1 progress:  10%|████████████                                                                                                            | 0.1/1.0 [00:15<00:51, 57.46s/it]Stage 1 progress:  10%|███████████▉                                                                                                           | 0.1/1.0 [00:20<03:00, 200.20s/it]
2025-10-06 15:54:46,804 - inference_perf.loadgen.load_generator - INFO - Loadgen encountered SIGINT
2025-10-06 15:54:47,805 - inference_perf.loadgen.load_generator - INFO - Stage 1 - run completed
2025-10-06 15:54:47,806 - inference_perf.reportgen.base - INFO - Generating Reports...
2025-10-06 15:55:04,907 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: avg_inter_token_latency. Skipping this metric.
2025-10-06 15:55:04,907 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: median_inter_token_latency. Skipping this metric.
2025-10-06 15:55:04,907 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: p90_inter_token_latency. Skipping this metric.
2025-10-06 15:55:04,907 - inference_perf.client.metricsclient.prometheus_client.base - WARNING - Metric metadata is not present for metric: p99_inter_token_latency. Skipping this metric.
2025-10-06 15:55:04,910 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155207/summary_lifecycle_metrics.json
2025-10-06 15:55:04,911 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155207/stage_0_lifecycle_metrics.json
2025-10-06 15:55:04,911 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155207/stage_1_lifecycle_metrics.json
2025-10-06 15:55:04,911 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155207/summary_prometheus_metrics.json
2025-10-06 15:55:04,913 - inference_perf.client.filestorage.local - INFO - Report saved to: reports-20251006-155207/config.yaml

config.yaml
summary_prometheus_metrics.json
stage_1_lifecycle_metrics.json
stage_0_lifecycle_metrics.json
summary_lifecycle_metrics.json

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from Bslabe123 October 5, 2025 19:54
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 5, 2025
@k8s-ci-robot k8s-ci-robot requested a review from jjk-g October 5, 2025 19:54
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @changminbark!

It looks like this is your first PR to kubernetes-sigs/inference-perf 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/inference-perf has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 5, 2025
Copy link
Copy Markdown
Collaborator

@jjk-g jjk-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Can you add an example running and it handling sigint?

This also only handles sigint for a given stage, if user cancels during stage 2/5, would they need to sigint the remaining stages

Comment thread inference_perf/loadgen/load_generator.py
@changminbark changminbark requested a review from jjk-g October 6, 2025 19:58
@jjk-g
Copy link
Copy Markdown
Collaborator

jjk-g commented Oct 9, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2025
@changminbark
Copy link
Copy Markdown
Member Author

/assign @terrytangyuan

@achandrasekar
Copy link
Copy Markdown
Contributor

Can you fix the type check issue - https://github.com/kubernetes-sigs/inference-perf/actions/runs/18292806647/job/52485168010?pr=244?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 10, 2025
@changminbark
Copy link
Copy Markdown
Member Author

Can you fix the type check issue - https://github.com/kubernetes-sigs/inference-perf/actions/runs/18292806647/job/52485168010?pr=244?

I have just fixed it. Can I also ask how I can run linting workflows on my local machine? Thank you!

@achandrasekar
Copy link
Copy Markdown
Contributor

Can you fix the type check issue - https://github.com/kubernetes-sigs/inference-perf/actions/runs/18292806647/job/52485168010?pr=244?

I have just fixed it. Can I also ask how I can run linting workflows on my local machine? Thank you!

Yes, pdm run validate should run linting and type checks locally.

@achandrasekar
Copy link
Copy Markdown
Contributor

Looks like lint check passed, but typecheck failed. Please address that as well.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Oct 10, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 10, 2025
@changminbark
Copy link
Copy Markdown
Member Author

Looks like lint check passed, but typecheck failed. Please address that as well.

I just fixed it and learned about pdm (python development manager). It's a new tool for me but it seems very powerful. Thank you!

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 10, 2025
@achandrasekar
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achandrasekar, changminbark

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 15, 2025
@achandrasekar achandrasekar merged commit 1090d50 into kubernetes-sigs:main Oct 15, 2025
3 of 4 checks passed
@changminbark changminbark deleted the loadgen-sigint-debug branch February 12, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle sigint gracefully

5 participants