Releases · kubernetes-sigs/inference-perf

01 May 19:07

jjk-g

v0.5.0

e250731

v0.5.0 Latest

Latest

Summary

Features

New UX components including tui and cli args
OTel trace replay support
Multi-report analysis
Distribution Sampling
Goodput metrics

Bug Fixes

Fixes for concurrency and multi-turn
vLLM metric parity
Multi-token chunk handling
Iterative convergence for prompt length accuracy

Improvements

Code coverage reporting
Added e2e & unit tests
Docs and guide updates

What's Changed

allow shared prefix question and system prompt variance and calculate… by @kaushikmitr in #301
test(loadgen): add unit tests for worker concurrency distribution by @sats-23 in #337
fix: use worker_id in request queue for ocncurrent load generation by @changminbark in #347
Update title and URL for wg-serving reference by @terrytangyuan in #350
feat: vLLM latest (0.15.0) production metrics by @changminbark in #348
feat: added configurable base seed for loadgen workers by @changminbark in #349
Assert monotonically increasing code coverage by @Bslabe123 in #357
Refactor OpenAI client to fix connection leaks and improve error telemetry by @LukeAVanDrie in #247
Refactor RequestQueueData to NamedTuple for better readability by @sats-23 in #333
Update version in Helm chart by @adinilfeld in #366
fix: preserve configured tokenizer when using MockModelServerClient by @alonh in #353
Multi report comparitive analysis feature by @SachinVarghese in #355
feat: add structured output support for vLLM backend by @dhxshop in #339
Add rich progress bar and a table with results by @achandrasekar in #377
fix(config): substitute timestamp in storage paths by @yangligt2 in #330
Update coverage check script by @jjk-g in #362
Add Licence Headers Check Presubmit by @Bslabe123 in #380
Fix No module named 'matplotlib' CI Error by @Bslabe123 in #409
fix: correct various typos in code and documentation by @jjk-g in #375
Update JOSS paper with reviewer feedback by @achandrasekar in #412
Fix sigint handling and shared prefix request count by @achandrasekar in #388
Add ShareGPT end-to-end tests by @diamondburned in #414
feat: OTel Trace Replay- Agentic Workload Benchmarking by @alonh in #372
Update contributor guide by @achandrasekar in #415
Pin GitHub Actions to specific commit hashes by @diamondburned in #418
Fix streaming metrics calculation by avoiding stream consumption before parsing by @achandrasekar in #420
Add CLI flags as a way to specify benchmark config by @achandrasekar in #389
fix test:e2e:docker ModuleNotFoundError by @diamondburned in #417
Add distribution sampling for shared_prefix by @Navjot10 in #387
Update goodput reporting based on latency SLOs by @achandrasekar in #427
Add workload catalog by @achandrasekar in #432
Update README with detailed workload descriptions by @seanhorgan in #439
Update chat to completion for workload catalog by @achandrasekar in #440
Improve loadgen coverage by @jjk-g in #367
improve the otel readme and link the top level readme to it by @alonh in #441
feat: add benchmark_time_seconds to metrics report by @changminbark in #430
Add Pre-Commit and Pre-Push hooks by @Bslabe123 in #424
Fix Multi Turn Chat Hang by @Bslabe123 in #443
feat: conversation_replay data generator for agentic workloads by @LoganVegnaSHOP in #426
Derive TPOT, ITL from Repsonse Tokens, not Chunks by @Bslabe123 in #410
[URGENT] fix: correct InferenceInfo.output_tokens access via nested response_info by @Bslabe123 in #454
Fix TPOT/TTFT/ITL skew from non-content SSE events by @Bslabe123 in #452
Update catalog to capture multi-turn tool calling by @achandrasekar in #457
Extract common graph-backed session replay runtime into ReplayGraphSessionGeneratorBase by @alonh in #436
chore: allow virtual address style for s3 storage by @walterbm in #458
Add E2e testing for Prometheus Querying and Report Contents by @Bslabe123 in #413
Add regression test for issue #364 by @Bslabe123 in #460
fix: clear user sessions between load stages by @alonh in #459
fix several zero vLLM Prometheus counter metrics by @diamondburned in #455
Improve build_graph runtime in otel_trace_to_replay_graph.py by @lenadankin in #434
workaround unexpected sharegpt format change by @diamondburned in #433
Fix RuntimeError on conversation_replay request failure by @kaushikmitr in #462
Fix SharedPrefix Datagen Prompt Length by @Bslabe123 in #383
Prepare version 0.5.0 by @jjk-g in #463

New Contributors

@kaushikmitr made their first contribution in #301
@sats-23 made their first contribution in #337
@adinilfeld made their first contribution in #366
@alonh made their first contribution in #353
@dhxshop made their first contribution in #339
@Navjot10 made their first contribution in #387
@seanhorgan made their first contribution in #439
@LoganVegnaSHOP made their first contribution in #426
@walterbm made their first contribution in #458
@lenadankin made their first contribution in #434

Full Changelog: v0.4.0...v0.5.0

Docker Image

quay.io/inference-perf/inference-perf:v0.5.0

Python Package

pip install inference-perf==v0.5.0

Contributors

seanhorgan, Bslabe123, and 17 other contributors

Assets 2

06 Feb 22:48

jjk-g

v0.4.0

e3e690b

v0.4.0

This release contains several feature improvements and various bug fixes:

mTLS support in vllm client
Multilora support
E2e testing against inference sim
Multiple report analysis
New aliases for shared_prefix config fields
Dependency updates

What's Changed

Update attribute error when parsing billsum conversations by @rlakhtakia in #297
feat: add percentiles configuration for request lifecycle metrics reporting by @hhk7734 in #295
Add mTLS support in vllm client by @unicell in #302
Add cloudbuild.yaml by @jjk-g in #306
Fix vllm prefix metrics by @jjk-g in #309
feat: Multilora support by @changminbark in #315
Add end-to-end testing using llm-d-inference-sim by @diamondburned in #294
chore: update README.md with concurrent load generation info by @changminbark in #313
Enabling multiple report analysis using CLI tool by @SachinVarghese in #307
fix concurrency higher than set issue. by @zetxqx in #320
Add Journal of Open Source Software paper on inference-perf by @achandrasekar in #326
Fix docs to reference std, actual field is std_dev by @jjk-g in #335
Add Sachin to authors in JOSS paper by @achandrasekar in #334
Add shuffle to multi-round prompt generation by @elevran in #331
Add aliases for shared_prefix config fields by @jjk-g in #311
Update paper with statement of need and references by @achandrasekar in #342
Add unique random seed to worker by @yangligt2 in #340
Update transformers by @jjk-g in #336

New Contributors

@unicell made their first contribution in #302
@elevran made their first contribution in #331
@yangligt2 made their first contribution in #340

Full Changelog: v0.3.0...v0.4.0

Docker Image

quay.io/inference-perf/inference-perf:v0.4.0

Python Package

pip install inference-perf==v0.4.0

Contributors

unicell, achandrasekar, and 9 other contributors

Assets 2

26 Nov 21:06

SachinVarghese

v0.3.0

a85b31b

v0.3.0

This release comes with some major improvements:

Trace file based load generation and testing
Support for benchmarking multi-turn chat scenarios
Shared client session for load generation performance
Improved helm chart configurations for kubernetes deployment
End to end tests on CI/CD pipeline

What's Changed

Improve efficiency and readability of data generators by @pancak3 in #210
Make selected request rates accurate to two decimal places (formerly zero) when using linear sweep type by @Bslabe123 in #237
Add debug log for saturation sampling by @jjk-g in #236
ci: push helm chart to OCI registry when release by @ExplorerRay in #240
chore: add inter_token_latency in ModelServerMetrics for sglang metrics by @jlcoo in #242
use achieved_rate in the report graph. by @zetxqx in #232
Improve docker image building by @pancak3 in #228
feat: Enhance Helm chart flexibility for job by @LukeAVanDrie in #248
Catch saturation detection failure by @jjk-g in #251
Adding time per output tokens prometheus metrics for sglang server by @SachinVarghese in #254
feat: loadgen SIGINT handler by @changminbark in #244
Feat: Add request timeouts and circuit breakers (#148) by @huaxig in #227
Added PrometheusMetric Implementations by @Bslabe123 in #221
Workflow that currently pushes Docker image now also pushes Helm chart by @Bslabe123 in #259
Fix for Invalid Chart Version by @Bslabe123 in #261
Add jjk-g to maintainers by @achandrasekar in #267
Update helm chart to pass in gcs bucket to download datasets. by @rlakhtakia in #260
Fixing test and validate workflows by @SachinVarghese in #272
publish-on-change workflow should use helm client login instead of docker login by @Bslabe123 in #264
Add Kubecon Demo results by @Bslabe123 in #224
docs: clarify authentication needed for querying metrics from GMP by @Bslabe123 in #276
Update vLLM kv cache metric from vllm:gpu_cache_usage_perc to vllm:kv_cache_usage_perc by @Bslabe123 in #277
Update helm to add service account name by @rlakhtakia in #270
Update helm chart to pull datasets from s3 bucket. by @rlakhtakia in #278
Trace load gen by @aish1331 in #198
fix: stabilize streaming responses for large chunk using iter_any() by @zetxqx in #284
fix: custom tokenizer truncates inputs to model max input length by @changminbark in #266
[Testing / CI/CD] Ability to automate scale testing with a mock server and test different datasets, loadgen, etc. and run it as a part of CI/CD (#274) by @huaxig in #274
Update helm to pass in existing kubernetes secret. by @rlakhtakia in #281
Loadgen concurrent load type by @changminbark in #263
Improve MultiprocessRequestDataCollector async by @diamondburned in #280
update gcs bucket to pass in bucket name only for consistency by @rlakhtakia in #285
Feat: Add user session to support Multi-turn chat (#179) by @huaxig in #257
fix pyproject dependency groups and TOML parsing issue by @diamondburned in #291
Fix overflow on tokenizer truncation by @jjk-g in #290
chore: improve openai client error handling by including status code and reason by @hhk7734 in #289
Fix: requests get duplicated using shared_prefix datagen when multi-turn chat disabled by @huaxig in #293
Share aiohttp.ClientSessions per worker by @diamondburned in #282

New Contributors

@jlcoo made their first contribution in #242
@zetxqx made their first contribution in #232
@LukeAVanDrie made their first contribution in #248
@changminbark made their first contribution in #244
@diamondburned made their first contribution in #280
@hhk7734 made their first contribution in #289

Full Changelog: v0.2.0...v0.3.0

Docker Image

quay.io/inference-perf/inference-perf:v0.3.0

Python Package

pip install inference-perf==v0.3.0

Contributors

Bslabe123, achandrasekar, and 13 other contributors

Assets 2

24 Sep 16:59

github-actions

v0.2.0

1ccc48b

v0.2.0

This release comes with some major improvements:

Default concurrency improvement and multi-process CPU utilization improvement with extensive scale testing to make sure the latency values reported at high concurrency (up to 10k QPS) are accurate
Enhanced support for SGLang and TGI with model server metrics
New dataset support for summarization (CNN Dailymail), prefill heavy (Billsum Conversations) and decode heavy (Instruct Infinity) use cases
Automatic sweep of request rates until saturation
Observability improvements around load generation and the ability to monitor scheduling delay and achieved rate

What's Changed

Add Bslabe123 and jjk-g as reviewers by @achandrasekar in #162
update streaming part in the documentation too to make it clear by @liyuerich in #168
Update instructions on running inference-perf without building from source by @achandrasekar in #170
Fix Malformed GMP Query URL by @Bslabe123 in #172
feat: revise Dockerfile to reduce image size by @ExplorerRay in #164
Support for running inference-perf offline by @aish1331 in #152
Update manifests.yaml by @Bslabe123 in #175
Change default api in config.yml from chat to completion by @Bslabe123 in #178
Remove outdated instructions in deploy/README.md by @Bslabe123 in #180
Add Streaming Support for Chat API Requests by @Bslabe123 in #173
Include KV-Cache Usage Percentage Metrics in Prometheus Reports by @Bslabe123 in #184
Fix for zeroed metrics added in #184 by @Bslabe123 in #186
Fix per stage PromQL metric queries by using time.time() instead of time.perf_counter() by @Bslabe123 in #177
Defer InferenceAPIData gen to worker procs by @jjk-g in #157
Rename schedule_accuracy to schedule_delay by @jjk-g in #195
Support custom http headers in inference requests by @achandrasekar in #192
Add vllm metrics [preemptions and swapped requests] to prometheus metrics by @Shuwen-Fang in #197
Remove alpha channel of diagram png to keep presentation consistent in differen browser modes (#199) by @pancak3 in #200
Added SGlang server support by @SachinVarghese in #193
Add cnn_dailymail datagen by @jjk-g in #196
Use 'median' instead of 'p50' in output reports by @Bslabe123 in #201
Fix typecheck issue by @achandrasekar in #209
Reduce CPU utilization of waiting workers by @jjk-g in #204
refactor: simplify response branches by @dublc in #208
iter -> itertools.cycle in Datagen by @Bslabe123 in #214
Support for populating hf_token values from k8s secrets (#183) by @huaxig in #212
Add vllm prefix metrics by @jjk-g in #216
Added support for TGI Model Server by @aish1331 in #203
Break metricsclient dependency on loadgen and modelserverclient modules by @Bslabe123 in #206
bugfix: self.additional_metrics_filters -> self.metrics_filters by @Bslabe123 in #220
Add Infinity Instruct datagen by @rlakhtakia in #217
Saturation detection and auto sweep by @jjk-g in #215
Write config.yaml to report directory by @jjk-g in #219
Add more percentiles to metrics by @namasl in #226
Update inf-perf to use hf-billsum dataset by @rlakhtakia in #207
Update documentation to cover metrics, loadgen and new datasets by @achandrasekar in #234

New Contributors

@Shuwen-Fang made their first contribution in #197
@pancak3 made their first contribution in #200
@dublc made their first contribution in #208
@huaxig made their first contribution in #212
@rlakhtakia made their first contribution in #217
@namasl made their first contribution in #226

Full Changelog: v0.1.1...v0.2.0

Docker Image

quay.io/inference-perf/inference-perf:v0.2.0

Python Package

pip install inference-perf==v0.2.0

Contributors

Bslabe123, achandrasekar, and 11 other contributors

Assets 2

01 Aug 22:43

github-actions

v0.1.1

fd21242

v0.1.1

What's Changed

Simplified local report storage by @SachinVarghese in #118
Add basic Helm Chart by @jjk-g in #114
Add support for api_key #116 by @andresC98 in #120
feat: migrate scripts from Makefile to PDM scripts by @rudeigerc in #122
Add Support for Querying Metrics from Google Managed Prometheus and Additional PromQL Filters now Configurable by @Bslabe123 in #121
fix: improve Docker build workflow with better secret handling and debugging by @wangchen615 in #128
fix: improve release workflow with proper changelog and Docker image handling by @wangchen615 in #129
Add qps observability, fractional rates by @jjk-g in #125
add config.md file to provide detail description for config.yml parameters by @liyuerich in #131
Prometheus query fixes and examples update by @SachinVarghese in #130
Add ability to specify datagen bounds for ShareGPT by @jjk-g in #137
Add the ability to analyze reports and produce charts by @achandrasekar in #135
update datasets type by @liyuerich in #136
Fix a parsing issue with streaming requests by @achandrasekar in #140
feat: add support for s3 storage by @omerap12 in #147
Add detection for model_name and tokenizer by @jjk-g in #145
update example config yml files by @liyuerich in #149
Fix QPS accuracy at lower rates by @jjk-g in #143
Update Helm chart by @jjk-g in #154
Autocalc total_count by @jjk-g in #155
fix: only assign tokenizer to model_name when not configured by @ExplorerRay in #160
feat: Add Python package publishing to release workflow by @wangchen615 in #153
Point config.yml to mounted configmap file by @Bslabe123 in #158

New Contributors

@andresC98 made their first contribution in #120
@rudeigerc made their first contribution in #122
@liyuerich made their first contribution in #131
@omerap12 made their first contribution in #147
@ExplorerRay made their first contribution in #160

Full Changelog: https://github.com/kubernetes-sigs/inference-perf/commits/v0.1.1

Docker Image

quay.io/inference-perf/inference-perf:v0.1.1

Python Package

pip install inference-perf==v0.1.1

Contributors

Bslabe123, achandrasekar, and 8 other contributors

Assets 2

26 Jun 18:59

github-actions

v0.1.0

a097500

v0.1.0

We are excited to announce the initial release of Inference Perf v0.1.0! This release comes with the following key features:

Highly scalable and can support benchmarking large inference production deployments by sending up to 10k QPS.
Reports the key metrics needed to measure LLM performance.
Supports different real world and synthetic datasets.
Supports different APIs and can support multiple model servers.
Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.

What's Changed

Add directory structure for the tool by @achandrasekar in #1
Add Makefile and typecheck presubmit by @Bslabe123 in #3
Fix Makefile Typo by @Bslabe123 in #6
Add design document to the repo by @achandrasekar in #4
Add default python gitignore by @sjmonson in #11
Make Inference-Perf Package-able / Use Modern Python Tooling by @sjmonson in #13
Added Abstract Type for Metrics Client by @Bslabe123 in #7
Add Chen Wang to OWNERS by @terrytangyuan in #20
Inference perf basic load run implementation by @SachinVarghese in #21
Adding vLLM Client to inference perf runner by @SachinVarghese in #27
Add HF ShareGPT Data Generator by @vivekk16 in #33
Add Unit Testing Maketargets and Unit Testing Github Workflow by @Bslabe123 in #19
Mock metrics client implementation by @SachinVarghese in #32
Parameterization of CLI tool using config file by @SachinVarghese in #34
Add SachinVarghese as approver, add owner aliases by @achandrasekar in #36
Containerize the benchmark by @achandrasekar in #38
Added demo example for vLLM Server and shareGPT datagen component by @SachinVarghese in #37
Fix: Raising error for api type mismatch by @SachinVarghese in #44
Add Custom Tokenizer by @vivekk16 in #43
Multi-stage performance run by @SachinVarghese in #49
Update README.md with meeting time / recording links by @achandrasekar in #54
Add support for cluster-local benchmarking by @Bslabe123 in #60
Update DataGenerator to Handle Both Chat and Completion APIs by @Bslabe123 in #58
Lint and type check fixes by @SachinVarghese in #62
Add StorageClient abstract type and GCS Client Implementation by @Bslabe123 in #61
Added Prometheus client to get model server metrics by @aish1331 in #64
Add support for different input distributions with a synthetic dataset by @achandrasekar in #66
Automatically Populate Missing Fields in Config by @Bslabe123 in #71
Generic model server client config by @SachinVarghese in #72
Request Lifecycle Report Generation by @Bslabe123 in #77
Add output distribution to synthetic data generator by @achandrasekar in #79
Improved Logging for Writing Report Files by @Bslabe123 in #80
Add the option to ignore end of sequence by @achandrasekar in #83
Add GitHub Release Workflow and Changelog Configuration by @wangchen615 in #41
Improved abstractions for perf project by @SachinVarghese in #84
Add issue templates for the repo by @achandrasekar in #90
docs: Update link to Slack channel in README.md by @terrytangyuan in #91
Add random data generator by @achandrasekar in #94
Multi-stage report generation for Prometheus Metrics by @aish1331 in #95
Add Docker build and push workflows for PRs and releases by @wangchen615 in #97
Add shared prefix generator to benchmark prefix caching by @achandrasekar in #98
Added throughput metrics to output report by @Bslabe123 in #101
Basic code test setup by @SachinVarghese in #96
Enable Docker Build Workflow on Push to Main Branch by @wangchen615 in #102
Fix Docker Tag Generation by Using env.QUAY_USERNAME in Workflow by @wangchen615 in #105
Update Quay.io Organization Name in Docker Build Workflow by @wangchen615 in #106
Add Support for Streaming Requests to Completions API by @Bslabe123 in #103
Add multiprocess, multithreaded loadgen by @jjk-g in #99
Update documentation to cover newer capabilities by @achandrasekar in #104
Use logging methods with levels instead of print by @shotarok in #110
Merge the latest fixes to the release branch by @achandrasekar in #115

New Contributors

@achandrasekar made their first contribution in #1
@Bslabe123 made their first contribution in #3
@sjmonson made their first contribution in #11
@terrytangyuan made their first contribution in #20
@SachinVarghese made their first contribution in #21
@vivekk16 made their first contribution in #33
@aish1331 made their first contribution in #64
@wangchen615 made their first contribution in #41
@jjk-g made their first contribution in #99
@shotarok made their first contribution in #110

Full Changelog: https://github.com/kubernetes-sigs/inference-perf/commits/v0.1.0

Contributors

Bslabe123, shotarok, and 8 other contributors

Assets 2

Releases: kubernetes-sigs/inference-perf

v0.5.0

Summary

Features

Bug Fixes

Improvements

What's Changed

New Contributors

Docker Image

Python Package

Contributors

Uh oh!

v0.4.0

v0.4.0

What's Changed

New Contributors

Docker Image

Python Package

Contributors

Uh oh!

v0.3.0

What's Changed

New Contributors

Docker Image

Python Package

Contributors

Uh oh!

v0.2.0

What's Changed

New Contributors

Docker Image

Python Package

Contributors

Uh oh!

v0.1.1

What's Changed

New Contributors

Docker Image

Python Package

Contributors

Uh oh!

v0.1.0

What's Changed

New Contributors

Contributors

Uh oh!