Releases: kubernetes-sigs/inference-perf
v0.5.0
Summary
Features
- New UX components including tui and cli args
- OTel trace replay support
- Multi-report analysis
- Distribution Sampling
- Goodput metrics
Bug Fixes
- Fixes for concurrency and multi-turn
- vLLM metric parity
- Multi-token chunk handling
- Iterative convergence for prompt length accuracy
Improvements
- Code coverage reporting
- Added e2e & unit tests
- Docs and guide updates
What's Changed
- allow shared prefix question and system prompt variance and calculate… by @kaushikmitr in #301
- test(loadgen): add unit tests for worker concurrency distribution by @sats-23 in #337
- fix: use worker_id in request queue for ocncurrent load generation by @changminbark in #347
- Update title and URL for wg-serving reference by @terrytangyuan in #350
- feat: vLLM latest (0.15.0) production metrics by @changminbark in #348
- feat: added configurable base seed for loadgen workers by @changminbark in #349
- Assert monotonically increasing code coverage by @Bslabe123 in #357
- Refactor OpenAI client to fix connection leaks and improve error telemetry by @LukeAVanDrie in #247
- Refactor RequestQueueData to NamedTuple for better readability by @sats-23 in #333
- Update version in Helm chart by @adinilfeld in #366
- fix: preserve configured tokenizer when using MockModelServerClient by @alonh in #353
- Multi report comparitive analysis feature by @SachinVarghese in #355
- feat: add structured output support for vLLM backend by @dhxshop in #339
- Add rich progress bar and a table with results by @achandrasekar in #377
- fix(config): substitute timestamp in storage paths by @yangligt2 in #330
- Update coverage check script by @jjk-g in #362
- Add Licence Headers Check Presubmit by @Bslabe123 in #380
- Fix
No module named 'matplotlib'CI Error by @Bslabe123 in #409 - fix: correct various typos in code and documentation by @jjk-g in #375
- Update JOSS paper with reviewer feedback by @achandrasekar in #412
- Fix sigint handling and shared prefix request count by @achandrasekar in #388
- Add ShareGPT end-to-end tests by @diamondburned in #414
- feat: OTel Trace Replay- Agentic Workload Benchmarking by @alonh in #372
- Update contributor guide by @achandrasekar in #415
- Pin GitHub Actions to specific commit hashes by @diamondburned in #418
- Fix streaming metrics calculation by avoiding stream consumption before parsing by @achandrasekar in #420
- Add CLI flags as a way to specify benchmark config by @achandrasekar in #389
- fix test:e2e:docker
ModuleNotFoundErrorby @diamondburned in #417 - Add distribution sampling for shared_prefix by @Navjot10 in #387
- Update goodput reporting based on latency SLOs by @achandrasekar in #427
- Add workload catalog by @achandrasekar in #432
- Update README with detailed workload descriptions by @seanhorgan in #439
- Update chat to completion for workload catalog by @achandrasekar in #440
- Improve loadgen coverage by @jjk-g in #367
- improve the otel readme and link the top level readme to it by @alonh in #441
- feat: add benchmark_time_seconds to metrics report by @changminbark in #430
- Add Pre-Commit and Pre-Push hooks by @Bslabe123 in #424
- Fix Multi Turn Chat Hang by @Bslabe123 in #443
- feat: conversation_replay data generator for agentic workloads by @LoganVegnaSHOP in #426
- Derive TPOT, ITL from Repsonse Tokens, not Chunks by @Bslabe123 in #410
- [URGENT] fix: correct
InferenceInfo.output_tokensaccess via nestedresponse_infoby @Bslabe123 in #454 - Fix TPOT/TTFT/ITL skew from non-content SSE events by @Bslabe123 in #452
- Update catalog to capture multi-turn tool calling by @achandrasekar in #457
- Extract common graph-backed session replay runtime into ReplayGraphSessionGeneratorBase by @alonh in #436
- chore: allow virtual address style for s3 storage by @walterbm in #458
- Add E2e testing for Prometheus Querying and Report Contents by @Bslabe123 in #413
- Add regression test for issue #364 by @Bslabe123 in #460
- fix: clear user sessions between load stages by @alonh in #459
- fix several zero vLLM Prometheus counter metrics by @diamondburned in #455
- Improve build_graph runtime in otel_trace_to_replay_graph.py by @lenadankin in #434
- workaround unexpected sharegpt format change by @diamondburned in #433
- Fix RuntimeError on conversation_replay request failure by @kaushikmitr in #462
- Fix
SharedPrefixDatagen Prompt Length by @Bslabe123 in #383 - Prepare version 0.5.0 by @jjk-g in #463
New Contributors
- @kaushikmitr made their first contribution in #301
- @sats-23 made their first contribution in #337
- @adinilfeld made their first contribution in #366
- @alonh made their first contribution in #353
- @dhxshop made their first contribution in #339
- @Navjot10 made their first contribution in #387
- @seanhorgan made their first contribution in #439
- @LoganVegnaSHOP made their first contribution in #426
- @walterbm made their first contribution in #458
- @lenadankin made their first contribution in #434
Full Changelog: v0.4.0...v0.5.0
Docker Image
quay.io/inference-perf/inference-perf:v0.5.0
Python Package
pip install inference-perf==v0.5.0
v0.4.0
v0.4.0
This release contains several feature improvements and various bug fixes:
- mTLS support in vllm client
- Multilora support
- E2e testing against inference sim
- Multiple report analysis
- New aliases for shared_prefix config fields
- Dependency updates
What's Changed
- Update attribute error when parsing billsum conversations by @rlakhtakia in #297
- feat: add percentiles configuration for request lifecycle metrics reporting by @hhk7734 in #295
- Add mTLS support in vllm client by @unicell in #302
- Add cloudbuild.yaml by @jjk-g in #306
- Fix vllm prefix metrics by @jjk-g in #309
- feat: Multilora support by @changminbark in #315
- Add end-to-end testing using llm-d-inference-sim by @diamondburned in #294
- chore: update README.md with concurrent load generation info by @changminbark in #313
- Enabling multiple report analysis using CLI tool by @SachinVarghese in #307
- fix concurrency higher than set issue. by @zetxqx in #320
- Add Journal of Open Source Software paper on inference-perf by @achandrasekar in #326
- Fix docs to reference std, actual field is std_dev by @jjk-g in #335
- Add Sachin to authors in JOSS paper by @achandrasekar in #334
- Add shuffle to multi-round prompt generation by @elevran in #331
- Add aliases for shared_prefix config fields by @jjk-g in #311
- Update paper with statement of need and references by @achandrasekar in #342
- Add unique random seed to worker by @yangligt2 in #340
- Update transformers by @jjk-g in #336
New Contributors
- @unicell made their first contribution in #302
- @elevran made their first contribution in #331
- @yangligt2 made their first contribution in #340
Full Changelog: v0.3.0...v0.4.0
Docker Image
quay.io/inference-perf/inference-perf:v0.4.0
Python Package
pip install inference-perf==v0.4.0
v0.3.0
This release comes with some major improvements:
- Trace file based load generation and testing
- Support for benchmarking multi-turn chat scenarios
- Shared client session for load generation performance
- Improved helm chart configurations for kubernetes deployment
- End to end tests on CI/CD pipeline
What's Changed
- Improve efficiency and readability of data generators by @pancak3 in #210
- Make selected request rates accurate to two decimal places (formerly zero) when using linear sweep type by @Bslabe123 in #237
- Add debug log for saturation sampling by @jjk-g in #236
- ci: push helm chart to OCI registry when release by @ExplorerRay in #240
- chore: add inter_token_latency in ModelServerMetrics for sglang metrics by @jlcoo in #242
- use achieved_rate in the report graph. by @zetxqx in #232
- Improve docker image building by @pancak3 in #228
- feat: Enhance Helm chart flexibility for job by @LukeAVanDrie in #248
- Catch saturation detection failure by @jjk-g in #251
- Adding time per output tokens prometheus metrics for sglang server by @SachinVarghese in #254
- feat: loadgen SIGINT handler by @changminbark in #244
- Feat: Add request timeouts and circuit breakers (#148) by @huaxig in #227
- Added
PrometheusMetricImplementations by @Bslabe123 in #221 - Workflow that currently pushes Docker image now also pushes Helm chart by @Bslabe123 in #259
- Fix for Invalid Chart Version by @Bslabe123 in #261
- Add jjk-g to maintainers by @achandrasekar in #267
- Update helm chart to pass in gcs bucket to download datasets. by @rlakhtakia in #260
- Fixing test and validate workflows by @SachinVarghese in #272
publish-on-changeworkflow should use helm client login instead of docker login by @Bslabe123 in #264- Add Kubecon Demo results by @Bslabe123 in #224
- docs: clarify authentication needed for querying metrics from GMP by @Bslabe123 in #276
- Update vLLM kv cache metric from
vllm:gpu_cache_usage_perctovllm:kv_cache_usage_percby @Bslabe123 in #277 - Update helm to add service account name by @rlakhtakia in #270
- Update helm chart to pull datasets from s3 bucket. by @rlakhtakia in #278
- Trace load gen by @aish1331 in #198
- fix: stabilize streaming responses for large chunk using iter_any() by @zetxqx in #284
- fix: custom tokenizer truncates inputs to model max input length by @changminbark in #266
- [Testing / CI/CD] Ability to automate scale testing with a mock server and test different datasets, loadgen, etc. and run it as a part of CI/CD (#274) by @huaxig in #274
- Update helm to pass in existing kubernetes secret. by @rlakhtakia in #281
- Loadgen concurrent load type by @changminbark in #263
- Improve MultiprocessRequestDataCollector async by @diamondburned in #280
- update gcs bucket to pass in bucket name only for consistency by @rlakhtakia in #285
- Feat: Add user session to support Multi-turn chat (#179) by @huaxig in #257
- fix pyproject dependency groups and TOML parsing issue by @diamondburned in #291
- Fix overflow on tokenizer truncation by @jjk-g in #290
- chore: improve openai client error handling by including status code and reason by @hhk7734 in #289
- Fix: requests get duplicated using shared_prefix datagen when multi-turn chat disabled by @huaxig in #293
- Share aiohttp.ClientSessions per worker by @diamondburned in #282
New Contributors
- @jlcoo made their first contribution in #242
- @zetxqx made their first contribution in #232
- @LukeAVanDrie made their first contribution in #248
- @changminbark made their first contribution in #244
- @diamondburned made their first contribution in #280
- @hhk7734 made their first contribution in #289
Full Changelog: v0.2.0...v0.3.0
Docker Image
quay.io/inference-perf/inference-perf:v0.3.0
Python Package
pip install inference-perf==v0.3.0
v0.2.0
This release comes with some major improvements:
- Default concurrency improvement and multi-process CPU utilization improvement with extensive scale testing to make sure the latency values reported at high concurrency (up to 10k QPS) are accurate
- Enhanced support for SGLang and TGI with model server metrics
- New dataset support for summarization (CNN Dailymail), prefill heavy (Billsum Conversations) and decode heavy (Instruct Infinity) use cases
- Automatic sweep of request rates until saturation
- Observability improvements around load generation and the ability to monitor scheduling delay and achieved rate
What's Changed
- Add Bslabe123 and jjk-g as reviewers by @achandrasekar in #162
- update streaming part in the documentation too to make it clear by @liyuerich in #168
- Update instructions on running inference-perf without building from source by @achandrasekar in #170
- Fix Malformed GMP Query URL by @Bslabe123 in #172
- feat: revise Dockerfile to reduce image size by @ExplorerRay in #164
- Support for running inference-perf offline by @aish1331 in #152
- Update manifests.yaml by @Bslabe123 in #175
- Change default api in config.yml from
chattocompletionby @Bslabe123 in #178 - Remove outdated instructions in
deploy/README.mdby @Bslabe123 in #180 - Add Streaming Support for Chat API Requests by @Bslabe123 in #173
- Include KV-Cache Usage Percentage Metrics in Prometheus Reports by @Bslabe123 in #184
- Fix for zeroed metrics added in #184 by @Bslabe123 in #186
- Fix per stage PromQL metric queries by using
time.time()instead oftime.perf_counter()by @Bslabe123 in #177 - Defer InferenceAPIData gen to worker procs by @jjk-g in #157
- Rename schedule_accuracy to schedule_delay by @jjk-g in #195
- Support custom http headers in inference requests by @achandrasekar in #192
- Add vllm metrics [preemptions and swapped requests] to prometheus metrics by @Shuwen-Fang in #197
- Remove alpha channel of diagram png to keep presentation consistent in differen browser modes (#199) by @pancak3 in #200
- Added SGlang server support by @SachinVarghese in #193
- Add cnn_dailymail datagen by @jjk-g in #196
- Use 'median' instead of 'p50' in output reports by @Bslabe123 in #201
- Fix typecheck issue by @achandrasekar in #209
- Reduce CPU utilization of waiting workers by @jjk-g in #204
- refactor: simplify response branches by @dublc in #208
iter -> itertools.cyclein Datagen by @Bslabe123 in #214- Support for populating hf_token values from k8s secrets (#183) by @huaxig in #212
- Add vllm prefix metrics by @jjk-g in #216
- Added support for TGI Model Server by @aish1331 in #203
- Break metricsclient dependency on loadgen and modelserverclient modules by @Bslabe123 in #206
- bugfix:
self.additional_metrics_filters->self.metrics_filtersby @Bslabe123 in #220 - Add Infinity Instruct datagen by @rlakhtakia in #217
- Saturation detection and auto sweep by @jjk-g in #215
- Write config.yaml to report directory by @jjk-g in #219
- Add more percentiles to metrics by @namasl in #226
- Update inf-perf to use hf-billsum dataset by @rlakhtakia in #207
- Update documentation to cover metrics, loadgen and new datasets by @achandrasekar in #234
New Contributors
- @Shuwen-Fang made their first contribution in #197
- @pancak3 made their first contribution in #200
- @dublc made their first contribution in #208
- @huaxig made their first contribution in #212
- @rlakhtakia made their first contribution in #217
- @namasl made their first contribution in #226
Full Changelog: v0.1.1...v0.2.0
Docker Image
quay.io/inference-perf/inference-perf:v0.2.0
Python Package
pip install inference-perf==v0.2.0
v0.1.1
What's Changed
- Simplified local report storage by @SachinVarghese in #118
- Add basic Helm Chart by @jjk-g in #114
- Add support for api_key #116 by @andresC98 in #120
- feat: migrate scripts from Makefile to PDM scripts by @rudeigerc in #122
- Add Support for Querying Metrics from Google Managed Prometheus and Additional PromQL Filters now Configurable by @Bslabe123 in #121
- fix: improve Docker build workflow with better secret handling and debugging by @wangchen615 in #128
- fix: improve release workflow with proper changelog and Docker image handling by @wangchen615 in #129
- Add qps observability, fractional rates by @jjk-g in #125
- add config.md file to provide detail description for config.yml parameters by @liyuerich in #131
- Prometheus query fixes and examples update by @SachinVarghese in #130
- Add ability to specify datagen bounds for ShareGPT by @jjk-g in #137
- Add the ability to analyze reports and produce charts by @achandrasekar in #135
- update datasets type by @liyuerich in #136
- Fix a parsing issue with streaming requests by @achandrasekar in #140
- feat: add support for s3 storage by @omerap12 in #147
- Add detection for model_name and tokenizer by @jjk-g in #145
- update example config yml files by @liyuerich in #149
- Fix QPS accuracy at lower rates by @jjk-g in #143
- Update Helm chart by @jjk-g in #154
- Autocalc total_count by @jjk-g in #155
- fix: only assign tokenizer to model_name when not configured by @ExplorerRay in #160
- feat: Add Python package publishing to release workflow by @wangchen615 in #153
- Point
config.ymlto mounted configmap file by @Bslabe123 in #158
New Contributors
- @andresC98 made their first contribution in #120
- @rudeigerc made their first contribution in #122
- @liyuerich made their first contribution in #131
- @omerap12 made their first contribution in #147
- @ExplorerRay made their first contribution in #160
Full Changelog: https://github.com/kubernetes-sigs/inference-perf/commits/v0.1.1
Docker Image
quay.io/inference-perf/inference-perf:v0.1.1
Python Package
pip install inference-perf==v0.1.1
v0.1.0
We are excited to announce the initial release of Inference Perf v0.1.0! This release comes with the following key features:
- Highly scalable and can support benchmarking large inference production deployments by sending up to 10k QPS.
- Reports the key metrics needed to measure LLM performance.
- Supports different real world and synthetic datasets.
- Supports different APIs and can support multiple model servers.
- Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
- Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.
What's Changed
- Add directory structure for the tool by @achandrasekar in #1
- Add Makefile and typecheck presubmit by @Bslabe123 in #3
- Fix Makefile Typo by @Bslabe123 in #6
- Add design document to the repo by @achandrasekar in #4
- Add default python gitignore by @sjmonson in #11
- Make Inference-Perf Package-able / Use Modern Python Tooling by @sjmonson in #13
- Added Abstract Type for Metrics Client by @Bslabe123 in #7
- Add Chen Wang to OWNERS by @terrytangyuan in #20
- Inference perf basic load run implementation by @SachinVarghese in #21
- Adding vLLM Client to inference perf runner by @SachinVarghese in #27
- Add HF ShareGPT Data Generator by @vivekk16 in #33
- Add Unit Testing Maketargets and Unit Testing Github Workflow by @Bslabe123 in #19
- Mock metrics client implementation by @SachinVarghese in #32
- Parameterization of CLI tool using config file by @SachinVarghese in #34
- Add SachinVarghese as approver, add owner aliases by @achandrasekar in #36
- Containerize the benchmark by @achandrasekar in #38
- Added demo example for vLLM Server and shareGPT datagen component by @SachinVarghese in #37
- Fix: Raising error for api type mismatch by @SachinVarghese in #44
- Add Custom Tokenizer by @vivekk16 in #43
- Multi-stage performance run by @SachinVarghese in #49
- Update README.md with meeting time / recording links by @achandrasekar in #54
- Add support for cluster-local benchmarking by @Bslabe123 in #60
- Update DataGenerator to Handle Both Chat and Completion APIs by @Bslabe123 in #58
- Lint and type check fixes by @SachinVarghese in #62
- Add StorageClient abstract type and GCS Client Implementation by @Bslabe123 in #61
- Added Prometheus client to get model server metrics by @aish1331 in #64
- Add support for different input distributions with a synthetic dataset by @achandrasekar in #66
- Automatically Populate Missing Fields in Config by @Bslabe123 in #71
- Generic model server client config by @SachinVarghese in #72
- Request Lifecycle Report Generation by @Bslabe123 in #77
- Add output distribution to synthetic data generator by @achandrasekar in #79
- Improved Logging for Writing Report Files by @Bslabe123 in #80
- Add the option to ignore end of sequence by @achandrasekar in #83
- Add GitHub Release Workflow and Changelog Configuration by @wangchen615 in #41
- Improved abstractions for perf project by @SachinVarghese in #84
- Add issue templates for the repo by @achandrasekar in #90
- docs: Update link to Slack channel in README.md by @terrytangyuan in #91
- Add random data generator by @achandrasekar in #94
- Multi-stage report generation for Prometheus Metrics by @aish1331 in #95
- Add Docker build and push workflows for PRs and releases by @wangchen615 in #97
- Add shared prefix generator to benchmark prefix caching by @achandrasekar in #98
- Added throughput metrics to output report by @Bslabe123 in #101
- Basic code test setup by @SachinVarghese in #96
- Enable Docker Build Workflow on Push to Main Branch by @wangchen615 in #102
- Fix Docker Tag Generation by Using env.QUAY_USERNAME in Workflow by @wangchen615 in #105
- Update Quay.io Organization Name in Docker Build Workflow by @wangchen615 in #106
- Add Support for Streaming Requests to Completions API by @Bslabe123 in #103
- Add multiprocess, multithreaded loadgen by @jjk-g in #99
- Update documentation to cover newer capabilities by @achandrasekar in #104
- Use logging methods with levels instead of print by @shotarok in #110
- Merge the latest fixes to the release branch by @achandrasekar in #115
New Contributors
- @achandrasekar made their first contribution in #1
- @Bslabe123 made their first contribution in #3
- @sjmonson made their first contribution in #11
- @terrytangyuan made their first contribution in #20
- @SachinVarghese made their first contribution in #21
- @vivekk16 made their first contribution in #33
- @aish1331 made their first contribution in #64
- @wangchen615 made their first contribution in #41
- @jjk-g made their first contribution in #99
- @shotarok made their first contribution in #110
Full Changelog: https://github.com/kubernetes-sigs/inference-perf/commits/v0.1.0