Skip to content

Releases: kubernetes-sigs/inference-perf

v0.5.0

01 May 19:07
e250731

Choose a tag to compare

Summary

Features

  • New UX components including tui and cli args
  • OTel trace replay support
  • Multi-report analysis
  • Distribution Sampling
  • Goodput metrics

Bug Fixes

  • Fixes for concurrency and multi-turn
  • vLLM metric parity
  • Multi-token chunk handling
  • Iterative convergence for prompt length accuracy

Improvements

  • Code coverage reporting
  • Added e2e & unit tests
  • Docs and guide updates

What's Changed

New Contributors

Full Changelog: v0.4.0...v0.5.0

Docker Image

quay.io/inference-perf/inference-perf:v0.5.0

Python Package

pip install inference-perf==v0.5.0

v0.4.0

06 Feb 22:48
e3e690b

Choose a tag to compare

v0.4.0

This release contains several feature improvements and various bug fixes:

  • mTLS support in vllm client
  • Multilora support
  • E2e testing against inference sim
  • Multiple report analysis
  • New aliases for shared_prefix config fields
  • Dependency updates

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.4.0

Docker Image

quay.io/inference-perf/inference-perf:v0.4.0

Python Package

pip install inference-perf==v0.4.0

v0.3.0

26 Nov 21:06
a85b31b

Choose a tag to compare

This release comes with some major improvements:

  • Trace file based load generation and testing
  • Support for benchmarking multi-turn chat scenarios
  • Shared client session for load generation performance
  • Improved helm chart configurations for kubernetes deployment
  • End to end tests on CI/CD pipeline

What's Changed

  • Improve efficiency and readability of data generators by @pancak3 in #210
  • Make selected request rates accurate to two decimal places (formerly zero) when using linear sweep type by @Bslabe123 in #237
  • Add debug log for saturation sampling by @jjk-g in #236
  • ci: push helm chart to OCI registry when release by @ExplorerRay in #240
  • chore: add inter_token_latency in ModelServerMetrics for sglang metrics by @jlcoo in #242
  • use achieved_rate in the report graph. by @zetxqx in #232
  • Improve docker image building by @pancak3 in #228
  • feat: Enhance Helm chart flexibility for job by @LukeAVanDrie in #248
  • Catch saturation detection failure by @jjk-g in #251
  • Adding time per output tokens prometheus metrics for sglang server by @SachinVarghese in #254
  • feat: loadgen SIGINT handler by @changminbark in #244
  • Feat: Add request timeouts and circuit breakers (#148) by @huaxig in #227
  • Added PrometheusMetric Implementations by @Bslabe123 in #221
  • Workflow that currently pushes Docker image now also pushes Helm chart by @Bslabe123 in #259
  • Fix for Invalid Chart Version by @Bslabe123 in #261
  • Add jjk-g to maintainers by @achandrasekar in #267
  • Update helm chart to pass in gcs bucket to download datasets. by @rlakhtakia in #260
  • Fixing test and validate workflows by @SachinVarghese in #272
  • publish-on-change workflow should use helm client login instead of docker login by @Bslabe123 in #264
  • Add Kubecon Demo results by @Bslabe123 in #224
  • docs: clarify authentication needed for querying metrics from GMP by @Bslabe123 in #276
  • Update vLLM kv cache metric from vllm:gpu_cache_usage_perc to vllm:kv_cache_usage_perc by @Bslabe123 in #277
  • Update helm to add service account name by @rlakhtakia in #270
  • Update helm chart to pull datasets from s3 bucket. by @rlakhtakia in #278
  • Trace load gen by @aish1331 in #198
  • fix: stabilize streaming responses for large chunk using iter_any() by @zetxqx in #284
  • fix: custom tokenizer truncates inputs to model max input length by @changminbark in #266
  • [Testing / CI/CD] Ability to automate scale testing with a mock server and test different datasets, loadgen, etc. and run it as a part of CI/CD (#274) by @huaxig in #274
  • Update helm to pass in existing kubernetes secret. by @rlakhtakia in #281
  • Loadgen concurrent load type by @changminbark in #263
  • Improve MultiprocessRequestDataCollector async by @diamondburned in #280
  • update gcs bucket to pass in bucket name only for consistency by @rlakhtakia in #285
  • Feat: Add user session to support Multi-turn chat (#179) by @huaxig in #257
  • fix pyproject dependency groups and TOML parsing issue by @diamondburned in #291
  • Fix overflow on tokenizer truncation by @jjk-g in #290
  • chore: improve openai client error handling by including status code and reason by @hhk7734 in #289
  • Fix: requests get duplicated using shared_prefix datagen when multi-turn chat disabled by @huaxig in #293
  • Share aiohttp.ClientSessions per worker by @diamondburned in #282

New Contributors

Full Changelog: v0.2.0...v0.3.0

Docker Image

quay.io/inference-perf/inference-perf:v0.3.0

Python Package

pip install inference-perf==v0.3.0

v0.2.0

24 Sep 16:59
1ccc48b

Choose a tag to compare

This release comes with some major improvements:

  • Default concurrency improvement and multi-process CPU utilization improvement with extensive scale testing to make sure the latency values reported at high concurrency (up to 10k QPS) are accurate
  • Enhanced support for SGLang and TGI with model server metrics
  • New dataset support for summarization (CNN Dailymail), prefill heavy (Billsum Conversations) and decode heavy (Instruct Infinity) use cases
  • Automatic sweep of request rates until saturation
  • Observability improvements around load generation and the ability to monitor scheduling delay and achieved rate

What's Changed

New Contributors

Full Changelog: v0.1.1...v0.2.0

Docker Image

quay.io/inference-perf/inference-perf:v0.2.0

Python Package

pip install inference-perf==v0.2.0

v0.1.1

01 Aug 22:43
fd21242

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/kubernetes-sigs/inference-perf/commits/v0.1.1

Docker Image

quay.io/inference-perf/inference-perf:v0.1.1

Python Package

pip install inference-perf==v0.1.1

v0.1.0

26 Jun 18:59

Choose a tag to compare

We are excited to announce the initial release of Inference Perf v0.1.0! This release comes with the following key features:

  • Highly scalable and can support benchmarking large inference production deployments by sending up to 10k QPS.
  • Reports the key metrics needed to measure LLM performance.
  • Supports different real world and synthetic datasets.
  • Supports different APIs and can support multiple model servers.
  • Supports specifying an exact input and output distribution to simulate different scenarios - Gaussian distribution, fixed length, min-max cases are all supported.
  • Generates different load patterns and can benchmark specific cases like burst traffic, scaling to saturation and other autoscaling / routing scenarios.

What's Changed

New Contributors

Full Changelog: https://github.com/kubernetes-sigs/inference-perf/commits/v0.1.0