Skip to content

Commit d446c56

Browse files
committed
refactor benchmark/index.md changes
1 parent 05b8375 commit d446c56

File tree

2 files changed

+199
-37
lines changed

2 files changed

+199
-37
lines changed

site-src/performance/benchmark/index.md

Lines changed: 37 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,49 @@
11
# Benchmark
22

3-
This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
4-
inference extension, and a Kubernetes service as the load balancing strategy. The
5-
benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
6-
tool to generate load and collect results.
3+
This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API
4+
Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the
5+
[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate
6+
load and collect results.
77

88
## Prerequisites
99

1010
### Deploy the inference extension and sample model server
1111

12-
Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
13-
sample vLLM application, and the inference extension.
12+
Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension)
13+
to deploy the vLLM model server, CRDs, etc.
14+
15+
__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing.
1416

1517
### [Optional] Scale the sample vLLM deployment
1618

17-
You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
19+
You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
1820

1921
```bash
20-
kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
22+
kubectl scale deployment vllm-llama3-8b-instruct --replicas=8
2123
```
2224

2325
### Expose the model server via a k8s service
2426

25-
As the baseline, let's also expose the vLLM deployment as a k8s service:
27+
To establish a baseline, expose the vLLM deployment as a k8s service:
2628

2729
```bash
28-
kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
30+
kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer
2931
```
3032

3133
## Run benchmark
3234

33-
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
35+
The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results.
36+
Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in
37+
parallel against different targets.
3438

3539
1. Check out the repo.
36-
40+
3741
```bash
3842
git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
3943
cd gateway-api-inference-extension
4044
```
4145

42-
1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
46+
1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
4347

4448
```bash
4549
# Get gateway IP
@@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p
5155
echo $SVC_IP
5256
```
5357

54-
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
58+
1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`.
59+
Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the
60+
[LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
5561

56-
1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
62+
1. Start the benchmark tool.
5763

58-
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
59-
to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
60-
the script below will watch for that log line and then start downloading results.
64+
```bash
65+
kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
66+
```
67+
68+
1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this
69+
benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print
70+
a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results.
6171

6272
```bash
63-
benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
73+
benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash
6474
```
65-
1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json).
75+
76+
After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder.
77+
Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark.
6678

6779
### Tips
6880

81+
* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be
82+
updated accordingly to analyze the results.
6983
* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
7084
This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
7185
* Update the `request_rates` that best suit your benchmark environment.
7286

7387
### Advanced Benchmark Configurations
7488

75-
Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
89+
Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a
90+
detailed list of configuration knobs.
7691

7792
## Analyze the results
7893

79-
This guide shows how to run the jupyter notebook using vscode.
94+
This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks.
8095

8196
1. Create a python virtual environment.
8297

tools/benchmark/benchmark.ipynb

Lines changed: 162 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "code",
5-
"execution_count": 26,
5+
"execution_count": null,
66
"metadata": {
77
"executionInfo": {
88
"elapsed": 391,
@@ -21,16 +21,17 @@
2121
"#@title Configuration. Edit this before running the rest.\n",
2222
"\n",
2323
"OUTPUT_DIR='output'\n",
24-
"RUN_ID='default-run'\n",
24+
"RUN_ID='example-run'\n",
2525
"# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n",
2626
"BENCHMARK_DIR =\"./\"\n",
2727
"# A regex to match the model name, which matches the output file name.\n",
28-
"MODEL_MATCHER='.*llama.*'"
28+
"MODEL_MATCHER='.*llama.*'\n",
29+
"INTERACTIVE_PLOT='False'"
2930
]
3031
},
3132
{
3233
"cell_type": "code",
33-
"execution_count": 27,
34+
"execution_count": null,
3435
"metadata": {
3536
"executionInfo": {
3637
"elapsed": 33,
@@ -55,6 +56,7 @@
5556
"import matplotlib.pyplot as plt\n",
5657
"import numpy as np\n",
5758
"import math\n",
59+
"from sklearn.metrics import r2_score\n",
5860
"import logging\n",
5961
"level = logging.INFO\n",
6062
"logger = logging.getLogger(__name__)\n",
@@ -82,11 +84,11 @@
8284
" XY(x = 'request_rate', x_label = 'QPS', y = 'output_tokens_per_min'),\n",
8385
" XY(x = \"request_rate\", x_label = 'QPS', y = \"p90_per_output_token_latency\"),\n",
8486
" XY(x = \"request_rate\", x_label = 'QPS', y = \"p90_latency\"),\n",
87+
" XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_attempted\"),\n",
88+
" XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_succeeded\"),\n",
8589
"]\n",
8690
"SANITY_CHECK_METRICS = [\n",
8791
" XY(x = 'request_rate', x_label = 'QPS', y = 'benchmark_time'),\n",
88-
" XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_attempted\"),\n",
89-
" XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_succeeded\"),\n",
9092
" XY(x = 'request_rate', x_label = 'QPS', y = 'throughput_rps'),\n",
9193
" XY(x = 'request_rate', x_label = 'QPS', y = 'total_input_tokens'),\n",
9294
" XY(x = 'request_rate', x_label = 'QPS', y = 'total_output_token'),\n",
@@ -110,6 +112,8 @@
110112
" self.interactive = interactive\n",
111113
" self.annotate = annotate\n",
112114
" self.output_dir = output_dir\n",
115+
" self.data = load_data(self.labels, self.run_id, self.output_dir)\n",
116+
" self.groups = group_data(self.data, self.metrics)\n",
113117
"\n",
114118
" def withRunId(self, run_id):\n",
115119
" return Plotter(run_id, self.labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, self.output_dir)\n",
@@ -124,10 +128,16 @@
124128
" return Plotter(self.run_id, self.labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, output_dir)\n",
125129
"\n",
126130
" def plot_bar(self):\n",
127-
" data = load_data(self.labels, self.run_id, self.output_dir)\n",
128-
" groups = group_data(data, self.metrics)\n",
131+
" \n",
129132
" logger.debug(\"Plotting run id...\")\n",
130-
" plot_bar(self.labels, groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n",
133+
" plot_bar(self.labels, self.groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n",
134+
"\n",
135+
" def plot_delta(self):\n",
136+
" \"\"\"\n",
137+
" Plot the delta between two labels.\n",
138+
" \"\"\"\n",
139+
" logger.debug(\"Plotting delta for run id...\")\n",
140+
" plot_delta(self.labels, self.groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n",
131141
"\n",
132142
"def filepaths(root_dir):\n",
133143
" \"\"\"\n",
@@ -201,6 +211,27 @@
201211
" groups = data.groupby(by=['label'],sort=True)\n",
202212
" return groups\n",
203213
"\n",
214+
"def compute_r2_for_metrics(groups, metrics, label_before, label_after):\n",
215+
" print(\"\\nCoefficient of Determination (R^2) between before and after runs:\")\n",
216+
" for m in metrics:\n",
217+
" try:\n",
218+
" df_b = groups.get_group(label_before).set_index('request_rate')\n",
219+
" df_a = groups.get_group(label_after).set_index('request_rate')\n",
220+
" except KeyError:\n",
221+
" print(f\" Skipping {m.y}: missing group data for '{label_before}' or '{label_after}'\")\n",
222+
" continue\n",
223+
" common = sorted(set(df_b.index).intersection(df_a.index))\n",
224+
" yb = df_b.loc[common, m.y].values\n",
225+
" ya = df_a.loc[common, m.y].values\n",
226+
" mask = ~np.isnan(yb) & ~np.isnan(ya)\n",
227+
" yb, ya = yb[mask], ya[mask]\n",
228+
" if len(yb) > 1 and np.any(yb != 0):\n",
229+
" r2 = r2_score(yb, ya)\n",
230+
" print(f\" {m.y:<30} R^2 = {r2:.4f}\")\n",
231+
" else:\n",
232+
" print(f\" {m.y:<30} insufficient data for R^2 calculation\")\n",
233+
"\n",
234+
"\n",
204235
"def init_plot(metrics, num_plots_per_row=NUM_PLOTS_PER_ROW):\n",
205236
" num_plots_per_row = min(num_plots_per_row, len(metrics))\n",
206237
" row_count = math.ceil(len(metrics) / num_plots_per_row)\n",
@@ -229,7 +260,7 @@
229260
" plot_func(curAx, m)\n",
230261
" return fig, axes\n",
231262
"\n",
232-
"def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=False, annotate=False):\n",
263+
"def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n",
233264
" labels = [label.alias for label in labels]\n",
234265
" logger.debug(f'Prnting bar chart for {labels}')\n",
235266
" logger.debug(f'groups: {groups}')\n",
@@ -294,7 +325,106 @@
294325
" fig, axes = plot_metrics(metrics, plot_func, num_plots_per_row)\n",
295326
" fig.tight_layout(rect=[0, 0.03, 1, 0.95])\n",
296327
" plt.show()\n",
297-
"\n"
328+
"\n",
329+
"def plot_delta(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=True, annotate=False):\n",
330+
" \"\"\"\n",
331+
" Plot the delta between base_label and compare_label for each metric.\n",
332+
" A positive delta means compare_label has a higher value than base_label.\n",
333+
" \"\"\"\n",
334+
" base_label = labels[0].name\n",
335+
" compare_label = labels[1].name\n",
336+
" logger.debug(f'Printing delta chart for {base_label} vs {compare_label}')\n",
337+
"\n",
338+
" try:\n",
339+
" base_df = groups.get_group((base_label,))\n",
340+
" compare_df = groups.get_group((compare_label,))\n",
341+
" except Exception as e:\n",
342+
" logger.error(f\"Error getting data for labels {base_label} and {compare_label}: {e}\")\n",
343+
" return\n",
344+
"\n",
345+
" y_columns = [m.y for m in metrics]\n",
346+
"\n",
347+
" # 1. Find common request rates\n",
348+
" base_rates = set(base_df['request_rate'].astype(int))\n",
349+
" compare_rates = set(compare_df['request_rate'].astype(int))\n",
350+
" common_rates = sorted(list(base_rates.intersection(compare_rates)))[:6]\n",
351+
"\n",
352+
" if not common_rates:\n",
353+
" logger.error(f\"No common request rates found between {base_label} and {compare_label}\")\n",
354+
" return\n",
355+
"\n",
356+
" # 2. Prepare data for delta calculation\n",
357+
" base_data = base_df.set_index('request_rate').to_dict()\n",
358+
" compare_data = compare_df.set_index('request_rate').to_dict()\n",
359+
"\n",
360+
" # Calculate deltas (compare_label - base_label)\n",
361+
" delta_data = {y_col: {} for y_col in y_columns}\n",
362+
" for y_col in y_columns:\n",
363+
" for rate in common_rates:\n",
364+
" base_val = base_data.get(y_col, {}).get(rate, np.nan)\n",
365+
" compare_val = compare_data.get(y_col, {}).get(rate, np.nan)\n",
366+
"\n",
367+
" if not np.isnan(base_val) and not np.isnan(compare_val):\n",
368+
" delta_data[y_col][rate] = (compare_val - base_val)/base_val*100\n",
369+
" else:\n",
370+
" delta_data[y_col][rate] = np.nan\n",
371+
"\n",
372+
" # 3. Plotting\n",
373+
" def plot_func(curAx, m):\n",
374+
" x = np.arange(len(common_rates))\n",
375+
" y_values = [delta_data[m.y].get(rr, np.nan) for rr in common_rates]\n",
376+
"\n",
377+
" # Determine colors based on positive/negative values\n",
378+
" colors = ['green' if val > 0 else 'blue' for val in y_values]\n",
379+
"\n",
380+
" rects = curAx.bar(x, y_values, 0.6, color=colors)\n",
381+
"\n",
382+
" # Add a horizontal line at y=0\n",
383+
" curAx.axhline(y=0, color='black', linestyle='-', linewidth=1)\n",
384+
"\n",
385+
" if annotate:\n",
386+
" for rect, val in zip(rects, y_values):\n",
387+
" if not np.isnan(val):\n",
388+
" height = rect.get_height()\n",
389+
" # For negative bars, put text above the bar\n",
390+
" vert_align = 'bottom' if val >= 0 else 'top'\n",
391+
" y_offset = 3 if val >= 0 else -3\n",
392+
"\n",
393+
" curAx.annotate(f'{val:.2f}',\n",
394+
" xy=(rect.get_x() + rect.get_width() / 2, val),\n",
395+
" xytext=(0, y_offset), # vertical offset\n",
396+
" textcoords=\"offset points\",\n",
397+
" ha='center', va=vert_align)\n",
398+
"\n",
399+
" # Create a title that shows what this delta represents\n",
400+
" title = f\"Delta: {compare_label} - {base_label} ({m.y})\"\n",
401+
" curAx.set_title(title, fontsize=12)\n",
402+
"\n",
403+
" # Add labels\n",
404+
" curAx.set_xlabel(m.x_label, fontsize=axis_label_fontsize)\n",
405+
" #curAx.set_ylabel(f\"% Delta in {m.y_label}\", fontsize=axis_label_fontsize)\n",
406+
" curAx.set_xticks(x)\n",
407+
" curAx.set_xticklabels(common_rates)\n",
408+
" curAx.tick_params(axis='both', labelsize=tick_label_fontsize)\n",
409+
"\n",
410+
" # Create a dummy handle for the legend\n",
411+
" legend_handle = [plt.Rectangle((0,0),1,1,color='green'),\n",
412+
" plt.Rectangle((0,0),1,1,color='blue')]\n",
413+
" legend_label = [f'{compare_label} > {base_label}',\n",
414+
" f'{compare_label} < {base_label}']\n",
415+
"\n",
416+
" return legend_handle, legend_label\n",
417+
"\n",
418+
" # Create plot with metrics\n",
419+
" fig, axes = plot_metrics(metrics, plot_func, num_plots_per_row)\n",
420+
"\n",
421+
" # Add an overall title for the figure\n",
422+
" fig.suptitle(f\"% Delta Metrics: {compare_label} - {base_label}\",\n",
423+
" fontsize=title_fontsize, y=0.98)\n",
424+
"\n",
425+
" plt.subplots_adjust(bottom=0.15, top=0.9) # Make room for legends\n",
426+
" fig.tight_layout(rect=[0, 0.1, 1, 0.95]) # Adjust the rectangle in which the subplots fit\n",
427+
" plt.show()"
298428
]
299429
},
300430
{
@@ -320,9 +450,26 @@
320450
"outputs": [],
321451
"source": [
322452
"#@title Plot Result\n",
323-
"\n",
324-
"pl = Plotter(run_id=RUN_ID, labels=[Label('inference-extension'),Label('k8s-svc')], output_dir=OUTPUT_DIR)\n",
325-
"pl.plot_bar()"
453+
"# initialize the plotter with the run id and labels. \n",
454+
"# Example labels are 'inference-extension' and 'k8s-svc' if comparing Inference Extension and K8s Service \n",
455+
"# 'regression-before' and 'regression-after' if comparing two different runs of inference extension to see the regression\n",
456+
"\n",
457+
"benchmark_id1 = <ID1> # eg 'regression-before' or 'inference-extension'\n",
458+
"benchmark_id2 = <ID2> # eg 'regression-after' or 'k8s-svc'\n",
459+
"labels = [Label(benchmark_id1), Label(benchmark_id2,)]\n",
460+
"\n",
461+
"# Plot bar chart of metrics\n",
462+
"pl = Plotter(run_id=RUN_ID, labels=labels, output_dir=OUTPUT_DIR)\n",
463+
"pl.plot_bar()\n",
464+
"pl.plot_delta()\n",
465+
"\n",
466+
"# Load & group data to compute R^2\n",
467+
"all_data = load_data(labels, RUN_ID, OUTPUT_DIR)\n",
468+
"groups = group_data(all_data)\n",
469+
"compute_r2_for_metrics(groups, CORE_METRICS,\n",
470+
" label_before=benchmark_id1,\n",
471+
" label_after=benchmark_id2)\n",
472+
"\n"
326473
]
327474
}
328475
],
@@ -355,4 +502,4 @@
355502
},
356503
"nbformat": 4,
357504
"nbformat_minor": 0
358-
}
505+
}

0 commit comments

Comments
 (0)