refactor benchmark/index.md changes

kaushikmitr · kaushikmitr · commit d446c56451c9 · 2025-05-15T18:30:51.000Z
diff --git a/site-src/performance/benchmark/index.md b/site-src/performance/benchmark/index.md
@@ -1,45 +1,49 @@
 # Benchmark
 
-This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API
-inference extension, and a Kubernetes service as the load balancing strategy. The
-benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG)
-tool to generate load and collect results.
+This user guide shows how to run benchmarks against a vLLM model server deployment by using both Gateway API
+Inference Extension, and a Kubernetes service as the load balancing strategy. The benchmark uses the
+[Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) tool to generate
+load and collect results.
 
 ## Prerequisites
 
 ### Deploy the inference extension and sample model server
 
-Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the
-sample vLLM application, and the inference extension.
+Follow the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/#getting-started-with-gateway-api-inference-extension)
+to deploy the vLLM model server, CRDs, etc.
+
+__Note:__ Only the GPU-based model server deployment option is supported for benchmark testing.
 
 ### [Optional] Scale the sample vLLM deployment
 
-You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. 
+You are more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision.
 
 ```bash
-kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml
+kubectl scale deployment vllm-llama3-8b-instruct --replicas=8
 ```
 
 ### Expose the model server via a k8s service
 
-As the baseline, let's also expose the vLLM deployment as a k8s service:
+To establish a baseline, expose the vLLM deployment as a k8s service:
 
 ```bash
-kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer
+kubectl expose deployment vllm-llama3-8b-instruct --port=80 --target-port=8000 --type=LoadBalancer
 ```
 
 ## Run benchmark
 
-The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets.
+The LPG benchmark tool works by sending traffic to the specified target IP and port, and collecting the results.
+Follow the steps below to run a single benchmark. Multiple LPG instances can be deployed to run benchmarks in
+parallel against different targets.
 
 1. Check out the repo.
-    
+
     ```bash
     git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension
     cd gateway-api-inference-extension
     ```
 
-1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service.
+1. Get the target IP. The examples below shows how to get the IP of a gateway or a k8s service.
 
     ```bash
     # Get gateway IP
@@ -51,32 +55,43 @@ The LPG benchmark tool works by sending traffic to the specified target IP and p
     echo $SVC_IP
     ```
 
-1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
+1. Then update the `<target-ip>` in `./config/manifests/benchmark/benchmark.yaml` to the value of `$SVC_IP` or `$GW_IP`.
+   Feel free to adjust other parameters such as `request_rates` as well. For a complete list of LPG configurations, refer to the
+   [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark).
 
-1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml`
+1. Start the benchmark tool.
 
-1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable
-to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`,
-the script below will watch for that log line and then start downloading results.
+    ```bash
+    kubectl apply -f ./config/manifests/benchmark/benchmark.yaml
+    ```
+
+1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable to specify what this
+   benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print
+   a log line `LPG_FINISHED`. The script below will watch for that log line and then start downloading results.
 
     ```bash
-    benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash
+    benchmark_id='k8s-svc' ./tools/benchmark/download-benchmark-results.bash
     ```
-    1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json).
+
+    After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/k8s-svc/results/json` folder.
+    Here is a [sample json file](./sample.json). Replace `k8s-svc` with `inference-extension` when running an inference extension benchmark.
 
 ### Tips
 
+* When using a `benchmark_id` other than `k8s-svc` or `inference-extension`, the labels in `./tools/benchmark/benchmark.ipynb` must be
+  updated accordingly to analyze the results.
 * You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script.
 This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly.
 * Update the `request_rates` that best suit your benchmark environment.
 
 ### Advanced Benchmark Configurations
 
-Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs.
+Refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a
+detailed list of configuration knobs.
 
 ## Analyze the results
 
-This guide shows how to run the jupyter notebook using vscode.
+This guide shows how to run the jupyter notebook using vscode after completing k8s service and inference extension benchmarks.
 
 1. Create a python virtual environment.
 
diff --git a/tools/benchmark/benchmark.ipynb b/tools/benchmark/benchmark.ipynb
@@ -2,7 +2,7 @@
   "cells": [
     {
       "cell_type": "code",
-      "execution_count": 26,
+      "execution_count": null,
       "metadata": {
         "executionInfo": {
           "elapsed": 391,
@@ -21,16 +21,17 @@
         "#@title Configuration. Edit this before running the rest.\n",
         "\n",
         "OUTPUT_DIR='output'\n",
-        "RUN_ID='default-run'\n",
+        "RUN_ID='example-run'\n",
         "# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n",
         "BENCHMARK_DIR =\"./\"\n",
         "# A regex to match the model name, which matches the output file name.\n",
-        "MODEL_MATCHER='.*llama.*'"
+        "MODEL_MATCHER='.*llama.*'\n",
+        "INTERACTIVE_PLOT='False'"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 27,
+      "execution_count": null,
       "metadata": {
         "executionInfo": {
           "elapsed": 33,
@@ -55,6 +56,7 @@
         "import matplotlib.pyplot as plt\n",
         "import numpy as np\n",
         "import math\n",
+        "from sklearn.metrics import r2_score\n",
         "import logging\n",
         "level = logging.INFO\n",
         "logger = logging.getLogger(__name__)\n",
@@ -82,11 +84,11 @@
         "    XY(x = 'request_rate', x_label = 'QPS', y = 'output_tokens_per_min'),\n",
         "    XY(x = \"request_rate\", x_label = 'QPS', y = \"p90_per_output_token_latency\"),\n",
         "    XY(x = \"request_rate\", x_label = 'QPS', y = \"p90_latency\"),\n",
+        "    XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_attempted\"),\n",
+        "    XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_succeeded\"),\n",
         "]\n",
         "SANITY_CHECK_METRICS = [\n",
         "    XY(x = 'request_rate', x_label = 'QPS', y = 'benchmark_time'),\n",
-        "    XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_attempted\"),\n",
-        "    XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_succeeded\"),\n",
         "    XY(x = 'request_rate', x_label = 'QPS', y = 'throughput_rps'),\n",
         "    XY(x = 'request_rate', x_label = 'QPS', y = 'total_input_tokens'),\n",
         "    XY(x = 'request_rate', x_label = 'QPS', y = 'total_output_token'),\n",
@@ -110,6 +112,8 @@
         "    self.interactive = interactive\n",
         "    self.annotate = annotate\n",
         "    self.output_dir = output_dir\n",
+        "    self.data = load_data(self.labels, self.run_id, self.output_dir)\n",
+        "    self.groups = group_data(self.data, self.metrics)\n",
         "\n",
         "  def withRunId(self, run_id):\n",
         "    return Plotter(run_id, self.labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, self.output_dir)\n",
@@ -124,10 +128,16 @@
         "    return Plotter(self.run_id, self.labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, output_dir)\n",
         "\n",
         "  def plot_bar(self):\n",
-        "    data = load_data(self.labels, self.run_id, self.output_dir)\n",
-        "    groups = group_data(data, self.metrics)\n",
+        "    \n",
         "    logger.debug(\"Plotting run id...\")\n",
-        "    plot_bar(self.labels, groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n",
+        "    plot_bar(self.labels, self.groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n",
+        "\n",
+        "  def plot_delta(self):\n",
+        "    \"\"\"\n",
+        "    Plot the delta between two labels.\n",
+        "    \"\"\"\n",
+        "    logger.debug(\"Plotting delta for run id...\")\n",
+        "    plot_delta(self.labels, self.groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n",
         "\n",
         "def filepaths(root_dir):\n",
         "    \"\"\"\n",
@@ -201,6 +211,27 @@
         "  groups = data.groupby(by=['label'],sort=True)\n",
         "  return groups\n",
         "\n",
+        "def compute_r2_for_metrics(groups, metrics, label_before, label_after):\n",
+        "    print(\"\\nCoefficient of Determination (R^2) between before and after runs:\")\n",
+        "    for m in metrics:\n",
+        "        try:\n",
+        "            df_b = groups.get_group(label_before).set_index('request_rate')\n",
+        "            df_a = groups.get_group(label_after).set_index('request_rate')\n",
+        "        except KeyError:\n",
+        "            print(f\"  Skipping {m.y}: missing group data for '{label_before}' or '{label_after}'\")\n",
+        "            continue\n",
+        "        common = sorted(set(df_b.index).intersection(df_a.index))\n",
+        "        yb = df_b.loc[common, m.y].values\n",
+        "        ya = df_a.loc[common, m.y].values\n",
+        "        mask = ~np.isnan(yb) & ~np.isnan(ya)\n",
+        "        yb, ya = yb[mask], ya[mask]\n",
+        "        if len(yb) > 1 and np.any(yb != 0):\n",
+        "            r2 = r2_score(yb, ya)\n",
+        "            print(f\"  {m.y:<30} R^2 = {r2:.4f}\")\n",
+        "        else:\n",
+        "            print(f\"  {m.y:<30} insufficient data for R^2 calculation\")\n",
+        "\n",
+        "\n",
         "def init_plot(metrics, num_plots_per_row=NUM_PLOTS_PER_ROW):\n",
         "  num_plots_per_row = min(num_plots_per_row, len(metrics))\n",
         "  row_count = math.ceil(len(metrics) / num_plots_per_row)\n",
@@ -229,7 +260,7 @@
         "    plot_func(curAx, m)\n",
         "  return fig, axes\n",
         "\n",
-        "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=False, annotate=False):\n",
+        "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n",
         "    labels = [label.alias for label in labels]\n",
         "    logger.debug(f'Prnting bar chart for {labels}')\n",
         "    logger.debug(f'groups: {groups}')\n",
@@ -294,7 +325,106 @@
         "    fig, axes = plot_metrics(metrics, plot_func, num_plots_per_row)\n",
         "    fig.tight_layout(rect=[0, 0.03, 1, 0.95])\n",
         "    plt.show()\n",
-        "\n"
+        "\n",
+        "def plot_delta(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=True, annotate=False):\n",
+        "    \"\"\"\n",
+        "    Plot the delta between base_label and compare_label for each metric.\n",
+        "    A positive delta means compare_label has a higher value than base_label.\n",
+        "    \"\"\"\n",
+        "    base_label = labels[0].name\n",
+        "    compare_label = labels[1].name\n",
+        "    logger.debug(f'Printing delta chart for {base_label} vs {compare_label}')\n",
+        "\n",
+        "    try:\n",
+        "        base_df = groups.get_group((base_label,))\n",
+        "        compare_df = groups.get_group((compare_label,))\n",
+        "    except Exception as e:\n",
+        "        logger.error(f\"Error getting data for labels {base_label} and {compare_label}: {e}\")\n",
+        "        return\n",
+        "\n",
+        "    y_columns = [m.y for m in metrics]\n",
+        "\n",
+        "    # 1. Find common request rates\n",
+        "    base_rates = set(base_df['request_rate'].astype(int))\n",
+        "    compare_rates = set(compare_df['request_rate'].astype(int))\n",
+        "    common_rates = sorted(list(base_rates.intersection(compare_rates)))[:6]\n",
+        "\n",
+        "    if not common_rates:\n",
+        "        logger.error(f\"No common request rates found between {base_label} and {compare_label}\")\n",
+        "        return\n",
+        "\n",
+        "    # 2. Prepare data for delta calculation\n",
+        "    base_data = base_df.set_index('request_rate').to_dict()\n",
+        "    compare_data = compare_df.set_index('request_rate').to_dict()\n",
+        "\n",
+        "    # Calculate deltas (compare_label - base_label)\n",
+        "    delta_data = {y_col: {} for y_col in y_columns}\n",
+        "    for y_col in y_columns:\n",
+        "        for rate in common_rates:\n",
+        "            base_val = base_data.get(y_col, {}).get(rate, np.nan)\n",
+        "            compare_val = compare_data.get(y_col, {}).get(rate, np.nan)\n",
+        "\n",
+        "            if not np.isnan(base_val) and not np.isnan(compare_val):\n",
+        "                delta_data[y_col][rate] = (compare_val - base_val)/base_val*100\n",
+        "            else:\n",
+        "                delta_data[y_col][rate] = np.nan\n",
+        "\n",
+        "    # 3. Plotting\n",
+        "    def plot_func(curAx, m):\n",
+        "        x = np.arange(len(common_rates))\n",
+        "        y_values = [delta_data[m.y].get(rr, np.nan) for rr in common_rates]\n",
+        "\n",
+        "        # Determine colors based on positive/negative values\n",
+        "        colors = ['green' if val > 0 else 'blue' for val in y_values]\n",
+        "\n",
+        "        rects = curAx.bar(x, y_values, 0.6, color=colors)\n",
+        "\n",
+        "        # Add a horizontal line at y=0\n",
+        "        curAx.axhline(y=0, color='black', linestyle='-', linewidth=1)\n",
+        "\n",
+        "        if annotate:\n",
+        "            for rect, val in zip(rects, y_values):\n",
+        "                if not np.isnan(val):\n",
+        "                    height = rect.get_height()\n",
+        "                    # For negative bars, put text above the bar\n",
+        "                    vert_align = 'bottom' if val >= 0 else 'top'\n",
+        "                    y_offset = 3 if val >= 0 else -3\n",
+        "\n",
+        "                    curAx.annotate(f'{val:.2f}',\n",
+        "                            xy=(rect.get_x() + rect.get_width() / 2, val),\n",
+        "                            xytext=(0, y_offset),  # vertical offset\n",
+        "                            textcoords=\"offset points\",\n",
+        "                            ha='center', va=vert_align)\n",
+        "\n",
+        "        # Create a title that shows what this delta represents\n",
+        "        title = f\"Delta: {compare_label} - {base_label} ({m.y})\"\n",
+        "        curAx.set_title(title, fontsize=12)\n",
+        "\n",
+        "        # Add labels\n",
+        "        curAx.set_xlabel(m.x_label, fontsize=axis_label_fontsize)\n",
+        "        #curAx.set_ylabel(f\"% Delta in {m.y_label}\", fontsize=axis_label_fontsize)\n",
+        "        curAx.set_xticks(x)\n",
+        "        curAx.set_xticklabels(common_rates)\n",
+        "        curAx.tick_params(axis='both', labelsize=tick_label_fontsize)\n",
+        "\n",
+        "        # Create a dummy handle for the legend\n",
+        "        legend_handle = [plt.Rectangle((0,0),1,1,color='green'),\n",
+        "                        plt.Rectangle((0,0),1,1,color='blue')]\n",
+        "        legend_label = [f'{compare_label} > {base_label}',\n",
+        "                       f'{compare_label} < {base_label}']\n",
+        "\n",
+        "        return legend_handle, legend_label\n",
+        "\n",
+        "    # Create plot with metrics\n",
+        "    fig, axes = plot_metrics(metrics, plot_func, num_plots_per_row)\n",
+        "\n",
+        "    # Add an overall title for the figure\n",
+        "    fig.suptitle(f\"% Delta Metrics: {compare_label} - {base_label}\",\n",
+        "                fontsize=title_fontsize, y=0.98)\n",
+        "\n",
+        "    plt.subplots_adjust(bottom=0.15, top=0.9)  # Make room for legends\n",
+        "    fig.tight_layout(rect=[0, 0.1, 1, 0.95])  # Adjust the rectangle in which the subplots fit\n",
+        "    plt.show()"
       ]
     },
     {
@@ -320,9 +450,26 @@
       "outputs": [],
       "source": [
         "#@title Plot Result\n",
-        "\n",
-        "pl = Plotter(run_id=RUN_ID, labels=[Label('inference-extension'),Label('k8s-svc')], output_dir=OUTPUT_DIR)\n",
-        "pl.plot_bar()"
+        "# initialize the plotter with the run id and labels. \n",
+        "# Example labels are 'inference-extension' and 'k8s-svc' if comparing Inference Extension and K8s Service \n",
+        "# 'regression-before' and 'regression-after' if comparing two different runs of inference extension to see the regression\n",
+        "\n",
+        "benchmark_id1 =  <ID1> # eg 'regression-before' or 'inference-extension'\n",
+        "benchmark_id2 = <ID2> # eg 'regression-after' or 'k8s-svc'\n",
+        "labels = [Label(benchmark_id1), Label(benchmark_id2,)]\n",
+        "\n",
+        "# Plot bar chart of metrics\n",
+        "pl = Plotter(run_id=RUN_ID, labels=labels, output_dir=OUTPUT_DIR)\n",
+        "pl.plot_bar()\n",
+        "pl.plot_delta()\n",
+        "\n",
+        "# Load & group data to compute R^2\n",
+        "all_data = load_data(labels, RUN_ID, OUTPUT_DIR)\n",
+        "groups = group_data(all_data)\n",
+        "compute_r2_for_metrics(groups, CORE_METRICS,\n",
+        "                           label_before=benchmark_id1,\n",
+        "                           label_after=benchmark_id2)\n",
+        "\n"
       ]
     }
   ],
@@ -355,4 +502,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
+}