sgl-project
diff --git a/‎.github/workflows/pr-test-rust.yml‎
Lines changed: 5 additions & 1 deletion b/‎.github/workflows/pr-test-rust.yml‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 5 additions & 5 deletions b/‎README.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎benchmark/hicache/bench_long_context.py‎
Lines changed: 96 additions & 0 deletions b/‎benchmark/hicache/bench_long_context.py‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎benchmark/hicache/bench_multiturn.py‎
Lines changed: 9 additions & 5 deletions b/‎benchmark/hicache/bench_multiturn.py‎
Lines changed: 9 additions & 5 deletions
diff --git a/‎docker/Dockerfile‎
Lines changed: 2 additions & 2 deletions b/‎docker/Dockerfile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/supported_models/support_new_models.md‎
Lines changed: 20 additions & 8 deletions b/‎docs/supported_models/support_new_models.md‎
Lines changed: 20 additions & 8 deletions
@@ -56,7 +56,7 @@ jobs:
 
   e2e-python:
     if: github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request'
-    runs-on: 2-gpu-runner
+    runs-on: BM.A10.4
     timeout-minutes: 30
     steps:
       - name: Checkout code
@@ -66,6 +66,10 @@ jobs:
         run: |
           bash scripts/ci/ci_install_rust.sh
 
+      - name: Install SGLang dependencies
+        run: |
+          sudo bash scripts/ci/ci_install_dependency.sh
+
       - name: Build python binding
         run: |
           source "$HOME/.cargo/env"
 
@@ -53,11 +53,11 @@ The core features include:
 - **Active Community**: SGLang is open-source and backed by an active community with wide industry adoption.
 
 ## Getting Started
-- [Install SGLang](https://docs.sglang.ai/start/install.html)
-- [Quick Start](https://docs.sglang.ai/backend/send_request.html)
-- [Backend Tutorial](https://docs.sglang.ai/backend/openai_api_completions.html)
-- [Frontend Tutorial](https://docs.sglang.ai/frontend/frontend.html)
-- [Contribution Guide](https://docs.sglang.ai/references/contribution_guide.html)
+- [Install SGLang](https://docs.sglang.ai/get_started/install.html)
+- [Quick Start](https://docs.sglang.ai/basic_usage/send_request.html)
+- [Backend Tutorial](https://docs.sglang.ai/basic_usage/openai_api_completions.html)
+- [Frontend Tutorial](https://docs.sglang.ai/references/frontend/frontend_tutorial.html)
+- [Contribution Guide](https://docs.sglang.ai/developer_guide/contribution_guide.html)
 
 ## Benchmark and Performance
 Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/), [Large-scale expert parallelism](https://lmsys.org/blog/2025-05-05-large-scale-ep/).
 
@@ -0,0 +1,96 @@
+import json
+import queue
+import time
+
+import requests
+from bench_multiturn import (
+    ReadyQueue,
+    WorkloadGenerator,
+    gen_payload,
+    log_to_jsonl_file,
+    parse_args,
+)
+from tqdm.asyncio import tqdm
+
+from sglang.bench_serving import get_tokenizer
+
+
+class ContextWorkloadGenerator(WorkloadGenerator):
+    def __init__(self, args):
+        # Construct the base URL for requests
+        self.baseurl = f"http://{args.host}:{args.port}/"
+        self.url = self.baseurl + "generate"
+
+        self.tokenizer = get_tokenizer(args.model_path)
+        self.distribution = args.distribution
+        self.request_rate = args.request_rate
+        self.start_time = None
+        self.finished_time = None
+
+        self.sent_requests = 0
+        self.completed_requests = 0
+
+        self.dataset = json.load(open(args.dataset_path))
+
+        init_requests = []
+        for i in range(min(args.num_clients, len(self.dataset["queries"]))):
+            context_id = self.dataset["queries"][i]["context"]
+            init_requests.append(
+                (
+                    i,
+                    gen_payload(
+                        self.dataset["contexts"][context_id]
+                        + self.dataset["queries"][i]["question"],
+                        len(
+                            self.tokenizer(
+                                self.dataset["queries"][i]["reference_answer"]
+                            )["input_ids"]
+                        ),
+                    ),
+                )
+            )
+        self.ready_queue = ReadyQueue(init_requests=init_requests)
+
+        self.response_queue = queue.Queue()
+        self.pbar = tqdm(total=args.num_clients * args.num_rounds)
+        self.performance_metrics = {
+            "ttft": [],
+            "latency": [],
+            "itl": [],
+            "prompt_len": [],
+            "cached_tokens": [],
+        }
+
+        self.max_parallel = args.max_parallel
+        self.logfile = args.log_file
+
+    def response_handler(self):
+        while True:
+            try:
+                client_id, response = self.response_queue.get(
+                    timeout=10
+                )  # Block until response is available
+                if not response.success:
+                    raise ValueError(f"Request failed with error: {response.error}")
+                self.performance_metrics["ttft"].append(response.ttft)
+                self.performance_metrics["itl"].extend(response.itl)
+                self.performance_metrics["latency"].append(response.latency)
+                self.completed_requests += 1
+
+            except queue.Empty:
+                if self.pbar.n == self.pbar.total:
+                    break
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    args.num_rounds = 1
+    args.max_parallel = 128
+    flush_cache_url = f"http://{args.host}:{args.port}/flush_cache"
+
+    for request_rate in [24, 16, 12, 8, 4, 2, 1]:
+        args.request_rate = request_rate
+        requests.post(flush_cache_url)
+        time.sleep(1)
+        performance_data = ContextWorkloadGenerator(args).run()
+        log_to_jsonl_file(performance_data, args.log_file, args.tag)
@@ -322,6 +322,9 @@ def __init__(self, args):
             "prompt_len": [],
             "cached_tokens": [],
         }
+        self.num_rounds = args.num_rounds
+        self.max_parallel = args.max_parallel
+        self.output_length = args.output_length
 
     async def handle_request(self, item):
         try:
@@ -336,7 +339,7 @@ async def handle_request(self, item):
     def request_sender(self):
         async def request_loop():
             while True:
-                if self.sent_requests - self.completed_requests < args.max_parallel:
+                if self.sent_requests - self.completed_requests < self.max_parallel:
                     new_request = self.ready_queue.pop()
                     if new_request:
                         asyncio.create_task(self.handle_request(new_request))
@@ -382,7 +385,7 @@ def response_handler(self):
                 self.performance_metrics["cached_tokens"].append(response.cached_tokens)
                 self.completed_requests += 1
 
-                if self.client_records[client_id]["round"] < args.num_rounds:
+                if self.client_records[client_id]["round"] < self.num_rounds:
                     # append new request to client's history
                     self.client_records[client_id][
                         "history"
@@ -392,7 +395,7 @@ def response_handler(self):
                             client_id,
                             gen_payload(
                                 self.client_records[client_id]["history"],
-                                args.output_length,
+                                self.output_length,
                             ),
                         )
                     )
@@ -461,7 +464,7 @@ def run(self):
             f"  Throughput: {performance_data['summary']['throughput']:.2f} requests per second"
         )
         print(f"  Cache Hit Rate: {performance_data['summary']['cache_hit_rate']:.6f}")
-        log_to_jsonl_file(performance_data, args.log_file, tag=args.tag)
+        return performance_data
 
 
 if __name__ == "__main__":
@@ -482,4 +485,5 @@ def run(self):
         args.request_rate = rate
         requests.post(flush_cache_url)
         time.sleep(1)
-        WorkloadGenerator(args).run()
+        performance_data = WorkloadGenerator(args).run()
+        log_to_jsonl_file(performance_data, args.log_file, tag=args.tag)
@@ -73,10 +73,10 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools wheel html5li
  && python3 -m pip install --no-cache-dir nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps \
  && python3 -m flashinfer --download-cubin \
  && if [ "$CUDA_VERSION" = "12.8.1" ]; then \
-      python3 -m pip install --no-cache-dir https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3+cu128-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall --no-deps ; \
+      python3 -m pip install --no-cache-dir https://github.com/sgl-project/whl/releases/download/v0.3.4/sgl_kernel-0.3.4+cu128-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall --no-deps ; \
     fi \
  && if [ "$CUDA_VERSION" = "12.9.1" ]; then \
-      python3 -m pip install --no-cache-dir https://github.com/sgl-project/whl/releases/download/v0.3.3/sgl_kernel-0.3.3-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall --no-deps ; \
+      python3 -m pip install --no-cache-dir https://github.com/sgl-project/whl/releases/download/v0.3.4/sgl_kernel-0.3.4-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall --no-deps ; \
     fi
 
 # Download source files
 
@@ -21,8 +21,8 @@ standard LLM support:
    in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561)
    to return `True` for your model.
 
-2. **Register a new chat-template**
-   See [conversation.py](https://github.com/sgl-project/sglang/blob/86a779dbe9e815c02f71ea82574608f6eae016b5/python/sglang/srt/conversation.py)
+2. **Register a new chat-template**:
+   Only when your default chat-template is unable to accept images as input: Register a new chat template in [conversation.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/conversation.py) and the corresponding matching function.
 
 3. **Multimodal Data Processor**:
    Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
@@ -35,16 +35,18 @@ standard LLM support:
    expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data
    with `RadixAttention`.
 
-5. **Adapt to Vision Attention**:
+5. **Handle Image Feature Extraction**:
+   Implement a `get_image_feature` function for your new model, which extracts image features from raw image data and converts them into the embeddings used by the language model.
+
+6. **Adapt to Vision Attention**:
    Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
 
 You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or
 other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
 
-You should test the new MLLM locally against Hugging Face models. See the [
-`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
+## Testing and Debugging
 
-## Test the Correctness
+Please note all your testing and benchmarking results in PR description.
 
 ### Interactive Debugging
 
@@ -65,14 +67,21 @@ should give the same text output and very similar prefill logits:
 To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in
 the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py)
 file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU,
-MMMU-Pro, etc.) in your PR.
+MMMU-Pro, etc.) in your PR. \\
+For VLMs, also include a test in `test_vision_openai_server_{x}.py` (e.g. [test_vision_openai_server_a.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_a.py), [test_vision_openai_server_b.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server_b.py)).
+
 
-This is the command to test a new model on your local machine:
+This is an example command to run to test a new model on your local machine:
 
 ```bash
 ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
 ```
 
+### Benchmark
+
+- **(Required) MMMU**: follow MMMU benchmark [README.md](https://github.com/sgl-project/sglang/blob/main/benchmark/mmmu/README.md) to get SGLang vs. HF Transformer accuracy comparison. The accuracy score from SGLang run should not be much lower than that from HF Transformer run. Similarly, follow https://docs.sglang.ai/developer_guide/benchmark_and_profiling.html to get performance comparison: TTFT and throughput must meet or exceed baselines (e.g., HF Transformer).
+- **(Optional) Other evals**: If you ran other evals, please note the results in PR description.
+
 ## Port a Model from vLLM to SGLang
 
 The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable
@@ -126,6 +135,9 @@ ModelRegistry.models.update(import_new_model_classes())
 launch_server(server_args)
 ```
 
+## Documentation
+Add to table of supported models in [generative_models.md](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/generative_models.md) or [multimodal_language_models.md](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/multimodal_language_models.md)
+
 ---
 
 By following these guidelines, you can add support for new language models and multimodal large language models in