Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
1e53e23
support w4a8 low latency deepep
ayrnb Jul 23, 2025
93cb396
clean code
ayrnb Jul 24, 2025
31d01f9
clean code
ayrnb Jul 24, 2025
157b979
clean code
ayrnb Jul 24, 2025
5dd0f87
[bug] fix pd completion protocol for batching support (#8317)
slin1237 Jul 24, 2025
f6e07f2
[router] fix pd model completion request (#8303)
slin1237 Jul 24, 2025
bfb118c
fix bug when eos_ids==0 (#8315)
bzantium Jul 24, 2025
2f86f3a
[router] add endpoint unit test (#8298)
slin1237 Jul 24, 2025
a167fd0
[code style] Clean dead triton kernel code in fused_moe and useless v…
BBuf Jul 24, 2025
96c5d85
fix
ayrnb Jul 24, 2025
0090240
fix
ayrnb Jul 24, 2025
8d1c5b9
chore: upgrade flashinfer v0.2.9rc1 (#8301)
Swipe4057 Jul 24, 2025
33c4b4d
[router] add streaming unit test (#8299)
slin1237 Jul 24, 2025
39fe1e8
[router] add request format unit test (#8300)
slin1237 Jul 24, 2025
145482f
HiCache Storage TP Refinement (#8307)
xiezhq-hermann Jul 25, 2025
d40846d
breakdown kernel update (#8334)
xiezhq-hermann Jul 25, 2025
f4674df
support idle batch for TBO (#8233)
sherry-1001 Jul 25, 2025
28d4d47
[Feature] Integrate quick allreduce and select the best allreduce imp…
lihaoyang-amd Jul 25, 2025
c0fb25e
DP Enhancement (#8280)
ch-wan Jul 25, 2025
7ad6b76
fix: Fix failed functional tests https://github.com/meta-llama/llama-…
ynwang007 Jul 25, 2025
af4b9ba
[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_qui…
hubertlu-tw Jul 25, 2025
15d2759
[CPU] Add tutorial docs for SGL on CPU (#8000)
ZailiWang Jul 25, 2025
70e37b9
chore: upgrade mooncake 0.3.5 (#8341)
ShangmingCai Jul 25, 2025
9045cc1
[torch.compile bug] avoid biased_grouped_topk_impl func repeatedly tr…
BBuf Jul 25, 2025
1b9cea5
[P/D] Support ipv6 in P/D scenario (#7858)
thefacetakt Jul 25, 2025
12cb760
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-…
Xu-Wenqing Jul 25, 2025
f8260f2
[Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs …
CatherineSue Jul 25, 2025
ed2e313
Clean up server_args, triton cache manager (#8332)
merrymercy Jul 25, 2025
7181ec8
fix: upgrade nccl version (#8359)
zhyncs Jul 25, 2025
d8ee156
[Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (#…
CatherineSue Jul 25, 2025
f8ca236
fix: kimi k2 xgrammar crash (#8367)
zhyncs Jul 25, 2025
58c468f
Fix FP4 MoE accuracy from missing routed_scaling_factor (#8333)
trevor-m Jul 25, 2025
3ec0b21
[CI] Fix flaky threshold (#8370)
merrymercy Jul 25, 2025
2272c2a
chore: bump v0.4.9.post4 (#8305)
zhyncs Jul 26, 2025
8af145b
Fix test_moe_fused_gate_combined sgl-kernel ci test (#8374)
ispobock Jul 26, 2025
e6312d2
Uodate Dockerfile.gb200 to latest sglang (#8356)
kyleliang-nv Jul 26, 2025
4fa44d6
chore: improve mmmu benchmark (#7000)
mickqian Jul 26, 2025
e236d8f
Save peak memory in logits processor (#8343)
ch-wan Jul 26, 2025
ce32bc2
Extract update_weights from RL Engine to SGLang to keep simplicity an…
hebiao064 Jul 26, 2025
5347567
chore: improvements on mm_utils (#7737)
mickqian Jul 26, 2025
3212c2a
vlm: optimize tensor transport (#6003)
mickqian Jul 26, 2025
da0c026
Tiny assert EPLB is used together with expert parallel (#8381)
fzyzcjy Jul 26, 2025
b7094a5
model: support intern-s1 (#8350)
RunningLeon Jul 26, 2025
5c705b1
Add perf tests for LoRA (#8314)
lifuhuang Jul 26, 2025
7615463
Remove slot usage in code to be backward-compatible with python 3.9 (…
lifuhuang Jul 27, 2025
62a6b7c
Add docker release flow for gb200 (#8394)
kyleliang-nv Jul 27, 2025
528bd1e
HiCache, check before terminate prefetching (#8372)
xiezhq-hermann Jul 27, 2025
426b749
Add nvfp4 scaled mm benchmark. (#8401)
HydraQYH Jul 27, 2025
b602f42
Urgent Fix: intern-s1 chat-template matching (#8403)
JustinTong0323 Jul 27, 2025
ed0fdbf
Tool to dump and compare internal activation tensors (#7976)
fzyzcjy Jul 27, 2025
62222bd
Minor tool for comparison of benchmark results (#7974)
fzyzcjy Jul 27, 2025
e34cf6a
Fix bench script making input data on L2 cache (#7739)
fzyzcjy Jul 27, 2025
85486b6
[NVIDIA] Add Flashinfer MoE blockscale fp8 backend (#8036)
kaixih Jul 27, 2025
91e3d15
Update Cutlass in sgl-kernel to v4.1 (#8392)
Fridge003 Jul 27, 2025
0bcc195
fix: minor fix TransportProxyTensor under tp (#8382)
mickqian Jul 27, 2025
2ab9702
[router] add different policies for p node and d node (#8395)
slin1237 Jul 27, 2025
2a1936d
Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-In…
lambert0312 Jul 27, 2025
36d6f0b
fix: fix the missing metrics on non-rank0 nodes (#7720)
acelyc111 Jul 27, 2025
bf0f448
[2/N] MoE Refactor: Unify weight loader and quant methods (#8397)
ch-wan Jul 27, 2025
5c9c275
Use FlashInfer FP4 gemm. (#8241)
elfiegg Jul 27, 2025
44d600c
Support precomputed_embeddings for Llama 4 (#8156)
AlienKevin Jul 27, 2025
4d921f2
[hotfix] fix merge conflicts in FlashInferEPMoE (#8405)
ch-wan Jul 27, 2025
bf3352c
chore: update CODEOWNERS (#8407)
zhyncs Jul 27, 2025
10ee895
chore: upgrade flashinfer v0.2.9rc2 (#8406)
zhyncs Jul 27, 2025
b3eac16
Support triton kernels v3.4.0 for fused_moe (#8258)
yuan-luo Jul 27, 2025
22e00ee
[Bugfix] Prevent PD server crash from invalid grammar (#8062)
ShangmingCai Jul 27, 2025
95217a9
Change to use native arm runner (#8414)
kyleliang-nv Jul 27, 2025
df90645
Support overlapped lora updates (#8213)
lifuhuang Jul 27, 2025
b58c3c2
Support ue8m0 for triton quant kernel (#7603)
fzyzcjy Jul 27, 2025
e983d66
Fix: Improve test_openai_function_calling unit test and fix reasoning…
byjiang1996 Jul 27, 2025
b47eda3
bugfix: Fix multiple finish_reason chunks and tool_calls finish reaso…
CatherineSue Jul 27, 2025
58dd95f
Fix test_openai_server (#8419)
CatherineSue Jul 27, 2025
bb81dae
Fix docker buildx push error (#8425)
kyleliang-nv Jul 28, 2025
dd487e5
bugfix: Fix XGrammar backend to use model's EOS tokens for constraine…
CatherineSue Jul 28, 2025
fe6a445
[router] improve router logs and request id header (#8415)
slin1237 Jul 28, 2025
2810338
[feat] Support different attention backends for prefill and decode (…
Qiaolin-Yu Jul 28, 2025
4ad9737
chore: bump transformer to 4.54.0 (#8416)
hebiao064 Jul 28, 2025
2fd5c70
[PD] Fix abort_request for PD disaggregation (#8352)
ShangmingCai Jul 28, 2025
6d6a8bc
GLM-4.5 Model Support (#8224)
zRzRzRzRzRzRzR Jul 28, 2025
5922c0c
Remove zstd compression for building Dockerfile.gb200 (#8442)
kyleliang-nv Jul 28, 2025
484d0e0
doc: add bench_one_batch_server in the benchmark doc (#8441)
Qiaolin-Yu Jul 28, 2025
581e7dc
GLM-4.5 Model Support Follow-up (#8445)
byjiang1996 Jul 28, 2025
25f73c6
fix GLM4_MOE launch with compressed_tensor quant model (#8456)
zminglei Jul 28, 2025
fb4ce17
Fix per_token_group_quant_8bit when hidden_dim // group_size is not d…
strgrb Jul 28, 2025
2262369
Revert "[kernel] opt moe align block kernel by block/warp scan algori…
BBuf Jul 28, 2025
45bc170
chore: bump v0.4.9.post5 (#8458)
zhyncs Jul 28, 2025
a9dd3ec
fix:reorder topk experts to ensure shared expert replaces minimal sco…
erictanjn Jul 28, 2025
712877a
support w4a8 low latency deepep
ayrnb Jul 23, 2025
77351b7
clean code
ayrnb Jul 24, 2025
c15e34a
clean code
ayrnb Jul 24, 2025
f770ea6
clean code
ayrnb Jul 24, 2025
cfe7d62
fix
ayrnb Jul 24, 2025
d2afdb4
fix
ayrnb Jul 24, 2025
eb39568
Merge branch 'feat/w4a8_support_ll_deepep' of github.com:bytedance-ia…
ayrnb Jul 28, 2025
1e721d4
support cudagraph
ayrnb Jul 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@
/python/sglang/srt/constrained @hnyls2002
/python/sglang/srt/disaggregation @ByronHsu @hnyls2002
/python/sglang/srt/distributed @yizhang2077
/python/sglang/srt/entrypoints @zhaochenyang20 @CatherineSue
/python/sglang/srt/entrypoints @ispobock @CatherineSue @slin1237
/python/sglang/srt/eplb @fzyzcjy
/python/sglang/srt/function_call @CatherineSue
/python/sglang/srt/layers @merrymercy @Ying1123 @zhyncs @ispobock @HaiShaw @ch-wan @BBuf
/python/sglang/srt/lora @Ying1123 @Fridge003
/python/sglang/srt/managers @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/mem_cache @merrymercy @Ying1123 @hnyls2002 @xiezhq-hermann
/python/sglang/srt/model_executor @merrymercy @Ying1123 @hnyls2002 @zhyncs @ispobock
/python/sglang/srt/models @zhyncs @ispobock @ByronHsu @zhaochenyang20
/python/sglang/srt/models @zhyncs @ispobock @ByronHsu @JustinTong0323
/python/sglang/srt/multimodal @mickqian @JustinTong0323
/python/sglang/srt/sampling @hnyls2002
/python/sglang/srt/speculative @Ying1123 @merrymercy @rkooo567 @kssteven418
/test/lang @merrymercy @Ying1123
/test/srt @merrymercy @Ying1123 @zhyncs
/sgl-router @ByronHsu @slin1237
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy @yinfan98 @HaiShaw
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy @FlamingoPg @HaiShaw
2 changes: 1 addition & 1 deletion .github/workflows/pr-test-pd-router.yml
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ jobs:
run: |
echo "Installing SGLang with all extras..."
python3 -m pip --no-cache-dir install -e "python[all]" --break-system-packages
python3 -m pip --no-cache-dir install mooncake-transfer-engine==0.3.4.post2
python3 -m pip --no-cache-dir install mooncake-transfer-engine==0.3.5

- name: Build and install sgl-router
run: |
Expand Down
7 changes: 7 additions & 0 deletions .github/workflows/pr-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,13 @@ jobs:
cd test/srt
python3 -m unittest test_bench_serving.TestBenchServing.test_online_latency_eagle

- name: Benchmark online latency (LoRA)
timeout-minutes: 10
run: |
cd test/srt
python3 -m unittest test_bench_serving.TestBenchServing.test_lora_online_latency
python3 -m unittest test_bench_serving.TestBenchServing.test_lora_online_latency_with_concurrent_adapter_updates

performance-test-1-gpu-part-2:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false
Expand Down
36 changes: 36 additions & 0 deletions .github/workflows/release-docker-gb200.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Release Docker Images (GB200)
on:
push:
branches:
- main
paths:
- "python/sglang/version.py"
workflow_dispatch:

jobs:
publish:
if: github.repository == 'sgl-project/sglang'
runs-on: ubuntu-22.04-arm
environment: 'prod'
steps:
- name: Delete huge unnecessary tools folder
run: rm -rf /opt/hostedtoolcache

- name: Checkout repository
uses: actions/checkout@v4

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Build and Push
run: |
version=$(cat python/sglang/version.py | cut -d'"' -f2)
tag=v${version}-cu128-gb200

docker buildx build --platform linux/arm64 --push --output type=image -t lmsysorg/sglang:${tag} -f docker/Dockerfile.gb200 --build-arg CUDA_VERSION=12.8.1 --build-arg BUILD_TYPE=blackwell --no-cache .
2 changes: 1 addition & 1 deletion .github/workflows/vllm-dependency-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
- name: Install dependencies
run: |
bash scripts/ci_install_dependency.sh
pip install "vllm==0.9.0.1"
pip install "vllm==0.10.0"
pip install "bitsandbytes>=0.44.0"

- name: Run VLLM dependency tests
Expand Down
6 changes: 5 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,11 @@ repos:
- id: codespell
additional_dependencies: ['tomli']
args: ['--toml', 'python/pyproject.toml', '-L', 'cann']
exclude: test/srt/test_reasoning_parser.py # Exclude the test file that is expected to fail
exclude: |
(?x)^(
test/srt/test_reasoning_parser\.py|
docs/backend/vlm_query\.ipynb
)$
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v18.1.8
hooks:
Expand Down
2 changes: 1 addition & 1 deletion benchmark/deepseek_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Add [performance optimization options](#performance-optimization-options) as nee

```bash
# Installation
pip install "sglang[all]>=0.4.9.post3"
pip install "sglang[all]>=0.4.9.post5"

# Launch
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
Expand Down
7 changes: 7 additions & 0 deletions benchmark/gsm8k/bench_sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from sglang.api import set_default_backend
from sglang.test.test_utils import (
add_common_sglang_args_and_parse,
dump_bench_raw_result,
select_sglang_backend,
)
from sglang.utils import download_and_cache_file, dump_state_text, read_jsonl
Expand Down Expand Up @@ -115,6 +116,12 @@ def few_shot_gsm8k(s, question):

# Dump results
dump_state_text(f"tmp_output_{args.backend}.txt", states)
dump_bench_raw_result(
path=args.raw_result_file,
states=states,
preds=preds,
labels=labels,
)

with open(args.result_file, "a") as fout:
value = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,11 @@ def get_model_config(model_name: str, tp_size: int):
topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size
shard_intermediate_size = 2 * intermediate_size // tp_size
elif config.architectures[0] in ["DeepseekV2ForCausalLM", "DeepseekV3ForCausalLM"]:
elif config.architectures[0] in [
"DeepseekV2ForCausalLM",
"DeepseekV3ForCausalLM",
"Glm4MoeForCausalLM",
]:
E = (
config.n_routed_experts + 1
if config.architectures[0] in ["DeepseekV3ForCausalLM"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,11 @@ def get_model_config(model_name: str, tp_size: int):
topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size
shard_intermediate_size = 2 * intermediate_size // tp_size
elif config.architectures[0] in ["DeepseekV2ForCausalLM", "DeepseekV3ForCausalLM"]:
elif config.architectures[0] in [
"DeepseekV2ForCausalLM",
"DeepseekV3ForCausalLM",
"Glm4MoeForCausalLM",
]:
E = (
config.n_routed_experts + 1
if config.architectures[0] in ["DeepseekV3ForCausalLM"]
Expand Down
8 changes: 8 additions & 0 deletions benchmark/mmlu/bench_sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from sglang.test.test_utils import (
add_common_sglang_args_and_parse,
dump_bench_raw_result,
select_sglang_backend,
)

Expand Down Expand Up @@ -142,6 +143,13 @@ def few_shot_mmlu(s, examples, question):
assert pt == len(cors)
weighted_acc = np.mean(cors)

dump_bench_raw_result(
path=args.raw_result_file,
states=states,
preds=preds,
labels=labels,
)

# Print results
print("Total latency: {:.3f}".format(latency))
print("Average accuracy: {:.3f}".format(weighted_acc))
Expand Down
31 changes: 20 additions & 11 deletions benchmark/mmmu/bench_sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,6 @@ async def eval_mmmu(args) -> None:
client = openai.AsyncOpenAI(
api_key="sk", base_url=f"http://127.0.0.1:{args.port}/v1"
)
semaphore = asyncio.Semaphore(args.concurrency)
start = time.perf_counter()
base_url = f"http://127.0.0.1:{args.port}"

Expand All @@ -139,16 +138,26 @@ async def eval_mmmu(args) -> None:

samples = samples[: args.profile_number]

tasks = [
process_sample_with_semaphore(
semaphore, client, sample, sampling_params, lora_path
)
for sample in samples
]

for coro in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
sample, response = await coro
process_result(response, sample, answer_dict, out_samples)
if args.concurrency == 1:
# For concurrency == 1, run in sequential mode to ensure consistent order
# this is mainly for profiling
for sample in tqdm(samples):
_, response = await process_sample(
client, sample, sampling_params, lora_path
)
process_result(response, sample, answer_dict, out_samples)
else:
semaphore = asyncio.Semaphore(args.concurrency)
tasks = [
process_sample_with_semaphore(
semaphore, client, sample, sampling_params, lora_path
)
for sample in samples
]

for coro in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
sample, response = await coro
process_result(response, sample, answer_dict, out_samples)

if args.profile:
print("Stopping profiler...")
Expand Down
7 changes: 4 additions & 3 deletions benchmark/mmmu/eval_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,7 @@
class EvalArgs:
seed: int = 42
split: str = "validation"
# Default setting to make the benchmark available on A100 for most 7B models
image_pixels_limit: int = 4300000
image_pixels_limit: int = -1
result_filename: str = ""
prompt_format_file: str = "prompt_format.yaml"
dataset_path: str = "MMMU/MMMU"
Expand Down Expand Up @@ -190,7 +189,7 @@ def process_sample(i, sample):
sample = construct_prompt(sample, eval_args.config)
image = sample["image"]
width, height = image.size
if width * height >= eval_args.image_pixels_limit:
if 0 < eval_args.image_pixels_limit <= width * height:
return None, True
# Use a unique identifier for the image path to avoid potential collisions if indices reset
image_path = f"{images_path}/image_{sample['id']}.png"
Expand All @@ -217,6 +216,8 @@ def process_sample(i, sample):
elif sample:
samples.append(sample)

samples.sort(key=lambda x: x["final_input_prompt"])

print(
f"Skipping {skip_count} samples with large images, {round((float(skip_count) / len(dataset)) * 100, 2)}% of dataset"
)
Expand Down
4 changes: 2 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ RUN python3 -m pip install --no-cache-dir --upgrade pip setuptools wheel html5li
*) echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1 ;; \
esac \
&& python3 -m pip install --no-cache-dir -e "python[${BUILD_TYPE}]" --extra-index-url https://download.pytorch.org/whl/cu${CUINDEX} \
&& python3 -m pip install --no-cache-dir nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps \
&& if [ "$CUDA_VERSION" = "12.8.1" ]; then \
python3 -m pip install --no-cache-dir nvidia-nccl-cu12==2.27.6 --force-reinstall --no-deps ; \
python3 -m pip install --no-cache-dir https://github.com/sgl-project/whl/releases/download/v0.2.7/sgl_kernel-0.2.7+cu128-cp39-abi3-manylinux2014_x86_64.whl --force-reinstall --no-deps ; \
fi

Expand All @@ -86,7 +86,7 @@ RUN wget https://developer.download.nvidia.com/compute/redist/nvshmem/3.3.9/sour
# Python tools
RUN python3 -m pip install --no-cache-dir \
datamodel_code_generator \
mooncake_transfer_engine==0.3.4.post2 \
mooncake-transfer-engine==0.3.5 \
pre-commit \
pytest \
black \
Expand Down
Loading
Loading