[Optimization] xgrammar async compile, multi thread, speed up #4835

ST-XX · 2025-11-05T09:50:16Z

Motivation

Optimize the integration of xgrammar by introducing asynchronous compilation and native caching. Improve efficiency for CUDA platforms with inplace operations and DLPack interconversion, and remove redundant backend caching logic.
优化 xgrammar 的集成，采用异步编译与原生缓存机制，提升 CUDA 平台效率，同时去除冗余的后端缓存逻辑。

PD 分离场景未验证，会继续更新当前 pr

Modifications

Refactored xgrammar to use asynchronous compile and implemented native caching.
Removed caching from the backend to avoid duplication.
Triggered xgrammar compilation during the Prefill stage, and joined the compile result before sampling the first token in decode.
For CUDA platforms:
- Used DLPack as an intermediate format for conversion between paddle.Tensor and torch.Tensor.
- Leveraged CUDA hardware for inplace acceleration of xgr.apply_token_bitmask_inplace.
- Removed previous GPU to CPU numpy conversion.
Other platforms retain existing logic.
xgrammar 改为异步 compile 并实现了原生缓存机制。
去掉了 backend 中的缓存，避免重复。
Prefill 阶段发起 xgrammar 编译，在 decode 第一个 token 的 sampler 之前 join 编译结果。
对于 CUDA 平台：
- 使用 DLPack 作为 paddle.Tensor 和 torch.Tensor 之间的中间转换格式。
- 利用 CUDA 硬件进行 inplace 加速 xgr.apply_token_bitmask_inplace。
- 移除了之前 GPU 到 CPU 的 numpy 转换。
其他平台维持原有逻辑。

Usage or Command

import openai

port = "8170"
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="null")

completion = client.chat.completions.create(
    model="null",
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON object containing: names of China's Four Great Inventions, their dynasties of origin, and brief descriptions (each under 50 characters)",
        }
    ],
    response_format={"type": "json_object"}
)
print(completion.choices[0].message.content)

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2025-11-05T09:50:21Z

Thanks for your contribution!

kevincheng2 · 2025-11-06T06:41:32Z

fastdeploy/model_executor/guided_decoding/xgrammar_backend.py

    def accept_token(self, token: int) -> None:
        """
        Validate and accept a generated token against the grammar constraints.
+        when accept eos_token, is_terminated = True


这里在哪里判断的eos_token啊？输出超长的场景怎么处理的？

eos accept 之后，matcher 的状态就是is_terminated，下面就会被重置掉了。后面输出的 token 不会再限制格式。开 ignore_eos 之后也可以继续生成。

kevincheng2 · 2025-11-06T06:44:34Z

fastdeploy/model_executor/guided_decoding/xgrammar_backend.py

+        logits = torch.from_numpy(logits.numpy())
+
+        logits = logits.float()  # cpu
+        apply_token_bitmask_inplace(


这个算子在多硬件上好像没有验证过？不确定能不能用

这里是纯 cpu 操作。 bitmask=token_bitmask.to(logits.device, non_blocking=True),
这个逻辑有点误导，实际 to 的还是 cpu

kevincheng2 · 2025-11-06T06:45:32Z

fastdeploy/model_executor/guided_decoding/xgrammar_backend.py

+    if current_platform.is_cuda():
+        dlpack = paddle.utils.dlpack.to_dlpack(logits)
+        t_logits = torch.from_dlpack(dlpack)
+        apply_token_bitmask_inplace(


这个算子是支持paddle.tensor 的吧，为什么还要转torch.tensor 呢

这里还是原生的 xgr. apply_token_bitmask_inplace 接口，只支持 tensor.Tensor

kevincheng2 · 2025-11-06T06:46:45Z

fastdeploy/model_executor/layers/sample/sampler.py

-        """update vocab mask. (cpu-heavy operation)"""
-        if len(self.logits_processor) == 0:
+        """add logits processor to SamplerProcessor"""
+        assert len(prefill_tokens) == 0


PD分离场景下，prefill_tokens非空？

PD 分离场景还没验证。这里会 assert 挂掉

kevincheng2 · 2025-11-06T06:54:26Z

fastdeploy/model_executor/layers/sample/sampler.py

                processor.fill_token_bitmask(self.token_bitmask, idx)

-    def apply_token_mask(self, logits: paddle.Tensor, skip_idx_list: List[int] = []):
+    def apply_token_mask(self, logits: paddle.Tensor, prefill_done_idxs: List[int] = []):


decode step间的异步是不是还没有加？

kevincheng2 · 2025-11-06T06:55:43Z

fastdeploy/worker/gcu_model_runner.py

多硬件的场景还没验证过。如果要支持，优先支持xpu吧

这里因为接口变了，必须得同步修改。xpu ci 过了。

xgrammar async compile, multi thread, speed up

ef1a837

fix test_sampler.py & pre-commit err

01ddeb8

Jiang-Jia-Jun requested a review from yuanlehome November 5, 2025 12:34

kevincheng2 reviewed Nov 6, 2025

View reviewed changes

Merge branch 'PaddlePaddle:develop' into feature/xgrammar-v1

45ada78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Optimization] xgrammar async compile, multi thread, speed up #4835

[Optimization] xgrammar async compile, multi thread, speed up #4835

ST-XX commented Nov 5, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 5, 2025

Uh oh!

kevincheng2 Nov 6, 2025

Uh oh!

ST-XX Nov 6, 2025

Uh oh!

kevincheng2 Nov 6, 2025

Uh oh!

ST-XX Nov 6, 2025

Uh oh!

kevincheng2 Nov 6, 2025

Uh oh!

ST-XX Nov 6, 2025

Uh oh!

kevincheng2 Nov 6, 2025

Uh oh!

ST-XX Nov 6, 2025

Uh oh!

kevincheng2 Nov 6, 2025

Uh oh!

kevincheng2 Nov 6, 2025

Uh oh!

ST-XX Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Optimization] xgrammar async compile, multi thread, speed up #4835

Are you sure you want to change the base?

[Optimization] xgrammar async compile, multi thread, speed up #4835

Conversation

ST-XX commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

PD 分离场景未验证，会继续更新当前 pr

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ST-XX commented Nov 5, 2025 •

edited

Loading