Skip to content

Conversation

@ST-XX
Copy link
Collaborator

@ST-XX ST-XX commented Nov 5, 2025

Motivation

Optimize the integration of xgrammar by introducing asynchronous compilation and native caching. Improve efficiency for CUDA platforms with inplace operations and DLPack interconversion, and remove redundant backend caching logic.
优化 xgrammar 的集成,采用异步编译与原生缓存机制,提升 CUDA 平台效率,同时去除冗余的后端缓存逻辑。

PD 分离场景未验证,会继续更新当前 pr

Modifications

  • Refactored xgrammar to use asynchronous compile and implemented native caching.

  • Removed caching from the backend to avoid duplication.

  • Triggered xgrammar compilation during the Prefill stage, and joined the compile result before sampling the first token in decode.

  • For CUDA platforms:

    • Used DLPack as an intermediate format for conversion between paddle.Tensor and torch.Tensor.
    • Leveraged CUDA hardware for inplace acceleration of xgr.apply_token_bitmask_inplace.
    • Removed previous GPU to CPU numpy conversion.
  • Other platforms retain existing logic.

  • xgrammar 改为异步 compile 并实现了原生缓存机制。

  • 去掉了 backend 中的缓存,避免重复。

  • Prefill 阶段发起 xgrammar 编译,在 decode 第一个 token 的 sampler 之前 join 编译结果。

  • 对于 CUDA 平台:

    • 使用 DLPack 作为 paddle.Tensor 和 torch.Tensor 之间的中间转换格式。
    • 利用 CUDA 硬件进行 inplace 加速 xgr.apply_token_bitmask_inplace
    • 移除了之前 GPU 到 CPU 的 numpy 转换。
  • 其他平台维持原有逻辑。

Usage or Command

import openai

port = "8170"
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="null")

completion = client.chat.completions.create(
    model="null",
    messages=[
        {
            "role": "user",
            "content": "Generate a JSON object containing: names of China's Four Great Inventions, their dynasties of origin, and brief descriptions (each under 50 characters)",
        }
    ],
    response_format={"type": "json_object"}
)
print(completion.choices[0].message.content)

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Nov 5, 2025

Thanks for your contribution!

def accept_token(self, token: int) -> None:
"""
Validate and accept a generated token against the grammar constraints.
when accept eos_token, is_terminated = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在哪里判断的eos_token啊?输出超长的场景怎么处理的?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eos accept 之后,matcher 的状态就是is_terminated,下面就会被重置掉了。后面输出的 token 不会再限制格式。开 ignore_eos 之后也可以继续生成。

logits = torch.from_numpy(logits.numpy())

logits = logits.float() # cpu
apply_token_bitmask_inplace(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个算子在多硬件上好像没有验证过?不确定能不能用

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是纯 cpu 操作。 bitmask=token_bitmask.to(logits.device, non_blocking=True),
这个逻辑有点误导,实际 to 的还是 cpu

if current_platform.is_cuda():
dlpack = paddle.utils.dlpack.to_dlpack(logits)
t_logits = torch.from_dlpack(dlpack)
apply_token_bitmask_inplace(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个算子是支持paddle.tensor 的吧,为什么还要转torch.tensor 呢

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里还是原生的 xgr. apply_token_bitmask_inplace 接口,只支持 tensor.Tensor

"""update vocab mask. (cpu-heavy operation)"""
if len(self.logits_processor) == 0:
"""add logits processor to SamplerProcessor"""
assert len(prefill_tokens) == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PD分离场景下,prefill_tokens非空?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PD 分离场景还没验证。这里会 assert 挂掉

processor.fill_token_bitmask(self.token_bitmask, idx)

def apply_token_mask(self, logits: paddle.Tensor, skip_idx_list: List[int] = []):
def apply_token_mask(self, logits: paddle.Tensor, prefill_done_idxs: List[int] = []):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode step间的异步是不是还没有加?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

多硬件的场景还没验证过。如果要支持,优先支持xpu吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里因为接口变了,必须得同步修改。xpu ci 过了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants