-
Notifications
You must be signed in to change notification settings - Fork 650
[Optimization] xgrammar async compile, multi thread, speed up #4835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
| def accept_token(self, token: int) -> None: | ||
| """ | ||
| Validate and accept a generated token against the grammar constraints. | ||
| when accept eos_token, is_terminated = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里在哪里判断的eos_token啊?输出超长的场景怎么处理的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eos accept 之后,matcher 的状态就是is_terminated,下面就会被重置掉了。后面输出的 token 不会再限制格式。开 ignore_eos 之后也可以继续生成。
| logits = torch.from_numpy(logits.numpy()) | ||
|
|
||
| logits = logits.float() # cpu | ||
| apply_token_bitmask_inplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个算子在多硬件上好像没有验证过?不确定能不能用
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是纯 cpu 操作。 bitmask=token_bitmask.to(logits.device, non_blocking=True),
这个逻辑有点误导,实际 to 的还是 cpu
| if current_platform.is_cuda(): | ||
| dlpack = paddle.utils.dlpack.to_dlpack(logits) | ||
| t_logits = torch.from_dlpack(dlpack) | ||
| apply_token_bitmask_inplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个算子是支持paddle.tensor 的吧,为什么还要转torch.tensor 呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里还是原生的 xgr. apply_token_bitmask_inplace 接口,只支持 tensor.Tensor
| """update vocab mask. (cpu-heavy operation)""" | ||
| if len(self.logits_processor) == 0: | ||
| """add logits processor to SamplerProcessor""" | ||
| assert len(prefill_tokens) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PD分离场景下,prefill_tokens非空?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PD 分离场景还没验证。这里会 assert 挂掉
| processor.fill_token_bitmask(self.token_bitmask, idx) | ||
|
|
||
| def apply_token_mask(self, logits: paddle.Tensor, skip_idx_list: List[int] = []): | ||
| def apply_token_mask(self, logits: paddle.Tensor, prefill_done_idxs: List[int] = []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decode step间的异步是不是还没有加?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多硬件的场景还没验证过。如果要支持,优先支持xpu吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里因为接口变了,必须得同步修改。xpu ci 过了。
Motivation
Optimize the integration of xgrammar by introducing asynchronous compilation and native caching. Improve efficiency for CUDA platforms with inplace operations and DLPack interconversion, and remove redundant backend caching logic.
优化 xgrammar 的集成,采用异步编译与原生缓存机制,提升 CUDA 平台效率,同时去除冗余的后端缓存逻辑。
PD 分离场景未验证,会继续更新当前 pr
Modifications
Refactored xgrammar to use asynchronous
compileand implemented native caching.Removed caching from the backend to avoid duplication.
Triggered xgrammar compilation during the Prefill stage, and joined the compile result before sampling the first token in decode.
For CUDA platforms:
paddle.Tensorandtorch.Tensor.xgr.apply_token_bitmask_inplace.Other platforms retain existing logic.
xgrammar 改为异步
compile并实现了原生缓存机制。去掉了 backend 中的缓存,避免重复。
Prefill 阶段发起 xgrammar 编译,在 decode 第一个 token 的 sampler 之前 join 编译结果。
对于 CUDA 平台:
xgr.apply_token_bitmask_inplace。其他平台维持原有逻辑。
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.