[Kernel][Misc] Add meta functions for ops to prevent graph breaks #6917

bnellnm · 2024-07-29T21:19:24Z

Miscellaneous changes to support torch.compile in vLLM.

Add meta functions for various ops to prevent torch.compile graph breaks.
Use string schemas for all dispatched op registrations. (the function pointer API is only used for ops that are registered for all keys or that do not take Tensor arguments)
Fix some type mismatches in the quantization code
In the aqlm kernel/code, change codebook_partition_sizes into a list instead of a Tensor since it is allocated on the CPU.
Bump up dynamo cache limits due to the amount of recompilation in cuda graph warmup.
Add torch.library.opcheck tests for ops that had "unit" tests and are opcheck-able.

Note: opcheck does not seem to work with torch.float8_e4m3fn. It complains that mul_cuda is not supported for that type.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-07-29T21:19:39Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

bnellnm · 2024-08-06T03:09:21Z

/ready
/torch.compile

youkaichao · 2024-08-06T22:43:56Z

vllm/worker/model_runner.py

can you explain the rationale here?

There's a different set of failures when fullgraph=True. Not just graph breaks but actual dynamo errors. Locally, I am switching between full/not-full constantly to test things.

can you send me the specific case where fullgraph fails?

There's probably more but these all fail with fullgraph=True and pass with fullgraph=False

FAILED tests/models/test_gguf.py::test_models[1-5-32-half-model0] - torch._dynamo.exc.Unsupported: hasattr TensorVariable shard_size FAILED tests/models/test_gguf.py::test_models[1-5-32-half-model1] - torch._dynamo.exc.InternalTorchDynamoError: 'SymNodeVariable' object has no attribute 'value' FAILED tests/models/test_gguf.py::test_models[1-5-32-half-model2] - torch._dynamo.exc.Unsupported: hasattr TensorVariable shard_size FAILED tests/models/test_gguf.py::test_models[1-5-32-half-model3] - torch._dynamo.exc.Unsupported: hasattr TensorVariable shard_size

youkaichao · 2024-08-07T00:16:36Z

vllm/attention/backends/flash_attn.py

I believe torch.library.define is generally not necessary. We should use high-level APIs listed in https://pytorch.org/docs/main/library.html .

I tried using custom_op here but it couldn't deal with the window_size tuple so I ended up sticking with define

I think the longer term answer is to move the registration into the flash attn module itself once it is pulled into the main vllm build process.

youkaichao · 2024-09-04T22:54:12Z

vllm/worker/model_runner.py

I'm ok with this for now, but after I port the technique to skip dynamo guard evaluation overhead, this should not be necessary anymore.

youkaichao · 2024-09-04T22:54:37Z

vllm/model_executor/models/jamba.py

why do you need this?

Pytorch had a problem with the set() here. It was one of the things fixed in the nightly version but not in 2.4. Using set also didn't seem to be particularly useful either so I left it as a range.

vllm/distributed/parallel_state.py

youkaichao

thanks for the pr and sorry for keeping you waiting for so long!

the registration for the quantization kernels makes sense to me. but I don't understand the rest changes.

…w mambda ops

youkaichao

thanks for the great pr!

we can further discuss if we need to fix the graph breaks ourselves or pytorch team will fix them.

Co-authored-by: Sage Moore <[email protected]>

gshtras · 2024-10-04T20:10:59Z

vllm/_custom_ops.py

+try:
+    torch.ops._C.gptq_marlin_24_gemm  # noqa B018
+
+    @torch.library.register_fake("_C::gptq_marlin_24_gemm")


This again breaks compatibility with torch < 2.4

Co-authored-by: Sage Moore <[email protected]> Signed-off-by: Alvant <[email protected]>

Co-authored-by: Sage Moore <[email protected]> Signed-off-by: Amit Garg <[email protected]>

Co-authored-by: Sage Moore <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>

bnellnm changed the title ~~Add meta functions for ops to prevent graph breaks~~ [Kernel][Misc] Add meta functions for ops to prevent graph breaks Jul 29, 2024

bnellnm force-pushed the fix-graph-breaks branch 2 times, most recently from 3713ec3 to 63c42c7 Compare August 5, 2024 22:03

bnellnm marked this pull request as ready for review August 6, 2024 03:09

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 6, 2024

bnellnm force-pushed the fix-graph-breaks branch 2 times, most recently from 97a11f1 to f015312 Compare August 6, 2024 20:15

youkaichao self-assigned this Aug 6, 2024

youkaichao reviewed Aug 6, 2024

View reviewed changes

youkaichao reviewed Aug 7, 2024

View reviewed changes

mgoin added the torch.compile label Aug 7, 2024

bnellnm force-pushed the fix-graph-breaks branch 3 times, most recently from f4f903f to 4307e87 Compare August 13, 2024 20:51

bnellnm force-pushed the fix-graph-breaks branch from f326f49 to 2961f59 Compare August 16, 2024 13:01

This was referenced Aug 16, 2024

[Kernel] register punica functions as torch ops #7591

Merged

[Kernel][Misc] dynamo support for ScalarType #7594

Merged

[Kernel] fix types used in aqlm and ggml kernels to support dynamo #7596

Merged

bnellnm force-pushed the fix-graph-breaks branch 3 times, most recently from 7ab9b00 to 0168f9e Compare August 20, 2024 20:15

bnellnm mentioned this pull request Aug 23, 2024

[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM #7651

Merged

bnellnm force-pushed the fix-graph-breaks branch from 0168f9e to c93c029 Compare August 29, 2024 16:47

youkaichao reviewed Sep 4, 2024

View reviewed changes

vllm/distributed/parallel_state.py Outdated Show resolved Hide resolved

youkaichao reviewed Sep 4, 2024

View reviewed changes

bnellnm force-pushed the fix-graph-breaks branch from 01622d5 to c04d24e Compare September 5, 2024 15:55

SageMoore and others added 15 commits September 9, 2024 22:04

add clones to all_reduce

167652d

fix format

1cb8184

fix aqlm custom op type annotations

771daa4

fix gptq custom op registration

bff7d64

add dynamo support for ScalarType

d805265

add some pointers to PT2 custom class docs

c1184fd

tweaks

a5a8489

fix merge

49953f2

fix cpu schemas

950be6a

fix merge

ec0f252

rebase + add meta functions for machete kernels

4699534

tweak tests + custom ar bindings

fd6b7c9

rebase + fix schema for selective_scan_fwd, add meta functions for ne…

6cecf82

…w mambda ops

modify copy_blocks opcheck test

3791beb

remove some custom ar changes

12c845a

bnellnm force-pushed the fix-graph-breaks branch from c04d24e to 12c845a Compare September 9, 2024 22:07

youkaichao approved these changes Sep 11, 2024

View reviewed changes

youkaichao merged commit 73202db into vllm-project:main Sep 11, 2024

youkaichao deleted the fix-graph-breaks branch September 11, 2024 19:52

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

fcfa02e

Co-authored-by: Sage Moore <[email protected]>

akondrat-amd mentioned this pull request Sep 13, 2024

[CI/Build] Making xformers import conditional, cannot use them on ROCM #8433

Closed

gshtras reviewed Oct 4, 2024

View reviewed changes

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

30b5d43

Co-authored-by: Sage Moore <[email protected]> Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

e93c53d

Co-authored-by: Sage Moore <[email protected]> Signed-off-by: Amit Garg <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

0573093

Co-authored-by: Sage Moore <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>

Uh oh!

[Kernel][Misc] Add meta functions for ops to prevent graph breaks #6917

[Kernel][Misc] Add meta functions for ops to prevent graph breaks #6917

Uh oh!

Conversation

bnellnm commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

Uh oh!

github-actions bot commented Jul 29, 2024

Uh oh!

bnellnm commented Aug 6, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnellnm Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bnellnm commented Jul 29, 2024 •

edited

Loading

bnellnm Sep 5, 2024 •

edited

Loading