-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[torch.compile] reorganize the cache directory to support compiling multiple models #19064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: youkaichao <[email protected]>
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the idea is pretty neat. Left two more comments.
Also please ensure the cache can be still loaded appropriately. :-)
| def initialize_cache(self, | ||
| cache_dir: str, | ||
| disable_cache: bool = False, | ||
| prefix: str = ""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since prefix is only used to caculate the base_cache_dir, why not use pass in the base_cache_dir instead of passing in prefix?
| def set_current_vllm_config(vllm_config: VllmConfig, check_compile=False): | ||
| def set_current_vllm_config(vllm_config: VllmConfig, | ||
| check_compile=False, | ||
| prefix: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a bit comment to explain the prefix meaning?
| self.model = get_model(vllm_config=self.vllm_config, | ||
| model_config=draft_model_config) | ||
| from vllm.compilation.backends import set_model_tag | ||
| with set_model_tag("eagle_head"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we name this something like "set_compile_region" or "set_model_component" (see the other comment)? That would make it clearer that this is 1:1 with a fullgraph torch.compile region
| def initialize_cache(self, | ||
| cache_dir: str, | ||
| disable_cache: bool = False, | ||
| prefix: str = ""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is technically a subdirectory (or a suffix to the path), not a prefix. I was prototyping something like this locally and I called this the "model_component", but up to you
zou3519
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @youkaichao, I wonder if we can simply the def configure_post_pass(self) method here? I had to make some edits to make things work here but maybe they are not necessary anymore? Thanks!
|
Why did you decide to keep inductor_cache and triton_cache on upper level, not inside backbone and eagle_head? Current organization introduce additional |
|
May we pass |
#17211 explores the possibility of compiling multiple models in the same process, i.e. both the main model and the eagle head model. However, it does this by extending the compilation cache directory in a tricky way. In addition, that integration can be problematic when we compile for specific shapes, the env vars are being set multiple times, with the latter overriding the previous value:
vllm/vllm/compilation/compiler_interface.py
Line 254 in d32aa2e
This PR re-organizes the cache directory structure, so that the same vLLM instances will use the same
TORCHINDUCTOR_CACHE_DIRandTRITON_CACHE_DIR, but just different storage forvllm_compile_cache.pyetc.It also reads the
prefixautomatically, and I think this would be helpful for future vision encoder compilation.the current structure after running
examples/offline_inference/eagle.py: