Skip to content

Conversation

@anko-intel
Copy link

@anko-intel anko-intel commented Jan 24, 2025

Adopt setting dynamo cache size to current behavior where number of graphs in compilation in one forward path is equal:
number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)

It is other approach to prepare hot fix after performance regression introduced by vllm-project#11967.
The hot fix #709 restores previous performance results for torch compile mode.
This one partially recover throughput but with big cost of warmup time - as after it much more graphs are compiled during warmup.
Without increasing the cache size torch.compile reach the limits and goes in eager mode which gives low throughput .

Adopt setting dynamo cache size to current behavior where number of
graphs in compilation in one forward path is equal:
number of LlamaDecoderLayer's + 2 (RMSNorm, VocabParallelEmbedding)
@anko-intel anko-intel changed the title Set dynamo cache size for torch compile Adopt dynamo cache size to current layer definition Jan 24, 2025
@anko-intel
Copy link
Author

after #753 not needed in this version

@anko-intel anko-intel closed this Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants