-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Memcpy kernel for flash attention #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
implementation is done. need testing (will do it on Thursday) the memory saving strategy is orthogonal to this kernel, so I would not include it in this PR |
678bb06 to
07e9891
Compare
07e9891 to
e21845e
Compare
|
Hey @suquark thanks for the PR! I have a quick question: have you also measured the performance diff between the two kernels before and after the optimization? |
|
see the PR comment for the optimized kernel performance comparison |
WoosukKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
* optimize * add benchmark * add assert * add test
Update optimum-intel
It's faster Signed-off-by: Nick Hill <[email protected]>
Adding fp8 gemm computation
sync release with IBM/release
…ack_acc_bf16 fix linear init impacts on generation
Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <[email protected]>
* Fix truncated output Signed-off-by: Woosuk Kwon <[email protected]> * fix Signed-off-by: Woosuk Kwon <[email protected]> --------- Signed-off-by: Woosuk Kwon <[email protected]>
* Fix truncated output Signed-off-by: Woosuk Kwon <[email protected]> * fix Signed-off-by: Woosuk Kwon <[email protected]> --------- Signed-off-by: Woosuk Kwon <[email protected]>
New Industry Use Cases (vllm-project#21-30): - vllm-project#21 Game Development: AI game testing + balance tuning - vllm-project#22 Construction: Vision AI safety inspection - vllm-project#23 Agriculture/Smart Farm: Crop monitoring + pest detection - vllm-project#24 Government/Public: Document automation + citizen services - vllm-project#25 Energy/Utilities: Grid monitoring + anomaly detection - vllm-project#26 Environment/Sustainability: Carbon tracking + ESG reporting - vllm-project#27 Fashion/Apparel: Trend analysis + inventory optimization - vllm-project#28 Sports/Fitness: Performance analytics + tactical analysis - vllm-project#29 Automotive/Mobility: Autonomous driving simulation - vllm-project#30 Space/Aerospace: Satellite image analysis Advanced Architecture Patterns: 1. Event-Driven Pattern: Webhook → Event Bus → Agent triggers 2. Streaming Pattern: Large dataset processing with chunking 3. Batch Processing Pattern: Celery-based parallel processing 4. Circuit Breaker Pattern: Fault tolerance + auto recovery 5. CQRS + Event Sourcing: Command/Query separation 6. Saga Pattern: Distributed transaction management Guide now covers: - 30+ industry-specific MCP implementations - 6 production-ready architecture patterns - Real-world scalability solutions - Enterprise integration strategies - Total: 8,672 lines (from 7,249)
* update ci with new repo name * update ipex to latest version * Update ci_pvc.yaml
* update ci with new repo name * update ipex to latest version * Update ci_pvc.yaml
Memcpy kernel for flash attention
The performance is pretty good (theoretical optimal throughput is 1.6TB/s for A100-40GB), considering the memory layout is not ideal.
result for unoptimized kernel:
the optimized kernel works much better for smaller number of tokens (+20% speedup)