Add PagedAttention support (experimental, CUDA only)#17579
Add PagedAttention support (experimental, CUDA only)#17579ericcurtin wants to merge 1 commit intoggml-org:masterfrom
Conversation
2a33486 to
14ad291
Compare
14ad291 to
06254d1
Compare
06254d1 to
1745418
Compare
| const int token_idx = block_idx * BLOCK_SIZE + i; | ||
| if (token_idx >= seq_len) break; | ||
|
|
||
| // TODO: Vectorized K loading and Q·K computation |
There was a problem hiding this comment.
some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works
beside, probably give some credits to the original kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh
There was a problem hiding this comment.
I mark it experimental for good reason 🙂
There was a problem hiding this comment.
I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.
There was a problem hiding this comment.
I mark it experimental for good reason 🙂
I think this PR should be marked as a draft, until it is no longer experimental
There was a problem hiding this comment.
IMO this shouldn't be turned into a PR until it's reasonably complete. I subscribed because I'm interested in trying it if it works, but I've gotten 10 notifications today and it doesn't even pass CI, so why is it being pushed at all?
There was a problem hiding this comment.
Beside, I don't think adding paged attention make any differences in llama.cpp rather than have an additional feature with cool looking name.
While your AI is not capable enough to explain this to you @ericcurtin , I will:
This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging)
No, it does not. The order of blocks can also be fragmented. This notion is explain by this sentence in this documentation: "the KV cache does not need to be stored in contiguous memory" - this crucial detail is left out which make the description technically wrong.
And we can definitely implement notion of "block" with the existing llama-kv-cache infrastructure. Just need to align placements of KV vectors to a fixed number and voilà, you got "fixed-size blocks"
enables efficient memory sharing between sequences through copy-on-write semantics.
llama.cpp do have copy-on-write, but just not automatic. llama_memory_seq_cp is there to allow sharing memory among multiple sequences. Ofc, we can implement the automatic mechanism, but just not right now.
31d8188 to
08abefa
Compare
08abefa to
b6edf80
Compare
19466fb to
efeba44
Compare
Implement PagedAttention algorithm from for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics. The implementation is experimental and disabled by default. Enable with the --pagedattention flag Signed-off-by: Eric Curtin <eric.curtin@docker.com>
efeba44 to
de93b99
Compare
Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.
The implementation is experimental and disabled by default. Enable with the --pagedattention flag