Modify sharktank KVCache and local Wave kernel copy for extend attention #2534

aviator19941 · 2025-10-16T14:12:06Z

This PR modifies the sharktank KVCache by reordering the block_seq_stride to before num_attn_heads in order to allow flattening the first 4 dimensions of the sharktank KVCache to match the Wave KVCache layout. The Wave kernel also expects separate k_cache and v_cache, which sharktank does not currently support, so the local sharktank copy of the Wave extend attention kernel is updated to use a single KVCache block and separate k_indices and v_indices instead of separate k_cache and v_cache blocks and a single kv_indices. In the future, sharktank will need to split the KVCache into k_cache and v_cache. There is also an extend attention test for a single request with 2 chunks that uses the updated KVCache and updated kernel.

Signed-off-by: aviator19941 <[email protected]>

…indices Signed-off-by: aviator19941 <[email protected]>

Signed-off-by: aviator19941 <[email protected]>

Groverkss

Putting a block right now, need to review more carefully, i see some things that need to be resolved before landing, but need to do a careful review.

sharktank/tests/kernels/wave/extend_attention_test.py

sharktank/sharktank/layers/paged_attention.py

Signed-off-by: aviator19941 <[email protected]>

github-actions · 2025-10-18T00:20:59Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
sharktank/sharktank/kernels/wave
utils.py					103-115, 318, 330-345
sharktank/sharktank/kernels/wave/templates
extend_attention_kernel.py					155-161, 216-222
sharktank/sharktank/layers
paged_attention.py					159-200, 244, 316-319, 419-420
sharktank/sharktank/ops
attention_impls.py					257-259, 265-290
signatures.py					969
sharktank/tests/kernels/wave
extend_attention_test.py					201, 242-308, 329-455
Project Total

_{This report was generated by python-coverage-comment-action}

sharktank/tests/kernels/wave/extend_attention_test.py

Signed-off-by: aviator19941 <[email protected]>

rsuderman · 2025-10-20T23:13:39Z

sharktank/sharktank/kernels/wave/extend_attention.py

        v_extend_shape,
-        k_cache_shape,
-        v_cache_shape,
+        kv_cache_1_shape,


Don't do kv_cache_#. This is ambiguous. It should be clear what these buffer / shape represents (e.g. the k or v component). Its especially important when invoking for wave.

rsuderman · 2025-10-20T23:13:53Z

sharktank/sharktank/kernels/wave/templates/extend_attention_kernel.py

    o_layout = tkl.MemoryLayout(shape=set_dynamic_dim(o_shape))
-    k_cache_layout = tkl.MemoryLayout(shape=set_dynamic_dim(k_cache_shape))
-    v_cache_layout = tkl.MemoryLayout(shape=set_dynamic_dim(v_cache_shape))
+    kv_cache_1_layout = tkl.MemoryLayout(shape=set_dynamic_dim(kv_cache_1_shape))


Same with layout.

rsuderman · 2025-10-20T23:14:58Z

sharktank/tests/kernels/wave/extend_attention_test.py

+        head_dim,
+        attn_dtype,
+        device,
+    ):


When you are writing this much boiler plate for a test its a sign that either helpers / cleanup should be included. The transposition, data setup, etc inherently means its too complex for clearly describe the test.

rsuderman · 2025-10-20T23:15:45Z

sharktank/tests/kernels/wave/extend_attention_test.py

+            extend_attention=True,
+        )
+
+        cache = PagedGQAttention(


This is inverted - the subcomponent will be invoking the kernel so they test should be using it externally.

rsuderman · 2025-10-20T23:16:17Z

sharktank/tests/kernels/wave/extend_attention_test.py

+
+        # Loop through chunks simulating progressive prefill
+        all_outputs = []
+        for chunk_id in range(num_chunks):


This section again screams utility. We should try to clearly describe the setup / components and not rely on a large monolithic setup for a test.

Signed-off-by: aviator19941 <[email protected]>

archana-ramalingam · 2025-10-23T04:29:18Z

sharktank/tests/kernels/wave/extend_attention_test.py

+            write_page_ids = page_ids[
+                :, start // block_seq_stride : end // block_seq_stride
+            ]
+            cache_partitions = [k.cpu(), v.cpu()]


Is .cpu() required?

archana-ramalingam · 2025-10-23T04:30:53Z

sharktank/tests/kernels/wave/extend_attention_test.py

+            cache_partitions = [k.cpu(), v.cpu()]
+
+            # Write chunk to KV cache
+            cache.write(


This write updates the cache allocation with cache_partitions. Unless we are planning to verify the cache updates, this step is unnecessary.

archana-ramalingam · 2025-10-23T04:32:48Z

sharktank/tests/kernels/wave/extend_attention_test.py

+            )
+
+        # Combine outputs for completeness check
+        combined_output = torch.cat(all_outputs, dim=2)


Rename as extend_attention_output

archana-ramalingam · 2025-10-23T04:48:35Z

sharktank/tests/kernels/wave/extend_attention_test.py

+            target_len=seq_len,
+            attention_dtype=attn_dtype,
+        ).to(device)
+        sdpa_ref = ops.scaled_dot_product_attention(q=q_sdpa, k=k_sdpa, v=v_sdpa, a=a)


The sdpa_ref and extend_attention_output calculation can be extracted to be separate functions under sharktank/tests/utils/

aviator19941 added 8 commits October 14, 2025 12:14

Change layout of kv cache for extend_attention

3621f8b

Signed-off-by: aviator19941 <[email protected]>

Add kv cache test

ec2ea99

Signed-off-by: aviator19941 <[email protected]>

Test with kv cache

b93e1e9

Signed-off-by: aviator19941 <[email protected]>

Modify wave kernel to read from sharktank kv cache using k_indices/v_…

816a381

…indices Signed-off-by: aviator19941 <[email protected]>

Update kernel and customop args to use new buffers/k_indices/v_indices

6835f41

Signed-off-by: aviator19941 <[email protected]>

Finish test for single request, 2 chunks

24fedd3

Signed-off-by: aviator19941 <[email protected]>

Fix test to use k_indices and v_indices

ce9e0cb

Signed-off-by: aviator19941 <[email protected]>

Merge branch 'main' into extend_attention_kv_cache

64337b6

aviator19941 requested review from Groverkss, archana-ramalingam and rsuderman October 16, 2025 14:13

Groverkss requested changes Oct 16, 2025

View reviewed changes

aviator19941 requested a review from zeeshanhaque21 October 16, 2025 15:56

archana-ramalingam reviewed Oct 16, 2025

View reviewed changes

sharktank/tests/kernels/wave/extend_attention_test.py Outdated Show resolved Hide resolved

archana-ramalingam reviewed Oct 17, 2025

View reviewed changes

sharktank/sharktank/layers/paged_attention.py Outdated Show resolved Hide resolved

Fix comments

a38a5d5

Signed-off-by: aviator19941 <[email protected]>

aviator19941 requested review from Groverkss and archana-ramalingam October 18, 2025 00:29

archana-ramalingam reviewed Oct 18, 2025

View reviewed changes

sharktank/tests/kernels/wave/extend_attention_test.py Outdated Show resolved Hide resolved

aviator19941 added 2 commits October 17, 2025 19:50

Uncomment is_mi300x

d46e8fc

Signed-off-by: aviator19941 <[email protected]>

Merge branch 'main' into extend_attention_kv_cache

28ff254

aviator19941 requested a review from archana-ramalingam October 18, 2025 00:51

archana-ramalingam approved these changes Oct 18, 2025

View reviewed changes

rsuderman requested changes Oct 20, 2025

View reviewed changes

Don't use ambiguous kv_cache_# params

e4a8b23

Signed-off-by: aviator19941 <[email protected]>

archana-ramalingam reviewed Oct 23, 2025

View reviewed changes

aviator19941 mentioned this pull request Oct 23, 2025

Create PagedExtendAttention class to use Wave KVCache layout #2586

Draft

Modify sharktank KVCache and local Wave kernel copy for extend attention #2534

Are you sure you want to change the base?

Modify sharktank KVCache and local Wave kernel copy for extend attention #2534

Uh oh!

Conversation

aviator19941 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Groverkss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aviator19941 commented Oct 16, 2025 •

edited

Loading

github-actions bot commented Oct 18, 2025 •

edited

Loading