-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Summary
The standard FP32 FlashAttention kernel in trueno-gpu has an algorithm bug causing illegal memory access (CUDA_ERROR_UNKNOWN code 700) at runtime.
Root Cause
In trueno-gpu/src/kernels/attention.rs:316-322, the dot product loop incorrectly uses local_col as the K row index:
// BUG: local_col ranges 0 to head_dim-1 (e.g., 0-63)
// But K tile only has tile_kv rows (e.g., 4 rows)
let k_col_offset = ctx.mul_u32_reg(local_col, head_dim_u32);
let k_elem_smem = ctx.add_u32_reg(k_col_offset, d_idx);For a configuration with tile_kv=4, head_dim=64:
- Thread 255 has local_col=63
- K row offset = 63 × 64 = 4032 elements
- K tile only has 4×64=256 elements
- Result: Out-of-bounds shared memory access → CUDA_ERROR_UNKNOWN
Correct Algorithm
For FlashAttention, each thread computing O[row, col] should:
- Iterate over ALL K rows in the KV tile (using a loop variable, not local_col)
- Compute attention score S[row, k_row] = Q[row, :] · K[k_row, :]
- Apply online softmax
- Accumulate V[k_row, col] weighted by softmax probability
Workaround
Use tensor_core_attention() instead of flash_attention_multi_head(). The WMMA-based Tensor Core kernel handles matrix multiplication correctly and is faster on RTX 20xx+ GPUs.
Affected Code
trueno-gpu/src/kernels/attention.rslines 305-365 (standard FP32 attention)realizar/src/cuda.rsflash_attention_multi_head()method
Validation
PTX validates with ptxas --gpu-name sm_89 but fails at runtime due to illegal memory access pattern.
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working