Skip to content

FP32 FlashAttention kernel has out-of-bounds K access (CUDA_ERROR_UNKNOWN 700) #32

@noahgift

Description

@noahgift

Summary

The standard FP32 FlashAttention kernel in trueno-gpu has an algorithm bug causing illegal memory access (CUDA_ERROR_UNKNOWN code 700) at runtime.

Root Cause

In trueno-gpu/src/kernels/attention.rs:316-322, the dot product loop incorrectly uses local_col as the K row index:

// BUG: local_col ranges 0 to head_dim-1 (e.g., 0-63)
// But K tile only has tile_kv rows (e.g., 4 rows)
let k_col_offset = ctx.mul_u32_reg(local_col, head_dim_u32);
let k_elem_smem = ctx.add_u32_reg(k_col_offset, d_idx);

For a configuration with tile_kv=4, head_dim=64:

  • Thread 255 has local_col=63
  • K row offset = 63 × 64 = 4032 elements
  • K tile only has 4×64=256 elements
  • Result: Out-of-bounds shared memory access → CUDA_ERROR_UNKNOWN

Correct Algorithm

For FlashAttention, each thread computing O[row, col] should:

  1. Iterate over ALL K rows in the KV tile (using a loop variable, not local_col)
  2. Compute attention score S[row, k_row] = Q[row, :] · K[k_row, :]
  3. Apply online softmax
  4. Accumulate V[k_row, col] weighted by softmax probability

Workaround

Use tensor_core_attention() instead of flash_attention_multi_head(). The WMMA-based Tensor Core kernel handles matrix multiplication correctly and is faster on RTX 20xx+ GPUs.

Affected Code

  • trueno-gpu/src/kernels/attention.rs lines 305-365 (standard FP32 attention)
  • realizar/src/cuda.rs flash_attention_multi_head() method

Validation

PTX validates with ptxas --gpu-name sm_89 but fails at runtime due to illegal memory access pattern.


🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions