FP32 FlashAttention kernel has out-of-bounds K access (CUDA_ERROR_UNKNOWN 700)

## Summary

The standard FP32 FlashAttention kernel in `trueno-gpu` has an algorithm bug causing illegal memory access (CUDA_ERROR_UNKNOWN code 700) at runtime.

## Root Cause

In `trueno-gpu/src/kernels/attention.rs:316-322`, the dot product loop incorrectly uses `local_col` as the K row index:

```rust
// BUG: local_col ranges 0 to head_dim-1 (e.g., 0-63)
// But K tile only has tile_kv rows (e.g., 4 rows)
let k_col_offset = ctx.mul_u32_reg(local_col, head_dim_u32);
let k_elem_smem = ctx.add_u32_reg(k_col_offset, d_idx);
```

For a configuration with tile_kv=4, head_dim=64:
- Thread 255 has local_col=63
- K row offset = 63 × 64 = 4032 elements
- K tile only has 4×64=256 elements
- Result: Out-of-bounds shared memory access → CUDA_ERROR_UNKNOWN

## Correct Algorithm

For FlashAttention, each thread computing O[row, col] should:
1. Iterate over ALL K rows in the KV tile (using a loop variable, not local_col)
2. Compute attention score S[row, k_row] = Q[row, :] · K[k_row, :]
3. Apply online softmax
4. Accumulate V[k_row, col] weighted by softmax probability

## Workaround

Use `tensor_core_attention()` instead of `flash_attention_multi_head()`. The WMMA-based Tensor Core kernel handles matrix multiplication correctly and is faster on RTX 20xx+ GPUs.

## Affected Code

- `trueno-gpu/src/kernels/attention.rs` lines 305-365 (standard FP32 attention)
- `realizar/src/cuda.rs` `flash_attention_multi_head()` method

## Validation

PTX validates with `ptxas --gpu-name sm_89` but fails at runtime due to illegal memory access pattern.

---
🤖 Generated with [Claude Code](https://claude.ai/code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP32 FlashAttention kernel has out-of-bounds K access (CUDA_ERROR_UNKNOWN 700) #32

Summary

Root Cause

Correct Algorithm

Workaround

Affected Code

Validation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FP32 FlashAttention kernel has out-of-bounds K access (CUDA_ERROR_UNKNOWN 700) #32

Description

Summary

Root Cause

Correct Algorithm

Workaround

Affected Code

Validation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions