Flash-Attention-3 Forward-Only Kernel

This repository bundles the Flash-Attention-3 forward-only kernel and the tooling required to build a lightweight Python wheel. It is intended for inference scenarios where backward operators and optional features are unnecessary.

Highlights

Ships only the Flash-Attention-3 forward path while disabling backward kernels, local attention, paged KV cache, FP16 kernels, and other extras to minimize the wheel size.
Applies a patch that renames the public interface to fa3_fwd_interface, making the forward kernel easy to import from Python.

Prerequisites(same as upstream)

Python: 3.9 or later
PyTorch: 2.10
Build dependencies: ninja, packaging, wheel

Quick Start

Clone the repository and initialize submodules:

git clone --recursive <repo-url>
cd fa3-fwd
# If --recursive was omitted during clone, run:
git submodule update --init --recursive

Create a Python virtual environment and install dependencies:

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install -r requirements.txt

Build the forward-only wheel:
```
bash build_fa3.sh
```
The script:
- Sources set_compile_env.sh to compute MAX_JOBS and NVCC_THREADS
- Applies the custom patch and interface rename inside the Flash-Attention submodule
- Runs python setup.py bdist_wheel under flash-attention/hopper
Install the generated wheel (example):
```
pip install build/*.whl
```

Python Usage Example

import torch
from fa3_fwd_interface import flash_attn_func

# Inputs must already live on CUDA and satisfy Flash-Attention-3 constraints
out = flash_attn_func(q, k, v, causal=True)

This package exposes only the forward kernel. For backward support or additional features, depend on the upstream Flash-Attention project instead.

Troubleshooting

Out-of-memory during compilation: The build script already throttles concurrency, but you can enforce MAX_JOBS=1 NVCC_THREADS=1 before running bash build_fa3.sh.
CUDA mismatch errors: Confirm that nvcc --version aligns with torch.version.cuda.

Repository Layout

build_fa3.sh: Main build entry point
set_compile_env.sh: Resource-based compiler configuration helper
hopper_setup_py.patch: Patch applied to the upstream setup.py
flash-attention: Upstream Flash-Attention submodule

Customize further by editing environment variables in the build script or modifying the submodule before the patch is applied (for example to re-enable additional datatypes or kernels).

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
flash-attention @ 4894657		flash-attention @ 4894657
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
build_fa3.sh		build_fa3.sh
hopper_setup_py.patch		hopper_setup_py.patch
requirements.txt		requirements.txt
set_compile_thread.sh		set_compile_thread.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flash-Attention-3 Forward-Only Kernel

Highlights

Prerequisites(same as upstream)

Quick Start

Python Usage Example

Troubleshooting

Repository Layout

About

Uh oh!

Releases

Packages

Languages

License

ZJY0516/fa-fwd

Folders and files

Latest commit

History

Repository files navigation

Flash-Attention-3 Forward-Only Kernel

Highlights

Prerequisites(same as upstream)

Quick Start

Python Usage Example

Troubleshooting

Repository Layout

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages