Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 129 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
# KernelBench: Can LLMs Write Efficient GPU Kernels? [ICML '25]
A benchmark for evaluating LLMs' ability to generate efficient GPU kernels

[arXiv](https://arxiv.org/html/2502.10517v1) | [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) | [HuggingFace Dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench)

<img src="./assets/figures/KernelBenchMascot.png" width="200">
[arXiv](https://arxiv.org/html/2502.10517v1) | [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) | [HuggingFace Dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) |

## Versions
The latest stable version will be on `main` branch. We continue to update and improve the repo.
- [v0.1](https://github.com/ScalingIntelligence/KernelBench/tree/v0.1) - See [blog](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/)
The huggingface dataset is updated to v0.1.
- [v0.1](https://github.com/ScalingIntelligence/KernelBench/tree/v0.1) - Latest version (also main branch)
- [v0](https://github.com/ScalingIntelligence/KernelBench/tree/v0) - Original Release

A benchmark for evaluating LLMs' ability to generate efficient GPU kernels

The Huggingface [dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) is updated to v0.1.
<img src="./assets/figures/KernelBenchMascot.png" width="200">

This repo provides core functionality for KernelBench and an easy-to-use set of scripts for evaluation. It is not intended to provide complex agentic scaffolds that solve this task; we recommend cloning and modifying this repo for your experiment, or using it as a git submodule.
<!-- See [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) and [arXiv paper](https://arxiv.org/html/2502.10517v1) for more details. -->

## 👋 Task Description
We structure the problem for LLM to transpile operators described in PyTorch to CUDA kernels, at whatever level of granularity it desires to.
Expand All @@ -29,17 +26,17 @@ We construct KernelBench to have 4 Levels of categories:
- **Level 4 🤗**: Level Hugging Face
Optimize whole model architectures from HuggingFace

We are actively extending KernelBench to other DSLs beyond `cuda` as well (see below).
We are actively extending KernelBench to other DSLs beyond `cuda` as well.

## ⚖️ Evaluation
#### Methodology
To evaluate model-generated kernels, we need to check if they:
- **is correct ✅**: check against reference torch operators `n_correctness` times on randomized inputs.
- **is performant ⏱️**: compare against reference torch operators `n_trial` times to measure speedup between runtimes.

Check out `src/eval.py` for details on how we implement correctness check and timing and `EVAL.md` for notes on evaluation and benchmarking guidelines [WIP].
Check out `src/eval.py` for details on how we implement correctness check and timing.

We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a kernel either locally or remotely by setting `eval_mode=local` or `eval_mode=modal`.
We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel.

#### Overall Benchmark Metric

Expand All @@ -66,98 +63,171 @@ We organize the repo into the following structure:
KernelBench/
├── assets/
├── KernelBench/ # Benchmark dataset files
├── src/kernelbench/ # KernelBench logic code
├── src/ # KernelBench logic code
│ ├── unit_tests/
│ ├── prompts/
│ ├── ....
├── scripts/ # helpful scripts to run the benchmark
├── results/ # baseline times across hardware
├── runs/ # where your runs will be stored
├── notebooks/ # example notebooks for analysis
├── pyproject.toml # Project configuration and dependencies
```

## 🔧 Set up
```
conda create --name kernel-bench python=3.10
conda activate kernel-bench
pip install -r requirements.txt
pip install -e .
```

We have transitioned to using `pyproject.toml` and `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) if you haven't already
### GPU Setup
Running and profiling kernels require a GPU.
If you don't have GPU available locally, you can set up [Modal](https://modal.com/). Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.

```bash
# Install base dependencies (works without a local GPU)
uv sync
#### NVIDIA (CUDA)
- Use default backend `cuda` (recommended).
- Ensure a CUDA-enabled PyTorch install.

# Install with GPU dependencies (for local GPU evaluation)
uv sync --extra gpu
#### AMD ROCm (Radeon / MI-Series)
KernelBench can run on AMD GPUs via ROCm (HIP) using the same PyTorch `torch.cuda` API.

# Run commands with uv (which invoke the right env)
uv run python scripts/<script_name>.py ...
1) Install ROCm-enabled PyTorch (pick the correct ROCm version for your system):
```
# Example (adjust ROCm version as needed)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/rocm6.4
```

You can still use `conda (python=3.10)` to create your environment and install dependencies with `requirements.txt`.
2) Verify GPU visibility:
```
python - <<'PY'
import torch
print("HIP:", torch.version.hip)
print("GPU:", torch.cuda.get_device_name(0))
print(torch.cuda.get_device_properties(0))
PY
```

We use `litellm` for API calls. Please set your keys by creating a `.env` following our `.env.example`.
3) Optional: select specific GPU(s)
```
export HIP_VISIBLE_DEVICES=0
export ROCR_VISIBLE_DEVICES=0
```

Running and profiling kernels require a GPU.
If you don't have a GPU available locally, you can set up [Modal](https://modal.com/) for cloud serverless GPU evaluation. Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.
> Note: For AMD, use `backend=triton` or `backend=helion` where applicable. CUDA backend is NVIDIA-only.

You can also try out our [tutorial notebook](https://bit.ly/kernelbench-neurips-colab) (also in notebooks/tutorial.ipynb) with Google Colab.
##### AMD ROCm Tips
- **What works**: AMD hardware-aware prompts, Triton backend generation, and ROCm-friendly timing.
- **What does not (by default)**: CUDA backend evaluation on ROCm is blocked to avoid CUDA-only compile paths.
- **Troubleshooting**: Ensure Triton is ROCm-enabled and PyTorch is a ROCm build.

## 🚀 Usage
### Run on a single problem
It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample.
To call LLM API providers, set the provider API key in your environment:
```
export OPENAI_API_KEY="your_api_key_here"
```

```bash
# for example, run level 2 problem 40 from huggingface and use google gemini 2.5 flash for generation
## 🚀 Usage
### Run on a single problem
This will fetch the problem, generate a sample, and evaluate the sample.

uv run python scripts/generate_and_eval_single_sample.py dataset_src=huggingface level=2 problem_id=40 server_type=google model_name=gemini/gemini-2.5-flash
```
# Example: run level 2 problem 40 from Hugging Face
python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" level=2 problem_id=40

# dataset_src could be "local" or "huggingface"
# add .verbose_logging for more visbility
# add .verbose_logging for more visibility
```

**What you might need to modify**
* **`gpu_arch`** - Depend on your GPU, you might need to adjust the `gpu_arch` argument to reflect your hardware.
* **`precision`** - You can specify the precision of tensor by `precision=fp32`. Currently all of our reported results are `fp32` but we added support for `fp16` & `bf16`.
* **`backend`** - We are also supporting other GPU programming languages beyond `cuda`. Simply specify `backend=triton`. For now we support DSLs: `cuda`, `triton`, `cute`, `tilelang`, `thunderkittens`.
We also support other GPU programming languages beyond `cuda`. Set `backend=triton`, `backend=cute`, or `backend=helion` as needed.

#### AMD ROCm Example Commands
Use `backend=triton` (recommended) or `backend=helion` on AMD GPUs:
```
# Triton on AMD ROCm (single problem)
python3 scripts/generate_and_eval_single_sample.py \
dataset_src="huggingface" level=2 problem_id=40 \
backend=triton

# Helion on AMD ROCm (single problem) (still in progress)
python3 scripts/generate_and_eval_single_sample.py \
dataset_src="huggingface" level=2 problem_id=40 \
backend=helion
```

Note on setting up ThunderKittens (TK) locally: to use `backend=thunderkittens`, you need to git clone the ThunderKittens repo and set the following environment variable to point to your local ThunderKittens directory, `export THUNDERKITTENS_ROOT=<PATH to ThunderKittens folder>`, and all ThunderKitten programs as shown in the [example](src/kernelbench/prompts/model_new_ex_add_thunderkittens.py), should contain `tk_root = os.environ.get("THUNDERKITTENS_ROOT", "/root/ThunderKittens")`, which enable the kernel to include the right TK primitives. In addition, we only support BF16 for TK right now.
If you want to target a specific AMD GPU:
```
HIP_VISIBLE_DEVICES=0 ROCR_VISIBLE_DEVICES=0 \
python3 scripts/generate_and_eval_single_sample.py \
dataset_src="huggingface" level=2 problem_id=40 \
backend=triton
```

Check the config fields for comprehensive set of options. Note we provide the model with a one-shot example by default along with the minimum set of info; you can check out other prompt settings or construct your own in `src/prompt_constructor_toml.py`.
##### Optional: Force AMD Prompt Inputs
Some scripts auto-detect GPU vendor/name. You can override:
```
python3 scripts/generate_and_eval_single_sample.py \
dataset_src=huggingface \
level=1 \
problem_id=1 \
backend=triton \
gpu_vendor=amd \
gpu_name=MI355X
```

### Run on all problems
### Run on all problems

```bash
```
# 1. Generate responses and store kernels locally to runs/{run_name} directory
uv run python scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
python3 scripts/generate_samples.py \
run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 \
server_type=deepseek model_name=deepseek-chat temperature=0

# If you use LLM_GATEWAY_KEY (AMD gateway), set server_type=openai and temperature=1

# 2. Evaluate on all generated kernels in runs/{run_name} directory
uv run python scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
# 2. Evaluate all generated kernels in runs/{run_name}
python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300

# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evaluation on GPUs
# add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command
# To speed up evaluation, parallelize compilation on CPUs before GPU evaluation.
# Add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command.
```
### Analyze the eval results to compute Benchmark Performance
We provide `scripts/benchmark_eval_analysis.py` to analyze the eval results to compute success rate, timing metric, and overall benchmark performance `fast_p`.

```bash
uv run python scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
##### AMD Triton Quick Start (batch)
```
python3 scripts/generate_samples.py \
run_name=amd_test \
dataset_src=huggingface \
level=1 \
backend=triton

python3 scripts/eval_from_generations.py \
run_name=amd_test \
dataset_src=huggingface \
level=1 \
backend=triton \
eval_mode=local
```
If you are using a different hardware, you can generate the baseline time with `scripts/generate_baseline_time.py` script.
We provide some reference baseline times a variety of NVIDIA GPUs across generations in `results/timing`, but we recommend you to generate your own baseline time for more accurate results (cluster power, software version, all affects timing result). See `results/timing/README.md` for more details.

### Multi-Turn Framework & Integrations
We have also releaed the test-time framework [Caesar](https://github.com/ScalingIntelligence/caesar) that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems.
##### AMD Baseline Timing
```
python3 scripts/get_baseline_time_single_problem.py
```
### Analyze the eval results to compute Benchmark Performance
Use `scripts/benchmark_eval_analysis.py` to compute success rate, timing metrics, and overall benchmark performance `fast_p`.

You can also use KernelBench as a library for your projects, for example: `from kernelbench import timing`, `from kernelbench import eval as kb_eval`, or `from kernelbench.utils import set_gpu_arch`.
```
python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
```
If you use different hardware, generate a baseline with `scripts/generate_baseline_time.py`.
We provide reference baselines for various NVIDIA GPUs in `results/timing`, but we recommend generating your own for accuracy (cluster power and software versions affect timing). See `results/timing/README.md` for details.

### Multi-Turn Framework
We have also releaed the test-time framework [Caesar](https://github.com/simonguozirui/caesar) that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems.

## 🛣️ Upcoming Roadmap
Check out our [roadmap](https://github.com/ScalingIntelligence/KernelBench/issues/74) for what we plan to add as features. We welcome community contirbutions in these directions.

## 🔍 Known Usage
Since release, we have gotten a lot of interest from researchers, research labs, and companies that use KernelBench to explore this direction. We have documented [known usage](https://docs.google.com/document/d/e/2PACX-1vTjS-UMH1HB5n_PENq2k-3YRfXIXkqKIKeNC2zcWMyLPdl4Jrwvdk4dNDVSsM8ybKrCxZB7GJq1slZF/pub) of KernelBench and related efforts towards automated kernel generations. If you are using KernelBench, we love to hear more about it!

Disclaimer: KernelBench is designed as an open-source evaluation framework and toolkit. The KernelBench team does not review, validate, or endorse individual kernels or reported results. Users are responsible for independently verifying any results obtained using the framework. Please check out `EVAL.md` for more guidance on benchmarking and evaluating kernels.

## 🪪 License
MIT. Check `LICENSE.md` for more details.

Expand Down
58 changes: 0 additions & 58 deletions pyproject.toml

This file was deleted.

53 changes: 28 additions & 25 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,35 +1,38 @@
# ARCHIVED: We are transitioning to pyproject.toml and uv-based project management
# However, we provide this as a backup for now

# Frameworks
# we use latest PyTorch stable release
torch==2.9.*
triton==3.5.*

# torch==2.5.0
# we shall upgrade torch for blackwell when it is stable
transformers>=4.57.3
datasets>=4.4.2
modal>=1.3.0
# AMD ROCm note: install ROCm-enabled torch from the PyTorch ROCm index.
# Current ROCm env:
# torch==2.8.0+rocm7.1.1.gitcba8b9d2
# HIP==7.1.52802-26aae437f6
# ROCm SMI (concise):
# Device IDs: 0x7551 x4
transformers
datasets
modal

# DSLs
nvidia-cutlass-dsl
tilelang
# nvidia-cutlass-dsl
# triton (required for AMD ROCm kernels)
# helion (optional, Helion DSL; install separately if needed)

# helper
tqdm>=4.67.1
tqdm
packaging
pydra-config
ninja>=1.13.0
cupy-cuda12x==13.6.0
tomli>=2.3.0
tabulate>=0.9.0
nsight-python
pydra_config
dill>=0.3.7,<0.4
pytest
ninja

# Numerics
einops>=0.8.1
python-dotenv>=1.2.1
numpy==2.4.0
einops
dotenv
numpy

# to deprecate with litellm
google-generativeai
together
openai
anthropic
pydantic==2.12.4

# use litellm for cloud providers and openai for local
openai>=2.14.0
litellm[proxy]>=1.80.10
Loading