ScalingIntelligence · 01xjw · Jan 21, 2026 · Jan 28, 2026 · Jan 29, 2026 · Mar 18, 2026
diff --git a/README.md b/README.md
@@ -1,19 +1,16 @@
 # KernelBench: Can LLMs Write Efficient GPU Kernels? [ICML '25]
-A benchmark for evaluating LLMs' ability to generate efficient GPU kernels
-
-[arXiv](https://arxiv.org/html/2502.10517v1) | [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) | [HuggingFace Dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) 
-
-<img src="./assets/figures/KernelBenchMascot.png" width="200">
+[arXiv](https://arxiv.org/html/2502.10517v1) | [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) | [HuggingFace Dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) | 
 
 ## Versions
-The latest stable version will be on `main` branch. We continue to update and improve the repo. 
-- [v0.1](https://github.com/ScalingIntelligence/KernelBench/tree/v0.1) - See [blog](https://scalingintelligence.stanford.edu/blogs/kernelbenchv01/)
+The huggingface dataset is updated to v0.1.
+- [v0.1](https://github.com/ScalingIntelligence/KernelBench/tree/v0.1) - Latest version (also main branch)
 - [v0](https://github.com/ScalingIntelligence/KernelBench/tree/v0) - Original Release
 
+A benchmark for evaluating LLMs' ability to generate efficient GPU kernels
 
-The Huggingface [dataset](https://huggingface.co/datasets/ScalingIntelligence/KernelBench) is updated to v0.1.
+<img src="./assets/figures/KernelBenchMascot.png" width="200">
 
-This repo provides core functionality for KernelBench and an easy-to-use set of scripts for evaluation. It is not intended to provide complex agentic scaffolds that solve this task; we recommend cloning and modifying this repo for your experiment, or using it as a git submodule.
+<!-- See [blog post](https://scalingintelligence.stanford.edu/blogs/kernelbench/) and [arXiv paper](https://arxiv.org/html/2502.10517v1) for more details. -->
 
 ## 👋 Task Description
 We structure the problem for LLM to transpile operators described in PyTorch to CUDA kernels, at whatever level of granularity it desires to.
@@ -29,17 +26,17 @@ We construct KernelBench to have 4 Levels of categories:
 - **Level 4 🤗**:  Level Hugging Face 
     Optimize whole model architectures from HuggingFace
 
-We are actively extending KernelBench to other DSLs beyond `cuda` as well (see below).
+We are actively extending KernelBench to other DSLs beyond `cuda` as well.
 
 ## ⚖️ Evaluation
 #### Methodology
 To evaluate model-generated kernels, we need to check if they:
 - **is correct ✅**: check against reference torch operators `n_correctness` times on randomized inputs.
 - **is performant ⏱️**: compare against reference torch operators `n_trial` times to measure speedup between runtimes.
 
-Check out `src/eval.py` for details on how we implement correctness check and timing and `EVAL.md` for notes on evaluation and benchmarking guidelines [WIP].
+Check out `src/eval.py` for details on how we implement correctness check and timing. 
 
-We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a kernel either locally or remotely by setting `eval_mode=local` or `eval_mode=modal`.
+We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel. 
 
 #### Overall Benchmark Metric
 
@@ -66,98 +63,171 @@ We organize the repo into the following structure:
 KernelBench/
 ├── assets/
 ├── KernelBench/ # Benchmark dataset files
-├── src/kernelbench/ # KernelBench logic code
+├── src/ # KernelBench logic code
 │   ├── unit_tests/  
 │   ├── prompts/
 │   ├── ....
 ├── scripts/ # helpful scripts to run the benchmark
 ├── results/ # baseline times across hardware 
 ├── runs/ # where your runs will be stored
-├── notebooks/ # example notebooks for analysis
-├── pyproject.toml # Project configuration and dependencies
 ```
 
 ## 🔧 Set up
+```
+conda create --name kernel-bench python=3.10
+conda activate kernel-bench
+pip install -r requirements.txt
+pip install -e . 
+```
 
-We have transitioned to using `pyproject.toml` and `uv` for dependency management. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) if you haven't already
+### GPU Setup
+Running and profiling kernels require a GPU.
+If you don't have GPU available locally, you can set up [Modal](https://modal.com/). Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.
 
-```bash
-# Install base dependencies (works without a local GPU)
-uv sync
+#### NVIDIA (CUDA)
+- Use default backend `cuda` (recommended).
+- Ensure a CUDA-enabled PyTorch install.
 
-# Install with GPU dependencies (for local GPU evaluation)
-uv sync --extra gpu
+#### AMD ROCm (Radeon / MI-Series)
+KernelBench can run on AMD GPUs via ROCm (HIP) using the same PyTorch `torch.cuda` API.
 
-# Run commands with uv (which invoke the right env)
-uv run python scripts/<script_name>.py ...
+1) Install ROCm-enabled PyTorch (pick the correct ROCm version for your system):
+```
+# Example (adjust ROCm version as needed)
+pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/rocm6.4
 ```
 
-You can still use `conda (python=3.10)` to create your environment and install dependencies with `requirements.txt`.
+2) Verify GPU visibility:
+```
+python - <<'PY'
+import torch
+print("HIP:", torch.version.hip)
+print("GPU:", torch.cuda.get_device_name(0))
+print(torch.cuda.get_device_properties(0))
+PY
+```
 
-We use `litellm` for API calls. Please set your keys by creating a `.env` following our `.env.example`.
+3) Optional: select specific GPU(s)
+```
+export HIP_VISIBLE_DEVICES=0
+export ROCR_VISIBLE_DEVICES=0
+```
 
-Running and profiling kernels require a GPU.
-If you don't have a GPU available locally, you can set up [Modal](https://modal.com/) for cloud serverless GPU evaluation. Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.
+> Note: For AMD, use `backend=triton` or `backend=helion` where applicable. CUDA backend is NVIDIA-only.
 
-You can also try out our [tutorial notebook](https://bit.ly/kernelbench-neurips-colab) (also in notebooks/tutorial.ipynb) with Google Colab.
+##### AMD ROCm Tips
+- **What works**: AMD hardware-aware prompts, Triton backend generation, and ROCm-friendly timing.
+- **What does not (by default)**: CUDA backend evaluation on ROCm is blocked to avoid CUDA-only compile paths.
+- **Troubleshooting**: Ensure Triton is ROCm-enabled and PyTorch is a ROCm build.
 
-## 🚀 Usage
-### Run on a single problem 
-It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample. 
+To call LLM API providers, set the provider API key in your environment:
+```
+export OPENAI_API_KEY="your_api_key_here"
+```
 
-```bash
-# for example, run level 2 problem 40 from huggingface and use google gemini 2.5 flash for generation
+## 🚀 Usage
+### Run on a single problem
+This will fetch the problem, generate a sample, and evaluate the sample.
 
-uv run python scripts/generate_and_eval_single_sample.py dataset_src=huggingface level=2 problem_id=40 server_type=google model_name=gemini/gemini-2.5-flash
+```
+# Example: run level 2 problem 40 from Hugging Face
+python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" level=2 problem_id=40
 
 # dataset_src could be "local" or "huggingface"
-# add .verbose_logging for more visbility
+# add .verbose_logging for more visibility
 ```
 
-**What you might need to modify**
-* **`gpu_arch`** - Depend on your GPU, you might need to adjust the `gpu_arch` argument to reflect your hardware.
-* **`precision`** - You can specify the precision of tensor by `precision=fp32`. Currently all of our reported results are `fp32` but we added support for `fp16` & `bf16`.
-*  **`backend`** - We are also supporting other GPU programming languages beyond `cuda`. Simply specify `backend=triton`. For now we support DSLs: `cuda`, `triton`, `cute`, `tilelang`, `thunderkittens`.
+We also support other GPU programming languages beyond `cuda`. Set `backend=triton`, `backend=cute`, or `backend=helion` as needed.
 
+#### AMD ROCm Example Commands
+Use `backend=triton` (recommended) or `backend=helion` on AMD GPUs:
+```
+# Triton on AMD ROCm (single problem)
+python3 scripts/generate_and_eval_single_sample.py \
+  dataset_src="huggingface" level=2 problem_id=40 \
+  backend=triton
+
+# Helion on AMD ROCm (single problem) (still in progress)
+python3 scripts/generate_and_eval_single_sample.py \
+  dataset_src="huggingface" level=2 problem_id=40 \
+  backend=helion
+```
 
-Note on setting up ThunderKittens (TK) locally: to use `backend=thunderkittens`, you need to git clone the ThunderKittens repo and set the following environment variable to point to your local ThunderKittens directory, `export THUNDERKITTENS_ROOT=<PATH to ThunderKittens folder>`, and all ThunderKitten programs as shown in the [example](src/kernelbench/prompts/model_new_ex_add_thunderkittens.py), should contain `tk_root = os.environ.get("THUNDERKITTENS_ROOT", "/root/ThunderKittens")`, which enable the kernel to include the right TK primitives. In addition, we only support BF16 for TK right now.
+If you want to target a specific AMD GPU:
+```
+HIP_VISIBLE_DEVICES=0 ROCR_VISIBLE_DEVICES=0 \
+python3 scripts/generate_and_eval_single_sample.py \
+  dataset_src="huggingface" level=2 problem_id=40 \
+  backend=triton
+```
 
-Check the config fields for comprehensive set of options. Note we provide the model with a one-shot example by default along with the minimum set of info; you can check out other prompt settings or construct your own in `src/prompt_constructor_toml.py`.
+##### Optional: Force AMD Prompt Inputs
+Some scripts auto-detect GPU vendor/name. You can override:
+```
+python3 scripts/generate_and_eval_single_sample.py \
+  dataset_src=huggingface \
+  level=1 \
+  problem_id=1 \
+  backend=triton \
+  gpu_vendor=amd \
+  gpu_name=MI355X
+```
 
-### Run on all problems 
+### Run on all problems
 
-```bash
+```
 # 1. Generate responses and store kernels locally to runs/{run_name} directory
-uv run python scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
+python3 scripts/generate_samples.py \
+  run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 \
+  server_type=deepseek model_name=deepseek-chat temperature=0
+
+# If you use LLM_GATEWAY_KEY (AMD gateway), set server_type=openai and temperature=1
 
-# 2. Evaluate on all generated kernels in runs/{run_name} directory
-uv run python scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
+# 2. Evaluate all generated kernels in runs/{run_name}
+python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
 
-# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evaluation on GPUs
-# add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command
+# To speed up evaluation, parallelize compilation on CPUs before GPU evaluation.
+# Add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command.
 ```
-### Analyze the eval results to compute Benchmark Performance
-We provide `scripts/benchmark_eval_analysis.py` to analyze the eval results to compute success rate, timing metric, and overall benchmark performance  `fast_p`. 
 
-```bash
-uv run python scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
+##### AMD Triton Quick Start (batch)
+```
+python3 scripts/generate_samples.py \
+  run_name=amd_test \
+  dataset_src=huggingface \
+  level=1 \
+  backend=triton
+
+python3 scripts/eval_from_generations.py \
+  run_name=amd_test \
+  dataset_src=huggingface \
+  level=1 \
+  backend=triton \
+  eval_mode=local
 ```
-If you are using a different hardware, you can generate the baseline time with `scripts/generate_baseline_time.py` script.
-We provide some reference baseline times a variety of NVIDIA GPUs across generations in `results/timing`, but we recommend you to generate your own baseline time for more accurate results (cluster power, software version, all affects timing result). See `results/timing/README.md` for more details.
 
-### Multi-Turn Framework & Integrations
-We have also releaed the test-time framework [Caesar](https://github.com/ScalingIntelligence/caesar) that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems.
+##### AMD Baseline Timing
+```
+python3 scripts/get_baseline_time_single_problem.py
+```
+### Analyze the eval results to compute Benchmark Performance
+Use `scripts/benchmark_eval_analysis.py` to compute success rate, timing metrics, and overall benchmark performance `fast_p`.
 
-You can also use KernelBench as a library for your projects, for example: `from kernelbench import timing`, `from kernelbench import eval as kb_eval`, or `from kernelbench.utils import set_gpu_arch`.
+```
+python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
+```
+If you use different hardware, generate a baseline with `scripts/generate_baseline_time.py`.
+We provide reference baselines for various NVIDIA GPUs in `results/timing`, but we recommend generating your own for accuracy (cluster power and software versions affect timing). See `results/timing/README.md` for details.
+
+### Multi-Turn Framework
+We have also releaed the test-time framework [Caesar](https://github.com/simonguozirui/caesar) that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems. 
 
 ## 🛣️ Upcoming Roadmap
 Check out our [roadmap](https://github.com/ScalingIntelligence/KernelBench/issues/74) for what we plan to add as features. We welcome community contirbutions in these directions. 
 
 ## 🔍 Known Usage
 Since release, we have gotten a lot of interest from researchers, research labs, and companies that use KernelBench to explore this direction. We have documented [known usage](https://docs.google.com/document/d/e/2PACX-1vTjS-UMH1HB5n_PENq2k-3YRfXIXkqKIKeNC2zcWMyLPdl4Jrwvdk4dNDVSsM8ybKrCxZB7GJq1slZF/pub) of KernelBench and related efforts towards automated kernel generations. If you are using KernelBench, we love to hear more about it!
 
-Disclaimer: KernelBench is designed as an open-source evaluation framework and toolkit. The KernelBench team does not review, validate, or endorse individual kernels or reported results. Users are responsible for independently verifying any results obtained using the framework. Please check out `EVAL.md` for more guidance on benchmarking and evaluating kernels.
-
 ## 🪪 License
 MIT. Check `LICENSE.md` for more details.
 

diff --git a/pyproject.toml b/pyproject.toml
diff --git a/requirements.txt b/requirements.txt
@@ -1,35 +1,38 @@
-# ARCHIVED: We are transitioning to pyproject.toml and uv-based project management
-# However, we provide this as a backup for now
-
 # Frameworks
-# we use latest PyTorch stable release
-torch==2.9.*
-triton==3.5.*
-
+# torch==2.5.0
 # we shall upgrade torch for blackwell when it is stable
-transformers>=4.57.3
-datasets>=4.4.2
-modal>=1.3.0
+# AMD ROCm note: install ROCm-enabled torch from the PyTorch ROCm index.
+# Current ROCm env:
+# torch==2.8.0+rocm7.1.1.gitcba8b9d2
+# HIP==7.1.52802-26aae437f6
+# ROCm SMI (concise):
+# Device IDs: 0x7551 x4
+transformers
+datasets
+modal
 
 # DSLs
-nvidia-cutlass-dsl
-tilelang
+# nvidia-cutlass-dsl
+# triton (required for AMD ROCm kernels)
+# helion (optional, Helion DSL; install separately if needed)
 
 # helper
-tqdm>=4.67.1
+tqdm
 packaging
-pydra-config
-ninja>=1.13.0
-cupy-cuda12x==13.6.0
-tomli>=2.3.0
-tabulate>=0.9.0
-nsight-python
+pydra_config
+dill>=0.3.7,<0.4
+pytest
+ninja
 
 # Numerics
-einops>=0.8.1
-python-dotenv>=1.2.1
-numpy==2.4.0
+einops
+dotenv
+numpy
+
+# to deprecate with litellm
+google-generativeai
+together
+openai
+anthropic
+pydantic==2.12.4
 
-# use litellm for cloud providers and openai for local
-openai>=2.14.0
-litellm[proxy]>=1.80.10