CUTracer

CUTracer is an NVBit-based CUDA binary instrumentation tool. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.

Features

NVBit-powered, runtime attach via CUDA_INJECTION64_PATH (no app rebuild needed)
Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- Data Race Detection
CUDA Graph and stream-capture aware flows
Deterministic kernel log file naming and CSV outputs

Requirements

All requirements are aligned with NVBit.

Unique requirements:

libzstd: Required for trace compression

Installation

Clone the repository:

git clone git@github.com:facebookresearch/CUTracer.git
cd CUTracer

Install system dependencies (libzstd static library for self-contained builds):

# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev

# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static

# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.

Download third-party dependencies:

./install_third_party.sh

This will download:

NVBit (NVIDIA Binary Instrumentation Tool)
nlohmann/json (JSON library for C++)

Build the tool:

make -j$(nproc)

Quickstart

Run your CUDA app with CUTracer (example: No instrumentation):

CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
./your_app

Configuration (env vars)

CUTRACER_INSTRUMENT: comma-separated modes: opcode_only, reg_trace, mem_trace, random_delay
CUTRACER_ANALYSIS: comma-separated analyses: proton_instr_histogram, deadlock_detection, random_delay
- Enabling proton_instr_histogram auto-enables opcode_only
- Enabling deadlock_detection auto-enables reg_trace
- Enabling random_delay requires CUTRACER_DELAY_NS to be set
KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel names
INSTR_BEGIN, INSTR_END: static instruction index gate during instrumentation
TOOL_VERBOSE: 0/1/2
TRACE_FORMAT_NDJSON: trace output format
- 1 (default): NDJSON+Zstd compressed (.ndjson.zst, ~12x compression, 92% space savings)
- 0: Plain text (.log, legacy format, verbose)
- 2: NDJSON uncompressed (.ndjson, for debugging)
CUTRACER_ZSTD_LEVEL: Zstd compression level (1-22, default 22)
- Lower values (1-3): Faster compression, slightly larger output
- Higher values (19-22): Maximum compression, slower but smallest output
- Default of 22 provides maximum compression for smallest output
CUTRACER_DELAY_NS: Fixed delay value in nanoseconds for race detection (required for random_delay analysis)
CUTRACER_DELAY_DUMP_PATH: Output path for delay config JSON file (for recording instrumentation patterns)
CUTRACER_DELAY_LOAD_PATH: Input path for delay config JSON file (for replay mode - deterministic reproduction)
CUTRACER_OUTPUT_DIR: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable

Note: The tool sets CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 to simplify channel memory handling.

Analyses

Instruction Histogram (proton_instr_histogram)

Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
Output: one CSV per kernel launch with columns warp_id,region_id,instruction,count

Example (Triton/Proton + IPC):

cd ~/CUTracer/tests/proton_tests

# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py

# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
  --chrome-trace ./vector.chrome_trace \
  --cutracer-trace ./kernel_*_add_kernel_hist.csv \
  --cutracer-log ./cutracer_main_*.log \
  --output vectoradd_ipc.csv

Deadlock / Hang Detection (deadlock_detection)

Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
Requires reg_trace (auto-enabled)

Example (intentional loop):

cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py

Data Race Detection (random_delay)

Data races depend on timing and often pass by luck. This analysis uses random delay injection to detect races by injecting delays before synchronization instructions, disrupting the timing and forcing hidden races to show up
Each instrumentation point is randomly enabled/disabled (50% probability) with a fixed delay value
Requires CUTRACER_DELAY_NS to be set

Example:

CUTRACER_DELAY_NS=10000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.py

Delay Dump and Replay

CUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:

Dump mode: Set CUTRACER_DELAY_DUMP_PATH to save the random instrumentation pattern to a JSON file
Replay mode: Set CUTRACER_DELAY_LOAD_PATH to load a saved config and reproduce the exact same delay pattern

Note: You cannot use both at the same time.

Workflow:

Run with CUTRACER_DELAY_DUMP_PATH=/tmp/config.json to record the delay pattern
When a failure occurs, save the config file
Replay with CUTRACER_DELAY_LOAD_PATH=/tmp/config.json to reproduce deterministically

Examples

The examples/ directory contains reference trace outputs for common workflows:

Proton Trace -- sample instruction histogram CSV, CUTracer log, and a README explaining the end-to-end proton instrumentation workflow for a Triton vector-add kernel

Troubleshooting

No CSV/log: check CUDA_INJECTION64_PATH, KERNEL_FILTERS, and write permissions
Empty histogram: ensure kernels emit clock instructions (e.g., Triton pl.scope)
High overhead: prefer opcode-only; narrow filters; use INSTR_BEGIN/INSTR_END
CUDA Graph/stream capture: data is flushed at cuGraphLaunch exit; ensure stream sync
IPC merge issues: resolve warp mismatches and kernel hash ambiguity with parser flags

License

This repository contains code under the MIT license (Meta) and the BSD-3-Clause license (NVIDIA). See LICENSE and LICENSE-BSD for details.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.ci		.ci
.github/workflows		.github/workflows
examples/proton_trace		examples/proton_trace
include		include
python		python
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clangd		.clangd
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-BSD		LICENSE-BSD
Makefile		Makefile
format.sh		format.sh
install_third_party.sh		install_third_party.sh
logo.svg		logo.svg
pr.md		pr.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

CUTracer

Features

Requirements

Installation

Quickstart

Configuration (env vars)

Analyses

Instruction Histogram (proton_instr_histogram)

Deadlock / Hang Detection (deadlock_detection)

Data Race Detection (random_delay)

Delay Dump and Replay

Examples

Troubleshooting

License

More Documentation

About

Licenses found

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

License

Licenses found

facebookresearch/CUTracer

Folders and files

Latest commit

History

Repository files navigation

CUTracer

Features

Requirements

Installation

Quickstart

Configuration (env vars)

Analyses

Instruction Histogram (proton_instr_histogram)

Deadlock / Hang Detection (deadlock_detection)

Data Race Detection (random_delay)

Delay Dump and Replay

Examples

Troubleshooting

License

More Documentation

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages