CUTracer is an NVBit-based CUDA binary instrumentation tool. It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.
- NVBit-powered, runtime attach via
CUDA_INJECTION64_PATH(no app rebuild needed) - Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
- Built-in analyses:
- Instruction Histogram (for Proton/Triton workflows)
- Deadlock/Hang Detection
- Data Race Detection
- CUDA Graph and stream-capture aware flows
- Deterministic kernel log file naming and CSV outputs
All requirements are aligned with NVBit.
Unique requirements:
- libzstd: Required for trace compression
- Clone the repository:
git clone git@github.com:facebookresearch/CUTracer.git
cd CUTracer- Install system dependencies (libzstd static library for self-contained builds):
# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev
# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static
# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.- Download third-party dependencies:
./install_third_party.shThis will download:
- NVBit (NVIDIA Binary Instrumentation Tool)
- nlohmann/json (JSON library for C++)
- Build the tool:
make -j$(nproc)Run your CUDA app with CUTracer (example: No instrumentation):
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
./your_appCUTRACER_INSTRUMENT: comma-separated modes:opcode_only,reg_trace,mem_trace,random_delayCUTRACER_ANALYSIS: comma-separated analyses:proton_instr_histogram,deadlock_detection,random_delay- Enabling
proton_instr_histogramauto-enablesopcode_only - Enabling
deadlock_detectionauto-enablesreg_trace - Enabling
random_delayrequiresCUTRACER_DELAY_NSto be set
- Enabling
KERNEL_FILTERS: comma-separated substrings matching unmangled or mangled kernel namesINSTR_BEGIN,INSTR_END: static instruction index gate during instrumentationTOOL_VERBOSE: 0/1/2TRACE_FORMAT_NDJSON: trace output format- 1 (default): NDJSON+Zstd compressed (
.ndjson.zst, ~12x compression, 92% space savings) - 0: Plain text (
.log, legacy format, verbose) - 2: NDJSON uncompressed (
.ndjson, for debugging)
- 1 (default): NDJSON+Zstd compressed (
CUTRACER_ZSTD_LEVEL: Zstd compression level (1-22, default 22)- Lower values (1-3): Faster compression, slightly larger output
- Higher values (19-22): Maximum compression, slower but smallest output
- Default of 22 provides maximum compression for smallest output
CUTRACER_DELAY_NS: Fixed delay value in nanoseconds for race detection (required forrandom_delayanalysis)CUTRACER_DELAY_DUMP_PATH: Output path for delay config JSON file (for recording instrumentation patterns)CUTRACER_DELAY_LOAD_PATH: Input path for delay config JSON file (for replay mode - deterministic reproduction)CUTRACER_OUTPUT_DIR: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable
Note: The tool sets CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 to simplify channel memory handling.
- Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
- Output: one CSV per kernel launch with columns
warp_id,region_id,instruction,count
Example (Triton/Proton + IPC):
cd ~/CUTracer/tests/proton_tests
# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py
# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py
# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
--chrome-trace ./vector.chrome_trace \
--cutracer-trace ./kernel_*_add_kernel_hist.csv \
--cutracer-log ./cutracer_main_*.log \
--output vectoradd_ipc.csv- Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
- Requires
reg_trace(auto-enabled)
Example (intentional loop):
cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py- Data races depend on timing and often pass by luck. This analysis uses random delay injection to detect races by injecting delays before synchronization instructions, disrupting the timing and forcing hidden races to show up
- Each instrumentation point is randomly enabled/disabled (50% probability) with a fixed delay value
- Requires
CUTRACER_DELAY_NSto be set
Example:
CUTRACER_DELAY_NS=10000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.pyCUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:
- Dump mode: Set
CUTRACER_DELAY_DUMP_PATHto save the random instrumentation pattern to a JSON file - Replay mode: Set
CUTRACER_DELAY_LOAD_PATHto load a saved config and reproduce the exact same delay pattern
Note: You cannot use both at the same time.
Workflow:
- Run with
CUTRACER_DELAY_DUMP_PATH=/tmp/config.jsonto record the delay pattern - When a failure occurs, save the config file
- Replay with
CUTRACER_DELAY_LOAD_PATH=/tmp/config.jsonto reproduce deterministically
The examples/ directory contains reference trace outputs for common workflows:
- Proton Trace -- sample instruction histogram CSV, CUTracer log, and a README explaining the end-to-end proton instrumentation workflow for a Triton vector-add kernel
- No CSV/log: check
CUDA_INJECTION64_PATH,KERNEL_FILTERS, and write permissions - Empty histogram: ensure kernels emit clock instructions (e.g., Triton
pl.scope) - High overhead: prefer opcode-only; narrow filters; use
INSTR_BEGIN/INSTR_END - CUDA Graph/stream capture: data is flushed at
cuGraphLaunchexit; ensure stream sync - IPC merge issues: resolve warp mismatches and kernel hash ambiguity with parser flags
This repository contains code under the MIT license (Meta) and the BSD-3-Clause license (NVIDIA). See LICENSE and LICENSE-BSD for details.
The full documentation lives in the Wiki. Key topics include Quickstart, Analyses, Post-processing, Configuration, Outputs, API & Data Structures, Developer Guide, and Troubleshooting.