Releases: ingero-io/ingero
v0.9.1
Note: The multi-node features in this release (fan-out queries, offline merge, Perfetto export) are interim solutions for cross-node GPU
investigation. A dedicated cluster-level observability and diagnostics tool with native multi-node support is coming soon.
One command. Every node. Full causal chain.
Ingero can now investigate distributed GPU workloads across multiple nodes from a single CLI command, MCP tool call, or offline database merge. Diagnose
which rank stalled, why, and see it all in a Perfetto timeline.
What's New
Node Identity & Rank Awareness
Every traced event is now tagged with its node identity and distributed training rank.
sudo ingero trace --node gpu-node-07tags all events with the node name- Rank auto-detection from
torchrun/torch.distributed.launchenvironment variables (RANK,LOCAL_RANK,WORLD_SIZE) - Event IDs are node-namespaced (
gpu-node-07:1,gpu-node-07:2) — merge-safe by design - Schema v0.9 with backward-compatible migration for existing databases
Fleet Fan-Out Queries
Query your entire GPU cluster from one command:
# SQL fan-out across 3 nodes
ingero query --nodes node-1:8080,node-2:8080,node-3:8080 \
"SELECT node, source, count(*) FROM events GROUP BY node, source"
# Cross-node causal chains sorted by severity
ingero explain --nodes node-1:8080,node-2:8080,node-3:8080
- Results concatenated with a node column prepended
- Partial failure handling — unreachable nodes produce warnings, not errors
- Configure default nodes in ingero.yaml under
fleet.nodes --no-tlsmode for dashboard on trusted networks (VPC, VPN)
MCP Fleet Tool
AI agents can now investigate entire clusters in one tool call:
query_fleet(action="chains") → merged causal chains from all nodes
query_fleet(action="sql", query="SELECT node, count(*) FROM events GROUP BY node")
Actions: chains, ops, overview, sql. Includes clock skew warnings.
Offline Database Merge
For air-gapped environments or offline analysis:
ingero merge node-1.db node-2.db node-3.db -o cluster.db
ingero query -d cluster.db --since 1h
ingero explain -d cluster.db --chains- Node-namespaced IDs ensure zero collisions
- Stack traces deduplicated by hash
--force-nodeassigns identity to pre-v0.9 databases
Perfetto Timeline Export
Export multi-node traces for visual timeline analysis:
ingero export --format perfetto -d cluster.db -o trace.jsonOpen in https://ui.perfetto.dev — one process track per node/rank, CUDA events as duration spans, causal chains as severity-colored markers. Immediately
spot which rank stalled while others waited.
Clock Skew Detection
Automatic NTP-style clock offset estimation across nodes:
WARNING: node-2 is ~47ms ahead of node-1 (RTT: 2ms)
- Live estimation during fan-out queries (3 samples, median, ~1ms precision on LAN)
- Offline heuristic during
ingero merge(session timestamp comparison) - Configurable threshold:
--clock-skew-threshold 10ms
New Commands
| Command | Description |
|---|---|
ingero merge |
Merge SQLite databases from multiple nodes |
ingero export --format perfetto |
Export to Chrome Trace Event Format |
New Flags
| Flag | Commands | Description |
|---|---|---|
--node |
trace | Tag events with node identity |
--nodes |
query, explain, export | Fan-out to multiple nodes |
--json |
check | Output system readiness results as JSON |
--no-tls |
dashboard | Plain HTTP for fleet queries on trusted networks |
--force-node |
merge | Assign node identity to legacy databases |
--clock-skew-threshold |
query, explain, merge | Clock skew warning threshold |
--timeout |
query, explain, export | Per-node timeout for fleet queries |
--ca-cert, --client-cert, --client-key |
query, explain, export | Optional mTLS for fleet queries |
New API Endpoints
| Endpoint | Description |
|---|---|
POST /api/v1/query |
Execute read-only SQL (used by fleet fan-out) |
GET /api/v1/time |
Server timestamp for clock skew estimation |
New MCP Tool
| Tool | Description |
|---|---|
query_fleet |
Fan-out query across multiple nodes (chains, ops, overview, sql) |
Sample Data
Multi-node sample databases are included in investigations/ — 3 node databases (180-252 KB each), a merged cluster database, and a Perfetto timeline. Try them:
ingero explain --db investigations/sample-cluster.db --chains
ingero export --format perfetto --db investigations/sample-cluster.db -o trace.jsonValidated On
- 3 x AWS g4dn.xlarge (Tesla T4, 15 GB VRAM), Kernel 6.17.0-1007-aws, NVIDIA 580.126.09
- Fan-out query, explain, merge, export, clock skew, partial failure (1 node down), single-node backward compatibility
- Mixed binary + Docker deployment across fleet nodes
- All existing tests pass + 80 new tests across 6 packages
- Full Ingero-EE orchestrator validation: OOM deflection, straggler remediation, watchdog, NCCL suspend/resume, fault injection, recovery persistence
Upgrade Notes
- Schema migration: Existing databases are automatically migrated to v0.9 on first open (adds node, rank, local_rank, world_size columns). Migration is
non-destructive — existing data is preserved. - Backward compatible: All single-node workflows are unchanged. The
--nodeflag defaults toos.Hostname()when not specified. - No new dependencies: Pure Go, no CGO, no new external libraries.
Bug Fixes
- Fixed
demo --no-gpunil pointer panic when GPU is present on the machine (nil RankCache in synthetic mode) - Fixed
--nodes "[host:port,...]"bracket format including brackets in hostnames — now strips surrounding brackets before parsing - Fixed MCP
query_fleetsql action rejecting thequeryparameter — now accepts bothqueryandsqlfields - Added
--ca-cert,--client-cert,--client-keymTLS flags toexportcommand (query and explain had them, export did not) - Added
--jsonflag tocheckcommand for structured JSON output - Fixed
gpu-test.shusing wrong Python path — now auto-detects/opt/pytorch/bin/python3when system Python lacks PyTorch - Fixed staticcheck S1001 lint: removed unnecessary copy loop in
/api/v1/queryhandler - Fixed eventSeq not seeded from DB on restart — prevented ID collisions across trace sessions
- Fixed nil pointer panic in merge batch commit loop on disk-full conditions
- Fixed race condition in MCP fleet client initialization (now uses
sync.Once) - Fixed silent I/O error swallowing in Perfetto export writer
- Fixed duplicate IDs when merging multiple legacy DBs with same
--force-node - Fixed path alias bypass in merge output-source collision check
- Fixed URL injection via unsanitized
sinceparameter in fleet client
v0.9.0
Ingero can now trace the full CUDA Graph lifecycle — capture, instantiate, launch — via eBPF uprobes on libcudart.so.
Zero application modification, zero CUPTI dependency, production-safe overhead.
CUDA Graph Observability
- eBPF probes for cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, and cudaGraphLaunch — covers the stream capture path used by PyTorch torch.compile, vLLM, and TensorRT-LLM
- Causal correlation connects graph events to system state: OOM during graph capture, CPU scheduling interference delaying graph dispatch, graph launch frequency anomalies (pool exhaustion), and captured-but-never-launched graphs wasting VRAM
- MCP tools: graph_lifecycle (timeline of all graph events for a PID) and graph_frequency (per-executable launch rates, hot/cold graph classification, pool saturation detection)
- ingero explain now includes graph context in causal chains when graph events are relevant
- Graceful degradation — if graph API symbols are absent (older CUDA), Ingero skips graph probes silently and continues normally
- Validated at 5,000+ GraphLaunch/sec on EC2 g4dn.xlarge with torch.compile(mode="reduce-overhead"), overhead within <2% budget
Remediation API
Ingero now exposes an optional remediation API over a Unix domain socket (/tmp/ingero-remediate.sock) using type-discriminated NDJSON. External tools can consume real-time {"type":"memory"} and {"type":"straggle"} signals to build custom remediation workflows. Enable with --remediate on ingero trace. See docs/remediation-protocol.md for integration details.
Straggler Detection
- New internal/straggler package: per-PID EMA throughput baseline tracking with sched_switch contention counting
- Correlated detection — both throughput drop and scheduling contention must fire to avoid false positives
- Sustained signal re-emission for downstream consumers that need periodic updates
v0.8.2
What's New
Docker containerization, a real-time GPU dashboard preview, human-friendly CLI duration parsing, and improved platform support
for WSL and AWS Deep Learning AMIs. This release also includes Ingero's first community contributions.
Highlights
-
Docker support — Multi-arch (amd64/arm64) Alpine-based container image (~10 MB), auto-published to GHCR on tag
push via GoReleaser. Includes GPU passthrough detection and healthcheck. -
Real-time GPU dashboard preview — New
ingero dashboardcommand launches a web UI for live GPU observability -
Human-friendly
--sincedurations —explainandquerynow accept--since 2hours,--since 1day,--since 1win addition to Go-native formats like5m. Powered by go-str2duration.
(#8) — thanks @patrickluzdev -
Demo scenario descriptions —
ingero demo --helpnow explains what each of the 6 scenarios demonstrates
(incident, cold-start, memcpy-bottleneck, periodic-spike, cpu-contention, gpu-steal).
(#7) — thanks @zamadye -
WSL hardening — bpftool shim bypass, vmlinux.h validation, and corrected WSL GPU device paths
-
AWS Deep Learning AMI support — Auto-discovery of versioned
libcudart.soon Deep Learning AMIs -
install-deps.sh — One-command dependency installation for quick start on fresh machines
Docker Quick Start
docker pull ghcr.io/ingero-io/ingero:latest
docker run --privileged --pid=host ghcr.io/ingero-io/ingero trace --duration 30s
Build from Source
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident # No GPU needed
For real GPU tracing:
sudo ./bin/ingero trace --duration 30s
./bin/ingero explain --since 30s
Contributors
- @patrickluzdev — human-friendly duration parsing (https://github.com/ingero-io/ingero/pull/8)
- @zamadye — demo scenario descriptions (https://github.com/ingero-io/ingero/pull/7)
## Changelog
### Features
* 21f48b2508d9b70c5e95911c40840c30e72e9a00 feat(docker): containerize Ingero with Alpine runtime and GoReleaser auto-publish
* 2c189157afb4acbd7d23ea93d22bb67005ab7175 feat: add GPU dashboard command with real-time web UI
### Bug Fixes
* 54b54c3814a35398723d2c149637d5159e4f6e1b fix(docker): add libc6-compat for glibc-linked nvidia-smi on Alpine
* dd65d2b1bd1f8a88dd8fd8204dffb5d5ba882616 fix(docker): address review feedback on container GPU detection
* 92bd37b71292e4e3c8e09551c87843767fc7584b fix(docker): address review feedback on containerization
* d371e3166edaf57a5633d0cbce9825b4f747e81c fix(docker): container GPU detection and Buildx setup
* 9bcac9dbc8864ad33346836c192c0ccf0571b200 fix: auto-discover libcudart.so on Deep Learning AMIs
* 2d948fca2e65928ea132fd7dc06e664c4d55daae fix: find versioned libcudart.so on Deep Learning AMIs
* c8c67a66ace70040c95c39650162f901ed6bd63d fix: remove unused printScenarioHelp function
* f47284dd4f39bdf43f7bfdbccdd53ceaf0b21ef6 fix: setup-wsl.sh bpftool shim bypass, vmlinux.h validation, WSL GPU paths
### Documentation
* da765d5e42e73d0d191fd117943ef76fd0fe75a5 docs: add license links for GPL-2.0 and BSD-3-Clause in README
* 98eaeed9a93e9ebe1c43027917594c67780d7225 docs: clarify make install is optional in README
* 52bfe9c0e1a3c73ccc8b4f3917e1355051b09a98 docs: comment out binary install, add build-from-source note
* 571bd0d136fc06e3bb6f708f79811ce0c944f56b docs: mention 10 GB rolling storage cap and --max-db in README
* e5ad809beb9ad7d581ec3f9425122f5af9c2172a docs: move dashboard command after mcp in README
* 60d6c16f19be6c3c08fcc840d138d1b3e9362c15 docs: reformat investigation sessions with Engineer/AI Assistant dialog
* 2754d02aac3b604f21b8669775c6d8a390f23f9e docs: renumber investigation sessions 1-2-3, separate metadata lines
* a5af4c3bdd2a6a26a4fa6e0c126a111c5b02990c docs: simplify GPU problems intro in README
* 4a089cfc026c5e7c97843098a6aa833bbc678aec docs: update session metadata
### Other
* 446a8510f2e5cd6330770b92b5d55e4e4c829cac Fix arrow direction in README diagram
* da7f85e323222a2b5ee105a96643ec841d9cf825 Fix formatting in README.md table
* 005bcabb7c16d796ed5641f6b0b0dc009723ae2c Fix formatting issues in README.md
* 896a8e14506a2bafd32c5ed509e2291304092ab3 add scenario descriptions to demo help
* 3ee065438d435abae3c3999eba658b2dd551ec59 chore: fix go-str2duration as direct dep and update test matrix
* 41096647d58e0648908a59ccb79d0eb93c6754d1 chore: force GitHub UI cache rebuild
* f3473e69a45e923e2c1e60f3202d4c35087ee2ea ci: add manual workflow_dispatch triggers to CI and Release
* 093683e08b04afdc02ced269a8474c9ee5ac3f7b ci: fix manual release trigger to publish Docker images to GHCR
* a575a91dd2f472047447d8de642b7d28d5cffc57 fixes & polishing
* 9edf95af2e125039c2d80d2086acdb4376454ced harden Go download in install-deps.sh, clean up demo title
* 65c7194f10b1f49f8fe6352ca586a5405ac1a745 install-deps shell script for quick start
* 0f603ff231f83785a2fdb7153cacd6bbbaba6d87 refactor(cli): add parseSince helper using go-str2duration and reuse in explain and query
v0.8.1
What's New
Seven fixes from RTX 4090 GPT-2 stress test analysis (5-phase, 237K+ events/min).
Highlights
- DB compaction at shutdown — WAL checkpoint + VACUUM when >20% of pages are free. Integration test DB shrank from 57 MB to 2.7 MB (95% reduction)
- Throughput-drop causal chains — new detection for when CUDA op rate drops >40% from peak but per-call latency stays flat. Catches GPU starvation that tail-ratio chains miss
- Aggregate flush starvation fix — high-throughput periods (400K+ events/min) no longer starve the stats flusher. Event-count-based inline flush every 10K events
- Process name persistence — dynamically discovered PIDs now have names in SQLite.
explain --per-processshows process names instead of raw PIDs - Phase 5 block I/O visibility — checkpoint saves trigger fsync for block device tracepoint capture
Test Results
- 224 unit tests
- RTX 4090 GPU integration: 73 PASS, 0 FAIL, 1 SKIP, 6 WARN
- 28/28 ML Eng AI-assisted investigations PASS
- 944 causal chains including new throughput-drop chains
Quick Start
git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident # No GPU neededFor real GPU tracing:
sudo ./bin/ingero trace --duration 30s
./bin/ingero explain --since 30sFull changelog: see Release Notes