Skip to content

Releases: ingero-io/ingero

v0.9.1

03 Apr 16:10
1dc273d

Choose a tag to compare

Note: The multi-node features in this release (fan-out queries, offline merge, Perfetto export) are interim solutions for cross-node GPU
investigation. A dedicated cluster-level observability and diagnostics tool with native multi-node support is coming soon.

One command. Every node. Full causal chain.

Ingero can now investigate distributed GPU workloads across multiple nodes from a single CLI command, MCP tool call, or offline database merge. Diagnose
which rank stalled, why, and see it all in a Perfetto timeline.

What's New

Node Identity & Rank Awareness

Every traced event is now tagged with its node identity and distributed training rank.

  • sudo ingero trace --node gpu-node-07 tags all events with the node name
  • Rank auto-detection from torchrun / torch.distributed.launch environment variables (RANK, LOCAL_RANK, WORLD_SIZE)
  • Event IDs are node-namespaced (gpu-node-07:1, gpu-node-07:2) — merge-safe by design
  • Schema v0.9 with backward-compatible migration for existing databases

Fleet Fan-Out Queries

Query your entire GPU cluster from one command:

# SQL fan-out across 3 nodes
ingero query --nodes node-1:8080,node-2:8080,node-3:8080 \
  "SELECT node, source, count(*) FROM events GROUP BY node, source"

# Cross-node causal chains sorted by severity
ingero explain --nodes node-1:8080,node-2:8080,node-3:8080
  • Results concatenated with a node column prepended
  • Partial failure handling — unreachable nodes produce warnings, not errors
  • Configure default nodes in ingero.yaml under fleet.nodes
  • --no-tls mode for dashboard on trusted networks (VPC, VPN)

MCP Fleet Tool

AI agents can now investigate entire clusters in one tool call:

query_fleet(action="chains")  →  merged causal chains from all nodes
query_fleet(action="sql", query="SELECT node, count(*) FROM events GROUP BY node")

Actions: chains, ops, overview, sql. Includes clock skew warnings.

Offline Database Merge

For air-gapped environments or offline analysis:

ingero merge node-1.db node-2.db node-3.db -o cluster.db
ingero query -d cluster.db --since 1h
ingero explain -d cluster.db --chains
  • Node-namespaced IDs ensure zero collisions
  • Stack traces deduplicated by hash
  • --force-node assigns identity to pre-v0.9 databases

Perfetto Timeline Export

Export multi-node traces for visual timeline analysis:

ingero export --format perfetto -d cluster.db -o trace.json

Open in https://ui.perfetto.dev — one process track per node/rank, CUDA events as duration spans, causal chains as severity-colored markers. Immediately
spot which rank stalled while others waited.

Clock Skew Detection

Automatic NTP-style clock offset estimation across nodes:

WARNING: node-2 is ~47ms ahead of node-1 (RTT: 2ms)
  • Live estimation during fan-out queries (3 samples, median, ~1ms precision on LAN)
  • Offline heuristic during ingero merge (session timestamp comparison)
  • Configurable threshold: --clock-skew-threshold 10ms

New Commands

Command Description
ingero merge Merge SQLite databases from multiple nodes
ingero export --format perfetto Export to Chrome Trace Event Format

New Flags

Flag Commands Description
--node trace Tag events with node identity
--nodes query, explain, export Fan-out to multiple nodes
--json check Output system readiness results as JSON
--no-tls dashboard Plain HTTP for fleet queries on trusted networks
--force-node merge Assign node identity to legacy databases
--clock-skew-threshold query, explain, merge Clock skew warning threshold
--timeout query, explain, export Per-node timeout for fleet queries
--ca-cert, --client-cert, --client-key query, explain, export Optional mTLS for fleet queries

New API Endpoints

Endpoint Description
POST /api/v1/query Execute read-only SQL (used by fleet fan-out)
GET /api/v1/time Server timestamp for clock skew estimation

New MCP Tool

Tool Description
query_fleet Fan-out query across multiple nodes (chains, ops, overview, sql)

Sample Data

Multi-node sample databases are included in investigations/ — 3 node databases (180-252 KB each), a merged cluster database, and a Perfetto timeline. Try them:

ingero explain --db investigations/sample-cluster.db --chains
ingero export --format perfetto --db investigations/sample-cluster.db -o trace.json

Validated On

  • 3 x AWS g4dn.xlarge (Tesla T4, 15 GB VRAM), Kernel 6.17.0-1007-aws, NVIDIA 580.126.09
  • Fan-out query, explain, merge, export, clock skew, partial failure (1 node down), single-node backward compatibility
  • Mixed binary + Docker deployment across fleet nodes
  • All existing tests pass + 80 new tests across 6 packages
  • Full Ingero-EE orchestrator validation: OOM deflection, straggler remediation, watchdog, NCCL suspend/resume, fault injection, recovery persistence

Upgrade Notes

  • Schema migration: Existing databases are automatically migrated to v0.9 on first open (adds node, rank, local_rank, world_size columns). Migration is
    non-destructive — existing data is preserved.
  • Backward compatible: All single-node workflows are unchanged. The --node flag defaults to os.Hostname() when not specified.
  • No new dependencies: Pure Go, no CGO, no new external libraries.

Bug Fixes

  • Fixed demo --no-gpu nil pointer panic when GPU is present on the machine (nil RankCache in synthetic mode)
  • Fixed --nodes "[host:port,...]" bracket format including brackets in hostnames — now strips surrounding brackets before parsing
  • Fixed MCP query_fleet sql action rejecting the query parameter — now accepts both query and sql fields
  • Added --ca-cert, --client-cert, --client-key mTLS flags to export command (query and explain had them, export did not)
  • Added --json flag to check command for structured JSON output
  • Fixed gpu-test.sh using wrong Python path — now auto-detects /opt/pytorch/bin/python3 when system Python lacks PyTorch
  • Fixed staticcheck S1001 lint: removed unnecessary copy loop in /api/v1/query handler
  • Fixed eventSeq not seeded from DB on restart — prevented ID collisions across trace sessions
  • Fixed nil pointer panic in merge batch commit loop on disk-full conditions
  • Fixed race condition in MCP fleet client initialization (now uses sync.Once)
  • Fixed silent I/O error swallowing in Perfetto export writer
  • Fixed duplicate IDs when merging multiple legacy DBs with same --force-node
  • Fixed path alias bypass in merge output-source collision check
  • Fixed URL injection via unsanitized since parameter in fleet client

v0.9.0

01 Apr 13:46
2a20d72

Choose a tag to compare

Ingero can now trace the full CUDA Graph lifecycle — capture, instantiate, launch — via eBPF uprobes on libcudart.so.
Zero application modification, zero CUPTI dependency, production-safe overhead.

CUDA Graph Observability

  • eBPF probes for cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, and cudaGraphLaunch — covers the stream capture path used by PyTorch torch.compile, vLLM, and TensorRT-LLM
  • Causal correlation connects graph events to system state: OOM during graph capture, CPU scheduling interference delaying graph dispatch, graph launch frequency anomalies (pool exhaustion), and captured-but-never-launched graphs wasting VRAM
  • MCP tools: graph_lifecycle (timeline of all graph events for a PID) and graph_frequency (per-executable launch rates, hot/cold graph classification, pool saturation detection)
  • ingero explain now includes graph context in causal chains when graph events are relevant
  • Graceful degradation — if graph API symbols are absent (older CUDA), Ingero skips graph probes silently and continues normally
  • Validated at 5,000+ GraphLaunch/sec on EC2 g4dn.xlarge with torch.compile(mode="reduce-overhead"), overhead within <2% budget

Remediation API

Ingero now exposes an optional remediation API over a Unix domain socket (/tmp/ingero-remediate.sock) using type-discriminated NDJSON. External tools can consume real-time {"type":"memory"} and {"type":"straggle"} signals to build custom remediation workflows. Enable with --remediate on ingero trace. See docs/remediation-protocol.md for integration details.

Straggler Detection

  • New internal/straggler package: per-PID EMA throughput baseline tracking with sched_switch contention counting
  • Correlated detection — both throughput drop and scheduling contention must fire to avoid false positives
  • Sustained signal re-emission for downstream consumers that need periodic updates

v0.8.2

18 Mar 07:30

Choose a tag to compare

What's New

Docker containerization, a real-time GPU dashboard preview, human-friendly CLI duration parsing, and improved platform support
for WSL and AWS Deep Learning AMIs. This release also includes Ingero's first community contributions.

Highlights

  • Docker support — Multi-arch (amd64/arm64) Alpine-based container image (~10 MB), auto-published to GHCR on tag
    push via GoReleaser. Includes GPU passthrough detection and healthcheck.

  • Real-time GPU dashboard preview — New ingero dashboard command launches a web UI for live GPU observability

  • Human-friendly --since durationsexplain and query now accept --since 2hours, --since 1day, --since 1w in addition to Go-native formats like 5m. Powered by go-str2duration.
    (#8) — thanks @patrickluzdev

  • Demo scenario descriptionsingero demo --help now explains what each of the 6 scenarios demonstrates
    (incident, cold-start, memcpy-bottleneck, periodic-spike, cpu-contention, gpu-steal).
    (#7) — thanks @zamadye

  • WSL hardening — bpftool shim bypass, vmlinux.h validation, and corrected WSL GPU device paths

  • AWS Deep Learning AMI support — Auto-discovery of versioned libcudart.so on Deep Learning AMIs

  • install-deps.sh — One-command dependency installation for quick start on fresh machines

Docker Quick Start

docker pull ghcr.io/ingero-io/ingero:latest
docker run --privileged --pid=host ghcr.io/ingero-io/ingero trace --duration 30s

Build from Source

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident    # No GPU needed

For real GPU tracing:
sudo ./bin/ingero trace --duration 30s
./bin/ingero explain --since 30s

Contributors

- @patrickluzdev — human-friendly duration parsing (https://github.com/ingero-io/ingero/pull/8)
- @zamadye — demo scenario descriptions (https://github.com/ingero-io/ingero/pull/7)

## Changelog
### Features
* 21f48b2508d9b70c5e95911c40840c30e72e9a00 feat(docker): containerize Ingero with Alpine runtime and GoReleaser auto-publish
* 2c189157afb4acbd7d23ea93d22bb67005ab7175 feat: add GPU dashboard command with real-time web UI
### Bug Fixes
* 54b54c3814a35398723d2c149637d5159e4f6e1b fix(docker): add libc6-compat for glibc-linked nvidia-smi on Alpine
* dd65d2b1bd1f8a88dd8fd8204dffb5d5ba882616 fix(docker): address review feedback on container GPU detection
* 92bd37b71292e4e3c8e09551c87843767fc7584b fix(docker): address review feedback on containerization
* d371e3166edaf57a5633d0cbce9825b4f747e81c fix(docker): container GPU detection and Buildx setup
* 9bcac9dbc8864ad33346836c192c0ccf0571b200 fix: auto-discover libcudart.so on Deep Learning AMIs
* 2d948fca2e65928ea132fd7dc06e664c4d55daae fix: find versioned libcudart.so on Deep Learning AMIs
* c8c67a66ace70040c95c39650162f901ed6bd63d fix: remove unused printScenarioHelp function
* f47284dd4f39bdf43f7bfdbccdd53ceaf0b21ef6 fix: setup-wsl.sh bpftool shim bypass, vmlinux.h validation, WSL GPU paths
### Documentation
* da765d5e42e73d0d191fd117943ef76fd0fe75a5 docs: add license links for GPL-2.0 and BSD-3-Clause in README
* 98eaeed9a93e9ebe1c43027917594c67780d7225 docs: clarify make install is optional in README
* 52bfe9c0e1a3c73ccc8b4f3917e1355051b09a98 docs: comment out binary install, add build-from-source note
* 571bd0d136fc06e3bb6f708f79811ce0c944f56b docs: mention 10 GB rolling storage cap and --max-db in README
* e5ad809beb9ad7d581ec3f9425122f5af9c2172a docs: move dashboard command after mcp in README
* 60d6c16f19be6c3c08fcc840d138d1b3e9362c15 docs: reformat investigation sessions with Engineer/AI Assistant dialog
* 2754d02aac3b604f21b8669775c6d8a390f23f9e docs: renumber investigation sessions 1-2-3, separate metadata lines
* a5af4c3bdd2a6a26a4fa6e0c126a111c5b02990c docs: simplify GPU problems intro in README
* 4a089cfc026c5e7c97843098a6aa833bbc678aec docs: update session metadata
### Other
* 446a8510f2e5cd6330770b92b5d55e4e4c829cac Fix arrow direction in README diagram
* da7f85e323222a2b5ee105a96643ec841d9cf825 Fix formatting in README.md table
* 005bcabb7c16d796ed5641f6b0b0dc009723ae2c Fix formatting issues in README.md
* 896a8e14506a2bafd32c5ed509e2291304092ab3 add scenario descriptions to demo help
* 3ee065438d435abae3c3999eba658b2dd551ec59 chore: fix go-str2duration as direct dep and update test matrix
* 41096647d58e0648908a59ccb79d0eb93c6754d1 chore: force GitHub UI cache rebuild
* f3473e69a45e923e2c1e60f3202d4c35087ee2ea ci: add manual workflow_dispatch triggers to CI and Release
* 093683e08b04afdc02ced269a8474c9ee5ac3f7b ci: fix manual release trigger to publish Docker images to GHCR
* a575a91dd2f472047447d8de642b7d28d5cffc57 fixes & polishing
* 9edf95af2e125039c2d80d2086acdb4376454ced harden Go download in install-deps.sh, clean up demo title
* 65c7194f10b1f49f8fe6352ca586a5405ac1a745 install-deps shell script for quick start
* 0f603ff231f83785a2fdb7153cacd6bbbaba6d87 refactor(cli): add parseSince helper using go-str2duration and reuse in explain and query

v0.8.1

15 Mar 17:40

Choose a tag to compare

What's New

Seven fixes from RTX 4090 GPT-2 stress test analysis (5-phase, 237K+ events/min).

Highlights

  • DB compaction at shutdown — WAL checkpoint + VACUUM when >20% of pages are free. Integration test DB shrank from 57 MB to 2.7 MB (95% reduction)
  • Throughput-drop causal chains — new detection for when CUDA op rate drops >40% from peak but per-call latency stays flat. Catches GPU starvation that tail-ratio chains miss
  • Aggregate flush starvation fix — high-throughput periods (400K+ events/min) no longer starve the stats flusher. Event-count-based inline flush every 10K events
  • Process name persistence — dynamically discovered PIDs now have names in SQLite. explain --per-process shows process names instead of raw PIDs
  • Phase 5 block I/O visibility — checkpoint saves trigger fsync for block device tracepoint capture

Test Results

  • 224 unit tests
  • RTX 4090 GPU integration: 73 PASS, 0 FAIL, 1 SKIP, 6 WARN
  • 28/28 ML Eng AI-assisted investigations PASS
  • 944 causal chains including new throughput-drop chains

Quick Start

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident    # No GPU needed

For real GPU tracing:

sudo ./bin/ingero trace --duration 30s
./bin/ingero explain --since 30s

Full changelog: see Release Notes