Releases · ingero-io/ingero

03 Apr 16:10

v0.9.1

1dc273d

v0.9.1 Latest

Latest

Note: The multi-node features in this release (fan-out queries, offline merge, Perfetto export) are interim solutions for cross-node GPU
investigation. A dedicated cluster-level observability and diagnostics tool with native multi-node support is coming soon.

One command. Every node. Full causal chain.

Ingero can now investigate distributed GPU workloads across multiple nodes from a single CLI command, MCP tool call, or offline database merge. Diagnose
which rank stalled, why, and see it all in a Perfetto timeline.

What's New

Node Identity & Rank Awareness

Every traced event is now tagged with its node identity and distributed training rank.

sudo ingero trace --node gpu-node-07 tags all events with the node name
Rank auto-detection from torchrun / torch.distributed.launch environment variables (RANK, LOCAL_RANK, WORLD_SIZE)
Event IDs are node-namespaced (gpu-node-07:1, gpu-node-07:2) — merge-safe by design
Schema v0.9 with backward-compatible migration for existing databases

Fleet Fan-Out Queries

Query your entire GPU cluster from one command:

# SQL fan-out across 3 nodes
ingero query --nodes node-1:8080,node-2:8080,node-3:8080 \
  "SELECT node, source, count(*) FROM events GROUP BY node, source"

# Cross-node causal chains sorted by severity
ingero explain --nodes node-1:8080,node-2:8080,node-3:8080

Results concatenated with a node column prepended
Partial failure handling — unreachable nodes produce warnings, not errors
Configure default nodes in ingero.yaml under fleet.nodes
--no-tls mode for dashboard on trusted networks (VPC, VPN)

MCP Fleet Tool

AI agents can now investigate entire clusters in one tool call:

query_fleet(action="chains")  →  merged causal chains from all nodes
query_fleet(action="sql", query="SELECT node, count(*) FROM events GROUP BY node")

Actions: chains, ops, overview, sql. Includes clock skew warnings.

Offline Database Merge

For air-gapped environments or offline analysis:

ingero merge node-1.db node-2.db node-3.db -o cluster.db
ingero query -d cluster.db --since 1h
ingero explain -d cluster.db --chains

Node-namespaced IDs ensure zero collisions
Stack traces deduplicated by hash
--force-node assigns identity to pre-v0.9 databases

Perfetto Timeline Export

Export multi-node traces for visual timeline analysis:

ingero export --format perfetto -d cluster.db -o trace.json

Open in https://ui.perfetto.dev — one process track per node/rank, CUDA events as duration spans, causal chains as severity-colored markers. Immediately
spot which rank stalled while others waited.

Clock Skew Detection

Automatic NTP-style clock offset estimation across nodes:

WARNING: node-2 is ~47ms ahead of node-1 (RTT: 2ms)

Live estimation during fan-out queries (3 samples, median, ~1ms precision on LAN)
Offline heuristic during ingero merge (session timestamp comparison)
Configurable threshold: --clock-skew-threshold 10ms

New Commands

Command	Description
`ingero merge`	Merge SQLite databases from multiple nodes
`ingero export --format perfetto`	Export to Chrome Trace Event Format

New Flags

Flag	Commands	Description
`--node`	trace	Tag events with node identity
`--nodes`	query, explain, export	Fan-out to multiple nodes
`--json`	check	Output system readiness results as JSON
`--no-tls`	dashboard	Plain HTTP for fleet queries on trusted networks
`--force-node`	merge	Assign node identity to legacy databases
`--clock-skew-threshold`	query, explain, merge	Clock skew warning threshold
`--timeout`	query, explain, export	Per-node timeout for fleet queries
`--ca-cert`, `--client-cert`, `--client-key`	query, explain, export	Optional mTLS for fleet queries

New API Endpoints

Endpoint	Description
`POST /api/v1/query`	Execute read-only SQL (used by fleet fan-out)
`GET /api/v1/time`	Server timestamp for clock skew estimation

New MCP Tool

Tool	Description
`query_fleet`	Fan-out query across multiple nodes (chains, ops, overview, sql)

Sample Data

Multi-node sample databases are included in investigations/ — 3 node databases (180-252 KB each), a merged cluster database, and a Perfetto timeline. Try them:

ingero explain --db investigations/sample-cluster.db --chains
ingero export --format perfetto --db investigations/sample-cluster.db -o trace.json

Validated On

3 x AWS g4dn.xlarge (Tesla T4, 15 GB VRAM), Kernel 6.17.0-1007-aws, NVIDIA 580.126.09
Fan-out query, explain, merge, export, clock skew, partial failure (1 node down), single-node backward compatibility
Mixed binary + Docker deployment across fleet nodes
All existing tests pass + 80 new tests across 6 packages
Full Ingero-EE orchestrator validation: OOM deflection, straggler remediation, watchdog, NCCL suspend/resume, fault injection, recovery persistence

Upgrade Notes

Schema migration: Existing databases are automatically migrated to v0.9 on first open (adds node, rank, local_rank, world_size columns). Migration is
non-destructive — existing data is preserved.
Backward compatible: All single-node workflows are unchanged. The --node flag defaults to os.Hostname() when not specified.
No new dependencies: Pure Go, no CGO, no new external libraries.

Bug Fixes

Fixed demo --no-gpu nil pointer panic when GPU is present on the machine (nil RankCache in synthetic mode)
Fixed --nodes "[host:port,...]" bracket format including brackets in hostnames — now strips surrounding brackets before parsing
Fixed MCP query_fleet sql action rejecting the query parameter — now accepts both query and sql fields
Added --ca-cert, --client-cert, --client-key mTLS flags to export command (query and explain had them, export did not)
Added --json flag to check command for structured JSON output
Fixed gpu-test.sh using wrong Python path — now auto-detects /opt/pytorch/bin/python3 when system Python lacks PyTorch
Fixed staticcheck S1001 lint: removed unnecessary copy loop in /api/v1/query handler
Fixed eventSeq not seeded from DB on restart — prevented ID collisions across trace sessions
Fixed nil pointer panic in merge batch commit loop on disk-full conditions
Fixed race condition in MCP fleet client initialization (now uses sync.Once)
Fixed silent I/O error swallowing in Perfetto export writer
Fixed duplicate IDs when merging multiple legacy DBs with same --force-node
Fixed path alias bypass in merge output-source collision check
Fixed URL injection via unsanitized since parameter in fleet client

Assets 5

01 Apr 13:46

BorisMorozov

v0.9.0

2a20d72

v0.9.0

Ingero can now trace the full CUDA Graph lifecycle — capture, instantiate, launch — via eBPF uprobes on libcudart.so.
Zero application modification, zero CUPTI dependency, production-safe overhead.

CUDA Graph Observability

eBPF probes for cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, and cudaGraphLaunch — covers the stream capture path used by PyTorch torch.compile, vLLM, and TensorRT-LLM
Causal correlation connects graph events to system state: OOM during graph capture, CPU scheduling interference delaying graph dispatch, graph launch frequency anomalies (pool exhaustion), and captured-but-never-launched graphs wasting VRAM
MCP tools: graph_lifecycle (timeline of all graph events for a PID) and graph_frequency (per-executable launch rates, hot/cold graph classification, pool saturation detection)
ingero explain now includes graph context in causal chains when graph events are relevant
Graceful degradation — if graph API symbols are absent (older CUDA), Ingero skips graph probes silently and continues normally
Validated at 5,000+ GraphLaunch/sec on EC2 g4dn.xlarge with torch.compile(mode="reduce-overhead"), overhead within <2% budget

Remediation API

Ingero now exposes an optional remediation API over a Unix domain socket (/tmp/ingero-remediate.sock) using type-discriminated NDJSON. External tools can consume real-time {"type":"memory"} and {"type":"straggle"} signals to build custom remediation workflows. Enable with --remediate on ingero trace. See docs/remediation-protocol.md for integration details.

Straggler Detection

New internal/straggler package: per-PID EMA throughput baseline tracking with sched_switch contention counting
Correlated detection — both throughput drop and scheduling contention must fire to avoid false positives
Sustained signal re-emission for downstream consumers that need periodic updates

Assets 5

18 Mar 07:30

BorisMorozov

v0.8.2

093683e

v0.8.2

What's New

Docker containerization, a real-time GPU dashboard preview, human-friendly CLI duration parsing, and improved platform support
for WSL and AWS Deep Learning AMIs. This release also includes Ingero's first community contributions.

Highlights

Docker support — Multi-arch (amd64/arm64) Alpine-based container image (~10 MB), auto-published to GHCR on tag
push via GoReleaser. Includes GPU passthrough detection and healthcheck.
Real-time GPU dashboard preview — New ingero dashboard command launches a web UI for live GPU observability
Human-friendly --since durations — explain and query now accept --since 2hours, --since 1day, --since 1w in addition to Go-native formats like 5m. Powered by go-str2duration.
(#8) — thanks @patrickluzdev
Demo scenario descriptions — ingero demo --help now explains what each of the 6 scenarios demonstrates
(incident, cold-start, memcpy-bottleneck, periodic-spike, cpu-contention, gpu-steal).
(#7) — thanks @zamadye
WSL hardening — bpftool shim bypass, vmlinux.h validation, and corrected WSL GPU device paths
AWS Deep Learning AMI support — Auto-discovery of versioned libcudart.so on Deep Learning AMIs
install-deps.sh — One-command dependency installation for quick start on fresh machines

Docker Quick Start

docker pull ghcr.io/ingero-io/ingero:latest
docker run --privileged --pid=host ghcr.io/ingero-io/ingero trace --duration 30s

Build from Source

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident    # No GPU needed

For real GPU tracing:
sudo ./bin/ingero trace --duration 30s
./bin/ingero explain --since 30s

Contributors

- @patrickluzdev — human-friendly duration parsing (https://github.com/ingero-io/ingero/pull/8)
- @zamadye — demo scenario descriptions (https://github.com/ingero-io/ingero/pull/7)

## Changelog
### Features
* 21f48b2508d9b70c5e95911c40840c30e72e9a00 feat(docker): containerize Ingero with Alpine runtime and GoReleaser auto-publish
* 2c189157afb4acbd7d23ea93d22bb67005ab7175 feat: add GPU dashboard command with real-time web UI
### Bug Fixes
* 54b54c3814a35398723d2c149637d5159e4f6e1b fix(docker): add libc6-compat for glibc-linked nvidia-smi on Alpine
* dd65d2b1bd1f8a88dd8fd8204dffb5d5ba882616 fix(docker): address review feedback on container GPU detection
* 92bd37b71292e4e3c8e09551c87843767fc7584b fix(docker): address review feedback on containerization
* d371e3166edaf57a5633d0cbce9825b4f747e81c fix(docker): container GPU detection and Buildx setup
* 9bcac9dbc8864ad33346836c192c0ccf0571b200 fix: auto-discover libcudart.so on Deep Learning AMIs
* 2d948fca2e65928ea132fd7dc06e664c4d55daae fix: find versioned libcudart.so on Deep Learning AMIs
* c8c67a66ace70040c95c39650162f901ed6bd63d fix: remove unused printScenarioHelp function
* f47284dd4f39bdf43f7bfdbccdd53ceaf0b21ef6 fix: setup-wsl.sh bpftool shim bypass, vmlinux.h validation, WSL GPU paths
### Documentation
* da765d5e42e73d0d191fd117943ef76fd0fe75a5 docs: add license links for GPL-2.0 and BSD-3-Clause in README
* 98eaeed9a93e9ebe1c43027917594c67780d7225 docs: clarify make install is optional in README
* 52bfe9c0e1a3c73ccc8b4f3917e1355051b09a98 docs: comment out binary install, add build-from-source note
* 571bd0d136fc06e3bb6f708f79811ce0c944f56b docs: mention 10 GB rolling storage cap and --max-db in README
* e5ad809beb9ad7d581ec3f9425122f5af9c2172a docs: move dashboard command after mcp in README
* 60d6c16f19be6c3c08fcc840d138d1b3e9362c15 docs: reformat investigation sessions with Engineer/AI Assistant dialog
* 2754d02aac3b604f21b8669775c6d8a390f23f9e docs: renumber investigation sessions 1-2-3, separate metadata lines
* a5af4c3bdd2a6a26a4fa6e0c126a111c5b02990c docs: simplify GPU problems intro in README
* 4a089cfc026c5e7c97843098a6aa833bbc678aec docs: update session metadata
### Other
* 446a8510f2e5cd6330770b92b5d55e4e4c829cac Fix arrow direction in README diagram
* da7f85e323222a2b5ee105a96643ec841d9cf825 Fix formatting in README.md table
* 005bcabb7c16d796ed5641f6b0b0dc009723ae2c Fix formatting issues in README.md
* 896a8e14506a2bafd32c5ed509e2291304092ab3 add scenario descriptions to demo help
* 3ee065438d435abae3c3999eba658b2dd551ec59 chore: fix go-str2duration as direct dep and update test matrix
* 41096647d58e0648908a59ccb79d0eb93c6754d1 chore: force GitHub UI cache rebuild
* f3473e69a45e923e2c1e60f3202d4c35087ee2ea ci: add manual workflow_dispatch triggers to CI and Release
* 093683e08b04afdc02ced269a8474c9ee5ac3f7b ci: fix manual release trigger to publish Docker images to GHCR
* a575a91dd2f472047447d8de642b7d28d5cffc57 fixes & polishing
* 9edf95af2e125039c2d80d2086acdb4376454ced harden Go download in install-deps.sh, clean up demo title
* 65c7194f10b1f49f8fe6352ca586a5405ac1a745 install-deps shell script for quick start
* 0f603ff231f83785a2fdb7153cacd6bbbaba6d87 refactor(cli): add parseSince helper using go-str2duration and reuse in explain and query

Contributors

patrickluzdev and zamadye

Assets 5

15 Mar 17:40

dml37

v0.8.1

38dee94

v0.8.1

What's New

Seven fixes from RTX 4090 GPT-2 stress test analysis (5-phase, 237K+ events/min).

Highlights

DB compaction at shutdown — WAL checkpoint + VACUUM when >20% of pages are free. Integration test DB shrank from 57 MB to 2.7 MB (95% reduction)
Throughput-drop causal chains — new detection for when CUDA op rate drops >40% from peak but per-call latency stays flat. Catches GPU starvation that tail-ratio chains miss
Aggregate flush starvation fix — high-throughput periods (400K+ events/min) no longer starve the stats flusher. Event-count-based inline flush every 10K events
Process name persistence — dynamically discovered PIDs now have names in SQLite. explain --per-process shows process names instead of raw PIDs
Phase 5 block I/O visibility — checkpoint saves trigger fsync for block device tracepoint capture

Test Results

224 unit tests
RTX 4090 GPU integration: 73 PASS, 0 FAIL, 1 SKIP, 6 WARN
28/28 ML Eng AI-assisted investigations PASS
944 causal chains including new throughput-drop chains

Quick Start

git clone https://github.com/ingero-io/ingero.git
cd ingero && make build
./bin/ingero demo incident    # No GPU needed

For real GPU tracing:

sudo ./bin/ingero trace --duration 30s
./bin/ingero explain --since 30s

Full changelog: see Release Notes

Assets 5

Releases: ingero-io/ingero

v0.9.1

What's New

Node Identity & Rank Awareness

Fleet Fan-Out Queries

MCP Fleet Tool

Offline Database Merge

Perfetto Timeline Export

Clock Skew Detection

New Commands

New Flags

New API Endpoints

New MCP Tool

Sample Data

Validated On

Upgrade Notes

Bug Fixes

Uh oh!

v0.9.0

Uh oh!

v0.8.2

What's New

Highlights

Docker Quick Start

Contributors

Uh oh!

v0.8.1

What's New

Highlights

Test Results

Quick Start

Uh oh!