You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All examples are in [`examples/torch-integration/`](../../examples/torch-integration/).
470
470
471
+
---
472
+
473
+
## Performance Tuning
474
+
475
+
The default algorithms use a fixed heuristic to select algorithms based on message size. For production workloads, you can achieve significantly better performance by **auto-tuning** — benchmarking every candidate algorithm, block count, and thread count for each message size at startup, then using the fastest configuration at runtime.
1.**Candidate selection** — For each power-of-two message size from 1 KB to 128 MB, the tuner picks the applicable algorithms:
482
+
- Small messages (≤ 4 MB): `default_allreduce_nvls_packet`, `default_allreduce_packet`
483
+
- Large messages (≥ 512 KB): `default_allreduce_rsag_zero_copy`
484
+
- Overlapping sizes get all three candidates.
485
+
486
+
2.**Grid search** — Each candidate is run with every combination of block counts (`4, 8, 16, … 128`) and thread counts (`512, 768, 1024`). Results are captured in a CUDA graph and timed.
487
+
488
+
3.**Cross-rank consensus** — Elapsed times are averaged across all ranks with an allreduce so every GPU selects the same configuration.
489
+
490
+
4.**Runtime dispatch** — `get_tuned_config()` rounds the actual message size up to the next power of two and returns the winning `(algorithm, nblocks, nthreads)` triple.
491
+
492
+
### Loading Candidate Algorithms
493
+
494
+
The same `load_algorithms` helper from Approach 1 is reused. The tuner extracts multiple algorithm objects:
0 commit comments