Skip to content

Commit 84642b4

Browse files
committed
Merge branch 'main' into caiorocha/algorithm_api
2 parents d560b74 + ab49386 commit 84642b4

File tree

1 file changed

+129
-0
lines changed

1 file changed

+129
-0
lines changed

docs/guide/mscclpp-torch-integration.md

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -468,3 +468,132 @@ stream_handle = torch.cuda.current_stream().cuda_stream
468468

469469
All examples are in [`examples/torch-integration/`](../../examples/torch-integration/).
470470

471+
---
472+
473+
## Performance Tuning
474+
475+
The default algorithms use a fixed heuristic to select algorithms based on message size. For production workloads, you can achieve significantly better performance by **auto-tuning** — benchmarking every candidate algorithm, block count, and thread count for each message size at startup, then using the fastest configuration at runtime.
476+
477+
**Full example:** [customized_comm_with_tuning.py](../../examples/torch-integration/customized_comm_with_tuning.py)
478+
479+
### How It Works
480+
481+
1. **Candidate selection** — For each power-of-two message size from 1 KB to 128 MB, the tuner picks the applicable algorithms:
482+
- Small messages (≤ 4 MB): `default_allreduce_nvls_packet`, `default_allreduce_packet`
483+
- Large messages (≥ 512 KB): `default_allreduce_rsag_zero_copy`
484+
- Overlapping sizes get all three candidates.
485+
486+
2. **Grid search** — Each candidate is run with every combination of block counts (`4, 8, 16, … 128`) and thread counts (`512, 768, 1024`). Results are captured in a CUDA graph and timed.
487+
488+
3. **Cross-rank consensus** — Elapsed times are averaged across all ranks with an allreduce so every GPU selects the same configuration.
489+
490+
4. **Runtime dispatch**`get_tuned_config()` rounds the actual message size up to the next power of two and returns the winning `(algorithm, nblocks, nthreads)` triple.
491+
492+
### Loading Candidate Algorithms
493+
494+
The same `load_algorithms` helper from Approach 1 is reused. The tuner extracts multiple algorithm objects:
495+
496+
```python
497+
algorithms = load_algorithms(scratch_buffer=self.scratch_buffer, rank=self.rank)
498+
499+
self._algorithm_nvls_packet = [
500+
algo for algo in algorithms
501+
if algo.collective == "allreduce" and algo.name == "default_allreduce_nvls_packet"
502+
][0]
503+
504+
self._algorithm_rsag_zero_copy = [
505+
algo for algo in algorithms
506+
if algo.collective == "allreduce" and algo.name == "default_allreduce_rsag_zero_copy"
507+
][0]
508+
509+
self._algorithm_packet = [
510+
algo for algo in algorithms
511+
if algo.collective == "allreduce" and algo.name == "default_allreduce_packet"
512+
][0]
513+
```
514+
515+
### The Tuning Loop
516+
517+
The tuning loop iterates over message sizes, candidate algorithms, and kernel launch parameters. CUDA graphs are used for accurate timing:
518+
519+
```python
520+
def _tune(self, n_warmup, n_graph_launches, n_ops_per_graph):
521+
sizes = [1 << i for i in range(10, 28)]
522+
self.best_configs = {1024: (self._algorithm_nvls_packet, 0, 0)}
523+
524+
tune_tensor = torch.rand(1 << 27, dtype=torch.float16, device="cuda")
525+
candidates_nblocks = [4, 8, 16, 24, 32, 48, 64, 128]
526+
candidates_nthreads = [512, 768, 1024]
527+
528+
for size in sizes:
529+
algos = []
530+
if size <= 4 * 1024 * 1024:
531+
algos.append(self._algorithm_nvls_packet)
532+
algos.append(self._algorithm_packet)
533+
if size >= 512 * 1024:
534+
algos.append(self._algorithm_rsag_zero_copy)
535+
536+
best_time = float("inf")
537+
best_config = None
538+
539+
for algo in algos:
540+
for nb in candidates_nblocks:
541+
for nt in candidates_nthreads:
542+
if self._run_algo(algo, tune_tensor, size, nb, nt) != 0:
543+
continue # skip unsupported configs
544+
545+
# Warmup, then time with CUDA graphs
546+
# ... (see full example for graph capture logic)
547+
548+
# Average timing across ranks
549+
time_tensor = torch.full(
550+
(self.world_size,), elapsed, dtype=torch.float64, device="cuda"
551+
).to(dtype=torch.float32)
552+
self.all_reduce(time_tensor, op=torch.distributed.ReduceOp.SUM)
553+
avg_time = time_tensor[self.rank].item() / self.world_size
554+
555+
if avg_time < best_time:
556+
best_time = avg_time
557+
best_config = (algo, nb, nt)
558+
559+
if best_config:
560+
self.best_configs[size] = best_config
561+
```
562+
563+
### Dispatching with Tuned Configuration
564+
565+
At runtime, round the message size to the next power of two and look up the best configuration:
566+
567+
```python
568+
def get_tuned_config(self, size):
569+
if size < 1024:
570+
target_size = 1024
571+
elif size > 256 * 1024 * 1024:
572+
target_size = 256 * 1024 * 1024
573+
else:
574+
target_size = 1 << (size - 1).bit_length()
575+
return self.best_configs.get(target_size)
576+
577+
def all_reduce(self, tensor, op=torch.distributed.ReduceOp.SUM, stream=None):
578+
config = self.get_tuned_config(tensor.nbytes)
579+
algo, nblocks, nthreads = config if config else (self._algorithm_nvls_packet, 0, 0)
580+
algo.execute(
581+
comm=self.comm.communicator,
582+
input_buffer=tensor.data_ptr(),
583+
output_buffer=tensor.data_ptr(),
584+
input_size=tensor.nbytes,
585+
output_size=tensor.nbytes,
586+
dtype=mscclpp_utils.torch_dtype_to_mscclpp_dtype(tensor.dtype),
587+
op=mscclpp.ReduceOp.SUM,
588+
stream=stream.cuda_stream if stream else torch.cuda.current_stream().cuda_stream,
589+
nblocks=nblocks,
590+
nthreads_per_block=nthreads,
591+
)
592+
```
593+
594+
### Running the Tuning Example
595+
596+
```bash
597+
MSCCLPP_MASTER_ADDR=<ip> MSCCLPP_MASTER_PORT=<port> \
598+
torchrun --nnodes=1 --nproc_per_node=8 customized_comm_with_tuning.py
599+
```

0 commit comments

Comments
 (0)