Fix OOM in Dask KMeans by collecting only one model after fit#7908
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughfit() now fetches the full model only from the first worker, adds a module helper Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels improvement Suggested reviewers jcrist 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 181-184: Replace the hard-coded cp.concatenate call with the
imported concatenate helper to preserve input type and avoid unnecessary copies:
where the code builds all_labels and sets first.labels_ (currently using
cp.concatenate(all_labels)), call concatenate(all_labels) instead (the helper is
imported from cuml.dask.common.input_utils) so cuDF/CuPy/NumPy inputs keep their
original types and the single-shard path avoids an extra allocation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b5a7e78e-fd3f-454d-bb6a-3d93b39be31b
📒 Files selected for processing (1)
python/cuml/cuml/dask/cluster/kmeans.py
| all_labels = [first.labels_] + self.client.gather(remote_labels) | ||
| all_inertias = [first.inertia_] + self.client.gather(remote_inertias) | ||
|
|
||
| first.labels_ = cp.concatenate(all_labels) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the target file and the specific lines mentioned
head -n 20 python/cuml/cuml/dask/cluster/kmeans.py | tail -n 10Repository: rapidsai/cuml
Length of output: 404
🏁 Script executed:
# Check the full context around lines 181-184
sed -n '1,20p' python/cuml/cuml/dask/cluster/kmeans.pyRepository: rapidsai/cuml
Length of output: 692
🏁 Script executed:
# Check if the file exists and get line count
wc -l python/cuml/cuml/dask/cluster/kmeans.pyRepository: rapidsai/cuml
Length of output: 100
🏁 Script executed:
# Read the relevant section around lines 181-184
sed -n '170,190p' python/cuml/cuml/dask/cluster/kmeans.pyRepository: rapidsai/cuml
Length of output: 783
🏁 Script executed:
# Check imports at the top of the file
sed -n '1,30p' python/cuml/cuml/dask/cluster/kmeans.pyRepository: rapidsai/cuml
Length of output: 1013
🏁 Script executed:
# Now let's examine the helper function in input_utils.py
cat python/cuml/cuml/dask/common/input_utils.py | wc -lRepository: rapidsai/cuml
Length of output: 60
🏁 Script executed:
# Check the helper function definition
sed -n '170,200p' python/cuml/cuml/dask/common/input_utils.pyRepository: rapidsai/cuml
Length of output: 887
🏁 Script executed:
# Get more context on the concatenate function
sed -n '175,195p' python/cuml/cuml/dask/common/input_utils.pyRepository: rapidsai/cuml
Length of output: 610
Use the imported concatenate() helper to preserve input type and avoid unnecessary copies.
Line 184 hard-codes cp.concatenate(), which causes type coercion when labels_ is a cuDF object and allocates a new buffer even in the single-worker case. The already-imported concatenate() from cuml.dask.common.input_utils preserves the input type (cuDF, CuPy, or NumPy) and skips the copy when there is only one shard.
Suggested fix
- all_labels = [first.labels_] + self.client.gather(remote_labels)
- all_inertias = [first.inertia_] + self.client.gather(remote_inertias)
+ all_labels = [first.labels_, *self.client.gather(remote_labels)]
+ all_inertias = [first.inertia_, *self.client.gather(remote_inertias)]
- first.labels_ = cp.concatenate(all_labels)
+ first.labels_ = concatenate(all_labels, axis=0)
first.inertia_ = sum(all_inertias)🧰 Tools
🪛 Ruff (0.15.6)
[warning] 181-181: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
[warning] 182-182: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 181 - 184, Replace the
hard-coded cp.concatenate call with the imported concatenate helper to preserve
input type and avoid unnecessary copies: where the code builds all_labels and
sets first.labels_ (currently using cp.concatenate(all_labels)), call
concatenate(all_labels) instead (the helper is imported from
cuml.dask.common.input_utils) so cuDF/CuPy/NumPy inputs keep their original
types and the single-shard path avoids an extra allocation.
06894d4 to
e416630
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
python/cuml/cuml/dask/cluster/kmeans.py (1)
181-185:⚠️ Potential issue | 🟠 MajorUse
concatenate()helper forlabels_aggregation instead ofcp.concatenate().Line 184 hard-codes CuPy concatenation, which can break output-type preservation and misses the single-shard no-copy path already implemented in
cuml.dask.common.input_utils.concatenate().Suggested fix
- all_labels = [first.labels_] + self.client.gather(remote_labels) - all_inertias = [first.inertia_] + self.client.gather(remote_inertias) + all_labels = [first.labels_, *self.client.gather(remote_labels)] + all_inertias = [first.inertia_, *self.client.gather(remote_inertias)] - first.labels_ = cp.concatenate(all_labels) + first.labels_ = concatenate(all_labels, axis=0) first.inertia_ = sum(all_inertias)Based on learnings: Correctly handle cuDF, pandas, and NumPy inputs using
input_to_cuml_array()for consistent conversion; preserve input type in output where sensible; handle both row-major (C) and column-major (F) memory order.#!/bin/bash # Verify current aggregation API usage in the changed block rg -n "all_labels|all_inertias|cp\\.concatenate|concatenate\\(" python/cuml/cuml/dask/cluster/kmeans.py -C 2🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 181 - 185, Replace the hard-coded CuPy concatenation of labels with the shared concatenate helper so output types and single-shard no-copy behavior are preserved: instead of using cp.concatenate on all_labels, call the concatenate function from cuml.dask.common.input_utils (the same helper used elsewhere) to aggregate the result of self.client.gather(remote_labels) together with first.labels_; keep inertia aggregation as sum(all_inertias) but ensure labels flow remains through input_to_cuml_array/concatenate path to correctly handle cuDF, pandas and NumPy inputs and both C/F memory orders.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 162-185: Add regression tests that exercise the new fit
aggregation path used in kmeans (the block that builds `first =
kmeans_fit[0].result()`, then remotely fetches `labels_`/`inertia_` via
`self.client.submit(getattr, f, "labels_", ...)` / `getattr(..., "inertia_",
...)`, gathers them with `self.client.gather`, and concatenates/sums into
`first.labels_`/`first.inertia_`). Create two tests: one that runs fit on a
single-worker Dask cluster and verifies `labels_` and `inertia_` match the
non-distributed scikit-learn/cuml reference, and one that runs on a multi-worker
cluster and asserts the aggregated `labels_` (after `cp.concatenate`) and
aggregated `inertia_` (after sum) equal the expected global values and that only
the first worker materializes the full model (no redundant cluster_centers_
copies). Ensure tests use the same api symbols (`kmeans_fit`, `first`,
`labels_`, `inertia_`, `client.gather`) so future changes to aggregation logic
are covered.
---
Duplicate comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 181-185: Replace the hard-coded CuPy concatenation of labels with
the shared concatenate helper so output types and single-shard no-copy behavior
are preserved: instead of using cp.concatenate on all_labels, call the
concatenate function from cuml.dask.common.input_utils (the same helper used
elsewhere) to aggregate the result of self.client.gather(remote_labels) together
with first.labels_; keep inertia aggregation as sum(all_inertias) but ensure
labels flow remains through input_to_cuml_array/concatenate path to correctly
handle cuDF, pandas and NumPy inputs and both C/F memory orders.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5098c120-0950-4748-9d71-c734cc2e3d8d
📒 Files selected for processing (1)
python/cuml/cuml/dask/cluster/kmeans.py
| # Collect the full model from only the first worker (for | ||
| # cluster_centers_ etc). Extract labels_ and inertia_ from the | ||
| # remaining workers remotely to avoid pulling N redundant copies | ||
| # of cluster_centers_ back to the client. | ||
| first = kmeans_fit[0].result() | ||
|
|
||
| remote_labels = [ | ||
| self.client.submit(getattr, f, "labels_", workers=[w]) | ||
| for f, (w, _) in zip( | ||
| kmeans_fit[1:], list(data.worker_to_parts.items())[1:] | ||
| ) | ||
| ] | ||
| remote_inertias = [ | ||
| self.client.submit(getattr, f, "inertia_", workers=[w]) | ||
| for f, (w, _) in zip( | ||
| kmeans_fit[1:], list(data.worker_to_parts.items())[1:] | ||
| ) | ||
| ] | ||
|
|
||
| all_labels = [first.labels_] + self.client.gather(remote_labels) | ||
| all_inertias = [first.inertia_] + self.client.gather(remote_inertias) | ||
|
|
||
| first.labels_ = cp.concatenate(all_labels) | ||
| first.inertia_ = sum(all_inertias) |
There was a problem hiding this comment.
Please add regression tests for the new fit aggregation path.
This is a behavior-critical bug fix; add coverage for both single-worker and multi-worker fit to assert correct labels_/inertia_ aggregation and prevent reintroducing redundant model materialization.
As per coding guidelines **/*.py: Update unit tests when making code changes.
🧰 Tools
🪛 Ruff (0.15.6)
[warning] 170-172: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
[warning] 176-178: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
[warning] 181-181: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
[warning] 182-182: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 162 - 185, Add
regression tests that exercise the new fit aggregation path used in kmeans (the
block that builds `first = kmeans_fit[0].result()`, then remotely fetches
`labels_`/`inertia_` via `self.client.submit(getattr, f, "labels_", ...)` /
`getattr(..., "inertia_", ...)`, gathers them with `self.client.gather`, and
concatenates/sums into `first.labels_`/`first.inertia_`). Create two tests: one
that runs fit on a single-worker Dask cluster and verifies `labels_` and
`inertia_` match the non-distributed scikit-learn/cuml reference, and one that
runs on a multi-worker cluster and asserts the aggregated `labels_` (after
`cp.concatenate`) and aggregated `inertia_` (after sum) equal the expected
global values and that only the first worker materializes the full model (no
redundant cluster_centers_ copies). Ensure tests use the same api symbols
(`kmeans_fit`, `first`, `labels_`, `inertia_`, `client.gather`) so future
changes to aggregation logic are covered.
| remote_labels = [ | ||
| self.client.submit(getattr, f, "labels_", workers=[w]) | ||
| for f, (w, _) in zip( | ||
| kmeans_fit[1:], list(data.worker_to_parts.items())[1:] | ||
| ) | ||
| ] | ||
| remote_inertias = [ | ||
| self.client.submit(getattr, f, "inertia_", workers=[w]) | ||
| for f, (w, _) in zip( | ||
| kmeans_fit[1:], list(data.worker_to_parts.items())[1:] | ||
| ) | ||
| ] | ||
|
|
||
| all_labels = [first.labels_] + self.client.gather(remote_labels) | ||
| all_inertias = [first.inertia_] + self.client.gather(remote_inertias) |
There was a problem hiding this comment.
There's an overhead to tasks - it'd be better to submit one task that returns a tuple of (labels, inertia) than 2 simple tasks.
| # of cluster_centers_ back to the client. | ||
| first = kmeans_fit[0].result() | ||
|
|
||
| remote_labels = [ |
There was a problem hiding this comment.
Thank you for putting up a PR fix so quickly!
May I know how large this remote_labels variable is if dataset has 1 billion rows? Will that blow up scheduler memory?
There was a problem hiding this comment.
labels_ is a 1D array of length n_samples (in the dask case, split across N-workers). The dtype is typically int32, which brings you to 4 GiB for the array total.
In most deployments of dask the data doesn't go through the scheduler, it goes directly worker->client (also note that in cases where the scheduler runs on the same node as the client this distinction is meaningless). So you care more about the memory capacity client-side than on the scheduler itself.
There was a problem hiding this comment.
Thanks for clarifying! In my deployment, the scheduler does run on the same node as the client, so they share the same GPU memory. That said, since it's a 1D int32 array, 4 GB seems manageable compared to the previous issue of collecting all workers' copies of the centroid matrix.
|
If we were to redesign this estimator (and probably the other clustering estimators) today, I might reconsider making the It's not needed for prediction, it's basically an artifact of A few options:
I'd vote for the third option if possible. It's the cleanest, most idiomatic dask interface, and would avoid this inefficiency. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cuml/cuml/dask/cluster/kmeans.py (1)
175-182: Addstrict=Truetozip()calls.Both
zip()calls iterate over slices that should always have matching lengths. Addingstrict=Trueguards against subtle bugs if the data structures ever diverge.Suggested fix
remote_labels = [ self.client.submit(getattr, f, "labels_", workers=[w]) - for f, w in zip(kmeans_fit[1:], workers[1:]) + for f, w in zip(kmeans_fit[1:], workers[1:], strict=True) ] remote_inertias = [ self.client.submit(getattr, f, "inertia_", workers=[w]) - for f, w in zip(kmeans_fit[1:], workers[1:]) + for f, w in zip(kmeans_fit[1:], workers[1:], strict=True) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 175 - 182, The two list comprehensions building remote_labels and remote_inertias use zip(kmeans_fit[1:], workers[1:]) without strict checking; modify both zip calls in the remote_labels and remote_inertias comprehensions to use zip(..., strict=True) so mismatched lengths between kmeans_fit[1:] and workers[1:] raise immediately; update the zip invocations used when calling self.client.submit(getattr, f, "labels_", workers=[w]) and self.client.submit(getattr, f, "inertia_", workers=[w]) to include strict=True.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 175-182: The two list comprehensions building remote_labels and
remote_inertias use zip(kmeans_fit[1:], workers[1:]) without strict checking;
modify both zip calls in the remote_labels and remote_inertias comprehensions to
use zip(..., strict=True) so mismatched lengths between kmeans_fit[1:] and
workers[1:] raise immediately; update the zip invocations used when calling
self.client.submit(getattr, f, "labels_", workers=[w]) and
self.client.submit(getattr, f, "inertia_", workers=[w]) to include strict=True.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e71d94ca-7639-4a94-8f63-6eaa0a32ac02
📒 Files selected for processing (1)
python/cuml/cuml/dask/cluster/kmeans.py
|
@jcrist, the first commit extracts labels and inertia via a single unified Dask task. The second commit takes this further by keeping |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cuml/cuml/dask/cluster/kmeans.py (1)
175-182: Addstrict=Truetozip()calls for defensive validation.Both
kmeans_fit[1:]andworkers[1:]should always have the same length, but addingstrict=Truecatches mismatches early if future refactoring breaks that invariant.Also, per the earlier reviewer suggestion: submitting a single task returning
(labels_, inertia_)instead of two separategetattrtasks per worker would reduce scheduler overhead.Proposed fix for strict= parameter
remote_labels = [ self.client.submit(getattr, f, "labels_", workers=[w]) - for f, w in zip(kmeans_fit[1:], workers[1:]) + for f, w in zip(kmeans_fit[1:], workers[1:], strict=True) ] remote_inertias = [ self.client.submit(getattr, f, "inertia_", workers=[w]) - for f, w in zip(kmeans_fit[1:], workers[1:]) + for f, w in zip(kmeans_fit[1:], workers[1:], strict=True) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 175 - 182, The zip of kmeans_fit[1:] and workers[1:] in the remote_labels and remote_inertias creation is unsafe against length mismatches; update the zip calls to zip(kmeans_fit[1:], workers[1:], strict=True) to validate lengths and fail fast. While here, reduce scheduler overhead by replacing the two separate self.client.submit(getattr, f, "labels_", ...) and self.client.submit(getattr, f, "inertia_", ...) calls with a single self.client.submit that returns (f.labels_, f.inertia_) (i.e., submit a single task per worker that retrieves both "labels_" and "inertia_"), then unpack to build remote_labels and remote_inertias from that single tuple result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 175-182: The zip of kmeans_fit[1:] and workers[1:] in the
remote_labels and remote_inertias creation is unsafe against length mismatches;
update the zip calls to zip(kmeans_fit[1:], workers[1:], strict=True) to
validate lengths and fail fast. While here, reduce scheduler overhead by
replacing the two separate self.client.submit(getattr, f, "labels_", ...) and
self.client.submit(getattr, f, "inertia_", ...) calls with a single
self.client.submit that returns (f.labels_, f.inertia_) (i.e., submit a single
task per worker that retrieves both "labels_" and "inertia_"), then unpack to
build remote_labels and remote_inertias from that single tuple result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3991dd39-b55a-499a-a516-fb694a452423
📒 Files selected for processing (1)
python/cuml/cuml/dask/cluster/kmeans.py
|
I pushed a small fixup commit with some improvements for dask idioms. The biggest one is that we don't want to store the I also tried to fix what I viewed as a performance bug in Anyway, I believe the actual memory issue here is resolved for now. Happy to merge. |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
python/cuml/cuml/dask/cluster/kmeans.py (2)
172-197:⚠️ Potential issue | 🟠 Major
internal_modelnow exposes shard-locallabels_andinertia_.After Line 172, the global
self.inertia_and distributedself.labels_are written only on the Dask estimator; the stored combined model still keeps the first worker’s local values. Anyget_combined_model()or other internal-model consumer can now observe fitted state that disagrees withself. Please sync those attributes when materializing the combined model, or override the accessor to source them from the Dask estimator.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 172 - 197, The internal_model stored via _set_internal_model(first) still contains shard-local labels_ and inertia_, causing get_combined_model() and other consumers to see stale local values; update the combined/internal model after you compute the global self.inertia_ and distributed self.labels_ so they reflect the Dask estimator: after computing inertia_and_lengths and constructing self.labels_, assign those consolidated values into the internal model (e.g., internal_model.labels_ = self.labels_ and internal_model.inertia_ = self.inertia_) or alter the internal_model accessor to delegate to the Dask estimator's self.labels_ and self.inertia_; ensure you update the same object returned by _set_internal_model(first) so consumers of get_combined_model() observe the synced state.
167-172:⚠️ Potential issue | 🟠 MajorKeep
internal_modelas a future instead of calling.result()here.Line 171 still brings one whole fitted estimator back to the client, including that worker’s
labels_shard. For large or skewed partitions this can still be the dominant allocation and reintroduce the OOM the PR is trying to remove.BaseEstimator._set_internal_model()already accepts futures, so storekmeans_fit[0]directly and fetch only lightweight metadata (for example the labels dtype) with small helper tasks.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 167 - 172, The code currently calls .result() on kmeans_fit[0] and passes the full estimator to _set_internal_model, pulling the entire fitted estimator (including labels shard) to the client; instead keep internal_model as the future by passing kmeans_fit[0] directly to BaseEstimator._set_internal_model, and create a small helper task that runs on the worker to extract only lightweight metadata needed on the client (e.g., labels dtype, n_clusters) from the future without materializing the full model locally; update the logic around kmeans_fit and any subsequent uses of internal_model to expect a future and use the helper task results for metadata consumption.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 172-197: The internal_model stored via _set_internal_model(first)
still contains shard-local labels_ and inertia_, causing get_combined_model()
and other consumers to see stale local values; update the combined/internal
model after you compute the global self.inertia_ and distributed self.labels_ so
they reflect the Dask estimator: after computing inertia_and_lengths and
constructing self.labels_, assign those consolidated values into the internal
model (e.g., internal_model.labels_ = self.labels_ and internal_model.inertia_ =
self.inertia_) or alter the internal_model accessor to delegate to the Dask
estimator's self.labels_ and self.inertia_; ensure you update the same object
returned by _set_internal_model(first) so consumers of get_combined_model()
observe the synced state.
- Around line 167-172: The code currently calls .result() on kmeans_fit[0] and
passes the full estimator to _set_internal_model, pulling the entire fitted
estimator (including labels shard) to the client; instead keep internal_model as
the future by passing kmeans_fit[0] directly to
BaseEstimator._set_internal_model, and create a small helper task that runs on
the worker to extract only lightweight metadata needed on the client (e.g.,
labels dtype, n_clusters) from the future without materializing the full model
locally; update the logic around kmeans_fit and any subsequent uses of
internal_model to expect a future and use the helper task results for metadata
consumption.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2b2691dd-1950-4f51-b4c5-373142fad777
📒 Files selected for processing (1)
python/cuml/cuml/dask/cluster/kmeans.py
viclafargue
left a comment
There was a problem hiding this comment.
The change looks good to me. We should consider having the attributes that scale with n_samples as Dask arrays in other Dask estimators too.
Nevermind, this was an actual bug in the implementation for cudf inputs. I've fixed this, and added a test. |
|
/merge |
2e65466
into
rapidsai:release/26.04
After MNMG KMeans fit, the client was calling .result() on every worker's future, pulling the full fitted estimator (including
cluster_centers_) from all workers back to the client. Since cluster centers are synchronized across workers via NCCL, every copy is identical making all but one transfer redundant.This PR changes the post-fit aggregation to:
labels_andinertia_from the remaining workers remotely via client.submit(getattr, ...)