Skip to content

Fix ValueError in _divide_sub_cluster_jobs with numpy >= 1.24#314

Open
yashparekh261 wants to merge 1 commit intoamzn:mainlinefrom
yashparekh261:pareyas-amz-patch
Open

Fix ValueError in _divide_sub_cluster_jobs with numpy >= 1.24#314
yashparekh261 wants to merge 1 commit intoamzn:mainlinefrom
yashparekh261:pareyas-amz-patch

Conversation

@yashparekh261
Copy link

Fix ValueError in _divide_sub_cluster_jobs with numpy >= 1.24

Issue

DistClustering._divide_sub_cluster_jobs in pecos/distributed/xmc/base.py crashes during distributed training with numpy >= 1.24:

File "pecos/distributed/xmc/base.py", line 474, in _divide_sub_cluster_jobs
    grp_list = np.array_split(sub_tree_assign_arr_list, num_machine)
ValueError: setting an array element with a sequence. The requested array has an 
inhomogeneous shape after 1 dimensions. The detected shape was (8,) + inhomogeneous part.

sub_tree_assign_arr_list is a list of arrays with different lengths (one per meta-tree leaf cluster, where each cluster has a different number of labels). np.array_split internally calls np.asarray() on this list, which in numpy < 1.24 silently created an object array, but in numpy >= 1.24 raises ValueError for ragged/inhomogeneous sequences.

This is triggered in distributed XMC training when nr_splits produces leaf clusters of unequal size, which is the common case for real-world data.

Description of changes

Replaced np.array_split(sub_tree_assign_arr_list, num_machine) with a pure Python divmod-based list partitioning that:

  • Handles variable-length (ragged) sub-tree assignment arrays naturally
  • Preserves the same split semantics: divides n items into num_machine groups as evenly as possible, with the first m groups getting one extra item (where n = k * num_machine + m)
  • Still produces empty lists for trailing groups when len(sub_tree_assign_arr_list) < num_machine
  • Returns list[list] directly instead of converting through numpy arrays and back via .tolist()
  • Works with all numpy versions

Testing

No test changes required — existing test_dist_clustering passes with the fix.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ays in _divide_sub_cluster_jobs

np.array_split fails on numpy>=1.24 when given a list of arrays with
different lengths (ragged/inhomogeneous sequences). The internal call to
np.asarray raises ValueError because it cannot create a regular ndarray
from arrays of different sizes.

Replace with plain Python list partitioning (divmod-based splitting)
which naturally handles variable-length sub-tree assignment arrays and
preserves the same split semantics including empty-group padding when
num_machines > num_sub_trees.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant