Skip to content

Conversation

@MikeDean2367
Copy link
Contributor

@MikeDean2367 MikeDean2367 commented Jul 26, 2025

What does this PR do?

This PR fixes a bug in verl.protocol.union_numpy_dict where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, NaN values, and circular references.

This resolves issue #2766.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

A comprehensive unit test suite has been added to tests/test_protocol_on_cpu.py. The new tests cover the following scenarios, all of which now pass:

  • Merging dictionaries with identical 3D (and higher) dimensional arrays.
  • Correctly failing when N-D arrays with the same shape but different values are merged.
  • Handling nested object-dtype arrays containing other arrays, strings, and None.
  • Correctly treating NaN values at the same position as equal, mimicking pandas' behavior.
  • Safely handling circular references without causing a RecursionError.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

The union_numpy_dict function previously used pd.DataFrame for equality checks, causing a ValueError on arrays with more than two dimensions.

This commit replaces the brittle pandas-based logic with a robust, recursive deep equality check that supports N-dimensional arrays, object-dtype arrays, and handles circular references correctly.

Expanded unit tests are added to verify the fix for 3D arrays, object-dtype arrays, NaN values, and circular references, ensuring high robustness and preventing regressions.

Fixes volcengine#2766
@CLAassistant
Copy link

CLAassistant commented Jul 26, 2025

CLA assistant check
All committers have signed the CLA.

@MikeDean2367 MikeDean2367 changed the title fix(protocol): Handle N-D and object-dtype arrays in union_numpy_dict [misc] fix: Handle N-D arrays and complex objects in union_numpy_dict Jul 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a great fix! The new recursive _deep_equal and _array_equal functions are a robust replacement for the previous pandas-based comparison, correctly handling N-D arrays, NaNs, and circular references as intended. The accompanying test suite is comprehensive and covers all the important edge cases, which gives high confidence in the fix.

I have one suggestion in verl/protocol.py to further improve the robustness of the cycle detection logic by ensuring resource cleanup even in the case of an exception.

Overall, excellent work on resolving this bug.

Comment on lines +168 to +181
# Perform the specific comparison based on type
result = False
if isinstance(a, float) and math.isnan(a) and math.isnan(b):
result = True
elif isinstance(a, np.ndarray):
# We know b is also an ndarray due to the initial type check
result = _array_equal(a, b, visited)
else:
# Standard equality for all other types
result = a == b

# Clean up the visited set on the way out of the recursion
visited.remove(obj_id)
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is a great implementation of deep equality checking, especially with the handling for cycles. However, there's a potential robustness issue.

The visited.remove(obj_id) call at the end of the function is not guaranteed to execute if an exception is raised during the comparison logic (e.g., if a == b raises an error for a specific object type).

If an exception occurs, the object's ID will remain in the visited set. While the current usage in union_numpy_dict creates a new set for each top-level array comparison, this could still lead to incorrect results for later elements within the same array comparison if the object graph is complex (e.g., contains the same object instance in multiple places, not just cycles).

To ensure the visited set is always cleaned up correctly, it's best practice to wrap the main logic in a try...finally block.

Suggested change
# Perform the specific comparison based on type
result = False
if isinstance(a, float) and math.isnan(a) and math.isnan(b):
result = True
elif isinstance(a, np.ndarray):
# We know b is also an ndarray due to the initial type check
result = _array_equal(a, b, visited)
else:
# Standard equality for all other types
result = a == b
# Clean up the visited set on the way out of the recursion
visited.remove(obj_id)
return result
try:
# Perform the specific comparison based on type
result = False
if isinstance(a, float) and math.isnan(a) and math.isnan(b):
result = True
elif isinstance(a, np.ndarray):
# We know b is also an ndarray due to the initial type check
result = _array_equal(a, b, visited)
else:
# Standard equality for all other types
result = a == b
finally:
# Clean up the visited set on the way out of the recursion
visited.remove(obj_id)
return result

@vermouth1992 vermouth1992 merged commit 00ac37f into volcengine:main Jul 27, 2025
51 of 56 checks passed
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
…volcengine#2768)

### What does this PR do?

This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would
crash on NumPy arrays with more than 2 dimensions. It replaces the
underlying comparison logic with a robust, recursive function that can
handle N-D arrays, nested objects, `NaN` values, and circular
references.

This resolves issue volcengine#2766.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

A comprehensive unit test suite has been added to
`tests/test_protocol_on_cpu.py`. The new tests cover the following
scenarios, all of which now pass:
* Merging dictionaries with identical 3D (and higher) dimensional
arrays.
* Correctly failing when N-D arrays with the same shape but different
values are merged.
* Handling nested `object`-dtype arrays containing other arrays,
strings, and `None`.
* Correctly treating `NaN` values at the same position as equal,
mimicking pandas' behavior.
* Safely handling circular references without causing a
`RecursionError`.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
SumanthRH pushed a commit to SumanthRH/verl that referenced this pull request Jul 29, 2025
…volcengine#2768)

### What does this PR do?

This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would
crash on NumPy arrays with more than 2 dimensions. It replaces the
underlying comparison logic with a robust, recursive function that can
handle N-D arrays, nested objects, `NaN` values, and circular
references.

This resolves issue volcengine#2766.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

A comprehensive unit test suite has been added to
`tests/test_protocol_on_cpu.py`. The new tests cover the following
scenarios, all of which now pass:
* Merging dictionaries with identical 3D (and higher) dimensional
arrays.
* Correctly failing when N-D arrays with the same shape but different
values are merged.
* Handling nested `object`-dtype arrays containing other arrays,
strings, and `None`.
* Correctly treating `NaN` values at the same position as equal,
mimicking pandas' behavior.
* Safely handling circular references without causing a
`RecursionError`.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 31, 2025
…volcengine#2768)

### What does this PR do?

This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would
crash on NumPy arrays with more than 2 dimensions. It replaces the
underlying comparison logic with a robust, recursive function that can
handle N-D arrays, nested objects, `NaN` values, and circular
references.

This resolves issue volcengine#2766.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

A comprehensive unit test suite has been added to
`tests/test_protocol_on_cpu.py`. The new tests cover the following
scenarios, all of which now pass:
* Merging dictionaries with identical 3D (and higher) dimensional
arrays.
* Correctly failing when N-D arrays with the same shape but different
values are merged.
* Handling nested `object`-dtype arrays containing other arrays,
strings, and `None`.
* Correctly treating `NaN` values at the same position as equal,
mimicking pandas' behavior.
* Safely handling circular references without causing a
`RecursionError`.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
…volcengine#2768)

### What does this PR do?

This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would
crash on NumPy arrays with more than 2 dimensions. It replaces the
underlying comparison logic with a robust, recursive function that can
handle N-D arrays, nested objects, `NaN` values, and circular
references.

This resolves issue volcengine#2766.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

A comprehensive unit test suite has been added to
`tests/test_protocol_on_cpu.py`. The new tests cover the following
scenarios, all of which now pass:
* Merging dictionaries with identical 3D (and higher) dimensional
arrays.
* Correctly failing when N-D arrays with the same shape but different
values are merged.
* Handling nested `object`-dtype arrays containing other arrays,
strings, and `None`.
* Correctly treating `NaN` values at the same position as equal,
mimicking pandas' behavior.
* Safely handling circular references without causing a
`RecursionError`.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
HyperdriveHustle pushed a commit to HyperdriveHustle/verl that referenced this pull request Aug 20, 2025
…volcengine#2768)

### What does this PR do?

This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would
crash on NumPy arrays with more than 2 dimensions. It replaces the
underlying comparison logic with a robust, recursive function that can
handle N-D arrays, nested objects, `NaN` values, and circular
references.

This resolves issue volcengine#2766.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

A comprehensive unit test suite has been added to
`tests/test_protocol_on_cpu.py`. The new tests cover the following
scenarios, all of which now pass:
* Merging dictionaries with identical 3D (and higher) dimensional
arrays.
* Correctly failing when N-D arrays with the same shape but different
values are merged.
* Handling nested `object`-dtype arrays containing other arrays,
strings, and `None`.
* Correctly treating `NaN` values at the same position as equal,
mimicking pandas' behavior.
* Safely handling circular references without causing a
`RecursionError`.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…volcengine#2768)

### What does this PR do?

This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would
crash on NumPy arrays with more than 2 dimensions. It replaces the
underlying comparison logic with a robust, recursive function that can
handle N-D arrays, nested objects, `NaN` values, and circular
references.

This resolves issue volcengine#2766.

### Checklist Before Starting

- [x] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

A comprehensive unit test suite has been added to
`tests/test_protocol_on_cpu.py`. The new tests cover the following
scenarios, all of which now pass:
* Merging dictionaries with identical 3D (and higher) dimensional
arrays.
* Correctly failing when N-D arrays with the same shape but different
values are merged.
* Handling nested `object`-dtype arrays containing other arrays,
strings, and `None`.
* Correctly treating `NaN` values at the same position as equal,
mimicking pandas' behavior.
* Safely handling circular references without causing a
`RecursionError`.

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants