-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[misc] fix: Handle N-D arrays and complex objects in union_numpy_dict #2768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[misc] fix: Handle N-D arrays and complex objects in union_numpy_dict #2768
Conversation
The union_numpy_dict function previously used pd.DataFrame for equality checks, causing a ValueError on arrays with more than two dimensions. This commit replaces the brittle pandas-based logic with a robust, recursive deep equality check that supports N-dimensional arrays, object-dtype arrays, and handles circular references correctly. Expanded unit tests are added to verify the fix for 3D arrays, object-dtype arrays, NaN values, and circular references, ensuring high robustness and preventing regressions. Fixes volcengine#2766
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a great fix! The new recursive _deep_equal and _array_equal functions are a robust replacement for the previous pandas-based comparison, correctly handling N-D arrays, NaNs, and circular references as intended. The accompanying test suite is comprehensive and covers all the important edge cases, which gives high confidence in the fix.
I have one suggestion in verl/protocol.py to further improve the robustness of the cycle detection logic by ensuring resource cleanup even in the case of an exception.
Overall, excellent work on resolving this bug.
| # Perform the specific comparison based on type | ||
| result = False | ||
| if isinstance(a, float) and math.isnan(a) and math.isnan(b): | ||
| result = True | ||
| elif isinstance(a, np.ndarray): | ||
| # We know b is also an ndarray due to the initial type check | ||
| result = _array_equal(a, b, visited) | ||
| else: | ||
| # Standard equality for all other types | ||
| result = a == b | ||
|
|
||
| # Clean up the visited set on the way out of the recursion | ||
| visited.remove(obj_id) | ||
| return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great implementation of deep equality checking, especially with the handling for cycles. However, there's a potential robustness issue.
The visited.remove(obj_id) call at the end of the function is not guaranteed to execute if an exception is raised during the comparison logic (e.g., if a == b raises an error for a specific object type).
If an exception occurs, the object's ID will remain in the visited set. While the current usage in union_numpy_dict creates a new set for each top-level array comparison, this could still lead to incorrect results for later elements within the same array comparison if the object graph is complex (e.g., contains the same object instance in multiple places, not just cycles).
To ensure the visited set is always cleaned up correctly, it's best practice to wrap the main logic in a try...finally block.
| # Perform the specific comparison based on type | |
| result = False | |
| if isinstance(a, float) and math.isnan(a) and math.isnan(b): | |
| result = True | |
| elif isinstance(a, np.ndarray): | |
| # We know b is also an ndarray due to the initial type check | |
| result = _array_equal(a, b, visited) | |
| else: | |
| # Standard equality for all other types | |
| result = a == b | |
| # Clean up the visited set on the way out of the recursion | |
| visited.remove(obj_id) | |
| return result | |
| try: | |
| # Perform the specific comparison based on type | |
| result = False | |
| if isinstance(a, float) and math.isnan(a) and math.isnan(b): | |
| result = True | |
| elif isinstance(a, np.ndarray): | |
| # We know b is also an ndarray due to the initial type check | |
| result = _array_equal(a, b, visited) | |
| else: | |
| # Standard equality for all other types | |
| result = a == b | |
| finally: | |
| # Clean up the visited set on the way out of the recursion | |
| visited.remove(obj_id) | |
| return result |
…volcengine#2768) ### What does this PR do? This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, `NaN` values, and circular references. This resolves issue volcengine#2766. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test A comprehensive unit test suite has been added to `tests/test_protocol_on_cpu.py`. The new tests cover the following scenarios, all of which now pass: * Merging dictionaries with identical 3D (and higher) dimensional arrays. * Correctly failing when N-D arrays with the same shape but different values are merged. * Handling nested `object`-dtype arrays containing other arrays, strings, and `None`. * Correctly treating `NaN` values at the same position as equal, mimicking pandas' behavior. * Safely handling circular references without causing a `RecursionError`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…volcengine#2768) ### What does this PR do? This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, `NaN` values, and circular references. This resolves issue volcengine#2766. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test A comprehensive unit test suite has been added to `tests/test_protocol_on_cpu.py`. The new tests cover the following scenarios, all of which now pass: * Merging dictionaries with identical 3D (and higher) dimensional arrays. * Correctly failing when N-D arrays with the same shape but different values are merged. * Handling nested `object`-dtype arrays containing other arrays, strings, and `None`. * Correctly treating `NaN` values at the same position as equal, mimicking pandas' behavior. * Safely handling circular references without causing a `RecursionError`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…volcengine#2768) ### What does this PR do? This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, `NaN` values, and circular references. This resolves issue volcengine#2766. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test A comprehensive unit test suite has been added to `tests/test_protocol_on_cpu.py`. The new tests cover the following scenarios, all of which now pass: * Merging dictionaries with identical 3D (and higher) dimensional arrays. * Correctly failing when N-D arrays with the same shape but different values are merged. * Handling nested `object`-dtype arrays containing other arrays, strings, and `None`. * Correctly treating `NaN` values at the same position as equal, mimicking pandas' behavior. * Safely handling circular references without causing a `RecursionError`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…volcengine#2768) ### What does this PR do? This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, `NaN` values, and circular references. This resolves issue volcengine#2766. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test A comprehensive unit test suite has been added to `tests/test_protocol_on_cpu.py`. The new tests cover the following scenarios, all of which now pass: * Merging dictionaries with identical 3D (and higher) dimensional arrays. * Correctly failing when N-D arrays with the same shape but different values are merged. * Handling nested `object`-dtype arrays containing other arrays, strings, and `None`. * Correctly treating `NaN` values at the same position as equal, mimicking pandas' behavior. * Safely handling circular references without causing a `RecursionError`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…volcengine#2768) ### What does this PR do? This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, `NaN` values, and circular references. This resolves issue volcengine#2766. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test A comprehensive unit test suite has been added to `tests/test_protocol_on_cpu.py`. The new tests cover the following scenarios, all of which now pass: * Merging dictionaries with identical 3D (and higher) dimensional arrays. * Correctly failing when N-D arrays with the same shape but different values are merged. * Handling nested `object`-dtype arrays containing other arrays, strings, and `None`. * Correctly treating `NaN` values at the same position as equal, mimicking pandas' behavior. * Safely handling circular references without causing a `RecursionError`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
…volcengine#2768) ### What does this PR do? This PR fixes a bug in `verl.protocol.union_numpy_dict` where it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects, `NaN` values, and circular references. This resolves issue volcengine#2766. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test A comprehensive unit test suite has been added to `tests/test_protocol_on_cpu.py`. The new tests cover the following scenarios, all of which now pass: * Merging dictionaries with identical 3D (and higher) dimensional arrays. * Correctly failing when N-D arrays with the same shape but different values are merged. * Handling nested `object`-dtype arrays containing other arrays, strings, and `None`. * Correctly treating `NaN` values at the same position as equal, mimicking pandas' behavior. * Safely handling circular references without causing a `RecursionError`. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
What does this PR do?
This PR fixes a bug in
verl.protocol.union_numpy_dictwhere it would crash on NumPy arrays with more than 2 dimensions. It replaces the underlying comparison logic with a robust, recursive function that can handle N-D arrays, nested objects,NaNvalues, and circular references.This resolves issue #2766.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
A comprehensive unit test suite has been added to
tests/test_protocol_on_cpu.py. The new tests cover the following scenarios, all of which now pass:object-dtype arrays containing other arrays, strings, andNone.NaNvalues at the same position as equal, mimicking pandas' behavior.RecursionError.Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)