[misc] fix: fix DataProto getstate bug #2962

vermouth1992 · 2025-08-07T09:06:49Z

What does this PR do?

Fix a severe bug in DataProto that cause modification to be incorrect
We should not use consolidated tensordict to perform any operation other than save and data transfer

import torch
from tensordict import TensorDict

a = TensorDict({"a": torch.zeros(1, 1)}, device='cuda', batch_size=[1])
a_consolidate = a.consolidate()
a['b'] = torch.ones(1, 1)
a = a.to('cpu')

a_consolidate['b'] = torch.ones(1, 1)
a_consolidate = a_consolidate.to('cpu')

print(a['a'], a['b'])  # 0, 1
print(a_consolidate['a'], a_consolidate['b'])  # 0, 0

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request aims to fix a bug in DataProto.__getstate__ where in-place consolidation of self.batch caused issues with subsequent modifications. The approach of creating a temporary consolidated variable is correct. However, the current implementation introduces a critical bug: an UnboundLocalError will occur if the consolidation logic is skipped, because the variable passed to torch.save would not be defined. I've provided a comment with a suggested fix to handle this case correctly.

gemini-code-assist · 2025-08-07T09:08:20Z

verl/protocol.py

        if version.parse(tensordict.__version__) >= version.parse("0.5.0") and self.batch is not None:
            self.batch = self.batch.contiguous()
-            self.batch = self.batch.consolidate()
-        torch.save(self.batch, buffer)
+            batch_consolidated = self.batch.consolidate()
+        torch.save(batch_consolidated, buffer)


This change introduces a potential UnboundLocalError. If the condition on line 334 is false (e.g., self.batch is None or the tensordict version is older), the if block is skipped. In that case, batch_consolidated is never defined, but it's used in torch.save on line 337, which will cause a crash.

To fix this, you should handle the case where the condition is false. A good approach is to move the torch.save call into an if/else block to ensure a value is always saved, preserving the original behavior for the case where consolidation doesn't happen.

Suggested change

if version.parse(tensordict.__version__) >= version.parse("0.5.0") and self.batch is not None:

self.batch = self.batch.contiguous()

self.batch = self.batch.consolidate()

torch.save(self.batch, buffer)

batch_consolidated = self.batch.consolidate()

torch.save(batch_consolidated, buffer)

if version.parse(tensordict.__version__) >= version.parse("0.5.0") and self.batch is not None:

self.batch = self.batch.contiguous()

batch_consolidated = self.batch.consolidate()

torch.save(batch_consolidated, buffer)

else:

torch.save(self.batch, buffer)

vermouth1992 · 2025-08-07T09:11:35Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses a significant bug in DataProto.__getstate__ by preventing the in-place modification of self.batch with a consolidated version, which previously caused issues with subsequent operations. The fix correctly uses a temporary variable for the consolidated batch during serialization. However, this change introduces a critical UnboundLocalError because the temporary variable is not defined if the consolidation logic is skipped. I have provided a code suggestion to resolve this by ensuring the variable passed to torch.save is always initialized.

gemini-code-assist · 2025-08-07T09:13:15Z

verl/protocol.py

        buffer = io.BytesIO()
        if version.parse(tensordict.__version__) >= version.parse("0.5.0") and self.batch is not None:
            self.batch = self.batch.contiguous()
-            self.batch = self.batch.consolidate()
-        torch.save(self.batch, buffer)
+            batch_consolidated = self.batch.consolidate()
+        torch.save(batch_consolidated, buffer)
        buffer_bytes = buffer.getvalue()
        return buffer_bytes, self.non_tensor_batch, self.meta_info


This change introduces a potential UnboundLocalError. The variable batch_consolidated is only defined within the if block on line 336. If the condition on line 334 is false (e.g., self.batch is None or an older tensordict version is used), torch.save on line 337 will be called with an undefined variable, causing a crash.

To fix this, you should ensure the variable passed to torch.save is always defined, regardless of the if condition's outcome.

Suggested change

buffer = io.BytesIO()

if version.parse(tensordict.__version__) >= version.parse("0.5.0") and self.batch is not None:

self.batch = self.batch.contiguous()

self.batch = self.batch.consolidate()

torch.save(self.batch, buffer)

batch_consolidated = self.batch.consolidate()

torch.save(batch_consolidated, buffer)

buffer_bytes = buffer.getvalue()

return buffer_bytes, self.non_tensor_batch, self.meta_info

buffer = io.BytesIO()

batch_to_save = self.batch

if version.parse(tensordict.__version__) >= version.parse("0.5.0") and self.batch is not None:

self.batch = self.batch.contiguous()

batch_to_save = self.batch.consolidate()

torch.save(batch_to_save, buffer)

buffer_bytes = buffer.getvalue()

return buffer_bytes, self.non_tensor_batch, self.meta_info

vermouth1992 · 2025-08-07T09:22:13Z

/gemini review

gemini-code-assist

Code Review

The code changes fix a severe bug in DataProto that caused modification to be incorrect. The changes isolate the tensordict consolidation to a local variable, ensuring the DataProto instance's state remains unchanged.

gemini-code-assist · 2025-08-07T09:23:57Z

verl/protocol.py

+            batch = self.batch.contiguous()
+            batch_consolidated = batch.consolidate()
+        else:
+            batch_consolidated = self.batch
+        torch.save(batch_consolidated, buffer)


The variable name batch_consolidated is slightly misleading, as the else branch assigns self.batch to it, which is not consolidated. This could be confusing for future maintainers. For improved clarity, use a more neutral name like batch_to_save. Additionally, the intermediate batch variable can be removed by chaining the .contiguous().consolidate() calls.

batch_to_save = self.batch.contiguous().consolidate() else: batch_to_save = self.batch torch.save(batch_to_save, buffer)

vermouth1992 · 2025-08-07T09:35:13Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses a critical bug in DataProto.__getstate__ where the object's state was mutated during serialization. The fix correctly avoids this side effect by using a temporary variable for the batch to be saved, ensuring self.batch is not modified. This prevents potential hard-to-debug issues with object state. I've also suggested a performance improvement to cache the tensordict version check, as it's currently performed on every call to __getstate__.

gemini-code-assist · 2025-08-07T09:36:41Z

verl/protocol.py

-            self.batch = self.batch.contiguous()
-            self.batch = self.batch.consolidate()
-        torch.save(self.batch, buffer)
+            batch_to_save = self.batch.contiguous().consolidate()


This line is executed based on a condition on line 334 that repeatedly parses the tensordict version on every __getstate__ call. This is inefficient as __getstate__ can be a hot path during serialization. For better performance, the version check should be performed once at module startup and the result cached in a constant. For example:

_TENSORDICT_V0_5_0_OR_GREATER = version.parse(tensordict.__version__) >= version.parse("0.5.0")

This constant can then be used in the if condition, avoiding repeated parsing.

Update protocol.py

6428941

vermouth1992 requested review from PeterSH6 and wuxibin89 August 7, 2025 09:06

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

vermouth1992 added 2 commits August 7, 2025 17:19

Update protocol.py

c5d4adc

Update protocol.py

17ae6c0

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

Update protocol.py

486bf33

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

wuxibin89 approved these changes Aug 7, 2025

View reviewed changes

vermouth1992 merged commit 083da9a into main Aug 8, 2025
54 of 57 checks passed

vermouth1992 deleted the vermouth1992-patch-1 branch August 8, 2025 00:24

yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Aug 11, 2025

[misc] fix: fix DataProto __getstate__ bug (volcengine#2962)

48a0046

ChangyiYang pushed a commit to SwordFaith/verl that referenced this pull request Aug 16, 2025

[misc] fix: fix DataProto __getstate__ bug (volcengine#2962)

12dc54d

whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025

[misc] fix: fix DataProto __getstate__ bug (volcengine#2962)

82a66d5

WncFht pushed a commit to WncFht/verl that referenced this pull request Oct 10, 2025

[misc] fix: fix DataProto __getstate__ bug (volcengine#2962)

ff0117a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[misc] fix: fix DataProto getstate bug #2962

[misc] fix: fix DataProto getstate bug #2962

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[misc] fix: fix DataProto __getstate__ bug #2962

[misc] fix: fix DataProto __getstate__ bug #2962

Uh oh!

Conversation

vermouth1992 commented Aug 7, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

vermouth1992 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[misc] fix: fix DataProto getstate bug #2962

[misc] fix: fix DataProto getstate bug #2962