Skip to content

Conversation

@wantbook-book
Copy link
Contributor

What does this PR do?

This PR adds functionality to save one training and one testing example as JSON files for reference, making it easier to inspect dataset formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Manually verified that two files train_example.json and test_example.json are saved correctly in the specified local_dir.

API and Usage Example

This change does not alter the public API.

Design & Code Changes

  • Added code to save train_dataset[0] and test_dataset[0] as JSON files in local_dir

  • Helps with quick inspection and reproducibility of dataset inputs

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a useful feature for inspecting dataset examples. I've identified a potential IndexError if the train or test datasets are empty and suggested a fix to make the script more robust.

Comment on lines +79 to +84
example = train_dataset[0]
with open(os.path.join(local_dir, "train_example.json"), "w") as f:
json.dump(example, f, indent=2)
example = test_dataset[0]
with open(os.path.join(local_dir, "test_example.json"), "w") as f:
json.dump(example, f, indent=2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation accesses train_dataset[0] and test_dataset[0] directly. This will raise an IndexError and crash the script if either dataset is empty. An empty dataset is a valid edge case, for example, if the source dataset has empty splits or if the mapping/filtering results in an empty dataset.

To make the script more robust, you should add a check to ensure the datasets are not empty before attempting to access their elements.

Suggested change
example = train_dataset[0]
with open(os.path.join(local_dir, "train_example.json"), "w") as f:
json.dump(example, f, indent=2)
example = test_dataset[0]
with open(os.path.join(local_dir, "test_example.json"), "w") as f:
json.dump(example, f, indent=2)
if len(train_dataset) > 0:
example = train_dataset[0]
with open(os.path.join(local_dir, "train_example.json"), "w") as f:
json.dump(example, f, indent=2)
if len(test_dataset) > 0:
example = test_dataset[0]
with open(os.path.join(local_dir, "test_example.json"), "w") as f:
json.dump(example, f, indent=2)

@eric-haibin-lin eric-haibin-lin merged commit ae3506d into volcengine:main Aug 2, 2025
8 checks passed
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
### What does this PR do?

This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.

### API and Usage Example

This change does not alter the public API.

### Design & Code Changes

- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir

- Helps with quick inspection and reproducibility of dataset inputs

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
HaeChan0305 pushed a commit to HaeChan0305/MLILAB-GRPO that referenced this pull request Aug 8, 2025
### What does this PR do?

This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.

### API and Usage Example

This change does not alter the public API.

### Design & Code Changes

- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir

- Helps with quick inspection and reproducibility of dataset inputs

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Aug 11, 2025
### What does this PR do?

This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.

### API and Usage Example

This change does not alter the public API.

### Design & Code Changes

- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir

- Helps with quick inspection and reproducibility of dataset inputs

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
### What does this PR do?

This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
  - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
  - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`

### Test

Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.

### API and Usage Example

This change does not alter the public API.

### Design & Code Changes

- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir

- Helps with quick inspection and reproducibility of dataset inputs

### Checklist Before Submitting

> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants