[data] feat: dump train/test example as JSON #2666

wantbook-book · 2025-07-21T14:14:06Z

What does this PR do?

This PR adds functionality to save one training and one testing example as JSON files for reference, making it easier to inspect dataset formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Manually verified that two files train_example.json and test_example.json are saved correctly in the specified local_dir.

API and Usage Example

This change does not alter the public API.

Design & Code Changes

Added code to save train_dataset[0] and test_dataset[0] as JSON files in local_dir
Helps with quick inspection and reproducibility of dataset inputs

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: easy code
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.

gemini-code-assist

Code Review

This PR adds a useful feature for inspecting dataset examples. I've identified a potential IndexError if the train or test datasets are empty and suggested a fix to make the script more robust.

gemini-code-assist · 2025-07-21T14:15:16Z

examples/data_preprocess/math_dataset.py

+    example = train_dataset[0]
+    with open(os.path.join(local_dir, "train_example.json"), "w") as f:
+        json.dump(example, f, indent=2)
+    example = test_dataset[0]
+    with open(os.path.join(local_dir, "test_example.json"), "w") as f:
+        json.dump(example, f, indent=2)


The current implementation accesses train_dataset[0] and test_dataset[0] directly. This will raise an IndexError and crash the script if either dataset is empty. An empty dataset is a valid edge case, for example, if the source dataset has empty splits or if the mapping/filtering results in an empty dataset.

To make the script more robust, you should add a check to ensure the datasets are not empty before attempting to access their elements.

Suggested change

example = train_dataset[0]

with open(os.path.join(local_dir, "train_example.json"), "w") as f:

json.dump(example, f, indent=2)

example = test_dataset[0]

with open(os.path.join(local_dir, "test_example.json"), "w") as f:

json.dump(example, f, indent=2)

if len(train_dataset) > 0:

example = train_dataset[0]

with open(os.path.join(local_dir, "train_example.json"), "w") as f:

json.dump(example, f, indent=2)

if len(test_dataset) > 0:

example = test_dataset[0]

with open(os.path.join(local_dir, "test_example.json"), "w") as f:

json.dump(example, f, indent=2)

### What does this PR do? This PR adds functionality to save one training and one testing example as JSON files for reference, making it easier to inspect dataset formatting and preprocessing. Related to potential future debugging and reproducibility improvements. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test Manually verified that two files train_example.json and test_example.json are saved correctly in the specified local_dir. ### API and Usage Example This change does not alter the public API. ### Design & Code Changes - Added code to save train_dataset[0] and test_dataset[0] as JSON files in local_dir - Helps with quick inspection and reproducibility of dataset inputs ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: easy code - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).

wantbook-book added 3 commits July 21, 2025 13:57

Save one example as JSON for reference

8f429be

Merge branch 'main' of github.com:volcengine/verl into main

10ecfe7

pre-commit

0423c61

gemini-code-assist bot reviewed Jul 21, 2025

View reviewed changes

eric-haibin-lin approved these changes Jul 24, 2025

View reviewed changes

eric-haibin-lin merged commit ae3506d into volcengine:main Aug 2, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] feat: dump train/test example as JSON #2666

[data] feat: dump train/test example as JSON #2666

Uh oh!

wantbook-book commented Jul 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[data] feat: dump train/test example as JSON #2666

[data] feat: dump train/test example as JSON #2666

Uh oh!

Conversation

wantbook-book commented Jul 21, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants