-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[data] feat: dump train/test example as JSON #2666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds a useful feature for inspecting dataset examples. I've identified a potential IndexError if the train or test datasets are empty and suggested a fix to make the script more robust.
| example = train_dataset[0] | ||
| with open(os.path.join(local_dir, "train_example.json"), "w") as f: | ||
| json.dump(example, f, indent=2) | ||
| example = test_dataset[0] | ||
| with open(os.path.join(local_dir, "test_example.json"), "w") as f: | ||
| json.dump(example, f, indent=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation accesses train_dataset[0] and test_dataset[0] directly. This will raise an IndexError and crash the script if either dataset is empty. An empty dataset is a valid edge case, for example, if the source dataset has empty splits or if the mapping/filtering results in an empty dataset.
To make the script more robust, you should add a check to ensure the datasets are not empty before attempting to access their elements.
| example = train_dataset[0] | |
| with open(os.path.join(local_dir, "train_example.json"), "w") as f: | |
| json.dump(example, f, indent=2) | |
| example = test_dataset[0] | |
| with open(os.path.join(local_dir, "test_example.json"), "w") as f: | |
| json.dump(example, f, indent=2) | |
| if len(train_dataset) > 0: | |
| example = train_dataset[0] | |
| with open(os.path.join(local_dir, "train_example.json"), "w") as f: | |
| json.dump(example, f, indent=2) | |
| if len(test_dataset) > 0: | |
| example = test_dataset[0] | |
| with open(os.path.join(local_dir, "test_example.json"), "w") as f: | |
| json.dump(example, f, indent=2) |
### What does this PR do?
This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.
### API and Usage Example
This change does not alter the public API.
### Design & Code Changes
- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir
- Helps with quick inspection and reproducibility of dataset inputs
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
### What does this PR do?
This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.
### API and Usage Example
This change does not alter the public API.
### Design & Code Changes
- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir
- Helps with quick inspection and reproducibility of dataset inputs
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
### What does this PR do?
This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.
### API and Usage Example
This change does not alter the public API.
### Design & Code Changes
- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir
- Helps with quick inspection and reproducibility of dataset inputs
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
### What does this PR do?
This PR adds functionality to save one training and one testing example
as JSON files for reference, making it easier to inspect dataset
formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [x] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
Manually verified that two files train_example.json and
test_example.json are saved correctly in the specified local_dir.
### API and Usage Example
This change does not alter the public API.
### Design & Code Changes
- Added code to save train_dataset[0] and test_dataset[0] as JSON files
in local_dir
- Helps with quick inspection and reproducibility of dataset inputs
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: easy code
- [x] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
What does this PR do?
This PR adds functionality to save one training and one testing example as JSON files for reference, making it easier to inspect dataset formatting and preprocessing.
Related to potential future debugging and reproducibility improvements.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Manually verified that two files train_example.json and test_example.json are saved correctly in the specified local_dir.
API and Usage Example
This change does not alter the public API.
Design & Code Changes
Added code to save train_dataset[0] and test_dataset[0] as JSON files in local_dir
Helps with quick inspection and reproducibility of dataset inputs
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace.