-
Notifications
You must be signed in to change notification settings - Fork 3k
Add batch method to Dataset class
#7064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch method to Dataset class
#7064
Conversation
|
Looks good to me ! :) you might want to add the |
|
Thanks for the feedback @lhoestq! The last commits include:
WDYT? |
|
You can put the documentation in process.mdx :) |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
af3d739 to
7b02d5f
Compare
|
I reset the head to the commit before I added the |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks ! the CI failures are unrelated to your PR
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* feat: add `batch` method to `Dataset` class * feat: add `num_proc` arg from `map` to `batch` * test: add test for `Dataset.batch() * style: formatting... * docs: move `Dataset.batch()`documentation to `process.mdx` * docs: add `numb_proc` to docs * Apply suggestions from code review --------- Co-authored-by: Quentin Lhoest <[email protected]>
* feat: add `batch` method to `Dataset` class * feat: add `num_proc` arg from `map` to `batch` * test: add test for `Dataset.batch() * style: formatting... * docs: move `Dataset.batch()`documentation to `process.mdx` * docs: add `numb_proc` to docs * Apply suggestions from code review --------- Co-authored-by: Quentin Lhoest <[email protected]>
* feat: add `batch` method to `Dataset` class * feat: add `num_proc` arg from `map` to `batch` * test: add test for `Dataset.batch() * style: formatting... * docs: move `Dataset.batch()`documentation to `process.mdx` * docs: add `numb_proc` to docs * Apply suggestions from code review --------- Co-authored-by: Quentin Lhoest <[email protected]>

This PR introduces a new
batchmethod to theDatasetclass, aligning its functionality with theIterableDataset.batch()method (implemented in #7054). The implementation uses as well the existingmapmethod for efficient batching of examples.Key changes:
batchmethod toDatasetclass inarrow_dataset.pymapmethod for batchingCloses #7063
Once the approach is approved, i will create the tests and update the documentation.