-
Notifications
You must be signed in to change notification settings - Fork 3k
Add concatenate_datasets for iterable datasets
#4500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good already!
There is a slight difference in concatenate_datasets between the version for map-style datasets and the one for iterable datasets:
- if
axis=0, the map-style version checks the feature types (featurescan beNonein iterable datasets, so it's ok not to have this check) of the shared columns, but doesn't require the equal set of column names among the input datasets, i.e., the following works:We need to address this.>>> from datasets import * >>> a = Dataset.from_dict({"a": [1, 2]}) >>> b = Dataset.from_dict({"b": ["aa", "bb"]}) >>> concatenate_datasets([a, b])
- if
axis=1, the map-style version checks the length of input datasets (besides the column names check for duplicates) and throws an error if the lengths are not equal. Somewhat expected that this is ignored for iterable datasets, so it can stay as-is IMO.
Two nits in code:
|
Thanks ! I addressed your comments :)
Indeed, here is what I did to fix this:
|
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
We might have to align the non-streaming concatenation with this behavior though, for consistency. What do you think ?
Yes, we can address that in a subsequent PR
|
Added more comments as suggested, and some typing While factorizing _apply_features_types for both IterableDataset and TypedExamplesIterable, I fixed a missing Let me know what you think now @mariosasko |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks!
This was resolved by the feature being implemented for iterable datasets: huggingface/datasets#4500
concatenate_datasetscurrently only supports lists ofdatasets.Dataset, not lists ofdatasets.IterableDatasetlikeinterleave_datasetsFix #2564
I also moved
_interleave_map_style_datasetsfrom combine.py to arrow_dataset.py, since the logic depends a lot on theDatasetobject internalsAnd I moved
concatenate_datasetsfrom arrow_dataset.py to combine.py to have it withinterleave_datasets(though it's also copied in arrow_dataset module for backward compatibility for now)