Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 15, 2022

concatenate_datasets currently only supports lists of datasets.Dataset, not lists of datasets.IterableDataset like interleave_datasets

Fix #2564

I also moved _interleave_map_style_datasets from combine.py to arrow_dataset.py, since the logic depends a lot on the Dataset object internals

And I moved concatenate_datasets from arrow_dataset.py to combine.py to have it with interleave_datasets (though it's also copied in arrow_dataset module for backward compatibility for now)

@lhoestq lhoestq requested a review from mariosasko June 15, 2022 13:58
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 15, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq lhoestq marked this pull request as ready for review June 15, 2022 14:19
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good already!

There is a slight difference in concatenate_datasets between the version for map-style datasets and the one for iterable datasets:

  • if axis=0, the map-style version checks the feature types (features can be None in iterable datasets, so it's ok not to have this check) of the shared columns, but doesn't require the equal set of column names among the input datasets, i.e., the following works:
     >>> from datasets import *
     >>> a = Dataset.from_dict({"a": [1, 2]})
     >>> b = Dataset.from_dict({"b": ["aa", "bb"]})
     >>> concatenate_datasets([a, b])
    We need to address this.
  • if axis=1, the map-style version checks the length of input datasets (besides the column names check for duplicates) and throws an error if the lengths are not equal. Somewhat expected that this is ignored for iterable datasets, so it can stay as-is IMO.

Two nits in code:

@lhoestq
Copy link
Member Author

lhoestq commented Jun 21, 2022

Thanks ! I addressed your comments :)

There is a slight difference in concatenate_datasets between the version for map-style datasets and the one for iterable datasets

Indeed, here is what I did to fix this:

  • axis 0: fill missing columns with None.
    (I first iterate over the input datasets to infer their columns from the first examples, then I set the features of the resulting dataset to be the merged features)
    This is consistent with non-streaming concatenation

  • axis 1: fill the missing rows with None, for consistency with axis 0
    (but let me know what you think, I can still revert this behavior and raise an error when one of the dataset runs out of examples)
    We might have to align the non-streaming concatenation with this behavior though, for consistency. What do you think ?

@lhoestq lhoestq requested a review from mariosasko June 21, 2022 13:32
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

We might have to align the non-streaming concatenation with this behavior though, for consistency. What do you think ?

Yes, we can address that in a subsequent PR

@lhoestq
Copy link
Member Author

lhoestq commented Jun 28, 2022

Added more comments as suggested, and some typing

While factorizing _apply_features_types for both IterableDataset and TypedExamplesIterable, I fixed a missing token_per_repo_id that was not passed to TypedExamplesIteable

Let me know what you think now @mariosasko

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks!

@lhoestq lhoestq merged commit f5826ef into master Jun 28, 2022
@lhoestq lhoestq deleted the concatenate-iterable-dataset branch June 28, 2022 21:15
KennethEnevoldsen added a commit to danish-foundation-models/site that referenced this pull request Oct 10, 2022
This was resolved by the feature being implemented for iterable datasets:
huggingface/datasets#4500
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

concatenate_datasets for iterable datasets

4 participants