Add `concatenate_datasets` for iterable datasets #4500

lhoestq · 2022-06-15T13:58:50Z

concatenate_datasets currently only supports lists of datasets.Dataset, not lists of datasets.IterableDataset like interleave_datasets

Fix #2564

I also moved _interleave_map_style_datasets from combine.py to arrow_dataset.py, since the logic depends a lot on the Dataset object internals

And I moved concatenate_datasets from arrow_dataset.py to combine.py to have it with interleave_datasets (though it's also copied in arrow_dataset module for backward compatibility for now)

HuggingFaceDocBuilderDev · 2022-06-15T14:05:02Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko

Looks good already!

There is a slight difference in concatenate_datasets between the version for map-style datasets and the one for iterable datasets:

if axis=0, the map-style version checks the feature types (features can be None in iterable datasets, so it's ok not to have this check) of the shared columns, but doesn't require the equal set of column names among the input datasets, i.e., the following works:
```
 >>> from datasets import *
 >>> a = Dataset.from_dict({"a": [1, 2]})
 >>> b = Dataset.from_dict({"b": ["aa", "bb"]})
 >>> concatenate_datasets([a, b])
```
We need to address this.
if axis=1, the map-style version checks the length of input datasets (besides the column names check for duplicates) and throws an error if the lengths are not equal. Somewhat expected that this is ignored for iterable datasets, so it can stay as-is IMO.

Two nits in code:

src/datasets/combine.py

src/datasets/iterable_dataset.py

lhoestq · 2022-06-21T10:11:01Z

Thanks ! I addressed your comments :)

There is a slight difference in concatenate_datasets between the version for map-style datasets and the one for iterable datasets

Indeed, here is what I did to fix this:

axis 0: fill missing columns with None.
(I first iterate over the input datasets to infer their columns from the first examples, then I set the features of the resulting dataset to be the merged features)
This is consistent with non-streaming concatenation
axis 1: fill the missing rows with None, for consistency with axis 0
(but let me know what you think, I can still revert this behavior and raise an error when one of the dataset runs out of examples)
We might have to align the non-streaming concatenation with this behavior though, for consistency. What do you think ?

mariosasko

Cool!

We might have to align the non-streaming concatenation with this behavior though, for consistency. What do you think ?

Yes, we can address that in a subsequent PR

src/datasets/iterable_dataset.py

lhoestq · 2022-06-28T16:30:37Z

Added more comments as suggested, and some typing

While factorizing _apply_features_types for both IterableDataset and TypedExamplesIterable, I fixed a missing token_per_repo_id that was not passed to TypedExamplesIteable

Let me know what you think now @mariosasko

mariosasko

Looks good! Thanks!

This was resolved by the feature being implemented for iterable datasets: huggingface/datasets#4500

lhoestq added 2 commits June 15, 2022 15:52

add concatenate_datasets for iterable datasets

37bb701

fix

590354e

lhoestq requested a review from mariosasko June 15, 2022 13:58

lhoestq marked this pull request as ready for review June 15, 2022 14:19

mariosasko reviewed Jun 17, 2022

View reviewed changes

src/datasets/combine.py Outdated Show resolved Hide resolved

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

lhoestq added 8 commits June 17, 2022 19:20

infer features

15c286e

fill missing rowzs and columns

5dbae25

comments

d4492ab

only check for duplicate keys once

1a7ed8b

comments

8916160

Merge branch 'master' into concatenate-iterable-dataset

a57e5e6

keep concatenate_datasets in arrow_dataset (to be deprecated)

6339e71

style

94b2293

lhoestq requested a review from mariosasko June 21, 2022 13:32

mariosasko reviewed Jun 22, 2022

View reviewed changes

src/datasets/iterable_dataset.py Show resolved Hide resolved

src/datasets/iterable_dataset.py Show resolved Hide resolved

src/datasets/iterable_dataset.py Show resolved Hide resolved

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

lhoestq added 4 commits June 28, 2022 12:37

Merge branch 'master' into concatenate-iterable-dataset

e82e847

Merge branch 'master' into concatenate-iterable-dataset

453089f

comments, typing, fix missing token_per_repo_id

65fafbe

style

ba0635a

mariosasko approved these changes Jun 28, 2022

View reviewed changes

lhoestq merged commit f5826ef into master Jun 28, 2022

lhoestq deleted the concatenate-iterable-dataset branch June 28, 2022 21:15

severo mentioned this pull request Sep 8, 2022

Infer the features if missing #3144

Closed

KennethEnevoldsen added a commit to danish-foundation-models/site that referenced this pull request Oct 10, 2022

fix: Updated load dataset to remove concatenation hotfix

71179ad

This was resolved by the feature being implemented for iterable datasets: huggingface/datasets#4500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `concatenate_datasets` for iterable datasets #4500

Add `concatenate_datasets` for iterable datasets #4500

Uh oh!

lhoestq commented Jun 15, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 15, 2022 •

edited

Loading

Uh oh!

mariosasko left a comment

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Jun 21, 2022 •

edited

Loading

Uh oh!

mariosasko left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Jun 28, 2022 •

edited

Loading

Uh oh!

mariosasko left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add concatenate_datasets for iterable datasets #4500

Add concatenate_datasets for iterable datasets #4500

Uh oh!

Conversation

lhoestq commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Jun 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `concatenate_datasets` for iterable datasets #4500

Add `concatenate_datasets` for iterable datasets #4500

lhoestq commented Jun 15, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 15, 2022 •

edited

Loading

lhoestq commented Jun 21, 2022 •

edited

Loading

lhoestq commented Jun 28, 2022 •

edited

Loading