Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Nov 8, 2023

Allow load_dataset to support the Webdataset format.

It allows users to download/stream data from local files or from the Hugging Face Hub.

Moreover it will enable the Dataset Viewer for Webdataset datasets on HF.

Implementation details

  • I added a new Webdataset builder
  • dataset with TAR files are now read using the Webdataset builder
  • Basic decoding from webdataset is used by default, except unsafe ones like pickle
  • HF authentication support is done with xopen

TODOS

  • tests
  • docs

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

A high-level review :)

".parquet": ("parquet", {}),
".arrow": ("arrow", {}),
".txt": ("text", {}),
".tar": ("webdataset", {}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make the module inference more robust by inspecting the contents of TAR archives (e.g., consecutive files with the same name (stem) but different extensions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some webdatasets may contain only one field (e.g. only images) so I'm not sure it would make sense.

Also I like the idea of keeping the TAR loading simple and only support webdataset for TAR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this would break some existing repos on the Hub, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TAR has never been supported on the Hub, what would it break ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you mean that some TAR datasets not in webdataset format won't work properly.

Maybe I can add an error message if the first webdataset examples don't have the same type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I thought this could fail on

elif ext == ".zip":
return infer_module_for_data_files_list_in_archives(data_files_list, download_config=download_config)

But this logic only inspects ZIP (and ignores TAR) archives :)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 14, 2023

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member Author

lhoestq commented Nov 14, 2023

I added an error message if the first examples don't appear to be in webdataset format

"The TAR archives of the dataset should be in Webdataset format, "
"but the files in the archive don't share the same prefix or the same types."

@lhoestq lhoestq marked this pull request as ready for review November 15, 2023 11:51
@lhoestq lhoestq requested a review from mariosasko November 15, 2023 11:52
@lhoestq
Copy link
Member Author

lhoestq commented Nov 24, 2023

@mariosasko could you review this ? I think it's fine to have webdataset as an optional dependency for now, then depending on usage and user feedbacks see if it makes sense to have our own implementation or not

inferred_arrow_schema = pa.Table.from_pylist(first_examples[:1]).schema
features = datasets.Features.from_arrow_schema(inferred_arrow_schema)

# Set Image types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also do the same for the Audio feature, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct, that's for another PR :p

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, feel free to add a TO-DO.

@lhoestq
Copy link
Member Author

lhoestq commented Nov 27, 2023

I just removed the dependency on webdataset @mariosasko :)

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nits, but looks good other than that

@lhoestq
Copy link
Member Author

lhoestq commented Nov 28, 2023

took your comments into account, lmk if you see anything else

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, LGTM!

@lhoestq lhoestq merged commit 98d96a4 into main Nov 28, 2023
@lhoestq lhoestq deleted the webdataset branch November 28, 2023 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants