Download and prepare as Parquet for cloud storage #4724

lhoestq · 2022-07-20T13:39:02Z

Download a dataset as Parquet in a cloud storage can be useful for streaming mode and to use with spark/dask/ray.

This PR adds support for fsspec URIs like s3://..., gcs://... etc. and ads the file_format to save as parquet instead of arrow:

from datasets import *

cache_dir = "s3://..."
builder = load_dataset_builder("crime_and_punish", cache_dir=cache_dir)
builder.download_and_prepare(file_format="parquet")

EDIT: actually changed the API to

from datasets import *

builder = load_dataset_builder("crime_and_punish")
builder.download_and_prepare("s3://...", file_format="parquet")

credentials to cloud storage can be passed using the storage_options argument in

For consistency with the BeamBasedBuilder, I name the parquet files {builder.name}-{split}-xxxxx-of-xxxxx.parquet. I think this is fine since we'll need to implement parquet sharding after this PR, so that a dataset can be used efficiently with dask for example.

Note that images/audio files are not embedded yet in the parquet files, this will added in a subsequent PR

TODO:

docs
tests

HuggingFaceDocBuilderDev · 2022-07-20T13:45:30Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq · 2022-07-22T09:27:57Z

tests/fixtures/fsspec.py

+from fsspec.implementations.local import AbstractFileSystem, LocalFileSystem, stringify_path
+
+
+class MockFileSystem(AbstractFileSystem):


I created this mockfs fixture that is not only read but also write, this way we can do all our tests based on fsspec using this one :) @albertvillanova

mariosasko

Good job! Some nits:

src/datasets/builder.py

src/datasets/info.py

src/datasets/utils/py_utils.py

Co-authored-by: Mario Šaško <[email protected]>

philschmid

I haven't reviewed the code yet but left some comments regarding the DX. I think to increase the usage of datasets within DS/Research (who are currently using other libraries because of better cloud storage integrations) we have to provide a similar API to what they are used to/datasets have from Hub datasets, e.g. loading datasets from cloud storage with pandas is similar to loading datasets with pandas from a local disk.

philschmid · 2022-07-30T07:19:14Z

docs/source/filesystems.mdx

+>>> builder = load_dataset_builder("csv", data_files=data_files, cache_dir=cache_dir, storage_options=storage_options)
+>>> builder.download_and_prepare(file_format="parquet")


From a UX perspective it would be cool if i could do

# singe file ds = load_dataset("s3://my-bucket/datasets-cache/train.csv",storage_options=storage_options) # multi ds = load_dataset("s3://my-bucket/datasets-cache",data_files={"train":["path/to/train.csv"]},storage_options=storage_options)

But not sure if this is possible to implement. Thats for example also how pandas is doing it
data = pd.read_csv('s3://bucket....csv')

Yup this is definitely doable :)
It sounds super intuitive and practical to use, thanks !

philschmid · 2022-07-30T07:20:24Z

docs/source/filesystems.mdx

->>> dataset = load_from_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
+```py
+# saves encoded_dataset to amazon s3
+>>> encoded_dataset.save_to_disk("s3://my-private-datasets/imdb/train", fs=fs)


Similar to my comment above it would be nice to use the more "native" save methods, like to_csv, to_parquetetc...
,e.g.

encoded_dataset.to_csv("s3://my-bucket/dataset.csv",storage_options=storage_options)

similar to pandas as well.

lhoestq · 2022-07-30T17:30:14Z

Just noticed that it would be more convenient to pass the output dir to download_and_prepare directly, to bypass the caching logic which prepares the dataset at <cache_dir>/<name>/<version>/<hash>/. And this way the cache is only used for the downloaded files. What do you think ?

builder = load_datadet_builder("squad")
# or with a custom cache
builder = load_datadet_builder("squad", cache_dir="path/to/local/cache/for/downloaded/files")

# download and prepare to s3
builder.download_and_prepare("s3://my_bucket/squad")

philschmid · 2022-08-08T06:34:23Z

Might be of interest:
PyTorch and AWS introduced better support for S3 streaming in torchtext.

mariosasko · 2022-08-10T12:56:56Z

Having thought about it a bit more, I also agree with @philschmid in that it's important to follow the existing APIs (pandas/dask), which means we should support the following at some point:

remote data files resolution for the packaged modules to support load_dataset("<format>", data_files="<fs_url>")
to_<format>("<fs_url>")
load_from_disk and save_to_disk already expose the fs param, but it would be cool to support specifying fsspec URLs directly as the source/destination path (perhaps we can then deprecate fs to be fully aligned with pandas/dask)

IMO these are the two main issues with the current approach:

relying on the builder API to generate the formatted files results in a non-friendly format due to how our caching works (a lot of nested subdirectories)
this approach still downloads the files needed to generate a dataset locally. Considering one of our goals is to align the streaming API with the non-streaming one, this could be avoided by running to_<format> on streamed/iterable datasets

lhoestq · 2022-08-26T12:34:47Z

Alright I did the last change I wanted to do, here is the final API:

builder = load_dataset_builder(...)
builder.download_and_prepare("s3://...", storage_options={"token": ...})

and it creates the arrow files directly in the specified directory, not in a nested subdirectory structure as we do in the cache !

this approach still downloads the files needed to generate a dataset locally. Considering one of our goals is to align the streaming API with the non-streaming one, this could be avoided by running to_ on streamed/iterable datasets

Yup this can be explored in some future work I think. Though to keep things simple and clear I would keep the streaming behaviors only when you load a dataset in streaming mode, and not include it in download_and_prepare (because it wouldn't be aligned with the name of the function, which imply to 1. download and 2. prepare ^^). Maybe an API like that can make sense for those who need full streaming

ds = load_dataset(..., streaming=True)
ds.to_parquet("s3://...")

albertvillanova

Thanks @lhoestq, awesome work!!! It is really important that we support multiple cloud storage.

Just a general comment about the documentation. Feel free to tell me if you don't agree.

I think the use of the word "loading" can be misleading: we are using it with different meanings:

until now, "loading" was meaning "get" (load into memory or memory-map), contrary to "save" or "share"
now we are using "loading" as "saving": "loading into a cloud storage", differently from "load a dataset from the HF Hub"

I think we should be clear and make the difference between:

loading a dataset FROM a cloud storage, with an API as suggested by @philschmid

ds = load_dataset("s3://my-bucket/datasets-cache/train.csv",storage_options=storage_options)

and saving it TO a cloud storage, with your proposal

docs/source/filesystems.mdx

albertvillanova

Just some additional comments below.

albertvillanova · 2022-08-26T14:28:57Z

src/datasets/builder.py

-        # cache_dir can be a remote bucket on GCS or S3 (when using BeamBasedBuilder for distributed data processing)
-        self._cache_dir_root = str(cache_dir or config.HF_DATASETS_CACHE)
-        self._cache_dir_root = (
-            self._cache_dir_root if is_remote_url(self._cache_dir_root) else os.path.expanduser(self._cache_dir_root)
-        )
-        path_join = posixpath.join if is_remote_url(self._cache_dir_root) else os.path.join
+        self._cache_dir_root = str(cache_dir) or os.path.expanduser(config.HF_DATASETS_CACHE)
+        self._cache_dir = self._build_cache_dir()
        self._cache_downloaded_dir = (
-            path_join(self._cache_dir_root, config.DOWNLOADED_DATASETS_DIR)
+            os.path.join(self._cache_dir_root, config.DOWNLOADED_DATASETS_DIR)
            if cache_dir
-            else str(config.DOWNLOADED_DATASETS_PATH)
+            else os.path.expanduser(config.DOWNLOADED_DATASETS_PATH)
        )
-        self._cache_downloaded_dir = (
-            self._cache_downloaded_dir
-            if is_remote_url(self._cache_downloaded_dir)
-            else os.path.expanduser(self._cache_downloaded_dir)
-        )
-        self._cache_dir = self._build_cache_dir()
-        if not is_remote_url(self._cache_dir_root):
-            os.makedirs(self._cache_dir_root, exist_ok=True)
-            lock_path = os.path.join(self._cache_dir_root, self._cache_dir.replace(os.sep, "_") + ".lock")
-            with FileLock(lock_path):
-                if os.path.exists(self._cache_dir):  # check if data exist
-                    if len(os.listdir(self._cache_dir)) > 0:
-                        logger.info("Overwrite dataset info from restored data version.")
-                        self.info = DatasetInfo.from_directory(self._cache_dir)
-                    else:  # dir exists but no data, remove the empty dir as data aren't available anymore
-                        logger.warning(
-                            f"Old caching folder {self._cache_dir} for dataset {self.name} exists but not data were found. Removing it. "
-                        )
-                        os.rmdir(self._cache_dir)


I see you are reverting the changes introduced by:

Support remote cache_dir #4347

Some concerns:

What about if the user sets a remote config.HF_DATASETS_CACHE or config.DOWNLOADED_DATASETS_DIR?

The FileLock was raising an error when working with a remote cache dir (BeamBasedBuilder)

Good catch thanks !

The FileLock is should only be applied when the output directory is a local directory indeed

lhoestq · 2022-08-26T15:09:09Z

totally agree with your comment on the meaning of "loading", I'll update the docs

Co-authored-by: Albert Villanova del Moral <[email protected]>

lhoestq · 2022-08-26T17:32:32Z

I took your comments into account and reverted all the changes related to cache_dir to keep the support for remote cache_dir for beam datasets. I also updated the wording in the docs to not use "load" when it's not appropriate :)

mariosasko

Looks all good now! Thanks!

albertvillanova

Thanks @lhoestq, awesome job!

Just a few cosmetic comments on the docs...

src/datasets/info.py

src/datasets/builder.py

lhoestq added 4 commits July 20, 2022 14:03

use fsspec for caching

6d89cfb

add parquet writer

606a48f

add file_format argument

cdf8dcd

style

ad91270

lhoestq added 12 commits July 20, 2022 17:04

use "gs" instead of "gcs" for apache beam + use is_remote_filesystem

742d2a9

typo

aed8ce6

fix test

93d5660

test ArrowWriter with filesystem

65c2037

test parquet writer

84d8397

more tests

4c46349

Merge branch 'main' into dl-and-pp-as-parquet

ee7e3f5

more tests

033a3b8

fix nullcontext on 3.6

ce8d7f9

parquet_writer.write_batch is not available in pyarrow 6

15dccf9

remove reference to open file

3a3d784

fix test

3eef46d

lhoestq commented Jul 22, 2022

View reviewed changes

lhoestq and others added 2 commits July 22, 2022 17:31

docs

b480549

Merge branch 'main' into dl-and-pp-as-parquet

713f83c

lhoestq marked this pull request as ready for review July 25, 2022 17:10

lhoestq requested review from albertvillanova and mariosasko July 25, 2022 17:12

This was referenced Jul 26, 2022

Drop Python 3.6 support #4460

Merged

Shard parquet in download_and_prepare #4747

Merged

docs: dask from parquet files

1db12b9

mariosasko reviewed Jul 27, 2022

View reviewed changes

src/datasets/builder.py Outdated Show resolved Hide resolved

src/datasets/info.py Outdated Show resolved Hide resolved

src/datasets/info.py Outdated Show resolved Hide resolved

src/datasets/utils/py_utils.py Outdated Show resolved Hide resolved

lhoestq and others added 3 commits July 27, 2022 17:43

Apply suggestions from code review

874b2a0

Co-authored-by: Mario Šaško <[email protected]>

use contextlib.nullcontext

f6ecb64

Merge branch 'main' into dl-and-pp-as-parquet

b0e4222

philschmid reviewed Jul 30, 2022

View reviewed changes

lhoestq added 7 commits August 25, 2022 14:51

Merge branch 'main' into dl-and-pp-as-parquet

509ff3f

add output_dir arg to download_and_prepare

1b02b66

update tests

2e85216

update docs

ba167db

fix tests

a9379f8

fix tests

ec94a4b

fix output parent dir creattion

f47871a

lhoestq requested a review from mariosasko August 26, 2022 12:34

albertvillanova reviewed Aug 26, 2022

View reviewed changes

docs/source/filesystems.mdx Outdated Show resolved Hide resolved

docs/source/filesystems.mdx Outdated Show resolved Hide resolved

albertvillanova reviewed Aug 26, 2022

View reviewed changes

lhoestq and others added 6 commits August 26, 2022 18:10

Apply suggestions from code review

460e1a6

Co-authored-by: Albert Villanova del Moral <[email protected]>

revert changes for remote cache_dir

88daa8a

fix wording in the docs: load -> download and prepare

fdf7252

style

22aaf7b

fix

c051b31

simplify incomplete_dir

e0a7742

fix tests

53d46cc

mariosasko approved these changes Aug 29, 2022

View reviewed changes

albertvillanova approved these changes Aug 30, 2022

View reviewed changes

src/datasets/info.py Show resolved Hide resolved

src/datasets/info.py Show resolved Hide resolved

src/datasets/builder.py Show resolved Hide resolved

albert's comments

606951f

lhoestq merged commit 139d210 into main Sep 5, 2022

lhoestq deleted the dl-and-pp-as-parquet branch September 5, 2022 17:25

araonblake mentioned this pull request Oct 31, 2022

prepare dataset for cloud storage doesn't work #5176

Closed

		from fsspec.implementations.local import AbstractFileSystem, LocalFileSystem, stringify_path


		class MockFileSystem(AbstractFileSystem):

		>>> builder = load_dataset_builder("csv", data_files=data_files, cache_dir=cache_dir, storage_options=storage_options)
		>>> builder.download_and_prepare(file_format="parquet")

Download and prepare as Parquet for cloud storage #4724

Download and prepare as Parquet for cloud storage #4724

Uh oh!

Conversation

lhoestq commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq Jul 22, 2022

Choose a reason for hiding this comment

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philschmid left a comment

Choose a reason for hiding this comment

Uh oh!

philschmid Jul 30, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq Jul 30, 2022

Choose a reason for hiding this comment

Uh oh!

philschmid Jul 30, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Jul 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philschmid commented Aug 8, 2022

Uh oh!

mariosasko commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Aug 26, 2022

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Aug 26, 2022

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Aug 26, 2022

Uh oh!

lhoestq commented Aug 26, 2022

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lhoestq commented Jul 20, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 20, 2022 •

edited

Loading

lhoestq commented Jul 30, 2022 •

edited

Loading

mariosasko commented Aug 10, 2022 •

edited

Loading

albertvillanova Aug 26, 2022 •

edited

Loading

albertvillanova left a comment •

edited

Loading