-
Notifications
You must be signed in to change notification settings - Fork 3k
Download and prepare as Parquet for cloud storage #4724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 32 commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
6d89cfb
use fsspec for caching
lhoestq 606a48f
add parquet writer
lhoestq cdf8dcd
add file_format argument
lhoestq ad91270
style
lhoestq 742d2a9
use "gs" instead of "gcs" for apache beam + use is_remote_filesystem
lhoestq aed8ce6
typo
lhoestq 93d5660
fix test
lhoestq 65c2037
test ArrowWriter with filesystem
lhoestq 84d8397
test parquet writer
lhoestq 4c46349
more tests
lhoestq ee7e3f5
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq 033a3b8
more tests
lhoestq ce8d7f9
fix nullcontext on 3.6
lhoestq 15dccf9
parquet_writer.write_batch is not available in pyarrow 6
lhoestq 3a3d784
remove reference to open file
lhoestq 3eef46d
fix test
lhoestq b480549
docs
lhoestq 713f83c
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq 1db12b9
docs: dask from parquet files
lhoestq 874b2a0
Apply suggestions from code review
lhoestq f6ecb64
use contextlib.nullcontext
lhoestq b0e4222
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq e7f3ac4
fix missing import
lhoestq df0343a
Use unstrip_protocol to merge protocol and path
mariosasko a0f84f4
remove bad "raise" and add TODOs
lhoestq 509ff3f
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq 1b02b66
add output_dir arg to download_and_prepare
lhoestq 2e85216
update tests
lhoestq ba167db
update docs
lhoestq a9379f8
fix tests
lhoestq ec94a4b
fix tests
lhoestq f47871a
fix output parent dir creattion
lhoestq 460e1a6
Apply suggestions from code review
lhoestq 88daa8a
revert changes for remote cache_dir
lhoestq fdf7252
fix wording in the docs: load -> download and prepare
lhoestq 22aaf7b
style
lhoestq c051b31
fix
lhoestq e0a7742
simplify incomplete_dir
lhoestq 53d46cc
fix tests
lhoestq 606951f
albert's comments
lhoestq File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,8 @@ | ||
| # Cloud storage | ||
|
|
||
| 🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: [`filesystems.S3FileSystem`]. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers: | ||
| 🤗 Datasets supports access to cloud storage providers through a `fsspec` FileSystem implementations. | ||
| You can save and load datasets from any cloud storage in a Pythonic way. | ||
| Take a look at the following table for some example of supported cloud storage providers: | ||
|
|
||
| | Storage provider | Filesystem implementation | | ||
| |----------------------|---------------------------------------------------------------| | ||
|
|
@@ -10,175 +12,191 @@ | |
| | Dropbox | [dropboxdrivefs](https://github.com/MarineChap/dropboxdrivefs)| | ||
| | Google Drive | [gdrivefs](https://github.com/intake/gdrivefs) | | ||
|
|
||
| This guide will show you how to save and load datasets with **s3fs** to a S3 bucket, but other filesystem implementations can be used similarly. An example is shown also for Google Cloud Storage and Azure Blob Storage. | ||
| This guide will show you how to save and load datasets with any cloud storage. | ||
| Here are examples for S3, Google Cloud Storage and Azure Blob Storage. | ||
|
|
||
| ## Amazon S3 | ||
| ## Set up your cloud storage FileSystem | ||
|
|
||
| ### Listing datasets | ||
| ### Amazon S3 | ||
|
|
||
| 1. Install the S3 dependency with 🤗 Datasets: | ||
|
|
||
| ``` | ||
| >>> pip install datasets[s3] | ||
| ``` | ||
|
|
||
| 2. List files from a public S3 bucket with `s3.ls`: | ||
| 2. Define your credentials | ||
|
|
||
| To use an anonymous connection, use `anon=True`. | ||
| Otherwise, include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket. | ||
|
|
||
| ```py | ||
| >>> import datasets | ||
| >>> s3 = datasets.filesystems.S3FileSystem(anon=True) | ||
| >>> s3.ls('public-datasets/imdb/train') | ||
| ['dataset_info.json.json','dataset.arrow','state.json'] | ||
| >>> storage_options = {"anon": True} # for anynonous connection | ||
| # or use your credentials | ||
| >>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} # for private buckets | ||
| # or use a botocore session | ||
| >>> import botocore | ||
| >>> s3_session = botocore.session.Session(profile="my_profile_name") | ||
| >>> storage_options = {"session": s3_session} | ||
| ``` | ||
|
|
||
| Access a private S3 bucket by entering your `aws_access_key_id` and `aws_secret_access_key`: | ||
| 3. Load your FileSystem instance | ||
|
|
||
| ```py | ||
| >>> import datasets | ||
| >>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) | ||
| >>> s3.ls('my-private-datasets/imdb/train') | ||
| ['dataset_info.json.json','dataset.arrow','state.json'] | ||
| >>> import s3fs | ||
| >>> fs = s3fs.S3FileSystem(**storage_options) | ||
| ``` | ||
|
|
||
| ### Saving datasets | ||
| ### Google Cloud Storage | ||
|
|
||
| After you have processed your dataset, you can save it to S3 with [`Dataset.save_to_disk`]: | ||
| 1. Install the Google Cloud Storage implementation: | ||
|
|
||
| ``` | ||
| >>> conda install -c conda-forge gcsfs | ||
| # or install with pip | ||
| >>> pip install gcsfs | ||
| ``` | ||
|
|
||
| 2. Define your credentials | ||
|
|
||
| ```py | ||
| >>> from datasets.filesystems import S3FileSystem | ||
| >>> storage_options={"token": "anon"} # for anonymous connection | ||
| # or use your credentials of your default gcloud credentials or from the google metadata service | ||
| >>> storage_options={"project": "my-google-project"} | ||
| # or use your credentials from elsewhere, see the documentation at https://gcsfs.readthedocs.io/ | ||
| >>> storage_options={"project": "my-google-project", "token": TOKEN} | ||
| ``` | ||
|
|
||
| # create S3FileSystem instance | ||
| >>> s3 = S3FileSystem(anon=True) | ||
| 3. Load your FileSystem instance | ||
|
|
||
| # saves encoded_dataset to your s3 bucket | ||
| >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3) | ||
| ```py | ||
| >>> import gcsfs | ||
| >>> fs = gcsfs.GCSFileSystem(**storage_options) | ||
| ``` | ||
|
|
||
| <Tip> | ||
| ### Azure Blob Storage | ||
|
|
||
| Remember to include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket. | ||
| 1. Install the Azure Blob Storage implementation: | ||
|
|
||
| </Tip> | ||
| ``` | ||
| >>> conda install -c conda-forge adlfs | ||
| # or install with pip | ||
| >>> pip install adlfs | ||
| ``` | ||
|
|
||
| Save your dataset with `botocore.session.Session` and a custom AWS profile: | ||
| 2. Define your credentials | ||
|
|
||
| ```py | ||
| >>> import botocore | ||
| >>> from datasets.filesystems import S3FileSystem | ||
|
|
||
| # creates a botocore session with the provided AWS profile | ||
| >>> s3_session = botocore.session.Session(profile='my_profile_name') | ||
| >>> storage_options = {"anon": True} # for anonymous connection | ||
| # or use your credentials | ||
| >>> storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY) # gen 2 filesystem | ||
| # or use your credentials with the gen 1 filesystem | ||
| >>> storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET} | ||
| ``` | ||
|
|
||
| # create S3FileSystem instance with s3_session | ||
| >>> s3 = S3FileSystem(session=s3_session) | ||
| 3. Load your FileSystem instance | ||
|
|
||
| # saves encoded_dataset to your s3 bucket | ||
| >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) | ||
| ```py | ||
| >>> import adlfs | ||
| >>> fs = adlfs.AzureBlobFileSystem(**storage_options) | ||
| ``` | ||
|
|
||
| ### Loading datasets | ||
| ## Load and Save your datasets using your cloud storage FileSystem | ||
|
|
||
| When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]: | ||
| ### Load datasets into a cloud storage | ||
|
|
||
| ```py | ||
| >>> from datasets import load_from_disk | ||
| >>> from datasets.filesystems import S3FileSystem | ||
| You can load a dataset into your cloud storage by specifying a remote `output_dir` in `download_and_prepare`. | ||
| Don't forget to use the previously defined `storage_options` containing your credentials to write into a private cloud storage. | ||
|
|
||
| # create S3FileSystem without credentials | ||
| >>> s3 = S3FileSystem(anon=True) | ||
| The `download_and_prepare` method works in two steps: | ||
| 1. it first downloads the raw data files (if any) in your local cache. You can set your cache directory by passing `cache_dir` to [`load_dataset_builder`] | ||
| 2. then it generates the dataset in Arrow or Parquet format in your cloud storage by iterating over the raw data files. | ||
|
|
||
| # load encoded_dataset to from s3 bucket | ||
| >>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) | ||
| Load a dataset from the Hugging Face Hub (see [how to load from the Hugging Face Hub](./loading#hugging-face-hub)): | ||
|
|
||
| >>> print(len(dataset)) | ||
| >>> # 25000 | ||
| ```py | ||
| >>> output_dir = "s3://my-bucket/imdb" | ||
| >>> builder = load_dataset_builder("imdb") | ||
| >>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet") | ||
| ``` | ||
|
|
||
| Load with `botocore.session.Session` and custom AWS profile: | ||
| Load a dataset using a loading script (see [how to load a local loading script](./loading#local-loading-script)): | ||
|
|
||
| ```py | ||
| >>> import botocore | ||
| >>> from datasets.filesystems import S3FileSystem | ||
|
|
||
| # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key | ||
| >>> s3_session = botocore.session.Session(profile='my_profile_name') | ||
|
|
||
| # create S3FileSystem instance with s3_session | ||
| >>> s3 = S3FileSystem(session=s3_session) | ||
| >>> output_dir = "s3://my-bucket/imdb" | ||
| >>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py") | ||
| >>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet") | ||
| ``` | ||
|
|
||
| # load encoded_dataset to from s3 bucket | ||
| >>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) | ||
| Load your own data files (see [how to load local and remote files](./loading#local-and-remote-files)): | ||
|
|
||
| >>> print(len(dataset)) | ||
| >>> # 25000 | ||
| ```py | ||
| >>> data_files = {"train": ["path/to/train.csv"]} | ||
| >>> output_dir = "s3://my-bucket/imdb" | ||
| >>> builder = load_dataset_builder("csv", data_files=data_files) | ||
| >>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet") | ||
| ``` | ||
|
|
||
| ## Google Cloud Storage | ||
| It is highly recommended to save the files as compressed Parquet files to optimize I/O by specifying `file_format="parquet"`. | ||
| Otherwize the dataset is saved as an uncompressed Arrow file. | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. Install the Google Cloud Storage implementation: | ||
| #### Dask | ||
|
|
||
| ``` | ||
| >>> conda install -c conda-forge gcsfs | ||
| # or install with pip | ||
| >>> pip install gcsfs | ||
| ``` | ||
| Dask is a parallel computing library and it has a pandas-like API for working with larger than memory Parquet datasets in parallel. | ||
| Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel. | ||
| Dask supports local data but also data from a cloud storage. | ||
|
|
||
| 2. Save your dataset: | ||
| Therefore you can load a dataset saved as sharded Parquet files in Dask with | ||
|
|
||
| ```py | ||
| >>> import gcsfs | ||
| import dask.dataframe as dd | ||
|
|
||
| # create GCSFileSystem instance using default gcloud credentials with project | ||
| >>> gcs = gcsfs.GCSFileSystem(project='my-google-project') | ||
| df = dd.read_parquet(output_dir, storage_options=storage_options) | ||
|
|
||
| # saves encoded_dataset to your gcs bucket | ||
| >>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs) | ||
| # or if your dataset is split into train/valid/test | ||
| df_train = dd.read_parquet(output_dir + f"/{builder.name}-train-*.parquet", storage_options=storage_options) | ||
| df_valid = dd.read_parquet(output_dir + f"/{builder.name}-validation-*.parquet", storage_options=storage_options) | ||
| df_test = dd.read_parquet(output_dir + f"/{builder.name}-test-*.parquet", storage_options=storage_options) | ||
| ``` | ||
|
|
||
| 3. Load your dataset: | ||
| You can find more about dask dataframes in their [documentation](https://docs.dask.org/en/stable/dataframe.html). | ||
|
|
||
| ```py | ||
| >>> import gcsfs | ||
| >>> from datasets import load_from_disk | ||
| ## Saving serialized datasets | ||
|
|
||
| # create GCSFileSystem instance using default gcloud credentials with project | ||
| >>> gcs = gcsfs.GCSFileSystem(project='my-google-project') | ||
| After you have processed your dataset, you can save it to your cloud storage with [`Dataset.save_to_disk`]: | ||
|
|
||
| # loads encoded_dataset from your gcs bucket | ||
| >>> dataset = load_from_disk('gcs://my-private-datasets/imdb/train', fs=gcs) | ||
| ```py | ||
| # saves encoded_dataset to amazon s3 | ||
| >>> encoded_dataset.save_to_disk("s3://my-private-datasets/imdb/train", fs=fs) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to my comment above it would be nice to use the more "native" save methods, like encoded_dataset.to_csv("s3://my-bucket/dataset.csv",storage_options=storage_options)similar to pandas as well. |
||
| # saves encoded_dataset to google cloud storage | ||
| >>> encoded_dataset.save_to_disk("gcs://my-private-datasets/imdb/train", fs=fs) | ||
| # saves encoded_dataset to microsoft azure blob/datalake | ||
| >>> encoded_dataset.save_to_disk("adl://my-private-datasets/imdb/train", fs=fs) | ||
| ``` | ||
|
|
||
| ## Azure Blob Storage | ||
|
|
||
| 1. Install the Azure Blob Storage implementation: | ||
| <Tip> | ||
|
|
||
| ``` | ||
| >>> conda install -c conda-forge adlfs | ||
| # or install with pip | ||
| >>> pip install adlfs | ||
| ``` | ||
| Remember to define your credentials in your [FileSystem instance](#set-up-your-cloud-storage-filesystem) `fs` whenever you are interacting with a private cloud storage. | ||
|
|
||
| 2. Save your dataset: | ||
| </Tip> | ||
|
|
||
| ```py | ||
| >>> import adlfs | ||
| ## Listing serialized datasets | ||
|
|
||
| # create AzureBlobFileSystem instance with account_name and account_key | ||
| >>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX") | ||
| List files from a cloud storage with your FileSystem instance `fs`, using `fs.ls`: | ||
|
|
||
| # saves encoded_dataset to your azure container | ||
| >>> encoded_dataset.save_to_disk('abfs://my-private-datasets/imdb/train', fs=abfs) | ||
| ```py | ||
| >>> fs.ls("my-private-datasets/imdb/train") | ||
| ["dataset_info.json.json","dataset.arrow","state.json"] | ||
| ``` | ||
|
|
||
| 3. Load your dataset: | ||
| ### Load serialized datasets | ||
|
|
||
| When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]: | ||
|
|
||
| ```py | ||
| >>> import adlfs | ||
| >>> from datasets import load_from_disk | ||
|
|
||
| # create AzureBlobFileSystem instance with account_name and account_key | ||
| >>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX") | ||
|
|
||
| # loads encoded_dataset from your azure container | ||
| >>> dataset = load_from_disk('abfs://my-private-datasets/imdb/train', fs=abfs) | ||
| # load encoded_dataset from cloud storage | ||
| >>> dataset = load_from_disk("s3://a-public-datasets/imdb/train", fs=fs) | ||
| >>> print(len(dataset)) | ||
| 25000 | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.