-
Notifications
You must be signed in to change notification settings - Fork 3k
Download and prepare as Parquet for cloud storage #4724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 25 commits
6d89cfb
606a48f
cdf8dcd
ad91270
742d2a9
aed8ce6
93d5660
65c2037
84d8397
4c46349
ee7e3f5
033a3b8
ce8d7f9
15dccf9
3a3d784
3eef46d
b480549
713f83c
1db12b9
874b2a0
f6ecb64
b0e4222
e7f3ac4
df0343a
a0f84f4
509ff3f
1b02b66
2e85216
ba167db
a9379f8
ec94a4b
f47871a
460e1a6
88daa8a
fdf7252
22aaf7b
c051b31
e0a7742
53d46cc
606951f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,8 @@ | ||
| # Cloud storage | ||
|
|
||
| 🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: [`filesystems.S3FileSystem`]. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers: | ||
| 🤗 Datasets supports access to cloud storage providers through a `fsspec` FileSystem implementations. | ||
| You can save and load datasets from any cloud storage in a Pythonic way. | ||
| Take a look at the following table for some example of supported cloud storage providers: | ||
|
|
||
| | Storage provider | Filesystem implementation | | ||
| |----------------------|---------------------------------------------------------------| | ||
|
|
@@ -10,175 +12,187 @@ | |
| | Dropbox | [dropboxdrivefs](https://github.com/MarineChap/dropboxdrivefs)| | ||
| | Google Drive | [gdrivefs](https://github.com/intake/gdrivefs) | | ||
|
|
||
| This guide will show you how to save and load datasets with **s3fs** to a S3 bucket, but other filesystem implementations can be used similarly. An example is shown also for Google Cloud Storage and Azure Blob Storage. | ||
| This guide will show you how to save and load datasets with any cloud storage. | ||
| Here are examples for S3, Google Cloud Storage and Azure Blob Storage. | ||
|
|
||
| ## Amazon S3 | ||
| ## Set up your cloud storage FileSystem | ||
|
|
||
| ### Listing datasets | ||
| ### Amazon S3 | ||
|
|
||
| 1. Install the S3 dependency with 🤗 Datasets: | ||
|
|
||
| ``` | ||
| >>> pip install datasets[s3] | ||
| ``` | ||
|
|
||
| 2. List files from a public S3 bucket with `s3.ls`: | ||
| 2. Define your credentials | ||
|
|
||
| To use an anonymous connection, use `anon=True`. | ||
| Otherwise, include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket. | ||
|
|
||
| ```py | ||
| >>> import datasets | ||
| >>> s3 = datasets.filesystems.S3FileSystem(anon=True) | ||
| >>> s3.ls('public-datasets/imdb/train') | ||
| ['dataset_info.json.json','dataset.arrow','state.json'] | ||
| >>> storage_options = {"anon": True} # for anynonous connection | ||
| # or use your credentials | ||
| >>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} # for private buckets | ||
| # or use a botocore session | ||
| >>> import botocore | ||
| >>> s3_session = botocore.session.Session(profile="my_profile_name") | ||
| >>> storage_options = {"session": s3_session} | ||
| ``` | ||
|
|
||
| Access a private S3 bucket by entering your `aws_access_key_id` and `aws_secret_access_key`: | ||
| 3. Load your FileSystem instance | ||
|
|
||
| ```py | ||
| >>> import datasets | ||
| >>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key) | ||
| >>> s3.ls('my-private-datasets/imdb/train') | ||
| ['dataset_info.json.json','dataset.arrow','state.json'] | ||
| >>> import s3fs | ||
| >>> fs = s3fs.S3FileSystem(**storage_options) | ||
| ``` | ||
|
|
||
| ### Saving datasets | ||
|
|
||
| After you have processed your dataset, you can save it to S3 with [`Dataset.save_to_disk`]: | ||
|
|
||
| ```py | ||
| >>> from datasets.filesystems import S3FileSystem | ||
| ### Google Cloud Storage | ||
|
|
||
| # create S3FileSystem instance | ||
| >>> s3 = S3FileSystem(anon=True) | ||
| 1. Install the Google Cloud Storage implementation: | ||
|
|
||
| # saves encoded_dataset to your s3 bucket | ||
| >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3) | ||
| ``` | ||
| >>> conda install -c conda-forge gcsfs | ||
| # or install with pip | ||
| >>> pip install gcsfs | ||
| ``` | ||
|
|
||
| <Tip> | ||
|
|
||
| Remember to include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket. | ||
| 2. Define your credentials | ||
|
|
||
| </Tip> | ||
| ```py | ||
| >>> storage_options={"token": "anon"} # for anonymous connection | ||
| # or use your credentials of your default gcloud credentials or from the google metadata service | ||
| >>> storage_options={"project": "my-google-project"} | ||
| # or use your credentials from elsewhere, see the documentation at https://gcsfs.readthedocs.io/ | ||
| >>> storage_options={"project": "my-google-project", "token": TOKEN} | ||
| ``` | ||
|
|
||
| Save your dataset with `botocore.session.Session` and a custom AWS profile: | ||
| 3. Load your FileSystem instance | ||
|
|
||
| ```py | ||
| >>> import botocore | ||
| >>> from datasets.filesystems import S3FileSystem | ||
| >>> import gcsfs | ||
| >>> fs = gcsfs.GCSFileSystem(**storage_options) | ||
| ``` | ||
|
|
||
| # creates a botocore session with the provided AWS profile | ||
| >>> s3_session = botocore.session.Session(profile='my_profile_name') | ||
| ### Azure Blob Storage | ||
|
|
||
| # create S3FileSystem instance with s3_session | ||
| >>> s3 = S3FileSystem(session=s3_session) | ||
| 1. Install the Azure Blob Storage implementation: | ||
|
|
||
| # saves encoded_dataset to your s3 bucket | ||
| >>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3) | ||
| ``` | ||
| >>> conda install -c conda-forge adlfs | ||
| # or install with pip | ||
| >>> pip install adlfs | ||
| ``` | ||
|
|
||
| ### Loading datasets | ||
| 2. Define your credentials | ||
|
|
||
| When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]: | ||
| ```py | ||
| >>> storage_options = {"anon": True} # for anonymous connection | ||
| # or use your credentials | ||
| >>> storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY) # gen 2 filesystem | ||
| # or use your credentials with the gen 1 filesystem | ||
| >>> storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET} | ||
| ``` | ||
|
|
||
| 3. Load your FileSystem instance | ||
|
|
||
| ```py | ||
| >>> from datasets import load_from_disk | ||
| >>> from datasets.filesystems import S3FileSystem | ||
| >>> import adlfs | ||
| >>> fs = adlfs.AzureBlobFileSystem(**storage_options) | ||
| ``` | ||
|
|
||
| # create S3FileSystem without credentials | ||
| >>> s3 = S3FileSystem(anon=True) | ||
| ## Load and Save your datasets using your cloud storage FileSystem | ||
|
|
||
| # load encoded_dataset to from s3 bucket | ||
| >>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3) | ||
| ### Load datasets into a cloud storage | ||
|
|
||
| >>> print(len(dataset)) | ||
| >>> # 25000 | ||
| ``` | ||
| You can load and cache a dataset into your cloud storage by specifying a remote `cache_dir` in `load_dataset`. | ||
| Don't forget to use the previously defined `storage_options` containing your credentials to write into a private cloud storage. | ||
|
|
||
| Load with `botocore.session.Session` and custom AWS profile: | ||
| Load a dataset from the Hugging Face Hub (see [how to load from the Hugging Face Hub](./loading#hugging-face-hub)): | ||
|
|
||
| ```py | ||
| >>> import botocore | ||
| >>> from datasets.filesystems import S3FileSystem | ||
| >>> cache_dir = "s3://my-bucket/datasets-cache" | ||
| >>> builder = load_dataset_builder("imdb", cache_dir=cache_dir, storage_options=storage_options) | ||
| >>> builder.download_and_prepare(file_format="parquet") | ||
| ``` | ||
|
|
||
| # create S3FileSystem instance with aws_access_key_id and aws_secret_access_key | ||
| >>> s3_session = botocore.session.Session(profile='my_profile_name') | ||
| Load a dataset using a loading script (see [how to load a local loading script](./loading#local-loading-script)): | ||
|
|
||
| # create S3FileSystem instance with s3_session | ||
| >>> s3 = S3FileSystem(session=s3_session) | ||
| ```py | ||
| >>> cache_dir = "s3://my-bucket/datasets-cache" | ||
| >>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py", cache_dir=cache_dir, storage_options=storage_options) | ||
| >>> builder.download_and_prepare(file_format="parquet") | ||
| ``` | ||
|
|
||
| # load encoded_dataset to from s3 bucket | ||
| >>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3) | ||
| Load your own data files (see [how to load local and remote files](./loading#local-and-remote-files)): | ||
|
|
||
| >>> print(len(dataset)) | ||
| >>> # 25000 | ||
| ```py | ||
| >>> data_files = {"train": ["path/to/train.csv"]} | ||
| >>> cache_dir = "s3://my-bucket/datasets-cache" | ||
| >>> builder = load_dataset_builder("csv", data_files=data_files, cache_dir=cache_dir, storage_options=storage_options) | ||
| >>> builder.download_and_prepare(file_format="parquet") | ||
|
||
| ``` | ||
|
|
||
| ## Google Cloud Storage | ||
| It is highly recommended to save the files as compressed Parquet files to optimize I/O by specifying `file_format="parquet"`. | ||
| Otherwize the dataset is saved as an uncompressed Arrow file. | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| 1. Install the Google Cloud Storage implementation: | ||
| #### Dask | ||
|
|
||
| ``` | ||
| >>> conda install -c conda-forge gcsfs | ||
| # or install with pip | ||
| >>> pip install gcsfs | ||
| ``` | ||
| Dask is a parallel computing library and it has a pandas-like API for working with larger than memory Parquet datasets in parallel. | ||
| Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel. | ||
| Dask supports local data but also data from a cloud storage. | ||
|
|
||
| 2. Save your dataset: | ||
| Therefore you can load a dataset saved as sharded Parquet files in Dask with | ||
|
|
||
| ```py | ||
| >>> import gcsfs | ||
| import dask.dataframe as dd | ||
|
|
||
| # create GCSFileSystem instance using default gcloud credentials with project | ||
| >>> gcs = gcsfs.GCSFileSystem(project='my-google-project') | ||
| df = dd.read_parquet(builder.cache_dir, storage_options=storage_options) | ||
|
|
||
| # saves encoded_dataset to your gcs bucket | ||
| >>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs) | ||
| # or if your dataset is split into train/valid/test | ||
| df_train = dd.read_parquet(builder.cache_dir + f"/{builder.name}-train-*.parquet", storage_options=storage_options) | ||
| df_valid = dd.read_parquet(builder.cache_dir + f"/{builder.name}-validation-*.parquet", storage_options=storage_options) | ||
| df_test = dd.read_parquet(builder.cache_dir + f"/{builder.name}-test-*.parquet", storage_options=storage_options) | ||
| ``` | ||
|
|
||
| 3. Load your dataset: | ||
| You can find more about dask dataframes in their [documentation](https://docs.dask.org/en/stable/dataframe.html). | ||
|
|
||
| ```py | ||
| >>> import gcsfs | ||
| >>> from datasets import load_from_disk | ||
| ## Saving serialized datasets | ||
|
|
||
| # create GCSFileSystem instance using default gcloud credentials with project | ||
| >>> gcs = gcsfs.GCSFileSystem(project='my-google-project') | ||
| After you have processed your dataset, you can save it to your cloud storage with [`Dataset.save_to_disk`]: | ||
|
|
||
| # loads encoded_dataset from your gcs bucket | ||
| >>> dataset = load_from_disk('gcs://my-private-datasets/imdb/train', fs=gcs) | ||
| ```py | ||
| # saves encoded_dataset to amazon s3 | ||
| >>> encoded_dataset.save_to_disk("s3://my-private-datasets/imdb/train", fs=fs) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to my comment above it would be nice to use the more "native" save methods, like encoded_dataset.to_csv("s3://my-bucket/dataset.csv",storage_options=storage_options)similar to pandas as well. |
||
| # saves encoded_dataset to google cloud storage | ||
| >>> encoded_dataset.save_to_disk("gcs://my-private-datasets/imdb/train", fs=fs) | ||
| # saves encoded_dataset to microsoft azure blob/datalake | ||
| >>> encoded_dataset.save_to_disk("adl://my-private-datasets/imdb/train", fs=fs) | ||
| ``` | ||
|
|
||
| ## Azure Blob Storage | ||
|
|
||
| 1. Install the Azure Blob Storage implementation: | ||
| <Tip> | ||
|
|
||
| ``` | ||
| >>> conda install -c conda-forge adlfs | ||
| # or install with pip | ||
| >>> pip install adlfs | ||
| ``` | ||
| Remember to define your credentials in your [FileSystem instance](#set-up-your-cloud-storage-filesystem) `fs` whenever you are interacting with a private cloud storage. | ||
|
|
||
| 2. Save your dataset: | ||
| </Tip> | ||
|
|
||
| ```py | ||
| >>> import adlfs | ||
| ## Listing serialized datasets | ||
|
|
||
| # create AzureBlobFileSystem instance with account_name and account_key | ||
| >>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX") | ||
| List files from a cloud storage with your FileSystem instance `fs`, using `fs.ls`: | ||
|
|
||
| # saves encoded_dataset to your azure container | ||
| >>> encoded_dataset.save_to_disk('abfs://my-private-datasets/imdb/train', fs=abfs) | ||
| ```py | ||
| >>> fs.ls("my-private-datasets/imdb/train") | ||
| ["dataset_info.json.json","dataset.arrow","state.json"] | ||
| ``` | ||
|
|
||
| 3. Load your dataset: | ||
| ### Load serialized datasets | ||
|
|
||
| When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]: | ||
|
|
||
| ```py | ||
| >>> import adlfs | ||
| >>> from datasets import load_from_disk | ||
|
|
||
| # create AzureBlobFileSystem instance with account_name and account_key | ||
| >>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX") | ||
|
|
||
| # loads encoded_dataset from your azure container | ||
| >>> dataset = load_from_disk('abfs://my-private-datasets/imdb/train', fs=abfs) | ||
| # load encoded_dataset from cloud storage | ||
| >>> dataset = load_from_disk("s3://a-public-datasets/imdb/train", fs=fs) | ||
| >>> print(len(dataset)) | ||
| 25000 | ||
| ``` | ||
Uh oh!
There was an error while loading. Please reload this page.