Skip to content
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
6d89cfb
use fsspec for caching
lhoestq Jul 20, 2022
606a48f
add parquet writer
lhoestq Jul 20, 2022
cdf8dcd
add file_format argument
lhoestq Jul 20, 2022
ad91270
style
lhoestq Jul 20, 2022
742d2a9
use "gs" instead of "gcs" for apache beam + use is_remote_filesystem
lhoestq Jul 20, 2022
aed8ce6
typo
lhoestq Jul 20, 2022
93d5660
fix test
lhoestq Jul 20, 2022
65c2037
test ArrowWriter with filesystem
lhoestq Jul 20, 2022
84d8397
test parquet writer
lhoestq Jul 21, 2022
4c46349
more tests
lhoestq Jul 21, 2022
ee7e3f5
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq Jul 21, 2022
033a3b8
more tests
lhoestq Jul 21, 2022
ce8d7f9
fix nullcontext on 3.6
lhoestq Jul 21, 2022
15dccf9
parquet_writer.write_batch is not available in pyarrow 6
lhoestq Jul 21, 2022
3a3d784
remove reference to open file
lhoestq Jul 21, 2022
3eef46d
fix test
lhoestq Jul 22, 2022
b480549
docs
lhoestq Jul 22, 2022
713f83c
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq Jul 25, 2022
1db12b9
docs: dask from parquet files
lhoestq Jul 27, 2022
874b2a0
Apply suggestions from code review
lhoestq Jul 27, 2022
f6ecb64
use contextlib.nullcontext
lhoestq Jul 27, 2022
b0e4222
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq Jul 27, 2022
e7f3ac4
fix missing import
lhoestq Jul 27, 2022
df0343a
Use unstrip_protocol to merge protocol and path
mariosasko Jul 29, 2022
a0f84f4
remove bad "raise" and add TODOs
lhoestq Jul 29, 2022
509ff3f
Merge branch 'main' into dl-and-pp-as-parquet
lhoestq Aug 25, 2022
1b02b66
add output_dir arg to download_and_prepare
lhoestq Aug 25, 2022
2e85216
update tests
lhoestq Aug 25, 2022
ba167db
update docs
lhoestq Aug 25, 2022
a9379f8
fix tests
lhoestq Aug 25, 2022
ec94a4b
fix tests
lhoestq Aug 25, 2022
f47871a
fix output parent dir creattion
lhoestq Aug 26, 2022
460e1a6
Apply suggestions from code review
lhoestq Aug 26, 2022
88daa8a
revert changes for remote cache_dir
lhoestq Aug 26, 2022
fdf7252
fix wording in the docs: load -> download and prepare
lhoestq Aug 26, 2022
22aaf7b
style
lhoestq Aug 26, 2022
c051b31
fix
lhoestq Aug 26, 2022
e0a7742
simplify incomplete_dir
lhoestq Aug 26, 2022
53d46cc
fix tests
lhoestq Aug 29, 2022
606951f
albert's comments
lhoestq Sep 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 120 additions & 102 deletions docs/source/filesystems.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Cloud storage

🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: [`filesystems.S3FileSystem`]. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers:
🤗 Datasets supports access to cloud storage providers through a `fsspec` FileSystem implementations.
You can save and load datasets from any cloud storage in a Pythonic way.
Take a look at the following table for some example of supported cloud storage providers:

| Storage provider | Filesystem implementation |
|----------------------|---------------------------------------------------------------|
Expand All @@ -10,175 +12,191 @@
| Dropbox | [dropboxdrivefs](https://github.com/MarineChap/dropboxdrivefs)|
| Google Drive | [gdrivefs](https://github.com/intake/gdrivefs) |

This guide will show you how to save and load datasets with **s3fs** to a S3 bucket, but other filesystem implementations can be used similarly. An example is shown also for Google Cloud Storage and Azure Blob Storage.
This guide will show you how to save and load datasets with any cloud storage.
Here are examples for S3, Google Cloud Storage and Azure Blob Storage.

## Amazon S3
## Set up your cloud storage FileSystem

### Listing datasets
### Amazon S3

1. Install the S3 dependency with 🤗 Datasets:

```
>>> pip install datasets[s3]
```

2. List files from a public S3 bucket with `s3.ls`:
2. Define your credentials

To use an anonymous connection, use `anon=True`.
Otherwise, include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket.

```py
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)
>>> s3.ls('public-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']
>>> storage_options = {"anon": True} # for anynonous connection
# or use your credentials
>>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} # for private buckets
# or use a botocore session
>>> import botocore
>>> s3_session = botocore.session.Session(profile="my_profile_name")
>>> storage_options = {"session": s3_session}
```

Access a private S3 bucket by entering your `aws_access_key_id` and `aws_secret_access_key`:
3. Load your FileSystem instance

```py
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
>>> s3.ls('my-private-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']
>>> import s3fs
>>> fs = s3fs.S3FileSystem(**storage_options)
```

### Saving datasets
### Google Cloud Storage

After you have processed your dataset, you can save it to S3 with [`Dataset.save_to_disk`]:
1. Install the Google Cloud Storage implementation:

```
>>> conda install -c conda-forge gcsfs
# or install with pip
>>> pip install gcsfs
```

2. Define your credentials

```py
>>> from datasets.filesystems import S3FileSystem
>>> storage_options={"token": "anon"} # for anonymous connection
# or use your credentials of your default gcloud credentials or from the google metadata service
>>> storage_options={"project": "my-google-project"}
# or use your credentials from elsewhere, see the documentation at https://gcsfs.readthedocs.io/
>>> storage_options={"project": "my-google-project", "token": TOKEN}
```

# create S3FileSystem instance
>>> s3 = S3FileSystem(anon=True)
3. Load your FileSystem instance

# saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)
```py
>>> import gcsfs
>>> fs = gcsfs.GCSFileSystem(**storage_options)
```

<Tip>
### Azure Blob Storage

Remember to include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket.
1. Install the Azure Blob Storage implementation:

</Tip>
```
>>> conda install -c conda-forge adlfs
# or install with pip
>>> pip install adlfs
```

Save your dataset with `botocore.session.Session` and a custom AWS profile:
2. Define your credentials

```py
>>> import botocore
>>> from datasets.filesystems import S3FileSystem

# creates a botocore session with the provided AWS profile
>>> s3_session = botocore.session.Session(profile='my_profile_name')
>>> storage_options = {"anon": True} # for anonymous connection
# or use your credentials
>>> storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY) # gen 2 filesystem
# or use your credentials with the gen 1 filesystem
>>> storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET}
```

# create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)
3. Load your FileSystem instance

# saves encoded_dataset to your s3 bucket
>>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)
```py
>>> import adlfs
>>> fs = adlfs.AzureBlobFileSystem(**storage_options)
```

### Loading datasets
## Load and Save your datasets using your cloud storage FileSystem

When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]:
### Load datasets into a cloud storage

```py
>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3FileSystem
You can load a dataset into your cloud storage by specifying a remote `output_dir` in `download_and_prepare`.
Don't forget to use the previously defined `storage_options` containing your credentials to write into a private cloud storage.

# create S3FileSystem without credentials
>>> s3 = S3FileSystem(anon=True)
The `download_and_prepare` method works in two steps:
1. it first downloads the raw data files (if any) in your local cache. You can set your cache directory by passing `cache_dir` to [`load_dataset_builder`]
2. then it generates the dataset in Arrow or Parquet format in your cloud storage by iterating over the raw data files.

# load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)
Load a dataset from the Hugging Face Hub (see [how to load from the Hugging Face Hub](./loading#hugging-face-hub)):

>>> print(len(dataset))
>>> # 25000
```py
>>> output_dir = "s3://my-bucket/imdb"
>>> builder = load_dataset_builder("imdb")
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Load with `botocore.session.Session` and custom AWS profile:
Load a dataset using a loading script (see [how to load a local loading script](./loading#local-loading-script)):

```py
>>> import botocore
>>> from datasets.filesystems import S3FileSystem

# create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
>>> s3_session = botocore.session.Session(profile='my_profile_name')

# create S3FileSystem instance with s3_session
>>> s3 = S3FileSystem(session=s3_session)
>>> output_dir = "s3://my-bucket/imdb"
>>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

# load encoded_dataset to from s3 bucket
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)
Load your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):

>>> print(len(dataset))
>>> # 25000
```py
>>> data_files = {"train": ["path/to/train.csv"]}
>>> output_dir = "s3://my-bucket/imdb"
>>> builder = load_dataset_builder("csv", data_files=data_files)
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

## Google Cloud Storage
It is highly recommended to save the files as compressed Parquet files to optimize I/O by specifying `file_format="parquet"`.
Otherwize the dataset is saved as an uncompressed Arrow file.

1. Install the Google Cloud Storage implementation:
#### Dask

```
>>> conda install -c conda-forge gcsfs
# or install with pip
>>> pip install gcsfs
```
Dask is a parallel computing library and it has a pandas-like API for working with larger than memory Parquet datasets in parallel.
Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel.
Dask supports local data but also data from a cloud storage.

2. Save your dataset:
Therefore you can load a dataset saved as sharded Parquet files in Dask with

```py
>>> import gcsfs
import dask.dataframe as dd

# create GCSFileSystem instance using default gcloud credentials with project
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project')
df = dd.read_parquet(output_dir, storage_options=storage_options)

# saves encoded_dataset to your gcs bucket
>>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
# or if your dataset is split into train/valid/test
df_train = dd.read_parquet(output_dir + f"/{builder.name}-train-*.parquet", storage_options=storage_options)
df_valid = dd.read_parquet(output_dir + f"/{builder.name}-validation-*.parquet", storage_options=storage_options)
df_test = dd.read_parquet(output_dir + f"/{builder.name}-test-*.parquet", storage_options=storage_options)
```

3. Load your dataset:
You can find more about dask dataframes in their [documentation](https://docs.dask.org/en/stable/dataframe.html).

```py
>>> import gcsfs
>>> from datasets import load_from_disk
## Saving serialized datasets

# create GCSFileSystem instance using default gcloud credentials with project
>>> gcs = gcsfs.GCSFileSystem(project='my-google-project')
After you have processed your dataset, you can save it to your cloud storage with [`Dataset.save_to_disk`]:

# loads encoded_dataset from your gcs bucket
>>> dataset = load_from_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
```py
# saves encoded_dataset to amazon s3
>>> encoded_dataset.save_to_disk("s3://my-private-datasets/imdb/train", fs=fs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above it would be nice to use the more "native" save methods, like to_csv, to_parquetetc...
,e.g.

encoded_dataset.to_csv("s3://my-bucket/dataset.csv",storage_options=storage_options)

similar to pandas as well.

# saves encoded_dataset to google cloud storage
>>> encoded_dataset.save_to_disk("gcs://my-private-datasets/imdb/train", fs=fs)
# saves encoded_dataset to microsoft azure blob/datalake
>>> encoded_dataset.save_to_disk("adl://my-private-datasets/imdb/train", fs=fs)
```

## Azure Blob Storage

1. Install the Azure Blob Storage implementation:
<Tip>

```
>>> conda install -c conda-forge adlfs
# or install with pip
>>> pip install adlfs
```
Remember to define your credentials in your [FileSystem instance](#set-up-your-cloud-storage-filesystem) `fs` whenever you are interacting with a private cloud storage.

2. Save your dataset:
</Tip>

```py
>>> import adlfs
## Listing serialized datasets

# create AzureBlobFileSystem instance with account_name and account_key
>>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX")
List files from a cloud storage with your FileSystem instance `fs`, using `fs.ls`:

# saves encoded_dataset to your azure container
>>> encoded_dataset.save_to_disk('abfs://my-private-datasets/imdb/train', fs=abfs)
```py
>>> fs.ls("my-private-datasets/imdb/train")
["dataset_info.json.json","dataset.arrow","state.json"]
```

3. Load your dataset:
### Load serialized datasets

When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]:

```py
>>> import adlfs
>>> from datasets import load_from_disk

# create AzureBlobFileSystem instance with account_name and account_key
>>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX")

# loads encoded_dataset from your azure container
>>> dataset = load_from_disk('abfs://my-private-datasets/imdb/train', fs=abfs)
# load encoded_dataset from cloud storage
>>> dataset = load_from_disk("s3://a-public-datasets/imdb/train", fs=fs)
>>> print(len(dataset))
25000
```
2 changes: 1 addition & 1 deletion src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1134,7 +1134,7 @@ def save_to_disk(self, dataset_path: str, fs=None):
fs.makedirs(dataset_path, exist_ok=True)
with fs.open(Path(dataset_path, config.DATASET_ARROW_FILENAME).as_posix(), "wb") as dataset_file:
with ArrowWriter(stream=dataset_file) as writer:
writer.write_table(dataset._data)
writer.write_table(dataset._data.table)
writer.finalize()
with fs.open(
Path(dataset_path, config.DATASET_STATE_JSON_FILENAME).as_posix(), "w", encoding="utf-8"
Expand Down
Loading