huggingface · lhoestq · Sep 5, 2022 · Jul 20, 2022 · Jul 20, 2022 · Jul 20, 2022
diff --git a/docs/source/filesystems.mdx b/docs/source/filesystems.mdx
@@ -1,6 +1,8 @@
 # Cloud storage
 
-🤗 Datasets supports access to cloud storage providers through a S3 filesystem implementation: [`filesystems.S3FileSystem`]. You can save and load datasets from your Amazon S3 bucket in a Pythonic way. Take a look at the following table for other supported cloud storage providers:
+🤗 Datasets supports access to cloud storage providers through a `fsspec` FileSystem implementations.
+You can save and load datasets from any cloud storage in a Pythonic way.
+Take a look at the following table for some example of supported cloud storage providers:
 
 | Storage provider     | Filesystem implementation                                     |
 |----------------------|---------------------------------------------------------------|
@@ -10,175 +12,191 @@
 | Dropbox              | [dropboxdrivefs](https://github.com/MarineChap/dropboxdrivefs)|
 | Google Drive         | [gdrivefs](https://github.com/intake/gdrivefs)                |
 
-This guide will show you how to save and load datasets with **s3fs** to a S3 bucket, but other filesystem implementations can be used similarly. An example is shown also for Google Cloud Storage and Azure Blob Storage.
+This guide will show you how to save and load datasets with any cloud storage.
+Here are examples for S3, Google Cloud Storage and Azure Blob Storage.
 
-## Amazon S3
+## Set up your cloud storage FileSystem
 
-### Listing datasets
+### Amazon S3
 
 1. Install the S3 dependency with 🤗 Datasets:
 
 ```
 >>> pip install datasets[s3]
 ```
 
-2. List files from a public S3 bucket with `s3.ls`:
+2. Define your credentials
+
+To use an anonymous connection, use `anon=True`.
+Otherwise, include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket.
 
 ```py
->>> import datasets
->>> s3 = datasets.filesystems.S3FileSystem(anon=True)  
->>> s3.ls('public-datasets/imdb/train')
-['dataset_info.json.json','dataset.arrow','state.json']
+>>> storage_options = {"anon": True}  # for anynonous connection
+# or use your credentials
+>>> storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key}  # for private buckets
+# or use a botocore session
+>>> import botocore
+>>> s3_session = botocore.session.Session(profile="my_profile_name")
+>>> storage_options = {"session": s3_session}
 ```
 
-Access a private S3 bucket by entering your `aws_access_key_id` and `aws_secret_access_key`:
+3. Load your FileSystem instance
 
 ```py
->>> import datasets
->>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)  
->>> s3.ls('my-private-datasets/imdb/train')  
-['dataset_info.json.json','dataset.arrow','state.json']
+>>> import s3fs
+>>> fs = s3fs.S3FileSystem(**storage_options)
 ```
 
-### Saving datasets
+### Google Cloud Storage
 
-After you have processed your dataset, you can save it to S3 with [`Dataset.save_to_disk`]:
+1. Install the Google Cloud Storage implementation:
+
+```
+>>> conda install -c conda-forge gcsfs
+# or install with pip
+>>> pip install gcsfs
+```
+
+2. Define your credentials
 
 ```py
->>> from datasets.filesystems import S3FileSystem
+>>> storage_options={"token": "anon"}  # for anonymous connection
+# or use your credentials of your default gcloud credentials or from the google metadata service
+>>> storage_options={"project": "my-google-project"}
+# or use your credentials from elsewhere, see the documentation at https://gcsfs.readthedocs.io/
+>>> storage_options={"project": "my-google-project", "token": TOKEN}
+```
 
-# create S3FileSystem instance
->>> s3 = S3FileSystem(anon=True)  
+3. Load your FileSystem instance
 
-# saves encoded_dataset to your s3 bucket
->>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train', fs=s3)
+```py
+>>> import gcsfs
+>>> fs = gcsfs.GCSFileSystem(**storage_options)
 ```
 
-<Tip>
+### Azure Blob Storage
 
-Remember to include your `aws_access_key_id` and `aws_secret_access_key` whenever you are interacting with a private S3 bucket.
+1. Install the Azure Blob Storage implementation:
 
-</Tip>
+```
+>>> conda install -c conda-forge adlfs
+# or install with pip
+>>> pip install adlfs
+```
 
-Save your dataset with `botocore.session.Session` and a custom AWS profile:
+2. Define your credentials
 
 ```py
->>> import botocore
->>> from datasets.filesystems import S3FileSystem
-
-# creates a botocore session with the provided AWS profile
->>> s3_session = botocore.session.Session(profile='my_profile_name')
+>>> storage_options = {"anon": True}  # for anonymous connection
+# or use your credentials
+>>> storage_options = {"account_name": ACCOUNT_NAME, "account_key": ACCOUNT_KEY)  # gen 2 filesystem
+# or use your credentials with the gen 1 filesystem
+>>> storage_options={"tenant_id": TENANT_ID, "client_id": CLIENT_ID, "client_secret": CLIENT_SECRET}
+```
 
-# create S3FileSystem instance with s3_session
->>> s3 = S3FileSystem(session=s3_session)  
+3. Load your FileSystem instance
 
-# saves encoded_dataset to your s3 bucket
->>> encoded_dataset.save_to_disk('s3://my-private-datasets/imdb/train',fs=s3)
+```py
+>>> import adlfs
+>>> fs = adlfs.AzureBlobFileSystem(**storage_options)
 ```
 
-### Loading datasets
+## Load and Save your datasets using your cloud storage FileSystem
 
-When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]:
+### Load datasets into a cloud storage
 
-```py
->>> from datasets import load_from_disk
->>> from datasets.filesystems import S3FileSystem
+You can load a dataset into your cloud storage by specifying a remote `output_dir` in `download_and_prepare`.
+Don't forget to use the previously defined `storage_options` containing your credentials to write into a private cloud storage.
 
-# create S3FileSystem without credentials
->>> s3 = S3FileSystem(anon=True)  
+The `download_and_prepare` method works in two steps:
+1. it first downloads the raw data files (if any) in your local cache. You can set your cache directory by passing `cache_dir` to [`load_dataset_builder`]
+2. then it generates the dataset in Arrow or Parquet format in your cloud storage by iterating over the raw data files.
 
-# load encoded_dataset to from s3 bucket
->>> dataset = load_from_disk('s3://a-public-datasets/imdb/train',fs=s3)  
+Load a dataset from the Hugging Face Hub (see [how to load from the Hugging Face Hub](./loading#hugging-face-hub)):
 
->>> print(len(dataset))
->>> # 25000
+```py
+>>> output_dir = "s3://my-bucket/imdb"
+>>> builder = load_dataset_builder("imdb")
+>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
 ```
 
-Load with `botocore.session.Session` and custom AWS profile:
+Load a dataset using a loading script (see [how to load a local loading script](./loading#local-loading-script)):
 
 ```py
->>> import botocore
->>> from datasets.filesystems import S3FileSystem
-
-# create S3FileSystem instance with aws_access_key_id and aws_secret_access_key
->>> s3_session = botocore.session.Session(profile='my_profile_name')
-
-# create S3FileSystem instance with s3_session
->>> s3 = S3FileSystem(session=s3_session)
+>>> output_dir = "s3://my-bucket/imdb"
+>>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
+>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
+```
 
-# load encoded_dataset to from s3 bucket
->>> dataset = load_from_disk('s3://my-private-datasets/imdb/train',fs=s3)  
+Load your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):
 
->>> print(len(dataset))
->>> # 25000
+```py
+>>> data_files = {"train": ["path/to/train.csv"]}
+>>> output_dir = "s3://my-bucket/imdb"
+>>> builder = load_dataset_builder("csv", data_files=data_files)
+>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
 ```
 
-## Google Cloud Storage
+It is highly recommended to save the files as compressed Parquet files to optimize I/O by specifying `file_format="parquet"`.
+Otherwize the dataset is saved as an uncompressed Arrow file.
 
-1. Install the Google Cloud Storage implementation:
+#### Dask
 
-```
->>> conda install -c conda-forge gcsfs
-# or install with pip
->>> pip install gcsfs
-```
+Dask is a parallel computing library and it has a pandas-like API for working with larger than memory Parquet datasets in parallel.
+Dask can use multiple threads or processes on a single machine, or a cluster of machines to process data in parallel.
+Dask supports local data but also data from a cloud storage.
 
-2. Save your dataset:
+Therefore you can load a dataset saved as sharded Parquet files in Dask with
 
 ```py
->>> import gcsfs
+import dask.dataframe as dd
 
-# create GCSFileSystem instance using default gcloud credentials with project
->>> gcs = gcsfs.GCSFileSystem(project='my-google-project')
+df = dd.read_parquet(output_dir, storage_options=storage_options)
 
-# saves encoded_dataset to your gcs bucket
->>> encoded_dataset.save_to_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
+# or if your dataset is split into train/valid/test
+df_train = dd.read_parquet(output_dir + f"/{builder.name}-train-*.parquet", storage_options=storage_options)
+df_valid = dd.read_parquet(output_dir + f"/{builder.name}-validation-*.parquet", storage_options=storage_options)
+df_test = dd.read_parquet(output_dir + f"/{builder.name}-test-*.parquet", storage_options=storage_options)
 ```
 
-3. Load your dataset:
+You can find more about dask dataframes in their [documentation](https://docs.dask.org/en/stable/dataframe.html).
 
-```py
->>> import gcsfs
->>> from datasets import load_from_disk
+## Saving serialized datasets
 
-# create GCSFileSystem instance using default gcloud credentials with project
->>> gcs = gcsfs.GCSFileSystem(project='my-google-project')
+After you have processed your dataset, you can save it to your cloud storage with [`Dataset.save_to_disk`]:
 
-# loads encoded_dataset from your gcs bucket
->>> dataset = load_from_disk('gcs://my-private-datasets/imdb/train', fs=gcs)
+```py
+# saves encoded_dataset to amazon s3
+>>> encoded_dataset.save_to_disk("s3://my-private-datasets/imdb/train", fs=fs)
+# saves encoded_dataset to google cloud storage
+>>> encoded_dataset.save_to_disk("gcs://my-private-datasets/imdb/train", fs=fs)
+# saves encoded_dataset to microsoft azure blob/datalake
+>>> encoded_dataset.save_to_disk("adl://my-private-datasets/imdb/train", fs=fs)
 ```
 
-## Azure Blob Storage
-
-1. Install the Azure Blob Storage implementation:
+<Tip>
 
-```
->>> conda install -c conda-forge adlfs
-# or install with pip
->>> pip install adlfs
-```
+Remember to define your credentials in your [FileSystem instance](#set-up-your-cloud-storage-filesystem) `fs` whenever you are interacting with a private cloud storage.
 
-2. Save your dataset:
+</Tip>
 
-```py
->>> import adlfs
+## Listing serialized datasets
 
-# create AzureBlobFileSystem instance with account_name and account_key
->>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX")
+List files from a cloud storage with your FileSystem instance `fs`, using `fs.ls`:
 
-# saves encoded_dataset to your azure container
->>> encoded_dataset.save_to_disk('abfs://my-private-datasets/imdb/train', fs=abfs)
+```py
+>>> fs.ls("my-private-datasets/imdb/train")
+["dataset_info.json.json","dataset.arrow","state.json"]
 ```
 
-3. Load your dataset:
+### Load serialized datasets
+
+When you are ready to use your dataset again, reload it with [`Dataset.load_from_disk`]:
 
 ```py
->>> import adlfs
 >>> from datasets import load_from_disk
-
-# create AzureBlobFileSystem instance with account_name and account_key
->>> abfs = adlfs.AzureBlobFileSystem(account_name="XXXX", account_key="XXXX")
-
-# loads encoded_dataset from your azure container
->>> dataset = load_from_disk('abfs://my-private-datasets/imdb/train', fs=abfs)
+# load encoded_dataset from cloud storage
+>>> dataset = load_from_disk("s3://a-public-datasets/imdb/train", fs=fs)  
+>>> print(len(dataset))
+25000
 ```
diff --git a/src/datasets/arrow_dataset.py b/src/datasets/arrow_dataset.py
@@ -1134,7 +1134,7 @@ def save_to_disk(self, dataset_path: str, fs=None):
         fs.makedirs(dataset_path, exist_ok=True)
         with fs.open(Path(dataset_path, config.DATASET_ARROW_FILENAME).as_posix(), "wb") as dataset_file:
             with ArrowWriter(stream=dataset_file) as writer:
-                writer.write_table(dataset._data)
+                writer.write_table(dataset._data.table)
                 writer.finalize()
         with fs.open(
             Path(dataset_path, config.DATASET_STATE_JSON_FILENAME).as_posix(), "w", encoding="utf-8"