Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@
title: Cloud storage
- local: faiss_es
title: Search index
- local: cli
title: CLI
- local: how_to_metrics
title: Metrics
- local: beam
Expand Down
48 changes: 48 additions & 0 deletions docs/source/cli.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Command Line Interface (CLI)

🤗 Datasets provides a command line interface (CLI) with useful shell commands to interact with your dataset.

You can check the available commands:
```shell
$ datasets-cli --help
usage: datasets-cli <command> [<args>]

positional arguments:
{convert,env,test,run_beam,dummy_data,convert_to_parquet}
datasets-cli command helpers
convert Convert a TensorFlow Datasets dataset to a HuggingFace Datasets dataset.
env Print relevant system environment info.
test Test dataset implementation.
run_beam Run a Beam dataset processing pipeline
dummy_data Generate dummy data.
convert_to_parquet Convert dataset to Parquet

optional arguments:
-h, --help show this help message and exit
```

## Convert to Parquet

Easily convert your Hub script-dataset to Parquet files, so that the dataset viewer will be supported.

```shell
$ datasets-cli convert_to_parquet --help
usage: datasets-cli <command> [<args>] convert_to_parquet [-h] [--token TOKEN] [--revision REVISION] [--trust_remote_code] dataset_id

positional arguments:
dataset_id source dataset ID

optional arguments:
-h, --help show this help message and exit
--token TOKEN access token to the Hugging Face Hub
--revision REVISION source revision
--trust_remote_code whether to trust the code execution of the load script
```

This command:
- makes a copy of the script on the "main" branch into a dedicated branch called "script" (if it does not already exists)
- creates a pull request to the Hub dataset to convert it to Parquet files (and deletes the script from the main branch)

If in the future you need to recreate the Parquet files from the "script" branch, pass the `--revision script` argument.

Note that you should pass the `--trust_remote_code` argument only if you trust the remote code to be executed locally on your machine.