diff --git a/.github/ISSUE_TEMPLATE/add-dataset.md b/.github/ISSUE_TEMPLATE/add-dataset.md new file mode 100644 index 00000000000..773eccde40b --- /dev/null +++ b/.github/ISSUE_TEMPLATE/add-dataset.md @@ -0,0 +1,15 @@ +--- +name: "Add Dataset" +about: Request the addition of a specific dataset to the library. +title: '' +labels: 'dataset request' +assignees: '' + +--- + +## Adding a Dataset +- **Name:** *name of the dataset* +- **Description:** *short description of the dataset (or link to social media or blog post)* +- **Paper:** *link to the dataset paper if available* +- **Data:** *link to the Github repository or current dataset location* +- **Motivation:** *what are some good reasons to have this dataset* diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml index 8d778801a30..2d3ae9f1e29 100644 --- a/.github/ISSUE_TEMPLATE/config.yml +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -1,7 +1,7 @@ contact_links: - name: Datasets on the Hugging Face Hub url: https://huggingface.co/datasets - about: Open a Pull request / Discussion related to a specific dataset on the Hugging Face Hub (PRs for datasets with no namespace still have to be on GitHub though) + about: Please use the "Community" tab of the dataset on the Hugging Face Hub to open a discussion or a pull request - name: Forum url: https://discuss.huggingface.co/c/datasets/10 about: Please ask and answer questions here, and engage with other community members diff --git a/ADD_NEW_DATASET.md b/ADD_NEW_DATASET.md index 0705e4d8fc0..916213064d3 100644 --- a/ADD_NEW_DATASET.md +++ b/ADD_NEW_DATASET.md @@ -1,357 +1,8 @@ -# How to add one (or several) new datasets to 🤗 Datasets - -ADD DATASETS DIRECTLY ON THE 🤗 HUGGING FACE HUB ! - -You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation: - -* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset) -* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share) - -## What about the datasets scripts in this GitHub repository then ? - -Datasets used to be hosted in this GitHub repository, but all datasets have now been migrated to the Hugging Face Hub. -The legacy GitHub datasets were added originally on the GitHub repository and therefore don't have a namespace: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name". -Those datasets are still maintained on GitHub, and if you'd like to edit them, please open a Pull Request on the huggingface/datasets repository. - -Sharing your dataset to the Hub is the recommended way of adding a dataset. - -In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under. - -The following presents how to open a Pull Request on GitHub to add a new dataset to this repository. - -## Add a new dataset to this repository (legacy) - -### Start by preparing your environment - -1. Fork the [repository](https://github.com/huggingface/datasets) by clicking on the 'Fork' button on the repository's page. -This creates a copy of the code under your GitHub user account. - -2. Clone your fork to your local disk, and add the base repository as a remote: - - ```bash - git clone https://github.com//datasets - cd datasets - git remote add upstream https://github.com/huggingface/datasets.git - ``` - -3. (**For Windows**) You will need to install [the right version](https://pytorch.org/get-started/locally/) of PyTorch before continuing because `pip install torch` may not work well for PyTorch on Windows. - -4. Set up a development environment, for instance by running the following command: - - ```bash - conda create -n env python=3.7 --y - conda activate env - pip install -e ".[dev]" - ``` - -5. Open the [online Datasets Tagging application](https://huggingface.co/spaces/huggingface/datasets-tagging). - -6. You should also open the online form that will allow you to [create dataset cards](https://huggingface.co/datasets/card-creator/) in a browser window (courtesy of [Evrard t'Serstevens](https://huggingface.co/evrardts).) - -Now you are ready, each time you want to add a new dataset, follow the steps in the following section: - -### Adding a new dataset - -#### Understand the structure of the dataset - -1. Find a short-name for the dataset: - - - Select a `short name` for the dataset which is unique but not too long and is easy to guess for users, e.g. `squad`, `natural_questions` - - Sometimes the short-list name is already given/proposed (e.g. in the spreadsheet of the data sprint to reach v2.0 if you are participating in the effort) - -You are now ready to start the process of adding the dataset. We will create the following files: - -- a **dataset script** which contains the code to download and pre-process the dataset: e.g. `squad.py`, -- a **dataset card** with tags and information on the dataset in a `README.md`. - -2. Let's start by creating a new branch to hold your development changes with the name of your dataset: - - ```bash - git fetch upstream - git rebase upstream/main - git checkout -b a-descriptive-name-for-my-changes - ``` - - **Do not** work on the `main` branch. - -3. Create your dataset folder under `datasets/`: - - ```bash - mkdir ./datasets/ - ``` - -4. Open a new online [dataset card form](https://huggingface.co/datasets/card-creator/) to fill out: you will be able to download it to your dataset folder with the `Export` button when you are done. Alternatively, you can also manually create and edit a dataset card in the folder by copying the template: - - ```bash - cp ./templates/README.md ./datasets//README.md - ``` - -5. Now explore the dataset you have selected while completing some fields of the **dataset card** while you are doing it: - - - Find the research paper or description presenting the dataset you want to add - - Read the relevant part of the paper/description presenting the dataset - - Find the location of the data for your dataset - - Download/open the data to see how it looks like - - While you explore and read about the dataset, you can complete some sections of the dataset card (the online form or the one you have just created at `./datasets//README.md`). You can just copy the information you meet in your readings in the relevant sections of the dataset card (typically in `Dataset Description`, `Dataset Structure` and `Dataset Creation`). - - If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md. - - There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/main/datasets/eli5. - - Don't spend too much time completing the dataset card, just copy what you find when exploring the dataset documentation. If you can't find all the information it's ok. You can always spend more time completing the dataset card while we are reviewing your PR (see below) and the dataset card will be open for everybody to complete them afterwards. If you don't know what to write in a section, just leave the `[More Information Needed]` text. - - -#### Write the loading/processing code - -Now let's get coding :-) - -The dataset script is the main entry point to load and process the data. It is a python script under `datasets//.py`. - -There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/main/about_dataset_load.html). - -Note on naming: the dataset class should be camel case, while the dataset short_name is its snake case equivalent (ex: `class BookCorpus` for the dataset `book_corpus`). - -To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py): - -```bash -cp ./templates/new_dataset_script.py ./datasets//.py -``` - -And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/main/dataset_script.html). - -You can also start (or copy any part) from one of the datasets of reference listed below. The main criteria for choosing among these reference dataset is the format of the data files (JSON/JSONL/CSV/TSV/text) and whether you need or don't need several configurations (see above explanations on configurations). Feel free to reuse any parts of the following examples and adapt them to your case: - -- question-answering: [squad](https://github.com/huggingface/datasets/blob/main/datasets/squad/squad.py) (original data are in json) -- natural language inference: [snli](https://github.com/huggingface/datasets/blob/main/datasets/snli/snli.py) (original data are in text files with tab separated columns) -- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/main/datasets/conll2003/conll2003.py) (original data are in text files with one token per line) -- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/main/datasets/allocine/allocine.py) (original data are in jsonl files) -- text classification: [ag_news](https://github.com/huggingface/datasets/blob/main/datasets/ag_news/ag_news.py) (original data are in csv files) -- translation: [flores](https://github.com/huggingface/datasets/blob/main/datasets/flores/flores.py) (original data come from text files - one per language) -- summarization: [billsum](https://github.com/huggingface/datasets/blob/main/datasets/billsum/billsum.py) (original data are in json files) -- benchmark: [glue](https://github.com/huggingface/datasets/blob/main/datasets/glue/glue.py) (original data are various formats) -- multilingual: [xquad](https://github.com/huggingface/datasets/blob/main/datasets/xquad/xquad.py) (original data are in json) -- multitask: [matinf](https://github.com/huggingface/datasets/blob/main/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication) -- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/main/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format) -- image classification: [beans](https://github.com/huggingface/datasets/blob/main/datasets/beans/beans.py) (original data are in .jpg format) -- object detection: [wider_face](https://github.com/huggingface/datasets/blob/main/datasets/wider_face/wider_face.py) (image files are in .jpg format and metadata come from text files) - -While you are developing the dataset script you can list test it by opening a python interpreter and running the script (the script is dynamically updated each time you modify it): - -```python -from datasets import load_dataset - -data = load_dataset('./datasets/') -``` - -This let you for instance use `print()` statements inside the script as well as seeing directly errors and the final dataset format. - -**What are configurations and splits** - -Sometimes you need to use several *configurations* and/or *splits* (usually at least splits will be defined). - -* Using several **configurations** allow to have like sub-datasets inside a dataset and are needed in two main cases: - - - The dataset covers or group several sub-datasets or domains that the users may want to access independently and/or - - The dataset comprise several sub-part with different features/organizations of the data (e.g. two types of CSV files with different types of columns). Inside a configuration of a dataset, all the data should have the same format (columns) but the columns can change across configurations. - -* **Splits** are a more fine grained division than configurations. They allow you, inside a configuration of the dataset, to split the data in typically train/validation/test splits. All the splits inside a configuration should have the same columns/features and splits are thus defined for each specific configurations of there are several. - - -**Some rules to follow when adding the dataset**: - -- try to give access to all the data, columns, features and information in the dataset. If the dataset contains various sub-parts with differing formats, create several configurations to give access to all of them. -- datasets in the `datasets` library are typed. Take some time to carefully think about the `features` (see an introduction [here](https://huggingface.co/docs/datasets/about_dataset_features.html) and the full list of possible features [here](https://huggingface.co/docs/datasets/package_reference/main_classes.html#features) -- if some of you dataset features are in a fixed set of classes (e.g. labels), you should use a `ClassLabel` feature. - - -#### Tests (optional) - - To check that your dataset works correctly and to create its `dataset_info` metadata in the dataset card, run the command: - - -```bash -datasets-cli test datasets/ --save_info --all_configs -``` - -**Note:** If your dataset requires manually downloading the data and having the user provide the path to the dataset you can run the following command: -```bash -datasets-cli test datasets/ --save_info --all_configs --data_dir your/manual/dir -``` -To have the configs use the path from `--data_dir` when generating them. - -#### Automatically add code metadata - -Now that your dataset script runs and create a dataset with the format you expected, you can add the JSON metadata and test data. - -**Make sure you run all of the following commands from the root of your `datasets` git clone.** - -1. To create the dummy data for continuous testing, there is a tool that automatically generates dummy data for you. At the moment it supports data files in the following format: txt, csv, tsv, jsonl, json, xml. - - If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with: - - ```bash - datasets-cli dummy_data datasets/ --auto_generate - ``` - - Example: - - ```bash - datasets-cli dummy_data ./datasets/snli --auto_generate - ``` - - If your data files are not in the supported format, you can run the same command without the `--auto_generate` flag. It should give you instructions on the files to manually create (basically, the same ones as for the real dataset but with only five items). - - ```bash - datasets-cli dummy_data datasets/ - ``` - - If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/dataset_script.html#dummy-data). - - If you've been fighting with dummy data creation without success for some time and can't seems to make it work: Go to the next step (open a Pull Request) and we'll help you cross the finish line 🙂. - -2. Now test that both the real data and the dummy data work correctly using the following commands: - - *For the real data*: - ```bash - RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_ - ``` - and - - *For the dummy data*: - ```bash - RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_ - ``` - - On **Windows**, you may need to run: - ``` - $Env:RUN_SLOW = "1" - pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_ - pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_ - ``` - to enable the slow tests, instead of `RUN_SLOW=1`. - -3. If all tests pass, your dataset works correctly. You can finally create the metadata by running the command: - - ```bash - datasets-cli test datasets/ --all_configs - ``` - - This first command should create a `README.md` file containing the metadata if this file doesn't exist already, or add the metadata to an existing `README.md` file in your dataset folder. - - -You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎 - -Note: You can use the CLI tool from the root of the repository with the following command: -```bash -python src/datasets/commands/datasets_cli.py -``` - -#### Open a Pull Request on the main HuggingFace repo and share your work!! - -Here are the step to open the Pull-Request on the main repo. - -1. Format your code. Run black, isort and flake8 so that your newly added files look nice with the following commands: - - ```bash - make style - flake8 datasets - ``` - - If you are on windows and `make style` doesn't work you can do the following steps instead: - - ```bash - pip install black - pip install isort - pip install flake8 - - black --line-length 119 --target-version py36 datasets/your_dataset/your_dataset.py - - isort datasets/your_dataset/your_dataset.py - - flake8 datasets/your_dataset - ``` - -2. Make sure that you have a dataset card (more information in the [next section](#tag-the-dataset-and-write-the-dataset-card)) with: - - 1. **Required:** - - The YAML tags obtained with the [online Datasets Tagging app](https://huggingface.co/spaces/huggingface/datasets-tagging). - - A description of the various fields in your dataset. - 2. Any relevant information you would like to share with users of your dataset in the appropriate paragraphs. - - You can use the online [dataset card creator](https://huggingface.co/datasets/card-creator/) - -3. Once you're happy with your dataset script file, add your changes and make a commit to record your changes locally: - - ```bash - git add datasets/ - git commit - ``` - - It is a good idea to sync your copy of the code with the original - repository regularly. This way you can quickly account for changes: - - - If you haven't pushed your branch yet, you can rebase on upstream/main: - - ```bash - git fetch upstream - git rebase upstream/main - ``` - - - If you have already pushed your branch, do not rebase but merge instead: - - ```bash - git fetch upstream - git merge upstream/main - ``` - - Push the changes to your account using: - - ```bash - git push -u origin a-descriptive-name-for-my-changes - ``` - -3. Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review. - -Congratulation you have open a PR to add a new dataset 🙏 - -**Important note:** In order to merge your Pull Request the maintainers will require you to tag and add a dataset card. Here is now how to do this last step: - -#### Tag the dataset and write the dataset card - -Each dataset is provided with a dataset card. - -The dataset card and in particular the tags which are on it are **really important** to make sure the dataset can be found on the hub and will be used by the users. Users need to have the best possible idea of what's inside the dataset and how it was created so that they can use it safely and have a good idea of the content. - -Creating the dataset card goes in two steps: - -1. **Tagging the dataset using the Datasets Tagging app** - - - Use the [online Datasets Tagging application](https://huggingface.co/spaces/huggingface/datasets-tagging). - - Enter the full path to your dataset folder on the left, and tag the different configs :-) (And don't forget to save to file after you're done with a config!) - -2. **Copy the tags in the dataset card and complete the dataset card** - - - You can use the online [dataset card creator](https://huggingface.co/datasets/card-creator/) - - - **Essential:** Once you have saved the tags for all configs, you can expand the **Show YAML output aggregating the tags** section on the right, which will show you a YAML formatted block to put in the relevant section of the [online form](https://huggingface.co/datasets/card-creator/) (or manually paste into your README.md). - - - **Very important as well:** On the right side of the tagging app, you will also find an expandable section called **Show Markdown Data Fields**. This gives you a starting point for the description of the fields in your dataset: you should paste it into the **Data Fields** section of the [online form](https://huggingface.co/datasets/card-creator/) (or your local README.md), then modify the description as needed. Briefly describe each of the fields and indicate if they have a default value (e.g. when there is no label). If the data has span indices, describe their attributes (character level or word level, contiguous or not, etc). If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points. - - Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/main/datasets/eli5#data-fields): - - Data Fields: - - q_id: a string question identifier for each example, corresponding to its ID in the Pushshift.io Reddit submission dumps. - - subreddit: One of explainlikeimfive, askscience, or AskHistorians, indicating which subreddit the question came from - - title: title of the question, with URLs extracted and replaced by URL_n tokens - - title_urls: list of the extracted URLs, the nth element of the list was replaced by URL_n - - - - **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md. - - Here is a completed example: https://github.com/huggingface/datasets/tree/main/datasets/eli5 for inspiration - - If you don't know what to write in a field and can find it, write: `[More Information Needed]` - -If you are using the online form, you can then click the `Export` button at the top to download a `README.md` file to your data folder. Once your `README.md` is ok you have finished all the steps to add your dataset, congratulation your Pull Request can be merged. - -**You have made another dataset super easy to access for everyone in the community! 🤯** +# How to add one new datasets + +Add datasets directly to the 🤗 Hugging Face Hub! + +You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation: + +* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset) +* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share) diff --git a/README.md b/README.md index 348c7bc5464..f924eb0b6d0 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,7 @@ [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb) -[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md) +[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://huggingface.co/docs/datasets/share.html)

@@ -127,8 +127,6 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub. -However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md). - # Main differences between 🤗 Datasets and `tfds` If you are familiar with the great TensorFlow Datasets, here are the main differences between 🤗 Datasets and `tfds`: diff --git a/docs/source/about_dataset_load.mdx b/docs/source/about_dataset_load.mdx index ebddab3ddc5..d5c32a71bb7 100644 --- a/docs/source/about_dataset_load.mdx +++ b/docs/source/about_dataset_load.mdx @@ -102,12 +102,12 @@ To ensure a dataset is complete, [`load_dataset`] will perform a series of tests If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files. In this case, an error is raised to alert that the dataset has changed. To ignore the error, one needs to specify `ignore_verifications=True` in [`load_dataset`]. -Anytime you see a verification error, feel free to [open an issue on GitHub](https://github.com/huggingface/datasets/issues) so that we can update the integrity checks for this dataset. +Anytime you see a verification error, feel free to open a discussion or pull request in the corresponding dataset "Community" tab, so that the integrity checks for that dataset are updated. ## Security The dataset repositories on the Hub are scanned for malware, see more information [here](https://huggingface.co/docs/hub/security#malware-scanning). -Moreover the datasets that were constributed on our GitHub repository have all been reviewed by our maintainers. +Moreover the datasets without a namespace (originally contributed on our GitHub repository) have all been reviewed by our maintainers. The code of these datasets is considered **safe**. It concerns datasets that are not under a namespace, e.g. "squad" or "glue", unlike the other datasets that are named "username/dataset_name" or "org/dataset_name". diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx index 3ef4a059809..3dd51e98639 100644 --- a/docs/source/loading.mdx +++ b/docs/source/loading.mdx @@ -281,7 +281,7 @@ An object data type in [pandas.Series](https://pandas.pydata.org/docs/reference/ ## Offline -Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub or 🤗 Datasets GitHub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline. +Even if you don't have an internet connection, it is still possible to load a dataset. As long as you've downloaded a dataset from the Hub repository before, it should be cached. This means you can reload the dataset from the cache and use it offline. If you know you won't have internet access, you can run 🤗 Datasets in full offline mode. This saves time because instead of waiting for the Dataset builder download to time out, 🤗 Datasets will look directly in the cache. Set the environment variable `HF_DATASETS_OFFLINE` to `1` to enable full offline mode. diff --git a/docs/source/share.mdx b/docs/source/share.mdx index 2fc73db2253..fb34edf237a 100644 --- a/docs/source/share.mdx +++ b/docs/source/share.mdx @@ -144,18 +144,14 @@ Members of the Hugging Face team will be happy to review your dataset script and ## Datasets on GitHub (legacy) Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub. -The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name". -Those datasets are still maintained on GitHub, and if you'd like to edit them, please open a Pull Request on the huggingface/datasets repository. -Sharing your dataset to the Hub is the recommended way of adding a dataset. + +The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name". -The distinction between a Hub dataset and a dataset from GitHub only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself. +The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself. -The code of these datasets are reviewed by the Hugging Face team, and they require test data in order to be regularly tested. - -In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under. - -For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md). +Those datasets are mow maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request. +The code of these datasets is reviewed by the Hugging Face team. diff --git a/src/datasets/inspect.py b/src/datasets/inspect.py index 1c0fa7e916a..8aa48274c1a 100644 --- a/src/datasets/inspect.py +++ b/src/datasets/inspect.py @@ -341,12 +341,9 @@ def get_dataset_config_info( data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s). download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. - revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - - - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. - You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. + revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load. + As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch. + You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from `"~/.huggingface"`. **config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied. @@ -405,12 +402,9 @@ def get_dataset_split_names( data_files (:obj:`str` or :obj:`Sequence` or :obj:`Mapping`, optional): Path(s) to source data file(s). download_config (:class:`~download.DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. - revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - - - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. - You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. + revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load. + As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch. + You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from `"~/.huggingface"`. **config_kwargs (additional keyword arguments): optional attributes for builder class which will override the attributes if supplied. diff --git a/src/datasets/load.py b/src/datasets/load.py index 7d8b88cb167..26e13bafde1 100644 --- a/src/datasets/load.py +++ b/src/datasets/load.py @@ -1079,12 +1079,9 @@ def dataset_module_factory( -> load the dataset builder from the dataset script in the dataset repository e.g. ``glue``, ``squad``, ``'username/dataset_name'``, a dataset repository on the HF hub containing a dataset script `'dataset_name.py'`. - revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - - - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. - You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. + revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load. + As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch. + You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository. download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. dynamic_modules_path (Optional str, defaults to HF_MODULES_CACHE / "datasets_modules", i.e. ~/.cache/huggingface/modules/datasets_modules): @@ -1121,9 +1118,6 @@ def dataset_module_factory( # - if path is a local directory (but no python file) # -> use a packaged module (csv, text etc.) based on content of the directory # - # - if path has no "/" and is a module on GitHub (in /datasets) - # -> use the module from the python file on GitHub - # Note that this case will be removed in favor of loading from the HF Hub instead eventually # - if path has one "/" and is dataset repository on the HF hub with a python file # -> the module from the python file in the dataset repository # - if path has one "/" and is dataset repository on the HF hub without a python file @@ -1459,12 +1453,9 @@ def load_dataset_builder( features (:class:`Features`, optional): Set the features type to use for this dataset. download_config (:class:`~utils.DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. - revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - - - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. - You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. + revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load. + As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch. + You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from `"~/.huggingface"`. **config_kwargs (additional keyword arguments): Keyword arguments to be passed to the :class:`BuilderConfig` @@ -1629,12 +1620,9 @@ def load_dataset( will not be copied in-memory unless explicitly enabled by setting `datasets.config.IN_MEMORY_MAX_SIZE` to nonzero. See more details in the :ref:`load_dataset_enhancing_performance` section. save_infos (:obj:`bool`, default ``False``): Save the dataset information (checksums/size/splits/...). - revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - - - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. - You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. + revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load. + As datasets have their own git repository on the Datasets Hub, the default version "main" corresponds to their "main" branch. + You can specify a different version than the default "main" by using a commit SHA or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. If True, will get token from `"~/.huggingface"`. task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` to standardized column names and types as detailed in :py:mod:`datasets.tasks`.