-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New documentation structure #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
1984bac
086a19a
da84f8e
d5c9d33
80cf49f
42f5c7a
5803ed8
e8fa2cf
c52c4b4
17a0e24
6c2bb57
d720143
d9eff0e
3f267a1
7a2c5bc
e25fb40
de5d66c
02889e6
66fa452
eb23a30
f1037e8
1f9610e
9fb32e2
0059f7a
e200a3c
ec0a1b3
05f29fd
73d4272
9e5b801
8d8dccb
042f08d
06c31e3
fcde35a
e998410
7564a1f
0ab7fb4
a462633
c7a325e
2bda4dd
f798728
13ae8c9
00cc036
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # Datasets 🤝 Arrow | ||
|
|
||
| TO DO: | ||
|
|
||
| A brief introduction on why Datasets chose to use Arrow. For example, include some context and background like design decisions and constraints. The user should understand why we decided to use Arrow instead of something else. | ||
|
|
||
| ## What is Arrow? | ||
|
|
||
| Arrow enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages: | ||
|
|
||
| * Arrow's standard format allows zero copy reads which removes virtually all serialization overhead. | ||
| * Arrow is language-agnostic so it supports different programming languages. | ||
| * Arrow is column-oriented so it is faster at querying and processing data. | ||
|
|
||
| ## Performance | ||
|
|
||
| TO DO: | ||
|
|
||
| Discussion on Arrow's speed and performance, especially as it relates to Datasets. In particular, this [tweet] from Thom is worth explaining how Datasets can iterate over such a massive dataset so quickly. | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Memory | ||
|
|
||
| TO DO: | ||
|
|
||
| Discussion on memory-mapping and efficiency, which enables this: | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ```python | ||
| >>> from datasets import total_allocated_bytes | ||
| >>> print("The number of bytes allocated on the drive is", dataset.dataset_size) | ||
| The number of bytes allocated on the drive is 1492156 | ||
| >>> print("For comparison, here is the number of bytes allocated in memory:", total_allocated_bytes()) | ||
| For comparison, here is the number of bytes allocated in memory: 0 | ||
| ``` | ||
|
|
||
| [tweet]: https://twitter.com/Thom_Wolf/status/1272512974935203841 | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| The Cache | ||
| ========= | ||
|
|
||
| One of the reasons why Datasets is so efficient is because of the cache. It stores previously downloaded and processed datasets so when you need to use them again, Datasets will reload it straight from the cache. This avoids having to download a dataset all over again, or recomputing all the processing functions you applied. | ||
|
|
||
| Fingerprint | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I really liked reading this section - super clear! |
||
| ----------- | ||
|
|
||
| How does the cache keeps track of what transforms you applied to your dataset? Well, Datasets assigns a fingerprint to the cache file. This keeps track of the current state of a dataset. The initial fingerprint is computed using a hash from the Arrow table, or a hash of the Arrow files if the dataset is on disk. Subsequent fingerprints are computed by combining the fingerprint of the previous state, and a hash of the latest transform applied. | ||
|
|
||
| .. tip:: | ||
|
|
||
| Transforms are any of the processing methods from the :doc:`How-to Process <./process>` guides such as :func:`datasets.Dataset.map` or :func:`datasets.Dataset.shuffle`. | ||
|
|
||
| Here are what the actual fingerprints look like: | ||
|
|
||
| >>> from datasets import Dataset | ||
| >>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]}) | ||
| >>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1}) | ||
| >>> print(dataset1._fingerprint, dataset2._fingerprint) | ||
| d19493523d95e2dc 5b86abacd4b42434 | ||
|
|
||
| In order for a transform to be hashable, it needs to be picklable using `dill <https://dill.readthedocs.io/en/latest/>`_ or `pickle <https://docs.python.org/3/library/pickle.html>`_. When you use a non-hashable transform, Datasets uses a random fingerprint instead and raises a warning. The non-hashable transform is considered different from the previous transforms, so Datasets will recompute everything. Make sure your transforms are serializable with pickle or dill to avoid this! | ||
|
|
||
| An example of when Datasets recomputes everything is when caching is disabled. When this happens, the cache files are always created. The cache files are written to a temporary directory, and gets deleted once the session ends. A random hash is assigned to these cache files, instead of a fingerprint. | ||
|
|
||
| .. tip:: | ||
|
|
||
| If caching is disabled, use :func:`datasets.Dataset.save_to_disk` to save your transformed dataset or it will be deleted once the session ends. | ||
|
|
||
| TO DO: Explain why it needs to be picklable to give the user more context. | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| Dataset features | ||
| ================ | ||
|
|
||
| :class:`datasets.Features` defines the internal structure of a dataset, and are used to specify the underlying serialization format. What's more relevant to you though is that :class:`datasets.Features` contains high-level information about everything from the column names and types, to the :class:`datasets.ClassLabel`. Datasets uses the `Apache Arrow Automatic Type Inference <https://arrow.apache.org/docs/python/json.html#automatic-type-inference>`_ to generate the features of your dataset. This guide will help you gain a better understanding about Datasets features. | ||
|
|
||
| .. tip:: | ||
|
|
||
| See the Troubleshooting section from How-to Load a dataset to see how you can manually specify features in case Arrow inferred an unexpected data type. | ||
|
|
||
| The format of :class:`datasets.Features` is simple: ``dict[column_name, column_type]``. The column type provides a wide range of options for describing the type of data you have. Let's take a look at the features of the MRPC dataset again: | ||
|
|
||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('glue', 'mrpc', split='train') | ||
| >>> dataset.features | ||
| {'idx': Value(dtype='int32', id=None), | ||
| 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), | ||
| 'sentence1': Value(dtype='string', id=None), | ||
| 'sentence2': Value(dtype='string', id=None), | ||
| } | ||
|
|
||
| The :class:`datasets.Value` feature tells Datasets that the ``idx`` data type is ``int32``, and the sentences' data types are ``string``. A large number of other data types are supported such as ``bool``, ``float32`` and ``binary`` to name just a few. Take a look at the :class:`datasets.Value` reference for a full list of supported data types. | ||
|
|
||
|
|
||
| :class:`datasets.ClassLabel` informs Datasets that the ``label`` column contains two classes with the labels ``not_equivalent`` and ``equivalent``. The labels are stored as integers in the dataset. When you retrieve the labels, :func:`datasets.ClassLabel.str2int` and :func:`datasets.ClassLabel.int2str` carries out the conversion from integer value to label name, and vice versa. | ||
|
|
||
| If your data type contains a list of objects, then you want to use the :class:`datasets.Sequence` feature. Remember the SQuAD dataset? | ||
|
|
||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('squad', split='train') | ||
| >>> dataset.features | ||
| {'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None), | ||
| 'context': Value(dtype='string', id=None), | ||
| 'id': Value(dtype='string', id=None), | ||
| 'question': Value(dtype='string', id=None), | ||
| 'title': Value(dtype='string', id=None)} | ||
|
|
||
| The ``answers`` field is constructed using the :class:`datasets.Sequence` feature because it contains ``text`` and ``answer_start``. | ||
|
|
||
| .. tip:: | ||
|
|
||
| See the Flatten section from How-to Process a dataset to see how you can extract the nested sub-fields as their own independent columns. | ||
|
|
||
| Lastly, there are two specific features for machine translation: :class:`datasets.Translation` and :class:`datasets.TranslationVariableLanguages`. | ||
|
|
||
| [I think for the translation features, we should either add some example code like we did for the other features or remove it all together.] | ||
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| Loading a Dataset | ||
| ================== | ||
|
|
||
| Entire datasets are readily available to you with a single line of code: :func:`datasets.load_dataset`. But how does this simple function serve you whatever dataset you request? This guide will help you understand how :func:`datasets.load_dataset` works. | ||
|
|
||
| What happens when you call :func:`datasets.load_dataset`? | ||
| -------------------------------------------------------- | ||
|
|
||
| In the beginning, :func:`datasets.load_dataset` downloads and imports the dataset loading script associated with the dataset you requested from the Hugging Face Hub. The Hub is the central repository where all the Hugging Face datasets and models are stored. Code in the loading script defines the dataset information (description, features, URL to the original files, etc.) and tells Datasets how to generate and display examples from it. | ||
|
|
||
| .. seealso:: | ||
|
|
||
| Read the Share section for a step-by-step guide on how to write your own Dataset loading script! | ||
|
|
||
| The loading script will download the dataset files from the original URL and cache the dataset in an Arrow table on the drive. If you've downloaded the dataset before, then Datasets will reload it from the cache to save you the trouble of downloading it again. Finally, Datasets will return the dataset built from the splits specified by the user. | ||
|
|
||
lhoestq marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Maintaining integrity | ||
| --------------------- | ||
|
|
||
| To ensure a dataset is complete, :func:`datasets.load_dataset` will perform some tests on the downloaded files to make sure everything is there. This way, you don't encounter any nasty surprises when your requested dataset doesn't get generated as expected. :func:`datasets.load_dataset` verifies: | ||
|
|
||
| * the list of downloaded files | ||
| * the number of bytes of the downloaded files | ||
| * the SHA256 checksums of the downloaded files | ||
| * the number of splits in the generated ``DatasetDict`` | ||
| * the number of samples in each split of the generated ``DatasetDict`` | ||
|
|
||
| TO DO: Explain why you would want to disable the verifications or override the information used to perform the verifications. | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| Batch mapping | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should "mapping" be uppercase for consistency elsewhere? (Not sure what the convention is)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the convention is lowercase, unless we refer to a class or something that requires an uppercase, or if you want to emphasize on specific words
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually brings up a good point on convention! I've been using the Google developer style guide, and we should use sentence case for headings and titles. I just changed a couple of headings, so it should reflect that now. :) For |
||
| ============= | ||
|
|
||
| Combining the utility of :func:`datasets.Dataset.map` with batch mode is very powerful because it allows you to freely control the size of the generated dataset. You can get creative with this, and take advantage of it for many interesting use-cases. In the Map section of the :doc:`How-to Process <./process>` guides, there are some examples of using :func:`datasets.Dataset.map` in batched mode to: | ||
stevhliu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| * split long sentences into shorter chunks | ||
| * augment the dataset with additional tokens | ||
|
|
||
| It will be helpful to understand how this works, so you can come up with your own ways of using :func:`datasets.Dataset.map` in batched mode. | ||
|
|
||
| Input size != output size | ||
| ------------------------- | ||
|
|
||
| You may be wondering how you can control the size of the generated dataset. The answer is: | ||
|
|
||
| ✨ The mapped function accepts a batch of inputs, but the output batch is not required to be the same size. ✨ | ||
|
|
||
| In other words, your input can be a batch of size ``N`` and return a batch of size ``M`` which can be greater or less than ``N``. This means you can concatenate your examples, divide it up, and even add more examples! | ||
|
|
||
| However, you need to remember that each field in the output dictionary must contain the **same number of elements** as the other field in the output dictionary. Otherwise, it is not possible to define the number of examples in the output returned by the mapped function. This number can vary between successive batches processed by the mapped function. For a single batch though, all fields of the output dictionary should have the same number of elements. | ||
|
|
||
| TO DO: | ||
|
|
||
| Maybe add a code example of when the number of elements in the field of an output dictionary aren't the same, so the user knows what not to do. | ||
lhoestq marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| Access a Dataset | ||
| ================ | ||
|
|
||
| In the previous tutorial, you learned how to successfully load a dataset. This section will familiarize you with the :class:`datasets.Dataset` object. You will learn what's inside a Dataset, and how you can access all the information inside. | ||
|
|
||
| A :class:`datasets.Dataset` object is returned when you load an instance of a dataset. This object behaves like a normal Python container. | ||
|
|
||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset('glue', 'mrpc', split='train') | ||
|
|
||
| Metadata | ||
| -------- | ||
|
|
||
| The :class:`datasets.Dataset` object contains a lot of useful information about your dataset. For example, you can return a short description of the dataset, it's authors, and even its size of it by calling ``dataset.info``. This gives you a quick snapshot of the datasets most important attributes. | ||
|
|
||
| >>> dataset.info | ||
| DatasetInfo( | ||
| description='GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n', | ||
| citation='@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n', homepage='https://www.microsoft.com/en-us/download/details.aspx?id=52398', | ||
| license='', | ||
| features={'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}, post_processed=None, supervised_keys=None, builder_name='glue', config_name='mrpc', version=1.0.0, splits={'train': SplitInfo(name='train', num_bytes=943851, num_examples=3668, dataset_name='glue'), 'validation': SplitInfo(name='validation', num_bytes=105887, num_examples=408, dataset_name='glue'), 'test': SplitInfo(name='test', num_bytes=442418, num_examples=1725, dataset_name='glue')}, | ||
| download_checksums={'https://dl.fbaipublicfiles.com/glue/data/mrpc_dev_ids.tsv': {'num_bytes': 6222, 'checksum': '971d7767d81b997fd9060ade0ec23c4fc31cbb226a55d1bd4a1bac474eb81dc7'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt': {'num_bytes': 1047044, 'checksum': '60a9b09084528f0673eedee2b69cb941920f0b8cd0eeccefc464a98768457f89'}, 'https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt': {'num_bytes': 441275, 'checksum': 'a04e271090879aaba6423d65b94950c089298587d9c084bf9cd7439bd785f784'}}, | ||
| download_size=1494541, | ||
| post_processing_size=None, | ||
| dataset_size=1492156, | ||
| size_in_bytes=2986697 | ||
| ) | ||
|
|
||
| To access more specific attributes of the dataset, like the ``description``, ``citation``, and ``homepage``, you can call them directly. Take a look at :class:`datasets.DatasetInfo` for a complete list of attributes you can return. | ||
|
|
||
| >>> dataset.split | ||
| NamedSplit('train') | ||
| >>> dataset.description | ||
| 'GLUE, the General Language Understanding Evaluation benchmark\n(https://gluebenchmark.com/) is a collection of resources for training,\nevaluating, and analyzing natural language understanding systems.\n\n' | ||
| >>> dataset.citation | ||
| '@inproceedings{dolan2005automatically,\n title={Automatically constructing a corpus of sentential paraphrases},\n author={Dolan, William B and Brockett, Chris},\n booktitle={Proceedings of the Third International Workshop on Paraphrasing (IWP2005)},\n year={2005}\n}\n@inproceedings{wang2019glue,\n title={{GLUE}: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},\n author={Wang, Alex and Singh, Amanpreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R.},\n note={In the Proceedings of ICLR.},\n year={2019}\n}\n\nNote that each GLUE dataset has its own citation. Please see the source to see\nthe correct citation for each contained dataset.' | ||
| >>> dataset.homepage | ||
| 'https://www.microsoft.com/en-us/download/details.aspx?id=52398' | ||
|
|
||
| Features and columns | ||
| -------------------- | ||
|
|
||
| A dataset is a table of rows and typed columns. Querying a dataset returns a Python dictionary where the keys correspond to column names, and values correspond to column values. | ||
|
|
||
| >>> dataset[0] | ||
| {'idx': 0, | ||
| 'label': 1, | ||
| 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', | ||
| 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'} | ||
|
|
||
| You can get the number of rows and columns using the following standard attributes: | ||
|
|
||
| >>> dataset.shape | ||
| (3668, 4) | ||
| >>> dataset.num_columns | ||
| 4 | ||
| >>> dataset.num_rows | ||
| 3668 | ||
| >>> len(dataset) | ||
| 3668 | ||
|
|
||
| List the columns names with :func:`datasets.Dataset.column_names`: | ||
|
|
||
| >>> dataset.column_names | ||
| ['idx', 'label', 'sentence1', 'sentence2'] | ||
|
|
||
| Get detailed information about the columns with :attr:`datasets.Dataset.features`: | ||
|
|
||
| >>> dataset.features | ||
| {'idx': Value(dtype='int32', id=None), | ||
| 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), | ||
| 'sentence1': Value(dtype='string', id=None), | ||
| 'sentence2': Value(dtype='string', id=None), | ||
| } | ||
|
|
||
| And you can even retrieve information about a specific feature like :class:`datasets.ClassLabel`: | ||
|
|
||
|
|
||
| >>> dataset.features['label'].num_classes | ||
| 2 | ||
| >>> dataset.features['label'].names | ||
| ['not_equivalent', 'equivalent'] | ||
| >>> dataset.features['label'].str2int('equivalent') | ||
| 1 | ||
| >>> dataset.features['label'].str2int('not_equivalent') | ||
| 0 | ||
|
|
||
|
|
||
| Rows, slices, batches, and columns | ||
| ---------------------------------- | ||
|
|
||
| You can access several rows at a time with slice notation or a list of indices. | ||
|
|
||
| >>> dataset[:3] | ||
| {'idx': [0, 1, 2], | ||
| 'label': [1, 0, 1], | ||
| 'sentence1': ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'], | ||
| 'sentence2': ['Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .', "Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", "On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale ."] | ||
| } | ||
| >>> dataset[[1, 3, 5]] | ||
| {'idx': [1, 3, 5], | ||
| 'label': [0, 0, 1], | ||
| 'sentence1': ["Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .', 'Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier .'], | ||
| 'sentence2': ["Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .", 'Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .', "With the scandal hanging over Stewart 's company , revenue the first quarter of the year dropped 15 percent from the same period a year earlier ."] | ||
| } | ||
|
|
||
| Querying by the column name will return its values. For example, if you only wanted the first three examples: | ||
|
|
||
| >>> dataset['sentence1'][:3] | ||
| ['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .", 'They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .'] | ||
|
|
||
| Depending on how a :class:`datasets.Dataset` object is queried, the format returned will be different: | ||
|
|
||
| * A single row like ``dataset[0]`` returns a Python dictionary of values. | ||
| * A batch like ``dataset[5:10]`` returns a Python dictionary of lists of values. | ||
| * A column like ``dataset['sentence1']`` returns a Python list of values. |
Uh oh!
There was an error while loading. Please reload this page.