Add args description to DatasetInfo #2384

lewtun · 2021-05-20T13:53:10Z

I am not sure what post_processed and post_processing_size correspond to, so have left them empty for now. I also took a guess at some of the other fields like dataset_size vs size_in_bytes, so might have misunderstood their meaning.

lhoestq

Thanks :) added some suggestions

lhoestq · 2021-05-21T12:27:58Z

src/datasets/info.py

+        homepage (str): A URL to the official homepage for the dataset.
+        license (str): The dataset's license.
+        features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.
        post_processed (PostProcessedInfo, optional):


Suggested change

post_processed (PostProcessedInfo, optional):

post_processed (PostProcessedInfo, optional): Info regarding the resources of a possible post-processing of a dataset. It can contain the information of an index for example.

lhoestq · 2021-05-21T12:28:36Z

src/datasets/info.py

+        splits (dict, optional): The mapping between split name and metadata.
+        download_checksums (dict, optional): The mapping between the URL to download the dataset's checksums and corresponding metadata.
+        download_size (int, optional): The size of the compressed dataset in bytes.
        post_processing_size (int, optional):


Suggested change

post_processing_size (int, optional):

post_processing_size (int, optional): Size of the dataset after the post-processing, if any.

lhoestq · 2021-05-21T12:29:35Z

src/datasets/info.py

-        download_checksums (dict, optional):
-        download_size (int, optional):
+        supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.
+        builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name, but with CamelCase instead of snake_case.


Suggested change

builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name, but with CamelCase instead of snake_case.

builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.

lhoestq · 2021-05-21T12:29:57Z

src/datasets/info.py

-        splits (dict, optional):
-        download_checksums (dict, optional):
-        download_size (int, optional):
+        supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.


Suggested change

supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.

supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).

lhoestq · 2021-05-21T12:30:32Z

src/datasets/info.py

+        description (str): A description of the dataset.
+        citation (str): A BibTeX citation of the dataset.
+        homepage (str): A URL to the official homepage for the dataset.
+        license (str): The dataset's license.


Suggested change

license (str): The dataset's license.

license (str): The dataset's license. It can be the name of the license or a paragraph containing the terms of the license.

lhoestq · 2021-05-21T12:31:12Z

src/datasets/info.py

-        size_in_bytes (int, optional):
-        task_templates (List[TaskTemplate], optional):
+        dataset_size (int, optional): The combined size of the Apache Arrow tables for all splits in bytes.
+        size_in_bytes (int, optional): The combined size of all files associated with the dataset.


Suggested change

size_in_bytes (int, optional): The combined size of all files associated with the dataset.

size_in_bytes (int, optional): The combined size of all files associated with the dataset (downloaded files + arrow files)

lhoestq · 2021-05-21T12:32:09Z

src/datasets/info.py

+        version (str or Version, optional): The version of the dataset.
+        splits (dict, optional): The mapping between split name and metadata.
+        download_checksums (dict, optional): The mapping between the URL to download the dataset's checksums and corresponding metadata.
+        download_size (int, optional): The size of the compressed dataset in bytes.


Suggested change

download_size (int, optional): The size of the compressed dataset in bytes.

download_size (int, optional): The size of the files to download to generate the dataset, in bytes.

lhoestq · 2021-05-21T12:32:39Z

src/datasets/info.py

+        citation (str): A BibTeX citation of the dataset.
+        homepage (str): A URL to the official homepage for the dataset.
+        license (str): The dataset's license.
+        features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.


Suggested change

features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.

features (Features, optional): The features used to specify the dataset's columns types.

lewtun · 2021-05-21T13:39:00Z

Thanks for the suggestions! I've included them and made a few minor tweaks along the way

lhoestq · 2021-05-21T13:45:15Z

Please merge master into this branch to fix the CI, I just fixed metadata validation tests.

Add args description to DatasetInfo (huggingface#2384)

Add args description to DatasetInfo

993bddd

lhoestq approved these changes May 21, 2021

View reviewed changes

Add missing attributes descriptions

c70ab24

Merge branch 'master' into improve-info-docs

428efbd

lewtun merged commit 74751e3 into huggingface:master May 22, 2021

lewtun deleted the improve-info-docs branch May 22, 2021 09:26

JayantGoel001 added a commit to JayantGoel001/datasets-1 that referenced this pull request May 22, 2021

Merge pull request #18 from huggingface/master

4d3127b

Add args description to DatasetInfo (huggingface#2384)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add args description to DatasetInfo #2384

Add args description to DatasetInfo #2384

Uh oh!

lewtun commented May 20, 2021

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lhoestq May 21, 2021

Uh oh!

lewtun commented May 21, 2021

Uh oh!

lhoestq commented May 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	post_processed (PostProcessedInfo, optional):
	post_processed (PostProcessedInfo, optional): Info regarding the resources of a possible post-processing of a dataset. It can contain the information of an index for example.

	post_processing_size (int, optional):
	post_processing_size (int, optional): Size of the dataset after the post-processing, if any.

	builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name, but with CamelCase instead of snake_case.
	builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.

	supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.
	supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).

	license (str): The dataset's license.
	license (str): The dataset's license. It can be the name of the license or a paragraph containing the terms of the license.

	size_in_bytes (int, optional): The combined size of all files associated with the dataset.
	size_in_bytes (int, optional): The combined size of all files associated with the dataset (downloaded files + arrow files)

	download_size (int, optional): The size of the compressed dataset in bytes.
	download_size (int, optional): The size of the files to download to generate the dataset, in bytes.

	features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.
	features (Features, optional): The features used to specify the dataset's columns types.

Add args description to DatasetInfo #2384

Add args description to DatasetInfo #2384

Uh oh!

Conversation

lewtun commented May 20, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewtun commented May 21, 2021

Uh oh!

lhoestq commented May 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants