Skip to content

Conversation

@lewtun
Copy link
Member

@lewtun lewtun commented May 20, 2021

Closes #2354

I am not sure what post_processed and post_processing_size correspond to, so have left them empty for now. I also took a guess at some of the other fields like dataset_size vs size_in_bytes, so might have misunderstood their meaning.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :) added some suggestions

homepage (str): A URL to the official homepage for the dataset.
license (str): The dataset's license.
features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.
post_processed (PostProcessedInfo, optional):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
post_processed (PostProcessedInfo, optional):
post_processed (PostProcessedInfo, optional): Info regarding the resources of a possible post-processing of a dataset. It can contain the information of an index for example.

splits (dict, optional): The mapping between split name and metadata.
download_checksums (dict, optional): The mapping between the URL to download the dataset's checksums and corresponding metadata.
download_size (int, optional): The size of the compressed dataset in bytes.
post_processing_size (int, optional):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
post_processing_size (int, optional):
post_processing_size (int, optional): Size of the dataset after the post-processing, if any.

download_checksums (dict, optional):
download_size (int, optional):
supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.
builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name, but with CamelCase instead of snake_case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name, but with CamelCase instead of snake_case.
builder_name (str, optional): The name of the :class:`GeneratorBasedBuilder` subclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.

splits (dict, optional):
download_checksums (dict, optional):
download_size (int, optional):
supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset.
supervised_keys (SupervisedKeysData, optional): Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).

description (str): A description of the dataset.
citation (str): A BibTeX citation of the dataset.
homepage (str): A URL to the official homepage for the dataset.
license (str): The dataset's license.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
license (str): The dataset's license.
license (str): The dataset's license. It can be the name of the license or a paragraph containing the terms of the license.

size_in_bytes (int, optional):
task_templates (List[TaskTemplate], optional):
dataset_size (int, optional): The combined size of the Apache Arrow tables for all splits in bytes.
size_in_bytes (int, optional): The combined size of all files associated with the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_in_bytes (int, optional): The combined size of all files associated with the dataset.
size_in_bytes (int, optional): The combined size of all files associated with the dataset (downloaded files + arrow files)

version (str or Version, optional): The version of the dataset.
splits (dict, optional): The mapping between split name and metadata.
download_checksums (dict, optional): The mapping between the URL to download the dataset's checksums and corresponding metadata.
download_size (int, optional): The size of the compressed dataset in bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
download_size (int, optional): The size of the compressed dataset in bytes.
download_size (int, optional): The size of the files to download to generate the dataset, in bytes.

citation (str): A BibTeX citation of the dataset.
homepage (str): A URL to the official homepage for the dataset.
license (str): The dataset's license.
features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
features (Features, optional): The features used to specify the dataset's columns, types and conversion methods.
features (Features, optional): The features used to specify the dataset's columns types.

@lewtun
Copy link
Member Author

lewtun commented May 21, 2021

Thanks for the suggestions! I've included them and made a few minor tweaks along the way

@lhoestq
Copy link
Member

lhoestq commented May 21, 2021

Please merge master into this branch to fix the CI, I just fixed metadata validation tests.

@lewtun lewtun merged commit 74751e3 into huggingface:master May 22, 2021
@lewtun lewtun deleted the improve-info-docs branch May 22, 2021 09:26
JayantGoel001 added a commit to JayantGoel001/datasets-1 that referenced this pull request May 22, 2021
Add args description to DatasetInfo (huggingface#2384)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document DatasetInfo attributes

2 participants