`pretty_name` for dataset in YAML tags #2395

bhavitvyamalik · 2021-05-22T09:24:45Z

I'm updating pretty_name for datasets in YAML tags as discussed with @lhoestq. Here are the first 10, please let me know if they're looking good.

If dataset has 1 config, I've added pretty_name as config_name: full_name_of_dataset as config names were plain_text, default, squad etc (not so important in this case) whereas when dataset has >1 configs, I've added config_name: full_name_of_dataset+config_name so as to let user know about the config here.

bhavitvyamalik · 2021-05-22T12:57:34Z

Initially I removed the - since there was only one pretty_name per config but turns out it was breaking here in from_yaml_string

datasets/src/datasets/utils/metadata.py

Line 197 in 74751e3

metada_dict[key] = list(set(sum(metada_dict[key].values(), [])))

in /utils/metadata.py

gchhablani · 2021-05-24T07:48:19Z

@lhoestq I guess this will also need some validation?

lhoestq · 2021-05-28T15:47:22Z

Looks like the parser doesn't allow things like

pretty_name:
  config_name1: My awesome config number 1
  config_name2: My amazing config number 2

therefore you had to use - and consider them as a list.

I would be nice to add support for this case in the validator.

There's one thing though: the DatasetMetadata object currently corresponds to the yaml tags that are flattened: the config names are just ignored, and the lists are concatenated.

Therefore I think we would potentially need to instantiate several DatasetMetadata objects: one per config. Otherwise we would end up with a list of several pretty_name while we actually need at most 1 per config.

What do you think @gchhablani ?

bhavitvyamalik · 2021-05-28T17:24:49Z

I was thinking of returning metada_dict (on line 193) whenever load_dataset_card is called (we can pass an extra parameter to from_readme or from_yaml_string for that to happen).

One just needs config_name as key for the dictionary inside pretty_name dict and for single config, there would be only one value to print. We can do this for other fields as well like size_categories, languages etc. This will obviate the need to flatten the YAML tags so that don't have to instantiate several DatasetMetadata objects. What do you guys think @lhoestq @gchhablani?

Update: I was thinking of returning the whole dictionary before flattening so that user can access whatever they want with specific configs. Let's say this is my metadata_dict before flattening (the loaded YAML string), so instead of validating it and then returning the items individually we can return it just after loading the YAML string.

gchhablani · 2021-05-28T19:16:11Z

Hi @lhoestq @bhavitvyamalik

@bhavitvyamalik, I'm not sure I understand your approach, can you please elaborate? The metadata_dict is flattened before instantiating the object, do you want to remove that? Still confused.

Few things come to my mind after going through this PR. They might not be entirely relevant to the current task, but I'm just trying to think about possible cases and discuss them here.

Instead of instantiating a new DatasetMetadata for each config with flattened tags, why can't we make it more flexible and validate only non-dict items? However, in that case, the types wouldn't be as strict for the class attributes. It would also not work for cases that are like Dict[str,List[Dict[str,str]], but I guess that won't be needed anyway in the foreseeable future?

Ideally, it would be something like - Check the metadata tag type (root), do a DFS, and find the non-dict objects (leaves), and validate them. Is this an overkill to handle the problem?
For single-config datasets, there can be slightly different validation for pretty_names, than for multi-config. The user shouldn't need to provide a config name for single-config datasets, wdyt @bhavitvyamalik @lhoestq? Either way, for multi-config, the validation can use the dictionary keys in the path to that leaf node to verify pretty_names: ... (config) as well. This will check whether the config name is same as the key (might be unnecessary but prevents typos, so less work for the reviewer(s)). For future, however, it might be beneficial to have something like this.
Should we have a default config name for single-config datasets? People use any string they feel like. I've seen plain_text, default and the dataset name. I've used image for a few datasets myself AFAIR. For smarter validation (again, a future case ;-;), it'd be easier for us to have a simple rule for naming configs in single-config datasets. Wdyt @lhoestq?

gchhablani · 2021-05-28T19:37:15Z

Btw, pretty_names can also be used to handle this during validation :P

-# Dataset Card for [Dataset Name]
+# Dataset Card for Allegro Reviews

This is where DatasetMetadata and ReadMe should be combined. But there are very few overlaps, I guess.

@bhavitvyamalik @lhoestq What about adding a pretty name across all configs, and then config-specific names?

Like

pretty_names:
    all_configs: X (dataset_name)
    config_1: X1 (config_1_name)
    config_2: X2 (config_2_name)

Then, using the metadata_dict, the ReadMe header can be validated against X.

Sorry if I'm throwing too many ideas at once.

gchhablani · 2021-05-29T13:53:16Z

@bhavitvyamalik

Now, I think I better understand what you're saying. So you want to skip validation for the unflattened metadata and just return it? And let the validation run for the flattened version?

bhavitvyamalik · 2021-05-29T14:14:19Z

Exactly! Validation is important but once the YAML tags are validated I feel we shouldn't do that again while calling load_dataset_card. +1 for default config name for single-config datasets.

gchhablani · 2021-05-29T19:16:45Z

@bhavitvyamalik
Actually, I made the ReadMe validation similar to DatasetMetadata validation and the class was validating the metadata during the creation.

Maybe we need to have a separate validation method instead of having it in __post_init__? Wdyt @lhoestq?

I'm sensing too many things to look into. It'd be great to discuss these sometime.

But if this PR is urgent then @bhavitvyamalik's logic seems good to me. It doesn't need major modifications in validation.

lhoestq · 2021-05-31T09:38:15Z

Maybe we need to have a separate validation method instead of having it in post_init? Wdyt @lhoestq?

We can definitely have a is_valid() method instead of doing it in the post init.

What about adding a pretty name across all configs, and then config-specific names?

Let's keep things simple to starts with. If we can allow both single-config and multi-config cases it would already be great :)

for single-config:

pretty_name: Allegro Reviews

for multi-config:

pretty_name:
  mrpc: Microsoft Research Paraphrase Corpus (MRPC)
  sst2: Stanford Sentiment Treebank
  ...

To support the multi-config case I see two options:

Don't allow DatasetMetadata to have dictionaries but instead have separate DatasetMetadata objects per config
allow DatasetMetadata to have dictionaries. It implies to remove the flattening step. Then we could get metadata for a specific config this way for example:

from datasets import load_dataset_card

glue_dataset_card = load_dataset_card("glue")
print(glue_dataset_card.metadata)
# DatasetMetatada object with dictionaries since there are many configs
print(glue_dataset_card.metadata.get_metadata_for_config("mrpc"))
# DatasetMetatada object with no dictionaries since there are only the mrpc tags

Let me know what you think or if you have other ideas.

gchhablani · 2021-05-31T10:38:35Z

I think Option 2 is better.

Just to clarify, will get_metadata_for_config also return common details, like language, say?

lhoestq · 2021-05-31T14:48:32Z

Just to clarify, will get_metadata_for_config also return common details, like language, say?

Yes that would be more convenient IMO. For example a dataset card like this

languages:
- en
pretty_name:
  config1: Pretty Name for Config 1
  config3: Pretty Name for Config 2

then metadat.get_metadata_for_config("config1") would return something like

DatasetMetadata(languages=["en"], pretty_name="Pretty Name for Config 1")

bhavitvyamalik · 2021-05-31T15:12:15Z

@lhoestq, should we do this post-processing in load_dataset_card by returning unflattened dictionary from DatasetMetadata or send this from DatasetMetadata? Since there isn't much to do I feel once we have the unflattened dictionary

lhoestq · 2021-05-31T15:14:49Z

Not sure I understand the difference @bhavitvyamalik , could you elaborate please ?

bhavitvyamalik · 2021-05-31T15:17:51Z

I was talking about this unflattened dictionary:

I was thinking of returning the whole dictionary before flattening so that user can access whatever they want with specific configs. Let's say this is my metadata_dict before flattening (the loaded YAML string), so instead of validating it and then returning the items individually we can return it just after loading the YAML string.

Post-processing meant extracting config-specific fields from this dictionary and then return this languages=["en"], pretty_name="Pretty Name for Config 1"

lhoestq · 2021-05-31T16:28:32Z

I still don't understand what you mean by "returning unflattened dictionary from DatasetMetadata or send this from DatasetMetadata", sorry. Can you give an example or rephrase this ?

IMO load_dataset_card can return a dataset card object with a metadata field. If the metadata isn't flat (i.e. it has several configs), you can get the flat metadata of 1 specific config with get_metadata_for_config. But of course if you have better ideas or suggestions, we can discuss this

gchhablani · 2021-05-31T18:44:15Z

@lhoestq, I think he is saying whatever get_metadata_for_config is doing can be done in load_dataset_card by taking the unflattened metadata_dict as input.

@bhavitvyamalik, I think it'd be better to have this "post-processing" in DatasetMetadata instead of load_dataset_card, as @lhoestq has shown. I'll quickly get on that.

Three things that are to be changed in DatasetMetadata:

Allow Non-flat elements and their validation.
Create a method to get metadata by config name.
Create a validate() method.

Once that is done, this PR can be updated and reviewed, wdys?

lhoestq · 2021-06-14T13:25:07Z

Thanks @gchhablani for the help ! Now that #2436 is merged you can remove the - in the pretty_name @bhavitvyamalik :)

lhoestq

Thanks :)

I added a few suggestions for the pretty names to try to make them prettier :P

datasets/ade_corpus_v2/README.md

datasets/adversarial_qa/README.md

datasets/aeslc/README.md

datasets/afrikaans_ner_corpus/README.md

datasets/ag_news/README.md

datasets/ai2_arc/README.md

datasets/air_dialogue/README.md

albertvillanova · 2022-09-23T13:29:13Z

Thanks @bhavitvyamalik.

I think this PR was superseded by these others also made by you:

I'm closing this.

bhavitvyamalik added 3 commits May 22, 2021 14:46

pretty_name first10

97a3db8

minor addition

48ef4e9

change values to pass tests

9114a82

gchhablani mentioned this pull request Jun 1, 2021

Update DatasetMetadata and ReadMe #2436

Merged

bhavitvyamalik added 2 commits June 21, 2021 19:14

Merge remote-tracking branch 'origin/master' into load_dataset_card

3cb2e68

new format pretty_names

9b80125

lhoestq reviewed Jun 21, 2021

View reviewed changes

bhavitvyamalik added 3 commits June 23, 2021 21:19

quentin's suggestions (backup)

51228e8

Merge remote-tracking branch 'origin/master' into load_dataset_card

7ad0078

pretty names for 1-100

d9db92d

albertvillanova added the dataset contribution Contribution to a dataset script label Sep 23, 2022

albertvillanova closed this Sep 23, 2022

pretty_name for dataset in YAML tags #2395

pretty_name for dataset in YAML tags #2395

Uh oh!

Conversation

bhavitvyamalik commented May 22, 2021

Uh oh!

bhavitvyamalik commented May 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gchhablani commented May 24, 2021

Uh oh!

lhoestq commented May 28, 2021

Uh oh!

bhavitvyamalik commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gchhablani commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gchhablani commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gchhablani commented May 29, 2021

Uh oh!

bhavitvyamalik commented May 29, 2021

Uh oh!

gchhablani commented May 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 31, 2021

Uh oh!

gchhablani commented May 31, 2021

Uh oh!

lhoestq commented May 31, 2021

Uh oh!

bhavitvyamalik commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhavitvyamalik commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented May 31, 2021

Uh oh!

gchhablani commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albertvillanova commented Sep 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`pretty_name` for dataset in YAML tags #2395

`pretty_name` for dataset in YAML tags #2395

bhavitvyamalik commented May 22, 2021 •

edited

Loading

bhavitvyamalik commented May 28, 2021 •

edited

Loading

gchhablani commented May 28, 2021 •

edited

Loading

gchhablani commented May 28, 2021 •

edited

Loading

gchhablani commented May 29, 2021 •

edited

Loading

bhavitvyamalik commented May 31, 2021 •

edited

Loading

lhoestq commented May 31, 2021 •

edited

Loading

bhavitvyamalik commented May 31, 2021 •

edited

Loading

gchhablani commented May 31, 2021 •

edited

Loading

lhoestq commented Jun 14, 2021 •

edited

Loading