Skip to content

Conversation

@bhavitvyamalik
Copy link
Contributor

I'm updating pretty_name for datasets in YAML tags as discussed with @lhoestq. Here are the first 10, please let me know if they're looking good.

If dataset has 1 config, I've added pretty_name as config_name: full_name_of_dataset as config names were plain_text, default, squad etc (not so important in this case) whereas when dataset has >1 configs, I've added config_name: full_name_of_dataset+config_name so as to let user know about the config here.

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented May 22, 2021

Initially I removed the - since there was only one pretty_name per config but turns out it was breaking here in from_yaml_string

metada_dict[key] = list(set(sum(metada_dict[key].values(), [])))
in /utils/metadata.py

@gchhablani
Copy link
Contributor

@lhoestq I guess this will also need some validation?

@lhoestq
Copy link
Member

lhoestq commented May 28, 2021

Looks like the parser doesn't allow things like

pretty_name:
  config_name1: My awesome config number 1
  config_name2: My amazing config number 2

therefore you had to use - and consider them as a list.

I would be nice to add support for this case in the validator.

There's one thing though: the DatasetMetadata object currently corresponds to the yaml tags that are flattened: the config names are just ignored, and the lists are concatenated.

Therefore I think we would potentially need to instantiate several DatasetMetadata objects: one per config. Otherwise we would end up with a list of several pretty_name while we actually need at most 1 per config.

What do you think @gchhablani ?

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented May 28, 2021

I was thinking of returning metada_dict (on line 193) whenever load_dataset_card is called (we can pass an extra parameter to from_readme or from_yaml_string for that to happen).

One just needs config_name as key for the dictionary inside pretty_name dict and for single config, there would be only one value to print. We can do this for other fields as well like size_categories, languages etc. This will obviate the need to flatten the YAML tags so that don't have to instantiate several DatasetMetadata objects. What do you guys think @lhoestq @gchhablani?

Update: I was thinking of returning the whole dictionary before flattening so that user can access whatever they want with specific configs. Let's say this is my metadata_dict before flattening (the loaded YAML string), so instead of validating it and then returning the items individually we can return it just after loading the YAML string.

@gchhablani
Copy link
Contributor

gchhablani commented May 28, 2021

Hi @lhoestq @bhavitvyamalik

@bhavitvyamalik, I'm not sure I understand your approach, can you please elaborate? The metadata_dict is flattened before instantiating the object, do you want to remove that? Still confused.

Few things come to my mind after going through this PR. They might not be entirely relevant to the current task, but I'm just trying to think about possible cases and discuss them here.

  1. Instead of instantiating a new DatasetMetadata for each config with flattened tags, why can't we make it more flexible and validate only non-dict items? However, in that case, the types wouldn't be as strict for the class attributes. It would also not work for cases that are like Dict[str,List[Dict[str,str]], but I guess that won't be needed anyway in the foreseeable future?

    Ideally, it would be something like - Check the metadata tag type (root), do a DFS, and find the non-dict objects (leaves), and validate them. Is this an overkill to handle the problem?

  2. For single-config datasets, there can be slightly different validation for pretty_names, than for multi-config. The user shouldn't need to provide a config name for single-config datasets, wdyt @bhavitvyamalik @lhoestq? Either way, for multi-config, the validation can use the dictionary keys in the path to that leaf node to verify pretty_names: ... (config) as well. This will check whether the config name is same as the key (might be unnecessary but prevents typos, so less work for the reviewer(s)). For future, however, it might be beneficial to have something like this.

  3. Should we have a default config name for single-config datasets? People use any string they feel like. I've seen plain_text, default and the dataset name. I've used image for a few datasets myself AFAIR. For smarter validation (again, a future case ;-;), it'd be easier for us to have a simple rule for naming configs in single-config datasets. Wdyt @lhoestq?

@gchhablani
Copy link
Contributor

gchhablani commented May 28, 2021

Btw, pretty_names can also be used to handle this during validation :P

-# Dataset Card for [Dataset Name]
+# Dataset Card for Allegro Reviews

This is where DatasetMetadata and ReadMe should be combined. But there are very few overlaps, I guess.

@bhavitvyamalik @lhoestq What about adding a pretty name across all configs, and then config-specific names?

Like

pretty_names:
    all_configs: X (dataset_name)
    config_1: X1 (config_1_name)
    config_2: X2 (config_2_name)

Then, using the metadata_dict, the ReadMe header can be validated against X.

Sorry if I'm throwing too many ideas at once.

@gchhablani
Copy link
Contributor

@bhavitvyamalik

Now, I think I better understand what you're saying. So you want to skip validation for the unflattened metadata and just return it? And let the validation run for the flattened version?

@bhavitvyamalik
Copy link
Contributor Author

Exactly! Validation is important but once the YAML tags are validated I feel we shouldn't do that again while calling load_dataset_card. +1 for default config name for single-config datasets.

@gchhablani
Copy link
Contributor

gchhablani commented May 29, 2021

@bhavitvyamalik
Actually, I made the ReadMe validation similar to DatasetMetadata validation and the class was validating the metadata during the creation.

Maybe we need to have a separate validation method instead of having it in __post_init__? Wdyt @lhoestq?

I'm sensing too many things to look into. It'd be great to discuss these sometime.

But if this PR is urgent then @bhavitvyamalik's logic seems good to me. It doesn't need major modifications in validation.

@lhoestq
Copy link
Member

lhoestq commented May 31, 2021

Maybe we need to have a separate validation method instead of having it in post_init? Wdyt @lhoestq?

We can definitely have a is_valid() method instead of doing it in the post init.

What about adding a pretty name across all configs, and then config-specific names?

Let's keep things simple to starts with. If we can allow both single-config and multi-config cases it would already be great :)

for single-config:

pretty_name: Allegro Reviews

for multi-config:

pretty_name:
  mrpc: Microsoft Research Paraphrase Corpus (MRPC)
  sst2: Stanford Sentiment Treebank
  ...

To support the multi-config case I see two options:

  1. Don't allow DatasetMetadata to have dictionaries but instead have separate DatasetMetadata objects per config
  2. allow DatasetMetadata to have dictionaries. It implies to remove the flattening step. Then we could get metadata for a specific config this way for example:
from datasets import load_dataset_card

glue_dataset_card = load_dataset_card("glue")
print(glue_dataset_card.metadata)
# DatasetMetatada object with dictionaries since there are many configs
print(glue_dataset_card.metadata.get_metadata_for_config("mrpc"))
# DatasetMetatada object with no dictionaries since there are only the mrpc tags

Let me know what you think or if you have other ideas.

@gchhablani
Copy link
Contributor

I think Option 2 is better.

Just to clarify, will get_metadata_for_config also return common details, like language, say?

@lhoestq
Copy link
Member

lhoestq commented May 31, 2021

Just to clarify, will get_metadata_for_config also return common details, like language, say?

Yes that would be more convenient IMO. For example a dataset card like this

languages:
- en
pretty_name:
  config1: Pretty Name for Config 1
  config3: Pretty Name for Config 2

then metadat.get_metadata_for_config("config1") would return something like

DatasetMetadata(languages=["en"], pretty_name="Pretty Name for Config 1")

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented May 31, 2021

@lhoestq, should we do this post-processing in load_dataset_card by returning unflattened dictionary from DatasetMetadata or send this from DatasetMetadata? Since there isn't much to do I feel once we have the unflattened dictionary

@lhoestq
Copy link
Member

lhoestq commented May 31, 2021

Not sure I understand the difference @bhavitvyamalik , could you elaborate please ?

@bhavitvyamalik
Copy link
Contributor Author

bhavitvyamalik commented May 31, 2021

I was talking about this unflattened dictionary:

I was thinking of returning the whole dictionary before flattening so that user can access whatever they want with specific configs. Let's say this is my metadata_dict before flattening (the loaded YAML string), so instead of validating it and then returning the items individually we can return it just after loading the YAML string.

Post-processing meant extracting config-specific fields from this dictionary and then return this languages=["en"], pretty_name="Pretty Name for Config 1"

@lhoestq
Copy link
Member

lhoestq commented May 31, 2021

I still don't understand what you mean by "returning unflattened dictionary from DatasetMetadata or send this from DatasetMetadata", sorry. Can you give an example or rephrase this ?

IMO load_dataset_card can return a dataset card object with a metadata field. If the metadata isn't flat (i.e. it has several configs), you can get the flat metadata of 1 specific config with get_metadata_for_config. But of course if you have better ideas or suggestions, we can discuss this

@gchhablani
Copy link
Contributor

gchhablani commented May 31, 2021

@lhoestq, I think he is saying whatever get_metadata_for_config is doing can be done in load_dataset_card by taking the unflattened metadata_dict as input.

@bhavitvyamalik, I think it'd be better to have this "post-processing" in DatasetMetadata instead of load_dataset_card, as @lhoestq has shown. I'll quickly get on that.


Three things that are to be changed in DatasetMetadata:

  1. Allow Non-flat elements and their validation.
  2. Create a method to get metadata by config name.
  3. Create a validate() method.

Once that is done, this PR can be updated and reviewed, wdys?

@lhoestq
Copy link
Member

lhoestq commented Jun 14, 2021

Thanks @gchhablani for the help ! Now that #2436 is merged you can remove the - in the pretty_name @bhavitvyamalik :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

I added a few suggestions for the pretty names to try to make them prettier :P

@albertvillanova albertvillanova added the dataset contribution Contribution to a dataset script label Sep 23, 2022
@albertvillanova
Copy link
Member

Thanks @bhavitvyamalik.

I think this PR was superseded by these others also made by you:

I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset contribution Contribution to a dataset script

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants