-
Notifications
You must be signed in to change notification settings - Fork 3k
pretty_name for dataset in YAML tags
#2395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Initially I removed the datasets/src/datasets/utils/metadata.py Line 197 in 74751e3
/utils/metadata.py
|
|
@lhoestq I guess this will also need some validation? |
|
Looks like the parser doesn't allow things like therefore you had to use I would be nice to add support for this case in the validator. There's one thing though: the DatasetMetadata object currently corresponds to the yaml tags that are flattened: the config names are just ignored, and the lists are concatenated. Therefore I think we would potentially need to instantiate several What do you think @gchhablani ? |
|
I was thinking of returning One just needs config_name as key for the dictionary inside Update: I was thinking of returning the whole dictionary before flattening so that user can access whatever they want with specific configs. Let's say this is my |
|
@bhavitvyamalik, I'm not sure I understand your approach, can you please elaborate? The Few things come to my mind after going through this PR. They might not be entirely relevant to the current task, but I'm just trying to think about possible cases and discuss them here.
|
|
Btw, This is where @bhavitvyamalik @lhoestq What about adding a pretty name across all configs, and then config-specific names? Like pretty_names:
all_configs: X (dataset_name)
config_1: X1 (config_1_name)
config_2: X2 (config_2_name)Then, using the Sorry if I'm throwing too many ideas at once. |
|
Now, I think I better understand what you're saying. So you want to skip validation for the unflattened metadata and just return it? And let the validation run for the flattened version? |
|
Exactly! Validation is important but once the YAML tags are validated I feel we shouldn't do that again while calling |
|
@bhavitvyamalik Maybe we need to have a separate validation method instead of having it in I'm sensing too many things to look into. It'd be great to discuss these sometime. But if this PR is urgent then @bhavitvyamalik's logic seems good to me. It doesn't need major modifications in validation. |
We can definitely have a
Let's keep things simple to starts with. If we can allow both single-config and multi-config cases it would already be great :) for single-config: pretty_name: Allegro Reviewsfor multi-config: pretty_name:
mrpc: Microsoft Research Paraphrase Corpus (MRPC)
sst2: Stanford Sentiment Treebank
...To support the multi-config case I see two options:
from datasets import load_dataset_card
glue_dataset_card = load_dataset_card("glue")
print(glue_dataset_card.metadata)
# DatasetMetatada object with dictionaries since there are many configs
print(glue_dataset_card.metadata.get_metadata_for_config("mrpc"))
# DatasetMetatada object with no dictionaries since there are only the mrpc tagsLet me know what you think or if you have other ideas. |
|
I think Option 2 is better. Just to clarify, will |
Yes that would be more convenient IMO. For example a dataset card like this languages:
- en
pretty_name:
config1: Pretty Name for Config 1
config3: Pretty Name for Config 2then DatasetMetadata(languages=["en"], pretty_name="Pretty Name for Config 1") |
|
@lhoestq, should we do this post-processing in |
|
Not sure I understand the difference @bhavitvyamalik , could you elaborate please ? |
|
I was talking about this unflattened dictionary:
Post-processing meant extracting config-specific fields from this dictionary and then return this |
|
I still don't understand what you mean by "returning unflattened dictionary from DatasetMetadata or send this from DatasetMetadata", sorry. Can you give an example or rephrase this ? IMO load_dataset_card can return a dataset card object with a metadata field. If the metadata isn't flat (i.e. it has several configs), you can get the flat metadata of 1 specific config with |
|
@lhoestq, I think he is saying whatever @bhavitvyamalik, I think it'd be better to have this "post-processing" in Three things that are to be changed in
Once that is done, this PR can be updated and reviewed, wdys? |
|
Thanks @gchhablani for the help ! Now that #2436 is merged you can remove the |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks :)
I added a few suggestions for the pretty names to try to make them prettier :P
|
Thanks @bhavitvyamalik. I think this PR was superseded by these others also made by you: I'm closing this. |
I'm updating
pretty_namefor datasets in YAML tags as discussed with @lhoestq. Here are the first 10, please let me know if they're looking good.If dataset has 1 config, I've added
pretty_nameasconfig_name: full_name_of_datasetas config names wereplain_text,default,squadetc (not so important in this case) whereas when dataset has >1 configs, I've addedconfig_name: full_name_of_dataset+config_nameso as to let user know about theconfighere.