Skip to content

Conversation

@albertvillanova
Copy link
Member

Currently, when mirroring datasets on the Hub, the license tags are hacked: removed of characters "." and "$". On the contrary, this hacking is not applied to community datasets on the Hub. This generates multiple variants of the same tag on the Hub.

I guess this hacking is no longer necessary:

  • it is not applied to community datasets
  • all canonical datasets are validated by maintainers before being merged: CI + maintainers make sure license tags are the right ones

Fix #4298.

@albertvillanova albertvillanova requested a review from lhoestq May 10, 2022 05:52
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 10, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member

lhoestq commented May 10, 2022

The Hub doesn't allow these characters in the YAML tags, and git push fails if you want to push a dataset card containing these characters.

Comment on lines -109 to -113
if (dataset_repo_path / "README.md").is_file():
with (dataset_repo_path / "README.md").open() as f:
readme_content = f.read()
if readme_content.count("---\n") > 1:
_, tags, content = readme_content.split("---\n", 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there is a function in huggingface_hub that does that

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code removed by this PR was pretty heavy handed (it should have affected only the keys not all strings).

i would advocate once again to rename the config names that contain a . (web_nlg and a very small number of other ones) and accept this PR

@lhoestq
Copy link
Member

lhoestq commented May 11, 2022

Ok, let me rename the bad config names :) I think I can also keep backward compatibility with a warning

@lhoestq
Copy link
Member

lhoestq commented May 12, 2022

Almost done with it btw, will submit a PR that shows all the configuration name changes (from a bit more than 20 datasets)

@albertvillanova
Copy link
Member Author

Please, let me know when the renaming of configs is done. If not enough bandwidth, I can take care of it...

@lhoestq
Copy link
Member

lhoestq commented May 17, 2022

Will focus on this this afternoon ;)

@lhoestq
Copy link
Member

lhoestq commented May 18, 2022

I realized when renaming all the configurations with dots in #4365 that it's not ideal for certain cases. For example:

  • many configurations have a version like "1.0.0" in their names
  • to avoid breaking changes we need to replace dots with underscores in the user input and show a warning, which hurts the experience
  • our second most downloaded dataset at the moment is affected: newsgroup
  • if we disallow dots, then we'll never be able to make the allenai/c4 work with its different configurations since they contain dots, and we can't rename them because they are the official download links

I was thinking of other alternatives:

  1. just stop separating tags per config name completely, and have a single flat YAML for all configurations. Dataset search doesn't use this info anyway
  2. use another YAML structure to avoid having config names as keys, such as
languages:
- config: 20220301_en
  values:
  - en

I'm down for 1, to keep things simple

@albertvillanova
Copy link
Member Author

@lhoestq I agree:

  • better not changing config names (so that we do not introduce any braking change)
  • therefore, we should not use them as keys

In relation with the proposed solutions, I have no strong opinion:

  • option 1 is simpler and aligns better with current usage on the Hub (configs are ignored)
  • however:
    • we will lose all the information per config we already have (for those datasets containing config keys; contributors made an effort to put that information per config)
    • and this information might be useful on the Hub in the future, in case we would like to enrich the search feature with more granularity; this is only applicable if this feature could eventually make sense

So, no strong opinion...

@lhoestq
Copy link
Member

lhoestq commented May 20, 2022

Closing in favor of #4367

@lhoestq lhoestq closed this May 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Normalise license names

4 participants