-
Notifications
You must be signed in to change notification settings - Fork 3k
Remove hacking license tags when mirroring datasets on the Hub #4302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
The Hub doesn't allow these characters in the YAML tags, and git push fails if you want to push a dataset card containing these characters. |
| if (dataset_repo_path / "README.md").is_file(): | ||
| with (dataset_repo_path / "README.md").open() as f: | ||
| readme_content = f.read() | ||
| if readme_content.count("---\n") > 1: | ||
| _, tags, content = readme_content.split("---\n", 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there is a function in huggingface_hub that does that
julien-c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code removed by this PR was pretty heavy handed (it should have affected only the keys not all strings).
i would advocate once again to rename the config names that contain a . (web_nlg and a very small number of other ones) and accept this PR
|
Ok, let me rename the bad config names :) I think I can also keep backward compatibility with a warning |
|
Almost done with it btw, will submit a PR that shows all the configuration name changes (from a bit more than 20 datasets) |
|
Please, let me know when the renaming of configs is done. If not enough bandwidth, I can take care of it... |
|
Will focus on this this afternoon ;) |
|
I realized when renaming all the configurations with dots in #4365 that it's not ideal for certain cases. For example:
I was thinking of other alternatives:
languages:
- config: 20220301_en
values:
- enI'm down for 1, to keep things simple |
|
@lhoestq I agree:
In relation with the proposed solutions, I have no strong opinion:
So, no strong opinion... |
|
Closing in favor of #4367 |
Currently, when mirroring datasets on the Hub, the license tags are hacked: removed of characters "." and "$". On the contrary, this hacking is not applied to community datasets on the Hub. This generates multiple variants of the same tag on the Hub.
I guess this hacking is no longer necessary:
Fix #4298.