Skip to content

Conversation

@leondz
Copy link
Contributor

@leondz leondz commented May 17, 2022

Checksum update to udhr for issue #4361

@mariosasko mariosasko linked an issue May 18, 2022 that may be closed by this pull request
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 19, 2022

The documentation is not available anymore as the PR was closed or merged.

@albertvillanova
Copy link
Member

Thanks for contributing @leondz.

The checksums of the files have changed because more languages have been added:

  • the new language codes need to be added to the dataset card (README file)
  • I think the dataset version number should also be increased, so that users who had previously cached it, get a new dataset download (with the additional languages)

@leondz
Copy link
Contributor Author

leondz commented May 19, 2022

Yep! All done (also fixed the language tags in the README which were iso639-3 instead of the expected bcp47)

@leondz
Copy link
Contributor Author

leondz commented May 20, 2022

I guess the language code CI failure is due to languages.json being a subset of bcp47 (see issue #4304), happy to contribute a solution here, e.g. autogeneration of the lang list from the relevant isos and the ietf bcp47 subtag register or full code for validation

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your contribution, @leondz.

Yes, I think it is OK to set version 1.0.0 (as previous was 0.0.0).

One of the CI failures is related to dummy data: once you have updated the dataset version, the dummy_data ZIP file should be moved from "dummy/0.0.0/dummy_data.zip" to "dummy/1.0.0/dummy_data.zip".

Other CI failure is related to missing languages in our resources file. This has been addressed in this PR:

You should merge master branch into your feature branch to incorporate that fix.

@leondz
Copy link
Contributor Author

leondz commented May 20, 2022

Thanks again for your contribution, @leondz.

Yes, I think it is OK to set version 1.0.0 (as previous was 0.0.0).

One of the CI failures is related to dummy data: once you have updated the dataset version, the dummy_data ZIP file should be moved from "dummy/0.0.0/dummy_data.zip" to "dummy/1.0.0/dummy_data.zip".

Oh, thanks, I missed that one

Other CI failure is related to missing languages in our resources file. This has been addressed in this PR:

You should merge master branch into your feature branch to incorporate that fix.

Yeah, I saw this :) I already have the merge, thanks. I'm talking about the longer-term picture: every time another language code comes up (e.g. da-bornholm or es-VE), the json will need updating, because the current approach is non-exhaustive manual whitelisting instead of relying on the established bcp standard.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @leondz,

Could you please merge the master branch to see if the tests pass?

git fetch upstream master
git merge upstream/master
git push

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@albertvillanova albertvillanova merged commit 8a95aa1 into huggingface:master Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

udhr doesn't load, dataset checksum mismatch

3 participants