-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Loading udhr fails due to a checksum mismatch for some source files. Looks like both of the source files on unicode.org have changed:
size + checksum in datasets repo:
(hfdev) leon@blade:~/datasets/datasets/udhr$ jq .default.download_checksums < dataset_infos.json
{
"https://unicode.org/udhr/assemblies/udhr_xml.zip": {
"num_bytes": 2273633,
"checksum": "0565fa62c2ff155b84123198bcc967edd8c5eb9679eadc01e6fb44a5cf730fee"
},
"https://unicode.org/udhr/assemblies/udhr_txt.zip": {
"num_bytes": 2107471,
"checksum": "087b474a070dd4096ae3028f9ee0b30dcdcb030cc85a1ca02e143be46327e5e5"
}
}
size + checksum regenerated from current source files:
(hfdev) leon@blade:~/datasets/datasets/udhr$ rm dataset_infos.json
(hfdev) leon@blade:~/datasets/datasets/udhr$ datasets-cli test --save_infos udhr.py
Using custom data configuration default
Testing builder 'default' (1/1)
Downloading and preparing dataset udhn/default (download: 4.18 MiB, generated: 6.15 MiB, post-processed: Unknown size, total: 10.33 MiB) to /home/leon/.cache/huggingface/datasets/udhn/default/0.0.0/ad74b91fa2b3c386e5751b0c52bdfda76d334f76731142fd432d4acc2e2fde66...
Dataset udhn downloaded and prepared to /home/leon/.cache/huggingface/datasets/udhn/default/0.0.0/ad74b91fa2b3c386e5751b0c52bdfda76d334f76731142fd432d4acc2e2fde66. Subsequent calls will reuse this data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 686.69it/s]
Dataset Infos file saved at dataset_infos.json
Test successful.
(hfdev) leon@blade:~/datasets/datasets/udhr$ jq .default.download_checksums < dataset_infos.json
{
"https://unicode.org/udhr/assemblies/udhr_xml.zip": {
"num_bytes": 2389690,
"checksum": "a3350912790196c6e1b26bfd1c8a50e8575f5cf185922ecd9bd15713d7d21438"
},
"https://unicode.org/udhr/assemblies/udhr_txt.zip": {
"num_bytes": 2215441,
"checksum": "cb87ecb25b56f34e4fd6f22b323000524fd9c06ae2a29f122b048789cf17e9fe"
}
}
(hfdev) leon@blade:~/datasets/datasets/udhr$
--- is unicode.org a sustainable hosting solution for this dataset?
Steps to reproduce the bug
from datasets import load_dataset
udhr = load_dataset("udhr")Expected results
That a Dataset object containing the UDHR data will be returned.
Actual results
>>> d = load_dataset('udhr')
Using custom data configuration default
Downloading and preparing dataset udhn/default (download: 4.18 MiB, generated: 6.15 MiB, post-processed: Unknown size, total: 10.33 MiB) to /home/leon/.cache/huggingface/datasets/udhn/default/0.0.0/ad74b91fa2b3c386e5751b0c52bdfda76d334f76731142fd432d4acc2e2fde66...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/leon/.local/lib/python3.9/site-packages/datasets/load.py", line 1731, in load_dataset
builder_instance.download_and_prepare(
File "/home/leon/.local/lib/python3.9/site-packages/datasets/builder.py", line 613, in download_and_prepare
self._download_and_prepare(
File "/home/leon/.local/lib/python3.9/site-packages/datasets/builder.py", line 1117, in _download_and_prepare
super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File "/home/leon/.local/lib/python3.9/site-packages/datasets/builder.py", line 684, in _download_and_prepare
verify_checksums(
File "/home/leon/.local/lib/python3.9/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums
raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://unicode.org/udhr/assemblies/udhr_xml.zip', 'https://unicode.org/udhr/assemblies/udhr_txt.zip']
>>>
Environment info
datasetsversion: 2.2.1 commit/4110fb6034f79c5fb470cf1043ff52180e9c63b7- Platform: Linux Ubuntu 20.04
- Python version: 3.9.12
- PyArrow version: 8.0.0
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working