Skip to content

udhr doesn't load, dataset checksum mismatch #4361

@leondz

Description

@leondz

Describe the bug

Loading udhr fails due to a checksum mismatch for some source files. Looks like both of the source files on unicode.org have changed:

size + checksum in datasets repo:

(hfdev) leon@blade:~/datasets/datasets/udhr$ jq .default.download_checksums < dataset_infos.json 
{
  "https://unicode.org/udhr/assemblies/udhr_xml.zip": {
    "num_bytes": 2273633,
    "checksum": "0565fa62c2ff155b84123198bcc967edd8c5eb9679eadc01e6fb44a5cf730fee"
  },
  "https://unicode.org/udhr/assemblies/udhr_txt.zip": {
    "num_bytes": 2107471,
    "checksum": "087b474a070dd4096ae3028f9ee0b30dcdcb030cc85a1ca02e143be46327e5e5"
  }
}

size + checksum regenerated from current source files:

(hfdev) leon@blade:~/datasets/datasets/udhr$ rm dataset_infos.json
(hfdev) leon@blade:~/datasets/datasets/udhr$ datasets-cli test --save_infos udhr.py
Using custom data configuration default
Testing builder 'default' (1/1)
Downloading and preparing dataset udhn/default (download: 4.18 MiB, generated: 6.15 MiB, post-processed: Unknown size, total: 10.33 MiB) to /home/leon/.cache/huggingface/datasets/udhn/default/0.0.0/ad74b91fa2b3c386e5751b0c52bdfda76d334f76731142fd432d4acc2e2fde66...
Dataset udhn downloaded and prepared to /home/leon/.cache/huggingface/datasets/udhn/default/0.0.0/ad74b91fa2b3c386e5751b0c52bdfda76d334f76731142fd432d4acc2e2fde66. Subsequent calls will reuse this data.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 686.69it/s]
Dataset Infos file saved at dataset_infos.json
Test successful.
(hfdev) leon@blade:~/datasets/datasets/udhr$ jq .default.download_checksums < dataset_infos.json 
{
  "https://unicode.org/udhr/assemblies/udhr_xml.zip": {
    "num_bytes": 2389690,
    "checksum": "a3350912790196c6e1b26bfd1c8a50e8575f5cf185922ecd9bd15713d7d21438"
  },
  "https://unicode.org/udhr/assemblies/udhr_txt.zip": {
    "num_bytes": 2215441,
    "checksum": "cb87ecb25b56f34e4fd6f22b323000524fd9c06ae2a29f122b048789cf17e9fe"
  }
}
(hfdev) leon@blade:~/datasets/datasets/udhr$ 

--- is unicode.org a sustainable hosting solution for this dataset?

Steps to reproduce the bug

from datasets import load_dataset
udhr = load_dataset("udhr")

Expected results

That a Dataset object containing the UDHR data will be returned.

Actual results

>>> d = load_dataset('udhr')
Using custom data configuration default
Downloading and preparing dataset udhn/default (download: 4.18 MiB, generated: 6.15 MiB, post-processed: Unknown size, total: 10.33 MiB) to /home/leon/.cache/huggingface/datasets/udhn/default/0.0.0/ad74b91fa2b3c386e5751b0c52bdfda76d334f76731142fd432d4acc2e2fde66...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/leon/.local/lib/python3.9/site-packages/datasets/load.py", line 1731, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/leon/.local/lib/python3.9/site-packages/datasets/builder.py", line 613, in download_and_prepare
    self._download_and_prepare(
  File "/home/leon/.local/lib/python3.9/site-packages/datasets/builder.py", line 1117, in _download_and_prepare
    super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
  File "/home/leon/.local/lib/python3.9/site-packages/datasets/builder.py", line 684, in _download_and_prepare
    verify_checksums(
  File "/home/leon/.local/lib/python3.9/site-packages/datasets/utils/info_utils.py", line 40, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://unicode.org/udhr/assemblies/udhr_xml.zip', 'https://unicode.org/udhr/assemblies/udhr_txt.zip']
>>> 

Environment info

  • datasets version: 2.2.1 commit/4110fb6034f79c5fb470cf1043ff52180e9c63b7
  • Platform: Linux Ubuntu 20.04
  • Python version: 3.9.12
  • PyArrow version: 8.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions