Skip to content

Conversation

@changjonathanc
Copy link
Contributor

The dataset is updated and the old url no longer works. So I updated it.

I faced a bug while trying to fix this. Documenting the solution here. Maybe we can add it to the doc (CONTRIBUTING.md and ADD_NEW_DATASET.md).

And to make the command work without the ExpectedMoreDownloadedFiles error, you just need to use the --ignore_verifications flag.
#2076 (comment)

@changjonathanc
Copy link
Contributor Author

changjonathanc commented Jun 6, 2021

Just noticed while
load_dataset('local_path/datastes/xor_tydi_qa') works,
load_dataset('xor_tydi_qa')
outputs an error:
FileNotFoundError: Couldn't find file at https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_retrieve_eng_span.jsonl
(the old url)

I tired clearing the cache .cache/huggingface/modules and .cache/huggingface/datasets, didn't work.

Anyone know how to fix this? Thanks.

@mariosasko
Copy link
Collaborator

It seems like the error is not on your end. By default, the lib tries to download the version of the dataset script that matches the version of the lib, and that version of the script is, in your case, broken because the old URL no longer works. Once this PR gets merged, you can wait for the new release or set script_version to "master" in load_dataset to get the fixed version of the script.

@changjonathanc
Copy link
Contributor Author

@mariosasko Thanks! It works now.

Pasting the docstring here for reference.

    script_version (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load:

        - For canonical datasets in the `huggingface/datasets` library like "squad", the default version of the module is the local version fo the lib.
          You can specify a different version from your local version of the lib (e.g. "master" or "1.2.0") but it might cause compatibility issues.
        - For community provided datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch.
          You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.

Branch name didn't work, but commit sha works.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the urls and updating the dataset_infos.json with the new checksum files :)

@lhoestq
Copy link
Member

lhoestq commented Jun 7, 2021

Regarding the issue you mentioned about the --ignore_verifications flag, I think we should actually change the current behavior of the --save_infos flag to make it ignore the verifications as well, so that you don't need to specific --ignore_verifications in this case.

@lhoestq lhoestq merged commit 800f500 into huggingface:master Jun 7, 2021
@changjonathanc
Copy link
Contributor Author

@lhoestq I realized I forgot to change this:

super(XORTyDiConfig, self).__init__(version=datasets.Version("1.0.0", ""), **kwargs)
self.data_url = data_url

What should I do?

@lhoestq
Copy link
Member

lhoestq commented Jun 7, 2021

Oh indeed. Please open a PR to change this. This should be 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants