Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 27, 2021

Previously it was unable to infer the compression protocol for files at URLs like

https://foo.bar/train.json.gz?dl=1

because of the query parameters.

I fixed that, this should allow 10+ datasets to work in streaming mode:

  "discovery",
  "emotion",
  "grail_qa",
  "guardian_authorship",
  "pragmeval",
  "simple_questions_v2",
  "versae/adobo",
  "w-nicole/childes_data",
  "w-nicole/childes_data_no_tags_",
  "w-nicole/childes_data_with_tags",
  "w-nicole/childes_data_with_tags_"

cc @severo

@lhoestq lhoestq merged commit 6f7bca7 into master Aug 30, 2021
@lhoestq lhoestq deleted the fix-extraction-protocol-inference-for-urls-with-params branch August 30, 2021 13:12
@lhoestq
Copy link
Member Author

lhoestq commented Aug 30, 2021

merging since the windows error is just a CircleCI issue

severo added a commit to huggingface/dataset-viewer that referenced this pull request Aug 30, 2021
@severo
Copy link
Collaborator

severo commented Aug 30, 2021

@lhoestq
Copy link
Member Author

lhoestq commented Aug 30, 2021

Nice !

severo added a commit that referenced this pull request Aug 31, 2021
A lot of URL use the query strings, for example
http://opus.nlpl.eu/download.php?f=Bianet/v1/moses/en-ku.txt.zip, we
must not remove it when trying to detect the protocol. We thus remove it
only in the case of the query string being ?dl=1 which occurs on dropbox
and dl.orangedox.com. Also: add unit tests.

See #2843 for the original
discussion.
lhoestq pushed a commit that referenced this pull request Aug 31, 2021
A lot of URL use the query strings, for example
http://opus.nlpl.eu/download.php?f=Bianet/v1/moses/en-ku.txt.zip, we
must not remove it when trying to detect the protocol. We thus remove it
only in the case of the query string being ?dl=1 which occurs on dropbox
and dl.orangedox.com. Also: add unit tests.

See #2843 for the original
discussion.
mattstern31 added a commit to mattstern31/datasets-server-storage-admin that referenced this pull request Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants