Skip to content

Commit 6f7bca7

Browse files
authored
Fix extraction protocol inference from urls with params (#2843)
* fix extraction protocol inference from urls with params * severo's comment * Update streaming_download_manager.py
1 parent 93f3c44 commit 6f7bca7

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

src/datasets/utils/streaming_download_manager.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,7 +149,11 @@ def _extract(self, urlpath: str) -> str:
149149
return f"{protocol}://::{urlpath}"
150150

151151
def _get_extraction_protocol(self, urlpath: str) -> Optional[str]:
152+
# get inner file: zip://train-00000.json.gz::https://foo.bar/data.zip -> zip://train-00000.json.gz
152153
path = urlpath.split("::")[0]
154+
# remove query params: https://foo.bar/train.json.gz?dl=1 -> https://foo.bar/train.json.gz
155+
path = path.split("?")[0]
156+
# Get extension: https://foo.bar/train.json.gz -> gz
153157
extension = path.split(".")[-1]
154158
if extension in BASE_KNOWN_EXTENSIONS:
155159
return None

0 commit comments

Comments
 (0)