Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 30, 2021

This code was failing because the parquet extension wasn't recognized:

from datasets import load_dataset
base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/"
data_files = {"train": base_url + "wikipedia-train.parquet"}
wiki = load_dataset("parquet", data_files=data_files, split="train", streaming=True)

It raises

NotImplementedError: Extraction protocol for file at https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/wikipedia-train.parquet is not implemented yet

I added parquet to the list of known extensions

EDIT: added pickle, conllu, xml extensions as well

@lhoestq lhoestq merged commit b00ef30 into master Jul 30, 2021
@lhoestq lhoestq deleted the add-missing-parquet-known-extension branch July 30, 2021 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants