Add missing parquet known extension #2733

lhoestq · 2021-07-30T13:01:20Z

This code was failing because the parquet extension wasn't recognized:

from datasets import load_dataset
base_url = "https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/"
data_files = {"train": base_url + "wikipedia-train.parquet"}
wiki = load_dataset("parquet", data_files=data_files, split="train", streaming=True)

It raises

NotImplementedError: Extraction protocol for file at https://storage.googleapis.com/huggingface-nlp/cache/datasets/wikipedia/20200501.en/1.0.0/wikipedia-train.parquet is not implemented yet

I added parquet to the list of known extensions

EDIT: added pickle, conllu, xml extensions as well

lhoestq added 3 commits July 30, 2021 14:59

add missing parquet known extension

638ae72

add conllu

c7c70d4

add xml

78a7908

lhoestq merged commit b00ef30 into master Jul 30, 2021

lhoestq deleted the add-missing-parquet-known-extension branch July 30, 2021 13:24

stevhliu mentioned this pull request Jul 30, 2021

New documentation structure #2718

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add missing parquet known extension #2733

Add missing parquet known extension #2733

Uh oh!

lhoestq commented Jul 30, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add missing parquet known extension #2733

Add missing parquet known extension #2733

Uh oh!

Conversation

lhoestq commented Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoestq commented Jul 30, 2021 •

edited

Loading