Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 1, 2021

The old code for the C4 dataset was to generate the C4 with Apache Beam, as in Tensorflow Datasets.
However AllenAI is now hosting the processed C4 dataset in this repo: https://huggingface.co/datasets/allenai/c4
Thanks a lot to them for their amazing work !

In this PR I changed the script to download and prepare the data directly from this repo.
It has 4 variants: en, en.noblocklist, en.noclean, realnewslike

You can load it with

from datasets import load_dataset

c4 = load_dataset("c4", "en")

It also supports streaming, if you don't want to download hundreds of GB of data:

c4 = load_dataset("c4", "en", streaming=True)

Regarding the dataset_infos.json, I haven't added the infos for en.noclean. I will add them once I have them.

Also we can work on the dataset card at https://huggingface.co/datasets/c4
For now I just added a link to https://huggingface.co/datasets/allenai/c4 as well as a few sections

@lhoestq lhoestq merged commit 70a72ec into master Jul 2, 2021
@lhoestq lhoestq deleted the c4 branch July 2, 2021 14:50
@lhoestq lhoestq mentioned this pull request Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants