Skip to content

Conversation

@albertvillanova
Copy link
Member

Add:

  • Free Law subset of The Pile: "free_law" config

Close bigscience-workshop/data_tooling#75.

CC: @StellaAthena

@albertvillanova albertvillanova merged commit 702389e into master Dec 1, 2021
@albertvillanova albertvillanova deleted the the-pile-free-law branch December 1, 2021 17:30
@StellaAthena
Copy link

@albertvillanova Is there a specific reason you’re adding the Pile under “the” instead of under “pile”? That does not appear to be consistent with other datasets.

@albertvillanova
Copy link
Member Author

albertvillanova commented Dec 2, 2021

Hi @StellaAthena,

I asked myself the same question, but at the end I decided to be consistent with previously added Pile subsets:

I guess the reason is to stress that the definite article is always used before the name of the dataset (your site says: "The Pile. An 800GB Dataset of Diverse Text for Language Modeling"). Other datasets are not usually preceded by the definite article, like "the SQuAD" or "the GLUE" or "the Common Voice"...

CC: @lhoestq

@lhoestq
Copy link
Member

lhoestq commented Dec 6, 2021

I guess the reason is to stress that the definite article is always used before the name of the dataset (your site says: "The Pile. An 800GB Dataset of Diverse Text for Language Modeling").

Yes that's because of this that it starts with "the"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create license-compliant version of the Pile: FreeLaw

4 participants