Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 19, 2021

After discussing with @yjernite we think it's better to have the subsets of The Pile explicitly have "the_pile" in their names.

I'm doing the changes for the subsets that @richarddwang added:

For consistency we should also rename bookcorpusopen to the_pile_bookcorpus IMO, but let me know what you think.
(we can just add a deprecation message to bookcorpusopen for now and add the_pile_bookcorpus)

@thomwolf
Copy link
Member

Sounds good. Should we also have a “the_pile” dataset with the subsets as configuration?

@lhoestq
Copy link
Member Author

lhoestq commented Aug 19, 2021

I think the main the_pile datasets will be the one that is the mix of all the subsets: https://the-eye.eu/public/AI/pile/

We can also add configurations for each subset, and even allow users to specify the subsets they want:

from datasets import load_dataset

load_dataset("the_pile", subsets=["openwebtext2", "books3", "hn"])

We're alrady doing something similar for mC4, where users can specify the list of languages they want to load.

@lhoestq lhoestq merged commit 374e171 into master Aug 23, 2021
@lhoestq lhoestq deleted the rename-the-pile-subsets branch August 23, 2021 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants