Skip to content

Conversation

@richarddwang
Copy link
Contributor

@richarddwang richarddwang commented Aug 14, 2021

books3 is part of EleutherAI/The Pile, but AFAIK, The Pile dataset blend all sub datasets together thus we are not able to use just one of its sub dataset from The Pile data. So I create an independent dataset using The Pile preliminary components.

When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you ! :D

Also note that dataset streaming won't work on this dataset because:

  • it is stored as a TAR archive, which is not easily streamable
  • we haven't added support for pathlib.Path manipulations to navigate in remote archives

@lhoestq
Copy link
Member

lhoestq commented Aug 18, 2021

When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797

Thanks for the message, we'll definitely improve this

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

Well currently no, but I think @lewtun was about to do it (though he's currently on vacations)

@lhoestq lhoestq merged commit 8052935 into huggingface:master Aug 18, 2021
@lewtun
Copy link
Member

lewtun commented Aug 19, 2021

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

Well currently no, but I think @lewtun was about to do it (though he's currently on vacations)

yes i plan to start working on this next week #2185

one question for @richarddwang - do you know if eleutherai happened to also release the "existing" datasets like enron emails and opensubtitles?

in appendix c of their paper, they provide details on how they extracted these datasets, but it would be nice if we could just point to a url so we can be as close as possible to original implementation.

@lhoestq lhoestq mentioned this pull request Aug 19, 2021
3 tasks
@richarddwang
Copy link
Contributor Author

richarddwang commented Aug 19, 2021

@lewtun

yes i plan to start working on this next week

Nice! Looking forward to it.

one question for @richarddwang - do you know if eleutherai happened to also release the "existing" datasets like enron emails and opensubtitles?

Sadly, I don't know any existing dataset of enron emails, but I believe opensubtitles dataset is hosted at here. https://the-eye.eu/public/AI/pile_preliminary_components/
image

@lewtun
Copy link
Member

lewtun commented Aug 19, 2021

thanks for the link @richarddwang! i think that corpus is actually the youtube subtitles one and my impression is that eleutherai have only uploaded the 14 new datasets they created. i've contacted one of the authors so hopefully they can share some additional info for us :)

btw it might take a while to put together all the corpora if i also need to preprocess them (e.g. the open subtitles / enron email etc), but i expect no longer than a few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants