-
Notifications
You must be signed in to change notification settings - Fork 3k
add books3 #2801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add books3 #2801
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you ! :D
Also note that dataset streaming won't work on this dataset because:
- it is stored as a TAR archive, which is not easily streamable
- we haven't added support for pathlib.Path manipulations to navigate in remote archives
Thanks for the message, we'll definitely improve this
Well currently no, but I think @lewtun was about to do it (though he's currently on vacations) |
yes i plan to start working on this next week #2185 one question for @richarddwang - do you know if eleutherai happened to also release the "existing" datasets like enron emails and opensubtitles? in appendix c of their paper, they provide details on how they extracted these datasets, but it would be nice if we could just point to a url so we can be as close as possible to original implementation. |
Nice! Looking forward to it.
Sadly, I don't know any existing dataset of enron emails, but I believe opensubtitles dataset is hosted at here. https://the-eye.eu/public/AI/pile_preliminary_components/ |
|
thanks for the link @richarddwang! i think that corpus is actually the youtube subtitles one and my impression is that eleutherai have only uploaded the 14 new datasets they created. i've contacted one of the authors so hopefully they can share some additional info for us :) btw it might take a while to put together all the corpora if i also need to preprocess them (e.g. the open subtitles / enron email etc), but i expect no longer than a few weeks. |

books3 is part of EleutherAI/The Pile, but AFAIK, The Pile dataset blend all sub datasets together thus we are not able to use just one of its sub dataset from The Pile data. So I create an independent dataset using The Pile preliminary components.
When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797
Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675