add books3 #2801

richarddwang · 2021-08-14T07:04:25Z

books3 is part of EleutherAI/The Pile, but AFAIK, The Pile dataset blend all sub datasets together thus we are not able to use just one of its sub dataset from The Pile data. So I create an independent dataset using The Pile preliminary components.

When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

lhoestq

Awesome thank you ! :D

Also note that dataset streaming won't work on this dataset because:

it is stored as a TAR archive, which is not easily streamable
we haven't added support for pathlib.Path manipulations to navigate in remote archives

datasets/books3/README.md

lhoestq · 2021-08-18T15:02:04Z

When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797

Thanks for the message, we'll definitely improve this

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

Well currently no, but I think @lewtun was about to do it (though he's currently on vacations)

lewtun · 2021-08-19T08:07:52Z

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

Well currently no, but I think @lewtun was about to do it (though he's currently on vacations)

yes i plan to start working on this next week #2185

one question for @richarddwang - do you know if eleutherai happened to also release the "existing" datasets like enron emails and opensubtitles?

in appendix c of their paper, they provide details on how they extracted these datasets, but it would be nice if we could just point to a url so we can be as close as possible to original implementation.

richarddwang · 2021-08-19T11:33:47Z

@lewtun

yes i plan to start working on this next week

Nice! Looking forward to it.

one question for @richarddwang - do you know if eleutherai happened to also release the "existing" datasets like enron emails and opensubtitles?

Sadly, I don't know any existing dataset of enron emails, but I believe opensubtitles dataset is hosted at here. https://the-eye.eu/public/AI/pile_preliminary_components/

lewtun · 2021-08-19T16:43:09Z

thanks for the link @richarddwang! i think that corpus is actually the youtube subtitles one and my impression is that eleutherai have only uploaded the 14 new datasets they created. i've contacted one of the authors so hopefully they can share some additional info for us :)

btw it might take a while to put together all the corpora if i also need to preprocess them (e.g. the open subtitles / enron email etc), but i expect no longer than a few weeks.

richarddwang added 3 commits August 14, 2021 16:22

add books3

510de17

fix flake8, datraset card

4a41e3d

fix dset card

aefc6f0

lhoestq approved these changes Aug 18, 2021

View reviewed changes

lhoestq added 2 commits August 18, 2021 17:02

Apply suggestions from code review

24cee31

Update README.md

bbdbfa4

lhoestq merged commit 8052935 into huggingface:master Aug 18, 2021

lhoestq mentioned this pull request Aug 19, 2021

Rename The Pile subsets #2817

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add books3 #2801

add books3 #2801

Uh oh!

richarddwang commented Aug 14, 2021 •

edited

Loading

Uh oh!

lhoestq left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Aug 18, 2021

Uh oh!

lewtun commented Aug 19, 2021

Uh oh!

richarddwang commented Aug 19, 2021 •

edited

Loading

Uh oh!

lewtun commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add books3 #2801

add books3 #2801

Uh oh!

Conversation

richarddwang commented Aug 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq commented Aug 18, 2021

Uh oh!

lewtun commented Aug 19, 2021

Uh oh!

richarddwang commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lewtun commented Aug 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

richarddwang commented Aug 14, 2021 •

edited

Loading

richarddwang commented Aug 19, 2021 •

edited

Loading