Skip to content

Conversation

@richarddwang
Copy link
Contributor

@richarddwang richarddwang commented Aug 14, 2021

stack exchange is part of EleutherAI/The Pile, but AFAIK, The Pile dataset blend all sub datasets together thus we are not able to use just one of its sub dataset from The Pile data. So I create an independent dataset using The Pile preliminary components.

I also change default timeout to 100 seconds instead of 10 seconds, otherwise I keep getting read time out when downloading source data of stack exchange and cc100 dataset.

When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! This looks all good to me :)



def http_get(url, temp_file, proxies=None, resume_size=0, headers=None, cookies=None, timeout=10.0, max_retries=0):
def http_get(url, temp_file, proxies=None, resume_size=0, headers=None, cookies=None, timeout=100.0, max_retries=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it

@lhoestq
Copy link
Member

lhoestq commented Aug 19, 2021

Hi ! Merging this one since it's all good :)

However I think it would also be better to actually rename it the_pile_stack_exchange to make things clearer and to avoid name collisions in the future. I would like to do the same for books3 as well.

If you don't mind I'll open a PR to do the renaming

@lhoestq lhoestq merged commit b9fb8b2 into huggingface:master Aug 19, 2021
@lhoestq lhoestq mentioned this pull request Aug 19, 2021
3 tasks
@richarddwang
Copy link
Contributor Author

If you don't mind I'll open a PR to do the renaming

@lhoestq That will be nice !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants