-
Notifications
You must be signed in to change notification settings - Fork 3k
add stack exchange #2803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add stack exchange #2803
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! This looks all good to me :)
|
|
||
|
|
||
| def http_get(url, temp_file, proxies=None, resume_size=0, headers=None, cookies=None, timeout=10.0, max_retries=0): | ||
| def http_get(url, temp_file, proxies=None, resume_size=0, headers=None, cookies=None, timeout=100.0, max_retries=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with it
|
Hi ! Merging this one since it's all good :) However I think it would also be better to actually rename it If you don't mind I'll open a PR to do the renaming |
@lhoestq That will be nice !! |
stack exchange is part of EleutherAI/The Pile, but AFAIK, The Pile dataset blend all sub datasets together thus we are not able to use just one of its sub dataset from The Pile data. So I create an independent dataset using The Pile preliminary components.
I also change default
timeoutto 100 seconds instead of 10 seconds, otherwise I keep getting read time out when downloading source data of stack exchange and cc100 dataset.When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797
Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675