Skip to content

Conversation

@richarddwang
Copy link
Contributor

@richarddwang richarddwang commented Aug 14, 2021

openwebtext2 is part of EleutherAI/The Pile, but AFAIK, The Pile dataset blend all sub datasets together thus we are not able to use just one of its sub dataset from The Pile data. So I create an independent dataset using The Pile preliminary components.

When I was creating dataset card. I found there is room for creating / editing dataset card. I've made it an issue. #2797

Also I am wondering whether the import of The Pile dataset is actively undertaken (because I may need it recently)? #1675

@richarddwang
Copy link
Contributor Author

It seems we need to pip install jsonlines to pass the checks ?

@lhoestq
Copy link
Member

lhoestq commented Aug 18, 2021

Hi ! Do you really need jsonlines ? I think it simply uses json.loads under the hood.

Currently the test are failing because jsonlines is not part of the extra requirements TESTS_REQUIRE in setup.py

So either you can replace jsonlines with a simple for loop on the lines of the files and use json.loads, or you can add TESTS_REQUIRE to the test requirements (but in this case users will have to install it as well).

@lhoestq lhoestq mentioned this pull request Aug 19, 2021
3 tasks
@richarddwang
Copy link
Contributor Author

richarddwang commented Aug 19, 2021

Thanks for your suggestion. I now know io and json lines format better and has changed jsonlines to just readlines.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks !

I'm also doing some minor modifications and we can merge :)

Comment on lines 60 to 64
- **Homepage:** https://openwebtext2.readthedocs.io/en/latest/
- **Repository:** [Needs More Information]
- **Paper:** https://arxiv.org/abs/2101.00027
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Homepage:** https://openwebtext2.readthedocs.io/en/latest/
- **Repository:** [Needs More Information]
- **Paper:** https://arxiv.org/abs/2101.00027
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]
- **Homepage:** [openwebtext2](https://openwebtext2.readthedocs.io/en/latest/)
- **Repository:** [Needs More Information]
- **Paper:** [arXiv](https://arxiv.org/abs/2101.00027)
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Needs More Information]

@lhoestq lhoestq merged commit 72ba8c3 into huggingface:master Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants