-
Notifications
You must be signed in to change notification settings - Fork 3k
Support Zstandard compressed files #2578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! This can be useful for datasets like The Pile
@lhoestq I think I'm missing something here... Tests are a development tool (to ensure we deliver a good quality lib), not something we offer to the end users of the lib. Users of the lib just On the contrary, developers (contributors) of the lib do need to be able to run tests (TDD). And because of that, they are required to install datasets differently: Apart from
So IMHO, to run tests you should previously install datasets with dev or tests dependencies: either |
|
Hi ! |
|
Thank you ! I think we can merge now |
|
@lhoestq does this mean that the pile could have streaming support in the future? Afaik streaming doesnt support zstandard compressed type |
just for reference, i tried to stream one of the data_files = ["https://the-eye.eu/public/AI/pile/train/00.jsonl.zst"]
streamed_dataset = load_dataset('json', split='train', data_files=data_files, streaming=True)and got the following error: i'm not sure whether @Shashi456 is referring to a fundamental limitation with "streaming" zstandard compression files or simply that we need to support the protocol in the streaming api of |
|
@lewtun our streaming mode patches the Python |
|
thanks a lot @albertvillanova - now i can stream the pile :) |
Close #2572.
cc: @thomwolf