-
Notifications
You must be signed in to change notification settings - Fork 3k
Support remote cache_dir #4347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support remote cache_dir #4347
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! Just one minor comment: you can use xjoin directly to support both URLs and local paths
|
@lhoestq thanks for your review. Please note that |
|
Actually you are right. datasets/src/datasets/utils/streaming_download_manager.py Lines 104 to 105 in 08ec04c
Though this is not an issue because posix paths (as returned by Path().as_posix()) work on windows. That's why we can replace |
|
Until now, we have always replaced "/" in paths with Now, you suggest ignoring this and work with POSIX strings (with "/"). As an example, when passing
You say this is OK and we don't care if we work with POSIX strings on Windows machines. I'm incorporating your suggested changes then... |
|
Also note that using
|
|
It looks like it broke the CI on windows :/ maybe this was not a good idea, sorry |
This reverts commit 43714db.
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and sorry for the bad indications ;)
This PR implements complete support for remote
cache_dir. Before, the support was just partial.This is useful to create datasets using Apache Beam (parallel data processing) builder with
cache_dirin a remote bucket, e.g., for Wikipedia dataset.