Skip to content

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented May 13, 2022

This PR implements complete support for remote cache_dir. Before, the support was just partial.

This is useful to create datasets using Apache Beam (parallel data processing) builder with cache_dir in a remote bucket, e.g., for Wikipedia dataset.

@albertvillanova albertvillanova changed the title Support cache_dir when remote Support remote cache_dir May 13, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 13, 2022

The documentation is not available anymore as the PR was closed or merged.

@albertvillanova albertvillanova marked this pull request as draft May 13, 2022 14:45
@albertvillanova albertvillanova marked this pull request as ready for review May 14, 2022 15:19
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! Just one minor comment: you can use xjoin directly to support both URLs and local paths

@albertvillanova
Copy link
Member Author

albertvillanova commented May 20, 2022

@lhoestq thanks for your review.

Please note that xjoin cannot be used in this context, as it always returns a POSIX path string and this is not suitable on Windows machines.

@lhoestq
Copy link
Member

lhoestq commented May 24, 2022

xjoin returns windows paths (not posix) on windows, since it just extendsos.path.join

Actually you are right.

if is_local_path(a):
a = Path(a, *p).as_posix()

Though this is not an issue because posix paths (as returned by Path().as_posix()) work on windows. That's why we can replace os.path.join with xjoin in streaming mode. They look like c:/Program Files/ or something (can't confirm right now, I don't have a windows with me)

@albertvillanova
Copy link
Member Author

albertvillanova commented May 25, 2022

Until now, we have always replaced "/" in paths with os.path.join (os.sep,...) in order to support Windows paths (that contain r"\").

Now, you suggest ignoring this and work with POSIX strings (with "/").

As an example, when passing cache_dir=r"C:\Users\Username\.mycache":

  • Until now, it results in self._cache_downloaded_dir = r"C:\Users\Username\.mycache\downloads"
  • If we use xjoin, it will give self._cache_downloaded_dir = "C:/Users/Username/.mycache/downloads"

You say this is OK and we don't care if we work with POSIX strings on Windows machines.

I'm incorporating your suggested changes then...

@albertvillanova
Copy link
Member Author

Also note that using xjoin, if we pass cache_dir="C:\\Users\\Username\\.mycache", we get:

  • self._cache_dir_root = "C:\\Users\\Username\\.mycache"
  • self._cache_downloaded_dir = "C:/Users/Username/.mycache/downloads"

@lhoestq
Copy link
Member

lhoestq commented May 25, 2022

It looks like it broke the CI on windows :/ maybe this was not a good idea, sorry

This reverts commit 43714db.
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and sorry for the bad indications ;)

@albertvillanova albertvillanova merged commit c1dce72 into huggingface:master May 25, 2022
@albertvillanova albertvillanova deleted the fix-remote-cache-dir branch May 25, 2022 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants