Skip to content

Conversation

@mariosasko
Copy link
Collaborator

Fix #4549

@mariosasko mariosasko requested a review from lhoestq June 23, 2022 14:49
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 23, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM :)

Though it doesn't fix the issue if the relative path points to an actual file (not a pattern) and if it contains a double underscores (so it doesn't completely "Fix #4549"):

resolve_patterns_locally_or_by_urls(
    base_path="/Users/quentinlhoest/Desktop/hf/datasets/playground",
    patterns=[".yo/data.txt"]
)

^ this one should return the file

but we still have to make sure this one doesn't:

resolve_patterns_locally_or_by_urls(
    base_path="/Users/quentinlhoest/Desktop/hf/datasets/playground",
    patterns=["*/data.txt"]
)

(and probably out of scope for now - not important)

maybe this one should return the file

resolve_patterns_locally_or_by_urls(
    base_path="/Users/quentinlhoest/Desktop/hf/datasets/playground",
    patterns=[".yo/*"]
)

@mariosasko
Copy link
Collaborator Author

mariosasko commented Jun 24, 2022

I'm aware of this behavior, which is tricky to solve due to fsspec's hidden file handling (see #4115 (comment)). I've tested some regex patterns to address this, and they seem to work (will push them on Monday; btw they don't break any of fsspec's tests, so maybe we can contribute this as an enhancement to them). Also, perhaps we should include the files starting with __ in the results again (we hadn't had issues with this pattern before). WDYT?

@lhoestq
Copy link
Member

lhoestq commented Jun 28, 2022

I see. Feel free to merge this one if it's good for you btw :)

Also, perhaps we should include the files starting with __ in the results again (we hadn't had issues with this pattern before)

The point was mainly to ignore __pycache__ directories for example. Also also for consistency with the iter_files/iter_archive which are already ignoring them

@mariosasko
Copy link
Collaborator Author

Very elegant solution! Feel free to merge if the CI is green after adding the tests.

@lhoestq lhoestq merged commit 0e1c629 into master Jun 30, 2022
@lhoestq lhoestq deleted the fix-4549 branch June 30, 2022 14:38
@lhoestq
Copy link
Member

lhoestq commented Jun 30, 2022

CI failure is unrelated to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FileNotFoundError when passing a data_file inside a directory starting with double underscores

4 participants