-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix regex get_data_files formatting for base paths
#6322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
get_data_files formatting for url base pathsget_data_files formatting for base paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposed fix, @ZachNagengast.
EDIT:
The reason why I used the the glob_pattern_to_regex in the entire pattern is because otherwise I got an error for Windows local paths: a base_path like 'C:\\Users\\runneradmin... made the function string_to_dict raise re.error: incomplete escape \U at position 2
See: https://github.com/huggingface/datasets/actions/runs/6544904352/job/17772361643
We should include a test that includes the case you mention, and find a solution that works for all cases.
That issue was fixed once we pass the base_path as POSIX.
Maybe we could add a test that fails in the case you mention.
|
The documentation is not available anymore as the PR was closed or merged. |
What is the expected inputs and outputs for the windows
I'm not sure what you meant by that, are there still changes needed? |
|
We took the liberty of continuing this PR to include it in today's patch release :) |
| splits: Set[str] = { | ||
| string_to_dict(xbasename(p), glob_pattern_to_regex(xbasename(split_pattern)))["split"] | ||
| for p in data_files | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are matching just in the basename, then what is the point of having 2 kinds of patterns?
- ALL_SPLIT_PATTERNS:
data/{split}-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9]*.* - ALL_DEFAULT_PATTERNS:
**/*[{sep}/]{keyword}[{sep}/]**
Maybe I'm missing something, but why do we need the former? I would naively say the latter contains the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only ALL_SPLIT_PATTERNS are parsed to infer custom split names.
While the second only detects train/valid/test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, and what is the point of the directory data/ in ALL_SPLIT_PATTERNS if we only match the basename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for old push_to_hub to work: they push custom splits using this pattern in the data directory.
New push_to_hub have some YAML to specify the pattern to use, so get_data_patterns isn't called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, all clear now. Thanks.
* Fix regex from formatting url base_path * Test test_get_data_patterns from Hub * simply match basename instead * more tests * minor * remove comment --------- Co-authored-by: Albert Villanova del Moral <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
* Fix regex from formatting url base_path * Test test_get_data_patterns from Hub * simply match basename instead * more tests * minor * remove comment --------- Co-authored-by: Albert Villanova del Moral <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

With this pr #6309, it is formatting the entire base path into regex, which results in the undesired formatting error
doesn't match the patternbecause of the line inglob_pattern_to_regex:.replace("//", "/"):hf://datasets/...hf:/datasets/...This fix will only convert the
split_patternto regex and keep thebase_pathunchanged.cc @albertvillanova hopefully this still works with your implementation