-
Notifications
You must be signed in to change notification settings - Fork 3k
[data_files] Only match separated split names #4633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, just one nit.
PS: I think we should also check that this change doesn't affect the existing Hub repos and warn their owners if it does.
|
I ran a script to find affected datasets (just did it on non-private non-gated). Adding "testing" and "evaluation" fixes all of of them except one:
Let me open a PR on their repository to fix it |
mariosasko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good now. Thanks!
|
Feel free to merge @albertvillanova if it's all good to you :) |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!! This will definitely make the user experience much more pleasant and aligned with their expectations.
docs/source/repository_structure.mdx
Outdated
| Files that contain *train* in their names are considered part of the train split, e.g. `train.csv`, `my_train_file.csv`, etc. | ||
| The same idea applies to the test and validation split: | ||
|
|
||
| - All the files that contain *test* in their names are considered part of the test split, e.g. `test.csv`, `my_test_file.csv` | ||
| - All the files that contain *validation* in their names are considered part of the validation split, e.g. `validation.csv`, `my_validation_file.csv` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the explanation more clear:
- You are saying here that if a filename contains "train", it is considered part of the train split: But this is only true if the subword "train" is delimited by some specific non-word characters.
- You do this clarification afterwards, but I think it would be better to make it right here: if a user stops reading here, they will misunderstand how this works
A comment about style (feel free to ignore it):
- You first make a sentence with "train" and then a partial unordered list enumeration with just "test" and "validation". I think it might be better:
- either an unordered list with all 3 cases ("train", "validation", and "test"; in this order)
- or no partial unordered list and just 3 sentences.
- The text in the 3 cases is quite repetitive: maybe just a single sentence and then 3 itemized examples
Something like (just for inspiration):
All the files that contain a split name in their names (delimited by some specific non-word characters, see below) are considered part of that split:
- train split:
train.csv,my_train_file.csv- validation split:
validation.csv,my_validation_file.csv- test split:
test.csv,my_test_file.csv
docs/source/repository_structure.mdx
Outdated
| └── validation.csv | ||
| ``` | ||
|
|
||
| Note that if a file contains *test* but is embedded in another word (e.g. `contest.csv`), it's not counted as a test file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth enumerating which are the non-word characters considered as valid delimiters of the split name subword?
docs/source/repository_structure.mdx
Outdated
|
|
||
| ## Split names keywords | ||
|
|
||
| Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just enumerating the equivalent names?
| Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets. | |
| Validation split is sometimes called "dev", or test split is called "eval". |
src/datasets/data_files.py
Outdated
| str(Split.TRAIN): ["**[-._ /]train[-._ ]*", "train[-._ ]*", "**[-._ /]training[-._ ]*", "training[-._ ]*"], | ||
| str(Split.TEST): [ | ||
| "**[-._ /]test[-._ ]*", | ||
| "test[-._ ]*", | ||
| "**[-._ /]testing[-._ ]*", | ||
| "testing[-._ ]*", | ||
| "**[-._ /]eval[-._ ]*", | ||
| "eval[-._ ]*", | ||
| "**[-._ /]evaluation[-._ ]*", | ||
| "evaluation[-._ ]*", | ||
| ], | ||
| str(Split.VALIDATION): [ | ||
| "**[-._ /]dev[-._ ]*", | ||
| "dev[-._ ]*", | ||
| "**[-._ /]valid[-._ ]*", | ||
| "valid[-._ ]*", | ||
| "**[-._ /]validation[-._ ]*", | ||
| "validation[-._ ]*", | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are a combination of pattern and split_name. Maybe better with list comprehensions?
test_split_names = ["test", "testing", "eval", "evaluation"]
...
default_patterns_split_in_filename = ["**[-._ /]{split_name}[-._ ]*", "{split_name}[-._ ]*"]
...
str(Split.TEST): [pattern.format(split_name=split_name) for split_name in test_split_names for pattern in default_patterns_split_in_filename]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also a question: why numbers are not valid delimiters?
- "test1.txt", "test2.txt" are not considered as "test" files
| str(Split.TRAIN): ["**train*/**"], | ||
| str(Split.TEST): ["**test*/**", "**eval*/**"], | ||
| str(Split.VALIDATION): ["**dev*/**", "**valid*/**"], | ||
| str(Split.TRAIN): ["train[-._ /]**", "**[-._ /]train[-._ /]**", "training[-._ /]**", "**[-._ /]training[-._ /]**"], | ||
| str(Split.TEST): [ | ||
| "test[-._ /]**", | ||
| "**[-._ /]test[-._ /]**", | ||
| "testing[-._ /]**", | ||
| "**[-._ /]testing[-._ /]**", | ||
| "eval[-._ /]**", | ||
| "**[-._ /]eval[-._ /]**", | ||
| "evaluation[-._ /]**", | ||
| "**[-._ /]evaluation[-._ /]**", | ||
| ], | ||
| str(Split.VALIDATION): [ | ||
| "dev[-._ /]**", | ||
| "**[-._ /]dev[-._ /]**", | ||
| "valid[-._ /]**", | ||
| "**[-._ /]valid[-._ /]**", | ||
| "validation[-._ /]**", | ||
| "**[-._ /]validation[-._ /]**", | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above about list comprehensions.
tests/test_data_files.py
Outdated
| @classmethod | ||
| def get_test_paths(cls, start_with=""): | ||
| """Helper to return directory and file paths with no details""" | ||
| all = [file["name"] for file in cls._fs_contents if file["name"].startswith(start_with)] | ||
| return all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be removed.
I just copied-pasted it but we do not use it.
tests/test_data_files.py
Outdated
| with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}): | ||
| yield DummyTestFS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment (feel free to ignore it): you use here unittest.mock.patch, but you could use pytest.monkeypatch instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is a context manager that doesn't have access to the monkeypatch fixture of pytest, so I used unittest.mock.patch instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq again not important: but indeed you are not using the patching. You are just using the returned instance DummyTestFS().
So I guess you could just remove the patching (unittest.mock.patch) and the test will pass anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}): | |
| yield DummyTestFS() | |
| yield DummyTestFS() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing the patching and the context manager :) merging
It makes sense if it is not indeed necessary.
tests/test_data_files.py
Outdated
|
|
||
| @contextmanager | ||
| def mock_fs(file_paths: List[str]): | ||
| """context manager to set up a mock:// filesystem in sfspec containing the provided files""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
| """context manager to set up a mock:// filesystem in sfspec containing the provided files""" | |
| """context manager to set up a mock:// filesystem in fsspec containing the provided files""" |
| {"train": "developers_list.txt"}, | ||
| {"train": "data/seqeval_results.txt"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also adding a test for "test": "contest.txt"?
tests/test_data_files.py
Outdated
| with mock_fs( | ||
| [file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths] | ||
| ) as fs: | ||
|
|
||
| def resolver(pattern): | ||
| return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)] | ||
|
|
||
| patterns_per_split = _get_data_files_patterns(resolver) | ||
| assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys()) | ||
| for split, patterns in patterns_per_split.items(): | ||
| matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)] | ||
| assert matched == data_file_per_split[split] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better moving the context manager to the specific position where it is necessary?
| with mock_fs( | |
| [file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths] | |
| ) as fs: | |
| def resolver(pattern): | |
| return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)] | |
| patterns_per_split = _get_data_files_patterns(resolver) | |
| assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys()) | |
| for split, patterns in patterns_per_split.items(): | |
| matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)] | |
| assert matched == data_file_per_split[split] | |
| def resolver(pattern): | |
| with mock_fs( | |
| [file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths] | |
| ) as fs: | |
| return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)] | |
| patterns_per_split = _get_data_files_patterns(resolver) | |
| assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys()) | |
| for split, patterns in patterns_per_split.items(): | |
| matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)] | |
| assert matched == data_file_per_split[split] |
|
Thanks for the feedback @albertvillanova I took your comments into account :)
Let me know what you think ! |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!! Nice job.
| KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"] | ||
| KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**[{sep}/]{keyword}[{sep}/]**"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Indeed, much clearer this way! Thanks.
|
I ended up removing the patching and the context manager :) merging |
As reported in #4477, the current pattern matching to infer which file goes into which split is too permissive. For example a file "contest.py" would be considered part of a test split (it contains "test") and "seqeval.py" as well (it contains "eval").
In this PR I made the pattern matching more robust by only matching split names between separators. The supported separators are dots, dashes, spaces and underscores.
I updated the docs accordingly.
One detail about the tests: I had to update one test because it was using
PurePath.matchas a reference for globbing, but it doesn't support the[..]glob pattern. Therefore I added amock_fscontext manager that can be used to easily define a dummy filesystem with certain files in it and run pattern matching tests. Its code comes mostly from test_streaming_download_manager.pyClose #4477