Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 5, 2022

As reported in #4477, the current pattern matching to infer which file goes into which split is too permissive. For example a file "contest.py" would be considered part of a test split (it contains "test") and "seqeval.py" as well (it contains "eval").

In this PR I made the pattern matching more robust by only matching split names between separators. The supported separators are dots, dashes, spaces and underscores.

I updated the docs accordingly.

One detail about the tests: I had to update one test because it was using PurePath.match as a reference for globbing, but it doesn't support the [..] glob pattern. Therefore I added a mock_fs context manager that can be used to easily define a dummy filesystem with certain files in it and run pattern matching tests. Its code comes mostly from test_streaming_download_manager.py

Close #4477

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 5, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq lhoestq marked this pull request as ready for review July 6, 2022 12:30
Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, just one nit.

PS: I think we should also check that this change doesn't affect the existing Hub repos and warn their owners if it does.

@lhoestq
Copy link
Member Author

lhoestq commented Jul 8, 2022

I ran a script to find affected datasets (just did it on non-private non-gated). Adding "testing" and "evaluation" fixes all of of them except one:

  • projecte-aina/cat_manynames: human_annotated_testset.tsv

Let me open a PR on their repository to fix it
EDIT: pr here

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good now. Thanks!

@lhoestq
Copy link
Member Author

lhoestq commented Jul 8, 2022

Feel free to merge @albertvillanova if it's all good to you :)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!! This will definitely make the user experience much more pleasant and aligned with their expectations.

Comment on lines 27 to 31
Files that contain *train* in their names are considered part of the train split, e.g. `train.csv`, `my_train_file.csv`, etc.
The same idea applies to the test and validation split:

- All the files that contain *test* in their names are considered part of the test split, e.g. `test.csv`, `my_test_file.csv`
- All the files that contain *validation* in their names are considered part of the validation split, e.g. `validation.csv`, `my_validation_file.csv`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the explanation more clear:

  • You are saying here that if a filename contains "train", it is considered part of the train split: But this is only true if the subword "train" is delimited by some specific non-word characters.
    • You do this clarification afterwards, but I think it would be better to make it right here: if a user stops reading here, they will misunderstand how this works

A comment about style (feel free to ignore it):

  • You first make a sentence with "train" and then a partial unordered list enumeration with just "test" and "validation". I think it might be better:
    • either an unordered list with all 3 cases ("train", "validation", and "test"; in this order)
    • or no partial unordered list and just 3 sentences.
  • The text in the 3 cases is quite repetitive: maybe just a single sentence and then 3 itemized examples

Something like (just for inspiration):

All the files that contain a split name in their names (delimited by some specific non-word characters, see below) are considered part of that split:

  • train split: train.csv, my_train_file.csv
  • validation split: validation.csv, my_validation_file.csv
  • test split: test.csv, my_test_file.csv

└── validation.csv
```

Note that if a file contains *test* but is embedded in another word (e.g. `contest.csv`), it's not counted as a test file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth enumerating which are the non-word characters considered as valid delimiters of the split name subword?


## Split names keywords

Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just enumerating the equivalent names?

Suggested change
Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.
Validation split is sometimes called "dev", or test split is called "eval".

Comment on lines 30 to 48
str(Split.TRAIN): ["**[-._ /]train[-._ ]*", "train[-._ ]*", "**[-._ /]training[-._ ]*", "training[-._ ]*"],
str(Split.TEST): [
"**[-._ /]test[-._ ]*",
"test[-._ ]*",
"**[-._ /]testing[-._ ]*",
"testing[-._ ]*",
"**[-._ /]eval[-._ ]*",
"eval[-._ ]*",
"**[-._ /]evaluation[-._ ]*",
"evaluation[-._ ]*",
],
str(Split.VALIDATION): [
"**[-._ /]dev[-._ ]*",
"dev[-._ ]*",
"**[-._ /]valid[-._ ]*",
"valid[-._ ]*",
"**[-._ /]validation[-._ ]*",
"validation[-._ ]*",
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a combination of pattern and split_name. Maybe better with list comprehensions?

test_split_names = ["test", "testing", "eval", "evaluation"]
...
default_patterns_split_in_filename = ["**[-._ /]{split_name}[-._ ]*", "{split_name}[-._ ]*"]
...
str(Split.TEST): [pattern.format(split_name=split_name) for split_name in test_split_names for pattern in default_patterns_split_in_filename]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also a question: why numbers are not valid delimiters?

  • "test1.txt", "test2.txt" are not considered as "test" files

Comment on lines 36 to 70
str(Split.TRAIN): ["**train*/**"],
str(Split.TEST): ["**test*/**", "**eval*/**"],
str(Split.VALIDATION): ["**dev*/**", "**valid*/**"],
str(Split.TRAIN): ["train[-._ /]**", "**[-._ /]train[-._ /]**", "training[-._ /]**", "**[-._ /]training[-._ /]**"],
str(Split.TEST): [
"test[-._ /]**",
"**[-._ /]test[-._ /]**",
"testing[-._ /]**",
"**[-._ /]testing[-._ /]**",
"eval[-._ /]**",
"**[-._ /]eval[-._ /]**",
"evaluation[-._ /]**",
"**[-._ /]evaluation[-._ /]**",
],
str(Split.VALIDATION): [
"dev[-._ /]**",
"**[-._ /]dev[-._ /]**",
"valid[-._ /]**",
"**[-._ /]valid[-._ /]**",
"validation[-._ /]**",
"**[-._ /]validation[-._ /]**",
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above about list comprehensions.

Comment on lines 530 to 534
@classmethod
def get_test_paths(cls, start_with=""):
"""Helper to return directory and file paths with no details"""
all = [file["name"] for file in cls._fs_contents if file["name"].startswith(start_with)]
return all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be removed.

I just copied-pasted it but we do not use it.

Comment on lines 555 to 556
with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}):
yield DummyTestFS()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment (feel free to ignore it): you use here unittest.mock.patch, but you could use pytest.monkeypatch instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a context manager that doesn't have access to the monkeypatch fixture of pytest, so I used unittest.mock.patch instead.

Copy link
Member

@albertvillanova albertvillanova Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq again not important: but indeed you are not using the patching. You are just using the returned instance DummyTestFS().

So I guess you could just remove the patching (unittest.mock.patch) and the test will pass anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}):
yield DummyTestFS()
yield DummyTestFS()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing the patching and the context manager :) merging

It makes sense if it is not indeed necessary.


@contextmanager
def mock_fs(file_paths: List[str]):
"""context manager to set up a mock:// filesystem in sfspec containing the provided files"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Suggested change
"""context manager to set up a mock:// filesystem in sfspec containing the provided files"""
"""context manager to set up a mock:// filesystem in fsspec containing the provided files"""

Comment on lines +610 to +611
{"train": "developers_list.txt"},
{"train": "data/seqeval_results.txt"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also adding a test for "test": "contest.txt"?

Comment on lines 621 to 632
with mock_fs(
[file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths]
) as fs:

def resolver(pattern):
return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)]

patterns_per_split = _get_data_files_patterns(resolver)
assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys())
for split, patterns in patterns_per_split.items():
matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)]
assert matched == data_file_per_split[split]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better moving the context manager to the specific position where it is necessary?

Suggested change
with mock_fs(
[file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths]
) as fs:
def resolver(pattern):
return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)]
patterns_per_split = _get_data_files_patterns(resolver)
assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys())
for split, patterns in patterns_per_split.items():
matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)]
assert matched == data_file_per_split[split]
def resolver(pattern):
with mock_fs(
[file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths]
) as fs:
return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)]
patterns_per_split = _get_data_files_patterns(resolver)
assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys())
for split, patterns in patterns_per_split.items():
matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)]
assert matched == data_file_per_split[split]

@lhoestq
Copy link
Member Author

lhoestq commented Jul 18, 2022

Thanks for the feedback @albertvillanova I took your comments into account :)

  • added numbers as supported delimiters
  • used list comprehension to create the patterns list
  • updated the docs and the tests according to your comments

Let me know what you think !

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!! Nice job.

Comment on lines +33 to +34
KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"]
KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**[{sep}/]{keyword}[{sep}/]**"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Indeed, much clearer this way! Thanks.

@lhoestq
Copy link
Member Author

lhoestq commented Jul 18, 2022

I ended up removing the patching and the context manager :) merging

@lhoestq lhoestq merged commit cd674a3 into main Jul 18, 2022
@lhoestq lhoestq deleted the only-match-separated-split-names branch July 18, 2022 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset Viewer issue for fgrezes/WIESP2022-NER

5 participants