[data_files] Only match separated split names #4633

lhoestq · 2022-07-05T14:18:11Z

As reported in #4477, the current pattern matching to infer which file goes into which split is too permissive. For example a file "contest.py" would be considered part of a test split (it contains "test") and "seqeval.py" as well (it contains "eval").

In this PR I made the pattern matching more robust by only matching split names between separators. The supported separators are dots, dashes, spaces and underscores.

I updated the docs accordingly.

One detail about the tests: I had to update one test because it was using PurePath.match as a reference for globbing, but it doesn't support the [..] glob pattern. Therefore I added a mock_fs context manager that can be used to easily define a dummy filesystem with certain files in it and run pattern matching tests. Its code comes mostly from test_streaming_download_manager.py

Close #4477

HuggingFaceDocBuilderDev · 2022-07-05T14:24:32Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko

Good job, just one nit.

PS: I think we should also check that this change doesn't affect the existing Hub repos and warn their owners if it does.

src/datasets/data_files.py

lhoestq · 2022-07-08T13:53:56Z

I ran a script to find affected datasets (just did it on non-private non-gated). Adding "testing" and "evaluation" fixes all of of them except one:

projecte-aina/cat_manynames: human_annotated_testset.tsv

Let me open a PR on their repository to fix it
EDIT: pr here

mariosasko

All looks good now. Thanks!

lhoestq · 2022-07-08T16:55:59Z

Feel free to merge @albertvillanova if it's all good to you :)

albertvillanova

Awesome!! This will definitely make the user experience much more pleasant and aligned with their expectations.

albertvillanova · 2022-07-15T08:22:58Z

docs/source/repository_structure.mdx

+Files that contain *train* in their names are considered part of the train split, e.g. `train.csv`, `my_train_file.csv`, etc.
+The same idea applies to the test and validation split:
+
+- All the files that contain *test* in their names are considered part of the test split, e.g. `test.csv`, `my_test_file.csv`
+- All the files that contain *validation* in their names are considered part of the validation split, e.g. `validation.csv`, `my_validation_file.csv`


To make the explanation more clear:

You are saying here that if a filename contains "train", it is considered part of the train split: But this is only true if the subword "train" is delimited by some specific non-word characters.

You do this clarification afterwards, but I think it would be better to make it right here: if a user stops reading here, they will misunderstand how this works

A comment about style (feel free to ignore it):

You first make a sentence with "train" and then a partial unordered list enumeration with just "test" and "validation". I think it might be better:

either an unordered list with all 3 cases ("train", "validation", and "test"; in this order)

or no partial unordered list and just 3 sentences.

The text in the 3 cases is quite repetitive: maybe just a single sentence and then 3 itemized examples

Something like (just for inspiration):

All the files that contain a split name in their names (delimited by some specific non-word characters, see below) are considered part of that split:

train split: train.csv, my_train_file.csv

validation split: validation.csv, my_validation_file.csv

test split: test.csv, my_test_file.csv

albertvillanova · 2022-07-15T08:56:54Z

docs/source/repository_structure.mdx

+    └── validation.csv
 ```

+Note that if a file contains *test* but is embedded in another word (e.g. `contest.csv`), it's not counted as a test file.


Maybe worth enumerating which are the non-word characters considered as valid delimiters of the split name subword?

albertvillanova · 2022-07-15T08:57:02Z

docs/source/repository_structure.mdx


+## Split names keywords
+
+Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.


Maybe just enumerating the equivalent names?

Suggested change

Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.

Validation split is sometimes called "dev", or test split is called "eval".

albertvillanova · 2022-07-15T08:57:08Z

src/datasets/data_files.py

+    str(Split.TRAIN): ["**[-._ /]train[-._ ]*", "train[-._ ]*", "**[-._ /]training[-._ ]*", "training[-._ ]*"],
+    str(Split.TEST): [
+        "**[-._ /]test[-._ ]*",
+        "test[-._ ]*",
+        "**[-._ /]testing[-._ ]*",
+        "testing[-._ ]*",
+        "**[-._ /]eval[-._ ]*",
+        "eval[-._ ]*",
+        "**[-._ /]evaluation[-._ ]*",
+        "evaluation[-._ ]*",
+    ],
+    str(Split.VALIDATION): [
+        "**[-._ /]dev[-._ ]*",
+        "dev[-._ ]*",
+        "**[-._ /]valid[-._ ]*",
+        "valid[-._ ]*",
+        "**[-._ /]validation[-._ ]*",
+        "validation[-._ ]*",
+    ],


These are a combination of pattern and split_name. Maybe better with list comprehensions?

test_split_names = ["test", "testing", "eval", "evaluation"] ... default_patterns_split_in_filename = ["**[-._ /]{split_name}[-._ ]*", "{split_name}[-._ ]*"] ... str(Split.TEST): [pattern.format(split_name=split_name) for split_name in test_split_names for pattern in default_patterns_split_in_filename]

And also a question: why numbers are not valid delimiters?

"test1.txt", "test2.txt" are not considered as "test" files

albertvillanova · 2022-07-15T08:58:17Z

src/datasets/data_files.py

-    str(Split.TRAIN): ["**train*/**"],
-    str(Split.TEST): ["**test*/**", "**eval*/**"],
-    str(Split.VALIDATION): ["**dev*/**", "**valid*/**"],
+    str(Split.TRAIN): ["train[-._ /]**", "**[-._ /]train[-._ /]**", "training[-._ /]**", "**[-._ /]training[-._ /]**"],
+    str(Split.TEST): [
+        "test[-._ /]**",
+        "**[-._ /]test[-._ /]**",
+        "testing[-._ /]**",
+        "**[-._ /]testing[-._ /]**",
+        "eval[-._ /]**",
+        "**[-._ /]eval[-._ /]**",
+        "evaluation[-._ /]**",
+        "**[-._ /]evaluation[-._ /]**",
+    ],
+    str(Split.VALIDATION): [
+        "dev[-._ /]**",
+        "**[-._ /]dev[-._ /]**",
+        "valid[-._ /]**",
+        "**[-._ /]valid[-._ /]**",
+        "validation[-._ /]**",
+        "**[-._ /]validation[-._ /]**",
+    ],


The same as above about list comprehensions.

albertvillanova · 2022-07-15T09:05:14Z

tests/test_data_files.py

+        @classmethod
+        def get_test_paths(cls, start_with=""):
+            """Helper to return directory and file paths with no details"""
+            all = [file["name"] for file in cls._fs_contents if file["name"].startswith(start_with)]
+            return all


This method can be removed.

I just copied-pasted it but we do not use it.

albertvillanova · 2022-07-15T09:08:25Z

tests/test_data_files.py

+    with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}):
+        yield DummyTestFS()


Just a comment (feel free to ignore it): you use here unittest.mock.patch, but you could use pytest.monkeypatch instead.

This function is a context manager that doesn't have access to the monkeypatch fixture of pytest, so I used unittest.mock.patch instead.

@lhoestq again not important: but indeed you are not using the patching. You are just using the returned instance DummyTestFS().

So I guess you could just remove the patching (unittest.mock.patch) and the test will pass anyway.

Suggested change

with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}):

yield DummyTestFS()

yield DummyTestFS()

I ended up removing the patching and the context manager :) merging

It makes sense if it is not indeed necessary.

albertvillanova · 2022-07-15T09:08:54Z

tests/test_data_files.py


+@contextmanager
+def mock_fs(file_paths: List[str]):
+    """context manager to set up a mock:// filesystem in sfspec containing the provided files"""


Typo

Suggested change

"""context manager to set up a mock:// filesystem in sfspec containing the provided files"""

"""context manager to set up a mock:// filesystem in fsspec containing the provided files"""

albertvillanova · 2022-07-15T09:10:19Z

tests/test_data_files.py

+        {"train": "developers_list.txt"},
+        {"train": "data/seqeval_results.txt"},


Maybe also adding a test for "test": "contest.txt"?

albertvillanova · 2022-07-15T09:16:18Z

tests/test_data_files.py

+    with mock_fs(
+        [file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths]
+    ) as fs:
+
+        def resolver(pattern):
+            return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)]
+
+        patterns_per_split = _get_data_files_patterns(resolver)
+        assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys())
+        for split, patterns in patterns_per_split.items():
+            matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)]
+            assert matched == data_file_per_split[split]


Maybe better moving the context manager to the specific position where it is necessary?

Suggested change

with mock_fs(

[file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths]

) as fs:

def resolver(pattern):

return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)]

patterns_per_split = _get_data_files_patterns(resolver)

assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys())

for split, patterns in patterns_per_split.items():

matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)]

assert matched == data_file_per_split[split]

def resolver(pattern):

with mock_fs(

[file_path for split_file_paths in data_file_per_split.values() for file_path in split_file_paths]

) as fs:

return [PurePath(file_path) for file_path in fs.glob(pattern) if fs.isfile(file_path)]

patterns_per_split = _get_data_files_patterns(resolver)

assert sorted(patterns_per_split.keys()) == sorted(data_file_per_split.keys())

for split, patterns in patterns_per_split.items():

matched = [file_path.as_posix() for pattern in patterns for file_path in resolver(pattern)]

assert matched == data_file_per_split[split]

lhoestq · 2022-07-18T10:29:37Z

Thanks for the feedback @albertvillanova I took your comments into account :)

added numbers as supported delimiters
used list comprehension to create the patterns list
updated the docs and the tests according to your comments

Let me know what you think !

albertvillanova

Thank you!! Nice job.

albertvillanova · 2022-07-18T10:43:15Z

src/datasets/data_files.py

+KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["**[{sep}/]{keyword}[{sep}]*", "{keyword}[{sep}]*"]
+KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]**", "**[{sep}/]{keyword}[{sep}/]**"]


Great! Indeed, much clearer this way! Thanks.

lhoestq · 2022-07-18T13:07:25Z

I ended up removing the patching and the context manager :) merging

lhoestq added 3 commits July 5, 2022 15:18

only match separated split names

d963a0e

docs

159649a

add space separator

7893856

fix win

31bf1bb

lhoestq marked this pull request as ready for review July 6, 2022 12:30

lhoestq requested review from albertvillanova and mariosasko July 6, 2022 12:31

mariosasko reviewed Jul 8, 2022

View reviewed changes

src/datasets/data_files.py Outdated Show resolved Hide resolved

lhoestq added 3 commits July 8, 2022 15:10

Merge branch 'main' into only-match-separated-split-names

3681589

add testing

93f4526

add evaluation

e6adffb

mariosasko approved these changes Jul 8, 2022

View reviewed changes

albertvillanova reviewed Jul 15, 2022

View reviewed changes

lhoestq added 4 commits July 18, 2022 11:49

suggestions in doc

9393386

use list comprehension + support numbers

8efbb78

update tests

ee15ede

Merge branch 'main' into only-match-separated-split-names

61a18f8

albertvillanova approved these changes Jul 18, 2022

View reviewed changes

lhoestq added 2 commits July 18, 2022 14:25

remove unnecessary patching and context manager

d372b28

style

3e97fd5

lhoestq merged commit cd674a3 into main Jul 18, 2022

lhoestq deleted the only-match-separated-split-names branch July 18, 2022 13:07


		## Split names keywords

		Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.

	Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.
	Validation split is sometimes called "dev", or test split is called "eval".

		with patch.dict(fsspec.registry.target, {"mock": DummyTestFS}):
		yield DummyTestFS()

	"""context manager to set up a mock:// filesystem in sfspec containing the provided files"""
	"""context manager to set up a mock:// filesystem in fsspec containing the provided files"""

		{"train": "developers_list.txt"},
		{"train": "data/seqeval_results.txt"},

		KEYWORDS_IN_FILENAME_BASE_PATTERNS = ["*[{sep}/]{keyword}[{sep}]", "{keyword}[{sep}]*"]
		KEYWORDS_IN_DIR_NAME_BASE_PATTERNS = ["{keyword}[{sep}/]", "[{sep}/]{keyword}[{sep}/]**"]

[data_files] Only match separated split names #4633

[data_files] Only match separated split names #4633

Uh oh!

Conversation

lhoestq commented Jul 5, 2022

Uh oh!

HuggingFaceDocBuilderDev commented Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhoestq commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Jul 8, 2022

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Jul 18, 2022

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Jul 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HuggingFaceDocBuilderDev commented Jul 5, 2022 •

edited

Loading

lhoestq commented Jul 8, 2022 •

edited

Loading

albertvillanova Jul 18, 2022 •

edited

Loading