docs

lhoestq · lhoestq · commit 159649a9b22d · 2022-07-05T16:08:47.000+02:00
diff --git a/docs/source/repository_structure.mdx b/docs/source/repository_structure.mdx
@@ -22,10 +22,13 @@ my_dataset_repository/
 
 ## Splits and file names
 
-🤗 Datasets automatically infer a dataset's train, validation, and test splits from the file names. Files that contain *train* in their names are considered part of the train split. The same idea applies to the test and validation split:
+🤗 Datasets automatically infer a dataset's train, validation, and test splits from the file names.
 
-- All the files that contain *test* in their names are considered part of the test split.
-- All the files that contain *valid* in their names are considered part of the validation split.
+Files that contain *train* in their names are considered part of the train split, e.g. `train.csv`, `my_train_file.csv`, etc.
+The same idea applies to the test and validation split:
+
+- All the files that contain *test* in their names are considered part of the test split, e.g. `test.csv`, `my_test_file.csv`
+- All the files that contain *validation* in their names are considered part of the validation split, e.g. `validation.csv`, `my_validation_file.csv`
 
 Here is an example where all the files are placed into a directory named `data`:
 
@@ -35,9 +38,11 @@ my_dataset_repository/
 └── data/
     ├── train.csv
     ├── test.csv
-    └── valid.csv
+    └── validation.csv
 ```
 
+Note that if a file contains *test* but is embedded in another word (e.g. `contest.csv`), it's not counted as a test file.
+
 ## Multiple files per split
 
 If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train, validation, and test split from the file name.
@@ -58,7 +63,8 @@ Make sure all the files of your `train` set have *train* in their names (same fo
 Even if you add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example),
 🤗 Datasets can still infer the appropriate split.
 
-For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
+For convenience, you can also place your data files into different directories.
+In this case, the split name is inferred from the directory name.
 
 ```
 my_dataset_repository/
@@ -80,6 +86,28 @@ Eventually, you'll also be able to structure your repository to specify differen
 
 </Tip>
 
+## Split names keywords
+
+Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.
+These other names are also supported.
+In particular, these keywords are equivalent:
+
+- train, training
+- validation, valid, dev
+- test, eval
+
+Therefore this is also a valid repository:
+
+```
+my_dataset_repository/
+├── README.md
+└── data/
+    ├── training.csv
+    ├── eval.csv
+    └── valid.csv
+```
+
+
 ## Custom split names
 
 If you have other data files in addition to the traditional train, validation, and test sets, you must use a different structure.