Skip to content

Commit 159649a

Browse files
committed
docs
1 parent d963a0e commit 159649a

File tree

1 file changed

+33
-5
lines changed

1 file changed

+33
-5
lines changed

docs/source/repository_structure.mdx

Lines changed: 33 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,13 @@ my_dataset_repository/
2222

2323
## Splits and file names
2424

25-
🤗 Datasets automatically infer a dataset's train, validation, and test splits from the file names. Files that contain *train* in their names are considered part of the train split. The same idea applies to the test and validation split:
25+
🤗 Datasets automatically infer a dataset's train, validation, and test splits from the file names.
2626

27-
- All the files that contain *test* in their names are considered part of the test split.
28-
- All the files that contain *valid* in their names are considered part of the validation split.
27+
Files that contain *train* in their names are considered part of the train split, e.g. `train.csv`, `my_train_file.csv`, etc.
28+
The same idea applies to the test and validation split:
29+
30+
- All the files that contain *test* in their names are considered part of the test split, e.g. `test.csv`, `my_test_file.csv`
31+
- All the files that contain *validation* in their names are considered part of the validation split, e.g. `validation.csv`, `my_validation_file.csv`
2932

3033
Here is an example where all the files are placed into a directory named `data`:
3134

@@ -35,9 +38,11 @@ my_dataset_repository/
3538
└── data/
3639
├── train.csv
3740
├── test.csv
38-
└── valid.csv
41+
└── validation.csv
3942
```
4043

44+
Note that if a file contains *test* but is embedded in another word (e.g. `contest.csv`), it's not counted as a test file.
45+
4146
## Multiple files per split
4247

4348
If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train, validation, and test split from the file name.
@@ -58,7 +63,8 @@ Make sure all the files of your `train` set have *train* in their names (same fo
5863
Even if you add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example),
5964
🤗 Datasets can still infer the appropriate split.
6065

61-
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
66+
For convenience, you can also place your data files into different directories.
67+
In this case, the split name is inferred from the directory name.
6268

6369
```
6470
my_dataset_repository/
@@ -80,6 +86,28 @@ Eventually, you'll also be able to structure your repository to specify differen
8086

8187
</Tip>
8288

89+
## Split names keywords
90+
91+
Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.
92+
These other names are also supported.
93+
In particular, these keywords are equivalent:
94+
95+
- train, training
96+
- validation, valid, dev
97+
- test, eval
98+
99+
Therefore this is also a valid repository:
100+
101+
```
102+
my_dataset_repository/
103+
├── README.md
104+
└── data/
105+
├── training.csv
106+
├── eval.csv
107+
└── valid.csv
108+
```
109+
110+
83111
## Custom split names
84112

85113
If you have other data files in addition to the traditional train, validation, and test sets, you must use a different structure.

0 commit comments

Comments
 (0)