You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/repository_structure.mdx
+33-5Lines changed: 33 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,10 +22,13 @@ my_dataset_repository/
22
22
23
23
## Splits and file names
24
24
25
-
🤗 Datasets automatically infer a dataset's train, validation, and test splits from the file names. Files that contain *train* in their names are considered part of the train split. The same idea applies to the test and validation split:
25
+
🤗 Datasets automatically infer a dataset's train, validation, and test splits from the file names.
26
26
27
-
- All the files that contain *test* in their names are considered part of the test split.
28
-
- All the files that contain *valid* in their names are considered part of the validation split.
27
+
Files that contain *train* in their names are considered part of the train split, e.g. `train.csv`, `my_train_file.csv`, etc.
28
+
The same idea applies to the test and validation split:
29
+
30
+
- All the files that contain *test* in their names are considered part of the test split, e.g. `test.csv`, `my_test_file.csv`
31
+
- All the files that contain *validation* in their names are considered part of the validation split, e.g. `validation.csv`, `my_validation_file.csv`
29
32
30
33
Here is an example where all the files are placed into a directory named `data`:
31
34
@@ -35,9 +38,11 @@ my_dataset_repository/
35
38
└── data/
36
39
├── train.csv
37
40
├── test.csv
38
-
└── valid.csv
41
+
└── validation.csv
39
42
```
40
43
44
+
Note that if a file contains *test* but is embedded in another word (e.g. `contest.csv`), it's not counted as a test file.
45
+
41
46
## Multiple files per split
42
47
43
48
If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train, validation, and test split from the file name.
@@ -58,7 +63,8 @@ Make sure all the files of your `train` set have *train* in their names (same fo
58
63
Even if you add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example),
59
64
🤗 Datasets can still infer the appropriate split.
60
65
61
-
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
66
+
For convenience, you can also place your data files into different directories.
67
+
In this case, the split name is inferred from the directory name.
62
68
63
69
```
64
70
my_dataset_repository/
@@ -80,6 +86,28 @@ Eventually, you'll also be able to structure your repository to specify differen
80
86
81
87
</Tip>
82
88
89
+
## Split names keywords
90
+
91
+
Train/validation/test splits are sometimes called train/dev/test, or sometimes train & eval sets.
92
+
These other names are also supported.
93
+
In particular, these keywords are equivalent:
94
+
95
+
- train, training
96
+
- validation, valid, dev
97
+
- test, eval
98
+
99
+
Therefore this is also a valid repository:
100
+
101
+
```
102
+
my_dataset_repository/
103
+
├── README.md
104
+
└── data/
105
+
├── training.csv
106
+
├── eval.csv
107
+
└── valid.csv
108
+
```
109
+
110
+
83
111
## Custom split names
84
112
85
113
If you have other data files in addition to the traditional train, validation, and test sets, you must use a different structure.
0 commit comments