You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: scripts/datasets/pretrain_corpus/README.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,24 +2,24 @@
2
2
3
3
We provide a series of shared scripts for downloading/preparing the text corpus for pretraining NLP models.
4
4
This helps create a unified text corpus for studying the performance of different pretraining algorithms.
5
-
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
6
-
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.
5
+
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
6
+
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.
7
7
8
8
## BookCorpus
9
9
Unfortunately, we are unable to provide the original [Toronto BookCorpus dataset](https://yknzhu.wixsite.com/mbweb) due to licensing issues.
10
10
11
11
There are some open source efforts for reproducing the dataset, e.g.,
12
-
using [soskek/bookcorpus](https://github.com/soskek/bookcorpus) or directly downloading the [preprocessed version](https://drive.google.com/file/d/16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z/view).
13
-
12
+
using [soskek/bookcorpus](https://github.com/soskek/bookcorpus) or directly downloading the [preprocessed version](https://drive.google.com/file/d/16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z/view).
13
+
14
14
Nevertheless, we utilize the [Project Gutenberg](https://www.gutenberg.org/) as an alternative to Toronto BookCorpus.
15
15
16
-
You can use the following command to download and prepare the Gutenberg dataset.
16
+
You can use the following command to download and prepare the Gutenberg dataset.
17
17
18
18
```bash
19
19
python prepare_bookcorpus.py --dataset gutenberg
20
20
```
21
21
22
-
Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.
22
+
Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.
23
23
24
24
## Wikipedia
25
25
@@ -43,7 +43,7 @@ You can download the OpenWebText from [link](https://skylion007.github.io/OpenWe
43
43
After downloading and extracting the OpenWebText (i.e., `tar xf openwebtext.tar.xz`), you can use the following command to preprocess the dataset.
Copy file name to clipboardExpand all lines: scripts/pretraining/README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@
3
3
Following the instruction of [Prepare OpenWebTextCorpus](../datasets/pretrain_corpus#openwebtext), download and prepare the dataset, obtaining a total of 20610 text files in the folder `prepared_owt`.
0 commit comments