Skip to content

Commit 9773efd

Browse files
committed
fix openwebtext
1 parent 1fb8eb8 commit 9773efd

5 files changed

Lines changed: 19 additions & 21 deletions

File tree

scripts/datasets/pretrain_corpus/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,24 @@
22

33
We provide a series of shared scripts for downloading/preparing the text corpus for pretraining NLP models.
44
This helps create a unified text corpus for studying the performance of different pretraining algorithms.
5-
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
6-
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.
5+
When releasing the datasets, we follow the [FAIR principle](https://www.go-fair.org/fair-principles/),
6+
i.e., the dataset needs to be findable, accessible, interoperable, and reusable.
77

88
## BookCorpus
99
Unfortunately, we are unable to provide the original [Toronto BookCorpus dataset](https://yknzhu.wixsite.com/mbweb) due to licensing issues.
1010

1111
There are some open source efforts for reproducing the dataset, e.g.,
12-
using [soskek/bookcorpus](https://github.com/soskek/bookcorpus) or directly downloading the [preprocessed version](https://drive.google.com/file/d/16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z/view).
13-
12+
using [soskek/bookcorpus](https://github.com/soskek/bookcorpus) or directly downloading the [preprocessed version](https://drive.google.com/file/d/16KCjV9z_FHm8LgZw05RSuk4EsAWPOP_z/view).
13+
1414
Nevertheless, we utilize the [Project Gutenberg](https://www.gutenberg.org/) as an alternative to Toronto BookCorpus.
1515

16-
You can use the following command to download and prepare the Gutenberg dataset.
16+
You can use the following command to download and prepare the Gutenberg dataset.
1717

1818
```bash
1919
python prepare_bookcorpus.py --dataset gutenberg
2020
```
2121

22-
Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.
22+
Also, you should follow the [license](https://www.gutenberg.org/wiki/Gutenberg:The_Project_Gutenberg_License) for using the data.
2323

2424
## Wikipedia
2525

@@ -43,7 +43,7 @@ You can download the OpenWebText from [link](https://skylion007.github.io/OpenWe
4343
After downloading and extracting the OpenWebText (i.e., `tar xf openwebtext.tar.xz`), you can use the following command to preprocess the dataset.
4444

4545
```bash
46-
python prepare_openwebtext.py --input openwebtext/ --output prepared_owt
46+
python prepare_openwebtext.py --input openwebtext/ --output prepared_owt --shuffle
4747
```
4848

4949
In this step, the archived txt are directly read without decompressing.

scripts/datasets/pretrain_corpus/prepare_openwebtext.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ def extract_files(full_name, output_dir, shuffle=False):
5151
"""
5252
if not full_name.endswith(".xz"):
5353
return
54-
file_prefix = re.split('\.|/',full_name)[1]
54+
file_prefix = re.split('\.|/', full_name)[-2]
5555
with open("{}.txt".format(os.path.join(output_dir, file_prefix)),"w") as fp:
5656
with tarfile.open(full_name) as t:
5757
txt_names = t.getnames()
@@ -65,7 +65,7 @@ def extract_files(full_name, output_dir, shuffle=False):
6565
if line:
6666
fp.write(line.decode()+'\n')
6767
# Two extra line break to mark the document separation
68-
fp.write('\n\n')
68+
fp.write('\n')
6969

7070

7171
@DATA_MAIN_REGISTRY.register('prepare_openwebtext')

scripts/pretraining/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Following the instruction of [Prepare OpenWebTextCorpus](../datasets/pretrain_corpus#openwebtext), download and prepare the dataset, obtaining a total of 20610 text files in the folder `prepared_owt`.
44

55
```bash
6-
python preprocesse_owt.py --input prepared_owt --output preprocessed_owt --shuffle
6+
python preprocesse_owt.py --input prepared_owt --output preprocessed_owt --max_seq_length 128
77
```
88
The above command allows us to generate the preprocessed Numpy features saved in `.npz`.
99
# Pretrain Model

scripts/pretraining/preprocesse_owt.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,11 @@
88
import multiprocessing
99

1010
from pretraining_utils import get_all_features
11+
from gluonnlp.base import get_repo_model_zoo_url
12+
from gluonnlp.utils.misc import download
1113
from gluonnlp.data.tokenizers import HuggingFaceWordPieceTokenizer
1214

15+
VOCAB_PATH = 'google_electra_small/vocab-e6d2b21d.json'
1316

1417
def get_parser():
1518
parser = argparse.ArgumentParser(description=__doc__)
@@ -19,9 +22,6 @@ def get_parser():
1922
help="directory for preprocessed features")
2023
parser.add_argument("--num_process", type=int, default=8,
2124
help="number of processes for multiprocessing")
22-
parser.add_argument("--vocab_file", default="vocab-c3b41053.json",
23-
help="vocabulary file of HuggingFaceWordPieceTokenizer"
24-
" for electra small model")
2525
parser.add_argument("--max_seq_length", type=int, default=128,
2626
help="the maximum length of the pretraining sequence")
2727
parser.add_argument("--num_out_files", type=int, default=1000,
@@ -40,10 +40,11 @@ def get_parser():
4040

4141
def main(args):
4242
num_process = min(multiprocessing.cpu_count(), args.num_process)
43-
assert os.path.isfile(args.vocab_file), 'Cannot find vocab file'
44-
# TODO(zheyuye), download the vocab_file from zoos and check it with sha1 hash.
43+
vocab_file = os.path.join(os.getcwd(), 'vocab-e6d2b21d.json')
44+
download(get_repo_model_zoo_url() + VOCAB_PATH, vocab_file,
45+
sha1_hash='e6d2b21d910ccb356aa18f27a1c7d70660edc058')
4546
tokenizer = HuggingFaceWordPieceTokenizer(
46-
vocab_file=args.vocab_file,
47+
vocab_file=vocab_file,
4748
unk_token='[UNK]',
4849
pad_token='[PAD]',
4950
cls_token='[CLS]',

scripts/pretraining/pretraining_utils.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,17 +41,14 @@ def tokenize_lines_to_ids(lines, tokenizer):
4141
"""
4242
results = []
4343
# tag line delimiters or doc delimiters
44-
line_delimiters = False
4544
for line in lines:
4645
if not line:
4746
break
4847
line = line.strip()
4948
# Single empty lines are used as line delimiters
5049
# Double empty lines are used as document delimiters
5150
if not line:
52-
if not line_delimiters:
53-
results.append([])
54-
line_delimiters = not line_delimiters
51+
results.append([])
5552
else:
5653
token_ids = tokenizer.encode(line, int)
5754
if token_ids:
@@ -125,7 +122,7 @@ def process_a_text(text_file, tokenizer, max_seq_length, short_seq_prob=0.05):
125122
for tokenized_line in tokenized_lines:
126123
current_sentences.append(tokenized_line)
127124
current_length += len(tokenized_line)
128-
# Create feature when meets the empty line or reaches the target length
125+
# Create feature when meets the empty line or reaches the target length
129126
if (not tokenized_line and current_length != 0) or (current_length >= target_seq_length):
130127
first_segment, second_segment = \
131128
sentenceize(current_sentences, max_seq_length, target_seq_length)

0 commit comments

Comments
 (0)