Skip to content

Commit e730e12

Browse files
loubnabnllvwerraLoubna ben allal
authored
Update codeparrot data preprocessing (#16944)
* add new preprocessing arguments * add new filters * add new filters to readme * fix config and test count, update function names and docstrings * reformat code * update readme * Update readme * rename config_test filter Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * rename few_assignments filter Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * rename tokenizer in arguments Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * rename functions and add limit_line argument for config_test filter * update threshold for config_test filter Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
1 parent 518dd12 commit e730e12

3 files changed

Lines changed: 89 additions & 6 deletions

File tree

examples/research_projects/codeparrot/README.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,20 +37,25 @@ Additionally, sure you have git-lfs installed. You can find instructions for how
3737
The source of the dataset is the GitHub dump available on Google's [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. The dataset is available on the Hugging Face Hub [here](https://huggingface.co/datasets/transformersbook/codeparrot).
3838

3939
### Preprocessing
40-
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374):
40+
The raw dataset contains many duplicates. We deduplicated and filtered the dataset using the heuristics proposed in OpenAI's Codex [paper](https://arxiv.org/abs/2107.03374) and some new ones:
4141

4242
- exact deduplication using each file's hash
4343
- filtering files with max line length > 1000
4444
- filtering files with mean line length > 100
4545
- fraction of alphanumeric characters < 0.25
4646
- containing the word "auto-generated" or similar in the first 5 lines
47+
- filtering with a probability of 0.7 of files with a mention of "test file" or "configuration file" or similar in the first 5 lines
48+
- filtering with a probability of 0.7 of files with high occurence of the keywords "test " or "config"
49+
- filtering with a probability of 0.7 of files without a mention of the keywords `def` , `for`, `while` and `class`
50+
- filtering files that use the assignment operator `=` less than 5 times
51+
- filtering files with ratio between number of characters and number of tokens after tokenization < 1.5 (the average ratio is 3.6)
4752

48-
The script to process the full dataset can be found in `scripts/preprocessing.py`. Executing the script on 16 vCPUs takes roughly 3h and removes 70% of the original dataset. The cleaned [train](https://huggingface.co/datasets/lvwerra/codeparrot-clean-train) and [validation](https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid) splits are also available on the Hub if you want to skip this step or use the data for another project.
53+
The script to process the full dataset can be found in `scripts/preprocessing.py`. Executing the script on 16 vCPUs takes roughly 3h and removes 70% of the original dataset. The cleaned [train](https://huggingface.co/datasets/loubnabnl/codeparrot-clean-train-v2) and [validation](https://huggingface.co/datasets/loubnabnl/codeparrot-clean-valid-v2) splits are also available on the Hub if you want to skip this step or use the data for another project.
4954

5055
To execute the preprocessing run the following command:
5156
```bash
5257
python scripts/preprocessing.py \
53-
--dataset_name lvwerra/codeparrot \
58+
--dataset_name transformersbook/codeparrot \
5459
--output_dir codeparrot-clean
5560
```
5661
During preprocessing the dataset is downloaded and stored locally as well as caches of the computations. Make sure you have more than 500GB free disk space to execute it.

examples/research_projects/codeparrot/scripts/arguments.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ class PreprocessingArguments:
133133
},
134134
)
135135
dataset_name: Optional[str] = field(
136-
default="codeparrot", metadata={"help": "Folder or name of dataset to process."}
136+
default="transformersbook/codeparrot", metadata={"help": "Folder or name of dataset to process."}
137137
)
138138
output_dir: Optional[str] = field(
139139
default="codeparrot-clean", metadata={"help": "Folder to save processed processed dataset."}
@@ -151,6 +151,16 @@ class PreprocessingArguments:
151151
alpha_frac: Optional[float] = field(
152152
default=0.25, metadata={"help": "Maximum fraction of non-alphanumeric characters, otherwise file is filtered."}
153153
)
154+
min_token_ratio: Optional[float] = field(
155+
default=1.5, metadata={"help": "Minimum character token ratio for the file, otherwise file is filtered."}
156+
)
157+
filter_proba: Optional[float] = field(
158+
default=0.7, metadata={"help": "Probability for filtering config, test and uncommon files."}
159+
)
160+
tokenizer: Optional[str] = field(
161+
default="lvwerra/codeparrot",
162+
metadata={"help": "Name or path to the tokenizer."},
163+
)
154164

155165

156166
@dataclass

examples/research_projects/codeparrot/scripts/preprocessing.py

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from datasets import load_dataset
1010

1111
from arguments import PreprocessingArguments
12-
from transformers import HfArgumentParser
12+
from transformers import AutoTokenizer, HfArgumentParser
1313

1414

1515
def get_hash(example):
@@ -50,18 +50,77 @@ def is_autogenerated(example, scan_width=5):
5050
return {"autogenerated": False}
5151

5252

53+
def is_config_or_test(example, scan_width=5, coeff=0.05):
54+
"""Check if file is a configuration file or a unit test by :
55+
1- looking for keywords in the first few lines of the file.
56+
2- counting number of occurence of the words 'config' and 'test' with respect to number of lines.
57+
"""
58+
59+
keywords = ["unit tests", "test file", "configuration file"]
60+
lines = example["content"].splitlines()
61+
count_config = 0
62+
count_test = 0
63+
# first test
64+
for _, line in zip(range(scan_width), lines):
65+
for keyword in keywords:
66+
if keyword in line.lower():
67+
return {"config_or_test": True}
68+
# second test
69+
nlines = example["content"].count("\n")
70+
threshold = int(coeff * nlines)
71+
for line in lines:
72+
count_config += line.lower().count("config")
73+
count_test += line.lower().count("test")
74+
if count_config > threshold or count_test > threshold:
75+
return {"config_or_test": True}
76+
return {"config_or_test": False}
77+
78+
79+
def has_no_keywords(example):
80+
"""Check if a python file has none of the keywords for: funcion, class, for loop, while loop."""
81+
keywords = ["def ", "class ", "for ", "while "]
82+
lines = example["content"].splitlines()
83+
for line in lines:
84+
for keyword in keywords:
85+
if keyword in line.lower():
86+
return {"has_no_keywords": False}
87+
return {"has_no_keywords": True}
88+
89+
90+
def has_few_assignments(example, minimum=4):
91+
"""Check if file uses symbol '=' less than `minimum` times."""
92+
lines = example["content"].splitlines()
93+
counter = 0
94+
for line in lines:
95+
counter += line.lower().count("=")
96+
if counter > minimum:
97+
return {"has_few_assignments": False}
98+
return {"has_few_assignments": True}
99+
100+
101+
def char_token_ratio(example):
102+
"""Compute character/token ratio of the file with tokenizer."""
103+
input_ids = tokenizer(example["content"], truncation=False)["input_ids"]
104+
ratio = len(example["content"]) / len(input_ids)
105+
return {"ratio": ratio}
106+
107+
53108
def preprocess(example):
54109
"""Chain all preprocessing steps into one function to not fill cache."""
55110
results = dict()
56111
results.update(get_hash(example))
57112
results.update(line_stats(example))
58113
results.update(alpha_stats(example))
114+
results.update(char_token_ratio(example))
59115
results.update(is_autogenerated(example))
116+
results.update(is_config_or_test(example))
117+
results.update(has_no_keywords(example))
118+
results.update(has_few_assignments(example))
60119
return results
61120

62121

63122
def filter(example, uniques, args):
64-
"""Filter dataset with heuristics."""
123+
"""Filter dataset with heuristics. Config, test and has_no_keywords files are removed with a given probability."""
65124
if not check_uniques(example, uniques):
66125
return False
67126
elif example["autogenerated"]:
@@ -72,6 +131,14 @@ def filter(example, uniques, args):
72131
return False
73132
elif example["alpha_frac"] < args.alpha_frac:
74133
return False
134+
elif example["ratio"] < args.min_token_ratio:
135+
return False
136+
elif example["config_or_test"] and np.random.rand() <= args.filter_proba:
137+
return False
138+
elif example["has_no_keywords"] and np.random.rand() <= args.filter_proba:
139+
return False
140+
elif example["has_few_assignments"]:
141+
return False
75142
else:
76143
return True
77144

@@ -89,6 +156,7 @@ def compress_file(file_path):
89156
args = parser.parse_args()
90157
if args.num_workers is None:
91158
args.num_workers = multiprocessing.cpu_count()
159+
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_dir)
92160

93161
# Load dataset
94162
t_start = time.time()

0 commit comments

Comments
 (0)