-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
(@albertvillanova)
The MBPP dataset on the Hub has only a test split for both its "full" and its "sanitized" subset, while the paper states in subsection 2.1 regarding the full split:
In the experiments described later in the paper, we hold out 10 problems for few-shot prompting, another 500 as our test dataset (which is used to evaluate both few-shot inference and fine-tuned models), 374 problems for fine-tuning, and the rest for validation.
If the dataset on the Hub should reproduce most closely what the original authors use, I guess this four-way split should be reflected.
The paper doesn't explicitly state the task_id ranges of the splits, but the GitHub readme referenced in the paper specifies exact task_id ranges, although it misstates the total number of samples:
We specify a train and test split to use for evaluation. Specifically:
- Task IDs 11-510 are used for evaluation.
- Task IDs 1-10 and 511-1000 are used for training and/or prompting. We typically used 1-10 for few-shot prompting, although you can feel free to use any of the training examples.
I.e. the few-shot, train and validation splits are combined into one split, with a soft suggestion of using the first ten for few-shot prompting. It is not explicitly stated whether the 374 fine-tuning samples mentioned in the paper have task_id 511 to 784 or 601 to 974 or are randomly sampled from task_id 511 to 974.
Regarding the "sanitized" split the paper states the following:
For evaluations involving the edited dataset, we perform comparisons with 100 problems that appear in both the original and edited dataset, using the same held out 10 problems for few-shot prompting and 374 problems for fine-tuning.
The statement doesn't appear to be very precise, as among the 10 few-shot problems, those with task_id 1, 5 and 10 are not even part of the sanitized variant, and many from the task_id range from 511 to 974 are missing (e.g. task_id 511 to 553). I suppose the idea the task_id ranges for each split remain the same, even if some of the task_ids are not present. That would result in 7 few-shot, 257 test, 141 train and 22 validation examples in the sanitized split.