Missing MBPP splits

(@albertvillanova)
The [MBPP dataset on the Hub](https://huggingface.co/datasets/mbpp) has only a test split for both its "full" and its "sanitized" subset, while the [paper](https://arxiv.org/abs/2108.07732) states in subsection 2.1 regarding the full split:
> In the experiments described later in the paper, we hold out 10 problems for **few-shot prompting**, another 500 as our **test** dataset (which is used to evaluate both few-shot inference and fine-tuned models), 374 problems for **fine-tuning**, and the rest for **validation**.

If the dataset on the Hub should reproduce most closely what the original authors use, I guess this four-way split should be reflected. 

The paper doesn't explicitly state the task_id ranges of the splits, but the [GitHub readme](https://github.com/google-research/google-research/tree/master/mbpp) referenced in the paper specifies exact task_id ranges, although it misstates the total number of samples:
> We specify a train and test split to use for evaluation. Specifically:
> 
> * Task IDs 11-510 are used for evaluation.
> * Task IDs 1-10 and 511-1000 are used for training and/or prompting. We typically used 1-10 for few-shot prompting, although you can feel free to use any of the training examples.

I.e. the few-shot, train and validation splits are combined into one split, with a soft suggestion of using the first ten for few-shot prompting. It is not explicitly stated whether the 374 fine-tuning samples mentioned in the paper have task_id 511 to 784 or 601 to 974 or are randomly sampled from task_id 511 to 974.

Regarding the "sanitized" split the paper states the following:
> For evaluations involving the edited dataset, we perform comparisons with 100 problems that appear in both the original and edited dataset, using the same held out 10 problems for few-shot prompting and 374 problems for fine-tuning. 

The statement doesn't appear to be very precise, as among the 10 few-shot problems, those with task_id 1, 5 and 10 are not even part of the sanitized variant, and many from the task_id range from 511 to 974 are missing (e.g. task_id 511 to 553). I suppose the idea the task_id ranges for each split remain the same, even if some of the task_ids are not present. That would result in 7 few-shot, 257 test, 141 train and 22 validation examples in the sanitized split.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing MBPP splits #4795

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing MBPP splits #4795

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions