Skip to content

Missing MBPP splits #4795

@stadlerb

Description

@stadlerb

(@albertvillanova)
The MBPP dataset on the Hub has only a test split for both its "full" and its "sanitized" subset, while the paper states in subsection 2.1 regarding the full split:

In the experiments described later in the paper, we hold out 10 problems for few-shot prompting, another 500 as our test dataset (which is used to evaluate both few-shot inference and fine-tuned models), 374 problems for fine-tuning, and the rest for validation.

If the dataset on the Hub should reproduce most closely what the original authors use, I guess this four-way split should be reflected.

The paper doesn't explicitly state the task_id ranges of the splits, but the GitHub readme referenced in the paper specifies exact task_id ranges, although it misstates the total number of samples:

We specify a train and test split to use for evaluation. Specifically:

  • Task IDs 11-510 are used for evaluation.
  • Task IDs 1-10 and 511-1000 are used for training and/or prompting. We typically used 1-10 for few-shot prompting, although you can feel free to use any of the training examples.

I.e. the few-shot, train and validation splits are combined into one split, with a soft suggestion of using the first ten for few-shot prompting. It is not explicitly stated whether the 374 fine-tuning samples mentioned in the paper have task_id 511 to 784 or 601 to 974 or are randomly sampled from task_id 511 to 974.

Regarding the "sanitized" split the paper states the following:

For evaluations involving the edited dataset, we perform comparisons with 100 problems that appear in both the original and edited dataset, using the same held out 10 problems for few-shot prompting and 374 problems for fine-tuning.

The statement doesn't appear to be very precise, as among the 10 few-shot problems, those with task_id 1, 5 and 10 are not even part of the sanitized variant, and many from the task_id range from 511 to 974 are missing (e.g. task_id 511 to 553). I suppose the idea the task_id ranges for each split remain the same, even if some of the task_ids are not present. That would result in 7 few-shot, 257 test, 141 train and 22 validation examples in the sanitized split.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions