Questions regarding the WSC evaluation results

Hi,

I'm recently trying to run lm-eval on Pythia models using the benchmarks listed in the paper. All the benchmarks show similar results to those reported in the paper, except WSC. In the paper the Pythia models report a WSC score of 0.3\~0.5, while the models can easily get 0.6\~0.8 accuracy on the WSC273 task from lm-eval. May I confirm what is the WSC task reported in the paper and how is it evaluated?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions regarding the WSC evaluation results #172

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions regarding the WSC evaluation results #172

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions