Skip to content

Questions regarding the WSC evaluation results #172

@mutiann

Description

@mutiann

Hi,

I'm recently trying to run lm-eval on Pythia models using the benchmarks listed in the paper. All the benchmarks show similar results to those reported in the paper, except WSC. In the paper the Pythia models report a WSC score of 0.3~0.5, while the models can easily get 0.6~0.8 accuracy on the WSC273 task from lm-eval. May I confirm what is the WSC task reported in the paper and how is it evaluated?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions