Skip to content

Conversation

@hatimbr
Copy link

@hatimbr hatimbr commented Apr 20, 2023

Adding xnli to lm-evaluation-harness based on this PR 134

@StellaAthena
Copy link
Collaborator

@hatimbr Thank you for the contribution! Can you run this on a couple models with public evaluations to confirm that the reported scores are what is expected?

@hatimbr
Copy link
Author

hatimbr commented Apr 24, 2023

Hi @StellaAthena, I tested it with bloom-7b1 and bloom-560m and compared it to this evaluation.

With bloom-7b1, I got:

Task Prompt Version Metric Value Stderr
xnli_en GPT-3 style 1 acc 0.3335 ± 0.0067
xnli_en GPT-3 style acc_norm 0.3285 ± 0.0066

The public accuracy score was 0.3333333432674408

With bloom -560m, I got:

Task Prompt Version Metric Value Stderr
xnli_fr MNLI crowdsource 1 acc 0.3497 ± 0.0067
xnli_fr MNLI crowdsource acc_norm 0.3345 ± 0.0067

The public accuracy score was 0.35261043906211853

Is it good enough ? Should I run more tests ?

@hatimbr hatimbr closed this May 9, 2023
@hatimbr hatimbr deleted the hb_dev branch May 9, 2023 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants