Hi openai/evals team,
I built a broader local study on LLM-as-judge reliability and extracted a narrower eval candidate that may be a better fit for this repo.
The candidate focuses on a single failure mode:
factual overelaboration
In these examples, one answer stays grounded while the other adds unsupported or false detail. The goal is to test whether a judge prefers the more factually correct answer rather than the more elaborated one.
Current candidate shape:
18 pairwise examples
- all human-reviewed
- all human non-
tie
- no custom framework code
- OpenAI Evals-style assets already prepared:
samples.jsonl
metaeval_samples.jsonl
- modelgraded YAML
- eval registry YAML
Why I think this may be useful:
- it is narrower than my full
100-pair public study
- it is more thematically consistent
- it captures a concrete failure mode that shows up in judge-based evals and product QA work
- it includes human
choice labels for meta-eval
Prepared assets for context:
I have not opened the PR yet because I want to maximize the chance that the contribution shape is useful before I submit it.
Questions:
- Does a narrow pairwise eval around factual overelaboration sound in-scope for
openai/evals?
- Would maintainers prefer this as:
- a pairwise model-graded eval with human choice labels, or
- a different framing closer to an existing template?
- If the shape is right, would you prefer I keep it at
15-18 carefully curated examples rather than trying to contribute a broader set?
If helpful, I can share the broader public study and validation notes in whatever format is easiest to review.
Thanks.
Hi
openai/evalsteam,I built a broader local study on LLM-as-judge reliability and extracted a narrower eval candidate that may be a better fit for this repo.
The candidate focuses on a single failure mode:
factual overelaboration
In these examples, one answer stays grounded while the other adds unsupported or false detail. The goal is to test whether a judge prefers the more factually correct answer rather than the more elaborated one.
Current candidate shape:
18pairwise examplestiesamples.jsonlmetaeval_samples.jsonlWhy I think this may be useful:
100-pair public studychoicelabels for meta-evalPrepared assets for context:
I have not opened the PR yet because I want to maximize the chance that the contribution shape is useful before I submit it.
Questions:
openai/evals?15-18carefully curated examples rather than trying to contribute a broader set?If helpful, I can share the broader public study and validation notes in whatever format is easiest to review.
Thanks.