Proposal: narrow factual-overelaboration pairwise eval (human-reviewed, no custom code)

Hi `openai/evals` team,

I built a broader local study on LLM-as-judge reliability and extracted a narrower eval candidate that may be a better fit for this repo.

The candidate focuses on a single failure mode:

**factual overelaboration**

In these examples, one answer stays grounded while the other adds unsupported or false detail. The goal is to test whether a judge prefers the more factually correct answer rather than the more elaborated one.

Current candidate shape:

- `18` pairwise examples
- all human-reviewed
- all human non-`tie`
- no custom framework code
- OpenAI Evals-style assets already prepared:
  - `samples.jsonl`
  - `metaeval_samples.jsonl`
  - modelgraded YAML
  - eval registry YAML

Why I think this may be useful:

- it is narrower than my full `100`-pair public study
- it is more thematically consistent
- it captures a concrete failure mode that shows up in judge-based evals and product QA work
- it includes human `choice` labels for meta-eval

Prepared assets for context:

- candidate subset: https://github.com/joaquinhuigomez/llm-judge-calibrator/blob/master/studies/truthfulqa_100/openai_evals/factual_overelaboration_pr18_v2.jsonl
- export scaffold: https://github.com/joaquinhuigomez/llm-judge-calibrator/tree/master/studies/truthfulqa_100/openai_evals/factual_overelaboration_pr18_v2
- broader study PR: https://github.com/joaquinhuigomez/llm-judge-calibrator/pull/1
- public result memo: https://github.com/joaquinhuigomez/llm-judge-calibrator/blob/master/studies/truthfulqa_100/results/public_result_memo.md

I have **not** opened the PR yet because I want to maximize the chance that the contribution shape is useful before I submit it.

Questions:

1. Does a narrow pairwise eval around factual overelaboration sound in-scope for `openai/evals`?
2. Would maintainers prefer this as:
   - a pairwise model-graded eval with human choice labels, or
   - a different framing closer to an existing template?
3. If the shape is right, would you prefer I keep it at `15-18` carefully curated examples rather than trying to contribute a broader set?

If helpful, I can share the broader public study and validation notes in whatever format is easiest to review.

Thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: narrow factual-overelaboration pairwise eval (human-reviewed, no custom code) #1635

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal: narrow factual-overelaboration pairwise eval (human-reviewed, no custom code) #1635

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions