Skip to content

Proposal: narrow factual-overelaboration pairwise eval (human-reviewed, no custom code) #1635

@joaquinhuigomez

Description

@joaquinhuigomez

Hi openai/evals team,

I built a broader local study on LLM-as-judge reliability and extracted a narrower eval candidate that may be a better fit for this repo.

The candidate focuses on a single failure mode:

factual overelaboration

In these examples, one answer stays grounded while the other adds unsupported or false detail. The goal is to test whether a judge prefers the more factually correct answer rather than the more elaborated one.

Current candidate shape:

  • 18 pairwise examples
  • all human-reviewed
  • all human non-tie
  • no custom framework code
  • OpenAI Evals-style assets already prepared:
    • samples.jsonl
    • metaeval_samples.jsonl
    • modelgraded YAML
    • eval registry YAML

Why I think this may be useful:

  • it is narrower than my full 100-pair public study
  • it is more thematically consistent
  • it captures a concrete failure mode that shows up in judge-based evals and product QA work
  • it includes human choice labels for meta-eval

Prepared assets for context:

I have not opened the PR yet because I want to maximize the chance that the contribution shape is useful before I submit it.

Questions:

  1. Does a narrow pairwise eval around factual overelaboration sound in-scope for openai/evals?
  2. Would maintainers prefer this as:
    • a pairwise model-graded eval with human choice labels, or
    • a different framing closer to an existing template?
  3. If the shape is right, would you prefer I keep it at 15-18 carefully curated examples rather than trying to contribute a broader set?

If helpful, I can share the broader public study and validation notes in whatever format is easiest to review.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions