Skip to content

Proposal: add WFGY 16-problem RAG failure map as a taxonomy for eval analysis #1629

@onestardao

Description

@onestardao

Hi Evals team,

OpenAI Evals is a natural place where people want to understand why certain prompts / systems fail, not just whether they pass or fail a metric.

I maintain WFGY RAG 16 Problem Map, an open-source failure taxonomy for RAG / LLM pipelines. It includes:

  • A list of 16 common failure types
  • A one-page Global Debug Card
  • A triage prompt that can classify failing examples and suggest structural fixes

Repo (MIT):
https://github.com/onestardao/WFGY
Main reference page:
https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md

The map is already integrated or cited in other RAG / LLM ecosystems, including RAGFlow, LlamaIndex, ToolUniverse (Harvard MIMS Lab), Rankify (Univ. of Innsbruck), Multimodal RAG Survey (QCRI LLM Lab) and curated lists like Awesome LLM Apps.

Proposal:

Provide WFGY’s 16-problem map as a recommended taxonomy for eval analysis in Evals, for example:

  1. An example eval where:

    • RAG outputs that fail a metric are further classified into one of the 16 WFGY failure types by an LLM.
    • The analysis notebook aggregates failure counts per type.
  2. A short docs section that:

    • Introduces the 16-problem map.
    • Shows how users can plug in their own failure taxonomies, with WFGY as a concrete example.

This would help eval users move from “the score went down” to “we see more cases of failure types 3 and 7, which point to specific pipeline fixes”.

If this aligns with Evals’ direction I am happy to draft the example and documentation text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions