Hi Evals team,
OpenAI Evals is a natural place where people want to understand why certain prompts / systems fail, not just whether they pass or fail a metric.
I maintain WFGY RAG 16 Problem Map, an open-source failure taxonomy for RAG / LLM pipelines. It includes:
- A list of 16 common failure types
- A one-page Global Debug Card
- A triage prompt that can classify failing examples and suggest structural fixes
Repo (MIT):
https://github.com/onestardao/WFGY
Main reference page:
https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md
The map is already integrated or cited in other RAG / LLM ecosystems, including RAGFlow, LlamaIndex, ToolUniverse (Harvard MIMS Lab), Rankify (Univ. of Innsbruck), Multimodal RAG Survey (QCRI LLM Lab) and curated lists like Awesome LLM Apps.
Proposal:
Provide WFGY’s 16-problem map as a recommended taxonomy for eval analysis in Evals, for example:
-
An example eval where:
- RAG outputs that fail a metric are further classified into one of the 16 WFGY failure types by an LLM.
- The analysis notebook aggregates failure counts per type.
-
A short docs section that:
- Introduces the 16-problem map.
- Shows how users can plug in their own failure taxonomies, with WFGY as a concrete example.
This would help eval users move from “the score went down” to “we see more cases of failure types 3 and 7, which point to specific pipeline fixes”.
If this aligns with Evals’ direction I am happy to draft the example and documentation text.
Hi Evals team,
OpenAI Evals is a natural place where people want to understand why certain prompts / systems fail, not just whether they pass or fail a metric.
I maintain WFGY RAG 16 Problem Map, an open-source failure taxonomy for RAG / LLM pipelines. It includes:
Repo (MIT):
https://github.com/onestardao/WFGY
Main reference page:
https://github.com/onestardao/WFGY/tree/main/ProblemMap/README.md
The map is already integrated or cited in other RAG / LLM ecosystems, including RAGFlow, LlamaIndex, ToolUniverse (Harvard MIMS Lab), Rankify (Univ. of Innsbruck), Multimodal RAG Survey (QCRI LLM Lab) and curated lists like Awesome LLM Apps.
Proposal:
Provide WFGY’s 16-problem map as a recommended taxonomy for eval analysis in Evals, for example:
An example eval where:
A short docs section that:
This would help eval users move from “the score went down” to “we see more cases of failure types 3 and 7, which point to specific pipeline fixes”.
If this aligns with Evals’ direction I am happy to draft the example and documentation text.