feat(giskard-checks): add Correctness LLM judge check#2350
feat(giskard-checks): add Correctness LLM judge check#2350kevinmessiaen wants to merge 3 commits into
Conversation
Register correctness judge with reference answer, Jinja prompt, public exports, and unit tests. Sync uv.lock for editable package versions. Made-with: Cursor
There was a problem hiding this comment.
Code Review
This pull request introduces a new Correctness check that uses an LLM to validate agent answers against reference ground-truth answers. The implementation includes the Correctness class, a corresponding Jinja2 prompt template, and unit tests. Feedback was provided regarding the handling of missing values in get_inputs, specifically to avoid passing the literal string "None" to the LLM and to ensure a reference answer is present before execution.
| return { | ||
| "description": str( | ||
| provided_or_resolve( | ||
| trace, | ||
| key=self.description_key, | ||
| value=provide_not_none(self.description), | ||
| ) | ||
| ), | ||
| "conversation": trace, | ||
| "answer": str( | ||
| provided_or_resolve( | ||
| trace, | ||
| key=self.answer_key, | ||
| value=provide_not_none(self.answer), | ||
| ) | ||
| ), | ||
| "reference_answer": str( | ||
| provided_or_resolve( | ||
| trace, | ||
| key=self.reference_answer_key, | ||
| value=provide_not_none(self.reference_answer), | ||
| ) | ||
| ), | ||
| } |
There was a problem hiding this comment.
The current implementation of get_inputs wraps the results of provided_or_resolve in str(). If a value (such as description or reference_answer) is missing from both the check instance and the trace, provided_or_resolve returns None, and str(None) converts it to the literal string "None". This string is then passed to the LLM prompt, which can lead to confusing or incorrect evaluations (e.g., the LLM comparing the agent's answer against the word "None" instead of a real reference answer).
Additionally, since Correctness is a reference-based check, it is highly recommended to validate that a reference_answer has been successfully resolved before proceeding, as the check cannot function correctly without it.
description = provided_or_resolve(
trace,
key=self.description_key,
value=provide_not_none(self.description),
)
answer = provided_or_resolve(
trace,
key=self.answer_key,
value=provide_not_none(self.answer),
)
reference_answer = provided_or_resolve(
trace,
key=self.reference_answer_key,
value=provide_not_none(self.reference_answer),
)
if reference_answer is None:
raise ValueError(
"Correctness check failed: No reference answer provided or found in trace. "
"Please provide 'reference_answer' or ensure it exists in the trace metadata."
)
return {
"description": str(description) if description is not None else "",
"conversation": trace,
"answer": str(answer) if answer is not None else "",
"reference_answer": str(reference_answer),
}
davidberenstein1957
left a comment
There was a problem hiding this comment.
some minor remarks
| """LLM-based check that validates an answer against a reference (ground-truth). | ||
|
|
||
| Uses an LLM to decide whether the agent's answer is correct relative to a | ||
| reference answer. The judge prompt receives the same ``Trace`` instance that is |
There was a problem hiding this comment.
is this a reference answer or a reference context?
| **How the trace appears in the prompt** | ||
|
|
||
| Template rendering (via the agents Jinja environment) formats values for the LLM. | ||
| If the trace type implements ``_repr_prompt_()`` (see | ||
| ``giskard.agents.templates.LLMFormattable``), that method supplies the | ||
| conversation text. Otherwise the trace | ||
| is serialized as indented JSON from ``model_dump()`` (Pydantic). |
There was a problem hiding this comment.
do we add this everywhere?
| description_key : str | ||
| JSONPath expression to extract the description from the trace | ||
| (default: ``trace.annotations.description``). | ||
|
|
|
|
||
| ## Markers | ||
|
|
||
| Markers <AGENT DESCRIPTION>...</AGENT DESCRIPTION>, <CONVERSATION>...</CONVERSATION>, <AGENT ANSWER>...</AGENT ANSWER>, and <REFERENCE ANSWER>...</REFERENCE ANSWER> indicate where each input is. Everything inside a marker belongs to that category. |
There was a problem hiding this comment.
Is there a reason we don't use add _ in beteween the marker that consist of multiple words like AGENT_DESCRIPTION?
| from pydantic import Field | ||
|
|
||
|
|
||
| class LLMTrace(Trace[str, str], frozen=True): |
There was a problem hiding this comment.
can we create a shared version of this, perhaps?
There was a problem hiding this comment.
Yes it's planned.
It's not yet defined what should be inputs and outputs type (str, openai format, ... ) and how we want to represent tool calls, thinking, ...
| ) | ||
|
|
||
|
|
||
| class MockGenerator(BaseGenerator): |
There was a problem hiding this comment.
also this could perhaps be shared?
Added a correctness judge with reference answer, Jinja prompt, public exports, and unit tests.