Skip to content

feat(giskard-checks): add Correctness LLM judge check#2350

Open
kevinmessiaen wants to merge 3 commits into
mainfrom
feature/eng-1496-giskard-checks-correctness-check
Open

feat(giskard-checks): add Correctness LLM judge check#2350
kevinmessiaen wants to merge 3 commits into
mainfrom
feature/eng-1496-giskard-checks-correctness-check

Conversation

@kevinmessiaen
Copy link
Copy Markdown
Member

Added a correctness judge with reference answer, Jinja prompt, public exports, and unit tests.

Register correctness judge with reference answer, Jinja prompt, public exports, and unit tests. Sync uv.lock for editable package versions.

Made-with: Cursor
@linear
Copy link
Copy Markdown

linear Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Correctness check that uses an LLM to validate agent answers against reference ground-truth answers. The implementation includes the Correctness class, a corresponding Jinja2 prompt template, and unit tests. Feedback was provided regarding the handling of missing values in get_inputs, specifically to avoid passing the literal string "None" to the LLM and to ensure a reference answer is present before execution.

Comment on lines +116 to +139
return {
"description": str(
provided_or_resolve(
trace,
key=self.description_key,
value=provide_not_none(self.description),
)
),
"conversation": trace,
"answer": str(
provided_or_resolve(
trace,
key=self.answer_key,
value=provide_not_none(self.answer),
)
),
"reference_answer": str(
provided_or_resolve(
trace,
key=self.reference_answer_key,
value=provide_not_none(self.reference_answer),
)
),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of get_inputs wraps the results of provided_or_resolve in str(). If a value (such as description or reference_answer) is missing from both the check instance and the trace, provided_or_resolve returns None, and str(None) converts it to the literal string "None". This string is then passed to the LLM prompt, which can lead to confusing or incorrect evaluations (e.g., the LLM comparing the agent's answer against the word "None" instead of a real reference answer).

Additionally, since Correctness is a reference-based check, it is highly recommended to validate that a reference_answer has been successfully resolved before proceeding, as the check cannot function correctly without it.

        description = provided_or_resolve(
            trace,
            key=self.description_key,
            value=provide_not_none(self.description),
        )
        answer = provided_or_resolve(
            trace,
            key=self.answer_key,
            value=provide_not_none(self.answer),
        )
        reference_answer = provided_or_resolve(
            trace,
            key=self.reference_answer_key,
            value=provide_not_none(self.reference_answer),
        )

        if reference_answer is None:
            raise ValueError(
                "Correctness check failed: No reference answer provided or found in trace. "
                "Please provide 'reference_answer' or ensure it exists in the trace metadata."
            )

        return {
            "description": str(description) if description is not None else "",
            "conversation": trace,
            "answer": str(answer) if answer is not None else "",
            "reference_answer": str(reference_answer),
        }

Copy link
Copy Markdown
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor remarks

"""LLM-based check that validates an answer against a reference (ground-truth).

Uses an LLM to decide whether the agent's answer is correct relative to a
reference answer. The judge prompt receives the same ``Trace`` instance that is
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a reference answer or a reference context?

Comment on lines +24 to +30
**How the trace appears in the prompt**

Template rendering (via the agents Jinja environment) formats values for the LLM.
If the trace type implements ``_repr_prompt_()`` (see
``giskard.agents.templates.LLMFormattable``), that method supplies the
conversation text. Otherwise the trace
is serialized as indented JSON from ``model_dump()`` (Pydantic).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we add this everywhere?

description_key : str
JSONPath expression to extract the description from the trace
(default: ``trace.annotations.description``).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


## Markers

Markers <AGENT DESCRIPTION>...</AGENT DESCRIPTION>, <CONVERSATION>...</CONVERSATION>, <AGENT ANSWER>...</AGENT ANSWER>, and <REFERENCE ANSWER>...</REFERENCE ANSWER> indicate where each input is. Everything inside a marker belongs to that category.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't use add _ in beteween the marker that consist of multiple words like AGENT_DESCRIPTION?

from pydantic import Field


class LLMTrace(Trace[str, str], frozen=True):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a shared version of this, perhaps?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's planned.

It's not yet defined what should be inputs and outputs type (str, openai format, ... ) and how we want to represent tool calls, thinking, ...

)


class MockGenerator(BaseGenerator):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this could perhaps be shared?

@davidberenstein1957 davidberenstein1957 enabled auto-merge (squash) May 13, 2026 09:15
@davidberenstein1957 davidberenstein1957 self-requested a review May 13, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants